Monday, May 27, 2019 - 13:30

TEITOK: Merging Digital Humanities and Corpus Linguistics

Maarten Janssen (ÚFAL MFF UK)

Corpora nowadays form a core part of linguistics - and for historical linguistics, they should form an even more solid cornerstone given that there are no native speakers to rely on. However, the tools for linguistic corpora do not apply well to historical corpora: not only are automatic tools considerably less accurate, but corpus tools also throw out much of the information that documents coming from the digial humanities contain - information containing formatting, writing order, etc. TEITOK is a corpus tool that attempts to bridge this gap, by providing a full platform for TEI/XML based corpora that can respond to all the needs from the DH community, and combine them with information concerning linguistic annotation. This creates the possibility to have meticulously transcribed documents, be it historical, dialectal, spoken, etc. - that at the same time are fully searchable and exploitable using NLP techniques.