Universal Dependencies for an Under-Represented Language

Guidelines

Building a syntactically annotated corpus (treebank) for a language that is currently under-represented, i.e., little or no such data exists for the language. Under-represented languages (usually but not always minority languages) are disadvantaged because general NLP tools need data on which language-specific models can be trained. Being able to use full-scale language technology means that texts, oral history, social media and other resources can be accessed, processed and further exploited in linguistic and cultural studies. Universal Dependencies (UD, http://universaldependencies.org/) is a de-facto standard for morphological and syntactic annotation applicable to a broad set of languages, it is thus a natural choice for a new treebank of a small language.

The student will build a treebank of a language that is currently either not part of Universal Dependencies project or substantially improve a treebank that is already included but lacking in size and quality. For instance Upper Sorbian, Belarusian or Hungarian treebanks are very small, while Lower Sorbian, Kashubian or Rusyn treebanks don't exist at all. Especially the Sorbian and Kashubian languages are of interest, given the collaboration potential. Therefore, we primarily look for candidates willing to work with one of these languages. (While an ideal candidate would be a native speaker of Sorbian/Kashubian, note that this is not a necessary condition for the success of the project. However, good command of a West Slavic language is required.)

The student will learn using existing language technology (annotation and visualization tools, morphological analyzers, taggers, parsers etc.) and adapt them for the target language. It is likely that a morphological tag set and dictionary will have to be developed as well. UD guidelines specific for the target language will be developed (consistently with the current UD guidelines for related resource-rich languages, such as Czech) and documented. The student will learn and apply techniques of projecting models from resource-rich to resource-poor languages, to bootstrap the annotation, as opposed to annotating everything manually from scratch.

References

Universal Dependencies (Nivre et al. 2020, Proceedings of LREC)
https://universaldependencies.org/

Ayan Das and Sudeshna Sarkar (2020). A Survey of the Model Transfer Approaches to Cross-Lingual Dependency Parsing. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19, 5, Article 67 (May 2020), https://doi.org/10.1145/3383772