Modeling named entities across languages

Guidelines

Named entity (NE) recognition and associated subtasks such as NE linking constitute a well established discipline in Natural Language Processing (NLP). However, when it comes to building applications that exploit NEs, the same bottleneck like in other NLP disciplines is faced: although there are NE resources for quite a few languages available nowadays, their annotation schemes are highly diverse, which makes it very difficult to develop a technology capable of handling multilingual data and to transfer the technology towards under-resourced languages.

This project aims at collecting and harmonizing existing NE resources, at interlinking the NE-related knowledge across language boundaries, and also at transferring it from one or more resource-rich languages to a set of under-resourced languages. Inspiration for these goals can be gained from other NLP tasks in which the harmonized data and cross-lingual methods already exist, with the Universal Dependencies project (UD, [1]) being probably the most prominent representative. Given that named entity borrowings constitute a not negligible part of word-formation in most languages (after target-language orthographical and morphological adaptations, such borrowings sometimes result in quite complex morphological subsystems), we plan to incorporate the NE-related knowledge into word-formation databases, such as those recently included in the collection called Universal Derivations (UDer, [2]).

At the Institute of Formal and Applied Linguistics, there is a substantial body of experience with building multilingual data (including participation in UD and UDer), exploiting such data via various cross-lingual transfer strategies [3, 4] as well as with building NE resources and developing NE recognizers [5]. The NE dataset which became a de-facto standard for Czech was developed in this university a decade ago, and as for NE recognition performance, a sequence of state-of-the-art solutions was created here, too.

References

[1] Nivre Joakim, de Marneffe Marie-Catherine, Ginter Filip, Goldberg Yoav, Hajič Jan, Manning Christopher, McDonald Ryan, Petrov Slav, Pyysalo Sampo, Silveira Natalia, Tsarfaty Reut, Zeman Daniel: Universal Dependencies v1: A Multilingual Treebank Collection. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), Copyright © European Language Resources Association, Paris, France, ISBN 978-2-9517408-9-1, pp. 1659-1666, 2016

[2] Lukáš Kyjánek, Zdeněk Žabokrtský, Magda Ševčíková, Jonáš Vidra (2019): Universal Derivations Kickoff: A Collection of Harmonized Derivational Resources for Eleven Languages. In: Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology (DeriMo 2019), pp. 101-110, ÚFAL MFF UK, Praha, Czechia, ISBN 978-80-88132-08-0

[3] Ramasamy Loganathan, Mareček David, Žabokrtský Zdeněk: Multilingual Dependency Parsing: Using Machine Translated Texts instead of Parallel Corpora. In: The Prague Bulletin of Mathematical Linguistics, Vol. 102, Copyright © Univerzita Karlova v Praze, ISSN 0032-6585, pp. 93-104, 2014

[4] Rosa Rudolf, Žabokrtský Zdeněk: KLcpos3 - a Language Similarity Measure for Delexicalized Parser Transfer. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-941643-73-0, pp. 243-249, 2015

[5] Straková Jana: Neural Network Based Named Entity Recognition. Ph.D. thesis, Charles University, Prague, Czech Republic, 120 pp., Jun 2017