Computational Models of Word Formation

Guidelines

Word formation data resources harmonized for multiple natural languages were almost non-existent until very recently ([1],[2]), which was a limiting factor for developing models whose validity would be empirically testable in a multilingual setting. The aim of the thesis is to develop, implement, and evaluate word formation models that make use of modern distributional vector space word representations (word embedding models), with a special focus on derivational morphology ([3]) and on multilingual aspects ([4]). Optionally, optimization criteria used in the models can be interpreted in terms of Information Theory, and might reflect hierarchical interactions in a language’s vocabulary, biological and cognitive biases relevant for natural languages, as well as language evolution perspectives.

References

[1] Batsuren, K., Bella, G., & Giunchiglia, F. (2019, July). CogNet: A Large-Scale Cognate Database. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3136-3145).
[2] Kyjánek, L., Žabokrtský, Z., Ševčíková, M., & Vidra, J. (2019). Universal Derivations Kickoff: A Collection of Harmonized Derivational Resources for Eleven Languages. In Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology (pp. 101-110).
[3] Bonami, O., & Paperno, D. (2018). Inflection vs. derivation in a distributional vector space. Lingue e linguaggio, 17(2), 173-196.
[4] Ruder, S., Vulić, I., & Søgaard, A. (2019). A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research, 65, 569-631.