Principal investigator (ÚFAL): 
Project Manager (ÚFAL): 
Provider: 
Grant id: 
128122
Duration: 
2022-2024

The proposed project deals with compound words in Czech, Russian, German and English, aiming at constructing a deep learning model that will a) identify compounds from non-compounds and b) generate their base words without relying on dictionary data. Task a) will be referred to as compound identification, while task b) will be referred to as compound splitting. Being given a compound word such as ‘rybolov’ ("fishery") or ‘flowerpot’ the model will determine whether or not this word is a compound word and if so, find the parent words for that compound, in this case 'rybolov' -> ['ryba', 'lov'] and 'flowerpot' -> ['flower', 'pot']. The project’s priority will be to handle compounding like this for several languages at the same time, because: 1) compounding presumably shares patterns across these languages; 2) covering a number of languages with a single model is both computationally and practically convenient; 3) the model will be able to deal with the appearance of loanwords in its input data. Because the four languages are formed from one pair of Slavic languages and one pair of German languages, all four being Indo-European, it will be interesting to see how training the model on data from all four languages improves its overall performance. We believe that this will be the case, because in all four languages, the overall patterning of bound and free morphemes into compound lexical units is similar. Furthermore, the model will be adapted into a Python package and distributed, which will transform it into a potentially useful natural language processing tool. It will then be possible to use it alongside tokenizers and lemmatizers to help automatically process and analyze ad-hoc coined compounds, which would otherwise be out-of-vocabulary.