Compound Identification and Splitting in Four Languages: A Deep Learning Approach

The proposed project deals with compound words in Czech, Russian, German and English, aiming at constructing a deep learning model that will a) identify compounds from non-compounds and b) generate their base words without relying on dictionary data. Task a) will be referred to as compound identification, while task b) will be referred to as compound splitting. Being given a compound word such as ‘rybolov’ ("fishery") or ‘flowerpot’ the model will determine whether or not this word is a compound word and if so, find the parent words for that compound, in this case 'rybolov' -> ['ryba', 'lov'] and 'flowerpot' -> ['flower', 'pot']. The project’s priority will be to handle compounding like this for several languages at the same time, because: 1) compounding presumably shares patterns across these languages; 2) covering a number of languages with a single model is both computationally and practically convenient; 3) the model will be able to deal with the appearance of loanwords in its input data. Because the four languages are formed from one pair of Slavic languages and one pair of German languages, all four being Indo-European, it will be interesting to see how training the model on data from all four languages improves its overall performance. We believe that this will be the case, because in all four languages, the overall patterning of bound and free morphemes into compound lexical units is similar. Furthermore, the model will be adapted into a Python package and distributed, which will transform it into a potentially useful natural language processing tool. It will then be possible to use it alongside tokenizers and lemmatizers to help automatically process and analyze ad-hoc coined compounds, which would otherwise be out-of-vocabulary.

The project's goals were, in the end, exceeded. The output of the project is a more general tool called PaRenT (Parent Retrieval Neural Tool) that in addition to compounding also covers derivation, and has been extended into 8 languages instead of just four (Czech, Russian, English, Dutch, German, French, and Spanish). The model, usable either as a Linux command line tool or a importable as a Python package, is available on GitHub (https://github.com/iml-r/PaReNT/).

The work was presented at the LREC-COLING 2024 conference in Turin, Italy, where it was nominated for Best Paper, and ended up receiving the an Outstanding Paper award as one of 10 papers out of 3500 submissions.

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form