Computational linguistics needs all kinds of data resources, be it lexical databases or annotated corpora of many diverse languages. These resources typically work with word (or token) as the basic processing unit. Unfortunately, word is not easy to define in a cross-linguistically consistent manner, while staying reasonably close to the traditional perception of the term. Without a clear definition, there is a danger that linguistic resources will not be mutually compatible, with negative consequences both for comparative lingustics and for training multilingual models.
Linguists have been attempting to come up with a definition since at least the beginning of the 20th century, with varying level of success. At any rate, digital lingusitic resources typically rely more on language-particular traditions (where they exist) than on language-neutral definitions.
In my talk, I will present preliminary results of a survey on wordhood in different languages, which we ran within the UniDive COST Action, contrasted with the latest contribution to the ‘defining the word’ debate by Haspelmath (2023). Our particular focus was on how words are delimited in the Universal Dependencies (UD) collection, although we did not restrict the discussion to examples attested in UD. I will bring up observations of two kinds: 1. the difficulties faced when trying to apply a theoretical definition to real data; 2. some cases where the word unit in the data does not match the definition.