Universal Segmentations

Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation scheme for many languages. The annotation scheme consists of simple tab-separated columns that stores a word and its morphological segmentations, including pieces of information about the word and the segmented units, e.g., part-of-speech categories, type of morphs/morphemes etc.

 

 

 

Segmentation data before and after harmonisation

 

Resource Original format   UniSegments format

Ex. 1  

Démonette   "abaissement","tlfnome","abaisser","tlfnome","Ncms","tlfnome",
"Vmn----","tlfnome", "simple","derif","suf","ment","derif",,,,
"\@RES","demonette","\@","demonette", "résultat de abaisser",
"derif","résultat de \@","demonette","descendant", "demonette",
"abaiss","derif",,,"derif"
     abaiss + e + ment
(lowering)
 
Ex. 2 DerIvaTario   3951;ABBATTIMENTO;BATTERE:vrb\_th;
ACons:ad:mt2:ms2b;MENTO:mento:mt4:ms1;;;;
  → ab + batt + i + mento
(breakdown)
 
Ex. 3 DerivBase.Ru вымор noun повыморить verb
rule887(по + noun + и1(ть) -> verb)
PFX,SFX
  по + вымори + ть
(become extinct)
 
Ex. 4 MorphoLex rafraîchissant [VB]>>sant>   r + a + fraîchis + sant
(refreshing)
 
Ex. 5

Word Formation 

Latin 

(23891,'malaxo','V1','','VmF','m0158','malaxo',
'VERB',NULL,'B')
(23890,'malaxatio','N3B','f','NcC','m0157',
'malaxatio','NOUN',NULL,'B')
(23891,1,23890,'86','a','2016-03-29 12:45:48')
('V-To-N','Derivation_Suffix','86','','n6p1; n2np;
Regular PP: v1*; v2*; v3*; v4*; v5*; v6*','','(t)io(n)',
'n31','abiurat-io, -ion-is; abstrus-io, -ion-is')
  malax + a + tio
(comminution)

 

Examples of the harmonised data

The file format consists of four columns: word form, lemma, part-of-speech category, simplified morphological segmentation, and detailed annotations of indices and types of individual morphological segments.

An excerpt from CroDeriV for Croatian:

podrapati   podrapati   VERB   po + drap + a + ti   {"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1], "type": "prefix"}, {"span": [2, 3, 4, 5], "type": "stem"}, {"span": [6], "type": "suffix"}, {"span": [7, 8], "type": "ending"}]}
podrazumijevati   podrazumijevati   VERB   pod + raz + um + ijev + a + ti   {"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1, 2], "type": "prefix"}, {"span": [3, 4, 5], "type": "prefix"}, {"span": [6, 7], "type": "stem"}, {"span": [8, 9, 10, 11], "type": "suffix"}, {"span": [12], "type": "suffix"}, {"span": [13, 14], "type": "ending"}]}
podraškati   podraškati   VERB   po + draš + k + a + ti   {"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1], "type": "prefix"}, {"span": [2, 3, 4, 5], "type": "stem"}, {"span": [6], "type": "suffix"}, {"span": [7], "type": "suffix"}, {"span": [8, 9], "type": "ending"}]}
podraškivati   podraškivati   VERB   po + draš + k + iv + a + ti   {"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1], "type": "prefix"}, {"span": [2, 3, 4, 5], "type": "stem"}, {"span": [6], "type": "suffix"}, {"span": [7, 8], "type": "suffix"}, {"span": [9], "type": "suffix"}, {"span": [10, 11], "type": "ending"}]}
podražavati   podražavati   VERB   po + draž + av + a + ti   {"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1], "type": "prefix"}, {"span": [2, 3, 4, 5], "type": "stem"}, {"span": [6, 7], "type": "suffix"}, {"span": [8], "type": "suffix"}, {"span": [9, 10], "type": "ending"}]}

An excerpt from DErivBase for German:

Geschäftsführerin   Geschäftsführerin   NOUN   Geschäftsführer + in   {"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], "type": "unsegmented"}, {"span": [15, 16], "type": "suffix"}]}
Geschäftsführung   Geschäftsführung   NOUN   Geschäftsführ + ung   {"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], "type": "unsegmented"}, {"span": [13, 14, 15], "type": "suffix"}]}
Gesellschafterin   Gesellschafterin   NOUN   Gesellschaft + er + in   {"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], "type": "unsegmented"}, {"span": [12, 13], "type": "suffix"}, {"span": [14, 15], "type": "suffix"}]}
Großherzigkeit   Großherzigkeit   NOUN   Großherzig + keit   {"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "type": "unsegmented"}, {"span": [10, 11, 12, 13], "type": "suffix"}]}
Großbäckerei   Großbäckerei   NOUN   Großbäcker + ei   {"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "type": "unsegmented"}, {"span": [10, 11], "type": "suffix"}]}

 

The current version

The current version of the collection is UniSegments 1.0. In its public version, it contains 38 harmonized segmentation datasets covering 30 languages (listed in the table below). UniSegments 1.0 is available in the LINDAT/CLARIAH-CZ digital library (item: http://hdl.handle.net/11234/1-4629). The license for each of the harmonized resources included in the collection is specified in the appropriate language/resource directory.

You can also use a Python application interface (API) for working with the resulting format of UniSegments. It is provided on the related GitHub repository.

  Size Distribution of morphs per unit [%]     License
Public release   1 2 3 4+  
ben-KCIS 1 kW 0 100 0 0   CC BY-NC 4.0
cat-MorphyNet 516 kL 0 100 0 0   CC BY-SA 3.0
ces-DeriNet 1,039 kL 8 16 19 57   CC BY-NC-SA 3.0
ces-MorphyNet 67 kL 0 100 0 0   CC BY-SA 3.0
deu-DerivBaseDE 61 kL 36 59 4 0   CC BY-SA 3.0
deu-MorphyNet 29 kL 0 100 0 0   CC BY-SA 3.0
eng-MorphoLex 69 kW 21 45 27 7   CC BY-NC-SA 4.0
eng-MorphyNet 292 kL 0 100 0 0   CC BY-SA 3.0
Fas-PerSegLex 45 kW 34 31 24 10   CC BY-NC-SA 4.0
fin-MorphyNet 400 kL 0 100 0 0   CC BY-SA 3.0
fra-Demonette 63 kL 46 80 3 0   CC BY-NC-SA 3.0
fra-Echantinom 5 kL 53 40 6 1   CC BY 4.0
fra-MorphoLex 16 kW 43 44 12 1   CC BY-NC-SA 4.0
fra-MorphyNet 363 kL 0 100 0 0   CC BY-SA 3.0
hbs-MorphyNet 34 kL 0 100 0 0   CC BY-SA 3.0
hin-KCIS 2 kW 29 71 0 0   CC BY-NC 4.0
hrv-CroDeriV 16 kL 0 1 20 79   CC BY-SA 3.0
hun-MorphyNet 428 kL 0 100 0 0   CC BY-SA 3.0
hye-Uniparser 594 kW 9 41 37 13   MIT
ita-DerIvaTario 11 kL 1 46 31 21   CC BY-SA 4.0
ita-MorphyNet 599 kL 0 100 0 0   CC BY-SA 3.0
kan-KCIS 26 kW 0 11 25 64   CC BY-NC 4.0
kpv-Uniparser 205 kW 9 40 35 16   MIT
lat-WordFormationLatin  36 kL 16 52 27 5   CC BY-NC-SA 4.0
mal-KCIS 33 kW 2 98 0 0   CC BY-NC 4.0
mar-KCIS 32 kW 0 51 43 6   CC BY-NC 4.0
mdf-Uniparser 105 kW 10 50 31 8   MIT
mhr-Uniparser 260 kW 9 38 36 17   MIT
mon-MorphyNet 35 kL 0 100 0 0   CC BY-SA 3.0
myv-Uniparser 164 kW 10 41 36 13   MIT
pol-MorphyNet 508 kL 0 100 0 0   CC BY-SA 3.0
por-MorphyNet 449 kL 0 100 0 0   CC BY-SA 3.0
rus-DerivBaseRU 156 kL 31 35 23 10   Apache-2.0
rus-MorphyNet 692 kL 0 100 0 0   CC BY-SA 3.0
spa-MorphyNet 541 kL 0 100 0 0   CC BY-SA 3.0
swe-MorphyNet 438 kL 0 100 0 0   CC BY-SA 3.0
tgk-Uniparser 232 kW 17 56 24 3   MIT
udm-Uniparser 375 kW 8 35 36 21   MIT
Private release            
deu-CELEX 48 kL 14 40 34 13   Non-public
deu-MorphoChallenge 3 kL 4 27 42 27   Non-public
eng-CELEX 44 kL 30 51 16 3   Non-public
eng-MorphoChallenge 3 kL 16 49 27 9   Non-public
fin-MorphoChallenge 4 kL 3 18 35 44   Non-public
nld-CELEX 101 kL 11 52 25 12   Non-public
rus-KuznetsEfremDict 73 kL 1 7 17 75   Non-public
rus-TikhonovDict 103 kL 6 11 22 61   Non-public
tur-MorphoChallenge 7 kL 3 19 34 45   Non-public

 

Related publications

  • Bafna, N.; Bodnár, J.; Kyjánek, L.; Svoboda, E.; Ševčíková, M.; Vidra, J.; Žabokrtský, Z. 2021. Towards Universal Segmentations: Survey of Existing Morphosegmentation Resources. Technical Report TR-2021-69. Prague: Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University. ISSN: 1214-5521. URL: https://ufal.mff.cuni.cz/techrep/tr69.pdf.