Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation scheme for many languages. The annotation scheme consists of simple tab-separated columns that stores a word and its morphological segmentations, including pieces of information about the word and the segmented units, e.g., part-of-speech categories, type of morphs/morphemes etc.
|
Resource | Original format | → | UniSegments format |
Ex. 1 |
Démonette |
"abaissement","tlfnome","abaisser","tlfnome","Ncms","tlfnome", "Vmn----","tlfnome", "simple","derif","suf","ment","derif",,,, "\@RES","demonette","\@","demonette", "résultat de abaisser", "derif","résultat de \@","demonette","descendant", "demonette", "abaiss","derif",,,"derif" |
→ |
abaiss + e + ment (lowering) |
Ex. 2 | DerIvaTario |
3951;ABBATTIMENTO;BATTERE:vrb\_th; ACons:ad:mt2:ms2b;MENTO:mento:mt4:ms1;;;; |
→ |
ab + batt + i + mento (breakdown) |
Ex. 3 | DerivBase.Ru |
вымор noun повыморить verb rule887(по + noun + и1(ть) -> verb) PFX,SFX |
→ |
по + вымори + ть (become extinct) |
Ex. 4 | MorphoLex |
rafraîchissant |
→ |
r + a + fraîchis + sant (refreshing) |
Ex. 5 |
Word Formation Latin |
(23891,'malaxo','V1','','VmF','m0158','malaxo', 'VERB',NULL,'B') (23890,'malaxatio','N3B','f','NcC','m0157', 'malaxatio','NOUN',NULL,'B') (23891,1,23890,'86','a','2016-03-29 12:45:48') ('V-To-N','Derivation_Suffix','86','','n6p1; n2np; Regular PP: v1*; v2*; v3*; v4*; v5*; v6*','','(t)io(n)', 'n31','abiurat-io, -ion-is; abstrus-io, -ion-is') |
→ |
malax + a + tio (comminution) |
The file format consists of five columns: word form, lemma, part-of-speech category, simplified morphological segmentation, and detailed annotations of indices and types of individual morphological segments.
An excerpt from CroDeriV for Croatian:
podrapati | podrapati | VERB | po + drap + a + ti | {"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1], "type": "prefix"}, {"span": [2, 3, 4, 5], "type": "stem"}, {"span": [6], "type": "suffix"}, {"span": [7, 8], "type": "ending"}]} |
podrazumijevati | podrazumijevati | VERB | pod + raz + um + ijev + a + ti | {"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1, 2], "type": "prefix"}, {"span": [3, 4, 5], "type": "prefix"}, {"span": [6, 7], "type": "stem"}, {"span": [8, 9, 10, 11], "type": "suffix"}, {"span": [12], "type": "suffix"}, {"span": [13, 14], "type": "ending"}]} |
podraškati | podraškati | VERB | po + draš + k + a + ti | {"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1], "type": "prefix"}, {"span": [2, 3, 4, 5], "type": "stem"}, {"span": [6], "type": "suffix"}, {"span": [7], "type": "suffix"}, {"span": [8, 9], "type": "ending"}]} |
podraškivati | podraškivati | VERB | po + draš + k + iv + a + ti | {"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1], "type": "prefix"}, {"span": [2, 3, 4, 5], "type": "stem"}, {"span": [6], "type": "suffix"}, {"span": [7, 8], "type": "suffix"}, {"span": [9], "type": "suffix"}, {"span": [10, 11], "type": "ending"}]} |
podražavati | podražavati | VERB | po + draž + av + a + ti | {"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1], "type": "prefix"}, {"span": [2, 3, 4, 5], "type": "stem"}, {"span": [6, 7], "type": "suffix"}, {"span": [8], "type": "suffix"}, {"span": [9, 10], "type": "ending"}]} |
An excerpt from DErivBase for German:
Geschäftsführerin | Geschäftsführerin | NOUN | Geschäftsführer + in | {"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], "type": "unsegmented"}, {"span": [15, 16], "type": "suffix"}]} |
Geschäftsführung | Geschäftsführung | NOUN | Geschäftsführ + ung | {"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], "type": "unsegmented"}, {"span": [13, 14, 15], "type": "suffix"}]} |
Gesellschafterin | Gesellschafterin | NOUN | Gesellschaft + er + in | {"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], "type": "unsegmented"}, {"span": [12, 13], "type": "suffix"}, {"span": [14, 15], "type": "suffix"}]} |
Großherzigkeit | Großherzigkeit | NOUN | Großherzig + keit | {"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "type": "unsegmented"}, {"span": [10, 11, 12, 13], "type": "suffix"}]} |
Großbäckerei | Großbäckerei | NOUN | Großbäcker + ei | {"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "type": "unsegmented"}, {"span": [10, 11], "type": "suffix"}]} |
The current version of the collection is UniSegments 1.0. In its public version, it contains 38 harmonized segmentation datasets covering 30 languages (listed in the table below). UniSegments 1.0 is available in the LINDAT/CLARIAH-CZ digital library (item: http://hdl.handle.net/11234/1-4629). The license for each of the harmonized resources included in the collection is specified in the appropriate language/resource directory.
You can also use a Python application interface (API) for working with the resulting format of UniSegments. It is provided on the related GitHub repository.
Size | Distribution of morphs per unit [%] | License | ||||
Public release | 1 | 2 | 3 | 4+ | ||
ben-KCIS | 1 kW | 0 | 100 | 0 | 0 | CC BY-NC 4.0 |
cat-MorphyNet | 516 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
ces-DeriNet | 1,039 kL | 8 | 16 | 19 | 57 | CC BY-NC-SA 3.0 |
ces-MorphyNet | 67 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
deu-DerivBaseDE | 61 kL | 36 | 59 | 4 | 0 | CC BY-SA 3.0 |
deu-MorphyNet | 29 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
eng-MorphoLex | 69 kW | 21 | 45 | 27 | 7 | CC BY-NC-SA 4.0 |
eng-MorphyNet | 292 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
Fas-PerSegLex | 45 kW | 34 | 31 | 24 | 10 | CC BY-NC-SA 4.0 |
fin-MorphyNet | 400 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
fra-Demonette | 63 kL | 46 | 80 | 3 | 0 | CC BY-NC-SA 3.0 |
fra-Echantinom | 5 kL | 53 | 40 | 6 | 1 | CC BY 4.0 |
fra-MorphoLex | 16 kW | 43 | 44 | 12 | 1 | CC BY-NC-SA 4.0 |
fra-MorphyNet | 363 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
hbs-MorphyNet | 34 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
hin-KCIS | 2 kW | 29 | 71 | 0 | 0 | CC BY-NC 4.0 |
hrv-CroDeriV | 16 kL | 0 | 1 | 20 | 79 | CC BY-SA 3.0 |
hun-MorphyNet | 428 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
hye-Uniparser | 594 kW | 9 | 41 | 37 | 13 | MIT |
ita-DerIvaTario | 11 kL | 1 | 46 | 31 | 21 | CC BY-SA 4.0 |
ita-MorphyNet | 599 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
kan-KCIS | 26 kW | 0 | 11 | 25 | 64 | CC BY-NC 4.0 |
kpv-Uniparser | 205 kW | 9 | 40 | 35 | 16 | MIT |
lat-WordFormationLatin | 36 kL | 16 | 52 | 27 | 5 | CC BY-NC-SA 4.0 |
mal-KCIS | 33 kW | 2 | 98 | 0 | 0 | CC BY-NC 4.0 |
mar-KCIS | 32 kW | 0 | 51 | 43 | 6 | CC BY-NC 4.0 |
mdf-Uniparser | 105 kW | 10 | 50 | 31 | 8 | MIT |
mhr-Uniparser | 260 kW | 9 | 38 | 36 | 17 | MIT |
mon-MorphyNet | 35 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
myv-Uniparser | 164 kW | 10 | 41 | 36 | 13 | MIT |
pol-MorphyNet | 508 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
por-MorphyNet | 449 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
rus-DerivBaseRU | 156 kL | 31 | 35 | 23 | 10 | Apache-2.0 |
rus-MorphyNet | 692 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
spa-MorphyNet | 541 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
swe-MorphyNet | 438 kL | 0 | 100 | 0 | 0 | CC BY-SA 3.0 |
tgk-Uniparser | 232 kW | 17 | 56 | 24 | 3 | MIT |
udm-Uniparser | 375 kW | 8 | 35 | 36 | 21 | MIT |
Private release | ||||||
deu-CELEX | 48 kL | 14 | 40 | 34 | 13 | Non-public |
deu-MorphoChallenge | 3 kL | 4 | 27 | 42 | 27 | Non-public |
eng-CELEX | 44 kL | 30 | 51 | 16 | 3 | Non-public |
eng-MorphoChallenge | 3 kL | 16 | 49 | 27 | 9 | Non-public |
fin-MorphoChallenge | 4 kL | 3 | 18 | 35 | 44 | Non-public |
nld-CELEX | 101 kL | 11 | 52 | 25 | 12 | Non-public |
rus-KuznetsEfremDict | 73 kL | 1 | 7 | 17 | 75 | Non-public |
rus-TikhonovDict | 103 kL | 6 | 11 | 22 | 61 | Non-public |
tur-MorphoChallenge | 7 kL | 3 | 19 | 34 | 45 | Non-public |