Universal Segmentations

Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation scheme for many languages. The annotation scheme consists of simple tab-separated columns that stores a word and its morphological segmentations, including pieces of information about the word and the segmented units, e.g., part-of-speech categories, type of morphs/morphemes etc.

Segmentation data before and after harmonisation

	
					Resource
				
					Original format
				
					  →
				
					UniSegments format
				
						Ex. 1  
					
					Démonette  
				
					"abaissement","tlfnome","abaisser","tlfnome","Ncms","tlfnome",

					"Vmn----","tlfnome", "simple","derif","suf","ment","derif",,,,

					"\@RES","demonette","\@","demonette", "résultat de abaisser",

					"derif","résultat de \@","demonette","descendant", "demonette",

					"abaiss","derif",,,"derif"
				
					  →  
				
					abaiss + e + ment

					(lowering)
				
					Ex. 2
				
					DerIvaTario  
				
					3951;ABBATTIMENTO;BATTERE:vrb\_th;

					ACons:ad:mt2:ms2b;MENTO:mento:mt4:ms1;;;;
				
					  →
				
					ab + batt + i + mento

					(breakdown)
				
					Ex. 3
				
					DerivBase.Ru
				
					вымор noun повыморить verb

					rule887(по + noun + и1(ть) -> verb)

					PFX,SFX
				
					  →
				
					по + вымори + ть

					(become extinct)
				
					Ex. 4
				
					MorphoLex
				
					rafraîchissant [VB]>>sant>  
				
					  →
				
					r + a + fraîchis + sant

					(refreshing)
				
					Ex. 5
				
						Word Formation 
					
						Latin 
					
					(23891,'malaxo','V1','','VmF','m0158','malaxo',

					'VERB',NULL,'B')

					(23890,'malaxatio','N3B','f','NcC','m0157',

					'malaxatio','NOUN',NULL,'B')

					(23891,1,23890,'86','a','2016-03-29 12:45:48')

					('V-To-N','Derivation_Suffix','86','','n6p1; n2np;

					Regular PP: v1*; v2*; v3*; v4*; v5*; v6*','','(t)io(n)',

					'n31','abiurat-io, -ion-is; abstrus-io, -ion-is')
				
					  →
				
					malax + a + tio

					(comminution)

Examples of the harmonised data

The file format consists of five columns: word form, lemma, part-of-speech category, simplified morphological segmentation, and detailed annotations of indices and types of individual morphological segments.

An excerpt from CroDeriV for Croatian:

	
					podrapati  
				
					podrapati  
				
					VERB  
				
					po + drap + a + ti  
				
					{"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1], "type": "prefix"}, {"span": [2, 3, 4, 5], "type": "stem"}, {"span": [6], "type": "suffix"}, {"span": [7, 8], "type": "ending"}]}
				
					podrazumijevati  
				
					podrazumijevati  
				
					VERB  
				
					pod + raz + um + ijev + a + ti  
				
					{"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1, 2], "type": "prefix"}, {"span": [3, 4, 5], "type": "prefix"}, {"span": [6, 7], "type": "stem"}, {"span": [8, 9, 10, 11], "type": "suffix"}, {"span": [12], "type": "suffix"}, {"span": [13, 14], "type": "ending"}]}
				
					podraškati  
				
					podraškati  
				
					VERB  
				
					po + draš + k + a + ti  
				
					{"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1], "type": "prefix"}, {"span": [2, 3, 4, 5], "type": "stem"}, {"span": [6], "type": "suffix"}, {"span": [7], "type": "suffix"}, {"span": [8, 9], "type": "ending"}]}
				
					podraškivati  
				
					podraškivati  
				
					VERB  
				
					po + draš + k + iv + a + ti  
				
					{"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1], "type": "prefix"}, {"span": [2, 3, 4, 5], "type": "stem"}, {"span": [6], "type": "suffix"}, {"span": [7, 8], "type": "suffix"}, {"span": [9], "type": "suffix"}, {"span": [10, 11], "type": "ending"}]}
				
					podražavati  
				
					podražavati  
				
					VERB  
				
					po + draž + av + a + ti  
				
					{"annot_name": "CroDeriV-1.0", "segmentation": [{"span": [0, 1], "type": "prefix"}, {"span": [2, 3, 4, 5], "type": "stem"}, {"span": [6, 7], "type": "suffix"}, {"span": [8], "type": "suffix"}, {"span": [9, 10], "type": "ending"}]}

An excerpt from DErivBase for German:

	
					Geschäftsführerin  
				
					Geschäftsführerin  
				
					NOUN  
				
					Geschäftsführer + in  
				
					{"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], "type": "unsegmented"}, {"span": [15, 16], "type": "suffix"}]}
				
					Geschäftsführung  
				
					Geschäftsführung  
				
					NOUN  
				
					Geschäftsführ + ung  
				
					{"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], "type": "unsegmented"}, {"span": [13, 14, 15], "type": "suffix"}]}
				
					Gesellschafterin  
				
					Gesellschafterin  
				
					NOUN  
				
					Gesellschaft + er + in  
				
					{"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], "type": "unsegmented"}, {"span": [12, 13], "type": "suffix"}, {"span": [14, 15], "type": "suffix"}]}
				
					Großherzigkeit  
				
					Großherzigkeit  
				
					NOUN  
				
					Großherzig + keit  
				
					{"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "type": "unsegmented"}, {"span": [10, 11, 12, 13], "type": "suffix"}]}
				
					Großbäckerei  
				
					Großbäckerei  
				
					NOUN  
				
					Großbäcker + ei  
				
					{"annot_name": "DErivBase-2.0", "segmentation": [{"span": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "type": "unsegmented"}, {"span": [10, 11], "type": "suffix"}]}

The current version

The current version of the collection is UniSegments 1.0. In its public version, it contains 38 harmonized segmentation datasets covering 30 languages (listed in the table below). UniSegments 1.0 is available in the LINDAT/CLARIAH-CZ digital library (item: http://hdl.handle.net/11234/1-4629). The license for each of the harmonized resources included in the collection is specified in the appropriate language/resource directory.

You can also use a Python application interface (API) for working with the resulting format of UniSegments. It is provided on the related GitHub repository.

	
						Size
					
						Distribution of morphs per unit [%]   
					
						  License
					
						Public release
					
						1
					
						2
					
						3
					
						4+
					
						ben-KCIS
					
						1 kW
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-NC 4.0
					
						cat-MorphyNet
					
						516 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						ces-DeriNet
					
						1,039 kL
					
						8
					
						16
					
						19
					
						57
					
						  CC BY-NC-SA 3.0
					
						ces-MorphyNet
					
						67 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						deu-DerivBaseDE
					
						61 kL
					
						36
					
						59
					
						4
					
						0
					
						  CC BY-SA 3.0
					
						deu-MorphyNet
					
						29 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						eng-MorphoLex
					
						69 kW
					
						21
					
						45
					
						27
					
						7
					
						  CC BY-NC-SA 4.0
					
						eng-MorphyNet
					
						292 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						Fas-PerSegLex
					
						45 kW
					
						34
					
						31
					
						24
					
						10
					
						  CC BY-NC-SA 4.0
					
						fin-MorphyNet
					
						400 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						fra-Demonette
					
						63 kL
					
						46
					
						80
					
						3
					
						0
					
						  CC BY-NC-SA 3.0
					
						fra-Echantinom
					
						5 kL
					
						53
					
						40
					
						6
					
						1
					
						  CC BY 4.0
					
						fra-MorphoLex
					
						16 kW
					
						43
					
						44
					
						12
					
						1
					
						  CC BY-NC-SA 4.0
					
						fra-MorphyNet
					
						363 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						hbs-MorphyNet
					
						34 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						hin-KCIS
					
						2 kW
					
						29
					
						71
					
						0
					
						0
					
						  CC BY-NC 4.0
					
						hrv-CroDeriV
					
						16 kL
					
						0
					
						1
					
						20
					
						79
					
						  CC BY-SA 3.0
					
						hun-MorphyNet
					
						428 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						hye-Uniparser
					
						594 kW
					
						9
					
						41
					
						37
					
						13
					
						  MIT
					
						ita-DerIvaTario
					
						11 kL
					
						1
					
						46
					
						31
					
						21
					
						  CC BY-SA 4.0
					
						ita-MorphyNet
					
						599 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						kan-KCIS
					
						26 kW
					
						0
					
						11
					
						25
					
						64
					
						  CC BY-NC 4.0
					
						kpv-Uniparser
					
						205 kW
					
						9
					
						40
					
						35
					
						16
					
						  MIT
					
						lat-WordFormationLatin 
					
						36 kL
					
						16
					
						52
					
						27
					
						5
					
						  CC BY-NC-SA 4.0
					
						mal-KCIS
					
						33 kW
					
						2
					
						98
					
						0
					
						0
					
						  CC BY-NC 4.0
					
						mar-KCIS
					
						32 kW
					
						0
					
						51
					
						43
					
						6
					
						  CC BY-NC 4.0
					
						mdf-Uniparser
					
						105 kW
					
						10
					
						50
					
						31
					
						8
					
						  MIT
					
						mhr-Uniparser
					
						260 kW
					
						9
					
						38
					
						36
					
						17
					
						  MIT
					
						mon-MorphyNet
					
						35 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						myv-Uniparser
					
						164 kW
					
						10
					
						41
					
						36
					
						13
					
						  MIT
					
						pol-MorphyNet
					
						508 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						por-MorphyNet
					
						449 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						rus-DerivBaseRU
					
						156 kL
					
						31
					
						35
					
						23
					
						10
					
						  Apache-2.0
					
						rus-MorphyNet
					
						692 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						spa-MorphyNet
					
						541 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						swe-MorphyNet
					
						438 kL
					
						0
					
						100
					
						0
					
						0
					
						  CC BY-SA 3.0
					
						tgk-Uniparser
					
						232 kW
					
						17
					
						56
					
						24
					
						3
					
						  MIT
					
						udm-Uniparser
					
						375 kW
					
						8
					
						35
					
						36
					
						21
					
						  MIT
					
						Private release
					
						deu-CELEX
					
						48 kL
					
						14
					
						40
					
						34
					
						13
					
						  Non-public
					
						deu-MorphoChallenge
					
						3 kL
					
						4
					
						27
					
						42
					
						27
					
						  Non-public
					
						eng-CELEX
					
						44 kL
					
						30
					
						51
					
						16
					
						3
					
						  Non-public
					
						eng-MorphoChallenge
					
						3 kL
					
						16
					
						49
					
						27
					
						9
					
						  Non-public
					
						fin-MorphoChallenge
					
						4 kL
					
						3
					
						18
					
						35
					
						44
					
						  Non-public
					
						nld-CELEX
					
						101 kL
					
						11
					
						52
					
						25
					
						12
					
						  Non-public
					
						rus-KuznetsEfremDict
					
						73 kL
					
						1
					
						7
					
						17
					
						75
					
						  Non-public
					
						rus-TikhonovDict
					
						103 kL
					
						6
					
						11
					
						22
					
						61
					
						  Non-public
					
						tur-MorphoChallenge
					
						7 kL
					
						3
					
						19
					
						34
					
						45
					
						  Non-public

Presentations

LREC 2022, Towards Universal Segmentations: UniSegments 1.0

Related publications

Žabokrtský, Z.; Bafna, N.; Bodnár, J.; Kyjánek, L.; Svoboda, E.; Ševčíková, M.; Vidra, J. 2022. Towards Universal Segmentations: UniSegments 1.0. In: Proceedings of the 13th Conference on Language Resources and Evaluation Conference (LREC). Marseille, pp. 1137-1149.
Bafna, N.; Bodnár, J.; Kyjánek, L.; Svoboda, E.; Ševčíková, M.; Vidra, J.; Žabokrtský, Z. 2021. Towards Universal Segmentations: Survey of Existing Morphosegmentation Resources. Technical Report TR-2021-69. Prague: Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University. ISSN: 1214-5521.

Search form