Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation scheme for many languages. The annotation scheme consists of simple tab-separated columns that stores a word and its morphological segmentations, including pieces of information about the word and the segmented units, e.g., part-of-speech categories, type of morphs/morphemes etc.
The file format consists of five columns: word form, lemma, part-of-speech category, simplified morphological segmentation, and detailed annotations of indices and types of individual morphological segments.
An excerpt from CroDeriV for Croatian:
An excerpt from DErivBase for German:
The current version of the collection is UniSegments 1.0. In its public version, it contains 38 harmonized segmentation datasets covering 30 languages (listed in the table below). UniSegments 1.0 is available in the LINDAT/CLARIAH-CZ digital library (item: http://hdl.handle.net/11234/1-4629). The license for each of the harmonized resources included in the collection is specified in the appropriate language/resource directory.
You can also use a Python application interface (API) for working with the resulting format of UniSegments. It is provided on the related GitHub repository.