Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivation, in a cross-linguistically consistent annotation scheme for many languages. The annotation scheme is based on a rooted tree data structure (as used in the DeriNet 2.0 database), in which nodes correspond to lexemes while edges represent derivational relations or compounding.
Each individual resource in the UDer collection can be searched online using two versions of DeriNet Search. DeriSearch v2 shows all pieces of information stored in the data. The data can be processed using DeriNet 2.0 API. We provide three Jupyter Notebooks as manuals with documentation of the API, tutorial to API modules, and a simple example of using the API. Relevant scripts for harmonising the original resources and releasing the UDer collections are available in the GitHub repository.
The current version of the collection is UDer 1.1. It contains 31 harmonized resources covering 21 languages (listed in the table below). UDer 1.1 is available in the LINDAT/CLARIAH CZ digital library (item: http://hdl.handle.net/11234/1-3247). The license for each of the harmonized resources included in the collection is specified in the appropriate language/resource directory.
Resource | Language | Lexemes | Relations | Families | License |
---|---|---|---|---|---|
CatVar | English | 82,675 | 24,628 | 58,047 | OSL-1.1 |
CroDeriV | Croatian | 5,093 | 4,948 | 145 | CC BY-NC-SA 3.0 |
D-CELEX | Dutch | 125,611 | 11,150 | 114,461 | GPL-3.0 (for scripts) |
Démonette | French | 22,060 | 13,808 | 8,252 | CC BY-NC-SA 3.0 |
DeriNet | Czech | 1,039,012 | 835,738 | 203,274 | CC BY-NC-SA 3.0 |
DeriNet.ES | Spanish | 151,173 | 42,825 | 108,348 | CC BY-NC-SA 3.0 |
DeriNet.FA | Persian | 43,357 | 35,745 | 7,612 | CC BY-NC-SA 4.0 |
DeriNet.RU | Russian | 337,632 | 164,725 | 172,907 | CC BY-NC-SA 4.0 |
DerIvaTario | Italian | 8,267 | 1,783 | 6,484 | CC BY-SA 4.0 |
DErivBase | German | 280,775 | 43,367 | 237,408 | CC BY-SA 3.0 |
DerivBase.Hr | Croatian | 99,606 | 34,639 | 64,967 | CC BY-SA 3.0 |
DerivBase.Ru | Russian | 270,473 | 136,449 | 136,449 | Apache 2.0 |
E-CELEX | English | 52,447 | 9,319 | 43,128 | GPL-3.0 (for scripts) |
EstWordNet | Estonian | 988 | 507 | 481 | CC BY-SA 3.0 |
EtymWordNet-cat | Catalan | 7,496 | 4,568 | 2,928 | CC BY-SA 3.0 |
EtymWordNet-ces | Czech | 7,633 | 5,237 | 2,396 | CC BY-SA 3.0 |
EtymWordNet-gla | Gaelic | 7,524 | 5,013 | 2,511 | CC BY-SA 3.0 |
EtymWordNet-pol | Polish | 27,797 | 24,876 | 2,921 | CC BY-SA 3.0 |
EtymWordNet-por | Portuguese | 2,797 | 1,610 | 1,187 | CC BY-SA 3.0 |
EtymWordNet-rus | Russian | 4,005 | 3,227 | 778 | CC BY-SA 3.0 |
EtymWordNet-hbs | Serbo-Croatian | 8,033 | 6,303 | 1,730 | CC BY-SA 3.0 |
EtymWordNet-swe | Swedish | 7,333 | 4,423 | 2,910 | CC BY-SA 3.0 |
EtymWordNet-tur | Turkish | 7,774 | 5,837 | 1,937 | CC BY-SA 3.0 |
FinnWordNet | Finnish | 20,035 | 11,890 | 8,145 | CC BY-SA 4.0 |
G-CELEX | German | 51,728 | 13,301 | 38,427 | GPL-3.0 (for scripts) |
GoldenCompoundAnalyses | Russian | 4,931 | 1,639 | 3,292 | CC BY-NC 4.0 |
Nomlex-PT | Portuguese | 7,020 | 4,201 | 2,819 | CC BY-SA 4.0 |
Sloleks | Slovenian | 48,054 | 29,121 | 18,933 | CC BY-NC-SA 4.0 |
The Morpho-Semantic Database | English | 13,813 | 7,855 | 5,958 | CC BY-NC-SA 3.0 |
The Polish WFN | Polish | 262,887 | 189,217 | 73,670 | CC BY-NC-SA 3.0 |
Word Formation Latin | Latin | 36,258 | 32,625 | 3,633 | CC BY-NC-SA 4.0 |