Folder and File Hierarchy in HamleDT 3.0
- First-level folders are codes of treebanks. They always start with the ISO 639 language code, and optionally continue with a hyphen and a treebank code: e.g. de-ud11 is the German corpus taken from the Universal Dependencies release 1.1, while de is the TIGER / CoNLL 2009 corpus that was part of the previous versions of HamleDT.
- treex/{train,dev,test}/*.treex.gz … Treex XML format. For *-ud11 treebanks it contains only the UD tree. For other treebanks, it contains three trees per sentence: the original annotation, a Prague-style tree, and the UD tree.
- conllu/{train,dev,test}/*.conllu.gz … universal dependencies exported to CoNLL-U format.
- conllu-patch/{train,dev,test}/*.conllu.gz … for non-free treebanks. Word forms, lemmas and original POS tags have been removed (only once per 1000 words there are three full tokens to facilitate synchronization checks). The file contains universal POS tags, features and dependencies.
The preferred file format is Treex XML because the other formats are less expressive and information gets lost in conversion. Both Treex and CoNLL can be read by the Treex framework for further processing. Treex files can be also viewed (or even edited!) using the TrEd editor with EasyTreex extension.
Dependency Labels and Structural Rules
For the documentation of the Universal Dependencies, go to http://universaldependencies.github.io/docs/.
The treebanks that were harmonized by us (i.e. not the *-ud11 treebanks) are a bit less conformant with the UD standard. Most notably, they keep the original tokenization, including NULL nodes, multi-word expressions collapsed into one node etc.
Interset
Morphosyntactic features and values used in HamleDT come from DZ Interset, which is a superset of the Universal Features (part of the Universal Dependencies standard). For their list and brief description, see here.