Obtaining HamleDT

HamleDT is derived from many pre-existing treebanks that have varying license terms. Many are available free of charge (at least for non-commercial research) but we can only redistribute selected treebanks with the most free licenses. To obtain the rest, there are several options.

Free

Currently there are 30 treebanks (of 28 languages) in HamleDT whose license terms permit us to redistribute them. This includes all treebanks taken from Universal Dependencies 1.1 (bg, da, de, el, en, es, eu, fa, fi-ftb, fi-tdt, fr, ga, he, hr, hu, id, it, sv), plus some others harmonized by us (ar, cs, et, fa, grc, la-it, la-ldt, nl, pl, pt, ro, sl, ta). This data can be downloaded directly from us [http://hdl.handle.net/11234/1-1508] in the Treex XML format and in the CoNLL-U format. For the treebanks converted by us, the Treex format contains also the original trees and the Prague-style trees. The Treex representation of the UD 1.1 treebanks, and the CoNLL-U representation of all treebanks contain only the harmonized UD-style trees. By downloading you agree to the license terms (note that different terms apply to different portions of the data and that some data are free only for non-commercial educational and research purposes).

Easy

These treebanks cannot be redistributed by us and you have to obtain them from their original sources before you can create their harmonized versions. Nevertheless they are available free of charge and easily obtainable through established distribution channels, typically the websites of their providers. Getting them usually means registering yourself online; sometimes you just agree to the license implicitly by downloading the data, sometimes you have to print, sign, scan and send the signed license agreement. The following treebanks belong to this category: ca, de (the Tiger treebank), es (the Ancora treebank), hi, ja, tr.

Rest

We have two English treebanks. One of them comes from UD 1.1 and is in the Free group; the other is derived from the CoNLL 2007 data, which in turn is based on the Penn Treebank (PTB). PTB is distributed by the Linguistic Data Consortium (LDC). If your institution is member of LDC, chances are that you already have access to PTB, even in its CoNLL 2007 form.

You may be able to get the remaining treebanks free of charge, however, getting them involves writing to the right people and politely asking for the data. We are talking about these treebanks: bn, ru, sk, te.

Treex

Treex is the natural-language-processing framework that was used to transform annotation styles of the treebanks in HamleDT. Treex is open-source and you can use it to transform the treebanks that we cannot distribute directly. Treex is written in Perl and Bash and it has been most broadly tested on Ubuntu, although it is supposed to run on other systems as well. If you do not have Treex on your system yet, please refer to the Treex Installation Guide and follow the instructions therein. You should not skip the optional step 5, but as of this writing it is a bit outdated and you will need to adjust it. The installation guide instructs you about updating Treex from the SVN repository; however, in 2015 Treex moved to Github and you need to clone the Git repository "git@github.com:ufal/treex.git". To be sure that you have the Treex version that was used to create HamleDT 3.0, you have to retrieve the commit dbba1b232d2eae69f43577e1764d8f8f30e9c3b9. Your Perl libraries should contain Lingua::Interset version 2.045.

HamleDT-related Makefiles are now in a separate Github repository, https://github.com/ufal/hamledt/. To get the version that was used to create HamleDT 3.0, you have to retrieve the commit 19f47665fed00b9defe5119b557ca950384db0ba. In the following text we assume that the HAMLEDT environment variable is pointing to the folder where you cloned the repository.

Suppose you have a copy of the German TIGER treebank in its CoNLL 2009 form (note that CoNLL 2009 data can be also obtained through the Linguistic Data Consortium) and you want to create its harmonized, i.e. HamleDT version. First make sure that the required hierarchy of folders is ready:

cd $HAMLEDT/normalize/de
make dirs

Next, you will have to edit the source target in the Makefile in that folder. This target is responsible for copying the source data from the place where they currently reside on your system to the dedicated folder within the HamleDT folder hierarchy. Run make source and check the contents of data/source when it is done. Finally, you can transform the treebank to the HamleDT (Universal Dependencies) style:

make treex
make prague
make ud

Note that by default, the prague and ud targets are parallelized, which makes it much faster. Parallelization assumes the Sun Grid Engine cluster. If you have it, parallel Treex might work on your system as well. If you don't, go to ../common.mak, find the definition of the QTREEX variable and make it identical to the definition of TREEX. You can view the treebank using the Tred tree editor that you installed together with Treex:

ttred data/treex/02/train/001.treex.gz

Inspecting the Makefiles is useful for you to understand how everything is done. Nevertheless, you do not have to stick with the Makefiles. In fact, normalizing your own corpus in CoNLL format is pretty easy and it can be done in one step (use Read::CoNLLX for CoNLL 2006 and 2007 data and Read::CoNLL2009 for CoNLL 2009 data):

treex -Lde Read::CoNLL2009 from=mycorpus.conll HamleDT::DE::Harmonize HamleDT::Udep Write::CoNLLU to=mycorpus.hamledt.conllu

Patches

If you have your own copy of one of the non-free treebanks and you do not want to install Treex, there is yet another way for you to obtain the harmonized annotation. We provide CoNLL patches – files in the CoNLL-U format where the underlying text, lemmas and original POS tags have been removed while our harmonized annotation is retained. This will be useful especially if you have the CoNLL distribution of the original treebank—merging your and our files should be straightforward. Every 1000 or so tokens the patches reveal the full line so that you can make sure that both data sources are synchronized.

We provide a script that combines the original CoNLL-X files and the corresponding patch. The script works with just one original file (e.g. train.conll) and an entire folder of gzipped patch files (e.g. train/*.conllu.gz — it reflects the fact that HamleDT splits long documents into numbered files, i.e. 001.conllu.gz, 002.conllu.gz etc.) The script is applied as follows:

perl apply_conll_patch.pl $CONLL/2006/sl/test.conll $HAMLEDT/sl/conllu/test > patched-sl-test.conllu

Patches for non-free treebanks are included in one package together with the full data of the free treebanks (see the download link above). The script is also included.

For treebanks with unclear ordering of files and/or with other file formats, using patches will be trickier. Contact us if you need help. At any rate, CoNLL patches are experimental and we intend to improve their support in future.