NEW: HamleDT 3.0
The new version of HamleDT was released on 2015-08-18.
What's New in this Release?
- Universal dependencies. HamleDT has switched to this new annotation style as its core harmonized style (actually, we have been heavily involved in formulating the UD standard). In the syntactic part, UD is close to the Stanford dependencies that we first used in HamleDT 2.0. In the morphological part, UD uses morphological features from Interset, that were used in HamleDT since the beginning. See http://universaldependencies.github.io/docs/ for more details.
- New languages and treebanks. HamleDT 3.0 is a superset of the Universal Dependencies treebanks 1.1 released in May 2015. UD 1.1 has 19 treebanks of 18 languages, all with licenses that permit redistribution. Whenever there is a UD 1.1 dataset from a treebank previously covered by HamleDT, we are now replacing our transformation by the official UD 1.1 data (even if it is smaller). We add 18 other languages, 10 of them with redistributable licenses. When compared to HamleDT 2.0, there are five new languages coming from UD 1.1 (Croatian, French, Hebrew, Indonesian, Irish) and one new language introduced in HamleDT (Polish). Some treebanks that were not freely redistributable in HamleDT 2.0, are in 3.0, thanks to the generosity of the teams that produced the original data (e.g. Basque, Bulgarian, Greek and Hungarian). There are several languages that now have two different treebanks (but sometimes only one of them has a free license): English, Finnish, German, Latin, Persian, Spanish. In total, there are now 42 treebanks of 36 languages, and 28 languages have at least one redistributable treebank.
- Search HamleDT online. Visit the PML-TQ search interface and in the list of treebanks, look for HamleDT and Universal Dependencies. (Note that some of the treebanks are accessible only if you are logged in.)
What's Next?
- We have received requests for the Prague-style harmonization that is not present in HamleDT 3.0. We will probably honor these requests by releasing a standalone package with the Prague flavor (PrahamleDT?) This is not a matter of a few days, as we have to convert the UD 1.1 treebanks to Prague (while the rest has been converted through Prague to UD). But it will hopefully happen before the end of 2015.
- We are constantly working on the harmonization scenario, removing bugs and normalizing newly found phenomena. So far the treebanks that we did not take from UD 1.1 are converted through the Prague style, which causes loss of information. Direct transformations from the original styles to the UD style are being prepared.
- We will replace the CoNLL 2006 Slovenian data by the new Slovene reference treebank (SSJ) that is much larger.
- We will replace the CoNLL-converted Bengali and Telugu data by the native Shakti Standard Format. CoNLL contains only inter-chunk dependencies, while we also want to add intra-chunk dependencies and to make all tokens of the sentence visible and accessible in the tree.
- We will probably further improve Arabic, as the source treebank moves towards PADT 2.0.
- We will update the import filter for Syntagrus (Russian) so that punctuation symbols become regular tree nodes.
- Quite a few new languages and treebanks are waiting in the queue to be added to HamleDT, many of them under a free license.
- Universal Dependencies plan the next release for November 2015. As long as the new data have compatible licenses, we will incorporate them in HamleDT. There are also new languages in the pipeline.
Archive: What Was New in HamleDT 2.0?
HamleDT 2.0 was released 2014-05-24.
- Stanford dependencies. HamleDT now provides two different normalizations of the member treebanks: besides its main and native Prague annotation style, users can also opt for the “Stanford style”. Both styles are quite popular and widely used de-facto standards. HamleDT / Prague is close to the style of at least ten different treebanks, most notably the (Czech) Prague Dependency Treebank. HamleDT / Stanford is close to the Universal Stanford Dependencies, proposed by de Marneffe et al. (LREC 2014; this version is slightly different from the older Stanford Dependencies, used e.g. in the Google Universal Dependency Treebanks).
- New language. We added a large treebank of Slovak [sk].
- And: Estonian [et] is not exactly new but we moved it to the free section of HamleDT because we are not aware of any restrictions affecting redistributability.
- New, better and larger data for old languages.
- Arabic [ar]: We now use PADT 1.5 instead of CoNLL 2007. The new data is substantially larger and we also got rid of a major flaw of the CoNLL data, caused by the conversion to the CoNLL format: the distinction between conjuncts and true dependents was missing in CoNLL.
- Czech [cs]: We use the newly released PDT 3.0. It is larger than CoNLL 2007 and it is distributed under less restrictive license terms. CoNLL 2007 is its subset, so the old data is not needed any more, except for comparability of experiments.
- English [en]: In HamleDT 1.0, we worked with the CoNLL 2009 data, later we switched to the CoNLL 2007 data. Both are derived from the Penn Treebank but they used different procedures to convert constituents to dependencies. We find the CoNLL 2007 conversion better.
- Hindi [hi]: The Hyderabad Dependency Treebank is growing. In HamleDT 1.0, we worked with the sample released for the ICON 2010 shared task. Now we work with the larger sample released for the shared task at the MTPIL workshop at COLING 2012.
- Numerous bugfixes. Most notably, it should never again happen that there are no conjuncts among children of a Coord node, or that there are conjuncts whose parent is not Coord.