Parsing HamleDT

Preliminary parsing results on HamleDT data, both original and harmonized (Prague) annotation. Unlabeled attachment score (UAS). We used Malt Parser (http://maltparser.org/) with the stack-lazy algorithm and a feature definition file that looks at complete morphological information, gold standard (lemmas, part of speech, values of individual morphosyntactic features).

To reduce time requirements and increase comparability, we ran first a short experiment where training data was limited to the first 5000 sentences. Treebanks that have fewer than 5000 sentences used their entire training data sets. Nevertheless the following table also contains results of the long experiment, where full training data was used for all treebanks. Most models were trained within a couple of days but it took more than one week to train the model of the original Czech treebank.

    UAS, trained on max 5000 sentences     UAS, full training data   LAS
Code TCode Original Prague P>O Original Prague Prague
ar padtr349 78,91% 79,44% 1 79,70% 80,37% 72,24%
bg conll2006 83,59% 89,83% 1 84,50% 90,92% 83,09%
bn icon2010 86,83% 80,30% 0 86,83% 80,30% 60,84%
ca conll2009 84,74% 88,37% 1 84,74% 89,71% 84,62%
cs pdt30 78,36% 78,77% 1 86,35% 86,71% 82,05%
da conll2006 88,01% 87,72% 0 88,93% 87,97% 80,59%
de conll2009 79,55% 84,28% 1 81,59% 88,42% 84,44%
el conll2007 81,92% 82,54% 1 81,92% 82,54% 76,59%
en conll2007 84,25% 85,47% 1 86,89% 88,17% 85,29%
es conll2009 82,19% 88,07% 1 90,40% 89,76% 85,01%
et puudepank 91,32% 88,92% 0 91,32% 88,92% 86,30%
eu bdt 74,63% 78,52% 1 75,92% 80,72% 74,30%
fa perdt 84,80% 82,27% 0 86,98% 84,10% 75,37%
fi turku 77,81% 80,28% 1 77,81% 80,28% 75,84%
grc agdt 63,41% 62,81% 0 63,12% 62,91% 54,73%
hi hydt05 94,52% 92,87% 0 95,12% 93,99% 90,16%
hu conll2007 70,80% 80,94% 1 74,23% 81,46% 79,43%
it conll2007 85,56% 83,11% 0 85,56% 83,11% 78,24%
ja conll2007 78,43% 88,37% 1 80,22% 90,15% 73,37%
la ldt 52,60% 53,04% 1 52,60% 53,04% 45,32%
nl conll2006 77,89% 77,37% 0 82,58% 81,44% 73,97%
pt conll2006 78,43% 85,86% 1 77,83% 86,74% 81,90%
ro rodt 81,52% 84,21% 1 81,52% 84,21% 78,00%
ru syntagrus 86,70% 82,02% 0 89,59% 85,43% 77,39%
sk sta1 74,76% 76,36% 1 80,73% 82,24% 75,35%
sl conll2006 80,77% 81,95% 1 80,77% 81,95% 75,00%
sv conll2006 82,98% 81,23% 0 86,99% 84,98% 78,89%
ta tamiltb 77,58% 77,38% 0 77,58% 77,38% 68,43%
te icon2010 92,30% 90,29% 0 92,30% 90,29% 71,19%
tr conll2007 85,69% 82,43% 0 84,76% 81,57% 75,99%
        57%