Friday, May 2, 2008

A Word on the Million Words

Work on the new PADT 2.0 is now in progress. The recent developments are described in our submission to the LREC 2008 Workshop on Arabic & Local Languages:

Prague Arabic Dependency Treebank: A Word on the Million Words
[paper]

According to the paper, the expected contents of PADT 2.0 will include these annotations:

PADT 2.0 Corpus Fun. Morphology Dep. Syntax Tectogrammatics Notes
Total 1,095,610 1,281,858 1,001,908 30,894 merged annotations
Prague 328,240 383,482 282,252 30,894 original annotations
Penn 767,370 898,376 719,656 converted annotations
Prague Corpus Fun. Morphology Dep. Syntax Tectogrammatics Notes
AEP 99,360 116,717 116,717 9,690 Arabic English Parallel News
EAT 48,371 55,097 55,097 13,934 English-Arabic Treebank
ASB 11,881 14,254 14,254 Arabic Gigaword
NHR 21,445 25,329 12,613 Arabic Gigaword
HYT 85,683 100,537 41,855 5,228 Arabic Gigaword
XIN 61,500 71,548 41,716 2,042 Arabic Gigaword
Penn Corpus Fun. Morphology Dep. Syntax Tectogrammatics Notes
1v3 151,546 172,386 172,386 Penn Arabic Treebank 1v3
2v2 141,515 161,217 161,217 Penn Arabic Treebank 2v2
3v2 335,250 394,466 394,466 Penn Arabic Treebank 3v2
4v1 149,784 178,720 Penn Arabic Treebank 4v1

Your suggestions and comments are very welcome. Thank you.


Comments: Post a Comment

Links to this post:

Create a Link



<< Home

This page is powered by Blogger. Isn't yours?