# ######################################################################## 2004/10/22 # # Prague Arabic Dependency Treebank 1.0 ############################################# Title: Prague Arabic Dependency Treebank 1.0 Authors: Jan Hajic, Otakar Smrz, Petr Zemanek, Petr Pajas, Jan Snaidauf, Emanuel Beska, Jakub Kracmar, Kamila Hassanova Annotators: Ondrej Beranek, Viktor Bielicky, Simona Hlavacova, Marketa Husinecka, Emira Klementova, Monika Kolbova, Alena Pejcharova, Martin Spata, Pavel Tupek Consultants: Ivona Kucerova, Jarmila Panevova, Jan Stepanek, Zdenek Zabokrtsky Support: Jiri Mirovsky, Roman Ondruska, Jiri Hana Coordinator: Otakar Smrz Website: http://ufal.mff.cuni.cz/padt/ E-mail: padt@ufal.mff.cuni.cz Address: Institute of Formal and Applied Linguistics & Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague Malostranske nam. 25 118 00 Praha 1 Czech Republic http://ufal.mff.cuni.cz/ http://ufal.mff.cuni.cz/ phone: +420 221 914 306 fax: +420 221 914 304 Data Type: text Data Sources: newswire Project: Prague Arabic Dependency Treebank Applications: cross-lingual information retrieval, language modeling, machine translation, natural language processing, parsing, tagging Languages: Arabic, English License: http://ufal.mff.cuni.cz/corp-lic/padt10-reg.html Funding: Ministry of Education of the Czech Republic (LN00A063, MSM113200006), Grant Agency of the Czech Republic (405/02/0823) Copyright: Portions (C) 2002-2004 Trustees of the University of Pennsylvania, (C) 2000 Agence France Presse, (C) 2001 Al Hayat News Agency, (C) 2002 Ummah Press Service, (C) 2002 An Nahar News Agency, (C) 2003 Xinhua News Agency, (C) 2002-2004 Center for Computational Linguistics & Institute of Formal and Applied Linguistics & Institute of Comparative Linguistics, Charles University in Prague Data Size: 148 000 tokens of data annotated as MorphoTrees 113 500 tokens of analytically annotated data INTRODUCTION Prague Arabic Dependency Treebank (PADT) not only consists of multi-level linguistic annotations over the language of Modern Standard Arabic, but even provides a variety of unique software implementations designed for general use in Natural Language Processing. The PADT project might be summarized as an open-ended activity of the Center for Computational Linguistics, the Institute of Formal and Applied Linguistics, and the Institute of Comparative Linguistics, Charles University in Prague, resting in multi-level annotation of Arabic language resources in the light of the theory of Functional Generative Description. The project is a younger sibling to Prague Dependency Treebank for Czech. DOCUMENTATION index.html data/index.html data/ArabicPDT.pl data/comment_syntax.dat data/comment_morpho.dat data/ALH/count_syntax.dat data/ALH/count_morpho.dat data/ANN/count_syntax.dat data/ANN/count_morpho.dat data/XIA/count_syntax.dat data/XIA/count_morpho.dat data/AFP/count_syntax.dat data/UMH/count_syntax.dat data/XIN/count_syntax.dat docs/README.txt docs/index.html docs/PADT_body.html docs/PADT_menu.html docs/PADT_license.html docs/PADT-logo.gif docs/morpho_AfrAd.gif docs/morpho_AmA.gif docs/morpho_fhm.gif docs/morpho_view.gif docs/syntax_pred.gif docs/syntax_elps.gif docs/syntax_view.gif docs/guides/PADT_Analytical.pdf docs/guides/PADT_Analytical.ps docs/guides/PDT_Analytical-en.link.html docs/guides/PDT_Analytical-cz.link.html docs/papers/2002-flm-extra.pdf docs/papers/2002-flm-extra.ps docs/papers/2002-flm-padt.pdf docs/papers/2002-flm-padt.ps.zip docs/papers/2002-pbml-sherds.pdf docs/papers/2002-pbml-sherds.ps docs/papers/2003-eacl-trees.pdf docs/papers/2003-eacl-trees.ps docs/papers/2004-nemlar-padt.pdf docs/papers/2004-nemlar-padt.ps docs/papers/2004-nemlar-tred.pdf docs/papers/2004-nemlar-tred.ps docs/slides/2003-eacl-trees.pps docs/slides/2003-eacl-trees.ppt docs/slides/2004-nemlar-padt.pps docs/slides/2004-nemlar-padt.ppt docs/slides/2004-nemlar-tred.pps docs/slides/2004-nemlar-tred.ppt tools/Encode-Arabic/html/ tools/PADT-modules/index.html tools/PADT-scripts/index.html tools/TrEd/tred/documentation/HTML/ cvs/CVS.link.html DATA The Prague Arabic Treebank 1.0 data are in the FS format suitable for TrEd and Netgraph. The encoding is UTF-8. Nonetheless, conversion tools for both the format and the encoding can be found in this distribution. In addition to the files proper, located in /data/, we provide the user with the CVS repository recording the history of evolution of the data, in /cvs/. data/ALH/syntax/ data/ALH/morpho/ data/ALH/corpus/ data/ANN/syntax/ data/ANN/morpho/ data/ANN/corpus/ data/XIA/syntax/ data/XIA/morpho/ data/XIA/corpus/ data/AFP/syntax/ data/UMH/syntax/ data/XIN/syntax/ cvs/data/ALH/ cvs/data/ANN/ cvs/data/XIA/ cvs/data/AFP/ cvs/data/UMH/ cvs/data/XIN/ TOOLS For information on how to install and use the tools, please refer to: tools/TrEd/ tools/TrEd.link.html tools/PADT-scripts/ tools/PADT-modules/ tools/Netgraph.link.html tools/Encode-Arabic/ tools/Encode-Arabic.link.html