Chapter 1. Introduction

Table of Contents

1.1. What is PDT 2.0
1.2. Historical background of the project
1.3. Development of the project
1.4. About Czech
1.5. Directory structure

This guide introduces the Prague Dependency Treebank, version 2.0 (PDT 2.0). The guide allows you to become quickly familiar with the basic ideas as well as contents of PDT 2.0. It provides an overview of its data and tools, together with links to more extensive documentation, including tutorials, formal specifications and further references. This document exists both in an HTML and a PDF version.

The website of PDT 2.0 is http://ufal.mff.cuni.cz/pdt2.0. You can also view the web page http://ufal.mff.cuni.cz/pdt2.0update, where possible corrections of the data, improved versions of the tools etc. will be published.

1.1. What is PDT 2.0

The Prague Dependency Treebank (PDT) is an open-ended project for manual annotation of substantial amount of Czech-language data with linguistically rich information ranging from morphology through syntax and semantics/pragmatics and beyond.

PDT version 2.0 is a sequel to version 1.0; PDT version 1.0 contains manual annotation of morphology and (surface) syntax (see http://ufal.mff.cuni.cz/pdt/ or the web page of Linguistic Data Consortium (LDC), http://www.ldc.upenn.edu, Catalog No. LDC2001T10). Version 2.0 adds the underlying syntax and semantics, topic/focus, coreference and lexical semantics based on a valency dictionary to the surface syntax and morphology that have been at the core of version 1.0. The corrections of version 1.0 are also included in version 2.0, even with the old data format preserved for those who have already invested into its use.

The annotation in PDT 2.0 covers a large amount of Czech texts with interlinked morphological (2 million words), syntactic (1.5 MW) and complex underlying syntactic and semantic annotation (0.8 MW). The corpus itself now uses the latest annotation technology (standoff annotation using XML, RelaxNG-see Section 3.4, "Data formats" and the whole Chapter 3, Data).

PDT 2.0 is based on the long-standing Praguian linguistic tradition and adapted for the current Computational Linguistics research needs (see also Section 1.2, "Historical background of the project"). Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in Czech and English) is provided as well.

This version of PDT concludes a 10-year period of development at the Institute of Formal and Applied Linguistics (ÚFAL) and its Center for Computational Linguistics (see Section 1.3, "Development of the project"). Recently, the project has been complemented with the publication of the Prague Arabic Dependency Treebank, http://www.ldc.upenn.edu, Catalog No. LDC2004T23 and a parallel Prague Czech-English Dependency Treebank, http://www.ldc.upenn.edu, Catalog No. LDC2004T25. The former project demonstrates that the Czech specifications can be adapted to a typologically different language and the latter one builds on the manual annotation of the Penn Treebank corpus and it is geared towards Machine Translation experiments between the two languages.

PDT 2.0 has had two purposes:

  • first, to map the theoretical achievements of the Prague Linguistic School to real language data, and thus explicitly test and preserve the theory of the dependency-based Functional Generative Description (FGD) (see also Section 1.2, "Historical background of the project") not only "on paper", but applied to a very large number of real "examples";

  • second, to allow for machine learning methods to be applied to yield automatic analysis and generation tools with reasonable accuracy.

Whereas the first purpose could have been served by choosing only a few examples for each linguistic phenomenon, the second one definitely needs a large number of naturally occurring sequences of sentences. The statistics obtained from them can certainly be used also for linguistic research with a distinct advantage.

The future of PDT is not completely determined at this point. There are several future directions under consideration (funding permitting, of course): adding spoken data; adding a deeper and broader annotation especially for coreference, information structure and/or discourse; annotation of another (very different) language; manual annotation of Czech/English and other parallel texts using the same (tectogrammatical) representation; and adding another layer (contents-based knowledge representation).

1.2. Historical background of the project

Prague School of Functional and Structural Linguistics is distinguished from other European schools of linguistic structuralism-among other things-by its openness to new trends and ideas. The history of the School formally dates back to 1926, when the Prague Linguistic Circle was founded by such prominent linguists as Vilém Mathesius, Roman Jakobson, and Bohumil Trnka. The research paved the way into several directions-phonology was perhaps the first internationally highly appreciated domain. Soon there also appeared (with a positive international acceptance) original contributions to language typology, word-formation, functional stratification of language, to such general linguistic issues as that of the distinction of core and periphery in the language system and, last but not least, attempts at a systematic account of the information structure of the sentence (functional sentence perspective, topic-focus articulation).

The Prague Linguistic Circle did not restrict its activities geographically; there were several linguists abroad who openly avowed the Circle's tenets and worked in their intentions. One of them was Lucien Tesnière, a French linguist, who can be duly called "the Father of dependency syntax". Tesnière's approach had found a very positive response also outside the Circle, especially in the work of the Czech syntactician Vladimír Šmilauer, whose Novočeská skladba (Syntax of Modern Czech, 1947) is a non-omissible source of information for all those who study Czech syntax.

The Prague School inspiration has found a continuation also in the new linguistic paradigm of explicit description of language, namely in the Functional Generative Description (FGD) as proposed by Petr Sgall in the 1960s and elaborated since then by him and his collaborators (for a most complex treatment, see the book The Meaning of the Sentence in Its Semantic and Pragmatic Aspects, 1986). There are three important distinctive features of the FGD framework:

  • inclusion of an underlying syntactic layer (tectogrammatics) into the linguistic description;

  • use of dependency syntax;

  • a specification of a formal account of the information structure (topic-focus articulation) of the sentence and its integration into the description.

1.3. Development of the project

The project started, in fact, in the lobby of a small hotel in Dublin, Ireland, in the end of March in 1995 during the 7th conference of the European Chapter of the ACL. A small group of us decided to pursue a project similar to the English Penn Treebank project which came out then not so long ago, but based on the Praguian dependency tradition, with full morphological analysis and with the perspective of gradual enrichment of the annotation (for more on the project context, see Section 1.2, "Historical background of the project").

Funding had to be secured first; we were lucky to get two grants of the Czech Grant Agency and one Ministry of Education projects simultaneously, starting in 1996: one smaller grant to write the specification of the treebank, one multi-institutional project to support the Czech National Corpus (our source of raw texts), and finally, a project called the "Linguistic Data Lab" to get the annotation itself going.

The specification called for a three-layer annotation scenario, with morphological, analytical and tectogrammatical layers of annotation. Except for the morphological layer, which was designed to use the existing Czech tagset, the annotation guidelines were only sketchy, with the understanding that they would be developed in parallel with the annotation as new phenomena and problems would be discovered. Nevertheless, some basic principles were taken as the "unbreakable" rules:

  • morphological annotation will be applied to individual tokens; no attempt will be made to analyze e.g. complex verb forms,

  • the tagset used in the existing morphological dictionary for Czech, developed at ÚFAL, will be used directly for annotation,

  • the unit of the surface-syntactic annotation (the analytical layer) will also be a token, with a 1:1 correspondence to the morphological layer units; no "traces", substitutes for ellipsis or anything like it would be inserted into the annotation,

  • dependency-style annotation will be used not only on the underlying syntactic layer (the tectogrammatical one), but also on the analytical layer,

  • the tectogrammatical annotation will include all what the theory has to offer, i.e. topic/focus, coreference, and other detailed annotation; "inserting" and "deleting" nodes (with respect to the lower layers) will be allowed to match the theory and the desired purpose of underlying representation,

  • valency will be taken into account when determining verb (noun, adjective) dependent's function.

Moreover, some further decisions were made. The data markup format was designed as the extension of the proprietary SGML format called CSTS used in the Czech National Corpus. Then, the organization of the annotation had to be determined: we started annotation of the lower two layers simultaneously (morphology and analytical syntax); the tectogrammatical layer annotation had been postponed until the first two layers were finished. Furthermore, tools were developed for the annotation to proceed. Among them, Graph, the grafical tree editor has been using our proprietary annotation format (called FS), a non-SGML but quite general and space-saving one.

The annotation of the morphological and analytical layers was performed mainly by students with linguistic background. The lack of complete guidelines at the analytical layer required weekly meetings of the annotators, where problems had been discussed and solutions immediately applied to the annotation process. Later, a dedicated editor was chosen from the annotators, and also the technical issues warranted another two annotators to stop annotating and cover the technical area.

The morphological annotation has been performed twice followed by the usual adjudication phase (by a single person to ensure high consistency). The annotators were choosing among possible lemmas and tags offered by the Czech morphological dictionary without any automatic pre-tagging or another kind of preference of tags. Almost 2 mil. tokens have been annotated at the morphological layer manually.

The analytical-layer annotation was performed only once, but with an extensive set of automatic consistency checks that included cross-layer annotation checking. At the beginning, no automatic pre-processing was taking place; later, a hand-written code was used to pre-assign the dependency functions. In 1998, a pre-release called PDT 0.5 was put together (containing about 380k annotated words) for the summer JHU Language Engineering Workshop in Baltimore, MD, U.S.A., where the first Czech parser was developed (by converting the data for the-slightly adapted-Collins' lexicalized English parser). Since 1999, the data for annotation have been preprocessed by this parser and presented to the annotators for corrections only, gaining approx. 30% annotation speedup. Over 1.5 mil. tokens have been manually annotated at the analytical layer, matching the Penn Treebank in size.

Merging the two layers of annotation, a non-trivial task, took over a year. It included extensively checking the data for consistency, final editing of the guidelines (and their translation to English), and finally preparing the CD-ROM for publication in 2001 as the Prague Dependency Treebank, version 1.0. During the checking phase, a new platform-independent editing tool, TrEd, has been developed.

The tectogrammatical layer annotation (using TrEd), starting in 2000 with the establishment of the Center for Computational Linguistics after the original funding expired, was originally thought to be too difficult to cover all of the planned data (about 50k sentences, a subset of the PDT 1.0 data) in full. The annotation was divided into four areas:

  • dependency structure in the form of a dependency tree, including semantic labeling and valency annotation,

  • topic/focus annotation,

  • coreference (grammatical and a restricted subset of textual one),

  • grammatical attributes of the nodes of the tree (not covered by any of the above).

Most of the effort has been directed to the first area, since the others should have been covered by small samples only. Manually written rules have been applied to the analytical-layer trees to pre-annotate them in cases where the relation between the analytical and the tectogrammatical trees was thought to be clear. Rudimentary valency dictionary has been prepared (in a hard-copy form) to assure consistency at least for the annotation around the most frequent verbs. The XML version of the valency dictionary, PDT-VALLEX, has been created later and an interface added to TrEd allowing for on-line use and editing of the dictionary; it also enabled to assign the appropriate valency frame to an occurrence of a word in the corpus. Meanwhile, as the work on the guidelines and test annotation of coreference and the topic/focus annotation progressed, it has been eventually decided to perform these annotations on the whole data. Still later, in 2004, the fourth area (assignment of additional grammatical information, filling no less than another 16 attributes of every tectogrammatical tree node) was also semi-automatically extended to the whole tectogrammaticaly annotated data, i.e. 50k sentences.

Contrary to the analytical layer of annotation, the tectogrammatical annotation staff has been divided into many small teams, with specialized (sub)areas assigned to their members. This has been a disadvantage, too-information sometimes did not get to all the people for whom it was relevant. Up to 30 people worked on the project at any given time. Everything has been annotated only once, except in pilot inter-annotator agreement tests. Consistency checking has been applied in a similar way as it was to the analytical layer, using complex cross-layer checking.

The final stage (after the "assembly-line" annotation process had finished in 2004) took also over a year. Completely new stand-off XML annotation scheme has been developed for the distribution of the data. The valency dictionary PDT-VALLEX has been fully manually checked and revised for verbs and certain categories of nouns (in both cases, by a single person to ensure maximum consistency), and extensive automatic cross-layer checks have been developed to find annotation inconsistencies-after it, all of them have then been manually corrected. A dedicated editor of the tectogrammatical annotator guidelines was appointed, whose task was to rewrite the individual sections of the guidelines (over 800 pages total) in a clear manner that uses consistent terminology and corresponds to what has eventually been annotated in the data. The guidelines have also been translated into English. The CD-ROM has been completed and shipped to LDC for publication in 2006.

1.4. About Czech

Czech, the language of texts incorporated in the Prague Dependency Treebank, is one of the western group of Slavic languages. It is spoken mainly in the Czech Republic where it is the only official language. Besides, native Czech speakers live in the other European countries, especially in Slovakia, and tens of thousands of Czech speakers live in the U.S.A., Canada and Australia. Czech has over 10 million speakers.

Similarly to other Slavic languages, Czech is highly inflectional-it has seven cases and four genders (e.g. there are 16 main paradigms for inflection of nouns) and it has free word order (from the purely syntactic point of view): words in a sentence can usually be ordered in several ways. However, the particular word order does influence the meaning of the sentence.

Czech is written using the Latin alphabet extended with several letters with accents. Czech letters (82 characters total) are included in the Unicode standard; also ISO 8859-2 (Latin 2), the standard 8-bit encoding for Central-European languages, and CP1250, its Windows counterpart, are widely used.

More information about the Czech language can be found at http://www.czech-language.cz.

1.5. Directory structure

This section contains a short description of the directory structure of the PDT 2.0 distribution, down to the second level.