LSA 2011 Workshop on the Prague Dependency Treebank

Jan Hajic Zdenka Uresova
Institute of Formal and Applied Linguistics
Faculty of Mathematics and Physics
Charles University in Prague
Czech Republic

The Prague Dependency Treebank workshop at the LSA 2011 at Colorado University in Boulder, CO

This half-day LSA 2011 workshop (Sat, July 30th, 1:30-5:15, HUMN 135) will introduce the family of the Prague Dependency Treebanks. It will be structured as two standard-length talks.

In the first part (1:30-3:15), the foundations of the Prague Dependency Treebanks (PDT) will be explained (with references to the underlying linguistic theory) first; then, the structure of the PDT will be shown, and the morphological and surface-syntactic dependency annotation will be described and illustrated on examples.

In the second part (3:30-5:15), we will introduce the linguistically rich annotation of the "tectogrammatical" layer of annotation (which is built over the surface dependency annotation). This layer of annotation includes various semantic features as well as Co-reference annotation and annotation of information structure (topic/focus). We will also show the annotation and search tools that can be used to search (and/or edit) the treebank(s). Before concluding, we will also briefly describe the use of treebanks in the development of various NLP technologies (taggers, parsers, etc.); it will include the use of the parallel Czech-English treebank in one of our machine translation systems.

For those interested in looking at the documentation of the Czech treebank and (the first version of) the English treebank, please visit The PDT 2.0 web page and the PEDT v. 1.0 web page, respectively - or just look below for a Czech and English sample annotation.

Sample annotated trees

Czech

A sample of PDT-style annotation for Czech ("Mnozi klienti podle nej se svymi ucty skutecne zacali spekulovat.", lit. "Many clients according-to him with their accounts really started to-speculate."):
Deep syntactic and semantic "tectogrammatical" annotation:           
Surface-syntactic dependency representation (same sentence as above):           

English

.. and for English (the famous Penn Treebank starting sentence about Pierre Vinken):
Deep syntactic and semantic "tectogrammatical" annotation (simplified for visualization):           
Surface-syntactic dependency representation (same sentence as above):           

The two sets of slides will be available here (or separately as Part I, Part II) at the time of or shortly after the workshop.


Additional resources and links

Graphical annotation and search tool for PML-encoded treebanks, by Petr Pajas et al. (freely downloadable). For installing and using the treebank search client embedded in TrEd, please see this documentation and reference.

Online version of the PML-TQ search system (for basic search functions, no installation needed): http://euler.ms.mff.cuni.cz:8111 (user: anonymous, password: anonymous, for sample data access). Please come to the workshop to find out how to get full access to the 20+ treebanks searchable by the online version of PML-TQ!

Online search tool "Netgraph" for the (Czech) PDT (predecessor of PML-TQ, simpler interface, not maintained anymore):

Documentation at http://ufal.mff.cuni.cz/pdt2.0/doc/tools/netgraph.

Simple browsing: sample Penn Treebank annotation, PDT-style (new PEDT corpus) (browse both the tectogrammatical and dependency annotation, compare to the original one). NB: DOES NOT WORK UNDER IE (YET) - USE Chrome, Mozilla etc. NB2: preliminary, undocumented yet, might change location later for the final location.

Softwware platform for NLP processing - analysis, generation, translation etc. called "TreeX" (for those familiar with GATE, it's a direct competitor :-) - but can be in fact used from within GATE, too. (Previous version with tutorials etc., called TectoMT, is here.)


You might also want to visit our Institute's pages at http://ufal.mff.cuni.cz.

Back to top.