Course on Treebank Annotation

December 13-16, 2010, Prague, Czech Republic

Invited Speakers

Erhard Hinrichs and Kathrin Beck: Tübingen Treebank Resources
The Tübingen suite of treebanks includes semi-automatically and fully automatically annotated corpora for spoken and written language. TüBa-D/S, TüBa-E/S and TüBa-J/S are treebanks of spoken German, English, and Japanese, which were annotated semi-automatically at the level of POS-tagging, syntax, and grammatical functions.
The most elaborated Tübingen treebank is the TüBa-D/Z based on newspaper articles of the German newspaper taz. With a size of 55 000 sentences (in release 6, released end of 2010), TüBa-D/Z is one of the biggest and richest semi-automatically treebanks created for German. It has been continuously extended, corrected and enriched by additional annotation layers since 2001.
Annotation includes morphology, POS tags and syntax (inculding grammatical functions), named entities, and coreference. Further annotation layers of discourse connectors and classification of named entities are currently in process.
The course will provide an overview of all treebanks mentioned above, focussing on the annotation principles, the annotation process, and the information contained in each annotation layer.
Aravind Joshi: Discourse Relations: Going Beyond the Domain of Sentences
How does one go beyond the domain of sentences? A principled way is via the so-called Discourse Relations (DR), which are signaled by the Discourse Connectives (DC), which can be thought of as higher level predicates taking abstract objects (such as events, situations, propositions) as their arguments. After discussing some general properties of DRs, I will briefly describe.the Penn Discourse Treebank (PDTB), which is a corpus of about 1 million words* annotated with DC (explicit and implicit) and their arguments, the senses of the DRs, attributions of the arguments, among some other pieces of information. I will discuss the dependency relations at the level of discourse relations and compare them to the syntactic dependencies. The set of DR's does not appear to be a closed class, yet not completely an open class either. These are expressions which behave as DC's.and can be thought of as Alternate Lexicalizations (AltLex) of discourse relations (DR). These are annotated in PDTB. We will also discuss briefly some applications of PDTB.

* The same corpus which has been annotated syntactically in the Penn Treebank, PTB)
Sandra Kübler: Querying Treebanks
Treebanks are useful for finding specific syntactic phenomena, which are difficult to detect in raw text. One example would be finding ditransitive verbs and their objects. But they are only helpful if the annotation contains the information we are looking for and if there is a way of finding out how specific phenomena are annotated in the treebank.
In this course, we will look at which phenomena we can find in treebanks and how to find them. More specifically, we will look at TigerSearch and Steven Bird's Treebank Search. We will explore the query languages used in the two tools, their strengths and limitations as well as the phenomena that we can find in different corpora.
Since the course will mostly be practical, students are encouraged to bring their laptops to the course. They are also encouraged to download TigerSearch (here) and install it on their laptops.
(There is also an update version of Tigersearch available here.)
Paul Meurer, Victoria Rosén and Koenraad de Smedt: Tools for Automatically Analyzed Corpora
In this course, we first give a short motivation for parsing corpora, based on construction principles and usage cases. We present the context of the TREPIL and XPAR projects and the INESS infrastructure. We then present the LFG Parsebanker, a comprehensive toolkit for interactive incremental construction of a treebank as a parsed corpus, using the XLE parsing tool. This web-based toolkit offers an environment for batch and interactive parsing, versioning, inspection of structures, discriminant-based disambiguation and a structural search facility. The tool is suited for any language with an XLE-based LFG grammar. Also parallel treebanks can be constructed, and a dependency treebank mode is under development. We will give a short overview of the theoretical foundations, and demonstrate the features of the tool during an online session.
Detmar Meurers: Detecting Errors in Corpus Annotation
Large corpora that are annotated with various types of linguistic annotation are central for computational linguistics and arguably also to theoretical linguistics. They play a crucial role as training and testing data for a wide range of natural language processing algorithms, and they provide access to natural examples relevant for creating and testing linguistic theories.
At the same time, the "gold standard" annotations used for these purposes contain a significant number of errors, which have been shown to negatively affect both kinds of uses.
As a step towards addressing this situation, we discuss an automatic method for detecting errors in annotated corpora that is generally applicable to corpora with a wide range of annotation schemes. The approach, developed in collaboration with Markus Dickinson and Adriane Boyd, is based on the idea that data recurring within a comparable context should be annotated the same way in all occurrences. Variation in the annotation within similar contexts thus is likely to be erroneous. We demonstrate the applicability of this variation n-gram method by illustrating that it can detect errors with high precision for a range of annotation types, including positional (part-of-speech), tree-based syntactic, discontinuous syntactic, and dependency annotation.
Martha Palmer: From Propositions to Event Descriptions
The PropBank annotated data has contributed significantly to the improvement in our ability to detect semantic roles. However, detecting semantic roles in individual predicate argument structures it just the first step towards realizing actual event descriptions, including co-references with previous mentions of the same event. VerbNet provided a key resource for the recent ACL paper on recovering implicit arguments, and further steps in this direction and that of building richer event descriptions will be discussed. The latest enhancements to VerbNet, which include greater coverage, a simplification and regularization of syntactic frame descriptions and thematic roles, and plans for generalizing semantic predicates, will be presented. The talk will also describe SemLink, an effort to map between complementary lexical resources: WordNet, FrameNet, VerbNet and PropBank. The goal is to develop a broad-coverage, unified English resource that has the fine-granularity and rich semantics of WordNet and FrameNet, that is a platform for syntactically based semantic generalizations derived from VerbNet, and that provides PropBank-like broad coverage training data for supervised Machine Learning techniques. SemLink should provide a necessary foundation for building richer event descriptions.
Jan Hajič, Eva Hajičová, Silvie Cinková, Martin Popel, Jan Štěpánek and Zdeněk Žabokrtský: Prague Dependency Treebank Tutorial: Annotation and Technology
The tutorial will introduce the Prague Dependency Treebank project, which aims at a complex manual annotation of a substantial amount of naturally occurring sentences in continuous Czech texts. The Prague Dependency Treebank has three layers of annotation: morphological, analytical (describing surface syntax in a dependency fashion) and tectogrammatical, which combines syntax and sentence semantics into a language meaning representation, keeping the dependency structure as the core of the annotation structure but adding basic coreferential links, topic/focus annotation, and a detailed semantic labeling of every sentence unit. The Prague Czech-English Dependency Treebank will be introduced as well. In addition to the data, the treebank and data processing tools will be discussed.
This tutorial is intended for students, researchers, and practitioners in natural language processing who want to see how many of the broadly annotated data and the annotation and data processing tools have been built in the Prague treebanking projects. The fact that the annotations and tools can be used in a general way could be a strong motivation for all attendees.
Barbora Hladká and Jiří Mírovský: Play the Language. An Alternative Way of Annotation
A collection of high quality data is resource-demanding regardless of the area of research and type of the data. We will present the Internet games and applications, whose purpose is to enrich text data with various types of annotation. In addition, the game competition will be organized.

Program

All presentations are available in a single zip file here.

Monday, Dec 13 (Mamaison Hotel Riverside Prague)
9:20- 9:30	Erhard Hinrichs and Jan Hajič: Opening
9:30-11:00	Erhard Hinrichs and Kathrin Beck: Tübingen Treebank Resources (Slides 1, Slides 2)
11:30-13:00	Paul Meurer, Victoria Rosén and Koenraad de Smedt: Tools for Automatically Analyzed Corpora (Slides 1, Slides 2)
	lunch
14:30-16:00	Paul Meurer, Victoria Rosén and Koenraad de Smedt
16:30-18:00	Detmar Meurers: Detecting Errors in Corpus Annotation (Slides 1, Slides 2, Slides 3)

Tuesday, Dec 14 (Mamaison Hotel Riverside Prague)
9:30-11:00	Erhard Hinrichs and Kathrin Beck
11:30-13:00	Sandra Kübler: Querying Treebanks (Slides)
	lunch
14:30-16:00	Detmar Meurers
16:30-18:00	Paul Meurer, Victoria Rosén and Koenraad de Smedt

Wednesday, Dec 15 (Mamaison Hotel Riverside Prague)
9:30-11:00	Sandra Kübler
11:30-12:30	Barbora Hladká and Jiří Mírovský: Play the Language. An Alternative Way of Annotation (Slides)
	lunch
14:00-15:30	Prague Dependency Treebank Tutorial: Technology Jan Štěpánek: Tred Editor and PML-TQ Query Engine and Query Language (Slides)
16:00-17:00	Prague Dependency Treebank Tutorial: Technology Zdeněk Žabokrtský and Martin Popel: Introduction to TectoMT (Slides)

since 19:30	workshop dinner (Konírna restaurant)

Thursday, Dec 16 (Refectory, School of Computer Science)
9:30-12:50	Prague Dependency Treebank Tutorial: Data 9:30-10:50 Jan Hajič: Introduction; Three Layers of Annotation: Morphology, Surface And Deep Syntax (Slides) 11:20-11:50 Eva Hajičová: Topic-Focus Articulation (Slides) 11:50-12:20 Zdeněk Žabokrtský: Grammatemes; Coreference (Slides) 12:20-12:50 Silvie Cinková: Prague Czech-English Dependency Treebank (Slides)
12:50-13:00	A Competition before Christmas 2010: A Medal Ceremony
	lunch
14:30-16:00	Martha Palmer: From Propositions to Event Descriptions (Slides)
16:30-18:00	Aravind Joshi: Discourse corpus (Slides 1, Slides 2, Slides 3)

CLARA Joint Training Programme: Course on Treebank Annotation

Institute of Formal and Applied Linguistics

Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic

Course on Treebank Annotation

December 13-16, 2010, Prague, Czech Republic

Invited Speakers

Program

Site navigation: