CoNLL-2009 ST Task Description

CoNLL-2009 Shared Task Description

The task builds on the CoNLL-2008 task and extends it to multiple languages. The core of the task is to predict syntactic and semantic dependencies and their labeling. Data is provided for both statistical training and evaluation, which extract these labeled dependencies from manually annotated treebanks such as the Penn Treebank for English, the Prague Dependency Treebank for Czech and similar treebanks for Catalan, Chinese, German, Japanese and Spanish languages, enriched with semantic relations (such as those captured in the Prop/Nombank and similar resources). Great effort has been devoted to provide the participants with a common and relatively simple data representation for all the languages, similar to the last year's English data.

The objective is to perform and evaluate Semantic Role Labeling (SRL) using a dependency-based representation for both syntactic and semantic dependencies in a quite novel way (1). Furthermore, the SRL problem will address not only propositions centered around verbal predicates but also around nouns and other major part-of-speech categories (where available).

The syntactic dependencies to be modeled will be more complex than the ones used in the previous CoNLL evaluations: Johansson and Nugues have shown that a richer set of syntactic dependencies improves semantic processing (2). The proposed evaluation offers a practical framework to perform joint learning for the two problems. There are several issues that make this task attractive: We believe that the proposed dependency-based representation is a better fit for many applications (e.g., Information Retrieval, Information Extraction) where it is often sufficient to identify the dependency between the predicate and the head of the argument constituent rather than extracting the complete argument constituent.

Furthermore, it was shown that the extraction of dependencies can be performed with state-of-the-art performance in linear time (3), which can give a signiﬁcant boost to the adoption of this technology in real-world applications and across multiple languages.

The task will continue investigating in several important research directions. For example, is the dependency-based representation better for SRL than the constituent-based formalism? Does joint learning improve syntactic and semantic analysis?

Training data provided

For training, the data will contain the values of the gold-standard annotation of HEAD, DEPREL, PRED and APREDs, and it will also contain the gold-standard lemma (LEMMA), part-of-speech (POS) and morphological features (FEAT). Training and development sections will be designated as such, even though their content will be formally the same. Similarly to the CoNLL-2008 shared task, this shared task evaluation is separated into two challenges:

'Closed Challenge'. Systems have to be built strictly with information contained in the given training corpus, and tuned with the development section. In addition, the PropBank frames and similar resources for the other languages can also be used (see Official Resources). Note that this means that constituent-based parsers or SRL systems can not be used in this challenge because the constituent-based annotations are not provided in our training set (unless trained in an un- or semi-supervised manner using only the data provided). The aim of this challenge is to compare the performance of the participating systems in a fair environment.

'Open Challenge'. Systems can be developed making use of any kind of external tools and resources. The only condition is that such tools or resources must have not been developed with the annotations of the test set, both for the input and output annotations of the data. In this challenge, we are interested in learning methods which make use of any tools or resources that might improve the performance. For example, we encourage the use of rich semantic information, by using WordNet, VerbNet or a WSD system. Also, in this challenge participants are encouraged to use constituent-based parsers and SRL systems, as long as these systems were trained only with the sections of TreeBank/PropBank/NomBank used in the shared task training corpus (the exact section numbers will be announced in due time; similarly for the other languages available). To encourage the participation of the groups that are only interested in SRL, the organizers will provide the output of state-of-the-art dependency parsers as input in this challenge. The comparison of different systems in this setting may not be fair, and thus ranking of systems is not necessarily important.

The task in detail

The participants are required to chose if their system computes both the syntactic dependencies and the semantic ones (joint system), or just the latter (SRL-only system). We strongly encourage to do the joint task. The participants are required to provide the output of the selected type of system in all the languages provided. Description of the representation and format of the provided input and required output can be found in Data format section. In order to make life easier for all and to avoid guessing which words are annotated as predicates (which is often arbitrary and language- and annotation-schema-dependent), we provide a clear indication for it (even in the test data): the column FILLPRED contains Y for those lines that should be considered predicates (not necessarily just verbs, but all words that are taking arguments in the particular annotation schema and language).

Specifically, the task consists of automatically filling the data columns for HEAD (syntactic dependency), DEPREL (type of dependency), PRED (frame, roleset, sense, or whatever it is called for a particular language), and APREDs (PREDs' argument dependencies and labels). The HEAD and DEPREL are optional (see below for the consequences for evaluation), and PRED is to be filled only for rows with FILLPRED=Y. Similarly, APREDs are to be filled only for the corresponding columns. The 'input' will consist of sentences with uniquely identified words, each of which will carry (on the same data line) the automatically pre-analyzed PLEMMA (lemma), PPOS (Part of Speech), PFEAT (other morphological and lexical features). For those interested only in the Semantic Role Labeling task, PHEAD (automatically predicted head node), and PDEPREL (automatically inferred dependency relation) will also be provided. The 'input' test data distributed at the runtime period will contain only the information described above, even though all the columns described in the data will be present (but possibly unfilled).

Each participant will be allowed to submit the output of 1 (one) system, with results for (possibly) both challenges. The type of submission (joint/SRL-only) will have to be specified. Participants are asked to be honest in declaring the type of the system; we discourage just copying the provided PHEAD and PDEPREL into HEAD and DEPREL - just say your system is a SRL-only and leave HEAD and DEPREL blank.

Evaluation

The output sent by the participants will be evaluated using a common evaluation script similar to the 2008 shared task, for all languages. The principles will be also similar to the CoNLL-2008 Shared Task: there will be separate Recall, Precision and F-measure scores for the syntactic dependencies, the SRL task and a joint score. Therefore, three rankings will be produced; the main one will be, given the emphasis on the "joint" prediction, the one based on the joint (labeled) macro F-score (see the detailed scorer description).

For those who submit only the SRL results, we will additionally compute (as a separate "system") their joint score using the input-supplied PHEAD/PDEPREL as if it is the true HEAD and DEPREL. For ranking of the SRL-only systems, the SRL F-score will be primary.

Learning Curve (Optional)

To assess the future implications for annotation and reserach, the participants are encouraged to submit (up to one week after the test output deadline, using similar uploading mechanism TBA) data for the learning curve of at least the large-data languages (English and Czech are recommended, but you can choose any other available language as well): we suggest using 25%, 50% and 75% of the training size as the minimal granularity of the data points for this task. Ideally participants should run both challenges under this setting and send the outputs properly identified by the training data size.

Performance (Optional)

Participants are encouraged to submit, by email to stranak@ufal.mff.cuni.cz, the memory footprint and the timing of both training and testing phase using their systems. While it is clear that this is also largely hardware- and OS-dependent, it should give a clue about the possible time efficiency and thus a hint at a set of possible applications which need certain response time. Depending on the entries received, they will appear in the common task desription paper.

Publication of the results

Participants are requested to submit a system description paper (see the deadlines published elsewhere) after the test period is over and they know the official scores.

In the paper, participants may also report results on restricted or extended task(s) and possibly only on some langauges (apart from the official ones); e.g., they can opt to ignore the FILLPRED column and make the prediction of the argument-bearing predicates themselves. Or, they can report results of more than the one system officially permitted for result submission.

References

(1) Hacioglu K. Semantic Role Labeling Using Dependency Trees. In Proceedings of COLING-2004, 2004.

(2) Johansson R. and Nugues P. Extended Constituent-to-dependency Conversion for English. In Proceedings of NODALIDA 2007, 2007.

(3) Nivre J., Hall J., Nilsson J. and Eryigit G. Labeled Pseudo-Projective Dependency Parsing with Support Vector Machines. In Proceedings of the CoNLL-X Shared Task, 2006.

Data format

Columns (overview)

ID FORM LEMMA PLEMMA POS PPOS FEAT PFEAT HEAD PHEAD DEPREL PDEPREL FILLPRED PRED APREDs

Description of the columns

ID, FORM, LEMMA, POS, FEAT, HEAD and DEPREL are the same as in the CoNLL-2006 and CoNLL-2007 Shared Tasks.

FEAT is a set of morphological features (separated by |) defined for a particular language, e.g. more detailed part of speech, number, gender, case, tense, aspect, degree of comparison, etc.

The P-columns (PLEMMA, PPOS, PFEAT, PHEAD and PDEPREL) are the autoamtically predicted variants of the gold-standard LEMMA, POS, FEAT, HEAD and DEPREL columns. They are produced by independently (or cross-)trained taggers and parsers.

PRED is the same as in the 2008 English data. APREDs correspond to 2008's ARGs. FILLPRED contains Y for lines where PRED is/should be filled.

Deviations / differences from the common format

All the languages come in the same data format. However, due to inherent differences, language-specific descriptions of the content of the data columns, including tagset and label descriptions, are attached to each dataset. They will be available as soon as the data is ready for the participants.

Common Values

UNDERSCORE ('_') is used for "unknown", "unannotated", "unfilled" etc. value (simply for all those cells in the large data table that do not display a defined label from the label sets described for each language in the documentation). The same character is also used in all cells of those columns which are completely unfilled because (e.g.) data for the particular language is not available.

Examples

This is an example sentence from the Czech Data; for English, see below.

The "P-" columns are filled with values from the corresponding gold-standard columns. The trial data (to be released soon) will have tagger and parser outputs there.

FORM

LEMMA

PLEMMA

POS

PPOS

FEAT

PFEAT

HEAD

PHEAD

DEPREL

PDEPREL

FILLPRED

PRED

APRED1

APRED2

APRED3

APRED4

APRED5

APRED6

APRED7

APRED8

APRED9

APRED10

APRED11

APRED12

APRED13

APRED14

APRED15

APRED16

SubPOS=R|Cas=2

AuxP

této

tento

SubPOS=D|Gen=F|Num=S|Cas=2

Atr

tento

RSTR

knihy

kniha

SubPOS=N|Gen=F|Num=S|Cas=2|Neg=A

Adv

kniha

DIR1

SubPOS=:

AuxG

SubPOS=R|Cas=6

AuxP

mnoha

mnoho

SubPOS=a|Cas=6

Atr

mnoho

RSTR

stránkách

stránka

SubPOS=N|Gen=F|Num=P|Cas=6|Neg=A

Adv

stránka

REG

pobuřující

SubPOS=G|Gen=F|Num=S|Cas=2|Neg=A

Atr

pobuřující

RSTR

SubPOS=:

AuxG

lze

Pred

v-w1757f1

totiž

SubPOS=b

AuxY

Linda

Lind

Obj

Lind

PAT

poznat

SubPOS=f|Neg=A

v-w4210f3

ACT

jako

SubPOS=,

AuxY

mistra

mistr

SubPOS=N|Gen=M|Num=S|Cas=4|Neg=A

Atv

mistr

COMPL2

COMPL

protikladu

protiklad

SubPOS=N|Gen=I|Num=S|Cas=2|Neg=A

Atr

protiklad

APP

SubPOS=:

AuxX

který

SubPOS=4|Gen=Y|Num=S|Cas=1

který

ACT

prý

SubPOS=T

AuxY

nešetří

šetřit

Atr

v-w6711f1

RSTR

sebe

SubPOS=6|Num=X|Cas=4

Obj_M

PAT

SubPOS=:

AuxX

čtenáře

čtenář

SubPOS=N|Gen=M|Num=P|Cas=4|Neg=A

Obj_M

čtenář

PAT

SubPOS=:

AuxX

Židy

Žid

Obj_M

Žid

PAT

ani

SubPOS=^

Coord

Němce

Němec

Obj_M

Němec

PAT

SubPOS=:

AuxK

The following English Example has been adapted from the CONLL 2008 task description (http://www.yr-bcn.es/conll2008/) to fit the CONLL 2009 format specification. Like the 2008 version, our tokenization differs from the Penn Treebank in its treatment of hyphens and slashes, e.g., we break up "New York-based" into "New", "York", "-", "based". Unlike the 2008 task, we do not include Penn Treebank's alternate tokenization in our version of the data.

FORM

LEMMA

PLEMMA

POS

PPOS

FEAT

PFEAT

HEAD

PHEAD

DEPREL

PDEPREL

FILLPRED

PRED

APRED1

APRED2

APRED3

APRED4

APRED5

APRED6

NNP

NAME

NNP

NAME

Tyler

tyler

NNP

SBJ

NMOD

years

year

NNS

AMOD

old

NMOD

senior

NMOD

vice

NMOD

president

APPO

president.01

LOC

this

NMOD

printing

VBG

NMOD

concern

PMOD

concern.02

was

VBD

ROOT

elected

elect

VBN

elect.01

president

OPRD

president.01

NMOD

its

PRP$

NMOD

technology

NMOD

group

PMOD

group.01

NMOD

new

NMOD

AM-TMP

position

APPO

position.02

A1-REF

CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages

CoNLL-2009 Shared Task:Syntactic and Semantic Dependencies in Multiple Languages

CoNLL-2009 Shared Task Description

Training data provided

The task in detail

Evaluation

Learning Curve (Optional)

Performance (Optional)

Publication of the results

References

Data format

Columns (overview)

Description of the columns

Deviations / differences from the common format

Common Values

Examples

CoNLL-2009 Shared Task:
Syntactic and Semantic Dependencies in Multiple Languages