CoNLL-2009 Shared Task:
Syntactic and Semantic Dependencies in Multiple Languages

 

 

CoNLL-2009 Shared Task Trial Data Download

This page contains trial-sized data for all the seven languages taking part in the CoNLL 2009 Shared Task. Under each language's heading, we provide a very basic description of the dataset, and a link to the zip file with all the data and documentation.

For the common format description of the data and some short examples, please see the Task Description section.

After unpacking, the data filename for a given <Language> has the form "CoNLL2009-ST-<Language>-trial.txt". All the files are supposed to be in the UTF-8 code. Please report all problems with the datasets to stranak@ufal.mff.cuni.cz and we will try to correct the problem ASAP. Please note that these are trial data and thus e.g. the contents of the P-columns might not be final. The final data will be released on a separate webpage on January 19, 2009. All registered participants will again be notified by email.

Catalan

The data distributed is a subset of the Ancora corpus.

Both Catalan and Spanish (see below) data share the same formats and specifications. For more details on the corpora and the accompanying information (tagsets, lexicons, etc.), see the README file of the data set distribution below.

Download Catalan Trial Data

Chinese

This data is a subset of the Chinese Treebank and Chinese PropBank.

The Chinese data are provided in the UTF-8 encoding. For description of the labels and annotation principles please refer to the original web pages for the source corpora (see the links above).

Download Chinese Trial Data

Czech

This data is a subset of the Prague Dependency Treebank v2.0.

The columns ID to DEPREL are almost the same as in 2006 and 2007 (minor change: SUBPOS is not a column, but now a part of FEAT).

PRED is the same as in 2008 English data. APREDn corresponds to 2008's ARGn, but it may contain a list of functions (separated by |): i.e. "They met each other", "they" is both ACT and PAT of "met".

Download Czech Trial Data

English

This data comes from the Penn Treebank, PropBank and NomBank annotated datasets.

It has been converted to the CoNLL 2009 Shared Task specification, but otherwise it is very close to the 2008 Shared Task specification.

Download English Trial Data

German

This data is prepared by converting Tiger treebank and Salsa semantic annotation.

The trial data set below contains 400 sentences in the CoNLL 2009 Shared Task format specification.

Download German Trial Data

Japanese

This data is a subset of the Kyoto University Corpus 4.0 and other data (web text and blog) annotated under the same criteria as the Kyoto University Corpus.

Download Japanese Trial Data

Spanish

The data distributed is a subset of the Ancora corpus.

Both Catalan (see above) and Spanish data share the same formats and specifications. For more details on the corpora and the accompanying information (tagsets, lexicons, etc.), see the README file of the data set distribution below.

Download Spanish Trial Data

All trial data at once...

...are available here (1.5MB zip of zipfiles).

Bonus: Graphical Visualization of the Trial Data

Since the tabular format is hardly readable for humans, we have prepared a visual representation of the trial data in the form of dependency trees (using the HEAD column pointers, represented as black-colored straight edges), with additional (colored, curved) links representing the predicate-argument relations (taken from the PRED and APREDx columns). It has been done for all seven languages, using the trial data as provided above.

The data is presented in the form of (quite compressed) .jpg files for fast viewing.

This visualization is meant only as a visual aid for simple sequential browsing of the data, to get some idea about e.g. features useful for dependency parsing. However, for those interested, we can provide also the PML version of the data and also TrEd - an annotation, visualization and search tool that can be used for looking at the PML-formatted data in a number of interesting ways. If interested, please contact stepanek@ufal.mff.cuni.cz for the data and installation hints.

Due to their expected enormous size, the future final data release (training, development, evaluation) will NOT be provided in this format.