CoNLL-2009 Shared Task Training and Development Data Download
This page contains instructions for downloading the main (training and development) data for the CoNLL 2009 Shared Task. The corresponding packages also contain the trial data available here since January 5, 2009 - there is no need to re-download them separately.
Due to relatively complicated licensing issues, the data is split into two packages; one for Ch, Cz, En will be made available to you from the LDC, the remaining four languages' data will be made available for download from this website after signing a common license (see below).
The signature for the common license is electronic - by clicking the "Sign and Submit" button at the bottom of the license page. Please take the licensing issues seriously, if only for the fact that if obeying all of the licensing rules and rules of the Shared Task you might be offered a permanent license for the data, to be kept even after the task finishes.
After signing the license, you will be (a) emailed a password to access the Catalan, German, Spanish and Japanese training and development data from this website and (b) contacted soon by the LDC (where the signed license gets sent automatically) with information on how to access and download the remaining data for Chinese, Czech and English.
If you are on this page for the first time, please proceed now to fill in all the requested information at the license page.
If you already have a license key from previous download, you may use it again (please do NOT fill the license now) - go to this direct download page.
(Feb. 9, 2009) Version B of the LDC-distributed data (Chinese, Czech, English) is now available, making the patches made available earlier (see below) obsolete. You should have received download instructions for the B versions from LDC; if not, please let us know.
CORRECTION No. 2: Please download the Chinese
training data version A correction diff file here. There is no id/key
needed to download the diff file. There are a few lines to change only
- please see the diff file; if you are not familiar with patch/diff,
you can certainly locate and modify the files you originally
downloaded from LDC by hand.
CORRECTION No. 1:
Please download the English training data version A correction
diff file here.
There is no id/key needed to download the diff file.
only one difference though that you can easily correct by hand, too:
on line 580366, there should be an underscore '_' between comic and
strip in the third column (not a space). While this should not affect
evaluation (since it is not in the development nor evaluation data),
under certain not-so-uncommon circumstances, it will make your
life easier and more bug-free when using e.g. '\s' as field separator
pattern in perl's REs.
Thanks to Pierre Nugues (LTH, Lund, Sweden) for
discovering this bug.