We encourage authors of datasets and corpora with annotated coreference or anaphoric relations to extend the CorefUD collection with their data. Please write to Michal Novák and Zdeněk Žabokrtský if interested. We propose the following timeline for our cooperation:
In this phase, we ask you first to make sure that your data satisfy the following requirements:
your data and their derivatives can be distributed under an open and free license, ideally a variant of the Creative Commons (without ND)
it is OK if your data are still under development and your annotations are planned to continue for a long time, but at least some reasonable subset with reasonable annotation quality must be available by the middle of January 2023
the data should contain at least 500 sentences (or 10k tokens)
the conversion pipeline from your file format into the CorefUD file format can be – at least in theory – fully automatic
If your data fulfill the above conditions and you would like to include them in CorefUD 1.1, please contact us. We will give you access to the GitHub repository, and will help you integrate your data into the collection.
Depending on your source format, we will discuss possible technical solutions for the conversion with you, and ideally also provide you with a pointer to an already existing converter that is similar to your case.
We will also ask you for more detailed information concerning other aspects of your data, e.g.
what types of anaphoric relations are annotated in your resource
whether a mention is specified by its boundaries or/and by its head
whether coreference is represented by links or clusters
whether you have gold-standard UD trees for sentences, or whether we shall parse the data with a dependency parser; the same for tagging and lemmatization
whether and how textual documents annotated in your data resource overlap with texts in UD and/or in CorefUD 1.0
whether any train/dev/test split is already pre-defined for your resource
If needed, a Zoom meeting can be organized to clarify the details.
delivered a sample of at least 100 sentences converted into the CorefUD format
make sure the sample file passes the validation using validate.py at least on level 2 (e.g. validate.py --level 2 --lang xy --coref sample.conllu
where xy
is the respective ISO language code)
import
directory, but possibly with manual corrections of the output if needed
ideally produced using the Udapi framework
even if you decide not to use Udapi for the conversion, your resulting data should be parsable by Udapi
the whole conversion pipeline executable in Linux environment via a Makefile
Makefile stored in import
directory
download
target should download the original data
...
all required information about the input resources stored in a markdown file with predefined structure (authors, reference to the primary resource, license, …)
if your input data do not contain annotation of morpho-syntax (lemmas, POS tags, and dependency trees), we will add it automatically using UDPipe 2
the converted data should satisfy the tests in the releasing
directory
unless the train/dev/split is already defined for your resource, you can use the default CorefUD splitter (cyclic division of documents into train:dev:test 8:1:1; see the divide
rule in processing/Makefile
); if you want to come up with some other solution, please follow the rules in https://universaldependencies.org/release_checklist.html#data-split
two weeks reserved for last fixes and resolution of unexpected problems