Instructions for Data Contributors

We encourage authors of datasets and corpora with annotated coreference or anaphoric relations to extend the CorefUD collection with their data. Please write to Michal Novák and Zdeněk Žabokrtský if interested. We propose the following timeline for our cooperation:

Phase 1 – Preliminary negotiations (the sooner the better)

  • In this phase, we ask you first to make sure that your data satisfy the following requirements:

    • your data and their derivatives can be distributed under an open and free license, ideally a variant of the Creative Commons (without ND)     

    • it is OK if your data are still under development and your annotations are planned to continue for a long time, but at least some reasonable subset with reasonable annotation quality must be available by the middle of January 2023

    • the data should contain at least 500 sentences (or 10k tokens)

    • the conversion pipeline from your file format into the CorefUD file format can be – at least in theory – fully automatic

  • If your data fulfill the above conditions and you would like to include them in CorefUD 1.1, please contact us. We will give you access to the GitHub repository, and will help you integrate your data into the collection.

  • Depending on your source format, we will discuss possible technical solutions for the conversion with you, and ideally also provide you with a pointer to an already existing converter that is similar to your case.

  • We will also ask you for more detailed information concerning other aspects of your data, e.g.

    • what types of anaphoric relations are annotated in your resource

    • whether a mention is specified by its boundaries or/and by its head

    • whether coreference is represented by links or clusters

    • whether you have gold-standard UD trees for sentences, or whether we shall parse the data with a dependency parser; the same for tagging and lemmatization

    • whether and how textual documents annotated in your data resource overlap with texts in UD and/or in CorefUD 1.0

    • whether any train/dev/test split is already pre-defined for your resource

  • If needed, a Zoom meeting can be organized to clarify the details.

Phase 2 – Pilot converter (deadline 20th December 2022)

  • delivered a sample of at least 100 sentences converted into the CorefUD format

  • make sure the sample file passes the validation using validate.py at least on level 2 (e.g. validate.py --level 2 --lang xy --coref sample.conllu where xy is the respective ISO language code)

  • ideally produced by a fully automatic Python script located in a subdirectory of the import directory, but possibly with manual corrections of the output if needed
  • ideally produced using the Udapi framework

  • even if you decide not to use Udapi for the conversion, your resulting data should be parsable by Udapi

Phase 3 – Final converter (deadline 15th January 2023)

  • the whole conversion pipeline executable in Linux environment via a Makefile

    • Makefile stored in import directory

    • download target should download the original data

    • ...

  • all required information about the input resources stored in a markdown file with predefined structure (authors, reference to the primary resource, license, …)

  • if your input data do not contain annotation of morpho-syntax (lemmas, POS tags, and dependency trees), we will add it automatically using UDPipe 2 

  • the converted data should satisfy the tests in the releasing directory

  • unless the train/dev/split is already defined for your resource, you can use the default CorefUD splitter (cyclic division of documents into train:dev:test 8:1:1; see the divide rule in processing/Makefile); if you want to come up with some other solution, please follow the rules in https://universaldependencies.org/release_checklist.html#data-split

Phase 4 – Releasing CorefUD 1.1 (deadline 31st January 2023)

  • two weeks reserved for last fixes and resolution of unexpected problems