CRAC 2025 Shared Task on Multilingual Coreference Resolution

Coreference resolution is the task of clustering together multiple mentions of the same entity appearing in a textual document (e.g. Beethoven, the German composer and he). This CodaLab-powered shared task deals with multilingual coreference resolution and is associated with the CODI-CRAC 2025 Workshop (6th Workshop on Computational Approaches to Discourse and 8th Workshop on Computational Models of Reference, Anaphora and Coreference) held at EMNLP 2025. This shared task builds on the previous three editions, in 2022, 2023, and 2024. The current edition mainly focuses on LLMs and challenges their abilities in predicting coreference across a range of typologically different languages. Coreference datasets from 17 languages are involved.

Important dates and actions

Registration

as soon as possible

Choose your track: LLM or Unconstrained.
Fill the registration form.

Development phase

March 21 - June 3

LLM Track

Download the data.
If the plaintext format does not suit you, adjust the import/export scripts.
Play with your LLM using the training data. Check out the acceptable approaches.
Run your LLM on the mini-dev data, import the LLM output, and evaluate.
Submit the results.

Unconstrained Track

Choose the starting point.
Download the data.
Develop your system. You may build it upon the baseline systems.
Run your system on the mini-dev data, validate, and evaluate.
Submit the results.

Evaluation phase

June 4-1726

LLM Track

Download the test data.
If you have modified the plaintext format, run your adjusted export script.
Run your system on the test data, import the results back to CoNLL-U files.
Submit the results.

Unconstrained Track

Download the test data.
Run your system on the test data, validate.
Submit the results.

Submission of system description papers

by ~~July 30~~August 3

Fill in a questionnaire with details about your system (by July 15).
Submit your best system's predictions on mini-dev to CodaLab (by July 15).
Write and submit a paper describing your system(s).

Workshop

November 8/9

See you at CODI-CRAC 2025 in Suzhou, China.

Official Results

The following table shows four versions of the CoNLL metric macro-averaged over all datasets:

head-match excluding singletons (the primary metric, see below),
partial-match excluding singletons,
exact-match excluding singletons and
head-match with singletons.

		excluding singletons					with singletons
Track	System	head-match	partial-match		exact-match		head-match
LLM	1. GLaRef-CRAC25	62.96	61.66	(-1.30)	58.98	(-3.98)	65.61	(+2.66)
	2. NUST-FewShot	61.74	61.14	(-0.60)	56.34	(-5.40)	63.44	(+1.69)
	3. PUXCRAC2025	60.09	59.68	(-0.41)	55.22	(-4.87)	54.77	(-5.32)
	4. UWB	59.84	59.55	(-0.29)	38.81	(-21.03)	62.77	(+2.93)
Unconstrained	1. CorPipeEnsemble	75.84	74.90	(-0.94)	72.76	(-3.08)	78.33	(+2.49)
	2. CorPipeBestDev	75.06	74.08	(-0.98)	71.97	(-3.10)	77.63	(+2.57)
	3. CorPipeSingle	74.75	73.74	(-1.01)	71.53	(-3.23)	77.43	(+2.68)
	4. Stanza	67.81	67.03	(-0.78)	64.68	(-3.13)	70.64	(+2.83)
	5. GLaRef-Propp	61.57	60.72	(-0.85)	58.43	(-3.14)	65.28	(+3.70)
	6. BASELINE	56.01	55.58	(-0.43)	54.24	(-1.77)	47.88	(-8.13)

For a more detailed evaluation see the shared task overview paper.

Submissions of all participants on the minidev set were published after the shared task.

1. Instructions for the Participants in a Nutshell
2. The LLM Track
- 2.1. Registration
- 2.2. Data
  - 2.2.1. Data download
- 2.3. Plaintext format, import/export tools
  - 2.3.1. Plaintext format
  - 2.3.2. Import/export tools
- 2.4. Developing your LLM system
- 2.5. Evaluation
- 2.6. Submission to Codalab
- 2.7. System description papers
3. The Unconstrained Track
- 3.1. Registration
- 3.2. Choosing the starting point
- 3.3. Data
  - 3.3.1. Data download
  - 3.3.2. Udapi interface for data
  - 3.3.3. CorefUD file format
- 3.4. Developing your system
  - 3.4.1. A baseline for prediction of empty nodes
  - 3.4.2. A baseline for coreference resolution
- 3.5. Validation
- 3.6. Evaluation
  - 3.6.1. Head-match score
  - 3.6.2. Matching of zero mentions
- 3.7. Submitting to CodaLab
- 3.8. System description papers
4. Miscellaneous
- 4.1. Terms and conditions for data usage
- 4.2. Shared task organizers
- 4.3. Changes to the previous editions of the shared task
- 4.4. The shared task in a broader context
- 4.5. Acknowledgements

1. Instructions for the Participants in a Nutshell

This shared task focuses on the development of systems for coreference resolution across multiple languages. Participants are expected to build systems that:

Identify mentions in texts, including reconstructing zero mentions (such as pro-drops) involved in coreference relations.
Identify coreference relations among mentions, predicting which mentions belong to the same coreference cluster (i.e., refer to the same entity or event).

Systems must handle linguistic diversity, including different languages, annotation styles, and presence or absence of zero mentions.

This edition encourages the use of large language models (LLMs) for coreference resolution. While non-LLM models are still welcome, a dedicated LLM Track has been introduced to highlight and explore the capabilities of LLM-based approaches. To accommodate different modeling strategies, two tracks are available:

LLM Track: Focused on solutions that primarily rely on LLMs for coreference resolution. Allowed strategies include fine-tuning LLMs, using in-context learning, designing effective prompts, and utilizing constrained decoding strategies.
Unconstrained Track: Open to all other approaches, including non-LLM models and hybrid systems. This track allows the use of additional pre-existing coreference systems, external tools, and extensive model modifications.

If you are unsure whether your system qualifies for the LLM track, feel free to ask the organizers. Detailed track-specific rules and constraints are provided in the respective sections below.

You should register for the shared task before proceeding to the development phase. During this phase, you can develop and test your system using provided training and development data, iteratively refining your model. The evaluation phase begins with the release of blind test data, and you must submit your predictions. Final rankings will be based on an official evaluation metric, with separate rankings for each track.

The main rules applicable to both tracks are:

Data: Shared task data are based on the CorefUD collection. Training and development data are provided in advance, while test data (without gold annotations) is released only at the start of the evaluation phase and must not be used for further model training.
Submission Format: Predictions must be submitted as CoNLL-U files with coreference annotations in the MISC column, following the CorefUD format.
Submission Process: You must submit your predictions via CodaLab, using the specific URL for your chosen track.
Evaluation: Each track has its own ranking based on an official evaluation metric. Additional analyses may be presented but will not affect rankings.
Post-Evaluation: You will be invited to submit a system description paper to the CODI-CRAC 2025 workshop.

Further details are provided in the LLM Track and Unconstrained Track sections.

We look forward to your participation and encourage innovative approaches to coreference resolution!

2. The LLM Track

The LLM Track is dedicated to approaches that rely on large language models (LLMs) for coreference resolution. LLMs refer to transformer-based models pre-trained on vast textual corpora and capable of generating coherent, context-aware outputs.

To qualify for this track, systems must primarily utilize LLMs for coreference resolution. Acceptable approaches include:

In-context learning: Prompting the model with examples to guide its predictions.
Fine-tuning: Training the model further using the provided datasets (adapter-based techniques like LoRA are allowed).
Prompt tuning: Optimizing prompt structures to improve model outputs.
Agentic systems: Implementing frameworks where the model autonomously selects or modifies prompts based on input text.
Decoder output restrictions: Constraining the model’s output grammar for structured predictions.

Systems in this track must independently generate mentions (including zero mentions) and cluster them into coreferential entities. In other words, external non-LLM systems cannot be used to solve the entire task or any of its subtasks. As a result, baseline systems from the Unconstrained Track cannot be used here.

If a submission does not fully adhere to these constraints, it must be submitted to the Unconstrained Track. The organizers reserve the right to reassign submissions from the LLM track to the Unconstrained track if they determine that the system does not adhere to the track's requirements.

2.1. Registration

If you are interested in participating in this shared task, please fill out the registration form as soon as possible. We strongly recommend that at least one person from your participating team registers to stay informed about all updates regarding the shared task.

Technically, this registration is not connected to participants' CodaLab accounts in any way. In other words, you can upload your CodaLab submissions without being registered here.

For any questions about the shared task, contact the organizers via corefud@googlegroups.com.

2.2. Data

The data for the shared task is based on the public edition of CorefUD 1.3.

CorefUD is a collection of previously existing datasets annotated with coreference, converted into a common annotation scheme. Coreference is annotated for empty tokens, particularly in pro-drop languages. The datasets include morphological and syntactic annotations that align with the standards of the Universal Dependencies project. Each dataset is divided into training, development, and test sections (train/dev/test) and stored in CoNLL-U format, with coreference information in the MISC column.

The public edition of CorefUD 1.3 consists of 24 datasets for 17 languages:

Ancient_Greek-PROIEL, New Testament gospels, from the PROIEL treebank (added into CorefUD in version 1.2),
Ancient_Hebrew-PTNK, Old Testament Genesis, portions of the Biblia Hebraic Stuttgartensia (added into CorefUD in version 1.2),
Catalan-AnCora, based on the Catalan part of Coreferentially annotated corpus AnCora,
Czech-PCEDT, based on the Czech part of the Prague Czech-English Dependency Treebank,
Czech-PDT, based on the Prague Dependency Treebank,
English-GUM, based on the Georgetown University Multilayer Corpus,
English-LitBank, based on LitBank, 100 works of English-language fiction (added into CorefUD in version 1.2),
English-ParCorFull, based on the English part of ParCorFull (newly excluded from the shared task data),
French-Democrat, based on the Democrat corpus,
French-ANCOR, based on the ANCOR_Center corpus of spoken French (newly added into CorefUD in version 1.3),
German-ParCorFull, based on the German part of ParCorFull (newly excluded from the shared task data),
German-PotsdamCC, based on the Potsdam Commentary Corpus,
Hindi-HDTB, based on Hindi Dependency Treebank (newly added into CorefUD in version 1.3),
Hungarian-SzegedKoref, based on the Hungarian coreference corpus SzegedKoref,
Hungarian-KorKor, based on the a Hungarian coreference corpus KorKor (added into CorefUD in version 1.1),
Korean-ECMT, based on corpus for information exraction in Korean (newly added into CorefUD in version 1.3),
Lithuanian-LCC, based on the Lithuanian Coreference Corpus,
Norwegian-BokmaalNARC, based on the Bokmaal part of the Norwegian Anaphora Resolution Corpus (added into CorefUD in version 1.1),
Norwegian-NynorskNARC, based on the Nynorsk part of the Norwegian Anaphora Resolution Corpus (added into CorefUD in version 1.1),
Old_Church_Slavonic-PROIEL, Codex Marianus and selected chapters of Suprasliensis from the PROIEL and TOROT treebanks (added into CorefUD in version 1.2),
Polish-PCC, based on the Polish Coreference Corpus,
Russian-RuCor, based on the Russian Coreference Corpus RuCor,
Spanish-AnCora, based on the Spanish part of Coreferentially annotated corpus AnCora,
Turkish-ITCC, based on the Turkish Coreference Corpus (added into CorefUD in version 1.1).

Adjustments have been made specifically for this task. Most notably, the development and test sets have been reduced to mini-dev and mini-test sets, respectively. The selection of datasets for the shared task also differs slightly: English-ParCorFull and German-ParCorFull have been excluded. Last but not least, the mini-test set may include datasets, or even languages, not present in the train or mini-dev data. If included, these surprise mini-test sets will be announced at the start of the evaluation phase.

The rationale behind the reduction is to decrease the computational requirements of evaluation while maintaining high discriminative power. Specifically, we limit every dev set and test set to 25 000 words by randomly sampling their documents. The 25k limit was chosen to limit the overall collection size by nearly one half, while influencing only a few largest corpora and still providing highly accurate results. English-ParCorFull and German-ParCorFull were excluded from the shared task data due to their limited size, which negatively impacted resolvers' performance.

For the LLM track, two variants of the data are provided. Gold data includes gold-standard annotations of coreference and empty nodes, intended for fine-tuning and evaluation. Input data lacks these annotations and is meant to be processed by your systems.

Gold data has undergone a minor technical modification: the forms (the second field in the CoNLL-U format) of empty tokens have been removed and replaced with an underscore (_). While the gold train and mini-dev sets are available for download, the gold test set remains secret and will be used internally in CodaLab for evaluation.

Input data simulates a real-world scenario where no manual linguistic annotation is available. We provide the mini-dev and mini-test sets, in which empty nodes and coreference annotations have been removed. Additionally, original morpho-syntactic features (POS tags, lemmas, and dependency trees) have been replaced by output from UDPipe 2 (an automatic UD-like annotation pipeline) even for datasets where these features were manually annotated in CorefUD 1.3.

The original data format is CoNLL-U. While datasets for the LLM track are also provided in plaintext format, generated by the export script, evaluation requires information that plaintext cannot fully capture. Therefore, working with CoNLL-U files is essential. However, if you do not modify the plaintext format, the import script should successfully integrate coreference annotations into the input CoNLL-U files, eliminating the need for an in-depth understanding of the CoNLL-U format.

2.2.1. Data download

The following table provides download links for the LLM track data.

Data type	Empty tokens	Coreference	Morpho-syntax	Forms of empty tokens	Plaintext	Download
Gold	manual	manual	original (manual if available, otherwise automatic)	deleted	included	train
Gold	manual	manual	original (manual if available, otherwise automatic)	deleted	included	mini-dev
Input	deleted	deleted	automatic UDPipe 2	deleted	included	mini-dev
Input	deleted	deleted	automatic UDPipe 2	deleted	included	mini-test

2.3. Plaintext format, import/export tools

To support the development of LLM systems, we provide data in the plaintext format, which is more suitable for LLMs compared to the CoNLL-U format. We also offer tools for exporting coreference annotations from CoNLL-U into plaintext and importing them back.

2.3.1. Plaintext format

Here is an example of the plaintext format (first sentence of the Spanish-AnCora dataset):

Los|[e1 jugadores de el Espanyol|[e2],e1] aseguraron hoy que ##|[e1] prefieren enfrentar se a el Barcelona|[e3]
en la|[e4 final de la|[e5 Copa de el Rey|e4],e5] en lugar de en las|[e6 semifinales|e6] , tras clasificar se ayer
ambos|[e7 equipos catalanes|e7] para esta|[e6 ronda|e6] .

The plaintext format is a simple text file where each line corresponds to a document, and each token is separated by a space. Coreference annotations are expressed on the token level after the '|' character at the end of a token. Token Annotations are represented in a similar form to CoNLLu, just but with a square bracket for defining span boundaries. Empty nodes are represented prefixed with '##' if an empty node has a form or lemma in the original data, they are appended right after.

Note that some design decisions were made to simplify the plaintext format. However, this may cause confusion about how the outputs are evaluated. If you need more control over the evaluation process, consider adjusting the plaintext format.

Singletons (entities with a single mention) are annotated in the plaintext format, but are ignored by the primary score.
Full mention spans are annotated in the plaintext format, whereas the primary score focuses only on mention heads.
Zero mentions: (1) can only have a single dependency parent, (2) are always positioned just after their parent token, and (3) are connected to their parent with no specific relation type. In contrast, the scorer (1) can evaluate zero mentions with multiple dependencies, as seen in the gold CoNLL-U data, (2) matches zero mentions based on dependency rather than linear position, and (3) uses relation types to disambiguate between multiple zero mentions sharing the same parent.

2.3.2. Import/export tool

Participants are encouraged to use the text2text-coref tool for exporting coreference annotations from CoNLL-U into plaintext and importing them back.

LLMs process text-based inputs. Therefore, we also provide both gold and input data in the plaintext format (see the Data Download section). If the provided plaintext format meets your needs, you do not need to run the export command. Otherwise, click here for details about the export command.

If the coreference annotation in the plaintext format does not meet your needs, feel free to adjust the export section of the text2text-coref tool, which is handled by the convert_to_text function in the convert module.

Once you modify the export scripts, run the following command to export the coreference annotations from CoNLL-U into plaintext:

# Export coreference annotations from CoNLL-U into plaintext
# --zero_mentions: add empty nodes and zero mention annotation to the plaintext output
# --sequential_ids: replace original entity IDs with increasing entity numbers from 1 per each document
python -m text2text_coref conllu2text [input_conllu_file] -o [output_text_file] --zero_mentions --sequential_ids

Regardless of whether you use the plaintext format provided or modify the export script, you will need to import the coreference annotations back into CoNLL-U format to run the official scorer. Run the following commands to clean the output from the LLM system and then import the coreference annotations back into CoNLL-U:

# clean the output of the LLM system
python -m text2text_coref clean [input_txt_file][conllu_skeleton_file] [output_txt_file]
# import the coreference annotations back to CoNLL-U
python -m text2text_coref text2conllu [input_txt_file][conllu_skeleton_file] [output_conllu_file]

If you modified the export script, you will also need to adjust the import process accordingly. Additionally, ensure that the resulting CoNLL-U files are valid (see the Validation section). Click here for details about the import command.

To do so, modify the convert_text_to_conllu function in the convert module. You may also need to adjust the cleaning functionality in the output_cleaner module.

For more information about the text2text-coref tool, refer to the documentation in the tool's GitHub repository.

2.4. Developing your LLM system

To qualify for the LLM track, your solution must rely on one or more existing LLMs, utilizing methods such as in-context learning, prompt tuning, or fine-tuning. If you use non-LLM models or other existing systems, your solution will be placed in the unconstrained track (for more examples, see here). If you're unsure whether your system meets the LLM track requirements, feel free to reach out to us for clarification.

2.5. Evaluation

The official scorer for the shared task is the CorefUD scorer in its latest version. Its functionality is guaranteed to remain unchanged from the start of the development phase through to the end of the evaluation phase.

The scorer requires two CoNLL-U files as input: one with gold annotations and one with predicted annotations. Therefore, all plaintext predictions must be converted back to the CoNLL-U format before evaluation (for more details, see here).

The primary evaluation metric for the task is the CoNLL score, which is the unweighted average of the F1 values for the MUC, B-cubed, and CEAFe scores. To encourage the development of multilingual systems, the primary ranking score will be computed by macro-averaging the CoNLL F1 scores across all datasets.

Besides the primary ranking, the overview paper on the shared task will also introduce multiple secondary rankings, e.g. by CoNLL score for individual languages, or by CoNLL scores calculated with exact matching.

We calculate the primary score under the following setup:

Singletons are excluded because many of the datasets do not have singletons annotated.
Mention head matching is used to match gold and predicted mentions. While the plaintext format marks full mention spans, mention heads are automatically identified when the markup is imported back into CoNLL-U files. Unless you modify the import/export tool, you do not need to identify mention heads manually. For more details on this, refer to the Evaluation section in the Unconstrained track.
Zero mentions are matched using a dependency-based method, which allows more flexibility in placing the empty node within the syntactic structure of a sentence. The plaintext format, however, limits the placement of zeros to maintain simplicity. This should not significantly impact the evaluation score. If you require greater control over the evaluation, you should adjust the plaintext format and the import/export tool. See the Evaluation section in the Unconstrained track for more information on zero mention matching.

2.6. Submitting to CodaLab

Submissions to the LLM Track of the shared task are collected through CodaLab. Please use your team name as the username for your CodaLab account.

You must create your CodaLab account prior to your first submission. The results of your submission will be publicly available under your username.

Participants who have developed multiple coreference prediction systems are encouraged to submit their predictions separately, up to 3 systems per team, as long as the systems differ in meaningful ways (e.g., using different architectures, not just varying hyper-parameter settings). To submit an additional system, create a new team account on CodaLab.

There are limits on the number of submissions during the evaluation phase: 2 trials per day and 10 in total. We recommend submitting your outputs during the development phase to avoid any unexpected issues during evaluation.

During the development phase, the submission limits are higher, allowing you to share intermediate results: 15 trials per day and 100 in total.

Multiple trials during the evaluation phase should only be used to resolve unexpected issues, not for systematic optimization of parameters or hyper-parameters based on the scores shown by CodaLab.

Submissions to CodaLab must be in a zip file, with 22 CoNLL-U files produced by your system placed in the root folder. The filenames must match the names of the corresponding input CoNLL-U files.

The zip file for the development phase should contain the following files (substitute minidev with minitest for submissions during the evaluation phase).

    ca_ancora-corefud-minidev.conllu
    cs_pcedt-corefud-minidev.conllu
    cs_pdt-corefud-minidev.conllu
    cu_proiel-corefud-minidev.conllu
    de_potsdamcc-corefud-minidev.conllu
    en_gum-corefud-minidev.conllu
    en_litbank-corefud-minidev.conllu
    es_ancora-corefud-minidev.conllu
    fr_ancor-corefud-minidev.conllu
    fr_democrat-corefud-minidev.conllu
    grc_proiel-corefud-minidev.conllu
    hbo_ptnk-corefud-minidev.conllu
    hi_hdtb-corefud-minidev.conllu
    hu_korkor-corefud-minidev.conllu
    hu_szegedkoref-corefud-minidev.conllu
    ko_ecmt-corefud-minidev.conllu
    lt_lcc-corefud-minidev.conllu
    no_bokmaalnarc-corefud-minidev.conllu
    no_nynorsknarc-corefud-minidev.conllu
    pl_pcc-corefud-minidev.conllu
    ru_rucor-corefud-minidev.conllu
    tr_itcc-corefud-minidev.conllu

If this naming (and placement) convention is not followed, the scorer will not be able to match the output with the input, and the output will not be scored. We recommend that you also validate your output files. However, even files that do not pass the validation tests will still be considered for evaluation and contribute to the final score, provided that the evaluation script does not fail on them.

2.7. System description papers

All shared task participants are invited to submit system description papers to the CODI-CRAC 2025 Workshop by August 3. Please submit your paper using SoftConf and select one of the "Shared Task paper (short/long)" as its Submission Type. If accepted, the papers will be published in the workshop proceedings.

System description papers can have the form of long or short research papers, up to 8 pages of content for long papers and up to 4 pages of content for short papers, plus an unlimited number of pages for references in both cases. All submissions must follow the EMNLP 2025 formatting instructions. Identity of the authors of the participating systems is known, and thus it is not required to make the submissions anonymous.

3. The Unconstrained Track

Unlike the LLM Track, the Unconstrained Track allows you to leverage any approach to address the task. You should submit your system to this track if:

You use non-LLM models or a combination of LLMs with other systems.
You modify the model architecture beyond fine-tuning (e.g., altering output logits, removing the decoder, or adding custom layers).
You integrate existing coreference resolution systems (e.g., precomputing coreference predictions with another system before using an LLM to refine them).

As it is allowed to use existing non-LLM systems, organizers provide two baseline systems to support participants: one for empty node reconstruction and another for coreference resolution. You can either extend these baseline systems or use their outputs as a starting point. Unlike in the LLM Track, you must first choose a starting point, which determines the amount of work required for your own system.

There is no plaintext format provided for the Unconstrained Track; the data is available only in the original CoNLL-U format. Although this may present a steeper learning curve, it offers greater control over the final output, ensuring better alignment with evaluation criteria.

3.1. Registration

Registration is the same for both tracks. See the LLM-track registration section for more details.

3.2. Choosing the starting point

Participants can choose from different starting points for joining the shared task, which vary based on the amount of work they need to do on their own. Depending on the starting point chosen, different degrees of predictions by baseline systems are available.

There are three starting points:

Coreference and zeros from scratch. Participants need to develop not only a system that resolves coreference, but also a system that predicts empty tokens that may be involved in zero anaphora. Alternatively, these tasks can be addressed jointly with a single system. Although more challenging, this starting point offers high potential gains.
Coreference from scratch. The empty tokens for zero anaphora are provided by the baseline system. Therefore, participants only need to focus on developing a system for coreference resolution. The systems submitted to the 2022 and 2023 editions can be readily applied at this starting point after some retraining.
Refine the baseline. Participants are provided with both empty tokens and coreference relations, as predicted by the baseline systems. Choosing this starting point is the simplest yet less flexible option.

Given that the shared task data comprises multiple datasets in different languages, participants have the flexibility to approach the task from various starting points across the datasets/languages.

3.3. Data

In the unconstrained track, the data source is identical to that in the LLM track: CorefUD 1.3. For the details on the data collection shared between the tracks, see the LLM track data section.

There are two main differences in how the data is pre-processed for the unconstrained track:

There are three variants of the input data, depending on the starting point chosen by the participant. See the Data download section below.
There is no plaintext format provided. Participants are expected to work with the original CoNLL-U format. To facilitate manipulation with the CoNLL-U files, we provide the Udapi library, which is a powerful tool for processing CorefUD data (and Universal Dependencies data, in general). See the Udapi interface for data section for a basic cookbook to Udapi for CorefUD, or see the CorefUD file format to learn more about the CoNLL-U format.

3.3.1. Data download

Download the data for the unconstrained track from the following table. Choose the input data variant based on the starting point you have chosen.

Data type	Starting point	Empty tokens	Coreference	Morpho-syntax	Forms of empty tokens	Plaintext	Download
Gold	All	manual	manual	original (manual if available, otherwise automatic)	deleted	not included	train
Gold	All	manual	manual	original (manual if available, otherwise automatic)	deleted	not included	mini-dev
Input	Coref. and zeros from scratch	deleted	deleted	automatic UDPipe 2	deleted	not included	mini-dev
	Coref. and zeros from scratch	deleted	deleted	automatic UDPipe 2	deleted	not included	mini-test
	Coref. from scratch	automatic baseline	deleted	automatic UDPipe 2	deleted	not included	mini-dev
	Coref. from scratch	automatic baseline	deleted	automatic UDPipe 2	deleted	not included	mini-test
	Refine the baseline	automatic baseline	automatic baseline	automatic UDPipe 2	deleted	not included	mini-dev
	Refine the baseline	automatic baseline	automatic baseline	automatic UDPipe 2	deleted	not included	mini-test

3.3.2. Udapi interface for data

Udapi is a Python API for reading, writing, querying and editing Universal Dependencies data in the CoNLL-U format (and several other formats). It has also support for the coreference annotations (and it was used for producing CorefUD). You can use Udapi for accessing and writing the data in a comfortable way. See the following example Python script, which:

loads a CorefUD data file
accesses tokens of the first sentence
creates an entity with two mentions
creates an empty token as a child of another token
creates another entity with two mentions, one of which is the empty token
debug-prints the newly created entities
saves the CorefUD data file with the new entities

#!/usr/bin/env python3
import udapi

# Extract the words of the first sentence in the Spanish blind dev set.
doc = udapi.Document("es_ancora-corefud-dev.conllu")
trees = list(doc.trees)
words = trees[0].descendants
print([w.form for w in words])
#['Los', 'jugadores', 'de', 'el', 'Espanyol', 'aseguraron', 'hoy', 'que',
# 'prefieren', 'enfrentar', 'se', 'a', 'el', 'Barcelona', 'en', 'la', 'final',
# 'de', 'la', 'Copa', 'de', 'el', 'Rey', 'en', 'lugar', 'de', 'en', 'las',
# 'semifinales', ',', 'tras', 'clasificar', 'se', 'ayer', 'ambos', 'equipos',
# 'catalanes', 'para', 'esta', 'ronda', '.']

# Create entity e1 with two mentions: "las semifinales" and "esta ronda"
e1 = doc.create_coref_entity()
e1.create_mention(words=words[27:29], head=words[28])
e1.create_mention(words=words[38:40], head=words[39])

# Create an empty node (zero) before the 9th word "prefieren".
zero = words[8].create_empty_child(deprel="nsubj", after=False, form="_")

# Make sure the input file es_ancora-corefud-dev.conllu is really
# the blind dev set without any empty nodes.
assert zero == trees[0].descendants_and_empty[8], "unexpected input file"

# Create entity e2 with two mentions:
# "Los jugadores de el Espanyol" and the newly created zero.
e2 = doc.create_coref_entity()
e2.create_mention(words=words[0:5], head=words[1])
e2.create_mention(words=[zero], head=zero)

# Print the newly created coreference entities.
udapi.create_block("corefud.PrintEntities").process_document(doc)

# Save the predictions into a CoNLL-U file
doc.store_conllu("output.conllu")

For getting a deeper insight into Udapi, you can use

Daniel Zeman's tutorial (with even basic concepts explained),
and/or Martin Popel's tutorial (a bit more advanced, Jupyter oriented).

3.3.3. CorefUD file format

If you use the Udapi interface for loading and storing the shared task data, which is the recommended way, you don't have to deal with the file format at all. However, it may be useful to understand the format for quick glimpses into the data.

The full specification of the CoNLL-U format is available at the website of Universal Dependencies. In a nutshell: every token has its own line; lines starting with # are sentence-level comments, and empty lines terminate a sentence. Regular token lines start with an integer number. There are also lines starting with intervals (e.g. 4-5), which introduce what UD calls “multi-word tokens”; these lines must be preserved in the output but otherwise the participants do not have to care about them (coreference annotation does not occur on them). Finally, there are also lines starting with decimal numbers (e.g. 2.1), which correspond to empty nodes in the dependency graph; these nodes may represent zero mentions and may contain coreference annotation. Every token/node line contains 10 tab-separated fields (columns). The first column is the numeric ID of the token/node, the next column contains the word FORM; any coreference annotation, if present, will appear in the last column, which is called MISC. The file must use Linux-style line breaks, that is, a single LF character, rather than CR LF, which is common on Windows.

The MISC column is either a single underscore (_), meaning there is no extra annotation, or one or more pieces of annotation (typically in the Attribute=Value form), separated by vertical bars (|). The annotation pieces relevant for this shared task always start with Entity=; these should be learned from the training data and predicted for the test data. Any other annotation that is present in the MISC column of the input file should be preserved in the output (especially note that if you discard SpaceAfter=No, or introduce a new one, the validator may report the file as invalid).

For more information on the Entity attribute, see the PDF with the description of the CorefUD 1.0 format (the CorefUD 1.2 format is identical).

Click here to see an example of a CoNLL-U file with coreference annotation highlighted in bold/yellow and an empty node in italics/red.

# newdoc id = CESS-CAST-A-20000217-13959
# global.Entity = eid-etype-head-other
# sent_id = CESS-CAST-A-20000217-13959-s1
# text = Los jugadores del Espanyol aseguraron hoy que prefieren enfrentarse al Barcelona en la final de la Copa del Rey en lugar de en las semifinales, tras clasificarse ayer ambos equipos catalanes para esta ronda.
1	Los	el	DET	da0mp0	Definite=Def|Gender=Masc|Number=Plur|PronType=Art	2	det	2:det	Entity=(e16088--2-gstype:gen,HomoDD
2	jugadores	jugador	NOUN	ncmp000	Gender=Masc|Number=Plur	6	nsubj	6:nsubj	ArgTem=arg0:agt
3-4	del	_	_	_	_	_	_	_	_
3	de	de	ADP	spcms	_	5	case	5:case	_
4	el	el	DET	_	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	5	det	5:det	_
5	Espanyol	Espanyol	PROPN	np0000o	_	2	nmod	2:nmod	Entity=(e16089-organization-1-gstype:spec)e16088)
6	aseguraron	asegurar	VERB	vmis3p0	Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin	0	root	0:root	_
7	hoy	hoy	ADV	rg	_	6	advmod	6:advmod	ArgTem=argM:tmp
8	que	que	SCONJ	cs	_	9	mark	9:mark	_
8.1	_	_	PRON	p	_	_	_	9:nsubj	ArgTem=arg0:agt|Entity=(e16088--1-CorefType:ident,gstype:gen)
9	prefieren	preferir	VERB	vmip3p0	Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin	6	ccomp	6:ccomp	ArgTem=arg1:pat
10-11	enfrentarse	_	_	_	_	_	_	_	_
10	enfrentar	enfrentar	VERB	vmn0000	VerbForm=Inf	9	xcomp	9:xcomp	ArgTem=arg1:pat
11	se	él	PRON	_	Case=Acc|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes	10	expl:pv	10:expl:pv	_
12-13	al	_	_	_	_	_	_	_	_
12	a	a	ADP	spcms	_	14	case	14:case	_
13	el	el	DET	_	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	14	det	14:det	_
14	Barcelona	Barcelona	PROPN	np0000l	_	10	obj	10:obj	ArgTem=arg1:pat|Entity=(e16090-organization-1-gstype:spec)
15	en	en	ADP	sps00	_	17	case	17:case	_
16	la	el	DET	da0fs0	Definite=Def|Gender=Fem|Number=Sing|PronType=Art	17	det	17:det	Entity=(e16091--2-gstype:gen,HomoDD
17	final	final	NOUN	ncfs000	Gender=Fem|Number=Sing	10	obl	10:obl	ArgTem=argM:loc
18	de	de	ADP	sps00	_	20	case	20:case	_
19	la	el	DET	da0fs0	Definite=Def|Gender=Fem|Number=Sing|PronType=Art	20	det	20:det	Entity=(e16092-other-2-gstype:spec
20	Copa	Copa	PROPN	np0000a	_	17	nmod	17:nmod	_
21-22	del	_	_	_	_	_	_	_	_
21	de	de	ADP	_	_	23	case	23:case	_
22	el	el	DET	_	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	23	det	23:det	_
23	Rey	Rey	PROPN	_	_	20	flat	20:flat	Entity=e16092)e16091)
24	en	en	ADP	sps00	_	29	cc	29:cc	_
25	lugar	lugar	NOUN	_	_	24	fixed	24:fixed	_
26	de	de	ADP	_	_	24	fixed	24:fixed	_
27	en	en	ADP	sps00	_	29	case	29:case	_
28	las	el	DET	da0fp0	Definite=Def|Gender=Fem|Number=Plur|PronType=Art	29	det	29:det	Entity=(e16093--2-gstype:gen,HomoDD
29	semifinales	semifinal	NOUN	ncfp000	Gender=Fem|Number=Plur	17	conj	17:conj	Entity=e16093)|SpaceAfter=No
30	,	,	PUNCT	fc	PunctType=Comm	32	punct	32:punct	_
31	tras	tras	ADP	sps00	_	32	mark	32:mark	_
32-33	clasificarse	_	_	_	_	_	_	_	_
32	clasificar	clasificar	VERB	vmn0000	VerbForm=Inf	6	advcl	6:advcl	ArgTem=argM:tmp
33	se	él	PRON	_	Case=Acc|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes	32	expl:pv	32:expl:pv	_
34	ayer	ayer	ADV	rg	_	32	advmod	32:advmod	ArgTem=argM:tmp
35	ambos	ambos	NUM	dn0mp0	Gender=Masc|Number=Plur|NumType=Card	36	nummod	36:nummod	Entity=(e16094-other-2-CorefType:ident,gstype:spec|SplitAnte=e16089<e16094,e16090<e16094
36	equipos	equipo	NOUN	ncmp000	Gender=Masc|Number=Plur	32	nsubj	32:nsubj	ArgTem=arg1:tem
37	catalanes	catalán	ADJ	aq0mp0	Gender=Masc|Number=Plur	36	amod	36:amod	Entity=e16094)
38	para	para	ADP	sps00	_	40	case	40:case	_
39	esta	este	DET	dd0fs0	Gender=Fem|Number=Sing|PronType=Dem	40	det	40:det	Entity=(e16093--2-CorefType:ident,gstype:gen
40	ronda	ronda	NOUN	ncfs000	Gender=Fem|Number=Sing	32	obl	32:obl	ArgTem=argM:adv|Entity=e16093)|SpaceAfter=No
41	.	.	PUNCT	fp	PunctType=Peri	6	punct	6:punct	_

3.4. Developing your system

You have virtually no limits in building your system. You can develop it from scratch or extend/modify the two baseline systems that we provide the participants with: the baseline for predicting empty tokens, and the baseline for coreference resolution. If you want to treat the baseline systems as black boxes and base your system just on their predictions, choose either the "Coreference from scratch" or the "Refine the baseline" starting points.

Your coreference resolution system is supposed to identify sets of tokens as mentions and cluster them to coreferential entities. To identify a mention, your system is expected to predict a mention head word. However, it is still advisable to predict full mention span, too (the reasons are explained here). If your system is not able to predict the mention heads (i.e. it predicts mention spans only, and the head index is always 1), mention heads can be estimated using the provided dependency tree and heuristics, e.g. the ones provided by Udapi, using the following command: udapy -s corefud.MoveHead < in.conllu > out.conllu

If you choose the "Coreference and zeros from scratch" starting point, your system is supposed to reconstruct empty tokens prior to coreference resolution. A newly added empty token must be connected to the rest of the sentence by an enhanced dependency relation. Your system is thus expected to identify the parent token of the empty token. It is also advisable to predict a type of the dependency relation (the reasons are explained here). If your system is not able to predict the dependency relation type, set each type to dep (empty value would cause the validation tests to fail).

We encourage you to use the API for these operations. If you decide not to do so, click here for more details on the sufficient format requirements.

You do not need to predict all information present in the gold data. Demonstrated on the example, instead of generating:

1	Los	el	DET	da0mp0	Definite=Def|Gender=Masc|Number=Plur|PronType=Art	2	det	2:det	Entity=(e16088--2-gstype:gen,HomoDD
2	jugadores	jugador	NOUN	ncmp000	Gender=Masc|Number=Plur	6	nsubj	6:nsubj	ArgTem=arg0:agt
...
5	Espanyol	Espanyol	PROPN	np0000o	_	2	nmod	2:nmod	Entity=(e16089-organization-1-gstype:spec)e16088)
...
8.1	_	_	PRON	p	_	_	_	9:nsubj	ArgTem=arg0:agt|Entity=(e16088--1-CorefType:ident,gstype:gen)
...
35	ambos	ambos	NUM	dn0mp0	Gender=Masc|Number=Plur|NumType=Card	36	nummod	36:nummod	Entity=(e16094-other-2-CorefType:ident,gstype:spec|SplitAnte=e16089<e16094,e16090<e16094
36	equipos	equipo	NOUN	ncmp000	Gender=Masc|Number=Plur	32	nsubj	32:nsubj	ArgTem=arg1:tem
37	catalanes	catalán	ADJ	aq0mp0	Gender=Masc|Number=Plur	36	amod	36:amod	Entity=e16094)
...

depending on your starting point, it is sufficient for your system to generate the following for a perfect match:

1	Los	el	DET	da0mp0	Definite=Def|Gender=Masc|Number=Plur|PronType=Art	2	det	2:det	Entity=(e1--2-
2	jugadores	jugador	NOUN	ncmp000	Gender=Masc|Number=Plur	6	nsubj	6:nsubj	_
2.1	_	_	_	_	_	_	_	9:dep	Entity=(e1--1-)
...
5	Espanyol	Espanyol	PROPN	np0000o	_	2	nmod	2:nmod	Entity=(e2--1-)e1)
...
...
35	ambos	ambos	NUM	dn0mp0	Gender=Masc|Number=Plur|NumType=Card	36	nummod	36:nummod	Entity=(e3--2-
36	equipos	equipo	NOUN	ncmp000	Gender=Masc|Number=Plur	32	nsubj	32:nsubj	_
37	catalanes	catalán	ADJ	aq0mp0	Gender=Masc|Number=Plur	36	amod	36:amod	Entity=e3)
...

Out of the possible anaphora annotation tags available in the CorefUD format, only Entity tags have been predicted (SplitAnte and Bridge can be ignored). And not even in their full content. Bracketing defines mention span. The index in the 3rd field of the Entity tag defines a relative position of the mention head within all tokens of the mention (in the example, the predicted mention heads are: jugadores, _, Espanyol, and equipos). The e* co-indexing in the 1st field of Entity tags clusters mentions into coreferential entities. As for the generated zero, its position does not need to be matched (2.1 vs. 8.1). Instead, its dependency relation in the DEPS field is used to align it to a gold zero. While the parent index (9) must be the same, the dependency relation type do not need to match (dep vs. nsubj) unless there are multiple zeros with the same parent.

3.4.1. Baseline for prediction of empty nodes

The system for predicting empty tokens (zeros) can be downloaded here. We have applied this system on the data for "Coref. and zeros from scratch" starting point to produce the data for the "Coref. from scratch" starting point. The system predicts the position of empty tokens in a sentence and the DEPS column, i.e. their parent in the enhanced dependencies and the dependency relation (deprel). While CoNLL-U allows multiple enhanced parents in the DEPS column, the baseline system predicts only one (the training data was pre-processed with corefud.SingleParent Udapi block). The baseline system does not predict any attributes of the empty nodes, so all the CoNLL-U columns except for ID and DEPS (including FORM) are empty (i.e. _).

3.4.2. Baseline for coreference resolution

The baseline coreference resolution system is based on the multilingual coreference resolution system presented in [7], using multilingual BERT in the end-to-end setting. The system only predicts the coreference annotation in the MISC column, i.e. if the input files do not contain empty nodes, the system cannot reconstruct them and consequently fails in resolving zero anaphora. We have applied this system on the data for "Coref. from scratch" starting point to produce the data for the "Refine the baseline" starting point.

3.5. Validation

Many things can go wrong when filling the predicted coreference annotation in the CoNLL-U format, especially if not using the API (incorrect syntax in the MISC column, unmatched brackets etc.) Although the evaluation script may recover from many potential validation errors, it is highly recommended to check validity prior to submitting the files, so that you do not run out of the maximum daily trials.

For the CoNLL-U file produced by your system to be ready to be submitted, it must satisfy the two following criteria:

Apart from the coreference annotation and empty tokens, the contents of the input file must be preserved. In particular, the (surface) tokenization and sentence segmentation must not change.
It must be accepted by the official UD validator script with the settings described below.

The official UD validator will be used to check the validity of the CoNLL-U format. Anyone can obtain it by cloning the UD tools repository from GitHub and running the script validate.py. Python 3 is needed to run the script (depending on your system, it may be available under the command python or python3; if in doubt, try python -V to see the version).

$ git clone git@github.com:UniversalDependencies/tools.git
$ cd tools
$ python3 validate.py -h

In addition, a third-party module called regex must be installed via pip. Try this if you do not have the module already:

$ sudo apt-get install python3-pip; python3 -m pip install regex

The validation script distinguishes several levels of validity; level 2 is sufficient in the shared task, as the higher levels deal with morphosyntactic requirements on the UD-released treebanks. On the other hand, we will use the --coref option to turn on tests specific to coreference annotation. The validator also requires the option --lang xx where xx is the ISO language code of the data set.

$ python3 validate.py --level 2 --coref --lang cs cs_pdt-corefud-test.conllu
*** PASSED ***

If there are errors, the script will print messages describing the location and the nature of the error, it will print *** FAILED *** with (number of) errors, and it will return a non-zero exit value. If the file is OK, the script will print *** PASSED *** and return zero as its exit value. The script may also print warning messages that point to potential problems in the file but are not considered errors and will not make the file invalid.

3.6. Evaluation

The scorer and the evaluation setup is shared between both tracks of the shared task. See the LLM-track evaluation section for more details. In the following subsections, we only elaborate on the matching strategies for overt and zero mentions.

3.6.1. Head-match score

The primary score is calculated using the head match. That is, to compare gold and predicted mentions, we compare their heads. Submitted systems are thus expected to predict a mention head word by filling in its relative position within all words in the corresponding mention span to the Entity attribute. For example, the annotation Entity=(e9-place-2- identifies the second word of the mention as its head.

However, it is still advisable to predict full mention spans, too. Evaluation with head matching uses them to disambiguate between mentions with the same head token. In addition, systems that predict only mention heads are likely to fail in the evaluation with exact matching, which will be calculated as one of the supplementary scores.

If the submitted system is not able to predict the mention heads (i.e. it predicts mention spans only, and the head index is always 1), mention heads can be estimated using the provided dependency tree and heuristics, e.g. the ones provided by Udapi, using the following command: udapy -s corefud.MoveHead < in.conllu > out.conllu

3.6.2. Matching of zero mentions

Since the 2024 edition the participants are expected to predict also the empty nodes involved in zero anaphora (if they opt for the LLM track or the "Coreference and zeros from scratch" starting point in the unconstrained track). In the system outputs, some empty nodes may be missing and some may be spurious. In addition, some empty nodes may be predicted at different surface positions within the sentence, yet playing the same role. Nevertheless, if such empty nodes are heads of the gold and the predicted mention, the evaluation method must be capable of matching these zero mentions.

The shared task applies the dependency-based method of matching zero mentions. It looks for the matching of zeros within the same sentence that maximizes the F-score of predicting dependencies of zeros in the DEPS field. Specifically, the task is cast as searching for a 1-to-1 matching in a weighted bipartite graph (with gold mentions and predicted mentions as partitions) to maximize the total sum of weights in the matching. Each candidate pair (gold zero mention - predicted zero mention) is weighed with a non-zero score only if the two mentions belong to the same sentence. The score is then calculated as a weighted sum of two features:

the F-score of the gold zero dependencies (in the DEPS field) recognized in the predicted zero, considering both parent and dependency type assignments (weighted by a factor of 10);
the F-score of the gold zero dependencies (in the DEPS field) recognized in the predicted zero, considering only parent assignments (weighed by a factor of 1).

The scoring system prioritizes exact assignment of both parents and types, while parent assignments without considering dependency types should only serve to break ties.

Note that matching zero mentions by their dependencies is applied first, preceding the matching strategies for non-zero mentions. Zeros that have not been matched to other zeros may then be matched to non-zero mentions. Although such matching may seem counterintuitive, it can be valid in cases where a predicted zero mention is incorrectly labeled as non-zero, or vice versa, often due to the wrong choice of the head in multi-token mentions involving empty tokens.

3.7. Submitting to CodaLab

Submissions to the unconstrained track are collected through CodaLab. Note that this is a different CodaLab URL than is used for the LLM track.

The remaining instructions and details about the submission process are exactly the same as for the LLM track. Please refer to the submission instructions in the LLM track for more information.

3.8. System description papers

All participants of the unconstrained track are invited to submit their system descriptions papers to the CODI-CRAC 2025 Workshop. Submission details are the same as for the LLM track (see here).

4. Miscellaneous

4.1. Terms and conditions for data usage

Training, development, and test datasets are subject to license agreements specified individually for each dataset in the public edition of the CorefUD 1.3 collection (which, in turn, are the same as license agreements of the original resources before CorefUD harmonization). In all cases, the licenses are sufficient for using the data for the CRAC 2025 shared task purposes. However, the participants must check the license agreements in case they want to use their trained models also for other purposes; for instance, usage for commercial purposes is prohibited with several CorefUD datasets as they are available under CC BY-NC-SA.

Whenever using the CorefUD 1.3 collection (inside or outside this shared task), please cite it as follows:

@misc{11234/1-5896,
  title     = {Coreference in Universal Dependencies 1.3 ({CorefUD} 1.3)},
  author    = {Nov{\'a}k, Michal and Popel, Martin and Zeman, Daniel and {\v Z}abokrtsk{\'y}, Zden{\v e}k and Nedoluzhko, Anna and Acar, Kutay and Bamman, David and Bourgonje, Peter and Cinkov{\'a}, Silvie and Eckhoff, Hanne and Cebiro{\u g}lu Eryi{\u g}it, G{\"u}l{\c s}en and Haji{\v c}, Jan and Hardmeier, Christian and Haug, Dag and J{\o}rgensen, Tollef and K{\aa}sen, Andre and Krielke, Pauline and Landragin, Fr{\'e}d{\'e}ric and Lapshinova-Koltunski, Ekaterina and M{\ae}hlum, Petter and Mart{\'{\i}}, M. Ant{\`o}nia and Mikulov{\'a}, Marie and Milintsevich, Kirill and Mujadia, Vandan and Muzerelle, Judith and Nam, Sangha and N{\o}klestad, Anders and Ogrodniczuk, Maciej and {\O}vrelid, Lilja and Pamay Arslan, Tu{\u g}ba and Porada, Ian and Recasens, Marta and Solberg, Per Erik and Stede, Manfred and Straka, Milan and Swanson, Daniel and Toldova, Svetlana and Vad{\'a}sz, No{\'e}mi and Velldal, Erik and Vincze, Veronika and Zeldes, Amir and {\v Z}itkus, Voldemaras},
  url       = {http://hdl.handle.net/11234/1-5896},
  note      = {{LINDAT}/{CLARIAH}-{CZ} digital library at the Institute of Formal and Applied Linguistics ({{\'U}FAL}), Faculty of Mathematics and Physics, Charles University},
  copyright = {Licence {CorefUD} v1.3},
  year      = {2025}
}

For a more general reference to CorefUD harmonization efforts, please cite the following LREC paper:

@inproceedings{biblio8283899234757555533,
  author    = {Anna Nedoluzhko and Michal Novák and Martin Popel and Zdeněk Žabokrtský and Amir Zeldes and Daniel Zeman},
  year      = 2022,
  title     = {CorefUD 1.0: Coreference Meets Universal Dependencies},
  booktitle = {Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)},
  pages     = {4859--4872},
  publisher = {European Language Resources Association},
  address   = {Marseille, France},
  isbn      = {979-10-95546-72-6},
}

By submitting results to this competition, the participants consent to the public release of their scores at the CODI-CRAC 2025 workshop and in one of the associated proceedings, at the task organizers' discretion. Participants further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was erroneous or deceptive.

4.2. Shared task organizers

Charles University (Prague, Czechia): Anna Nedoluzhko, Michal Novák, Martin Popel, Milan Straka, Zdeněk Žabokrtský, Daniel Zeman
University of West Bohemia (Pilsen, Czechia): Miloslav Konopík, Ondřej Pražák, Jakub Sido

You can send any questions about the shared task to the organizers via corefud@googlegroups.com.

4.3. Changes to the previous editions of the shared task

The main differences between the three editions are as follows:

The 2022 edition (shared task web):
- based on CorefUD 1.0
- 13 datasets for 10 languages (Catalan, Czech, English, French, German, Hungarian, Lithuanian, Polish, Russian, and Spanish)
- gold morpho-syntactic features used wherever available
- partial matching used as the primary score
The 2023 edition (shared task web):
- based on CorefUD 1.1
- 17 datasets for 12 languages (Norwegian and Turkish added)
- original morpho-syntax features in dev and test sets replaced by the output of UDPipe 2 in order to make the evaluation scheme more realistic
- head matching used as the primary score
The 2024 edition (shared task web):
- based on CorefUD 1.2
- 21 datasets for 15 languages (Ancient Greek, Ancient Hebrew and Old Church Slavonic added)
- more attention paid to zero mentions (zero mentions present in 10 datasets for these languages Catalan, Czech, Old Church Slavonic, Spanish, Ancient Greek, Hungarian, Polish, Turkish)
- the scorer adjusted to align sets of empty tokens in the gold and predicted files
The 2025 edition (described here):
- based on CorefUD 1.3
- 22 datasets for 17 languages (added: French-ANCOR, Hindi-HDTB, Korean-ECMT, excluded: English-ParCorFull, German-ParCorFull)
- LLM track introduced

4.4. The shared task in a broader context

Inspired by the Universal Dependencies initiative (UD) [1], the coreference community has started discussions on establishing a universal annotation scheme and using it to harmonize existing corpora. The discussions at the CRAC 2020 workshop led to proposing the Universal Anaphora initiative. One of the lines of effort related to Universal Anaphora resulted in CorefUD, which is a multilingual collection of coreference data resources harmonized under a common scheme [2]. The current public release of CorefUD 1.3 contains 24 datasets for 17 languages, namely Ancient Greek, Ancient Hebrew, Catalan, Czech (2×), English (3×), French (2×), German (2×), Hindi, Hungarian (2×), Korean, Lithuanian, Norwegian (2×), Old Church Slavonic, Polish, Russian, Spanish, and Turkish. The CRAC 2025 shared task deals with coreference resolution in all these languages. It is the 4rd edition of the shared task; findings of the previous three editions can be found in [8]-[10].

References

[1] De Marneffe, M.-C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2), 255-308.
[2] Nedoluzhko, A., Novák M., Popel M., Žabokrtský Z., Zeldes A., Zeman D. (2022). CorefUD 1.0: Coreference Meets Universal Dependencies. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022) (pp. 4859-4872).
[3] Pradhan, S., Moschitti, A., Xue, N., Uryupina, O., & Zhang, Y. (2012). CoNLL-2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes. In Joint Conference on EMNLP and CoNLL-Shared Task (pp. 1-40).
[4] Popel, M., Žabokrtský, Z., Nedoluzhko, A., Novák, M., & Zeman, D. (2021). Do UD Trees Match Mention Spans in Coreference Annotations? In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 3570-3576).
[5] Nedoluzhko, A., Novák, M., Popel, M., Žabokrtský, Z., & Zeman, D. (2021). Is one head enough? Mention heads in coreference annotations compared with UD-style heads. In Proceedings of the Sixth International Conference on Dependency Linguistics (Depling, SyntaxFest 2021) (pp. 101-114).
[6] Denis, P. & Baldridge, J. (2009). Global joint models for coreference resolution and named entity classification. Procesamiento del lenguaje natural, Nº. 42, (pp. 87-96).
[7] Pražák, O., Konopík, M., & Sido, J. (2021). Multilingual Coreference Resolution with Harmonized Annotations. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). 2021.
[8] Žabokrtský Z., Konopík M., Nedoluzhko A., Novák M., Ogrodniczuk M., Popel M., Pražák O., Sido J., Zeman D., Zhu Y. (2022). Findings of the Shared Task on Multilingual Coreference Resolution. In Proceedings of the CRAC 2022 Shared Task on Multilingual Coreference Resolution (pp. 1-17).
[9] Žabokrtský Z., Konopik M., Nedoluzhko A., Novák M., Ogrodniczuk M., Popel M., Prazak O., Sido J., Zeman D. (2023). Findings of the Second Shared Task on Multilingual Coreference Resolution. In Proceedings of the CRAC 2023 Shared Task on Multilingual Coreference Resolution (pp. 1-18).
[10] Novák M., Dohnalová B., Konopik M., Nedoluzhko A., Popel M., Prazak O., Sido J., Straka M., Žabokrtský Z., Zeman D. (2024). Findings of the Third Shared Task on Multilingual Coreference Resolution. In Proceedings of The Seventh Workshop on Computational Models of Reference, Anaphora and Coreference (pp. 78-96).

4.5. Acknowledgements

This shared task is supported by the Grants No. 20-16819X (LUSyD) of the Czech Science Foundation, UNCE24/SSH/009, and LM2023062 (LINDAT/CLARIAH-CZ) of the Ministry of Education, Youth, and Sports of the Czech Republic.

CorefUD

Search form