Coreference resolution is the task of clustering together multiple mentions of the same entity appearing in a textual document (e.g. Beethoven, the German composer and he). This CodaLab-powered shared task deals with multilingual coreference resolution and is associated with the CODI-CRAC 2025 Workshop (6th Workshop on Computational Approaches to Discourse and 8th Workshop on Computational Models of Reference, Anaphora and Coreference) held at EMNLP 2025. This shared task builds on the previous three editions, in 2022, 2023, and 2024. The current edition mainly focuses on LLMs and challenges their abilities in predicting coreference across a range of typologically different languages. Coreference datasets from 16 languages are involved.
This shared task focuses on the development of systems for coreference resolution across multiple languages. Participants are expected to build systems that:
Systems must handle linguistic diversity, including different languages, annotation styles, and presence or absence of zero mentions.
This edition encourages the use of large language models (LLMs) for coreference resolution. While non-LLM models are still welcome, a dedicated LLM Track has been introduced to highlight and explore the capabilities of LLM-based approaches. To accommodate different modeling strategies, two tracks are available:
If you are unsure whether your system qualifies for the LLM track, feel free to ask the organizers. Detailed track-specific rules and constraints are provided in the respective sections below.
You should register for the shared task before proceeding to the development phase. During this phase, you can develop and test your system using provided training and development data, iteratively refining your model. The evaluation phase begins with the release of blind test data, and you must submit your predictions. Final rankings will be based on an official evaluation metric, with separate rankings for each track.
The main rules applicable to both tracks are:
Further details are provided in the LLM Track and Unconstrained Track sections.
We look forward to your participation and encourage innovative approaches to coreference resolution!
The LLM Track is dedicated to approaches that rely on large language models (LLMs) for coreference resolution. LLMs refer to transformer-based models pre-trained on vast textual corpora and capable of generating coherent, context-aware outputs.
To qualify for this track, systems must primarily utilize LLMs for coreference resolution. Acceptable approaches include:
Systems in this track must independently generate mentions (including zero mentions) and cluster them into coreferential entities. In other words, external non-LLM systems cannot be used to solve the entire task or any of its subtasks. As a result, baseline systems from the Unconstrained Track cannot be used here.
If a submission does not fully adhere to these constraints, it must be submitted to the Unconstrained Track. The organizers reserve the right to reassign submissions from the LLM track to the Unconstrained track if they determine that the system does not adhere to the track's requirements.
For any questions about the shared task, contact the organizers via corefud@googlegroups.com.
The original data format is CoNLL-U. While datasets for the LLM track are also provided in plaintext format, generated by the export script, evaluation requires information that plaintext cannot fully capture. Therefore, working with CoNLL-U files is essential. However, if you do not modify the plaintext format, the import script should successfully integrate coreference annotations into the input CoNLL-U files, eliminating the need for an in-depth understanding of the CoNLL-U format.
The following table provides download links for the LLM track data.
Data type | Empty tokens | Coreference | Morpho-syntax | Forms of empty tokens | Plaintext | Download |
---|---|---|---|---|---|---|
Gold | manual | manual |
original (manual if available, otherwise automatic) |
deleted | included | train |
dev | ||||||
Input | deleted | deleted |
automatic UDPipe 2 |
deleted | included | dev |
test |
To support the development of LLM systems, we provide data in the plaintext format, which is more suitable for LLMs compared to the CoNLL-U format. We also offer tools for exporting coreference annotations from CoNLL-U into plaintext and importing them back.
Here is an example of the plaintext format (first sentence of the Catalan-AnCora dataset):
Los|[e1 jugadores de el Espanyol|[e2],e1] aseguraron hoy que ##|[e1] prefieren enfrentar se a el Barcelona|[e3] en la|[e4 final de la|[e5 Copa de el Rey|e4],e5] en lugar de en las|[e6 semifinales|e6] , tras clasificar se ayer ambos|[e7 equipos catalanes|e7] para esta|[e6 ronda|e6] .
The plaintext format is a simple text file where each line corresponds to a document, and each token is separated by a space. Coreference annotations are expressed on the token level after the '|' character at the end of a token. Token Annotations are represented in a similar form to CoNLLu, just but with a square bracket for defining span boundaries. Empty nodes are represented prefixed with '##' if an empty node has a form or lemma in the original data, they are appended right after.
Participants are encouraged to use the text2text-coref tool for exporting coreference annotations from CoNLL-U into plaintext and importing them back.
Regardless of whether you use the plaintext format provided or modify the export script, you will need to import the coreference annotations back into CoNLL-U format to run the official scorer. Run the following commands to clean the output from the LLM system and then import the coreference annotations back into CoNLL-U:
# clean the output of the LLM system
python -m text2text_coref clean [input_txt_file][conll_skeleton_file] [output_txt_file]
# import the coreference annotations back to CoNLL-U
python -m text2text_coref text2conllu [input_file][conll_skeleton_file] [output_conll_file]
For more information about the text2text-coref tool, refer to the documentation in the tool's GitHub repository.
To qualify for the LLM track, your solution must rely on one or more existing LLMs, utilizing methods such as in-context learning, prompt tuning, or fine-tuning. If you use non-LLM models or other existing systems, your solution will be placed in the unconstrained track (for more examples, see here). If you're unsure whether your system meets the LLM track requirements, feel free to reach out to us for clarification.
The official scorer for the shared task is the CorefUD scorer in its latest version. Its functionality is guaranteed to remain unchanged from the start of the development phase through to the end of the evaluation phase.
The scorer requires two CoNLL-U files as input: one with gold annotations and one with predicted annotations. Therefore, all plaintext predictions must be converted back to the CoNLL-U format before evaluation (for more details, see here).
We calculate the primary score under the following setup:
All shared task participants are invited to submit system description papers to the CODI-CRAC 2025 Workshop. Submission details will be provided soon.
Unlike the LLM Track, the Unconstrained Track allows you to leverage any approach to address the task. You should submit your system to this track if:
As it is allowed to use existing non-LLM systems, organizers provide two baseline systems to support participants: one for empty node reconstruction and another for coreference resolution. You can either extend these baseline systems or use their outputs as a starting point. Unlike in the LLM Track, you must first choose a starting point, which determines the amount of work required for your own system.
There is no plaintext format provided for the Unconstrained Track; the data is available only in the original CoNLL-U format. Although this may present a steeper learning curve, it offers greater control over the final output, ensuring better alignment with evaluation criteria.
Registration is the same for both tracks. See the LLM-track registration section for more details.
Participants can choose from different starting points for joining the shared task, which vary based on the amount of work they need to do on their own. Depending on the starting point chosen, different degrees of predictions by baseline systems are available.
There are three starting points:
Given that the shared task data comprises multiple datasets in different languages, participants have the flexibility to approach the task from various starting points across the datasets/languages.
In the unconstrained track, the data source is identical to that in the LLM track: CorefUD 1.3. For the details on the data collection shared between the tracks, see the LLM track data section.
There are two main differences in how the data is pre-processed for the unconstrained track:
Download the data for the unconstrained track from the following table. Choose the input data variant based on the starting point you have chosen.
Data type | Starting point | Empty tokens | Coreference | Morpho-syntax | Forms of empty tokens | Plaintext | Download |
---|---|---|---|---|---|---|---|
Gold | All | manual | manual |
original (manual if available, otherwise automatic) |
deleted | not included | train |
dev | |||||||
Input | Coref. and zeros from scratch | deleted | deleted |
automatic UDPipe 2 |
deleted | not included | dev |
test | |||||||
Coref. from scratch |
automatic baseline |
deleted |
automatic UDPipe 2 |
deleted | not included | dev | |
test | |||||||
Refine the baseline |
automatic baseline |
automatic baseline |
automatic UDPipe 2 |
deleted | not included | dev | |
test |
Udapi is a Python API for reading, writing, querying and editing Universal Dependencies data in the CoNLL-U format (and several other formats). It has also support for the coreference annotations (and it was used for producing CorefUD). You can use Udapi for accessing and writing the data in a comfortable way. See the following example Python script, which:
#!/usr/bin/env python3
import udapi
# Extract the words of the first sentence in the Spanish blind dev set.
doc = udapi.Document("es_ancora-corefud-dev.conllu")
trees = list(doc.trees)
words = trees[0].descendants
print([w.form for w in words])
#['Los', 'jugadores', 'de', 'el', 'Espanyol', 'aseguraron', 'hoy', 'que',
# 'prefieren', 'enfrentar', 'se', 'a', 'el', 'Barcelona', 'en', 'la', 'final',
# 'de', 'la', 'Copa', 'de', 'el', 'Rey', 'en', 'lugar', 'de', 'en', 'las',
# 'semifinales', ',', 'tras', 'clasificar', 'se', 'ayer', 'ambos', 'equipos',
# 'catalanes', 'para', 'esta', 'ronda', '.']
# Create entity e1 with two mentions: "las semifinales" and "esta ronda"
e1 = doc.create_coref_entity()
e1.create_mention(words=words[27:29], head=words[28])
e1.create_mention(words=words[38:40], head=words[39])
# Create an empty node (zero) before the 9th word "prefieren".
zero = words[8].create_empty_child(deprel="nsubj", after=False, form="_")
# Make sure the input file es_ancora-corefud-dev.conllu is really
# the blind dev set without any empty nodes.
assert zero == trees[0].descendants_and_empty[8], "unexpected input file"
# Create entity e2 with two mentions:
# "Los jugadores de el Espanyol" and the newly created zero.
e2 = doc.create_coref_entity()
e2.create_mention(words=words[0:5], head=words[1])
e2.create_mention(words=[zero], head=zero)
# Print the newly created coreference entities.
udapi.create_block("corefud.PrintEntities").process_document(doc)
# Save the predictions into a CoNLL-U file
doc.store_conllu("output.conllu")
For getting a deeper insight into Udapi, you can use
If you use the Udapi interface for loading and storing the shared task data, which is the recommended way, you don't have to deal with the file format at all. However, it may be useful to understand the format for quick glimpses into the data.
The full specification of the CoNLL-U format is available at the website of Universal Dependencies. In a nutshell: every token has its own line; lines starting with #
are sentence-level comments, and empty lines terminate a sentence. Regular token lines start with an integer number. There are also lines starting with intervals (e.g. 4-5
), which introduce what UD calls “multi-word tokens”; these lines must be preserved in the output but otherwise the participants do not have to care about them (coreference annotation does not occur on them). Finally, there are also lines starting with decimal numbers (e.g. 2.1
), which correspond to empty nodes in the dependency graph; these nodes may represent zero mentions and may contain coreference annotation. Every token/node line contains 10 tab-separated fields (columns). The first column is the numeric ID of the token/node, the next column contains the word FORM; any coreference annotation, if present, will appear in the last column, which is called MISC. The file must use Linux-style line breaks, that is, a single LF character, rather than CR LF, which is common on Windows.
The MISC column is either a single underscore (_
), meaning there is no extra annotation, or one or more pieces of annotation (typically in the Attribute=Value
form), separated by vertical bars (|
). The annotation pieces relevant for this shared task always start with Entity=
; these should be learned from the training data and predicted for the test data. Any other annotation that is present in the MISC column of the input file should be preserved in the output (especially note that if you discard SpaceAfter=No
, or introduce a new one, the validator may report the file as invalid).
For more information on the Entity
attribute, see the PDF with the description of the CorefUD 1.0 format (the CorefUD 1.2 format is identical).
You have virtually no limits in building your system. You can develop it from scratch or extend/modify the two baseline systems that we provide the participants with: the baseline for predicting empty tokens, and the baseline for coreference resolution. If you want to treat the baseline systems as black boxes and base your system just on their predictions, choose either the "Coreference from scratch" or the "Refine the baseline" starting points.
Your coreference resolution system is supposed to identify sets of tokens as mentions and cluster them to coreferential entities. To identify a mention, your system is expected to predict a mention head word. However, it is still advisable to predict full mention span, too (the reasons are explained here). If your system is not able to predict the mention heads (i.e. it predicts mention spans only, and the head index is always 1
), mention heads can be estimated using the provided dependency tree and heuristics, e.g. the ones provided by Udapi, using the following command: udapy -s corefud.MoveHead < in.conllu > out.conllu
If you choose the "Coreference and zeros from scratch" starting point, your system is supposed to reconstruct empty tokens prior to coreference resolution. A newly added empty token must be connected to the rest of the sentence by an enhanced dependency relation. Your system is thus expected to identify the parent token of the empty token. It is also advisable to predict a type of the dependency relation (the reasons are explained here). If your system is not able to predict the dependency relation type, set each type to dep
(empty value would cause the validation tests to fail).
The system for predicting empty tokens (zeros) can be downloaded here. We have applied this system on the data for "Coref. and zeros from scratch" starting point to produce the data for the "Coref. from scratch" starting point. The system predicts the position of empty tokens in a sentence and the DEPS column, i.e. their parent in the enhanced dependencies and the dependency relation (deprel). While CoNLL-U allows multiple enhanced parents in the DEPS column, the baseline system predicts only one (the training data was pre-processed with corefud.SingleParent Udapi block). The baseline system does not predict any attributes of the empty nodes, so all the CoNLL-U columns except for DEPS (including FORM) are empty (i.e. _
).
The baseline coreference resolution system is based on the multilingual coreference resolution system presented in [7], using multilingual BERT in the end-to-end setting. The system only predicts the coreference annotation in the MISC column, i.e. if the input files do not contain empty nodes, the system cannot reconstruct them and consequently fails in resolving zero anaphora. We have applied this system on the data for "Coref. from scratch" starting point to produce the data for the "Refine the baseline" starting point.
Many things can go wrong when filling the predicted coreference annotation in the CoNLL-U format, especially if not using the API (incorrect syntax in the MISC column, unmatched brackets etc.) Although the evaluation script may recover from many potential validation errors, it is highly recommended to check validity prior to submitting the files, so that you do not run out of the maximum daily trials.
For the CoNLL-U file produced by your system to be ready to be submitted, it must satisfy the two following criteria:
The official UD validator will be used to check the validity of the CoNLL-U format. Anyone can obtain it by cloning the UD tools repository from GitHub and running the script validate.py
. Python 3 is needed to run the script (depending on your system, it may be available under the command python
or python3
; if in doubt, try python -V
to see the version).
$ git clone git@github.com:UniversalDependencies/tools.git $ cd tools $ python3 validate.py -h
In addition, a third-party module called regex
must be installed via pip. Try this if you do not have the module already:
$ sudo apt-get install python3-pip; python3 -m pip install regex
The validation script distinguishes several levels of validity; level 2 is sufficient in the shared task, as the higher levels deal with morphosyntactic requirements on the UD-released treebanks. On the other hand, we will use the --coref
option to turn on tests specific to coreference annotation. The validator also requires the option --lang xx
where xx
is the ISO language code of the data set.
$ python3 validate.py --level 2 --coref --lang cs cs_pdt-corefud-test.conllu *** PASSED ***
If there are errors, the script will print messages describing the location and the nature of the error, it will print *** FAILED *** with (number of) errors
, and it will return a non-zero exit value. If the file is OK, the script will print *** PASSED ***
and return zero as its exit value. The script may also print warning messages that point to potential problems in the file but are not considered errors and will not make the file invalid.
The scorer and the evaluation setup is shared between both tracks of the shared task. See the LLM-track evaluation section for more details. In the following subsections, we only elaborate on the matching strategies for overt and zero mentions.
The primary score is calculated using the head match. That is, to compare gold and predicted mentions, we compare their heads. Submitted systems are thus expected to predict a mention head word by filling in its relative position within all words in the corresponding mention span to the Entity
attribute. For example, the annotation Entity=(e9-place-2-
identifies the second word of the mention as its head.
However, it is still advisable to predict full mention spans, too. Evaluation with head matching uses them to disambiguate between mentions with the same head token. In addition, systems that predict only mention heads are likely to fail in the evaluation with exact matching, which will be calculated as one of the supplementary scores.
If the submitted system is not able to predict the mention heads (i.e. it predicts mention spans only, and the head index is always 1
), mention heads can be estimated using the provided dependency tree and heuristics, e.g. the ones provided by Udapi, using the following command: udapy -s corefud.MoveHead < in.conllu > out.conllu
Since the 2024 edition the participants are expected to predict also the empty nodes involved in zero anaphora (if they opt for the LLM track or the "Coreference and zeros from scratch" starting point in the unconstrained track). In the system outputs, some empty nodes may be missing and some may be spurious. In addition, some empty nodes may be predicted at different surface positions within the sentence, yet playing the same role. Nevertheless, if such empty nodes are heads of the gold and the predicted mention, the evaluation method must be capable of matching these zero mentions.
The shared task applies the dependency-based method of matching zero mentions. It looks for the matching of zeros within the same sentence that maximizes the F-score of predicting dependencies of zeros in the DEPS field. Specifically, the task is cast as searching for a 1-to-1 matching in a weighted bipartite graph (with gold mentions and predicted mentions as partitions) to maximize the total sum of weights in the matching. Each candidate pair (gold zero mention - predicted zero mention) is weighed with a non-zero score only if the two mentions belong to the same sentence. The score is then calculated as a weighted sum of two features:
The scoring system prioritizes exact assignment of both parents and types, while parent assignments without considering dependency types should only serve to break ties.
Note that matching zero mentions by their dependencies is applied first, preceding the matching strategies for non-zero mentions. Zeros that have not been matched to other zeros may then be matched to non-zero mentions. Although such matching may seem counterintuitive, it can be valid in cases where a predicted zero mention is incorrectly labeled as non-zero, or vice versa, often due to the wrong choice of the head in multi-token mentions involving empty tokens.
Submissions to the unconstrained track are collected through CodaLab. Note that this is a different CodaLab URL than is used for the LLM track.
The remaining instructions and details about the submission process are exactly the same as for the LLM track. Please refer to the submission instructions in the LLM track for more information.
All participants of the unconstrained track are invited to submit their system descriptions papers to the CODI-CRAC 2025 Workshop. Submission details are the same as for the LLM track (see here).
Training, development, and test datasets are subject to license agreements specified individually for each dataset in the public edition of the CorefUD 1.3 collection (which, in turn, are the same as license agreements of the original resources before CorefUD harmonization). In all cases, the licenses are sufficient for using the data for the CRAC 2025 shared task purposes. However, the participants must check the license agreements in case they want to use their trained models also for other purposes; for instance, usage for commercial purposes is prohibited with several CorefUD datasets as they are available under CC BY-NC-SA.
Whenever using the CorefUD 1.3 collection (inside or outside this shared task), please cite it as follows:
@misc{11234/1-5478, title = {Coreference in Universal Dependencies 1.2 ({CorefUD} 1.2)}, author = {Popel, Martin and Nov{\'a}k, Michal and {\v Z}abokrtsk{\'y}, Zden{\v e}k and Zeman, Daniel and Nedoluzhko, Anna and Acar, Kutay and Bamman, David and Bourgonje, Peter and Cinkov{\'a}, Silvie and Eckhoff, Hanne and Cebiro{\u g}lu Eryi{\u g}it, G{\"u}l{\c s}en and Haji{\v c}, Jan and Hardmeier, Christian and Haug, Dag and J{\o}rgensen, Tollef and K{\aa}sen, Andre and Krielke, Pauline and Landragin, Fr{\'e}d{\'e}ric and Lapshinova-Koltunski, Ekaterina and M{\ae}hlum, Petter and Mart{\'{\i}}, M. Ant{\`o}nia and Mikulov{\'a}, Marie and N{\o}klestad, Anders and Ogrodniczuk, Maciej and {\O}vrelid, Lilja and Pamay Arslan, Tu{\u g}ba and Recasens, Marta and Solberg, Per Erik and Stede, Manfred and Straka, Milan and Swanson, Daniel and Toldova, Svetlana and Vad{\'a}sz, No{\'e}mi and Velldal, Erik and Vincze, Veronika and Zeldes, Amir and {\v Z}itkus, Voldemaras}, url = {http://hdl.handle.net/11234/1-5478}, note = {{LINDAT}/{CLARIAH}-{CZ} digital library at the Institute of Formal and Applied Linguistics ({{\'U}FAL}), Faculty of Mathematics and Physics, Charles University}, copyright = {Licence {CorefUD} v1.2}, year = {2024} }
For a more general reference to CorefUD harmonization efforts, please cite the following LREC paper:
@inproceedings{biblio8283899234757555533, author = {Anna Nedoluzhko and Michal Novák and Martin Popel and Zdeněk Žabokrtský and Amir Zeldes and Daniel Zeman}, year = 2022, title = {CorefUD 1.0: Coreference Meets Universal Dependencies}, booktitle = {Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)}, pages = {4859--4872}, publisher = {European Language Resources Association}, address = {Marseille, France}, isbn = {979-10-95546-72-6}, }
By submitting results to this competition, the participants consent to the public release of their scores at the CODI-CRAC 2025 workshop and in one of the associated proceedings, at the task organizers' discretion. Participants further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was erroneous or deceptive.
Charles University (Prague, Czechia): Anna Nedoluzhko, Michal Novák, Martin Popel, Milan Straka, Zdeněk Žabokrtský, Daniel Zeman
University of West Bohemia (Pilsen, Czechia): Miloslav Konopík, Ondřej Pražák, Jakub Sido
You can send any questions about the shared task to the organizers via corefud@googlegroups.com.
The main differences between the three editions are as follows:
Inspired by the Universal Dependencies initiative (UD) [1], the coreference community has started discussions on establishing a universal annotation scheme and using it to harmonize existing corpora. The discussions at the CRAC 2020 workshop led to proposing the Universal Anaphora initiative. One of the lines of effort related to Universal Anaphora resulted in CorefUD, which is a multilingual collection of coreference data resources harmonized under a common scheme [2]. The current public release of CorefUD 1.3 contains 23 datasets for 16 languages, namely Ancient Greek, Ancient Hebrew, Catalan, Czech (2×), English (3×), French (2×), German (2×), Hungarian (2×), Korean, Lithuanian, Norwegian (2×), Old Church Slavonic, Polish, Russian, Spanish, and Turkish. The CRAC 2025 shared task deals with coreference resolution in all these languages. It is the 4rd edition of the shared task; findings of the previous three editions can be found in [8]-[10].
References
This shared task is supported by the Grants No. 20-16819X (LUSyD) of the Czech Science Foundation, UNCE24/SSH/009, and LM2023062 (LINDAT/CLARIAH-CZ) of the Ministry of Education, Youth, and Sports of the Czech Republic.