Coreference resolution is the task of clustering together multiple mentions of the same entity appearing in a textual document (e.g. Joe Biden, the U.S. President and he). This CodaLab-powered shared task deals with multilingual coreference resolution and is associated with the CRAC 2024 Workshop (the Seventh Workshop on Computational Models of Reference, Anaphora and Coreference) held at EMNLP 2024. This shared task builds on the previous two editions, in 2022 and 2023, paying more attention to zero-pronoun coreference this time. Coreference datasets from 15 languages are involved. Still, you can participate just for selected languages and without dealing with zero mentions.
The following table shows four versions of the CoNLL metric macro-averaged over all datasets:
A more detailed evaluation will be provided in the shared task overview paper.
system | head-match | partial-match | exact-match | with singletons |
---|---|---|---|---|
1. CorPipe-2stage | 73.90 | 72.19 | 69.86 | 75.65 |
2. CorPipe | 72.75 | 70.30 | 68.36 | 74.65 |
3. CorPipe-single | 70.18 | 68.02 | 66.07 | 71.96 |
4. Ondfa | 69.97 | 69.82 | 40.25 | 70.67 |
5. BASELINE | 53.16 | 52.48 | 51.26 | 46.45 |
6. DFKI-CorefGen | 33.38 | 32.36 | 30.71 | 38.65 |
7. Ritwikmishra | 16.47 | 16.65 | 14.16 | 15.42 |
The goal: shared task participants are supposed to create systems that
The main rules of the shared task are the following:
Even if all datasets included in the shared task are available in the same file format, systems competing in the shared task are supposed to be flexible enough to accommodate various types of variability present in the CorefUD collection, such as
If you are interested in the shared taks, you should register first.
You can then proceed into the development phase. As the first step, you should choose your starting point. The easier the starting point is, the more you rely on the baseline solution provided by us. You can then start developing your system. Feel free to use the gold training and development data. Run your system on the appropriate blind development data, based on your starting point. Ensure that the output from you system is valid and evaluate it using the scorer. Repeat the development loop until satisfied with the result. We also encourage you to share the intermediate results with the other participants by submitting the system outputs to the CodaLab. This is also strongly recommended for practising the submission process, so that potential technical issues are discovered soon.
With the start of the evaluation phase, we will publish the blind test data. Choose the appropriate variant based on your starting point and run your system on it. Check the validity of the output and submit it to CodaLab. You should immediately see the evaluation score. If anything goes wrong, repeat the process.
After the end of the evaluation phase, we will ask you to provide us with the details on your submission and your system. You are also encouraged to write a paper describing your system in details, which you can finally present at the workshop.
If you are interested in participating in this shared task, please fill the registration form as soon as possible.
Technically, this registration will not be connected with participants' CodaLab accounts in any way. In other words, it will be possible to upload your CodaLab submissions without being registered here. However, we strongly recommend that at least one person from each participating team fills this registration form so that we can keep you informed about all updates regarding the shared task.
In addition, you can send any questions about the shared task to the organizers via corefud@googlegroups.com.
Participants can choose from different starting points for joining the shared task, which vary based on the amount of work they need to do on their own. Depending on the starting point chosen, different degrees of predictions by baseline systems are available.
There are three starting points:
Given that the shared task data comprises multiple datasets in different languages, participants have the flexibility to approach the task from various starting points across the datasets/languages.
The source of the data for the shared task is the public edition of CorefUD 1.2. CorefUD is a collection of previously existing datasets annotated with coreference, converted into a common annotation scheme. Coreference is annotated also for empty tokens, mainly in pro-drop languages. The datasets are enriched with morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. Each dataset in the collection is divided into a training section, a development section, and a test section (train/dev/test for short) and stored in the CoNLL-U format, with coreference-specific information captured in the MISC column.
Compared to the public edition of CorefUD 1.2, the data provided for the shared task participants are slightly adjusted.
Gold data aimed for training and evaluating have undergone a small technical modification: forms (the 2nd field in the CoNLL-U format) of empty tokens have been deleted (replaced by an underscore _
). The reason is that the baseline system for prediction of zeros does not predict the forms of zeros (empty tokens) and also the evaluation ignores the forms of zeros. While we make the gold train and dev sets available for download, the gold test set is secret and will be used internally in CodaLab for evaluation of submissions.
We make also available the input data aimed to be processed by your systems. The input data should get closer to the real-world setup, where no manual linguistic annotation is available. However, this is done with respect to different starting points. Consequently, for each starting point we provide the dev and test set, in which the empty nodes or/and annotation of coreference is either deleted or replaced by the output of the baseline systems. Furthermore, the original morpho-syntax features (POS tags, lemmas, and dependency trees) are replaced by the output of UDPipe 2 (a pipeline for an automatic UD-like annotation) even in the datasets for which the features are manually annotated in CorefUD 1.2.
Data type | Starting point | Empty tokens | Coreference | Morpho-syntax | Forms of empty tokens | Download |
---|---|---|---|---|---|---|
Gold | All | manual | manual |
original (manual if available, otherwise automatic) |
deleted | train |
dev | ||||||
Input | Coref. and zeros from scratch | deleted | deleted |
automatic UDPipe 2 |
deleted | dev |
test | ||||||
Coref. from scratch |
automatic baseline |
deleted |
automatic UDPipe 2 |
deleted | dev | |
test | ||||||
Refine the baseline |
automatic baseline |
automatic baseline |
automatic UDPipe 2 |
deleted | dev | |
test |
Udapi is a Python API for reading, writing, querying and editing Universal Dependencies data in the CoNLL-U format (and several other formats). It has also support for the coreference annotations (and it was used for producing CorefUD). You can use Udapi for accessing and writing the data in a comfortable way. See the following example Python script, which:
#!/usr/bin/env python3
import udapi
# Extract the words of the first sentence in the Spanish blind dev set.
doc = udapi.Document("es_ancora-corefud-dev.conllu")
trees = list(doc.trees)
words = trees[0].descendants
print([w.form for w in words])
#['Los', 'jugadores', 'de', 'el', 'Espanyol', 'aseguraron', 'hoy', 'que',
# 'prefieren', 'enfrentar', 'se', 'a', 'el', 'Barcelona', 'en', 'la', 'final',
# 'de', 'la', 'Copa', 'de', 'el', 'Rey', 'en', 'lugar', 'de', 'en', 'las',
# 'semifinales', ',', 'tras', 'clasificar', 'se', 'ayer', 'ambos', 'equipos',
# 'catalanes', 'para', 'esta', 'ronda', '.']
# Create entity e1 with two mentions: "las semifinales" and "esta ronda"
e1 = doc.create_coref_entity()
e1.create_mention(words=words[27:29], head=words[28])
e1.create_mention(words=words[38:40], head=words[39])
# Create an empty node (zero) before the 9th word "prefieren".
zero = words[8].create_empty_child(deprel="nsubj", after=False, form="_")
# Make sure the input file es_ancora-corefud-dev.conllu is really
# the blind dev set without any empty nodes.
assert zero == trees[0].descendants_and_empty[8], "unexpected input file"
# Create entity e2 with two mentions:
# "Los jugadores de el Espanyol" and the newly created zero.
e2 = doc.create_coref_entity()
e2.create_mention(words=words[0:5], head=words[1])
e2.create_mention(words=[zero], head=zero)
# Print the newly created coreference entities.
udapi.create_block("corefud.PrintEntities").process_document(doc)
# Save the predictions into a CoNLL-U file
doc.store_conllu("output.conllu")
For getting a deeper insight into Udapi, you can use
If you use the Udapi interface for loading and storing the shared task data, which is the recommended way, you don't have to deal with the file format at all. However, it may be useful to understand the format for quick glimpses into the data.
The full specification of the CoNLL-U format is available at the website of Universal Dependencies. In a nutshell: every token has its own line; lines starting with #
are sentence-level comments, and empty lines terminate a sentence. Regular token lines start with an integer number. There are also lines starting with intervals (e.g. 4-5
), which introduce what UD calls “multi-word tokens”; these lines must be preserved in the output but otherwise the participants do not have to care about them (coreference annotation does not occur on them). Finally, there are also lines starting with decimal numbers (e.g. 2.1
), which correspond to empty nodes in the dependency graph; these nodes may represent zero mentions and may contain coreference annotation. Every token/node line contains 10 tab-separated fields (columns). The first column is the numeric ID of the token/node, the next column contains the word FORM; any coreference annotation, if present, will appear in the last column, which is called MISC. The file must use Linux-style line breaks, that is, a single LF character, rather than CR LF, which is common on Windows.
The MISC column is either a single underscore (_
), meaning there is no extra annotation, or one or more pieces of annotation (typically in the Attribute=Value
form), separated by vertical bars (|
). The annotation pieces relevant for this shared task always start with Entity=
; these should be learned from the training data and predicted for the test data. Any other annotation that is present in the MISC column of the input file should be preserved in the output (especially note that if you discard SpaceAfter=No
, or introduce a new one, the validator may report the file as invalid).
For more information on the Entity
attribute, see the PDF with the description of the CorefUD 1.0 format (the CorefUD 1.2 format is identical).
# newdoc id = CESS-CAST-A-20000217-13959 # global.Entity = eid-etype-head-other # sent_id = CESS-CAST-A-20000217-13959-s1 # text = Los jugadores del Espanyol aseguraron hoy que prefieren enfrentarse al Barcelona en la final de la Copa del Rey en lugar de en las semifinales, tras clasificarse ayer ambos equipos catalanes para esta ronda. 1 Los el DET da0mp0 Definite=Def|Gender=Masc|Number=Plur|PronType=Art 2 det 2:det Entity=(e16088--2-gstype:gen,HomoDD 2 jugadores jugador NOUN ncmp000 Gender=Masc|Number=Plur 6 nsubj 6:nsubj ArgTem=arg0:agt 3-4 del _ _ _ _ _ _ _ _ 3 de de ADP spcms _ 5 case 5:case _ 4 el el DET _ Definite=Def|Gender=Masc|Number=Sing|PronType=Art 5 det 5:det _ 5 Espanyol Espanyol PROPN np0000o _ 2 nmod 2:nmod Entity=(e16089-organization-1-gstype:spec)e16088) 6 aseguraron asegurar VERB vmis3p0 Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin 0 root 0:root _ 7 hoy hoy ADV rg _ 6 advmod 6:advmod ArgTem=argM:tmp 8 que que SCONJ cs _ 9 mark 9:mark _ 8.1 _ _ PRON p _ _ _ 9:nsubj ArgTem=arg0:agt|Entity=(e16088--1-CorefType:ident,gstype:gen) 9 prefieren preferir VERB vmip3p0 Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin 6 ccomp 6:ccomp ArgTem=arg1:pat 10-11 enfrentarse _ _ _ _ _ _ _ _ 10 enfrentar enfrentar VERB vmn0000 VerbForm=Inf 9 xcomp 9:xcomp ArgTem=arg1:pat 11 se él PRON _ Case=Acc|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes 10 expl:pv 10:expl:pv _ 12-13 al _ _ _ _ _ _ _ _ 12 a a ADP spcms _ 14 case 14:case _ 13 el el DET _ Definite=Def|Gender=Masc|Number=Sing|PronType=Art 14 det 14:det _ 14 Barcelona Barcelona PROPN np0000l _ 10 obj 10:obj ArgTem=arg1:pat|Entity=(e16090-organization-1-gstype:spec) 15 en en ADP sps00 _ 17 case 17:case _ 16 la el DET da0fs0 Definite=Def|Gender=Fem|Number=Sing|PronType=Art 17 det 17:det Entity=(e16091--2-gstype:gen,HomoDD 17 final final NOUN ncfs000 Gender=Fem|Number=Sing 10 obl 10:obl ArgTem=argM:loc 18 de de ADP sps00 _ 20 case 20:case _ 19 la el DET da0fs0 Definite=Def|Gender=Fem|Number=Sing|PronType=Art 20 det 20:det Entity=(e16092-other-2-gstype:spec 20 Copa Copa PROPN np0000a _ 17 nmod 17:nmod _ 21-22 del _ _ _ _ _ _ _ _ 21 de de ADP _ _ 23 case 23:case _ 22 el el DET _ Definite=Def|Gender=Masc|Number=Sing|PronType=Art 23 det 23:det _ 23 Rey Rey PROPN _ _ 20 flat 20:flat Entity=e16092)e16091) 24 en en ADP sps00 _ 29 cc 29:cc _ 25 lugar lugar NOUN _ _ 24 fixed 24:fixed _ 26 de de ADP _ _ 24 fixed 24:fixed _ 27 en en ADP sps00 _ 29 case 29:case _ 28 las el DET da0fp0 Definite=Def|Gender=Fem|Number=Plur|PronType=Art 29 det 29:det Entity=(e16093--2-gstype:gen,HomoDD 29 semifinales semifinal NOUN ncfp000 Gender=Fem|Number=Plur 17 conj 17:conj Entity=e16093)|SpaceAfter=No 30 , , PUNCT fc PunctType=Comm 32 punct 32:punct _ 31 tras tras ADP sps00 _ 32 mark 32:mark _ 32-33 clasificarse _ _ _ _ _ _ _ _ 32 clasificar clasificar VERB vmn0000 VerbForm=Inf 6 advcl 6:advcl ArgTem=argM:tmp 33 se él PRON _ Case=Acc|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes 32 expl:pv 32:expl:pv _ 34 ayer ayer ADV rg _ 32 advmod 32:advmod ArgTem=argM:tmp 35 ambos ambos NUM dn0mp0 Gender=Masc|Number=Plur|NumType=Card 36 nummod 36:nummod Entity=(e16094-other-2-CorefType:ident,gstype:spec|SplitAnte=e16089<e16094,e16090<e16094 36 equipos equipo NOUN ncmp000 Gender=Masc|Number=Plur 32 nsubj 32:nsubj ArgTem=arg1:tem 37 catalanes catalán ADJ aq0mp0 Gender=Masc|Number=Plur 36 amod 36:amod Entity=e16094) 38 para para ADP sps00 _ 40 case 40:case _ 39 esta este DET dd0fs0 Gender=Fem|Number=Sing|PronType=Dem 40 det 40:det Entity=(e16093--2-CorefType:ident,gstype:gen 40 ronda ronda NOUN ncfs000 Gender=Fem|Number=Sing 32 obl 32:obl ArgTem=argM:adv|Entity=e16093)|SpaceAfter=No 41 . . PUNCT fp PunctType=Peri 6 punct 6:punct _
You have virtually no limits in building your system. You can develop it from scratch or extend/modify the two baseline systems that we provide the participants with: the baseline for predicting empty tokens, and the baseline for coreference resolution. If you want to treat the baseline systems as black boxes and base your system just on their predictions, choose either the "Coreference from scratch" or the "Refine the baseline" starting points.
Your coreference resolution system is supposed to identify sets of tokens as mentions and cluster them to coreferential entities. To identify a mention, your system is expected to predict a mention head word. However, it is still advisable to predict full mention span, too (the reasons are explained here). If your system is not able to predict the mention heads (i.e. it predicts mention spans only, and the head index is always 1
), mention heads can be estimated using the provided dependency tree and heuristics, e.g. the ones provided by Udapi, using the following command: udapy -s corefud.MoveHead < in.conllu > out.conllu
If you choose the "Coreference and zeros from scratch" starting point, your system is supposed to reconstruct empty tokens prior to coreference resolution. A newly added empty token must be connected to the rest of the sentence by an enhanced dependency relation. Your system is thus expected to identify the parent token of the empty token. It is also advisable to predict a type of the dependency relation (the reasons are explained here). If your system is not able to predict the dependency relation type, set each type to dep
(empty value would cause the validation tests to fail).
You do not need to predict all information present in the gold data. Demonstrated on the example, instead of generating:
1 Los el DET da0mp0 Definite=Def|Gender=Masc|Number=Plur|PronType=Art 2 det 2:det Entity=(e16088--2-gstype:gen,HomoDD 2 jugadores jugador NOUN ncmp000 Gender=Masc|Number=Plur 6 nsubj 6:nsubj ArgTem=arg0:agt ... 5 Espanyol Espanyol PROPN np0000o _ 2 nmod 2:nmod Entity=(e16089-organization-1-gstype:spec)e16088) ... 8.1 _ _ PRON p _ _ _ 9:nsubj ArgTem=arg0:agt|Entity=(e16088--1-CorefType:ident,gstype:gen) ... 35 ambos ambos NUM dn0mp0 Gender=Masc|Number=Plur|NumType=Card 36 nummod 36:nummod Entity=(e16094-other-2-CorefType:ident,gstype:spec|SplitAnte=e16089<e16094,e16090<e16094 36 equipos equipo NOUN ncmp000 Gender=Masc|Number=Plur 32 nsubj 32:nsubj ArgTem=arg1:tem 37 catalanes catalán ADJ aq0mp0 Gender=Masc|Number=Plur 36 amod 36:amod Entity=e16094) ...depending on your starting point, it is sufficient for your system to generate the following for a perfect match:
1 Los el DET da0mp0 Definite=Def|Gender=Masc|Number=Plur|PronType=Art 2 det 2:det Entity=(e1--2- 2 jugadores jugador NOUN ncmp000 Gender=Masc|Number=Plur 6 nsubj 6:nsubj _ 2.1 _ _ _ _ _ _ _ 9:dep Entity=(e1--1-) ... 5 Espanyol Espanyol PROPN np0000o _ 2 nmod 2:nmod Entity=(e2--1-)e1) ... ... 35 ambos ambos NUM dn0mp0 Gender=Masc|Number=Plur|NumType=Card 36 nummod 36:nummod Entity=(e3--2- 36 equipos equipo NOUN ncmp000 Gender=Masc|Number=Plur 32 nsubj 32:nsubj _ 37 catalanes catalán ADJ aq0mp0 Gender=Masc|Number=Plur 36 amod 36:amod Entity=e3) ...Out of the possible anaphora annotation tags available in the CorefUD format, only
Entity
tags have been predicted (and e.g. the SplitAnte
can be ignored). And not even in their full content. Bracketing defines mention span. The index in the 3rd field of the Entity
tag defines a relative position of the mention head within all tokens of the mention (in the example, the predicted mention heads are: jugadores, _, Espanyol, and equipos). The e*
co-indexing in the 1st field of Entity
tags clusters mentions into coreferential entities. As for the generated zero, its position does not need to be matched (2.1
vs. 8.1
). Instead, its dependency relation in the DEPS field is used to align it to a gold zero. While the parent index (9
) must be the same, the dependency relation type do not need to match (dep
vs. nsubj
) unless there are multiple zeros with the same parent.
The system for predicting empty tokens (zeros) can be downloaded here. We have applied this system on the data for "Coref. and zeros from scratch" starting point to produce the data for the "Coref. from scratch" starting point. The system predicts the position of empty tokens in a sentence and the DEPS column, i.e. their parent in the enhanced dependencies and the dependency relation (deprel). While CoNLL-U allows multiple enhanced parents in the DEPS column, the baseline system predicts only one (the training data was pre-processed with corefud.SingleParent Udapi block). The baseline system does not predict any attributes of the empty nodes, so all the CoNLL-U columns except for DEPS (including FORM) are empty (i.e. _
).
The baseline coreference resolution system is based on the multilingual coreference resolution system presented in [7], using multilingual BERT in the end-to-end setting. The system only predicts the coreference annotation in the MISC column, i.e. if the input files do not contain empty nodes, the system cannot reconstruct them and consequently fails in resolving zero anaphora. We have applied this system on the data for "Coref. from scratch" starting point to produce the data for the "Refine the baseline" starting point.
Many things can go wrong when filling the predicted coreference annotation in the CoNLL-U format, especially if not using the API (incorrect syntax in the MISC column, unmatched brackets etc.) Although the evaluation script may recover from many potential validation errors, it is highly recommended to check validity prior to submitting the files, so that you do not run out of the maximum daily trials.
For the CoNLL-U file produced by your system to be ready to be submitted, it must satisfy the two following criteria:
The official UD validator will be used to check the validity of the CoNLL-U format. Anyone can obtain it by cloning the UD tools repository from GitHub and running the script validate.py
. Python 3 is needed to run the script (depending on your system, it may be available under the command python
or python3
; if in doubt, try python -V
to see the version).
$ git clone git@github.com:UniversalDependencies/tools.git $ cd tools $ python3 validate.py -h
In addition, a third-party module called regex
must be installed via pip. Try this if you do not have the module already:
$ sudo apt-get install python3-pip; python3 -m pip install regex
The validation script distinguishes several levels of validity; level 2 is sufficient in the shared task, as the higher levels deal with morphosyntactic requirements on the UD-released treebanks. On the other hand, we will use the --coref
option to turn on tests specific to coreference annotation. The validator also requires the option --lang xx
where xx
is the ISO language code of the data set.
$ python3 validate.py --level 2 --coref --lang cs cs_pdt-corefud-test.conllu *** PASSED ***
If there are errors, the script will print messages describing the location and the nature of the error, it will print *** FAILED *** with (number of) errors
, and it will return a non-zero exit value. If the file is OK, the script will print *** PASSED ***
and return zero as its exit value. The script may also print warning messages that point to potential problems in the file but are not considered errors and will not make the file invalid.
The official scorer for the shared task is the CorefUD scorer in versions after May 10. Its functionality is guaranteed not to change until the end of the evaluation phase.
The main evaluation metric for the task is the CoNLL score, which is an unweighted average of the F1 values of MUC, B-cubed, and CEAFe scores. To encourage the participants to develop multilingual systems, the primary ranking score will be computed by macro-averaging CoNLL F1 scores over all datasets. For the same reason, singletons (entities with a single mention) will not be taken into account in calculation of the primary score, as many of the datasets do not have singletons annotated. Although some of the datasets also comprise annotation of split antecedents, bridging and other anaphoric relations, these are not going to be evaluated.
Besides the primary ranking, the overview paper on the shared task will also introduce multiple secondary rankings, e.g. by CoNLL score for individual languages, or by CoNLL scores calculated with exact matching.
The primary score is calculated using the head match. That is, to compare gold and predicted mentions, we compare their heads. Submitted systems are thus expected to predict a mention head word by filling in its relative position within all words in the corresponding mention span to the Entity
attribute. For example, the annotation Entity=(e9-place-2-
identifies the second word of the mention as its head.
However, it is still advisable to predict full mention spans, too. Evaluation with head matching uses them to disambiguate between mentions with the same head token. In addition, systems that predict only mention heads are likely to fail in the evaluation with exact matching, which will be calculated as one of the supplementary scores.
If the submitted system is not able to predict the mention heads (i.e. it predicts mention spans only, and the head index is always 1
), mention heads can be estimated using the provided dependency tree and heuristics, e.g. the ones provided by Udapi, using the following command: udapy -s corefud.MoveHead < in.conllu > out.conllu
Unlike in the previous editions, in this year the participants are expected to predict also the empty nodes involved in zero anaphora (if they opt for the "Coreference and zeros from scratch" starting point). In the system outputs, some empty nodes may be missing and some may be spurious. In addition, some empty nodes may be predicted at different surface positions within the sentence, yet playing the same role. Nevertheless, if such empty nodes are heads of the gold and the predicted mention, the evaluation method must be capable of matching these zero mentions.
The shared task applies the dependency-based method of matching zero mentions. It looks for the matching of zeros within the same sentence that maximizes the F-score of predicting dependencies of zeros in the DEPS field. Specifically, the task is cast as searching for a 1-to-1 matching in a weighted bipartite graph (with gold mentions and predicted mentions as partitions) to maximize the total sum of weights in the matching. Each candidate pair (gold zero mention - predicted zero mention) is weighed with a non-zero score only if the two mentions belong to the same sentence. The score is then calculated as a weighted sum of two features:
The scoring system prioritizes exact assignment of both parents and types, while parent assignments without considering dependency types should only serve to break ties.
Note that matching zero mentions by their dependencies is applied first, preceding the matching strategies for non-zero mentions. Zeros that have not been matched to other zeros may then be matched to non-zero mentions. Although such matching may seem counterintuitive, it can be valid in cases where a predicted zero mention is incorrectly labeled as non-zero, or vice versa, often due to the wrong choice of the head in multi-token mentions involving empty tokens.
The submissions to the shared task are collected through CodaLab. You need to create a CodaLab account prior to your first submission. We suggest to use your team name as the username when creating the account because the username will be shown (publicly) in CodaLab results.
We recommend you to submit the outputs of your system to CodaLab already in the development phase, using the dev set as the input. By doing so, you can practice the submission process and prevent some unexpected issues during the evaluation phase. Limits on maximum number of submission trials are sufficiently high to share also your intermediate results: 15 trials per day and 100 in total.
On the other hand, the evaluation phase limits the maximum number of submissions trials more strictly: 2 trials per day and 10 in total. Multiple trials are allowed exclusively for resolving some unexpected situations, but definitely should not be used for systematic optimization of parameters or hyper-parameters of your model towards the scores shown by CodaLab.
Participants who have developed multiple coreference prediction systems are encouraged to submit their predictions separately, up to 3 systems per team, as long as the systems are different in some interesting ways (e.g. using different architectures, not just different hyper-parameter settings). In order to submit an additional system of yours, please create an additional team account at CodaLab.
The submission to CodaLab must be a zip file, with the CoNLL-U files produced by your system resided in the root folder and their names identical to the names of the corresponding input files. The zip file for the development phase should contain the following 21 files (substitute dev
with test
for the submissions in the evaluation phase).
ca_ancora-corefud-dev.conllu cs_pcedt-corefud-dev.conllu cs_pdt-corefud-dev.conllu cu_proiel-corefud-dev.conllu de_parcorfull-corefud-dev.conllu de_potsdamcc-corefud-dev.conllu en_gum-corefud-dev.conllu en_litbank-corefud-dev.conllu en_parcorfull-corefud-dev.conllu es_ancora-corefud-dev.conllu fr_democrat-corefud-dev.conllu grc_proiel-corefud-dev.conllu hbo_ptnk-corefud-dev.conllu hu_korkor-corefud-dev.conllu hu_szegedkoref-corefud-dev.conllu lt_lcc-corefud-dev.conllu no_bokmaalnarc-corefud-dev.conllu no_nynorsknarc-corefud-dev.conllu pl_pcc-corefud-dev.conllu ru_rucor-corefud-dev.conllu tr_itcc-corefud-dev.conllu
If this naming (and placement) convention is not observed, the scorer will not be able to pair the output with the input, and the output will not be scored. We recommend you also to check validity of your output files. However, even files not passing the validation tests will be considered for the evaluation and contributing to the final score (provided the evaluation script does not fail on such files).
All shared task participants are invited to submit their system descriptions papers to the CRAC 2024 Workshop. Please submit your paper using SoftConf and select one of the "Shared Task paper (short/long)" as its Submission Type. If accepted, the papers will be published in the workshop proceedings.
System description papers can have the form of long or short research papers, up to 8 pages of content for long papers and up to 4 pages of content for short papers, plus an unlimited number of pages for references in both cases. For the formatting instructions, please follow the instructions for the other CRAC papers.
Identity of the authors of the participating systems is known, and thus it is not required to make the submissions anonymous.
Accepted papers will be presented in a dedicated session of the CRAC 2024 Workshop (the Seventh Workshop on Computational Models of Reference, Anaphora and Coreference) held at EMNLP 2024, and published in the conference proceedings.
Training, development, and test datasets are subject to license agreements specified individually for each dataset in the public edition of the CorefUD 1.2 collection (which, in turn, are the same as license agreements of the original resources before CorefUD harmonization). In all cases, the licenses are sufficient for using the data for the CRAC 2024 shared task purposes. However, the participants must check the license agreements in case they want to use their trained models also for other purposes; for instance, usage for commercial purposes is prohibited with several CorefUD datasets as they are available under CC BY-NC-SA.
Whenever using the CorefUD 1.2 collection (inside or outside this shared task), please cite it as follows:
@misc{11234/1-5478, title = {Coreference in Universal Dependencies 1.2 ({CorefUD} 1.2)}, author = {Popel, Martin and Nov{\'a}k, Michal and {\v Z}abokrtsk{\'y}, Zden{\v e}k and Zeman, Daniel and Nedoluzhko, Anna and Acar, Kutay and Bamman, David and Bourgonje, Peter and Cinkov{\'a}, Silvie and Eckhoff, Hanne and Cebiro{\u g}lu Eryi{\u g}it, G{\"u}l{\c s}en and Haji{\v c}, Jan and Hardmeier, Christian and Haug, Dag and J{\o}rgensen, Tollef and K{\aa}sen, Andre and Krielke, Pauline and Landragin, Fr{\'e}d{\'e}ric and Lapshinova-Koltunski, Ekaterina and M{\ae}hlum, Petter and Mart{\'{\i}}, M. Ant{\`o}nia and Mikulov{\'a}, Marie and N{\o}klestad, Anders and Ogrodniczuk, Maciej and {\O}vrelid, Lilja and Pamay Arslan, Tu{\u g}ba and Recasens, Marta and Solberg, Per Erik and Stede, Manfred and Straka, Milan and Swanson, Daniel and Toldova, Svetlana and Vad{\'a}sz, No{\'e}mi and Velldal, Erik and Vincze, Veronika and Zeldes, Amir and {\v Z}itkus, Voldemaras}, url = {http://hdl.handle.net/11234/1-5478}, note = {{LINDAT}/{CLARIAH}-{CZ} digital library at the Institute of Formal and Applied Linguistics ({{\'U}FAL}), Faculty of Mathematics and Physics, Charles University}, copyright = {Licence {CorefUD} v1.2}, year = {2024} }
For a more general reference to CorefUD harmonization efforts, please cite the following LREC paper:
@inproceedings{biblio8283899234757555533, author = {Anna Nedoluzhko and Michal Novák and Martin Popel and Zdeněk Žabokrtský and Amir Zeldes and Daniel Zeman}, year = 2022, title = {CorefUD 1.0: Coreference Meets Universal Dependencies}, booktitle = {Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)}, pages = {4859--4872}, publisher = {European Language Resources Association}, address = {Marseille, France}, isbn = {979-10-95546-72-6}, }
By submitting results to this competition, the participants consent to the public release of their scores at the CRAC 2024 workshop and in the associated proceedings, at the task organizers' discretion. Participants further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was erroneous or deceptive.
Charles University (Prague, Czechia): Anna Nedoluzhko, Michal Novák, Martin Popel, Zdeněk Žabokrtský, Daniel Zeman
Polish Academy of Sciences (Warsaw, Poland): Maciej Ogrodniczuk
University of West Bohemia (Pilsen, Czechia): Miloslav Konopík, Ondřej Pražák, Jakub Sido
You can send any questions about the shared task to the organizers via corefud@googlegroups.com.
The main differences between the three editions are as follows:
Inspired by the Universal Dependencies initiative (UD) [1], the coreference community has started discussions on establishing a universal annotation scheme and using it to harmonize existing corpora. The discussions at the CRAC 2020 workshop led to proposing the Universal Anaphora initiative. One of the lines of effort related to Universal Anaphora resulted in CorefUD, which is a multilingual collection of coreference data resources harmonized under a common scheme [2]. The current public release of CorefUD 1.2 contains 21 datasets for 15 languages, namely Ancient Greek, Ancient Hebrew, Catalan, Czech (2×), English (3×), French, German (2×), Hungarian (2×), Lithuanian, Norwegian (2×), Old Church Slavonic, Polish, Russian, Spanish, and Turkish. The CRAC 2024 shared task deals with coreference resolution in all these languages. It is the 3rd edition of the shared task; findings of the first and second edition can be found in [8] and [9], respectively.
References
This shared task is supported by the Grants No. 20-16819X (LUSyD) of the Czech Science Foundation, UNCE24/SSH/009, and LM2023062 (LINDAT/CLARIAH-CZ) of the Ministry of Education, Youth, and Sports of the Czech Republic.