CRAC 2024 Shared Task on Multilingual Coreference Resolution

Coreference resolution is the task of clustering together multiple mentions of the same entity appearing in a textual document (e.g. Joe Biden, the U.S. President and he). This CodaLab-powered shared task deals with multilingual coreference resolution and is associated with the CRAC 2024 Workshop (the Seventh Workshop on Computational Models of Reference, Anaphora and Coreference) held at EMNLP 2024. This shared task builds on the previous two editions, in 2022 and 2023, paying more attention to zero-pronoun coreference this time. Coreference datasets from 15 languages are involved. Still, you can participate just for selected languages and without dealing with zero mentions.

Important dates and actions

Registration

as soon as possible

Please fill the registration form.
More details...

Development phase

by June 3

Choose the starting point.
Download the data.
Develop your system. You may build it upon the baseline systems.
Run your system on the dev data, validate, and evaluate.
Submit the results.

Evaluation phase

June 3-24

Download the test data.
Run your system on the test data, validate.
Submit the results.

Submission of system description papers

by September 1

Fill a questionnaire with details about your system.
Write and submit a paper describing your system(s) to SoftConf.

Workshop

November 15

See you at EMNLP 2024 in Miami, Florida.

Official Results

The following table shows four versions of the CoNLL metric macro-averaged over all datasets:

head-match excluding singletons (the primary metric, see below),
partial-match excluding singletons,
exact-match excluding singletons and
head-match with singletons.

For a more detailed evaluation see the shared task overview paper.

system	head-match	partial-match	exact-match	with singletons
1. CorPipe-2stage	73.90	72.19	69.86	75.65
2. CorPipe	72.75	70.30	68.36	74.65
3. CorPipe-single	70.18	68.02	66.07	71.96
4. Ondfa	69.97	69.82	40.25	70.67
5. BASELINE	53.16	52.48	51.26	46.45
6. DFKI-CorefGen	33.38	32.36	30.71	38.65
7. Ritwikmishra	16.47	16.65	14.16	15.42

1. Task Description
2. Instructions for Participants
- 2.1. Registration
- 2.2. Choosing the starting point
- 2.3. Data
  - 2.3.1. Data download
  - 2.3.2. Udapi interface for data
  - 2.3.3. File format
- 2.4. Developing your system
  - 2.4.1. A baseline for prediction of zeros
  - 2.4.2. A baseline for coreference resolution
- 2.5. Validation
- 2.6. Evaluation scorer
  - 2.6.1. Head-match score
  - 2.6.2. Matching of zero mentions
- 2.7. Submitting to CodaLab
- 2.8. System description papers
- 2.9. Workshop
3. Miscellaneous
- 3.1. Terms and conditions for data usage
- 3.2. Shared task organizers
- 3.3. Changes to the previous editions of the shared task
- 3.4. The shared task in a broader context
- 3.5. Acknowledgements

1. Task Description

The goal: shared task participants are supposed to create systems that

identify mentions in texts, including reconstruction of zero mentions, i.e. empty tokens (such as pro-drops) involved in coreference relations,
identify coreference relations among the mentions, i.e. predict which mentions belong to the same coreference cluster (i.e., refer to the same entity or event).

The main rules of the shared task are the following:

Training and development data are published first; test data (without gold annotations) will be published only after the beginning of the evaluation phase and must not be used for improving the models.
Participants are expected to deliver their submissions exclusively via CodaLab; a submission must have the form of a zip file containing test set files with predicted coreference, ideally for all 21 CorefUD datasets; however, participants who are unable to predict coreference for all CorefUD datasets are encouraged to submit at least a subset of test set files.
Similarly, participants whose systems are unable to reconstruct zero mentions are welcome, too. See below.
Participants who have developed multiple coreference prediction systems are encouraged to submit their predictions separately, up to 3 systems per team, as long as the systems are different in some interesting ways (e.g. using different architectures, not just different hyperparameter settings). In order to submit an additional system of yours, please create an additional team account at CodaLab.
Technically, "files with predicted coreference" means that coreference attributes using the CorefUD notation are filled into the MISC column of test set CoNLL-U files. Furthermore, "reconstructed zero mentions" corresponds to new empty tokens inserted to sentences in the CoNLL-U files, connected to the rest of the sentence by enhanced dependency relation. However, it is recommended to use the Udapi interface (a Python API for accessing CoNLL-U-formatted files) rather than of modifying the CoNLL-U files by one's new code.
Only identity coreference is supposed to be predicted (even if some of the CorefUD datasets contain annotation of bridging and split antecedents).
There is a single official evaluation criterion that will be used for the main ranking of all submissions within the evaluation phase. There are no subtasks delimited within this shared task.
Even if there are no subtasks declared, additional evaluation criteria might be evaluated for all submissions and presented by the organizers (for instance, secondary rankings of submissions according to scores reached for individual languages).
Deep-learning-based baseline systems for both reconstruction of zero mentions and coreference resolution are available to participants, and it is up to their decision whether they start developing their system from scratch, or by incremental improvements of the baselines.
After the evaluation period, participants will be invited to submit their system description papers to the CRAC 2024 workshop.

Even if all datasets included in the shared task are available in the same file format, systems competing in the shared task are supposed to be flexible enough to accommodate various types of variability present in the CorefUD collection, such as

Different languages - the datasets embrace languages of different groups and families, different writing scripts, including modern as well as a few ancient languages,
Different training data sizes - for instance, the size difference between ParCorFull and PCEDT datasets spans two orders of magnitude,
Different annotation guidelines - even datasets in the same language can be annotated according to diverse guidelines, resulting for instance in different distribution of mention types,
Presence of zero mentions - although related to the annotation guidelines, it is worth to mention it separately: some datasets include zero mentions, some don't,
Completeness of annotation - there may be various types of internal annotation inconsistencies in some datasets.

2. Instructions for Participants

If you are interested in the shared taks, you should register first.

You can then proceed into the development phase. As the first step, you should choose your starting point. The easier the starting point is, the more you rely on the baseline solution provided by us. You can then start developing your system. Feel free to use the gold training and development data. Run your system on the appropriate blind development data, based on your starting point. Ensure that the output from you system is valid and evaluate it using the scorer. Repeat the development loop until satisfied with the result. We also encourage you to share the intermediate results with the other participants by submitting the system outputs to the CodaLab. This is also strongly recommended for practising the submission process, so that potential technical issues are discovered soon.

With the start of the evaluation phase, we will publish the blind test data. Choose the appropriate variant based on your starting point and run your system on it. Check the validity of the output and submit it to CodaLab. You should immediately see the evaluation score. If anything goes wrong, repeat the process.

After the end of the evaluation phase, we will ask you to provide us with the details on your submission and your system. You are also encouraged to write a paper describing your system in details, which you can finally present at the workshop.

2.1. Registration

If you are interested in participating in this shared task, please fill the registration form as soon as possible.

Technically, this registration will not be connected with participants' CodaLab accounts in any way. In other words, it will be possible to upload your CodaLab submissions without being registered here. However, we strongly recommend that at least one person from each participating team fills this registration form so that we can keep you informed about all updates regarding the shared task.

In addition, you can send any questions about the shared task to the organizers via corefud@googlegroups.com.

2.2. Choosing the starting point

Participants can choose from different starting points for joining the shared task, which vary based on the amount of work they need to do on their own. Depending on the starting point chosen, different degrees of predictions by baseline systems are available.

There are three starting points:

Coreference and zeros from scratch. Participants need to develop not only a system that resolves coreference, but also a system that predicts empty tokens that may be involved in zero anaphora. Alternatively, these tasks can be addressed jointly with a single system. Although more challenging, this starting point offers high potential gains.
Coreference from scratch. The empty tokens for zero anaphora are provided by the baseline system. Therefore, participants only need to focus on developing a system for coreference resolution. The systems submitted to the last year's edition can be readily applied at this starting point after some retraining.
Refine the baseline. Participants are provided with both empty tokens and coreference relations, as predicted by the baseline systems. Choosing this starting point is the simplest yet less flexible option.

Given that the shared task data comprises multiple datasets in different languages, participants have the flexibility to approach the task from various starting points across the datasets/languages.

2.3. Data

The source of the data for the shared task is the public edition of CorefUD 1.2. CorefUD is a collection of previously existing datasets annotated with coreference, converted into a common annotation scheme. Coreference is annotated also for empty tokens, mainly in pro-drop languages. The datasets are enriched with morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. Each dataset in the collection is divided into a training section, a development section, and a test section (train/dev/test for short) and stored in the CoNLL-U format, with coreference-specific information captured in the MISC column.

The public edition of CorefUD 1.2 consists of 21 datasets for 15 languages. Click here to see the full list...

Ancient_Greek-PROIEL, New Testament gospels, from the PROIEL treebank (newly added into CorefUD in version 1.2),
Ancient_Hebrew-PTNK, Old Testament Genesis, portions of the Biblia Hebraic Stuttgartensia (newly added into CorefUD in version 1.2),,
Catalan-AnCora, based on the Catalan part of Coreferentially annotated corpus AnCora,
Czech-PCEDT, based on the Czech part of the Prague Czech-English Dependency Treebank,
Czech-PDT, based on the Prague Dependency Treebank,
English-GUM, based on the Georgetown University Multilayer Corpus,
English-LitBank, based on LitBank, 100 works of English-language fiction (newly added into CorefUD in version 1.2),,
English-ParCorFull, based on the English part of ParCorFull,
French-Democrat, based on the Democrat corpus,
German-ParCorFull, based on the German part of ParCorFull,
German-PotsdamCC, based on the Potsdam Commentary Corpus,
Hungarian-SzegedKoref, based on the Hungarian coreference corpus SzegedKoref,
Hungarian-KorKor, based on the a Hungarian coreference corpus KorKor (added into CorefUD in version 1.1),
Lithuanian-LCC, based on the Lithuanian Coreference Corpus,
Norwegian-BokmaalNARC, based on the Bokmaal part of the Norwegian Anaphora Resolution Corpus (added into CorefUD in version 1.1),
Norwegian-NynorskNARC, based on the Nynorsk part of the Norwegian Anaphora Resolution Corpus (added into CorefUD in version 1.1),
Old_Church_Slavonic-PROIEL, Codex Marianus and selected chapters of Suprasliensis from the PROIEL and TOROT treebanks (newly added into CorefUD in version 1.2),,
Polish-PCC, based on the Polish Coreference Corpus,
Russian-RuCor, based on the Russian Coreference Corpus RuCor,
Spanish-AnCora, based on the Spanish part of Coreferentially annotated corpus AnCora,
Turkish-ITCC, based on the Turkish Coreference Corpus (added into CorefUD in version 1.1).

2.3.1. Data download

Compared to the public edition of CorefUD 1.2, the data provided for the shared task participants are slightly adjusted.

Gold data aimed for training and evaluating have undergone a small technical modification: forms (the 2nd field in the CoNLL-U format) of empty tokens have been deleted (replaced by an underscore _). The reason is that the baseline system for prediction of zeros does not predict the forms of zeros (empty tokens) and also the evaluation ignores the forms of zeros. While we make the gold train and dev sets available for download, the gold test set is secret and will be used internally in CodaLab for evaluation of submissions.

We make also available the input data aimed to be processed by your systems. The input data should get closer to the real-world setup, where no manual linguistic annotation is available. However, this is done with respect to different starting points. Consequently, for each starting point we provide the dev and test set, in which the empty nodes or/and annotation of coreference is either deleted or replaced by the output of the baseline systems. Furthermore, the original morpho-syntax features (POS tags, lemmas, and dependency trees) are replaced by the output of UDPipe 2 (a pipeline for an automatic UD-like annotation) even in the datasets for which the features are manually annotated in CorefUD 1.2.

Data type	Starting point	Empty tokens	Coreference	Morpho-syntax	Forms of empty tokens	Download
Gold	All	manual	manual	original (manual if available, otherwise automatic)	deleted	train
Gold	All	manual	manual	original (manual if available, otherwise automatic)	deleted	dev
Input	Coref. and zeros from scratch	deleted	deleted	automatic UDPipe 2	deleted	dev
	Coref. and zeros from scratch	deleted	deleted	automatic UDPipe 2	deleted	test
	Coref. from scratch	automatic baseline	deleted	automatic UDPipe 2	deleted	dev
	Coref. from scratch	automatic baseline	deleted	automatic UDPipe 2	deleted	test
	Refine the baseline	automatic baseline	automatic baseline	automatic UDPipe 2	deleted	dev
	Refine the baseline	automatic baseline	automatic baseline	automatic UDPipe 2	deleted	test

2.3.2. Udapi interface for data

Udapi is a Python API for reading, writing, querying and editing Universal Dependencies data in the CoNLL-U format (and several other formats). It has also support for the coreference annotations (and it was used for producing CorefUD). You can use Udapi for accessing and writing the data in a comfortable way. See the following example Python script, which:

loads a CorefUD data file
accesses tokens of the first sentence
creates an entity with two mentions
creates an empty token as a child of another token
creates another entity with two mentions, one of which is the empty token
debug-prints the newly created entities
saves the CorefUD data file with the new entities

#!/usr/bin/env python3
import udapi

# Extract the words of the first sentence in the Spanish blind dev set.
doc = udapi.Document("es_ancora-corefud-dev.conllu")
trees = list(doc.trees)
words = trees[0].descendants
print([w.form for w in words])
#['Los', 'jugadores', 'de', 'el', 'Espanyol', 'aseguraron', 'hoy', 'que',
# 'prefieren', 'enfrentar', 'se', 'a', 'el', 'Barcelona', 'en', 'la', 'final',
# 'de', 'la', 'Copa', 'de', 'el', 'Rey', 'en', 'lugar', 'de', 'en', 'las',
# 'semifinales', ',', 'tras', 'clasificar', 'se', 'ayer', 'ambos', 'equipos',
# 'catalanes', 'para', 'esta', 'ronda', '.']

# Create entity e1 with two mentions: "las semifinales" and "esta ronda"
e1 = doc.create_coref_entity()
e1.create_mention(words=words[27:29], head=words[28])
e1.create_mention(words=words[38:40], head=words[39])

# Create an empty node (zero) before the 9th word "prefieren".
zero = words[8].create_empty_child(deprel="nsubj", after=False, form="_")

# Make sure the input file es_ancora-corefud-dev.conllu is really
# the blind dev set without any empty nodes.
assert zero == trees[0].descendants_and_empty[8], "unexpected input file"

# Create entity e2 with two mentions:
# "Los jugadores de el Espanyol" and the newly created zero.
e2 = doc.create_coref_entity()
e2.create_mention(words=words[0:5], head=words[1])
e2.create_mention(words=[zero], head=zero)

# Print the newly created coreference entities.
udapi.create_block("corefud.PrintEntities").process_document(doc)

# Save the predictions into a CoNLL-U file
doc.store_conllu("output.conllu")

For getting a deeper insight into Udapi, you can use

Daniel Zeman's tutorial (with even basic concepts explained),
and/or Martin Popel's tutorial (a bit more advanced, Jupyter oriented).

2.3.3. File format

If you use the Udapi interface for loading and storing the shared task data, which is the recommended way, you don't have to deal with the file format at all. However, it may be useful to understand the format for quick glimpses into the data.

The full specification of the CoNLL-U format is available at the website of Universal Dependencies. In a nutshell: every token has its own line; lines starting with # are sentence-level comments, and empty lines terminate a sentence. Regular token lines start with an integer number. There are also lines starting with intervals (e.g. 4-5), which introduce what UD calls “multi-word tokens”; these lines must be preserved in the output but otherwise the participants do not have to care about them (coreference annotation does not occur on them). Finally, there are also lines starting with decimal numbers (e.g. 2.1), which correspond to empty nodes in the dependency graph; these nodes may represent zero mentions and may contain coreference annotation. Every token/node line contains 10 tab-separated fields (columns). The first column is the numeric ID of the token/node, the next column contains the word FORM; any coreference annotation, if present, will appear in the last column, which is called MISC. The file must use Linux-style line breaks, that is, a single LF character, rather than CR LF, which is common on Windows.

The MISC column is either a single underscore (_), meaning there is no extra annotation, or one or more pieces of annotation (typically in the Attribute=Value form), separated by vertical bars (|). The annotation pieces relevant for this shared task always start with Entity=; these should be learned from the training data and predicted for the test data. Any other annotation that is present in the MISC column of the input file should be preserved in the output (especially note that if you discard SpaceAfter=No, or introduce a new one, the validator may report the file as invalid).

For more information on the Entity attribute, see the PDF with the description of the CorefUD 1.0 format (the CorefUD 1.2 format is identical).

Click here to see an example of a CoNLL-U file with coreference annotation highlighted in bold/yellow and an empty node in italics/red.

# newdoc id = CESS-CAST-A-20000217-13959
# global.Entity = eid-etype-head-other
# sent_id = CESS-CAST-A-20000217-13959-s1
# text = Los jugadores del Espanyol aseguraron hoy que prefieren enfrentarse al Barcelona en la final de la Copa del Rey en lugar de en las semifinales, tras clasificarse ayer ambos equipos catalanes para esta ronda.
1	Los	el	DET	da0mp0	Definite=Def|Gender=Masc|Number=Plur|PronType=Art	2	det	2:det	Entity=(e16088--2-gstype:gen,HomoDD
2	jugadores	jugador	NOUN	ncmp000	Gender=Masc|Number=Plur	6	nsubj	6:nsubj	ArgTem=arg0:agt
3-4	del	_	_	_	_	_	_	_	_
3	de	de	ADP	spcms	_	5	case	5:case	_
4	el	el	DET	_	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	5	det	5:det	_
5	Espanyol	Espanyol	PROPN	np0000o	_	2	nmod	2:nmod	Entity=(e16089-organization-1-gstype:spec)e16088)
6	aseguraron	asegurar	VERB	vmis3p0	Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin	0	root	0:root	_
7	hoy	hoy	ADV	rg	_	6	advmod	6:advmod	ArgTem=argM:tmp
8	que	que	SCONJ	cs	_	9	mark	9:mark	_
8.1	_	_	PRON	p	_	_	_	9:nsubj	ArgTem=arg0:agt|Entity=(e16088--1-CorefType:ident,gstype:gen)
9	prefieren	preferir	VERB	vmip3p0	Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin	6	ccomp	6:ccomp	ArgTem=arg1:pat
10-11	enfrentarse	_	_	_	_	_	_	_	_
10	enfrentar	enfrentar	VERB	vmn0000	VerbForm=Inf	9	xcomp	9:xcomp	ArgTem=arg1:pat
11	se	él	PRON	_	Case=Acc|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes	10	expl:pv	10:expl:pv	_
12-13	al	_	_	_	_	_	_	_	_
12	a	a	ADP	spcms	_	14	case	14:case	_
13	el	el	DET	_	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	14	det	14:det	_
14	Barcelona	Barcelona	PROPN	np0000l	_	10	obj	10:obj	ArgTem=arg1:pat|Entity=(e16090-organization-1-gstype:spec)
15	en	en	ADP	sps00	_	17	case	17:case	_
16	la	el	DET	da0fs0	Definite=Def|Gender=Fem|Number=Sing|PronType=Art	17	det	17:det	Entity=(e16091--2-gstype:gen,HomoDD
17	final	final	NOUN	ncfs000	Gender=Fem|Number=Sing	10	obl	10:obl	ArgTem=argM:loc
18	de	de	ADP	sps00	_	20	case	20:case	_
19	la	el	DET	da0fs0	Definite=Def|Gender=Fem|Number=Sing|PronType=Art	20	det	20:det	Entity=(e16092-other-2-gstype:spec
20	Copa	Copa	PROPN	np0000a	_	17	nmod	17:nmod	_
21-22	del	_	_	_	_	_	_	_	_
21	de	de	ADP	_	_	23	case	23:case	_
22	el	el	DET	_	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	23	det	23:det	_
23	Rey	Rey	PROPN	_	_	20	flat	20:flat	Entity=e16092)e16091)
24	en	en	ADP	sps00	_	29	cc	29:cc	_
25	lugar	lugar	NOUN	_	_	24	fixed	24:fixed	_
26	de	de	ADP	_	_	24	fixed	24:fixed	_
27	en	en	ADP	sps00	_	29	case	29:case	_
28	las	el	DET	da0fp0	Definite=Def|Gender=Fem|Number=Plur|PronType=Art	29	det	29:det	Entity=(e16093--2-gstype:gen,HomoDD
29	semifinales	semifinal	NOUN	ncfp000	Gender=Fem|Number=Plur	17	conj	17:conj	Entity=e16093)|SpaceAfter=No
30	,	,	PUNCT	fc	PunctType=Comm	32	punct	32:punct	_
31	tras	tras	ADP	sps00	_	32	mark	32:mark	_
32-33	clasificarse	_	_	_	_	_	_	_	_
32	clasificar	clasificar	VERB	vmn0000	VerbForm=Inf	6	advcl	6:advcl	ArgTem=argM:tmp
33	se	él	PRON	_	Case=Acc|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes	32	expl:pv	32:expl:pv	_
34	ayer	ayer	ADV	rg	_	32	advmod	32:advmod	ArgTem=argM:tmp
35	ambos	ambos	NUM	dn0mp0	Gender=Masc|Number=Plur|NumType=Card	36	nummod	36:nummod	Entity=(e16094-other-2-CorefType:ident,gstype:spec|SplitAnte=e16089<e16094,e16090<e16094
36	equipos	equipo	NOUN	ncmp000	Gender=Masc|Number=Plur	32	nsubj	32:nsubj	ArgTem=arg1:tem
37	catalanes	catalán	ADJ	aq0mp0	Gender=Masc|Number=Plur	36	amod	36:amod	Entity=e16094)
38	para	para	ADP	sps00	_	40	case	40:case	_
39	esta	este	DET	dd0fs0	Gender=Fem|Number=Sing|PronType=Dem	40	det	40:det	Entity=(e16093--2-CorefType:ident,gstype:gen
40	ronda	ronda	NOUN	ncfs000	Gender=Fem|Number=Sing	32	obl	32:obl	ArgTem=argM:adv|Entity=e16093)|SpaceAfter=No
41	.	.	PUNCT	fp	PunctType=Peri	6	punct	6:punct	_

2.4. Developing your system

You have virtually no limits in building your system. You can develop it from scratch or extend/modify the two baseline systems that we provide the participants with: the baseline for predicting empty tokens, and the baseline for coreference resolution. If you want to treat the baseline systems as black boxes and base your system just on their predictions, choose either the "Coreference from scratch" or the "Refine the baseline" starting points.

Your coreference resolution system is supposed to identify sets of tokens as mentions and cluster them to coreferential entities. To identify a mention, your system is expected to predict a mention head word. However, it is still advisable to predict full mention span, too (the reasons are explained here). If your system is not able to predict the mention heads (i.e. it predicts mention spans only, and the head index is always 1), mention heads can be estimated using the provided dependency tree and heuristics, e.g. the ones provided by Udapi, using the following command: udapy -s corefud.MoveHead < in.conllu > out.conllu

If you choose the "Coreference and zeros from scratch" starting point, your system is supposed to reconstruct empty tokens prior to coreference resolution. A newly added empty token must be connected to the rest of the sentence by an enhanced dependency relation. Your system is thus expected to identify the parent token of the empty token. It is also advisable to predict a type of the dependency relation (the reasons are explained here). If your system is not able to predict the dependency relation type, set each type to dep (empty value would cause the validation tests to fail).

We encourage you to use the API for these operations. If you decide not to do so, click here for more details on the sufficient format requirements...

You do not need to predict all information present in the gold data. Demonstrated on the example, instead of generating:

1	Los	el	DET	da0mp0	Definite=Def|Gender=Masc|Number=Plur|PronType=Art	2	det	2:det	Entity=(e16088--2-gstype:gen,HomoDD
2	jugadores	jugador	NOUN	ncmp000	Gender=Masc|Number=Plur	6	nsubj	6:nsubj	ArgTem=arg0:agt
...
5	Espanyol	Espanyol	PROPN	np0000o	_	2	nmod	2:nmod	Entity=(e16089-organization-1-gstype:spec)e16088)
...
8.1	_	_	PRON	p	_	_	_	9:nsubj	ArgTem=arg0:agt|Entity=(e16088--1-CorefType:ident,gstype:gen)
...
35	ambos	ambos	NUM	dn0mp0	Gender=Masc|Number=Plur|NumType=Card	36	nummod	36:nummod	Entity=(e16094-other-2-CorefType:ident,gstype:spec|SplitAnte=e16089<e16094,e16090<e16094
36	equipos	equipo	NOUN	ncmp000	Gender=Masc|Number=Plur	32	nsubj	32:nsubj	ArgTem=arg1:tem
37	catalanes	catalán	ADJ	aq0mp0	Gender=Masc|Number=Plur	36	amod	36:amod	Entity=e16094)
...

depending on your starting point, it is sufficient for your system to generate the following for a perfect match:

1	Los	el	DET	da0mp0	Definite=Def|Gender=Masc|Number=Plur|PronType=Art	2	det	2:det	Entity=(e1--2-
2	jugadores	jugador	NOUN	ncmp000	Gender=Masc|Number=Plur	6	nsubj	6:nsubj	_
2.1	_	_	_	_	_	_	_	9:dep	Entity=(e1--1-)
...
5	Espanyol	Espanyol	PROPN	np0000o	_	2	nmod	2:nmod	Entity=(e2--1-)e1)
...
...
35	ambos	ambos	NUM	dn0mp0	Gender=Masc|Number=Plur|NumType=Card	36	nummod	36:nummod	Entity=(e3--2-
36	equipos	equipo	NOUN	ncmp000	Gender=Masc|Number=Plur	32	nsubj	32:nsubj	_
37	catalanes	catalán	ADJ	aq0mp0	Gender=Masc|Number=Plur	36	amod	36:amod	Entity=e3)
...

Out of the possible anaphora annotation tags available in the CorefUD format, only Entity tags have been predicted (and e.g. the SplitAnte can be ignored). And not even in their full content. Bracketing defines mention span. The index in the 3rd field of the Entity tag defines a relative position of the mention head within all tokens of the mention (in the example, the predicted mention heads are: jugadores, _, Espanyol, and equipos). The e* co-indexing in the 1st field of Entity tags clusters mentions into coreferential entities. As for the generated zero, its position does not need to be matched (2.1 vs. 8.1). Instead, its dependency relation in the DEPS field is used to align it to a gold zero. While the parent index (9) must be the same, the dependency relation type do not need to match (dep vs. nsubj) unless there are multiple zeros with the same parent.

2.4.1. A baseline for prediction of zeros

The system for predicting empty tokens (zeros) can be downloaded here. We have applied this system on the data for "Coref. and zeros from scratch" starting point to produce the data for the "Coref. from scratch" starting point. The system predicts the position of empty tokens in a sentence and the DEPS column, i.e. their parent in the enhanced dependencies and the dependency relation (deprel). While CoNLL-U allows multiple enhanced parents in the DEPS column, the baseline system predicts only one (the training data was pre-processed with corefud.SingleParent Udapi block). The baseline system does not predict any attributes of the empty nodes, so all the CoNLL-U columns except for DEPS (including FORM) are empty (i.e. _).

2.4.2. A baseline for coreference resolution

The baseline coreference resolution system is based on the multilingual coreference resolution system presented in [7], using multilingual BERT in the end-to-end setting. The system only predicts the coreference annotation in the MISC column, i.e. if the input files do not contain empty nodes, the system cannot reconstruct them and consequently fails in resolving zero anaphora. We have applied this system on the data for "Coref. from scratch" starting point to produce the data for the "Refine the baseline" starting point.

2.5. Validation

Many things can go wrong when filling the predicted coreference annotation in the CoNLL-U format, especially if not using the API (incorrect syntax in the MISC column, unmatched brackets etc.) Although the evaluation script may recover from many potential validation errors, it is highly recommended to check validity prior to submitting the files, so that you do not run out of the maximum daily trials.

For the CoNLL-U file produced by your system to be ready to be submitted, it must satisfy the two following criteria:

Apart from the coreference annotation and empty tokens, the contents of the input file must be preserved. In particular, the (surface) tokenization and sentence segmentation must not change.
It must be accepted by the official UD validator script with the settings described below.

The official UD validator will be used to check the validity of the CoNLL-U format. Anyone can obtain it by cloning the UD tools repository from GitHub and running the script validate.py. Python 3 is needed to run the script (depending on your system, it may be available under the command python or python3; if in doubt, try python -V to see the version).

$ git clone git@github.com:UniversalDependencies/tools.git
$ cd tools
$ python3 validate.py -h

In addition, a third-party module called regex must be installed via pip. Try this if you do not have the module already:

$ sudo apt-get install python3-pip; python3 -m pip install regex

The validation script distinguishes several levels of validity; level 2 is sufficient in the shared task, as the higher levels deal with morphosyntactic requirements on the UD-released treebanks. On the other hand, we will use the --coref option to turn on tests specific to coreference annotation. The validator also requires the option --lang xx where xx is the ISO language code of the data set.

$ python3 validate.py --level 2 --coref --lang cs cs_pdt-corefud-test.conllu
*** PASSED ***

If there are errors, the script will print messages describing the location and the nature of the error, it will print *** FAILED *** with (number of) errors, and it will return a non-zero exit value. If the file is OK, the script will print *** PASSED *** and return zero as its exit value. The script may also print warning messages that point to potential problems in the file but are not considered errors and will not make the file invalid.

2.6. Evaluation

The official scorer for the shared task is the CorefUD scorer in versions after May 10. Its functionality is guaranteed not to change until the end of the evaluation phase.

The main evaluation metric for the task is the CoNLL score, which is an unweighted average of the F1 values of MUC, B-cubed, and CEAFe scores. To encourage the participants to develop multilingual systems, the primary ranking score will be computed by macro-averaging CoNLL F1 scores over all datasets. For the same reason, singletons (entities with a single mention) will not be taken into account in calculation of the primary score, as many of the datasets do not have singletons annotated. Although some of the datasets also comprise annotation of split antecedents, bridging and other anaphoric relations, these are not going to be evaluated.

Besides the primary ranking, the overview paper on the shared task will also introduce multiple secondary rankings, e.g. by CoNLL score for individual languages, or by CoNLL scores calculated with exact matching.

2.6.1. Head-match score

The primary score is calculated using the head match. That is, to compare gold and predicted mentions, we compare their heads. Submitted systems are thus expected to predict a mention head word by filling in its relative position within all words in the corresponding mention span to the Entity attribute. For example, the annotation Entity=(e9-place-2- identifies the second word of the mention as its head.

However, it is still advisable to predict full mention spans, too. Evaluation with head matching uses them to disambiguate between mentions with the same head token. In addition, systems that predict only mention heads are likely to fail in the evaluation with exact matching, which will be calculated as one of the supplementary scores.

If the submitted system is not able to predict the mention heads (i.e. it predicts mention spans only, and the head index is always 1), mention heads can be estimated using the provided dependency tree and heuristics, e.g. the ones provided by Udapi, using the following command: udapy -s corefud.MoveHead < in.conllu > out.conllu

2.6.2. Matching of zeros

Unlike in the previous editions, in this year the participants are expected to predict also the empty nodes involved in zero anaphora (if they opt for the "Coreference and zeros from scratch" starting point). In the system outputs, some empty nodes may be missing and some may be spurious. In addition, some empty nodes may be predicted at different surface positions within the sentence, yet playing the same role. Nevertheless, if such empty nodes are heads of the gold and the predicted mention, the evaluation method must be capable of matching these zero mentions.

The shared task applies the dependency-based method of matching zero mentions. It looks for the matching of zeros within the same sentence that maximizes the F-score of predicting dependencies of zeros in the DEPS field. Specifically, the task is cast as searching for a 1-to-1 matching in a weighted bipartite graph (with gold mentions and predicted mentions as partitions) to maximize the total sum of weights in the matching. Each candidate pair (gold zero mention - predicted zero mention) is weighed with a non-zero score only if the two mentions belong to the same sentence. The score is then calculated as a weighted sum of two features:

the F-score of the gold zero dependencies (in the DEPS field) recognized in the predicted zero, considering both parent and dependency type assignments (weighted by a factor of 10);
the F-score of the gold zero dependencies (in the DEPS field) recognized in the predicted zero, considering only parent assignments (weighed by a factor of 1).

The scoring system prioritizes exact assignment of both parents and types, while parent assignments without considering dependency types should only serve to break ties.

Note that matching zero mentions by their dependencies is applied first, preceding the matching strategies for non-zero mentions. Zeros that have not been matched to other zeros may then be matched to non-zero mentions. Although such matching may seem counterintuitive, it can be valid in cases where a predicted zero mention is incorrectly labeled as non-zero, or vice versa, often due to the wrong choice of the head in multi-token mentions involving empty tokens.

2.7. Submitting to CodaLab

The submissions to the shared task are collected through CodaLab. You need to create a CodaLab account prior to your first submission. We suggest to use your team name as the username when creating the account because the username will be shown (publicly) in CodaLab results.

We recommend you to submit the outputs of your system to CodaLab already in the development phase, using the dev set as the input. By doing so, you can practice the submission process and prevent some unexpected issues during the evaluation phase. Limits on maximum number of submission trials are sufficiently high to share also your intermediate results: 15 trials per day and 100 in total.

On the other hand, the evaluation phase limits the maximum number of submissions trials more strictly: 2 trials per day and 10 in total. Multiple trials are allowed exclusively for resolving some unexpected situations, but definitely should not be used for systematic optimization of parameters or hyper-parameters of your model towards the scores shown by CodaLab.

Participants who have developed multiple coreference prediction systems are encouraged to submit their predictions separately, up to 3 systems per team, as long as the systems are different in some interesting ways (e.g. using different architectures, not just different hyper-parameter settings). In order to submit an additional system of yours, please create an additional team account at CodaLab.

The submission to CodaLab must be a zip file, with the CoNLL-U files produced by your system resided in the root folder and their names identical to the names of the corresponding input files. The zip file for the development phase should contain the following 21 files (substitute dev with test for the submissions in the evaluation phase).

ca_ancora-corefud-dev.conllu
cs_pcedt-corefud-dev.conllu
cs_pdt-corefud-dev.conllu
cu_proiel-corefud-dev.conllu
de_parcorfull-corefud-dev.conllu
de_potsdamcc-corefud-dev.conllu
en_gum-corefud-dev.conllu
en_litbank-corefud-dev.conllu
en_parcorfull-corefud-dev.conllu
es_ancora-corefud-dev.conllu
fr_democrat-corefud-dev.conllu
grc_proiel-corefud-dev.conllu
hbo_ptnk-corefud-dev.conllu
hu_korkor-corefud-dev.conllu
hu_szegedkoref-corefud-dev.conllu
lt_lcc-corefud-dev.conllu
no_bokmaalnarc-corefud-dev.conllu
no_nynorsknarc-corefud-dev.conllu
pl_pcc-corefud-dev.conllu
ru_rucor-corefud-dev.conllu
tr_itcc-corefud-dev.conllu

If this naming (and placement) convention is not observed, the scorer will not be able to pair the output with the input, and the output will not be scored. We recommend you also to check validity of your output files. However, even files not passing the validation tests will be considered for the evaluation and contributing to the final score (provided the evaluation script does not fail on such files).

2.8. System description papers

All shared task participants are invited to submit their system descriptions papers to the CRAC 2024 Workshop. Please submit your paper using SoftConf and select one of the "Shared Task paper (short/long)" as its Submission Type. If accepted, the papers will be published in the workshop proceedings.

System description papers can have the form of long or short research papers, up to 8 pages of content for long papers and up to 4 pages of content for short papers, plus an unlimited number of pages for references in both cases. For the formatting instructions, please follow the instructions for the other CRAC papers.

Identity of the authors of the participating systems is known, and thus it is not required to make the submissions anonymous.

2.9. Workshop

Accepted papers will be presented in a dedicated session of the CRAC 2024 Workshop (the Seventh Workshop on Computational Models of Reference, Anaphora and Coreference) held at EMNLP 2024, and published in the conference proceedings.

3. Miscellaneous

3.1. Terms and conditions for data usage

Training, development, and test datasets are subject to license agreements specified individually for each dataset in the public edition of the CorefUD 1.2 collection (which, in turn, are the same as license agreements of the original resources before CorefUD harmonization). In all cases, the licenses are sufficient for using the data for the CRAC 2024 shared task purposes. However, the participants must check the license agreements in case they want to use their trained models also for other purposes; for instance, usage for commercial purposes is prohibited with several CorefUD datasets as they are available under CC BY-NC-SA.

Whenever using the CorefUD 1.2 collection (inside or outside this shared task), please cite it as follows:

@misc{11234/1-5478,
 title = {Coreference in Universal Dependencies 1.2 ({CorefUD} 1.2)},
 author = {Popel, Martin and Nov{\'a}k, Michal and {\v Z}abokrtsk{\'y}, Zden{\v e}k and Zeman, Daniel and Nedoluzhko, Anna and Acar, Kutay and Bamman, David and Bourgonje, Peter and Cinkov{\'a}, Silvie and Eckhoff, Hanne and Cebiro{\u g}lu Eryi{\u g}it, G{\"u}l{\c s}en and Haji{\v c}, Jan and Hardmeier, Christian and Haug, Dag and J{\o}rgensen, Tollef and K{\aa}sen, Andre and Krielke, Pauline and Landragin, Fr{\'e}d{\'e}ric and Lapshinova-Koltunski, Ekaterina and M{\ae}hlum, Petter and Mart{\'{\i}}, M. Ant{\`o}nia and Mikulov{\'a}, Marie and N{\o}klestad, Anders and Ogrodniczuk, Maciej and {\O}vrelid, Lilja and Pamay Arslan, Tu{\u g}ba and Recasens, Marta and Solberg, Per Erik and Stede, Manfred and Straka, Milan and Swanson, Daniel and Toldova, Svetlana and Vad{\'a}sz, No{\'e}mi and Velldal, Erik and Vincze, Veronika and Zeldes, Amir and {\v Z}itkus, Voldemaras},
 url = {http://hdl.handle.net/11234/1-5478},
 note = {{LINDAT}/{CLARIAH}-{CZ} digital library at the Institute of Formal and Applied Linguistics ({{\'U}FAL}), Faculty of Mathematics and Physics, Charles University},
 copyright = {Licence {CorefUD} v1.2},
 year = {2024} }

For a more general reference to CorefUD harmonization efforts, please cite the following LREC paper:

@inproceedings{biblio8283899234757555533,
  author    = {Anna Nedoluzhko and Michal Novák and Martin Popel and Zdeněk Žabokrtský and Amir Zeldes and Daniel Zeman},
  year      = 2022,
  title     = {CorefUD 1.0: Coreference Meets Universal Dependencies},
  booktitle = {Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)},
  pages     = {4859--4872},
  publisher = {European Language Resources Association},
  address   = {Marseille, France},
  isbn      = {979-10-95546-72-6},
}

By submitting results to this competition, the participants consent to the public release of their scores at the CRAC 2024 workshop and in the associated proceedings, at the task organizers' discretion. Participants further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was erroneous or deceptive.

3.2. Shared task organizers

Charles University (Prague, Czechia): Anna Nedoluzhko, Michal Novák, Martin Popel, Zdeněk Žabokrtský, Daniel Zeman
Polish Academy of Sciences (Warsaw, Poland): Maciej Ogrodniczuk
University of West Bohemia (Pilsen, Czechia): Miloslav Konopík, Ondřej Pražák, Jakub Sido

You can send any questions about the shared task to the organizers via corefud@googlegroups.com.

3.3. Changes to the previous editions of the shared task

The main differences between the three editions are as follows:

The 2022 edition (shared task web):
- based on CorefUD 1.0
- 13 datasets for 10 languages (Catalan, Czech, English, French, German, Hungarian, Lithuanian, Polish, Russian, and Spanish)
- gold morpho-syntactic features used wherever available
- partial matching used as the primary score
The 2023 edition (shared task web):
- based on CorefUD 1.1
- 17 datasets for 12 languages (Norwegian and Turkish added)
- original morpho-syntax features in dev and test sets replaced by the output of UDPipe 2 in order to make the evaluation scheme more realistic
- head matching used as the primary score
The 2024 edition (described here):
- based on CorefUD 1.2
- 21 datasets for 15 languages (Ancient Greek, Ancient Hebrew and Old Church Slavonic added)
- more attention paid to zero mentions (zero mentions present in 10 datasets for these languages Catalan, Czech, Old Church Slavonic, Spanish, Ancient Greek, Hungarian, Polish, Turkish)
- the scorer adjusted to align sets of empty tokens in the gold and predicted files

3.4. The shared task in a broader context

Inspired by the Universal Dependencies initiative (UD) [1], the coreference community has started discussions on establishing a universal annotation scheme and using it to harmonize existing corpora. The discussions at the CRAC 2020 workshop led to proposing the Universal Anaphora initiative. One of the lines of effort related to Universal Anaphora resulted in CorefUD, which is a multilingual collection of coreference data resources harmonized under a common scheme [2]. The current public release of CorefUD 1.2 contains 21 datasets for 15 languages, namely Ancient Greek, Ancient Hebrew, Catalan, Czech (2×), English (3×), French, German (2×), Hungarian (2×), Lithuanian, Norwegian (2×), Old Church Slavonic, Polish, Russian, Spanish, and Turkish. The CRAC 2024 shared task deals with coreference resolution in all these languages. It is the 3rd edition of the shared task; findings of the first and second edition can be found in [8] and [9], respectively.

References

[1] De Marneffe, M.-C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2), 255-308.
[2] Nedoluzhko, A., Novák M., Popel M., Žabokrtský Z., Zeldes A., Zeman D. (2022). CorefUD 1.0: Coreference Meets Universal Dependencies. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022) (pp. 4859-4872).
[3] Pradhan, S., Moschitti, A., Xue, N., Uryupina, O., & Zhang, Y. (2012). CoNLL-2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes. In Joint Conference on EMNLP and CoNLL-Shared Task (pp. 1-40).
[4] Popel, M., Žabokrtský, Z., Nedoluzhko, A., Novák, M., & Zeman, D. (2021). Do UD Trees Match Mention Spans in Coreference Annotations? In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 3570-3576).
[5] Nedoluzhko, A., Novák, M., Popel, M., Žabokrtský, Z., & Zeman, D. (2021). Is one head enough? Mention heads in coreference annotations compared with UD-style heads. In Proceedings of the Sixth International Conference on Dependency Linguistics (Depling, SyntaxFest 2021) (pp. 101-114).
[6] Denis, P. & Baldridge, J. (2009). Global joint models for coreference resolution and named entity classification. Procesamiento del lenguaje natural, Nº. 42, (pp. 87-96).
[7] Pražák, O., Konopík, M., & Sido, J. (2021). Multilingual Coreference Resolution with Harmonized Annotations. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). 2021.
[8] Žabokrtský Z., Konopík M., Nedoluzhko A., Novák M., Ogrodniczuk M., Popel M., Pražák O., Sido J., Zeman D., Zhu Y. (2022). Findings of the Shared Task on Multilingual Coreference Resolution. In Proceedings of the CRAC 2022 Shared Task on Multilingual Coreference Resolution (pp. 1-17).

3.5. Acknowledgements

This shared task is supported by the Grants No. 20-16819X (LUSyD) of the Czech Science Foundation, UNCE24/SSH/009, and LM2023062 (LINDAT/CLARIAH-CZ) of the Ministry of Education, Youth, and Sports of the Czech Republic.

CorefUD

Search form