Principal investigator (ÚFAL):
Coreference resolution is the task of determining which two expressions in a text refer to the same entity. For example, in the sentence "I bought a new car; it is red.", both "car" and "it" refer to the same thing. It could even be argued that "a new car" and "it" are expressions referring to the same entity, illustrating the potential complexity of this task. Any Natural Language Understanding application is expected to be able to resolve what different expressions refer to, and which expressions refer to the same entities.
There has been a multitude of approaches to resolving coreference, and, as can be expected, many datasets have been created to train and test system performance. However, these datasets are not all consistent in the annotation schemes or the linguistic guidelines that they follow. This makes the resources fragmented and difficult to use to train in a cohesive multilingual system.
For this proposal, we would like to focus on a) formalizing the guidelines of a uniform multilingual scheme for annotating and representing coreference and b) converting existing datasets to this uniform scheme in an automated way. We will create a multilingual dataset that is more suited for the creation of tools used to train systems to automatically detect and resolve coreference than existing datasets.