Adapting Uniform Meaning Representation (UMR) for the Italic/Romance languages

Principal investigator (ÚFAL):

Federica Gamba

Project Manager (ÚFAL):

Hana Kubištová

Provider:

GAUK

Grant id:

104924

Duration:

2024-2026

Tags:

Data

Semantics

People:

Daniel Zeman

The project revolves around the development and expansion of the Uniform Meaning Representation (UMR) framework towards the inclusion of more, and still unrepresented, languages. The UMR project proposes a meaning representation framework that originates from Abstract Meaning Representation (AMR), initially designed for English, but extends it to other languages, with a special focus on morphologically complex and low-resource languages. Additionally, it enhances AMR in many ways, e.g. by devising a strategy to capture linguistic phenomena extending beyond sentence boundaries (e.g., coreference).
Three main objectives have been defined for the proposed project. We first intend to refine the UMR annotation guidelines, which are often incomplete, unclear, or underspecified. Furthermore, these guidelines are currently skewed towards English, despite the stated cross-linguistic approach. We intend to make them more adaptable to languages beyond English; the focus will primarily be on Italic languages of the Indo-European family, using Latin as a starting point, but we will be exploiting Romance languages (Italian, Spanish, French above all) to pursue cross-linguality as well as closely collaborating with a team working on Czech. Secondly, we plan to release an annotated dataset for Latin. This will be achieved through a two-pronged approach. First, a sample of data will be manually annotated, at the same time serving as a terrain for refining the guidelines. Second, we will explore strategies for the (semi-)automatic extraction of UMR annotations from existing language resources. The goal here is to provide valuable data for Machine Learning and Natural Language Understanding downstream applications. The third pillar of the project focuses on enhancing language resources related to predicate-argument structure information, a critical component of UMR annotation. Currently, two separate resources are available for Latin, which however already revealed some substantial limitations. The project aims to merge and improve these resources, making them more comprehensive and suitable for UMR annotation purposes, and most probably resulting in the release of a new lexical resource.
In essence, this research project aims to strengthen UMR cross-linguistic capabilities, simultaneously addressing language-specific challenges and enhancing its applicability to a wide range of languages. It represents a comprehensive effort to contribute to the meaning representation field, a domain that takes another step forward towards the understanding of natural language - which after all is the ultimate goal of language processing.

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form