Uniform Meaning Representation (UMR) for Latin
Uniform Meaning Representation (UMR) is a meaning representation framework designed to annotate the semantic content of a text. It is primarily based on Abstract Meaning Representation (AMR), but was explicitly developed with cross-linguistic scope in mind. UMR aims to extend AMR to other languages, and in particular to morphologically complex, possibly low-resource languages, by adjusting the AMR schema to make it more cross-linguistically applicable. UMR also adds new semantic coverage to the schema by providing representation for tense, aspect, modality, and scope, and enhances the representation by designing document-level dependency structures for linguistic phenomena such as temporal and modal relations, as well as coreference, which may extend beyond sentence. UMR is intended to be scalable, learnable, and cross-linguistically plausible, and it is designed to support both lexical and logical inference.
Data releases
-
UMR 1.0
This data release did not include any Latin data, but is listed here for completeness. It is available in the LINDAT/CLARIAH-CZ repository, and it contains data annotated by the U.S. team for six langauges: Arapaho, Chinese, Kukama, English, Navajo, Sanapaná.
-
UMR 2.0
This data release contains the first version of the manually annotated Latin data and of the converted Czech data, also by the ÚFAL MFF UK team. It is available in the LINDAT/CLARIAH-CZ repository at TBA.
Project objectives
This project aims to explore how to adapt the UMR framework to a larger number of languages (namely Latin and Romance languages).
In particular,
-
it places a special focus on historical languages, absent from the UMR 1.0 collection. The manual annotation of a sample of Latin data contributes to this objective.
-
it investigates strategies for the (semi-)automatic extraction of UMR annotations from syntactic resources, such as Universal Dependencies.
Latin in UMR 2.0
The Latin sentences included in the UMR 2.0 release are sourced from a portion of the Latin Dependency Treebank (LDT) that was made available by the Index Thomisticus Project; before releasing the data, the project corrected it at the syntactic layer and annotated it from scratch at a semantic/pragmatic one. The sentences annotated in UMR are all taken from Sallust's De Coniuratione Catilinae ("Conspiracy of Catiline"). The annotation of the source LDT data reflects that of the Prague Dependency Treebank (PDT), including a detailed annotation at the tectogrammatical level that focuses on the syntactic-semantic properties of the language. This is particularly relevant in relation to the UMR for Czech project, also carried out by the ÚFAL MFF UK team, which explores a (semi-)automatic conversion of PDT data into the UMR format.
Publications
-
Gamba, F. 2024. Predicate Sense Disambiguation for UMR Annotation of Latin: Challenges and Insights. In Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024), pages 19–29, Hybrid in Bangkok, Thailand and online. Association for Computational Linguistics.
-
Lopatková, M., Fučíková, E., Gamba, F., Štěpánek, J., Zeman, D., Zikánová, Š. 2024. Towards a Conversion of the Prague Dependency Treebank Data to the Uniform Meaning Representation. In Proceedings of the 24th Conference Information Technologies – Applications and Theory (ITAT 2024), pages 62–76, CEUR-WS.org, Košice, Slovakia.
Related projects
The development of Latin UMRs has been supported by the following projects:
-
Project GAUK No. 104924: Adapting Uniform Meaning Representation (UMR) for the Italic/Romance languages, 2024-2026.
This project focuses on developing UMRs for Latin and Romance languages through two primary approaches: manually annotating a sample of data and investigating strategies for the (semi-)automatic extraction of UMR annotations from existing linguistic resources.
-
Project LUSyD: Language Understanding: from Syntax to Discourse, GAČR EXPRO program, Project No. GX20-16819X.
This project serves as the fundamental research on meaning representations in general, testing various Natural Language Understanding tools, work on discourse etc., and the foundations of the SynSemClass event-type ontology. From the UMR perspective, it serves for support of the basic understanding of the UMR principles in the broader approach to meaning representations.
-
Project of the large research infrastructure LINDAT/CLARIAH-CZ, project No. LM2023062, MŠMT LRI program.
This project gives the infrastructural support for hosting the necessary data, tools and services developed in the UMR project and related resources. It also serves as the primary distribution repository for the U.S. partner-developed data.
The work on UMR for Latin is also related to the following project:
-
Project UMR – Uniform Meaning Representation, No. LUAUS23283, in the Inter-Excellence II program (Inter-Action subprogram), 2023-2027.
The project supports primarily cooperation with the U.S. partner, preparation for release, manual checks, and the work on the SynSemClass event-type ontology for application on UMR.