Uniform Meaning Representation for Czech

Tags:

Annotations, Coreference, Corpora, Data, Lexicons, Monolingual, Morphology, Semantics, Syntax, Valency

Uniform Meaning Representation (UMR) for Czech

For centuries, linguists have deliberated on how to represent meaning. In recent years, this inquiry has been viewed not only as an intriguing theoretical challenge but also due to its practical implications for various applications, since meaning representation can serve, in general, as a basis for any system requiring sound and reliable knowledge representation to enable logical inference.

While numerous formalisms for meaning representation have been proposed in recent decades, this project focuses on specific approaches: the meaning representation used in the Prague Dependency Treebank family (PDT) and the Uniform Meaning Representation. The choice of the first formalism is motivated by the availability of data for Czech, particularly the PDT-C treebank. This treebank provides the most comprehensive Czech data (almost 175.5 thousand sentences across different genres) with fine-grained annotation at the tectogrammatical level, capturing linguistically structured meaning. The second approach, Uniform Meaning Representation (UMR), offers significant potential to enhance the PDT-C representation in several key ways:

UMR provides a more abstract representation, which is less dependent on a specific language and its structure.
UMR anchors concepts within a knowledge base, utilizing resources like English Wikipedia or WikiData.
UMR aims to support logical inference, an aspect that lies beyond the scope of the PDT.
Furthermore, UMR is being used for a variety of typologically diverse languages, including Chinese, Arapaho, Navajo, Kukama, and Sanapaná. This approach and its rich data may facilitate understanding some features of the Czech language from the typological point of view.

Project Objective

The primary objective of the project is to explore the feasibility of a (semi-)automatic conversion of the PDT-C data into a format that adheres to the UMR specification. In particular, the project aims to identify:

Language phenomena that can be transferred relatively easily and reliably from the available Czech annotation to the UMR structures (as, e.g., sentence syntactic structure or coreference relations );
Phenomena that require specific treatment and detailed analysis but still can be transferred (as., e.g., modality or negation);
Phenomena that are unavailable in PDT-C and thus necessitate new annotations, either through automatic methods (utilizing advanced machine learning techniques) or even manual annotation (as, e.g., concept anchoring).

UMR Parsing Shared Task

Data Releases

UMR 2.2
This data release contains data prepared for the First Shared Task on UMR Parsing. It is available from the LINDAT/CLARIAH-CZ repository at http://hdl.handle.net/11234/1-6132.

UMR 2.1 (Czech and Latin)
This data release contains automatically converted data from the manually annotated corpora (Czech: PDT-C, Latin: LDT), a sample of manually annotated data and the corresponding automatically converted data used for comparison, and sample data annotated in parallel by two different annotators (only for Czech). It is available from the LINDAT/CLARIAH-CZ repository at http://hdl.handle.net/11234/1-5951.

UMR 2.0
This data release contains - in addition to the languages from UMR 1.0 - the first version of the Czech data converted from PDT-C and manually prepared Latin data, also by the ÚFAL MFF UK team. It is available also from the LINDAT/CLARIAH-CZ repository at http://hdl.handle.net/11234/1-5902. The Czech conversion is described here:
- Lopatková, M., Fučíková, E., Gamba, F., Hajič, J., Hledíková, H., Mikulová, M., Novák, M., Štěpánek, J., Zeman, D., Zikánová, Š.: UMR 2.0 - Czech: Release Notes, ÚFAL TR-2025-74, 2025 (on-line version)

UMR 1.0
This data release was without Czech data yet, but we put it here for completeness. It is available from the LINDAT/CLARIAH-CZ repository at http://hdl.handle.net/11234/1-5198. It contains all the data annotated by the U.S. team.

Publications and Presentations

Štěpánek, J., Zeman, D., Lopatková, M., Gamba, F., Hledíková, H., Xue, N.: First Shared Task on UMR Parsing. Accpted for the Seventh International Workshop on Designing Meaning Representations (DMR 2026).
Hledíková, H., Gamba, F., Lopatková, M., Štěpánek, J.: Towards Consistent UMR Annotation of Deverbal Nouns: Evidence from Czech and Latin. Accpted for the Seventh International Workshop on Designing Meaning Representations (DMR 2026).
Lopatková, M., Hledíková, H., Štěpánek, J., Zeman, D.: From the Prague Dependency Treebank to the Uniform Meaning Representation: Gold-Standard Czech UMR Data and Partial Automatic Conversion. In Proceedings of the 25th Conference Information Technologies – Applications and Theory (ITAT 2025), CEUR-WS.org, Košice, Slovakia, p. 179-190, 2025.
Štěpánek, J., Zeman, D., Lopatková, M., Gamba, F., Hledíková, H.: Comparing Manual and Automatic UMRs for Czech and Latin. In Proceedings of the Sixth International Workshop on Designing Meaning Representations (DMR 2025), pages 1-12, Prague, Czechia. Association for Computational Lingustics, 2025.
Lopatková, M., Fučíková, E., Gamba, F., Hajič, J., Hledíková, H., Mikulová, M., Novák, M., Štěpánek, J., Zeman, D., Zikánová, Š.: UMR 2.0 - Czech: Release Notes, ÚFAL TR-2025-74, 2025 (on-line version)
Hajič, J.:
Hajič, J.: Towards a Conversion of the Prague Dependency Treebank Data to the Uniform Meaning Representation. Presentation at the UMR meeting 2024. (slides)
Lopatková, M., Fučíková, E., Gamba, F., Štěpánek, J., Zeman, D., Zikánová, Š.: Towards a Conversion of the Prague Dependency Treebank Data to the Uniform Meaning Representation. In Proceedings of the 24th Conference Information Technologies – Applications and Theory (ITAT 2024), CEUR-WS.org, Košice, Slovakia, p. 62-76, 2024. (slides)
Hajič, J., Fučíková, E., Lopatková, M., Urešová, Z.: Mapping Czech Verbal Valency to PropBank Argument Labels. In Proceedings of the Fifth International Workshop on Designing Meaning Representations (DMR 2024), LREC-COLING 2024, ELRA Language Resource Association, p. 88-100, 2024. (poster)

Related Projects

The development of Czech UMR has been supported by the following projects:

Project UMR – Uniform Meaning Representation, No. LUAUS23283, in the Inter-Excellence II program (Inter-Action subprogram), 2023-2027
The project supports primarily cooperation with the U.S. partner, preparation for release, manual checks, and the work on the SynSemClass event-type ontology for application on UMR.

Project LUSyD: Language Understanding: from Syntax to Discourse, GAČR EXPRO program, Project No. GX20-16819X
This project serves as the fundamental research on meaning representations in general, testing various Natural Language Understanding tools, work on discourse etc., and the foundations of the SynSemClass event-type ontology. From the UMR perspective, in serves for support of the basic understanding of the UMR principles in the broader approach to meaning representations.

Project of the large research infrastructure LINDAT/CLARIAH-CZ, project No. LM2023062, MŠMT LRI program
This project gives the infrastructural support for hosting the necessary data, tools and services developed in the UMR project and related resources. It also serves as the primary distribution repository for the U.S. partner-developed data.

The UMR for Czech is also related to the following project:

Adapting Uniform Meaning Representation (UMR) for the Italic/Romance languages, project No. 104924, GAUK

Search form