ÚFAL PhD Research Forum – Session 1

Monday, 12 May, 2025 - 14:00

Room:

ÚFAL PhD Research Forum – Session 1

14:00–14:20

Minoo Nassajian

Uniform Meaning Representation for a Low Resource Language (Persian)

Abstract: Uniform Meaning Representation (UMR), which is primarily based on Abstract Meaning Representation (AMR), can be considered as one of the frameworks providing a consistent semantic representation across different languages, facilitating better understanding and processing of multilingual data. Furthermore, this framework provides considerable detail about how to represent low-resource languages which are typologically quite distinct from languages like English by abstracting away from language-specific syntax and focusing on the underlying meaning resulting in better processing the semantics of languages with limited data. The current research intends to apply the UMR framework to Persian for the first time. This language is considered as one of the Indo-European languages and has a rich morphology. So, not only can this proposed research boost NLP capabilities for Persian, but it will also advance the wider field of multilingual semantic representation and provide a valuable resource for future research in Persian linguistics and computational linguistics such as information extraction, as UMR consists of both sentence-level representation that focuses on predicate-argument structures and a document-level representation that captures semantic relations that go beyond sentence boundaries.

14:20–14:40

Tomáš Polák

Legal clarity: rules and LLMs

Abstract: Comprehensibility is a key quality of any legal text. Traditional readability metrics, such as the Flesch-Kincaid and the Gunning-Fog index, are simple and objective, but offer limited utility in assessing comprehensibility. More complex comprehensibility rules are better for identifying lexical and syntactic features that undermine legal clarity. Among other features, the PONK tool applies a set of such rules to flag problematic text segments. However, it relies on resource-intensive morphological and syntactic annotations and a complex algorithmic implementation. An alternative—a purely LLM-based approach without instructions containing comprehensibility rules—requires neither, but is less stable, predictable, and explainable. Instructing and constraining LLMs with comprehensibility rules could address these limitations and offset the costs of a rule-based approach. We test this hybrid approach by comparing span annotations generated by rule-prompted LLMs with those produced by the PONK tool's rule engine. Evaluating the performance and stability of this hybrid approach is a critical step towards developing a reliable, low-effort comprehensibility metric with an easily extensible set of rules.

14:40–15:00

Gianluca Vico

Alignable Tokenization

Abstract: Tokenization is the first step in many natural language processing pipelines, but low-resource languages are often over-tokenized, reducing downstream task efficiency.

We use parallel data to train a tokenizer that optimizes semantic subword alignment and extends SentencePiece by conditioning on source sentences. We test it in several language pairs and evaluate it both intrinsically and on downstream tasks, namely machine translation and language modelling. The language pairs used are French-Italian (high-resource), Czech-Ukrainian (medium-resource), and Italian-Maltese and German-Upper Sorbian (low-resource).

The intrinsic evaluation includes parity, fertility, and one-to-one alignability. The downstream tasks use chrF++ and byte-level perplexity.

Preliminary results show this tokenizer matches SentencePiece on the intrinsic evaluation but performs similarly or worse in translation tasks.

Future work includes studying noise-robust tokenizers and a quantitative survey on tokenizers.

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form

ÚFAL PhD Research Forum – Session 1