Monday, 6 May, 2024 - 14:00

ÚFAL PhD conference

Christopher Brückner, Vojtěch Lanz, Michal Olbrich, Adam Štefunko (ÚFAL MFF UK)


Christopher Brückner

Information Extraction from Domain-Specific Data

Abstract: Information extraction is the task of automatically extracting structured information from unstructured data. This information includes entities explicitly mentioned in text, as well as more implicit abstract subjects. The recent increase in digitized historical documents gave rise to a challenging domain in NLP characterized by noisy data and evolving languages.

This talk addresses plans to annotate multilingual historical text segments based on a domain-specific hierarchical ontology. Planned approaches include named entity linking, the hierarchical classification of abstract subjects, and experiments with summarization models which aim to jointly solve both previous tasks.




Vojtěch Lanz

Document-level information extraction in the medical domain for low-resource languages 

Abstract: Doctors and nurses spend a lot of time on administrative tasks besides patient care, such as manually writing discharge summaries and reading them, as well as extracting important information from them. We aim to facilitate this work and ensure the efficient extraction of information from discharge summaries in various European languages. To do this, we need to ensure an adequate amount of multilingual data. Currently, publicly available clinical data is mostly in English. In our work, we strive to secure enough multilingual clinical data and solve NLP tasks such as Question Answering, Text Classification, or Passage Retrieval using language models. The goal is to explore the behavior of both encoder-based models as well as decoder-based models and provide interesting observations about the LLMs. Besides introducing the dissertation topic, related works, plans, and challenges, we also present our results of the first experiments exploring the question-answering task on low-resource languages. 




Michal Olbrich

Computational Models of Competition in Natural Languages

Abstract: Competition in natural languages is a widely researched topic. Approaches range from purely linguistic to those using mathematical models. We will investigate the use of mathematical models of population dynamics to describe the relations and historical changes attested in diachronic data. Although there have been attempts to use such models to describe certain linguistic phenomena, none of them have gone beyond stating hypotheses in a purely theoretical manner and applied the models on a corpus of diachronic data. Models of population dynamics were proven to be useful in other fields, such as biology or economics. Testing these models on large language data will help to understand processes of language evolution and also prove or reject specific linguistic hypotheses about competition in natural languages.




Adam Štefunko

Data-driven modeling of musical harmony and improvised accompaniment

Abstract: Harmony is a key element of any musical piece and it is directly associated with accompanying a melody or a bass line. In many musical styles including Baroque music or jazz music, accompaniment is very often improvised, most of the time but not always according to some written-out clues. Accompaniment necessarily relies on knowledge of the given musical style and its harmonic system, but it also reflects the accompanist's personal taste. In the talk, we will discuss how musical accompaniments can serve as interesting data which can be analyzed using NLP methods and further used to build a data-driven model capable of evaluating a given accompaniment from the stylistic viewpoint and to generate a new accompaniment. We show this on an example of basso continuo, the practice of improvising harmonic accompaniments over a given bass line, typical of Baroque music, and partimento, the pedagogical tradition of doing so without an accompanied voice present. Existing computational models of continuo mostly employ hard-coded procedures based on common rules extracted from historical treatises and textbooks and they do not easily scale to model different styles of realization and are not designed to describe live realizations at all. We show a pilot example-driven basso continuo evaluation and generation model based on pattern-matching and present our future plans with the topic.