Monday, 17 April, 2023 - 14:00
Room: 

ÚFAL PhD conference

Hana Hledíková
Josef Jon
Jiří Mayer (ÚFAL MFF UK)

14:00-14:25 

Hana Hledíková

Morphematic structure of complex verbs and its relationship to valency

Abstract: Word formation of verbs as well as verbs’ valency structure is theoretically well described and captured in valency lexicons and derivational networks (e.g., Vallex, FrameNet, PropBank; DeriNet, Universal Derivations, Universal Segmentations). However, the interaction between the two is less well-researched. This talk will be focused on the question of how to combine available corpora and language resources on word-formation and valency to analyse the interaction on larger language data, taking into account also the distribution of complex verbs across the frequency spectrum. I will describe the process of preparing data for analysing the relationship between verbs’ morphematic complexity, valency and frequency in four languages – Czech, English, German and Spanish – and the problems connected with it, such as the comparability of the used corpora, lemmatization, morphematic segmentation, and limited availability of lower frequency verbs in valency lexicons. The problems become especially challenging when dealing with multiple languages.

 

14:25-14:50

Josef Jon

Exploring Diversity in Machine Translation

Abstract: Current approaches to machine translation (and to natural language processing in general) often struggle with text that is in a way creative, atypical, eccentric, outlandish. This is caused by the models, algorithms and data used. In neural machine translation (NMT), the problem has two parts: processing diverse input (often related to the question of robustness of the NMT system), and producing an adequate (and similarly creative) output. We can even consider the long-term risk arising from the popularity of current methods, which can induce a loop towards using an ordinary, stereotypical and unified language: users will simplify their formulations because these are better processed by automatic tools, which in turn will lead to further proliferation of mundane data. 

The aim of the thesis is to explore diversity as a specific quality of MT  in the following areas. Firstly, we are interested in NMT systems capable of processing text that is already diverse or atypical. Secondly, we want to produce similarly atypical, diverse but correct translations. Thirdly, translation diversity is an aspect that crucially affects current MT evaluation methods based on similarity to the reference translation.
 

 

14:50-15:15

Jiří Mayer

End-to-end full-page optical music recognition utilizing synthetic training data

Abstract: Optical music recognition (OMR) is the process of converting scans of printed or handwritten music notation into some machine-readable format. It has many similarities to optical character recognition (reading text), but is still not solved due to a number of reasons. Firstly, the written music (unlike text) is not strictly sequential. For example, piano music often has at least two simultaneous voices (the bass and the melody). This creates problems in deep learning, as most image-to-sequence models expect the input to be just a single sequence. This encoding-related complexity coupled with the comparatively smaller interest in the field means there is very little training data available.

In this talk, I would like to present how we plan to overcome some of these challenges using data synthesis and how we plan to utilize such artificial data to train modern end-to-end image-to-sequence deep learning models, with the primary focus on polyphonic music and extensibility of the synthesizer.

 

 

 

*** The talks will be delivered in person (MFF UK, Malostranské nám. 25, 4th floor, room S1) and will be streamed via Zoom. For details how to join the Zoom meeting, please write to sevcikova et ufal.mff.cuni.cz ***