Monday, 23 February, 2026 - 14:00
Room: 

Measuring Syntactic Complexity in a Multilingual Corpus: Cross-linguistic and Genre Variation

Olga Nádvorníková
Alexandr Rosen (FF UK)

This presentation introduces syntactic complexity metrics (SCMs) newly available in InterCorp, a large multilingual corpus annotated with Universal Dependencies, and demonstrates their application in research on language and genre variation. The SCMs are computed for individual sentences and texts, offering researchers a way to quantify structural properties of language use across a wide variety of languages and registers. Beyond simple frequency counts, SCMs capture dimensions such as clausal embedding, phrasal expansion, and dependency distance, thereby providing a richer picture of syntactic organization. To illustrate their application, we report on a contrastive study involving 17 languages and four textual genres. The analysis of six SCMs reveals systematic correlations that cluster into clausal and phrasal measures, while mean dependency distance emerges as particularly sensitive to cross-linguistic variation. Using random forest classification, we show that SCMs reliably predict genre, with NP-related measures ranking highest, whereas mean dependency distance and its standard deviation provide the best discrimination among languages. Patterns of misclassification further point to affinities between languages, such as the proximity of English to Romance, previously observed in lexical studies. By linking corpus annotation to empirical findings, the presentation demonstrates how SCMs can inform contrastive linguistics, translation studies, register analysis, and L1/L2 research.

 

*** The talk will be delivered in person (MFF UK, Malostranské nám. 25, 4th floor, room S1) and will be streamed via Zoom. For details how to join the Zoom meeting, please write to sevcikova et ufal.mff.cuni.cz ***