Monday, 25 March, 2024 - 14:00

The ParlaMint project: developing comparable corpora of parliamentary debates in Europe

The talk presents the results of the ParlaMint project, which comprise comparable corpora of parliamentary debates of 29 European countries and autonomous regions, covering at least the period from 2015 to 2022, and containing over 1 billion words.  The corpora are uniformly encoded, contain rich metadata about their 24 thousand speakers, and are linguistically annotated up to the level of Universal Dependencies syntax and named entities.  We present the compilation of the corpora, including the encoding infrastructure, use of GitHub, the production of individual corpora, the common pipeline for producing their distribution, and use of CLARIN services for dissemination.  We then introduce the latest additions to the corpora, namely metadata localisation, adding new metadata, such as the political orientation of political parties, the machine translation of the corpora to English and its tagging with semantic classes, and the production of pilot speech corpora.  Finally, outreach activities and further work are discussed.


Tomaž Erjavec is a senior researcher at the Department of Knowledge Technologies, Jožef Stefan Institute, and at the Fran Ramovš Institute of the Slovenian Language at the Scientific Research Centre of the Slovenian Academy of Sciences and Arts. His work is in the fields of language technologies and digital humanities and focuses on developing language resources, especially as regards their annotation and encoding. He is the national coordinator of the Slovenian node of the CLARIN research infrastructure for language resources and tools, and was a member of ISO/TC 37/SC 4 Language resource management, of the Council of the Text Encoding Initiative Consortium, the Board of the European Chapter of the Association of Computational Linguistics, and the founding president of the Slovenian Language Technologies Society.