Tags: 

ParCzech

ParCzech is a project on compiling Czech parliamentary data into annotated corpora.

May 7, 2020

We are extremely please that the very first ParCzech corpus has been published. Its name is ParCzech PS7 1.0 and this TEI encoded corpus consists of the stenographic protocols that record the Chamber of Deputies' (PS) meetings held in the 7th term between 2013–2017.  The audio recordings are available as well. The corpus is automatically enriched with the morphological and named-entity annotations using the procedures MorphoDita and NameTag.

Download

To download the data, please visit the LINDAT/CLARIAH-CZ repository:

  • ParCzech PS7 1.0
    • parczech-ps7-1.0-raw.tar.gz (stenoprotocols converted into TEI-derived coding and split into speeches, links to audio files included)
    • parczech-ps7-1.0-annotated.tar.gz(stenoprotocols tokenized and processed by NameTag)
    • parczech-ps7-1.0-audio-DDD.tar (MP3 audiorecordings, DDD stands for the file number, one file may contain stenoprotocols from more than one meeting, but one meeting is not split into more than one archive)

Search in KonText service at LINDAT/CLARIAH-CZ

Cite

To properly acknowledge ParCzech PS7 1.0, please cite the following data item in the LINDAT/CLARIAH-CZ repository (txt, BibTex):


Hladká, Barbora; Kopp, Matyáš and Straňák, Pavel, 2020,  ParCzech PS7 1.0, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University,  http://hdl.handle.net/11234/1-3174.

@misc{11234/1-3174,
 title = {{ParCzech} {PS7} 1.0},
 author = {Hladk{\'a}, Barbora and Kopp, Maty{\'a}{\v s} and Stra{\v n}{\'a}k, Pavel},
 url = {http://hdl.handle.net/11234/1-3174},
 note = {{LINDAT}/{CLARIAH}-{CZ} digital library at the Institute of Formal and Applied Linguistics ({{\'U}FAL}), Faculty of Mathematics and Physics, Charles University},
 copyright = {Public Domain Dedication ({CC} Zero)},
 year = {2020} }

Publications

  • Hladká Barbora, Kopp Matyáš and Straňák Pavel. Compiling Czech Parliamentary Stenographic Protocols into a Corpus. In Proceedings of the LREC 2020 Workshop on Creating, Using and Linking of Parliamentary Corpora with Other Types of Political Discourse (ParlaCLARIN II), Darja Fiser, Maria Eskevich, Franciska de Jong (eds.), pp. 18–22,  2020.

Acknowledgements

This work has been using language resources and tools developed and/or stored and/or distributed by the LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2018101).

Contact

Barbora Hladká

Matyáš Kopp

Pavel Straňák