Sources in iRozhlas 2.0 (SiR 2.0) is a collection of newspaper articles annotated partially expertly, partially as a result of a crowdsourcing annotation task. The annotations covers citation signals, citation sources, their class and links between the sources and signals.
SiR 2.0 is an update of SiR 1.0, bringing expert annotation to a part of the data.
SiR 2.0 was published in April 2026 in the LINDAT/CLARIAH-CZ repository.
The documents published in SiR 2.0 originated from a news server of the Czech public radio iRozhlas.
Jiří Mírovský (Charles University, Faculty of Mathematics and Physics),
Barbora Hladká (Charles University, Faculty of Mathematics and Physics),
Matyáš Kopp (Charles University, Faculty of Mathematics and Physics),
Václav Moravec (Charles University, Faculty of Mathematics and Physics)
SiR 2.0 is an annotated corpus of Czech articles from a news server of a Czech public radio iRozhlas. It is a collection of 1 718 articles (42 890 sentences, 614 995 words) with manually annotated attribution. SiR 2.0 is an update of its prior release, SiR 1.0, bringing expert annotation to two (out of three) original data parts.
For example, in the sentence:
Premiér prohlásil, že platy vzrostou o 10 %.
[The prime minister has stated that salaries will increase by 10%.],
the citation phrase prohlásil [has stated] refers to the citation source premiér [the prime minister], who provided the information.
The sources are further classified into several classes of named (official political, official non-political, unofficial) and unnamed (anonymous, partially anonymous) sources:
In the example sentence above, premiér [the prime minister] is an official political source.
The sources are classified independently of the provided information.
The corpus consists of two parts, depending on the origin of the annotations:
The data were annotated in the Brat tool and are distributed in the Brat native format, i.e. each article is represented by
the original plain text and a stand-off annotation file. The annotation files carry three types of information:
Please cite the following paper when using the corpus for your research:
Hladká Barbora, Jiří Mírovský, Matyáš Kopp, Václav Moravec: Annotating Attribution in Czech News Server Articles. In: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pp. 1817–1823, Marseille, France 20-25, June 2022. [pdf]
You might be also interested in a related paper on automatic recognition and classification of the citation sources that utilizes the corpus:
Jiří Mírovský and Barbora Hladká: SouDeC: Source Detection and Classification in Czech. In: Proceedings of the 15th Conference on Language Resources and
Evaluation (LREC 2026), pp. 685-693, Palma, Mallorca, Spain, May 2026. [pdf]
The corpus SiR 2.0 is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0) licence.
The work on the first version of the corpus (SiR 1.0) was financed by the TAČR project TL05000057: Signál a šum v éře Žurnalistiky 5.0 - komparativní perspektiva novinářských žánrů automatizovaných obsahů. The manual annotation of the citations was organized as an annotation task for students attending the courses Digital Communication and Sources and Ethics for Journalists at the Faculty of Social Sciences, Charles University to practice selected theoretical journalistic concepts.
The expert annotation of the second version of the corpus (SiR 2.0) was financed by the National Recovery Plan projekt MPO 60273/24/21300/21000 NPO.