Sources in iRozhlas 1.0 (SiR 1.0) is a collection of annotated newspaper articles coming as a result of a crowdsourcing annotation task. The annotation task involved approx. 2 thousand articles and over 290 annotators, who marked in the texts more than 11 thousand citation signals and about 10 thousand citation sources, their class and links to the signals.
SiR 1.0 was published in September 2022 in the LINDAT/CLARIAH-CZ repository. Articles that were double- or triple-annotated (589 articles) are available for searching in Teitok.
The documents published in SiR 1.0 originated from a news server of the Czech public radio iRozhlas.
Barbora Hladká (Charles University, Faculty of Mathematics and Physics),
Jiří Mírovský (Charles University, Faculty of Mathematics and Physics),
Matyáš Kopp (Charles University, Faculty of Mathematics and Physics),
Václav Moravec (Charles University, Faculty of Social Sciences)
SiR 1.0 is an annotated corpus of Czech articles from a news server of a Czech public radio iRozhlas. It is a collection of 1 718 articles (42 890 sentences, 614 995 words) with manually annotated attribution.
For example, in the sentence:
Jak už vědci uvedli při prvním kole vykopávek, jde pro ně o záhadu. [As the scientists already stated during the first round of excavations, it is a mystery to them.],
the citation phrase uvedli [stated] refers to the citation source vědci [scientists], who provided the information.
The sources are further classified into several classes of named (unofficial, official non-political, official political) and unnamed (anonymous, partially anonymous) sources.
The corpus consists of three parts, depending on the quality of the annotations:
The data were annotated in the Brat tool and are distributed in the Brat native format, i.e. each article is represented by
the original plain text and a stand-off annotation file. The annotation files carry three types of information:
In case of double-annotated articles, the above-mentioned tags are used for annotations agreed upon by both annotators. Annotations where the annotators disagreed are marked with numbered tags (e.g., PHRASE1 is a citation phrase recognized only by the first annotator, "anonymous2" is a citation source marked as anonymous only by the second annotator).
Please cite the following paper when using the corpus for your research:
Hladká Barbora, Jiří Mírovský, Matyáš Kopp, Václav Moravec: Annotating Attribution in Czech News Server Articles. In: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pp. 1817–1823, Marseille, France 20-25, June 2022. [pdf]
The corpus SiR 1.0 is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0) licence.
The work on the corpus was financed by the TAČR project TL05000057: Signál a šum v éře Žurnalistiky 5.0 - komparativní perspektiva novinářských žánrů automatizovaných obsahů. The manual annotation of the citations was organized as an annotation task for students attending the courses Digital Communication and Sources and Ethics for Journalists at the Faculty of Social Sciences, Charles University to practice selected theoretical journalistic concepts.