Sources in iRozhlas 2.0

Sources in iRozhlas 2.0 (SiR 2.0) is a collection of newspaper articles annotated partially expertly, partially as a result of a crowdsourcing annotation task. The annotations covers citation signals, citation sources, their class and links between the sources and signals.

SiR 2.0 is an update of SiR 1.0, bringing expert annotation to a part of the data.

SiR 2.0 was published in April 2026 in the LINDAT/CLARIAH-CZ repository.

Source of the data

 
The documents published in SiR 2.0 originated from a news server of the Czech public radio iRozhlas.

Authors

Jiří Mírovský (Charles University, Faculty of Mathematics and Physics),
Barbora Hladká (Charles University, Faculty of Mathematics and Physics),
Matyáš Kopp (Charles University, Faculty of Mathematics and Physics),
Václav Moravec (Charles University, Faculty of Mathematics and Physics)

Introduction

SiR 2.0 is an annotated corpus of Czech articles from a news server of a Czech public radio iRozhlas. It is a collection of 1 718 articles (42 890 sentences, 614 995 words) with manually annotated attribution. SiR 2.0 is an update of its prior release, SiR 1.0, bringing expert annotation to two (out of three) original data parts.

Annotation Scheme

For example, in the sentence:

Premiér prohlásil, že platy vzrostou o 10 %.
[The prime minister has stated that salaries will increase by 10%.],

the citation phrase prohlásil [has stated] refers to the citation source premiér [the prime minister], who provided the information.

The sources are further classified into several classes of named (official political, official non-political, unofficial) and unnamed (anonymous, partially anonymous) sources:

  • official (with mandate, also an institution)
    • official political – e.g., předseda vlády [prime minister], Ministerstvo obrany [Ministry of defence]
    • official non-political – e.g., ředitel Národního muzea [director of National muzeum], mluvčí fotbalového klubu [spokesman of a football club], firma Apple [Apple]
  • unofficial (fully specified, without mandate)
    • unofficial - e.g., kytarový virtuos Lubomír Brabec [guitar virtuoso Lubomír Brabec], bývalý prezident Václav Klaus [former president Václav Klaus], New York Times [New York Times]
  • anonymous (underspecified)
    • partially anonymous – e.g., většina lékařů [most doctors], zdroje z okolí prezidenta [sources close to the president]
    • anonymous – e.g., dostupné informace [available information], anonymní zdroj [anonymous source]

In the example sentence above, premiér [the prime minister] is an official political source.
The sources are classified independently of the provided information.

Data

The corpus consists of two parts, depending on the origin of the annotations:

  • expert-annotated articles: 589 articles (13 280 sentences, 193 864 words); in SiR 1.0, these data were in two directories: triple_manual and double_unified,
  • student-annotated articles: 1 129 articles (29 610 sentences, 421 131 words) annotated each by a single student; in SiR 1.0, these data were in directory single.

Data Format

The data were annotated in the Brat tool and are distributed in the Brat native format, i.e. each article is represented by
the original plain text and a stand-off annotation file. The annotation files carry three types of information:

  1. annotation of citation phrases (text spans marked with tag "PHRASE"),
  2. annotation of citation sources (text spans marked with tags representing the type of the source: "anonymous", "anonymous-partial", "unofficial", "official-non-political" and "official-political"), and
  3. links connecting the citation source with the respective annotation phrase (relation type "attribution").

Citation

Please cite the following paper when using the corpus for your research:
Hladká Barbora, Jiří Mírovský, Matyáš Kopp, Václav Moravec: Annotating Attribution in Czech News Server Articles. In: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pp. 1817–1823, Marseille, France 20-25, June 2022. [pdf]

You might be also interested in a related paper on automatic recognition and classification of the citation sources that utilizes the corpus:
Jiří Mírovský and Barbora Hladká: SouDeC: Source Detection and Classification in Czech. In: Proceedings of the 15th Conference on Language Resources and
Evaluation (LREC 2026)
, pp. 685-693, Palma, Mallorca, Spain, May 2026. [pdf]

Licence

The corpus SiR 2.0 is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0) licence.

Acknowledgement

The work on the first version of the corpus (SiR 1.0) was financed by the TAČR project TL05000057: Signál a šum v éře Žurnalistiky 5.0 - komparativní perspektiva novinářských žánrů automatizovaných obsahů. The manual annotation of the citations was organized as an annotation task for students attending the courses Digital Communication and Sources and Ethics for Journalists at the Faculty of Social Sciences, Charles University to practice selected theoretical journalistic concepts.

The expert annotation of the second version of the corpus (SiR 2.0) was financed by the National Recovery Plan projekt MPO 60273/24/21300/21000 NPO.