SumeCzech

SumeCzech is a 1-million-document dataset of Czech news, each consisting of:

  • headline;
  • abstract (visually distinguished first paragraph);
  • rest of the text.

For more details, please read our paper SumeCzech: Large Czech News-Based Summarization Dataset.

Download

We distribute only the scripts capable of downloading the dataset from CommonCrawl. You can download them from LINDAT/CLARIAH-CZ repository.

Best Results

Here we collect the published results of summarization methods on the SumeCzech dataset, using the published ROUGERAW metric. Note that the results differ from the original results reported in the paper, because the published metric uses slightly different tokenization.

Abstract → Headline

Paper System Test set Out-of-domain test set
ROUGERAW-1 ROUGERAW-2 ROUGERAW-L ROUGERAW-1 ROUGERAW-2 ROUGERAW-L
P R F P R F P R F P R F P R F P R F
SumeCzech first 13.9 23.6 16.5 04.1 07.4 05.0 12.2 20.7 14.5 13.3 26.5 16.7 04.7 10.0 06.0 11.6 23.3 14.7
random 11.0 17.8 12.8 02.6 04.5 03.1 09.6 15.5 11.1 10.6 20.7 13.1 03.2 06.9 04.1 09.3 18.2 11.5
textrank 13.3 22.8 15.9 03.7 06.8 04.6 11.6 19.9 13.8 12.8 25.9 16.3 04.5 09.6 05.7 11.3 22.7 14.2
tensor2tensor 20.2 15.9 17.2 06.7 05.1 05.6 18.6 14.7 15.8 19.4 15.1 16.3 07.1 05.2 05.7 18.1 14.1 15.2  

Text → Headline

Paper System Test set Out-of-domain test set
ROUGERAW-1 ROUGERAW-2 ROUGERAW-L ROUGERAW-1 ROUGERAW-2 ROUGERAW-L
P R F P R F P R F P R F P R F P R F
SumeCzech first 07.4 13.5 08.9 01.1 02.2 01.3 06.5 11.7 07.7 06.7 13.6 08.3 01.3 02.8 01.6 05.9 12.0 07.4
random 05.9 10.3 06.9 00.5 01.0 00.6 05.2 08.9 06.0 05.2 10.0 06.3 00.6 01.4 00.8 04.6 08.9 05.6
textrank 06.0 16.5 08.3 00.8 02.3 01.1 05.0 13.8 06.9 05.8 16.9 08.1 01.1 03.4 01.5 05.0 14.5 06.9
tensor2tensor 08.8 07.0 07.5 00.8 00.6 00.7 08.1 06.5 07.0 06.3 05.1 05.5 00.5 00.4 00.4 05.9 04.8 05.1
Bachelor Thesis of Müller, 2020 Seq2seq-FT 15.4 13.7 14.1 02.4 02.1 02.1 13.9 12.4 12.8 12.6 11.4 11.6 01.9 01.6 01.7 11.7 10.7 10.8
Seq2seq-FT-NER 15.3 13.6 14.0 02.4 02.0 02.1 13.9 12.4 12.7 13.0 11.6 11.9 01.9 01.7 01.7 12.0 10.8 11.0

Text → Abstract

Paper System Test set Out-of-domain test set
ROUGERAW-1 ROUGERAW-2 ROUGERAW-L ROUGERAW-1 ROUGERAW-2 ROUGERAW-L
P R F P R F P R F P R F P R F P R F
SumeCzech first 13.1 17.9 14.4 01.9 02.8 02.1 08.8 12.0 09.6 11.1 17.1 12.7 01.6 02.7 01.9 07.6 11.7 08.7
random 11.7 15.5 12.7 01.2 01.7 01.3 07.7 10.3 08.4 10.1 15.1 11.4 01.0 01.7 01.2 06.9 10.3 07.8
textrank 11.1 20.8 13.8 01.6 03.1 02.0 07.1 13.4 08.9 09.8 19.9 12.5 01.5 03.3 02.0 06.6 13.3 08.4
tensor2tensor 13.2 10.5 11.3 01.2 00.9 01.0 10.2 08.1 08.7 12.5 09.4 10.3 00.8 00.6 00.6 09.8 07.5 08.1

How to cite

@inproceedings{straka-etal-2018-sumeczech,
    title = "{S}ume{C}zech: Large {C}zech News-Based Summarization Dataset",
    author = "Straka, Milan  and
      Mediankin, Nikita  and
      Kocmi, Tom  and
      {\v{Z}}abokrtsk{\'y}, Zden{\v{e}}k  and
      Hude{\v{c}}ek, Vojt{\v{e}}ch  and
      Haji{\v{c}}, Jan",
    booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
    month = may,
    year = "2018",
    address = "Miyazaki, Japan",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://www.aclweb.org/anthology/L18-1551",
}