Tags:

SumeCzech

SumeCzech is a 1-million-document dataset of Czech news, each consisting of:

headline;
abstract (visually distinguished first paragraph);
rest of the text.

For more details, please read our paper SumeCzech: Large Czech News-Based Summarization Dataset.

Download

We distribute only the scripts capable of downloading the dataset from CommonCrawl. You can download them from LINDAT/CLARIAH-CZ repository.

Best Results

Here we collect the published results of summarization methods on the SumeCzech dataset, using the published ROUGE_RAW metric. Note that the results differ from the original results reported in the paper, because the published metric uses slightly different tokenization.

Abstract → Headline

Paper	System	Test set									Out-of-domain test set
		ROUGE_RAW-1			ROUGE_RAW-2			ROUGE_RAW-L			ROUGE_RAW-1			ROUGE_RAW-2			ROUGE_RAW-L
		P	R	F	P	R	F	P	R	F	P	R	F	P	R	F	P	R	F
SumeCzech	first	13.9	23.6	16.5	04.1	07.4	05.0	12.2	20.7	14.5	13.3	26.5	16.7	04.7	10.0	06.0	11.6	23.3	14.7
	random	11.0	17.8	12.8	02.6	04.5	03.1	09.6	15.5	11.1	10.6	20.7	13.1	03.2	06.9	04.1	09.3	18.2	11.5
	textrank	13.3	22.8	15.9	03.7	06.8	04.6	11.6	19.9	13.8	12.8	25.9	16.3	04.5	09.6	05.7	11.3	22.7	14.2
	tensor2tensor	20.2	15.9	17.2	06.7	05.1	05.6	18.6	14.7	15.8	19.4	15.1	16.3	07.1	05.2	05.7	18.1	14.1	15.2

Text → Headline

Paper	System	Test set									Out-of-domain test set
		ROUGE_RAW-1			ROUGE_RAW-2			ROUGE_RAW-L			ROUGE_RAW-1			ROUGE_RAW-2			ROUGE_RAW-L
		P	R	F	P	R	F	P	R	F	P	R	F	P	R	F	P	R	F
SumeCzech	first	07.4	13.5	08.9	01.1	02.2	01.3	06.5	11.7	07.7	06.7	13.6	08.3	01.3	02.8	01.6	05.9	12.0	07.4
	random	05.9	10.3	06.9	00.5	01.0	00.6	05.2	08.9	06.0	05.2	10.0	06.3	00.6	01.4	00.8	04.6	08.9	05.6
	textrank	06.0	16.5	08.3	00.8	02.3	01.1	05.0	13.8	06.9	05.8	16.9	08.1	01.1	03.4	01.5	05.0	14.5	06.9
	tensor2tensor	08.8	07.0	07.5	00.8	00.6	00.7	08.1	06.5	07.0	06.3	05.1	05.5	00.5	00.4	00.4	05.9	04.8	05.1
Bachelor Thesis of Müller, 2020	Seq2seq-FT	15.4	13.7	14.1	02.4	02.1	02.1	13.9	12.4	12.8	12.6	11.4	11.6	01.9	01.6	01.7	11.7	10.7	10.8
Bachelor Thesis of Müller, 2020	Seq2seq-FT-NER	15.3	13.6	14.0	02.4	02.0	02.1	13.9	12.4	12.7	13.0	11.6	11.9	01.9	01.7	01.7	12.0	10.8	11.0

Text → Abstract

Paper	System	Test set									Out-of-domain test set
		ROUGE_RAW-1			ROUGE_RAW-2			ROUGE_RAW-L			ROUGE_RAW-1			ROUGE_RAW-2			ROUGE_RAW-L
		P	R	F	P	R	F	P	R	F	P	R	F	P	R	F	P	R	F
SumeCzech	first	13.1	17.9	14.4	01.9	02.8	02.1	08.8	12.0	09.6	11.1	17.1	12.7	01.6	02.7	01.9	07.6	11.7	08.7
	random	11.7	15.5	12.7	01.2	01.7	01.3	07.7	10.3	08.4	10.1	15.1	11.4	01.0	01.7	01.2	06.9	10.3	07.8
	textrank	11.1	20.8	13.8	01.6	03.1	02.0	07.1	13.4	08.9	09.8	19.9	12.5	01.5	03.3	02.0	06.6	13.3	08.4
	tensor2tensor	13.2	10.5	11.3	01.2	00.9	01.0	10.2	08.1	08.7	12.5	09.4	10.3	00.8	00.6	00.6	09.8	07.5	08.1

How to cite

@inproceedings{straka-etal-2018-sumeczech,
    title = "{S}ume{C}zech: Large {C}zech News-Based Summarization Dataset",
    author = "Straka, Milan  and
      Mediankin, Nikita  and
      Kocmi, Tom  and
      {\v{Z}}abokrtsk{\'y}, Zden{\v{e}}k  and
      Hude{\v{c}}ek, Vojt{\v{e}}ch  and
      Haji{\v{c}}, Jan",
    booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
    month = may,
    year = "2018",
    address = "Miyazaki, Japan",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://www.aclweb.org/anthology/L18-1551",
}

SumeCzech

Summarization News Dataset in Czech

Search form

SumeCzech

Download

Best Results

Abstract → Headline

Text → Headline

Text → Abstract

How to cite