E2E NLG Challenge

Motivation

Natural language generation plays a critical role for Conversational Agents as it has a significant impact on a user’s impression of the system. This shared task focuses on recent end-to-end (E2E), data-driven NLG methods, which jointly learn sentence planning and surface realisation from non-aligned data, e.g. (Wen et al., 2015; Mei et al., 2016; Dusek and Jurcicek, 2016; Lampouras and Vlachos, 2016) etc.

So far, E2E NLG approaches were limited to small, de-lexicalised data sets, e.g. BAGEL, SF Hotels/ Restaurants, or RoboCup. In this shared challenge, we will provide a new crowd-sourced data set of 50k instances in the restaurant domain, as described in (Novikova, Lemon and Rieser, 2016). Each instance consist of a dialogue act-based meaning representation (MR) and up to 5 references in natural language. In contrast to previously used data, our data set includes additional challenges, such as open vocabulary, complex syntactic structures and diverse discourse phenomena. For example:

MR:

name[The Eagle],
eatType[coffee shop],
food[French],
priceRange[moderate],
customerRating[3/5],
area[riverside],
kidsFriendly[yes],
near[Burger King]

NL:

“The three star coffee shop, The Eagle, gives families a mid-priced dining experience featuring a variety of wines and cheeses. Find The Eagle near Burger King.”

The full data set can now be downloaded here. A detailed description of the data can be found in our SIGDIAL 2017 paper. A brief summary of the E2E NLG Challenge results is now available in our INLG 2018 paper.

This challenge follows on from previous successful shared tasks on generation, e.g. SemEval’17 task 9 on text generation from AMR, and Generation Challenges 2008-11. However, this is the first NLG task to concentrate on (1) generation from dialogue acts, (2) using semantically un-aligned data.

The Task

The task is to generate an utterance from a given MR, which is a) similar to human generated reference texts, and b) highly rated by humans. Similarity will be assessed using standard metrics, such as BLEU and METEOR. Human ratings will be obtained using a mixture of crowd-sourcing and expert annotations. We will also test a suite of novel metrics to estimate the quality of a generated utterance.

The metrics used for automatic evaluation are available on Github.

Download & Cite the Data

The full E2E dataset is now available for download here. The package includes a description of the data format. A paper with a description of the dataset appeared at SIGDIAL 2017 and is also available on arXiv. An updated description of the data has now been released on arXiv (as a journal submission under review).

A package with the outputs of all participating systems on the test set as well as raw human ratings used for the evaluation is now available for download here. The package includes a short description of the data formats.

To cite the E2E data, use:

@article{dusek2019e2e,
  title={Evaluating the State-of-the-Art of End-to-End 
         Natural Language Generation: {The} {E2E} {NLG} {Challenge}},
  author={Du{\v{s}}ek, Ond\v{r}ej and Novikova, Jekaterina and Rieser, Verena},
  journal={arXiv preprint arXiv:1901.11528},
  year={2019},
  month=jan,
  url={https://arxiv.org/abs/1901.11528},
}

See the Proceedings section below for citing the E2E NLG Challenge results.

Baseline System

We used TGen (Dusek and Jurcicek, 2016) as the baseline system for the challenge. It is a seq2seq model with attention (Bahdanau et al., 2015) with added beam search and a reranker penalizing outputs that stray away from the input MR. The baseline scores on the development set are as follows:

Metric	Score
BLEU	0.6925
NIST	8.4781
METEOR	0.4703
ROUGE-L	0.7257
CIDEr	2.3987

The full baseline system outputs can be downloaded here for both the development and test sets (one instance per line). If you want to run the baseline yourself, basic instructions are provided in the TGen Github repository.

The scripts used for evaluation are available on Github.

Important Dates

13 March 2017:: Registration opens
27 March 2017:: Training and development data are released (MRs + references)
27 June 2017:: The baseline system is released.
16 October 2017:: Test data is released (MRs only)
31 October 2017:: Entry submission deadline
15 November 2017:: Evaluation results are released
15 December 2017:: Participants submit a paper describing their systems
1 March 2018:: Final versions of the description papers due
7 November 2018:: Results presented at INLG

Evaluation Results

We are happy to announce that the interest in the E2E NLG shared task has by far outperformed our expectations. Heriot-Watt University has set out this challenge for the first time this year, and we received a total of 62 submissions by 17 institutions, with about 1/3 of these submissions coming from industry. In comparison, the well established Conference in Machine Translation WMT’17 (running since 2006) got 31 institutions submitting to a total of 8 tasks.

Participants map

A brief summary of the E2E NLG Challenge results is now available in our INLG 2018 paper, a more detailed analysis is in preparation.

Automatic Metrics

The automatic evaluation results were obtained using the metrics scripts provided with the baseline. The table is sortable – just click on the metric you want use for sorting. Click again to reverse the sort.

Submitter	Affiliation	System name	P?	BLEU	NIST	METEOR	ROUGE_L	CIDEr
BASELINE	Heriot-Watt University	Baseline	✓	0.6593	8.6094	0.4483	0.6850	2.2338
Biao Zhang	Xiamen University	bzhang_submit	✓	0.6545	8.1840	0.4392	0.7083	2.1012
Chen Shuang	Harbin Institute of Technology	Abstract-beam1		0.5854	5.4691	0.3977	0.6747	1.6391
Chen Shuang	Harbin Institute of Technology	Abstract-beam2		0.5916	5.9477	0.3974	0.6701	1.6513
Chen Shuang	Harbin Institute of Technology	Abstract-beam3		0.6150	6.8029	0.4068	0.6750	1.7870
Chen Shuang	Harbin Institute of Technology	Abstract-greedy		0.6635	8.3977	0.4312	0.6909	2.0788
Chen Shuang	Harbin Institute of Technology	NonAbstract-beam2		0.5860	6.1602	0.3833	0.6619	1.6133
Chen Shuang	Harbin Institute of Technology	NonAbstract-beam3		0.6088	6.9790	0.3899	0.6628	1.7015
Chen Shuang	Harbin Institute of Technology	Primary_NonAbstract-beam1	✓	0.5859	5.4383	0.3836	0.6714	1.5790
ZHAW	Zurich University of Applied Sciences	base		0.6544	8.3391	0.4448	0.6783	2.1438
ZHAW	Zurich University of Applied Sciences	primary_1	✓	0.5864	8.0212	0.4322	0.5998	1.8173
ZHAW	Zurich University of Applied Sciences	primary_2	✓	0.6004	8.1394	0.4388	0.6119	1.9188
FORGe	Pompeu Fabra University	E2E_UPF_1	✓	0.4207	6.5139	0.3685	0.5437	1.3106
FORGe	Pompeu Fabra University	E2E_UPF_2		0.4113	6.3293	0.3686	0.5593	1.2467
FORGe	Pompeu Fabra University	E2E_UPF_3	✓	0.4599	7.1092	0.3858	0.5611	1.5586
Sheffield NLP	University of Sheffield	sheffield_primarySystem1_var1	✓	0.6015	8.3075	0.4405	0.6778	2.1775
Sheffield NLP	University of Sheffield	sheffield_primarySystem1_var2		0.6233	8.1751	0.4378	0.6887	2.2840
Sheffield NLP	University of Sheffield	sheffield_primarySystem1_var3		0.5690	8.0382	0.4202	0.6348	2.0956
Sheffield NLP	University of Sheffield	sheffield_primarySystem1_var4		0.5799	7.9163	0.4310	0.6670	2.0691
Sheffield NLP	University of Sheffield	sheffield_primarySystem2_var1	✓	0.5436	5.7462	0.3561	0.6152	1.4130
Sheffield NLP	University of Sheffield	sheffield_primarySystem2_var2		0.5356	7.8373	0.3831	0.5513	1.5825
HarvardNLP & Henry Elder	Harvard SEAS & Adapt	main_1_support_1		0.6581	8.5719	0.4409	0.6893	2.1065
HarvardNLP & Henry Elder	Harvard SEAS & Adapt	main_1_support_2		0.6618	8.6025	0.4571	0.7038	2.3371
HarvardNLP & Henry Elder	Harvard SEAS & Adapt	main_1_support_3		0.6737	8.6061	0.4523	0.7084	2.3056
HarvardNLP & Henry Elder	Harvard SEAS & Adapt	Primary_main_1	✓	0.6496	8.5268	0.4386	0.6872	2.0850
Heng Gong	Harbin Institute of Technology	Primary_test_2	✓	0.6422	8.3453	0.4469	0.6645	2.2721
Heng Gong	Harbin Institute of Technology	test_1		0.6396	8.3111	0.4466	0.6620	2.2272
Heng Gong	Harbin Institute of Technology	test_3		0.6395	8.3127	0.4457	0.6628	2.2442
Heng Gong	Harbin Institute of Technology	test_4		0.6395	8.3127	0.4457	0.6628	2.2442
Adapt	Adapt	primary_submission-temperature_1.1	✓	0.5092	7.1954	0.4025	0.5872	1.5039
Adapt	Adapt	supporting_submission-temperature_0.9		0.5573	7.7013	0.4154	0.6130	1.8110
Adapt	Adapt	supporting_submission-temperature_1.0		0.5265	7.3991	0.4095	0.5992	1.6488
<anonymous 1>	<anonymous 1>	<anonymous 1 combined>		0.2921	4.7690	0.2515	0.4361	0.6674
<anonymous 1>	<anonymous 1>	<anonymous 1 primary>	✓	0.4723	6.1938	0.3170	0.5616	1.2127
Shubham Agarwal	NLE	submission_primary	✓	0.6534	8.5300	0.4435	0.6829	2.1539
Shubham Agarwal	NLE	submission_second		0.6669	8.5388	0.4484	0.6991	2.2239
Shubham Agarwal	NLE	submission_third		0.6676	8.5416	0.4485	0.6991	2.2276
UCSC-Slug2Slug	UC Santa Cruz	Slug2Slug	✓	0.6619	8.6130	0.4454	0.6772	2.2615
UCSC-Slug2Slug	UC Santa Cruz	Slug2Slug-alt (late submission)	✓	0.6035	8.3954	0.4369	0.5991	2.1019
Thomson Reuters NLG	Thomson Reuters	NonPrimary_1_test_output_model_11_post		0.6536	8.3293	0.4550	0.6805	2.1050
Thomson Reuters NLG	Thomson Reuters	NonPrimary_2_test_output_model_13_post		0.6562	8.3942	0.4571	0.6876	2.1706
Thomson Reuters NLG	Thomson Reuters	NonPrimary_3_test_output_beam_5_model_11_post		0.6805	8.7777	0.4462	0.6928	2.3195
Thomson Reuters NLG	Thomson Reuters	NonPrimary_4_test_output_beam_5_model_13_post		0.6742	8.6590	0.4499	0.6983	2.3018
Thomson Reuters NLG	Thomson Reuters	NonPrimary_5_submission_6		0.6208	8.0632	0.4417	0.6692	2.1127
Thomson Reuters NLG	Thomson Reuters	NonPrimary_6_submission_4_beam		0.6201	8.0938	0.4419	0.6740	2.1251
Thomson Reuters NLG	Thomson Reuters	NonPrimary_7_submission_4		0.6182	8.0616	0.4417	0.6729	2.0783
Thomson Reuters NLG	Thomson Reuters	NonPrimary_8_test_train_only		0.4111	6.7541	0.3970	0.5435	1.4096
Thomson Reuters NLG	Thomson Reuters	Primary_1_submission_6_beam	✓	0.6336	8.1848	0.4322	0.6828	2.1425
Thomson Reuters NLG	Thomson Reuters	Primary_2_test_train_dev	✓	0.4202	6.7686	0.3968	0.5481	1.4389
UCSC-TNT-NLG	UC Santa Cruz	System 1/Primary-Sys1	✓	0.6561	8.5105	0.4517	0.6839	2.2183
UCSC-TNT-NLG	UC Santa Cruz	System 1/Sys1-Model1		0.6476	8.4301	0.4508	0.6795	2.1233
UCSC-TNT-NLG	UC Santa Cruz	System 2/Primary-Sys2	✓	0.6502	8.5211	0.4396	0.6853	2.1670
UCSC-TNT-NLG	UC Santa Cruz	System 2/Sys2-Model1		0.6606	8.6223	0.4439	0.6772	2.1997
UCSC-TNT-NLG	UC Santa Cruz	System 2/Sys2-Model2		0.6563	8.5482	0.4482	0.6835	2.1953
UCSC-TNT-NLG	UC Santa Cruz	System 2/Sys2-Model3		0.3681	6.6004	0.3846	0.5259	1.5205
UIT-DANGNT	VNU-HCM University of Information Technology	test_e2e_result_2 final_TSV	✓	0.5990	7.9277	0.4346	0.6634	2.0783
UKP-TUDA	Technische Universität Darmstadt	test_e2e-Puzikov	✓	0.5657	7.4544	0.4529	0.6614	1.8206

Note: “P?” denotes primary submissions.

Human Evaluation (updated results)

The human evaluation was conducted on the 20 primary systems and the baseline using the CrowdFlower platform. We used our newly developed RankME method (Novikova et al., 2018) to obtain the ratings. Crowd workers were presented with five randomly selected outputs of different systems corresponding to a single meaning representation, and were asked to rank these systems from the best to worst, ties permitted. A single human-authored reference was provided for comparison. We collected separate ranks for quality and naturalness.

Quality is defined as an overall quality of the utterance, in terms of its grammatical correctness, fluency, adequacy and other important factors. When collecting quality ratings, system outputs were presented to crowd workers together with the corresponding meaning representation.

Naturalness is defined the extent to which the utterance could have been produced by a native speaker. When collecting naturalness ratings, system outputs were presented to crowd workers without the corresponding meaning representation.

If used in a real-life NLG system, quality would be considered the primary measure.

The final evaluation results were produced using the TrueSkill algorithm (Sakaguchi et al., 2014). For naturalness, the algorithm performed 1890 pairwise comparisons per each system (37800 comparisons in total), for quality – 1260 comparisons per system (25200 comparisons in total). In results tables, systems are ordered by their inferred system TrueSkill scores, and clustered. Systems within a cluster are considered tied. The system clusters have been created using bootstrap resampling, with a p-level of p ≤ 0.05.

Quality

#	TrueSkill	Range	System name	Submitter
1	0.300	(1.0, 1.0)	Slug2Slug	UCSC-Slug2Slug
2	0.228	(2.0, 4.0)	ukp-tuda	UKP-TUDA
	0.213	(2.0, 5.0)	Primary_test_2	Heng Gong
	0.184	(3.0, 5.0)	test_e2e_result_2_final_TSV	UIT-DANGNT
	0.184	(3.0, 6.0)	Baseline	BASELINE
	0.136	(5.0, 7.0)	Slug2Slug-alt (late submission)	UCSC-Slug2Slug
	0.117	(6.0, 8.0)	primary_2	ZHAW
	0.084	(7.0, 10.0)	System 1/Primary-Sys1	UCSC-TNT-NLG
	0.065	(8.0, 10.0)	System 2/Primary-Sys2	UCSC-TNT-NLG
	0.048	(8.0, 12.0)	submission_primary	NLE
	0.018	(10.0, 13.0)	primary_1	ZHAW
	0.014	(10.0, 14.0)	E2E_UPF_1	FORGe
	-0.012	(11.0, 14.0)	sheffield_primarySystem1_var1	Sheffield NLP
	-0.012	(11.0, 14.0)	Primary_main_1	HarvardNLP & Henry Elder
3	-0.078	(15.0, 16.0)	Primary_2_test_train_dev	Thomson Reuters NLG
3	-0.083	(15.0, 16.0)	E2E_UPF_3	FORGe
4	-0.152	(17.0, 19.0)	primary_submission-temperature_1.1	Adapt
	-0.185	(17.0, 19.0)	Primary_1_submission_6_beam	Thomson Reuters NLG
	-0.186	(17.0, 19.0)	bzhang_submit	Biao Zhang
5	-0.426	(20.0, 21.0)	Primary_NonAbstract-beam1	Chen Shuang
5	-0.457	(20.0, 21.0)	sheffield_primarySystem2_var1	Sheffield NLP

Naturalness

#	TrueSkill	Range	System name	Submitter
1	0.211	(1.0, 1.0)	sheffield_primarySystem2_var1	Sheffield NLP
2	0.171	(2.0, 3.0)	Slug2Slug	UCSC-Slug2Slug
	0.154	(2.0, 4.0)	Primary_NonAbstract-beam1	Chen Shuang
	0.126	(3.0, 6.0)	Primary_main_1	HarvardNLP & Henry Elder
	0.105	(4.0, 8.0)	submission_primary	NLE
	0.101	(4.0, 8.0)	Baseline	BASELINE
	0.091	(5.0, 8.0)	test_e2e_result_2 final_TSV	UIT-DANGNT
	0.077	(5.0, 10.0)	ukp-tuda	UKP-TUDA
	0.060	(7.0, 11.0)	System 2/Primary-Sys2	UCSC-TNT-NLG
	0.046	(9.0, 12.0)	Primary_test_2	Heng Gong
	0.027	(9.0, 12.0)	System 1/Primary-Sys1	UCSC-TNT-NLG
	0.027	(10.0, 12.0)	bzhang_submit	Biao Zhang
3	-0.053	(13.0, 16.0)	Primary_1_submission_6_beam	Thomson Reuters NLG
	-0.073	(13.0, 17.0)	Slug2Slug-alt (late submission)	UCSC-Slug2Slug
	-0.077	(13.0, 17.0)	sheffield_primarySystem1_var1	Sheffield NLP
	-0.083	(13.0, 17.0)	primary_2	ZHAW
	-0.104	(15.0, 17.0)	primary_1	ZHAW
4	-0.144	(18.0, 19.0)	E2E_UPF_1	FORGe
4	-0.164	(18.0, 19.0)	primary_submission-temperature_1.1	Adapt
5	-0.243	(20.0, 21.0)	Primary_2_test_train_dev	Thomson Reuters NLG
5	-0.255	(20.0, 21.0)	E2E_UPF_3	FORGe

Proceedings (Full System Descriptions)

A brief description of the challenge results was published at INLG. To cite the challenge, use:

@inproceedings{dusek2018findings,
  title={Findings of the {E2E} {NLG} {Challenge}},
  author={Du{\v{s}}ek, Ond\v{r}ej and Novikova, Jekaterina and Rieser, Verena},
  booktitle={Proceedings of the 11th International Conference 
             on Natural Language Generation},
  address={Tilburg, The Netherlands},
  year={2018},
  note={arXiv:1810.01170},
  url={https://arxiv.org/abs/1810.01170},
}

A greatly extended analysis of the challenge results is now released on arXiv (as a journal submission under review).

System outputs and human ratings can now be downloaded from here. Please use the same citation to refer for this data release.

All submitters participating in human evaluation provided a description of their primary systems as a technical paper. The papers are linked below:

System	Paper
Adapt	Henry Elder, Sebastian Gehrmann, Alexander O'Connor and Qun Liu: E2E NLG Challenge Submission: Towards Controllable Generation of Diverse Natural Language
Chen Shuang	Shuang Chen: A General Model for Neural Text Generation from Structured Data
FORGe (both systems)	Simon Mille and Stamatia Dasiopoulou: FORGe at E2E 2017
HarvardNLP & Henry Elder	Sebastian Gerhmann, Falcon Z. Dai, Henry Elder and Alexander M. Rush: End-to-End Content and Plan Selection for Natural Language Generation
Heng Gong (final paper pending)
NLE	Shubham Agarwal, Marc Dymetman and Éric Gaussier: A char-based seq2seq submission to the E2E NLG Challenge
Sheffield NLP (both systems)	Mingjie Chen, Gerasimos Lampouras and Andreas Vlachos: Sheffield at E2E: structured prediction approaches to end-to-end language generation
UCSC-Slug2Slug	Juraj Juraska, Panagiotis Karagiannis, Kevin K. Bowden and Marilyn A. Walker: Slug2Slug: A Deep Ensemble Model with Slot Alignment for Sequence-to-Sequence Natural Language Generation
UCSC-TNT-NLG, System 1	Shereen Oraby, Lena Reed, Shubhangi Tandon, Sharath T.S., Stephanie Lukin and Marilyn Walker: TNT-NLG, System 1: Using a Statistical NLG to Massively Augment Crowd-Sourced Data for Neural Generation
UCSC-TNT-NLG, System 2	Shubhangi Tandon, Sharath T.S., Shereen Oraby, Lena Reed, Stephanie Lukin and Marilyn Walker: TNT-NLG, System 2: Data Repetition and Meaning Representation Manipulation to Improve Neural Generation
Thomson Reuters NLP, System 1	Elnaz Davoodi, Charese Smiley, Dezhao Song and Frank Schilder: The E2E NLG Challenge: Training a Sequence-to-Sequence Approach for Meaning Representation to Natural Language Sentences
Thomson Reuters NLP, System 2	Charese Smiley, Elnaz Davoodi, Dezhao Song and Frank Schilder: The E2E NLG Challenge: End-to-End Generation through Partial Template Mining
UIT-DANGNT	Dang Tuan Nguyen and Trung Tran: Structure-based Generation System for E2E NLG Challenge
UKP-TUDA	Yevgeniy Puzikov and Iryna Gurevych: E2E NLG Challenge: Neural Models vs. Templates
Biao Zhang	Biao Zhang, Jing Yang, Qian Lin and Jinsong Su: Attention Regularized Sequence-to-Sequence Learning for E2E NLG Challenge
ZHAW (both systems)	Jan Deriu and Mark Cieliebak: End-to-End Trainable System for Enhancing Diversity in Natural Language Generation

Other papers using the E2E dataset

Published versions of the systems participating in the Challenge:

Agarwal et al. (INLG 2018): Char2char Generation with Reranking for the E2E NLG Challenge
Elder et al. (INLG 2018): E2E NLG Challenge Submission: Towards Controllable Generation of Diverse Natural Language
Puzikov & Gurevych (INLG 2018): E2E NLG Challenge: Neural Models vs. Templates
Smiley et al. (INLG 2018): The E2E NLG Challenge: A Tale of Two Systems
Deriu & Cieliebak (INLG 2018): Syntactic Manipulation for Generating more Diverse and Interesting Texts
Gehrmann et al. (INLG 2018): End-to-End Content and Plan Selection for Data-to-Text Generation
Juraska et al. (NAACL 2018): A Deep Ensemble Model with Slot Alignment for Sequence-to-Sequence Natural Language Generation

Further works that use the E2E dataset but did not participate in the official E2E challenge:

Guy Lapalme's realizations using jsRealB -- symbolic realizer, also produces French versions of the outputs. Includes a browser for the outputs, grouping them by slot values.
Su & Chen (SLT 2018): Investigating Linguistic Pattern Ordering in Hierarchical Natural Language Generation
Reed et al. (INLG 2018): Can Neural Generators for Dialogue Learn Sentence Planning and Discourse Structuring?
Juraska & Walker (INLG 2018): Characterizing Variation in Crowd-Sourced Data for Training Neural Language Generators to Produce Stylistically Varied Outputs
Shimorina & Gardent (INLG 2018): Handling Rare Items in Data-to-text Generation
Jagdfeld et al. (INLG 2018): Sequence-to-Sequence Models for Data-to-Text Natural Language Generation: Word- vs. Character-based Processing and Output Diversity
Wiseman et al. (EMNLP 2018): Learning Neural Templates for Text Generation
Freitag & Roy (EMNLP 2018): Unsupervised Natural Language Generation with Denoising Autoencoders
Oraby et al. (Interspeech 2018): Neural MultiVoice Models for Expressing Novel Personalities in Dialog
Su et al. (NAACL 2018): Natural Language Generation by Hierarchical Decoding with Linguistic Patterns
Shen et al. (NAACL 2019): Pragmatically Informative Text Generation

Contacts

Organising Comittee

Jekaterina Novikova
Ondrej Dusek
Verena Rieser

Heriot-Watt University, Edinburgh, UK.

Contact Details

e2e-nlg-challengegooglegroups.com

Advisory Committee

Mohit Bansal, University of Northern Carolina Chapel Hill
Ehud Reiter, University of Aberdeen
Amanda Stent, Bloomberg
Andreas Vlachos, University of Sheffield
Marilyn Walker, University of California Santa Cruz
Matthew Walter, Toyota Technological Institute at Chicago
Tsung-Hsien Wen, University of Cambridge
Luke Zettlemoyer, University of Washington