Natural language generation plays a critical role for Conversational Agents as it has a significant impact on a user’s impression of the system. This shared task focuses on recent end-to-end (E2E), data-driven NLG methods, which jointly learn sentence planning and surface realisation from non-aligned data, e.g. (Wen et al., 2015; Mei et al., 2016; Dusek and Jurcicek, 2016; Lampouras and Vlachos, 2016) etc.
So far, E2E NLG approaches were limited to small, de-lexicalised data sets, e.g. BAGEL, SF Hotels/ Restaurants, or RoboCup. In this shared challenge, we will provide a new crowd-sourced data set of 50k instances in the restaurant domain, as described in (Novikova, Lemon and Rieser, 2016). Each instance consist of a dialogue act-based meaning representation (MR) and up to 5 references in natural language. In contrast to previously used data, our data set includes additional challenges, such as open vocabulary, complex syntactic structures and diverse discourse phenomena. For example:
The full data set can now be downloaded here. A detailed description of the data can be found in our SIGDIAL 2017 paper. A brief summary of the E2E NLG Challenge results is now available in our INLG 2018 paper.
This challenge follows on from previous successful shared tasks on generation, e.g. SemEval’17 task 9 on text generation from AMR, and Generation Challenges 2008-11. However, this is the first NLG task to concentrate on (1) generation from dialogue acts, (2) using semantically un-aligned data.
The task is to generate an utterance from a given MR, which is a) similar to human generated reference texts, and b) highly rated by humans. Similarity will be assessed using standard metrics, such as BLEU and METEOR. Human ratings will be obtained using a mixture of crowd-sourcing and expert annotations. We will also test a suite of novel metrics to estimate the quality of a generated utterance.
The metrics used for automatic evaluation are available on Github.
The full E2E dataset is now available for download here. The package includes a description of the data format. A paper with a description of the dataset appeared at SIGDIAL 2017 and is also available on arXiv. An updated description of the data has now been released on arXiv (as a journal submission under review).
A package with the outputs of all participating systems on the test set as well as raw human ratings used for the evaluation is now available for download here. The package includes a short description of the data formats.
To cite the E2E data, use:
@article{dusek2019e2e, title={Evaluating the State-of-the-Art of End-to-End Natural Language Generation: {The} {E2E} {NLG} {Challenge}}, author={Du{\v{s}}ek, Ond\v{r}ej and Novikova, Jekaterina and Rieser, Verena}, journal={arXiv preprint arXiv:1901.11528}, year={2019}, month=jan, url={https://arxiv.org/abs/1901.11528}, }
See the Proceedings section below for citing the E2E NLG Challenge results.
We used TGen (Dusek and Jurcicek, 2016) as the baseline system for the challenge. It is a seq2seq model with attention (Bahdanau et al., 2015) with added beam search and a reranker penalizing outputs that stray away from the input MR. The baseline scores on the development set are as follows:
The full baseline system outputs can be downloaded here for both the development and test sets (one instance per line). If you want to run the baseline yourself, basic instructions are provided in the TGen Github repository.
The scripts used for evaluation are available on Github.
We are happy to announce that the interest in the E2E NLG shared task has by far outperformed our expectations. Heriot-Watt University has set out this challenge for the first time this year, and we received a total of 62 submissions by 17 institutions, with about 1/3 of these submissions coming from industry. In comparison, the well established Conference in Machine Translation WMT’17 (running since 2006) got 31 institutions submitting to a total of 8 tasks.
A brief summary of the E2E NLG Challenge results is now available in our INLG 2018 paper, a more detailed analysis is in preparation.
The automatic evaluation results were obtained using the metrics scripts provided with the baseline. The table is sortable – just click on the metric you want use for sorting. Click again to reverse the sort.
Submitter | Affiliation | System name | P? | BLEU | NIST | METEOR | ROUGE_L | CIDEr |
---|---|---|---|---|---|---|---|---|
BASELINE | Heriot-Watt University | Baseline | ✓ | 0.6593 | 8.6094 | 0.4483 | 0.6850 | 2.2338 |
Biao Zhang | Xiamen University | bzhang_submit | ✓ | 0.6545 | 8.1840 | 0.4392 | 0.7083 | 2.1012 |
Chen Shuang | Harbin Institute of Technology | Abstract-beam1 | 0.5854 | 5.4691 | 0.3977 | 0.6747 | 1.6391 | |
Chen Shuang | Harbin Institute of Technology | Abstract-beam2 | 0.5916 | 5.9477 | 0.3974 | 0.6701 | 1.6513 | |
Chen Shuang | Harbin Institute of Technology | Abstract-beam3 | 0.6150 | 6.8029 | 0.4068 | 0.6750 | 1.7870 | |
Chen Shuang | Harbin Institute of Technology | Abstract-greedy | 0.6635 | 8.3977 | 0.4312 | 0.6909 | 2.0788 | |
Chen Shuang | Harbin Institute of Technology | NonAbstract-beam2 | 0.5860 | 6.1602 | 0.3833 | 0.6619 | 1.6133 | |
Chen Shuang | Harbin Institute of Technology | NonAbstract-beam3 | 0.6088 | 6.9790 | 0.3899 | 0.6628 | 1.7015 | |
Chen Shuang | Harbin Institute of Technology | Primary_NonAbstract-beam1 | ✓ | 0.5859 | 5.4383 | 0.3836 | 0.6714 | 1.5790 |
ZHAW | Zurich University of Applied Sciences | base | 0.6544 | 8.3391 | 0.4448 | 0.6783 | 2.1438 | |
ZHAW | Zurich University of Applied Sciences | primary_1 | ✓ | 0.5864 | 8.0212 | 0.4322 | 0.5998 | 1.8173 |
ZHAW | Zurich University of Applied Sciences | primary_2 | ✓ | 0.6004 | 8.1394 | 0.4388 | 0.6119 | 1.9188 |
FORGe | Pompeu Fabra University | E2E_UPF_1 | ✓ | 0.4207 | 6.5139 | 0.3685 | 0.5437 | 1.3106 |
FORGe | Pompeu Fabra University | E2E_UPF_2 | 0.4113 | 6.3293 | 0.3686 | 0.5593 | 1.2467 | |
FORGe | Pompeu Fabra University | E2E_UPF_3 | ✓ | 0.4599 | 7.1092 | 0.3858 | 0.5611 | 1.5586 |
Sheffield NLP | University of Sheffield | sheffield_primarySystem1_var1 | ✓ | 0.6015 | 8.3075 | 0.4405 | 0.6778 | 2.1775 |
Sheffield NLP | University of Sheffield | sheffield_primarySystem1_var2 | 0.6233 | 8.1751 | 0.4378 | 0.6887 | 2.2840 | |
Sheffield NLP | University of Sheffield | sheffield_primarySystem1_var3 | 0.5690 | 8.0382 | 0.4202 | 0.6348 | 2.0956 | |
Sheffield NLP | University of Sheffield | sheffield_primarySystem1_var4 | 0.5799 | 7.9163 | 0.4310 | 0.6670 | 2.0691 | |
Sheffield NLP | University of Sheffield | sheffield_primarySystem2_var1 | ✓ | 0.5436 | 5.7462 | 0.3561 | 0.6152 | 1.4130 |
Sheffield NLP | University of Sheffield | sheffield_primarySystem2_var2 | 0.5356 | 7.8373 | 0.3831 | 0.5513 | 1.5825 | |
HarvardNLP & Henry Elder | Harvard SEAS & Adapt | main_1_support_1 | 0.6581 | 8.5719 | 0.4409 | 0.6893 | 2.1065 | |
HarvardNLP & Henry Elder | Harvard SEAS & Adapt | main_1_support_2 | 0.6618 | 8.6025 | 0.4571 | 0.7038 | 2.3371 | |
HarvardNLP & Henry Elder | Harvard SEAS & Adapt | main_1_support_3 | 0.6737 | 8.6061 | 0.4523 | 0.7084 | 2.3056 | |
HarvardNLP & Henry Elder | Harvard SEAS & Adapt | Primary_main_1 | ✓ | 0.6496 | 8.5268 | 0.4386 | 0.6872 | 2.0850 |
Heng Gong | Harbin Institute of Technology | Primary_test_2 | ✓ | 0.6422 | 8.3453 | 0.4469 | 0.6645 | 2.2721 |
Heng Gong | Harbin Institute of Technology | test_1 | 0.6396 | 8.3111 | 0.4466 | 0.6620 | 2.2272 | |
Heng Gong | Harbin Institute of Technology | test_3 | 0.6395 | 8.3127 | 0.4457 | 0.6628 | 2.2442 | |
Heng Gong | Harbin Institute of Technology | test_4 | 0.6395 | 8.3127 | 0.4457 | 0.6628 | 2.2442 | |
Adapt | Adapt | primary_submission-temperature_1.1 | ✓ | 0.5092 | 7.1954 | 0.4025 | 0.5872 | 1.5039 |
Adapt | Adapt | supporting_submission-temperature_0.9 | 0.5573 | 7.7013 | 0.4154 | 0.6130 | 1.8110 | |
Adapt | Adapt | supporting_submission-temperature_1.0 | 0.5265 | 7.3991 | 0.4095 | 0.5992 | 1.6488 | |
<anonymous 1> | <anonymous 1> | <anonymous 1 combined> | 0.2921 | 4.7690 | 0.2515 | 0.4361 | 0.6674 | |
<anonymous 1> | <anonymous 1> | <anonymous 1 primary> | ✓ | 0.4723 | 6.1938 | 0.3170 | 0.5616 | 1.2127 |
Shubham Agarwal | NLE | submission_primary | ✓ | 0.6534 | 8.5300 | 0.4435 | 0.6829 | 2.1539 |
Shubham Agarwal | NLE | submission_second | 0.6669 | 8.5388 | 0.4484 | 0.6991 | 2.2239 | |
Shubham Agarwal | NLE | submission_third | 0.6676 | 8.5416 | 0.4485 | 0.6991 | 2.2276 | |
UCSC-Slug2Slug | UC Santa Cruz | Slug2Slug | ✓ | 0.6619 | 8.6130 | 0.4454 | 0.6772 | 2.2615 |
UCSC-Slug2Slug | UC Santa Cruz | Slug2Slug-alt (late submission) | ✓ | 0.6035 | 8.3954 | 0.4369 | 0.5991 | 2.1019 |
Thomson Reuters NLG | Thomson Reuters | NonPrimary_1_test_output_model_11_post | 0.6536 | 8.3293 | 0.4550 | 0.6805 | 2.1050 | |
Thomson Reuters NLG | Thomson Reuters | NonPrimary_2_test_output_model_13_post | 0.6562 | 8.3942 | 0.4571 | 0.6876 | 2.1706 | |
Thomson Reuters NLG | Thomson Reuters | NonPrimary_3_test_output_beam_5_model_11_post | 0.6805 | 8.7777 | 0.4462 | 0.6928 | 2.3195 | |
Thomson Reuters NLG | Thomson Reuters | NonPrimary_4_test_output_beam_5_model_13_post | 0.6742 | 8.6590 | 0.4499 | 0.6983 | 2.3018 | |
Thomson Reuters NLG | Thomson Reuters | NonPrimary_5_submission_6 | 0.6208 | 8.0632 | 0.4417 | 0.6692 | 2.1127 | |
Thomson Reuters NLG | Thomson Reuters | NonPrimary_6_submission_4_beam | 0.6201 | 8.0938 | 0.4419 | 0.6740 | 2.1251 | |
Thomson Reuters NLG | Thomson Reuters | NonPrimary_7_submission_4 | 0.6182 | 8.0616 | 0.4417 | 0.6729 | 2.0783 | |
Thomson Reuters NLG | Thomson Reuters | NonPrimary_8_test_train_only | 0.4111 | 6.7541 | 0.3970 | 0.5435 | 1.4096 | |
Thomson Reuters NLG | Thomson Reuters | Primary_1_submission_6_beam | ✓ | 0.6336 | 8.1848 | 0.4322 | 0.6828 | 2.1425 |
Thomson Reuters NLG | Thomson Reuters | Primary_2_test_train_dev | ✓ | 0.4202 | 6.7686 | 0.3968 | 0.5481 | 1.4389 |
UCSC-TNT-NLG | UC Santa Cruz | System 1/Primary-Sys1 | ✓ | 0.6561 | 8.5105 | 0.4517 | 0.6839 | 2.2183 |
UCSC-TNT-NLG | UC Santa Cruz | System 1/Sys1-Model1 | 0.6476 | 8.4301 | 0.4508 | 0.6795 | 2.1233 | |
UCSC-TNT-NLG | UC Santa Cruz | System 2/Primary-Sys2 | ✓ | 0.6502 | 8.5211 | 0.4396 | 0.6853 | 2.1670 |
UCSC-TNT-NLG | UC Santa Cruz | System 2/Sys2-Model1 | 0.6606 | 8.6223 | 0.4439 | 0.6772 | 2.1997 | |
UCSC-TNT-NLG | UC Santa Cruz | System 2/Sys2-Model2 | 0.6563 | 8.5482 | 0.4482 | 0.6835 | 2.1953 | |
UCSC-TNT-NLG | UC Santa Cruz | System 2/Sys2-Model3 | 0.3681 | 6.6004 | 0.3846 | 0.5259 | 1.5205 | |
UIT-DANGNT | VNU-HCM University of Information Technology | test_e2e_result_2 final_TSV | ✓ | 0.5990 | 7.9277 | 0.4346 | 0.6634 | 2.0783 |
UKP-TUDA | Technische Universität Darmstadt | test_e2e-Puzikov | ✓ | 0.5657 | 7.4544 | 0.4529 | 0.6614 | 1.8206 |
The human evaluation was conducted on the 20 primary systems and the baseline using the CrowdFlower platform. We used our newly developed RankME method (Novikova et al., 2018) to obtain the ratings. Crowd workers were presented with five randomly selected outputs of different systems corresponding to a single meaning representation, and were asked to rank these systems from the best to worst, ties permitted. A single human-authored reference was provided for comparison. We collected separate ranks for quality and naturalness.
Quality is defined as an overall quality of the utterance, in terms of its grammatical correctness, fluency, adequacy and other important factors. When collecting quality ratings, system outputs were presented to crowd workers together with the corresponding meaning representation.
Naturalness is defined the extent to which the utterance could have been produced by a native speaker. When collecting naturalness ratings, system outputs were presented to crowd workers without the corresponding meaning representation.
If used in a real-life NLG system, quality would be considered the primary measure.
The final evaluation results were produced using the TrueSkill algorithm (Sakaguchi et al., 2014). For naturalness, the algorithm performed 1890 pairwise comparisons per each system (37800 comparisons in total), for quality – 1260 comparisons per system (25200 comparisons in total). In results tables, systems are ordered by their inferred system TrueSkill scores, and clustered. Systems within a cluster are considered tied. The system clusters have been created using bootstrap resampling, with a p-level of p ≤ 0.05.
# | TrueSkill | Range | System name | Submitter |
---|---|---|---|---|
1 | 0.300 | (1.0, 1.0) | Slug2Slug | UCSC-Slug2Slug |
2 | 0.228 | (2.0, 4.0) | ukp-tuda | UKP-TUDA |
0.213 | (2.0, 5.0) | Primary_test_2 | Heng Gong | |
0.184 | (3.0, 5.0) | test_e2e_result_2_final_TSV | UIT-DANGNT | |
0.184 | (3.0, 6.0) | Baseline | BASELINE | |
0.136 | (5.0, 7.0) | Slug2Slug-alt (late submission) | UCSC-Slug2Slug | |
0.117 | (6.0, 8.0) | primary_2 | ZHAW | |
0.084 | (7.0, 10.0) | System 1/Primary-Sys1 | UCSC-TNT-NLG | |
0.065 | (8.0, 10.0) | System 2/Primary-Sys2 | UCSC-TNT-NLG | |
0.048 | (8.0, 12.0) | submission_primary | NLE | |
0.018 | (10.0, 13.0) | primary_1 | ZHAW | |
0.014 | (10.0, 14.0) | E2E_UPF_1 | FORGe | |
-0.012 | (11.0, 14.0) | sheffield_primarySystem1_var1 | Sheffield NLP | |
-0.012 | (11.0, 14.0) | Primary_main_1 | HarvardNLP & Henry Elder | |
3 | -0.078 | (15.0, 16.0) | Primary_2_test_train_dev | Thomson Reuters NLG |
-0.083 | (15.0, 16.0) | E2E_UPF_3 | FORGe | |
4 | -0.152 | (17.0, 19.0) | primary_submission-temperature_1.1 | Adapt |
-0.185 | (17.0, 19.0) | Primary_1_submission_6_beam | Thomson Reuters NLG | |
-0.186 | (17.0, 19.0) | bzhang_submit | Biao Zhang | |
5 | -0.426 | (20.0, 21.0) | Primary_NonAbstract-beam1 | Chen Shuang |
-0.457 | (20.0, 21.0) | sheffield_primarySystem2_var1 | Sheffield NLP |
# | TrueSkill | Range | System name | Submitter |
---|---|---|---|---|
1 | 0.211 | (1.0, 1.0) | sheffield_primarySystem2_var1 | Sheffield NLP |
2 | 0.171 | (2.0, 3.0) | Slug2Slug | UCSC-Slug2Slug |
0.154 | (2.0, 4.0) | Primary_NonAbstract-beam1 | Chen Shuang | |
0.126 | (3.0, 6.0) | Primary_main_1 | HarvardNLP & Henry Elder | |
0.105 | (4.0, 8.0) | submission_primary | NLE | |
0.101 | (4.0, 8.0) | Baseline | BASELINE | |
0.091 | (5.0, 8.0) | test_e2e_result_2 final_TSV | UIT-DANGNT | |
0.077 | (5.0, 10.0) | ukp-tuda | UKP-TUDA | |
0.060 | (7.0, 11.0) | System 2/Primary-Sys2 | UCSC-TNT-NLG | |
0.046 | (9.0, 12.0) | Primary_test_2 | Heng Gong | |
0.027 | (9.0, 12.0) | System 1/Primary-Sys1 | UCSC-TNT-NLG | |
0.027 | (10.0, 12.0) | bzhang_submit | Biao Zhang | |
3 | -0.053 | (13.0, 16.0) | Primary_1_submission_6_beam | Thomson Reuters NLG |
-0.073 | (13.0, 17.0) | Slug2Slug-alt (late submission) | UCSC-Slug2Slug | |
-0.077 | (13.0, 17.0) | sheffield_primarySystem1_var1 | Sheffield NLP | |
-0.083 | (13.0, 17.0) | primary_2 | ZHAW | |
-0.104 | (15.0, 17.0) | primary_1 | ZHAW | |
4 | -0.144 | (18.0, 19.0) | E2E_UPF_1 | FORGe |
-0.164 | (18.0, 19.0) | primary_submission-temperature_1.1 | Adapt | |
5 | -0.243 | (20.0, 21.0) | Primary_2_test_train_dev | Thomson Reuters NLG |
-0.255 | (20.0, 21.0) | E2E_UPF_3 | FORGe |
A brief description of the challenge results was published at INLG. To cite the challenge, use:
@inproceedings{dusek2018findings, title={Findings of the {E2E} {NLG} {Challenge}}, author={Du{\v{s}}ek, Ond\v{r}ej and Novikova, Jekaterina and Rieser, Verena}, booktitle={Proceedings of the 11th International Conference on Natural Language Generation}, address={Tilburg, The Netherlands}, year={2018}, note={arXiv:1810.01170}, url={https://arxiv.org/abs/1810.01170}, }
A greatly extended analysis of the challenge results is now released on arXiv (as a journal submission under review).
System outputs and human ratings can now be downloaded from here. Please use the same citation to refer for this data release.
All submitters participating in human evaluation provided a description of their primary systems as a technical paper. The papers are linked below:
System | Paper |
Adapt | Henry Elder, Sebastian Gehrmann, Alexander O'Connor and Qun Liu: E2E NLG Challenge Submission: Towards Controllable Generation of Diverse Natural Language |
Chen Shuang | Shuang Chen: A General Model for Neural Text Generation from Structured Data |
FORGe (both systems) | Simon Mille and Stamatia Dasiopoulou: FORGe at E2E 2017 |
HarvardNLP & Henry Elder | Sebastian Gerhmann, Falcon Z. Dai, Henry Elder and Alexander M. Rush: End-to-End Content and Plan Selection for Natural Language Generation |
Heng Gong (final paper pending) | |
NLE | Shubham Agarwal, Marc Dymetman and Éric Gaussier: A char-based seq2seq submission to the E2E NLG Challenge |
Sheffield NLP (both systems) | Mingjie Chen, Gerasimos Lampouras and Andreas Vlachos: Sheffield at E2E: structured prediction approaches to end-to-end language generation |
UCSC-Slug2Slug | Juraj Juraska, Panagiotis Karagiannis, Kevin K. Bowden and Marilyn A. Walker: Slug2Slug: A Deep Ensemble Model with Slot Alignment for Sequence-to-Sequence Natural Language Generation |
UCSC-TNT-NLG, System 1 | Shereen Oraby, Lena Reed, Shubhangi Tandon, Sharath T.S., Stephanie Lukin and Marilyn Walker: TNT-NLG, System 1: Using a Statistical NLG to Massively Augment Crowd-Sourced Data for Neural Generation |
UCSC-TNT-NLG, System 2 | Shubhangi Tandon, Sharath T.S., Shereen Oraby, Lena Reed, Stephanie Lukin and Marilyn Walker: TNT-NLG, System 2: Data Repetition and Meaning Representation Manipulation to Improve Neural Generation |
Thomson Reuters NLP, System 1 | Elnaz Davoodi, Charese Smiley, Dezhao Song and Frank Schilder: The E2E NLG Challenge: Training a Sequence-to-Sequence Approach for Meaning Representation to Natural Language Sentences |
Thomson Reuters NLP, System 2 | Charese Smiley, Elnaz Davoodi, Dezhao Song and Frank Schilder: The E2E NLG Challenge: End-to-End Generation through Partial Template Mining |
UIT-DANGNT | Dang Tuan Nguyen and Trung Tran: Structure-based Generation System for E2E NLG Challenge |
UKP-TUDA | Yevgeniy Puzikov and Iryna Gurevych: E2E NLG Challenge: Neural Models vs. Templates |
Biao Zhang | Biao Zhang, Jing Yang, Qian Lin and Jinsong Su: Attention Regularized Sequence-to-Sequence Learning for E2E NLG Challenge |
ZHAW (both systems) | Jan Deriu and Mark Cieliebak: End-to-End Trainable System for Enhancing Diversity in Natural Language Generation |
Published versions of the systems participating in the Challenge:
Further works that use the E2E dataset but did not participate in the official E2E challenge:
Jekaterina Novikova
Ondrej Dusek
Verena Rieser
Heriot-Watt University, Edinburgh, UK.
e2e-nlg-challengegooglegroups.com
Mohit Bansal, University of Northern Carolina Chapel Hill
Ehud Reiter, University of Aberdeen
Amanda Stent, Bloomberg
Andreas Vlachos, University of Sheffield
Marilyn Walker, University of California Santa Cruz
Matthew Walter, Toyota Technological Institute at Chicago
Tsung-Hsien Wen, University of Cambridge
Luke Zettlemoyer, University of Washington