Every Layer Counts: Multi-Layer Multi-Head Attention for Neural Machine Translation

Isaac Kojo Essel Ampomah, Sally McClean, Lin Zhiwei, Glenn Hawe

References:

  1. Kamal Al-Sabahi, Zhang Zuping, and Mohammed Nadher. A hierarchical structured self-attentive model for extractive document summarization (HSSAS) IEEE Access 6, pages 24205–24212, IEEE, 2018. (http://doi.org/10.1109/ACCESS.2018.2829199)
  2. Isaac KE Ampomah, Sally McClean, Zhiwei Lin, and Glenn Hawe. JASs: Joint Attention Strategies for Paraphrase Generation In International Conference on Applications of Natural Language to Information Systems, pages 92–104, 2019. (http://doi.org/10.1007/978-3-030-23281-8_8)
  3. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization arXiv:1607.06450, 2016.
  4. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate ICLR'15, arXiv:1409.0473, 2015.
  5. Ankur Bapna, Mia Xu Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. Training Deeper Neural Machine Translation Models with Transparent Attention In Proceedings of the 2018 Conference on EMNLP, pages 3028–3033, 2018. (http://doi.org/10.18653/v1/D18-1338)
  6. Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James Glass. Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks In Proceedings of the 8th IJCNLP, pages 1–10, 2017.
  7. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F Zaidan. Findings of the 2011 workshop on statistical machine translation In Proceedings of the 6th Workshop on SMT, pages 22–64, 2011.
  8. Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. Report on the 11th IWSLT evaluation campaign, IWSLT 2014 In Proc. of IWSLT, pages 57, 2014.
  9. Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. The IWSLT 2015 evaluation campaign In International Conference on Spoken Language, pages 57, 2015.
  10. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation arXiv:1406.1078, 2014. (http://doi.org/10.3115/v1/D14-1179)
  11. Zi-Yi Dou, Zhaopeng Tu, Xing Wang, Shuming Shi, and Tong Zhang. Exploiting Deep Representations for Neural Machine Translation In Proceedings of the 2018 Conference on EMNLP, pages 4253–4262, ACL, 2018. (http://doi.org/10.18653/v1/D18-1457)
  12. Zi-Yi Dou, Zhaopeng Tu, Xing Wang, Longyue Wang, Shuming Shi, and Tong Zhang. Dynamic layer aggregation for neural machine translation with routing-by-agreement arXiv:1902.05770, 2019. (http://doi.org/10.1609/aaai.v33i01.330186)
  13. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252, 2017.
  14. Hamidreza Ghader and Christof Monz. What does Attention in Neural Machine Translation Pay Attention to? In Proceedings of the 8th IJCNLP, pages 30–39, 2017.
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. (http://doi.org/10.1109/CVPR.2016.90)
  16. Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo Chen, and Tie-Yan Liu. Layer-wise coordination between encoder and decoder for neural machine translation In Advances in Neural Information Processing Systems, pages 7944–7954, 2018.
  17. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks In Proceedings of the IEEE conference on CVPR, pages 4700–4708, 2017. (http://doi.org/10.1109/CVPR.2017.243)
  18. Po-Sen Huang, Chong Wang, Sitao Huang, Dengyong Zhou, and Li Deng. Towards neural phrase-based machine translation ICLR, 2018.
  19. Kazuki Irie, Albert Zeyer, Ralf Schlüter, and Hermann Ney. Language Modeling with Deep Transformers Proc. Interspeech 2019, pages 3905–3909, 2019. (http://doi.org/10.21437/Interspeech.2019-2225)
  20. Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization arXiv:1412.6980, 2014.
  21. Philipp Koehn. Statistical significance tests for machine translation evaluation In Proceedings of the 2004 conference on EMNLP, pages 388–395, 2004.
  22. Philipp Koehn and Josh Schroeder. Experiments in domain adaptation for statistical machine translation In Proc. 2nd WMT, pages 224–227, 2007. (http://doi.org/10.3115/1626355.1626388)
  23. Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to Attention-based Neural Machine Translation In Proc. EMNLP, pages 1412–1421, ACL, Lisbon, Portugal, 2015. (http://doi.org/10.18653/v1/D15-1166)
  24. Minh-Thang Luong and Christopher D Manning. Stanford neural machine translation systems for spoken language domains In Proceedings of the IWSLT, pages 76–79, 2015.
  25. Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Danish Pruthi, and Xinyi Wang. compare-mt: A Tool for Holistic Comparison of Language Generation Systems In Proceedings of the 2019 Conference of the NAACL, pages 35–41, 2019. (http://doi.org/10.18653/v1/N19-4007)
  26. Franz Josef Och, Christoph Tillmann, and Hermann Ney. Improved alignment models for statistical machine translation In Proc. of EMNLP and Very Large Corpora, 1999.
  27. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation In Proceedings of the 40th annual meeting on ACL, pages 311–318, 2002. (http://doi.org/10.3115/1073083.1073135)
  28. Martin Popel and Ondřej Bojar. Training Tips for the Transformer Model The Prague Bulletin of Mathematical Linguistics, pages 43–70, 2018. (http://doi.org/10.2478/pralin-2018-0002)
  29. Alessandro Raganato, Jörg Tiedemann, and others. An analysis of encoder representations in transformer-based machine translation In Proc. of EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018. (http://doi.org/10.18653/v1/W18-5431)
  30. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation of Rare Words with Subword Units In Proc. ACL, pages 1715–1725, 2016. (http://doi.org/10.18653/v1/P16-1162)
  31. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-Attention with Relative Position Representations In Proceedings of the 2018 Conference of the NAACL:HLT, Volume 2 (Short Papers), pages 464–468, 2018. (http://doi.org/10.18653/v1/N18-2074)
  32. David So, Quoc Le, and Chen Liang. The Evolved Transformer In International Conference on Machine Learning, pages 5877–5886, 2019.
  33. Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks arXiv preprint arXiv:1505.00387, 2015.
  34. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need In Advances in neural information processing systems, pages 5998–6008, 2017.
  35. Jesse Vig and Yonatan Belinkov. Analyzing the Structure of Attention in a Transformer Language Model arXiv preprint arXiv:1906.04284, 2019. (http://doi.org/10.18653/v1/W19-4808)
  36. Qiang Wang, Fuxue Li, Tong Xiao, Yanyang Li, Yinqiao Li, and Jingbo Zhu. Multi-layer representation fusion for neural machine translation In Proceedings of the 27th International Conference on Computational Linguistics, pages 3015–3026, 2018.
  37. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, and others. Google's neural machine translation system: Bridging the gap between human and machine translation arXiv:1609.08144, 2016.
  38. Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. Tied transformers: Neural machine translation with shared encoder and decoder In Proceedings of the AAAI Conference on Artificial Intelligence 33, pages 5466–5473, 2019. (http://doi.org/10.1609/aaai.v33i01.33015466)
  39. Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation In Proceedings of the IEEE conference on CVPR, pages 2403–2412, 2018. (http://doi.org/10.1109/CVPR.2018.00255)