Diacritics Restoration using BERT with Analysis on Czech language

Jakub Náplava, Milan Straka, Jana Straková

References:

  1. Kübra Adali and Gülşen Eryiğit. Vowel and diacritic restoration for social media texts In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), pages 53–61, 2014. (http://doi.org/10.3115/v1/W14-1307)
  2. Badr AlKhamissi, Muhammad N ElNokrashy, and Mohamed Gabr. Deep Diacritization: Efficient Hierarchical Recurrence for Improved Arabic Diacritization arXiv preprint arXiv:2011.00538, 2020.
  3. Sawsan Alqahtani, Ajay Mishra, and Mona Diab. Efficient Convolutional Neural Networks for Diacritic Restoration In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1442–1448, 2019. (http://doi.org/10.18653/v1/D19-1151)
  4. Yonatan Belinkov and James Glass. Arabic diacritization with recurrent neural networks In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2281–2285, 2015. (http://doi.org/10.18653/v1/D15-1274)
  5. Yonatan Belinkov and Yonatan Bisk. Synthetic and natural noise both break neural machine translation arXiv preprint arXiv:1711.02173, 2017.
  6. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Édouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised Cross-lingual Representation Learning at Scale In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, 2020.
  7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
  8. Mokhtar Madhfar and Ali Mustafa Qamar. Effective Deep Learning Models for Automatic Diacritization of Arabic Text IEEE Access, IEEE, 2020.
  9. Jana Straková, Milan Straka, and Jan Hajič. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 13–18, Association for Computational Linguistics, Baltimore, Maryland, 2014. (http://doi.org/10.3115/v1/P14-5003)
  10. Hamdy Mubarak, Ahmed Abdelali, Hassan Sajjad, Younes Samih, and Kareem Darwish. Highly effective Arabic diacritization using sequence to sequence modeling In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2390–2395, 2019.
  11. Jakub Náplava, Milan Straka, Pavel Straňák, and Jan Hajic. Diacritics restoration using neural networks In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.
  12. Cao Hong Nga, Nguyen Khai Thinh, Pao-Chi Chang, and Jia-Ching Wang. Deep Learning Based Vietnamese Diacritics Restoration In 2019 IEEE International Symposium on Multimedia (ISM), pages 331–3313, 2019. (http://doi.org/10.1109/ISM46123.2019.00074)
  13. Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, and Daniel Zeman. Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4034–4043, European Language Resources Association, Marseille, France, 2020.
  14. Dan Zeman. DIAKRITIZACE TEXTU In CzechEncy - Nový encyklopedický slovník češtiny, Nakladatelství Lidov{é} noviny, Praha, Czech Republic, 2016.
  15. Maria Nuţu, Beáta Lőrincz, and Adriana Stan. Deep learning for automatic diacritics restoration in romanian In 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP), pages 235–240, 2019. (http://doi.org/10.1109/ICCP48234.2019.8959557)
  16. Iroro Orife. Attentive Sequence-to-Sequence Learning for Diacritic Restoration of YorùBá Language Text Proc. Interspeech 2018, pages 2848–2852, 2018. (http://doi.org/10.21437/Interspeech.2018-42)
  17. Barbara Rychalska, Dominika Basaj, Alicja Gosiewska, and Przemysław Biecek. Models in the Wild: On Corruption Robustness of Neural NLP Systems In International Conference on Neural Information Processing, pages 235–247, 2019. (http://doi.org/10.1007/978-3-030-36718-3_20)
  18. Milan Straka, Jana Straková, and Jan Hajič. Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER In Text, Speech, and Dialogue, pages 137–150, Springer International Publishing, Cham, 2019. (http://doi.org/10.1007/978-3-030-27947-9_12)