[ Skip to the content ]

Institute of Formal and Applied Linguistics

at Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic


[ Back to the navigation ]

Publication


Year 2018
Type in proceedings
Status published
Language English
Author(s) Náplava, Jakub Straka, Milan Straňák, Pavel Hajič, Jan
Title Diacritics Restoration Using Neural Networks
Czech title Doplnění diakritiky pomocí neuronových sítí
Proceedings 2018: Paris, France: LREC 2018: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018)
Pages range 1-10
How published online
URL http://www.lrec-conf.org/proceedings/lrec2018/summaries/573.html
Supported by 2016-2019 DG16P02R019 (NAKI II project on Oral history, with USD AV CR and NFA: Virtuální asistent pro zpřístupnění historických audiovizuálních dat) 2017-2019 EF16_013/0001781 (LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie - rozšíření repozitáře a výpočetní kapacity) 2016-2019 LM2015071 (LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat) 2017-2021 PROGRES Q18 (Společenské vědy: od víceoborovosti k mezioborovosti)
Czech abstract Článek popisuje inovativní kombinaci recurrent neural-network based modelu na úrovni znaků a jazykového modelu aplikovanou na úlohu doplnění diakritiky do textu.
English abstract In this paper, we describe a novel combination of a character-level recurrent neural-network based model and a language model applied to diacritics restoration. In many cases in the past and still at present, people often replace characters with diacritics with their ASCII counterparts. Despite the fact that the resulting text is usually easy to understand for humans, it is much harder for further computational processing. This paper opens with a discussion of applicability of restoration of diacritics in selected languages. Next, we present a neural network-based approach to diacritics generation. The core component of our model is a bidirectional recurrent neural network operating at a character level. We evaluate the model on two existing datasets consisting of four European languages. When combined with a language model, our model reduces the error of current best systems by 20% to 64%. Finally, we propose a pipeline for obtaining consistent diacritics restoration datasets for twelve languages and evaluate our model on it. All the code is available under open source license on https://github.com/arahusky/diacritics_restoration.
Specialization linguistics ("jazykověda")
Confidentiality default – not confidential
Open access yes
Article no. 573
Editor(s)* Nicoletta Calzolari; Khalid Choukri; Thierry Declerck; Bente Maegaard; Joseph Mariani; Hélène Mazo; Asunción Moreno; Jan Odijk; Stelios Piperidis
ISBN* 979-10-95546-00-9
Address* Paris, France
Month* May
Venue* Phoenix Seagaia Conference Center
Publisher* European Language Resources Association
Creator: Common Account
Created: 5/17/18 9:24 PM
Modifier: Common Account
Modified: 11/6/18 7:59 AM
***

Paperpublicdiacritization.pdfapplication/pdf
Content, Design & Functionality: ÚFAL, 2006–2016. Page generated: Thu Nov 15 02:07:06 CET 2018

[ Back to the navigation ] [ Back to the content ]

100% OpenAIRE compliant