LanideNN: Language Identification at Character Level
LanideNN is language identification method based on Bidirectional Recurrent Neural Networks [1] and it performs well in monolingual and multilingual language identification tasks on six testsets. The method keeps its accuracy also for short documents and across domains, so it is ideal for the off-the-shelf use without preparation of training data.
LanideNN actually predicts a language for every character in a text window:
We have released the LanideNN code and models at https://github.com/tomkocmi/LanideNN
As a part of a paper [1], we have created testset containing 131 languages in order to properly test multilingual language identification of short texts. The dataset contains 100 lines for each of the tested languages, with the average line length of 142.3 characters. Each line of the dataset starts with an ISO-3 label of the language presented on that line. All lines are shuffled.
The testset contains data from following corpora: W2C [2], Tatoeba [3] and Leipzig corpora collection [4].
The dataset is freely available for non-commercial research purposes at this link.
References
[1] Tom Kocmi and Ondřej Bojar. LanideNN: Multilingual Language Identification on Character Window. In EACL 2017.
[2] Martin Majliš and Zdeněk Žabokrtský. 2012. Language richness of the web. In LREC, pp. 2927-2934.
[4] Uwe Quasthoff, Matthias Richter, and Christian Biemann. 2006. Corpus portal for search in monolingual corpora. In Proceedings of the fifth international conference on language resources and evaluation, vol. 17991802, p. 21.