LanideNN: Language Identification at Character Level

LanideNN is language identification method based on Bidirectional Recurrent Neural Networks [1] and it performs well in monolingual and multilingual language identification tasks on six testsets. The method keeps its accuracy also for short documents and across domains, so it is ideal for the off-the-shelf use without preparation of training data.

LanideNN actually predicts a language for every character in a text window:

 Illustration of text partitioning

We have released the LanideNN code and models at https://github.com/tomkocmi/LanideNN

As a part of a paper [1], we have created testset containing 131 languages in order to properly test multilingual language identification of short texts. The dataset contains 100 lines for each of the tested languages, with the average line length of 142.3 characters. Each line of the dataset starts with an ISO-3 label of the language presented on that line. All lines are shuffled.

The testset contains data from following corpora: W2C [2], Tatoeba [3] and Leipzig corpora collection [4].

The dataset is freely available for non-commercial research purposes at this link.

References

[1] Tom Kocmi and Ondřej Bojar. LanideNN: Multilingual Language Identification on Character Window. In EACL 2017.

[2] Martin Majliš and Zdeněk Žabokrtský. 2012. Language richness of the web. In LREC, pp. 2927-2934.

[3] http://tatoeba.org

[4] Uwe Quasthoff, Matthias Richter, and Christian Biemann. 2006. Corpus portal for search in monolingual corpora. In Proceedings of the fifth international conference on language resources and evaluation, vol. 17991802, p. 21.