Deltacorpus

Grant:

Morphologically and Syntactically Annotated Corpora of Many Languages

Tags:

Corpora, Data, Machine Learning, Taggers

Deltacorpus is a corpus of many languages, tagged by a DELexicalized TAgger (Yu et al., 2016). The tagging approach is unique in that we only need a raw corpus of the target language, while other semi-supervised methods typically need bilingual data or a dictionary. We employ language-independent features such as word length, frequency, neighborhood entropy, character classes (alphabetic vs. numeric vs. punctuation) etc. We demonstrate that such features can, to certain extent, serve as predictors of the part of speech, represented by the universal POS tag (Das and Petrov, 2011). Even though the tagging accuracy is well below results achieved by methods based on parallel data, the independence of our method on such data makes it a temporary solution for a number of languages for which parallel data are hard to obtain.

Deltacorpus contains web-crawled texts in 107 languages from the W2C corpus, roughly 1 million tokens per language (except for a few languages that have less data in W2C). We have excluded languages whose WEB part in W2C is too noisy (especially due to wrong language identification), as well as a few Asian languages with non-trivial word segmentation (e.g. Chinese, Japanese and Thai). All languages were tagged using the same delexicalized model, trained on a mixture of 7 languages from HamleDT 3.0. These source languages are Bulgarian, Catalan, German, Greek, Hindi, Hungarian and Turkish (50,000 training tokens per language).

Deltacorpus 1.0 is available from the LINDAT/CLARIN repository at http://hdl.handle.net/11234/1-1662.

Deltacorpus 1.1 differs in a couple of aspects: 1. The Universal Dependencies tagset (http://universaldependencies.org/) is used instead of the older and smaller Google universal POS tagset. 2. The classifier was trained on Universal Dependencies 1.2 instead of HamleDT. 3. Balto-Slavic languages, Germanic languages and Romance languages were tagged by a classifier trained only on the respective group of languages. Other languages were tagged by a classifier trained on all available languages. (Exceptions: Hungarian goes with Germanic, Romanian and Latin go with others.)

Zhiwei Yu, David Mareček, Daniel Zeman, Zdeněk Žabokrtský. 2016. If You Even Don't Have a Bit of Bible: Learning Delexicalized POS Taggers. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia.

How to cite

@inproceedings{ biblio:YuMaZeZa2016,
title = {If You Even Don't Have a Bit of Bible: Learning Delexicalized {POS} Taggers},
author = {Zhiwei Yu and David Mare{\v{c}}ek and Daniel Zeman and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}}},
year = {2016},
booktitle = {Proceedings of the 10th International Conference on Language Resources and Evaluation ({LREC 2016})},
editor = {Nicoletta Calzolari and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asunción Moreno and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association},
address = {Portorož, Slovenia},
venue = {Grand Hotel Bernardin Conference Center},
pages = {1659--1666},
isbn = {978-2-9517408-9-1}
}

Delexicalized tagger applied to many languages

Search form

How to cite