Institute of Formal and Applied Linguistics

at Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic

Year 2016
Type data/software
Status published
Language English
Author(s) Mareček, David Yu, Zhiwei Zeman, Daniel Žabokrtský, Zdeněk
Title Deltacorpus
Czech title Deltacorpus
Publisher LINDAT/CLARIN digital library
Institution Univerzita Karlova v Praze
Publisher's city and country Praha, Czechia
Month March
Note Version 1.1 released 2016-06-20, id http://hdl.handle.net/11234/1-1743
How published online
URL http://hdl.handle.net/11234/1-1662
Supported by 2015-2017 GA15-10472S (Morfologicky a syntakticky anotované korpusy mnoha jazyků) 2012-2016 PRVOUK P46 (Informatika)
Czech abstract Texty ve 107 jazycích z korpusu W2C (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), první 1000000 tokenů pro každý jazyk, označkované delexikalizovaným taggerem popsaným v Yu et al. (2016, LREC, Portorož, Slovenia).
English abstract Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia).
Specialization linguistics ("jazykověda")
Confidentiality default – not confidential
Category data
Economic parameters The resource provides POS tagging solution for 107 languages. For most of them no such resource was available and creating a manually tagged corpus for one language may cost hundreds of thousands CZK.
Open access no
License approval required never
Fee required never
Identifier http://hdl.handle.net/11234/1-1662
