Martin Majliš

Ph.D. student (supervised by Zdeněk Žabokrtský) at
Institute of Formal and Applied Linguistics
Faculty of Mathematics and Physics
Charles University in Prague

E-mail: majlis@ufal.mff.cuni.cz

Address

UFAL, MFF UK
Malostranské náměstí 25
CZ-118 00 Praha

Education

Research interests

Teaching

Projects

Publications

[7] Martin Majliš and Zdeněk Žabokrtský. Language richness of the web. In Proceedings of LREC2012, Istanbul, Turkey, May 2012. ELRA, European Language Resources Association. In print. [ bib | .pdf ]
We have built a corpus containing texts in 106 languages from texts available on the Internet and on Wikipedia. The W2C Web Corpus contains 54.7 GB of text and the W2C Wiki Corpus contains 8.5 GB of text. The W2C Web Corpus contains more than 100 MB of text available for 75 languages. At least 10 MB of text is available for 100 languages. These corpora are a unique data source for linguists, since they outclass all published works both in the size of the material collected and the number of languages covered. This language data resource can be of use particularly to researchers specialized in multilingual technologies development. We also developed software that greatly simplifies the creation of a new text corpus for a given language, using text materials freely available on the Internet. Special attention was given to components for filtering and de-duplication that allow to keep the material quality very high.

[6] Ondřej Bojar, Zdeněk Žabokrtský, Ondřej Dušek, Petra Galuščáková, Martin Majliš, David Mareček, Jiří Maršík, Michal Novák, Martin Popel, and Aleš Tamchyna. The Joy of Parallelism with CzEng 1.0. In Proceedings of LREC2012, Istanbul, Turkey, May 2012. ELRA, European Language Resources Association. In print. [ bib | .pdf ]
CzEng 1.0 is an updated release of our Czech-English parallel corpus, freely available for non-commercial research or educational purposes. In this release, we approximately doubled the corpus size, reaching 15 million sentence pairs (about 200 million tokens per language). More importantly, we carefully filtered the data to reduce the amount of non-matching sentence pairs. CzEng 1.0 is automatically aligned at the level of sentences as well as words. We provide not only the plain text representation, but also automatic morphological tags, surface syntactic as well as deep syntactic dependency parse trees and automatic co-reference links in both English and Czech.

This paper describes key properties of the released resource including the distribution of text domains, the corpus data formats, and a toolkit to handle the provided rich annotation. We also summarize the procedure of the rich annotation (incl. co-reference resolution) and of the automatic filtering. Finally, we provide some suggestions on exploiting such an automatically annotated sentence-parallel corpus.

[5] Martin Majliš. Yet Another Language Identifier. In EACL 2012. The Association for Computer Linguistics, April 2012. [ bib | .pdf ]
Language identification of written text has been studied for several decades. Despite this fact, most of the research is focused on a few most spoken languages, whereas the minor ones are ignored. The identification of a larger number of languages brings new difficulties that do not occur for a few languages. These difficulties are causing decreased accuracy. The objective of this paper is to investigate the sources of such degradation. In order to isolate the impact of individual factors, 5 different algorithms and 3 different number of languages are used.

The Support Vector Machine algorithm achieved an accuracy of 98% for 90 languages and the YALI algorithm based on a scoring function had an accuracy of 95.4%. The YALI algorithm has slightly lower accuracy but classifies around 17 times faster and its training is more than 4000 times faster.

Three different data sets with various number of languages and sample sizes were prepared to overcome the lack of standardized data sets. These data sets are now publicly available.

[4] Martin Majliš and Zdeněk Žabokrtský. W2c - large multilingual corpus. Technical Report Prague, Czech Republic, ÚFAL, Charles University, December 2011. [ bib | .pdf ]
We built corpus containing 106 languages from texts available on the Wikipedia and on the Internet. The W2C Wiki Corpus contains 8.5 GB of text and the W2C Web Corpus contains 54.7 GB of text. The software part contains tools for distributed crawling and processing of web pages.

[3] Martin Majliš and Zdeněk Žabokrtský. W2c - web to corpus, December 2011. [ bib | http ]
W2C is a collection of software and data. The software part radically facilitates creating a new text corpora for a given language, using text materials freely available on the Internet. A special attention was given to components for filtering that allow to keep the material quality very high. The data part contains corpora for more than 100 languages, with around 10 million words in each. This language data resource can be used especially by researchers specialized at developing multilingual technologies.

[2] Ondřej Bojar, Zdeněk Žabokrtský, Ondřej Dušek, Petra Galuščáková, Martin Majliš, David Mareček, Jiří Maršík, Michal Novák, Martin Popel, and Aleš Tamchyna. Czeng 1.0, December 2011. [ bib | http ]
A new release of the parallel corpus CzEng, this time with a focus on the removal of bad sentence pairs.

[1] Martin Majliš. Large multilingual corpus, September 2011. [ bib | .pdf ]
This thesis introduces the W2C Corpus, which contains 97 languages with more than 10 million words for each of these languages, with the total size 10.5 billion words. The corpus was built by crawling the Internet. This work describes the methods and tools used for its construction. The complete process consisted of building an initial corpus from Wikipedia, developing a language recognizer for 122 languages, implementing a distributed system for crawling and parsing webpages and finally, the reduction of duplicities. A comparative analysis of the texts of Wikipedia and the Internet is provided at the end of this thesis. The analysis is based on basic statistics such as average word and sentence length, conditional entropy and perplexity.

(see the UFAL publication database for their BibTeX entries)