Martin Majliš
|
Ph.D. student (supervised by Zdeněk Žabokrtský) at Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University in Prague E-mail: majlis@ufal.mff.cuni.cz |
AddressUFAL, MFF UKMalostranské náměstí 25 CZ-118 00 Praha |
Education
- 2011 Mgr. (Master's degree) in Computational Linguistics, Faculty of Mathematics and Physics, Charles University in Prague.
- 2008 Bc. (Bachelor's degree) in Computer Science, Faculty of Mathematics and Physics, Charles University in Prague.
Research interests
- machine learning, machine translation
Teaching
Projects
Publications
| [7] |
Martin Majliš and Zdeněk Žabokrtský.
Language richness of the web.
In Proceedings of LREC2012, Istanbul, Turkey, May 2012. ELRA,
European Language Resources Association.
In print.
[ bib |
.pdf ]
We have built a corpus containing texts in 106 languages from texts available on the Internet and on Wikipedia. The W2C Web Corpus contains 54.7 GB of text and the W2C Wiki Corpus contains 8.5 GB of text. The W2C Web Corpus contains more than 100 MB of text available for 75 languages. At least 10 MB of text is available for 100 languages. These corpora are a unique data source for linguists, since they outclass all published works both in the size of the material collected and the number of languages covered. This language data resource can be of use particularly to researchers specialized in multilingual technologies development. We also developed software that greatly simplifies the creation of a new text corpus for a given language, using text materials freely available on the Internet. Special attention was given to components for filtering and de-duplication that allow to keep the material quality very high.
|
| [6] |
Ondřej Bojar, Zdeněk Žabokrtský, Ondřej Dušek,
Petra Galuščáková, Martin Majliš, David
Mareček, Jiří Maršík, Michal Novák,
Martin Popel, and Aleš Tamchyna.
The Joy of Parallelism with CzEng 1.0.
In Proceedings of LREC2012, Istanbul, Turkey, May 2012. ELRA,
European Language Resources Association.
In print.
[ bib |
.pdf ]
CzEng 1.0 is an updated release of our Czech-English parallel corpus, freely available for non-commercial research or educational purposes. In this release, we approximately doubled the corpus size, reaching 15 million sentence pairs (about 200 million tokens per language). More importantly, we carefully filtered the data to reduce the amount of non-matching sentence pairs. CzEng 1.0 is automatically aligned at the level of sentences as well as words. We provide not only the plain text representation, but also automatic morphological tags, surface syntactic as well as deep syntactic dependency parse trees and automatic co-reference links in both English and Czech.
|
| [5] |
Martin Majliš.
Yet Another Language Identifier.
In EACL 2012. The Association for Computer Linguistics, April
2012.
[ bib |
.pdf ]
Language identification of written text has been studied for several decades. Despite this fact, most of the research is focused on a few most spoken languages, whereas the minor ones are ignored. The identification of a larger number of languages brings new difficulties that do not occur for a few languages. These difficulties are causing decreased accuracy. The objective of this paper is to investigate the sources of such degradation. In order to isolate the impact of individual factors, 5 different algorithms and 3 different number of languages are used.
|
| [4] |
Martin Majliš and Zdeněk Žabokrtský.
W2c - large multilingual corpus.
Technical Report Prague, Czech Republic, ÚFAL, Charles
University, December 2011.
[ bib |
.pdf ]
We built corpus containing 106 languages from texts available on the Wikipedia and on the Internet. The W2C Wiki Corpus contains 8.5 GB of text and the W2C Web Corpus contains 54.7 GB of text. The software part contains tools for distributed crawling and processing of web pages.
|
| [3] |
Martin Majliš and Zdeněk Žabokrtský.
W2c - web to corpus, December 2011.
[ bib |
http ]
W2C is a collection of software and data. The software part radically facilitates creating a new text corpora for a given language, using text materials freely available on the Internet. A special attention was given to components for filtering that allow to keep the material quality very high. The data part contains corpora for more than 100 languages, with around 10 million words in each. This language data resource can be used especially by researchers specialized at developing multilingual technologies.
|
| [2] |
Ondřej Bojar, Zdeněk Žabokrtský, Ondřej Dušek,
Petra Galuščáková, Martin Majliš, David
Mareček, Jiří Maršík, Michal Novák, Martin
Popel, and Aleš Tamchyna.
Czeng 1.0, December 2011.
[ bib |
http ]
A new release of the parallel corpus CzEng, this time with a focus on the removal of bad sentence pairs.
|
| [1] |
Martin Majliš.
Large multilingual corpus, September 2011.
[ bib |
.pdf ]
This thesis introduces the W2C Corpus, which contains 97 languages with more than 10 million words for each of these languages, with the total size 10.5 billion words. The corpus was built by crawling the Internet. This work describes the methods and tools used for its construction. The complete process consisted of building an initial corpus from Wikipedia, developing a language recognizer for 122 languages, implementing a distributed system for crawling and parsing webpages and finally, the reduction of duplicities. A comparative analysis of the texts of Wikipedia and the Internet is provided at the end of this thesis. The analysis is based on basic statistics such as average word and sentence length, conditional entropy and perplexity.
|