Downloads

CzEng
CzEng is a Czech-English parallel corpus. The last version 0.7 has been updated to 0.9, more than quadrupling its size.
Visit CzEng main site.
Trainable Tokenizer v.0.1
Trainable Tokenizer was introduced in: Klyueva Natalia and Ondøej Bojar: UMC 0.1: Czech-Russian-English Multilingual Corpus. Proceedings of the Conference "Corpora 2008". PDF
For more information see the README file in the package.
Download Trainable Tokenizer v.0.1 (tar.gz, 2.9 MB, including training data for cs, en, ru and tentatively also for de, hi, it, pt)
Manually Flagged Errors in WMT09 Test Set
We carried out a thorough analysis of error types occurring in four English-to-Czech MT systems participating in WMT09. The annotated data are available for further research. A report is now under review for publication. Contact Ondøej Bojar if you are interested.
For more information see the README file in the package.
Download Manually Flagged Errors in WMT09 Test Set (tar.gz, 344 KB)
Extensions to Moses Decoder
We keep experimenting with phrase-based (and soon to come hierarchical) models for English-to-Czech as well. Some of our experiments need minor extensions of the Moses MT system.
To avoid cluttering the main repository, our contributions are available independently, yet synchronized with the main development:
Moses Extensions developed at ÚFAL at github.

This project has been supported by the grant FP7-ICT-2007-3-231720 (EuroMatrixPlus).
2009 © Institute of Formal and Applied Linguistics. All Rights Reserved.