Downloads

CzEng: CzEng is a Czech-English parallel corpus. The last version 0.7 has been updated to 0.9, more than quadrupling its size.
Visit CzEng main site.
Trainable Tokenizer v.0.1: Trainable Tokenizer was introduced in: Klyueva Natalia and Ondřej Bojar: UMC 0.1: Czech-Russian-English Multilingual Corpus. Proceedings of the Conference "Corpora 2008". PDF
For more information see the README file in the package.
Download Trainable Tokenizer v.0.1 (tar.gz, 2.9 MB, including training data for cs, en, ru and tentatively also for de, hi, it, pt)
Manually Flagged Errors in WMT09 Test Set: We carried out a thorough analysis of error types occurring in four English-to-Czech MT systems participating in WMT09. The annotated data are available for further research. A report is now under review for publication. Contact Ondřej Bojar if you are interested.
For more information see the README file in the package.
Download Manually Flagged Errors in WMT09 Test Set (tar.gz, 344 KB)
Extensions to Moses Decoder: We keep experimenting with phrase-based (and soon to come hierarchical) models for English-to-Czech as well. Some of our experiments need minor extensions of the Moses MT system.
To avoid cluttering the main repository, our contributions are available independently, yet synchronized with the main development:
Moses Extensions developed at ÚFAL at github.

ÚFAL Participation in EuroMatrixPlus