PDT 1.0 - Czech-English Corpus

The Czech/English Parallel Corpus

Reader's Digest

The Reader's Digest corpus is a parallel text of articles from Reader's Digest, years 1993-1996. The Czech part is translation of the English one.

Statistics

Number of articles: 450
Number of parallel sentences: 53,117
Number of tokens in English part: 1,010,346 (after tokenization and normalization)
Number of tokens in Czech part: 877,658 (after tokenization and normalization)

Included Information

Sentence pairs were aligned automatically by our implementation of (Gale and Church, 1993) algorithm and than realigned using SIMR/GSA tool (Melamed, 1996).

See comparison of sentence alignments between previous and actual version of this corpus.

Whole corpus has been morphologically analyzed, tagged and lemmatized using BH tools. Description of these tools are in (Hajic and Hladka, 1998). Czech part of the corpus has been parsed by Statistical Parser for Czech (Hajic at al., 1998).

Data Format and Location

Data files are of SGML format. See DTD PCDoc. Brief data format description can be found in the file Description.txt.
File name format:
[file_id]c.[ext] for Czech part of the corpus
[file_id]e.[ext] for English part of the corpus
Data Location

Related Projects and Experiments

Automatic translation dictionary extraction and noun phrase identification (Curin and Cmejrek, 1999)
Statistical machine translation from Czech to English at JHU Workshop 1999 (Al-Onaizan at al., 1999)

References

Al-Onaizan Yaser, Jan Curin, Michael Jahr, Kevin Knight, John Lafferty, Dan Melamed, Franz-Josef Och, David Purdy, Noah A. Smith, David Yarowsky. 1999. Statistical Machine Translation. Final Report, JHU Summer Workshop'99.
Available in: pdffile, psfile
Curin, J. and M. Cmejrek. 1999. Automatic Translation Lexicon Extraction form Czech/English Parallel Texts. In The Prague Bulletin of Mathematical Linguistics 71. pp 47-57.
Available in: pdffile, psfile
Gale, W. and K. Church. 1993. A program for aligning sentences in bilingual corpora. In Computational Linguistics 19(1).
Hajic, Jan, Eric Brill, Michael Collins, Barbora Hladka, Douglas Jones, Cynthia Kuo, Lance Ramshaw, Oren Schwartz, Christoph Tillmann, and Daniel Zeman. 1998. Core Natural Language Processing Technology Applicable to Multiple Languages. Final Report, JHU Summer Workshop'98. (JHU WS'98 webpage)
Hajic, Jan and Barbora Hladka. 1998. Tagging inflective languages: Prediction of morphological categories for a rich, structured tagset. In Proceedings of Coling/ACL.
Available in: pdffile, psfile
Melamed, I. Dan. 1996. A geometric approach to mapping bitext correspondence. In Proceedings of the First Conference on Empirical Methods in Natural Language Processing.