The Czech/English Parallel Corpus

Reader's Digest

The Reader's Digest corpus is a parallel text of articles from Reader's Digest, years 1993-1996. The Czech part is translation of the English one.


Number of articles: 450
Number of parallel sentences: 53,117
Number of tokens in English part: 1,010,346 (after tokenization and normalization)
Number of tokens in Czech part: 877,658 (after tokenization and normalization)

Included Information

Sentence pairs were aligned automatically by our implementation of (Gale and Church, 1993) algorithm and than realigned using SIMR/GSA tool (Melamed, 1996).

See comparison of sentence alignments between previous and actual version of this corpus.

Whole corpus has been morphologically analyzed, tagged and lemmatized using BH tools. Description of these tools are in (Hajic and Hladka, 1998). Czech part of the corpus has been parsed by Statistical Parser for Czech (Hajic at al., 1998).

Data Format and Location

Related Projects and Experiments