Up
The Czech/English Parallel Corpus
Reader's Digest
The Reader's Digest corpus is a parallel text of articles from Reader's
Digest, years 1993-1996. The Czech part is translation of the English one.
Statistics
Number of articles: 450
Number of parallel sentences: 53,117
Number of tokens in English part: 1,010,346 (after tokenization and
normalization)
Number of tokens in Czech part: 877,658 (after tokenization and normalization)
Included Information
Sentence pairs were aligned automatically by our implementation of (Gale
and Church, 1993) algorithm and than realigned using SIMR/GSA tool
(Melamed, 1996).
See comparison of sentence alignments
between previous and actual version of this corpus.
Whole corpus has been morphologically analyzed, tagged and lemmatized
using BH tools. Description of these tools are in (Hajic
and Hladka, 1998). Czech part of the corpus has been parsed by Statistical
Parser for Czech (Hajic at al., 1998).
Data Format and Location
- Data files are of SGML format. See DTD
PCDoc. Brief data format description can be found in the file Description.txt.
- File name format:
[file_id]c.[ext] for Czech part of the corpus
[file_id]e.[ext] for English part of the corpus
-
Data Location
Related Projects and Experiments
References
-
Al-Onaizan Yaser, Jan Curin,
Michael Jahr, Kevin Knight, John Lafferty, Dan Melamed, Franz-Josef Och,
David Purdy, Noah A. Smith, David Yarowsky. 1999. Statistical Machine Translation.
Final Report, JHU Summer Workshop'99.
Available in: pdffile,
psfile
-
Curin, J. and M. Cmejrek.
1999. Automatic Translation Lexicon Extraction form Czech/English Parallel
Texts. In The Prague Bulletin of Mathematical Linguistics 71. pp 47-57.
Available in: pdffile,
psfile
-
Gale, W. and K. Church. 1993.
A program for aligning sentences in bilingual corpora. In Computational
Linguistics 19(1).
-
Hajic, Jan, Eric Brill, Michael Collins,
Barbora Hladka, Douglas Jones, Cynthia Kuo, Lance Ramshaw, Oren Schwartz,
Christoph Tillmann, and Daniel Zeman. 1998. Core Natural Language Processing
Technology Applicable to Multiple Languages. Final Report, JHU Summer Workshop'98.
(JHU WS'98 webpage)
-
Hajic, Jan and Barbora
Hladka. 1998. Tagging inflective languages: Prediction of morphological
categories for a rich, structured tagset. In Proceedings of Coling/ACL.
Available in: pdffile,
psfile
-
Melamed, I. Dan. 1996. A geometric
approach to mapping bitext correspondence. In Proceedings of the First
Conference on Empirical Methods in Natural Language Processing.
Up