Machine Translation Projects at ÚFAL

Machine translation (MT) has a long tradition at the Charles University in Prague. In the years 1977-1986, an English-to-Czech translation system called APAČ was designed and implemented by the group led by Zdeněk Kirchner. The APAČ system was based on a dependency grammar and its core parts were implemented in Q-systems. Similar architecture was used in the Czech-Russian translation system RUSLAN in the late 80's and for the Czech-English system MATRACE in the beginning of 90's.

The current machine translation research at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague proceeds in two parallel, but cooperating, streams. One focuses on the translation between Czech and a set of close (mainly Slavic) languages, such as Slovak, Polish or Lithuanian, and is based on the morphosyntactic analysis  (shallow MT). The other stream deals with more distant languages such as Czech and English and used two different approaches: one is based on  syntacticosemantic analysis (dependency-based MT) of the input language and the second on the statistical translation models (statistical MT) built from the parallel corpus.

Shallow Machine Translation

The experience from an MT system RUSLAN made it apparent that a full-fledged syntactic analysis of Czech is both unnecessary and too unreliable and costly for the MT systems between closely related languages. The system Česílko therefore uses the method of direct word-for-word translation (after necessary morphological processing), the use of which is justified by the similarity (even though not identity) of syntactic constructions in both languages.

The system has been tested on texts from the domain of documentation of corporate information systems. It is, however, not limited to any specific domain; it has also undergone thorough testing on rather difficult texts of a Czech general encyclopedia, and in a cross-lingual treebank annotation transfer project. Its primary task is, however, to provide support for translation and localization of various technical texts. The system currently translates from Czech to three languages: Slovak, Polish and Lithuanian. The Czech-Slovak translation contains large dictionaries and is fully "market ready" while the remaining two language pairs are only experimental.

Statistical Machine Translation

Word-Based Statistical MT

First experiments in Czech-English statistical machine translation were performed at NLP Summer Workshop 1999 at CLSP.

The statistical machine translation system consisting of GIZA++ Toolkit, ISI ReWrite Decoder and CMU Statistical Language Modelling Toolkit was customized to translate between Czech and English. Several experiments with different configurations were performed. The different configurations take as input plain text or text normalized by linguistic preprocessing (such as lemmatization). We used 21,600 parallel sentences from the Prague Czech-English Dependency Treebank and 54,091 aligned segments from Reader's Digest for translation model training.

A package of scripts and instructions for building statistical machine translation system called "SMT Quick Run" is available here.

Phrase-Based Statistical MT

Phrase-based Czech↔English MT was studied during CLSP Summer Workshop 2006. The Moses Decoder was used to experiment with multiple factors aiming at better morphological coherence of MT output.

Have a look at our interactive demo of Czech-English Machine Translation.

Dependency-based Machine Translation

The Dependency-based Machine Translation system (DBMT) is based on tectogrammatical dependency trees capturing the underlying structure of the sentence.

The experimental Czech-English DBMT system has the vintage analysis-transfer-generation architecture. The automatic process includes analysis of the Czech input into tectogrammatical (underlying) representation. The Czech sentence is automatically tokenized, morphologically tagged, and each word form is assigned a lemma. A statistical dependency parser (either Collins or Charniak) is used to obtain the analytical representation. Then the analytical structure is converted into tectogrammatical representation using linguistic rules.

In transfer, tectogrammatical base-form attribute of autosemantic nodes is replaced by its English equivalent found in the Czech-English probabilistic dictionary trained by GIZA++ on the parallel corpus. Then a simple rule-based system is used for the reordering of constituents and for generation to English surface realization, and an n-gram language model for scoring and choosing from translation hypotheses. The results can be evaluated quantitatively with BLEU score. The following resources were used: Prague Dependency Treebank, a newly created Prague Czech-English Dependency Treebank, an English monolingual corpus, and translation lexicons.

Prague Czech-English Dependency Treebank

Prague Czech-English Dependency Treebank (PCEDT, version 1.0) is a corpus of Czech-English parallel resources suitable for experiments in structural machine translation. PCEDT was published by Linguistic Data Consortium in 2004, catalog number LDC2004T25.

The core part of the PCEDT is a Czech translation of 21,600 English sentences from Wall Street Journal part of Penn Treebank 3 corpus (PTB, released by LDC in 1999). Sentences of the Czech translation were automatically morphologically annotated and parsed into two levels (analytical and tectogrammatical). The original English sentences were transformed from the Penn Treebank phrase-structure trees into dependency representations. A heldout (development) and evaluation set of 515 sentence pairs was selected and manually annotated on tectogrammatical level for both Czech and English; for the purposes of quantitative evaluation this set has been retranslated from Czech to English by 4 different translation companies.

PCEDT also comprises a parallel Czech-English corpus of plain text from Reader's Digest 1993-1996 consisting of 53,000 parallel sentences, and a large monolingual corpus of Czech (2.4 M sentences). Also included is a probabilistic Czech-English translation dictionary, which consists of 46,150 word-translation pairs of base forms.

References

Shallow Machine Translation

Statistical Machine Translation

Structural Machine Translation

Prague Czech-English Dependency Treebank


Ústav formální a aplikované lingvistiky
Jan Cuřín and Ondřej Bojar, bojar <at> ufal.mff.cuni.cz
2007-05-22