Example of Translation from English to Czech

This example shows how an English sentence is translated to Czech in the TectoMT system. TectoMT is a hybrid statistical/rule-based translation system with a pipeline approach; its pipeline consists of a deep dependency analysis, transfer at the deep layer, and generation of the surface form.

Deep analysis

The source English sentence is first tokenized, tagged and parsed. We use the statistical Morce tagger and the Penn Treebank tagset for tagging, and the statistical Maximum Spanning Tree parser for the dependency parsing. We also apply various tiny correction rules. Stanford Named Entity Recognizer is used to detect named entities.


Source dependency tree (surface/analytical layer), including syntactic functions (blue) and part-of-speech tags (dark green)

 

The dependency parse of the sentence is converted using hand-written rules into the deep structure, where all grammatical words are hidden and only content words remain as nodes. In addition, formemes describing the surface morpho-syntactic function of the individual nodes, as well as grammatemes (various grammatical attributes) are assigned using a rule-based module; and functors (semantic functions), using a statistical tool (LibLINEAR). However, only lemmas, formemes and grammatemes are currently being used in the transfer phase.


Source deep syntax tree (tectogrammatical layer): lemmas (black lowercase), formemes (violet), functors (black uppercase) and the corresponding surface tokens (green/orange, just for illustration)

 

Transfer

The lemma and formeme of each node is translated using Maximum Entropy models. The topology of the tree is left unchanged – we assume that the deep structure of both languages is similar in most cases. Grammatemes are translated using simple rules. Out of n-best lists produced by the Maximum Entropy models, the best overall lemma-formeme combination for the whole tree is chosen using a target-language tree model.


Target tectogrammatical layer: the translation of lemmas and formemes.

 

Generation

The resulting deep representation in the target language is then converted using rule-based modules into a surface dependency tree, which includes all auxiliary words, as well as the resulting lemmas and morphological features for all inflected words. Using a morphological generation module, the target surface word forms are generated. The resulting sentence is subsequently normalized.


Target surface tree, generated from the tectogrammatical layer.