Work at CUNI (Univerzita Karlova v Praze, partner No. 3)

6 month report (due August 1, 2010)

Translation engines and translation API (WP2)

  • We have improved the translation quality of our TectoMT system ( by adding a new translation model based on maximum entropy classifier. The improvement reached 0.8 BLEU points. [WMT paper]
  • TectoMT system, which translates texts from English to Czech, was adapted to running on our internal server as a web service ( It is also able to communicate with sites using our conventional REST API.
  • Now we plan to do the same for our second, factored phrase-based MT system, which translates texts in both the directions English to Czech and Czech to English.

Automatic linguistic annotation of "noisy" MT outputs (WP4)

  • We made first experiments with parsing of "noisy" output from machine translation systems. We have created the "noisy" treebank in the following way: We took Prague Dependency Treebank (highly linguistically annotated treebank of Czech) and made its sentences "noisy" by translating them to English and back to Czech using our phrase-based MT system. This "noisy" sentences were then aligned with the original sentences and using this alignments the dependency relations were transfered also to the "noisy" sentences. Dependency parsers trained on this "noisy treebank" can be more suitable for machine translation outputs, because they should put less weight on those features in texts that are often wrong.

Automatic evaluation metrics for MT (WP4, WP6)

  • We continue working on our MT metric called SemPOS, which is suitable for morphologically rich languages on the target side as Czech. [ACL paper].

Test data translation (WP6)

  • Test data have been randomly selected for all langauges and both directions; translation guidelines have been agreed upon, including the translation support software (Olifant), and the working and interchange format (TMX).
  • For the English to Czech direction, the cleanup and translation have been performed (by three independent translators) and the datasets are ready; they have been also extensively checked for formal errors, and all problems corrected. Three-way comparison of both the "CLEAN" output and the "TRANS" (translated, Czech) output have been done for the three English-to-Czech datasets,
  • For the Czech to English direction, one translator have been hired, and the cleaning and translation work has started; so far, the second parallel translation will start immediately after hiring the second one.
  • Work has started on testing a repository (DSpace, in which all Faust data will be stored, initially internally-only, later also for public distribution. DSpace is a widely-used repository which implements many standards for data identification (persistent ID assignment), description (Dublin Core metadata, OAI protocol for metadata sharing), and sharing (licensed distribution) with the possibility of single-source user authentication through Shibboleth services and APIs. DSpace is considered (as one of the standard repositories) in EU resource-sharing projects, both infrastructural (Clarin) and in the Language Technology areas (METANET). Work has started on proper metadata description for the already completed or partially completed Faust test datasets.

Content: David Marecek.
Site is valid XHTML 1.0 and valid CSS.
2010 © Institute of Formal and Applied Linguistics. All Rights Reserved.

Site navigation: