[ Skip to the content ]

Institute of Formal and Applied Linguistics

at Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic

[ Back to the navigation ]


Year 2012
Type in proceedings
Status published
Language English
Author(s) Tamchyna, Aleš Galuščáková, Petra Kamran, Amir Stanojević, Miloš Bojar, Ondřej
Title Selecting Data for English-to-Czech Machine Translation
Czech title Výběr dat pro anglicko-český strojový překlad
Proceedings 2012: Montréal, Canada: WMT 2012 (NAACL): Proceedings of the Seventh Workshop on Statistical Machine Translation
Pages range 374-381
URL http://www.aclweb.org/anthology/W12-3148
Supported by 2009-2012 FP7-ICT-2007-3-231720 (EuroMatrix Plus) 2009-2012 7E09003 (EuroMatrixPlus – Bringing Machine Translation for European Languages to the User) 2011-2012 7E11051 (EuroMatrixPlus - Enlarged European Union Bringing Machine Translation for European Languages to the User) 2011-2013 GAP406/11/1499 (Čeština ve věku strojového překladu) 2010-2012 GPP406/10/P259 (Hybridní frázový a hloubkově-syntaktický strojový překlad) 2012-2016 PRVOUK P46 (Informatika)
Czech abstract Studujeme vliv různých metod výběru dat na anglicko-český strojový překlad. Vyhodnocujeme kvalitu nové paralelního korpusu CzEng 1.0, popisujeme jednoduchou metodu jak zlepšit pokrytí slovníku extrahovaného z paralelních dat a zkoumáme několik metod filtrace paralelních dat pro lepší překlad. Příspěvek zároveň slouží jako popis našeho systému CU-TAMCH-BOJ v soutěži WMT12.
English abstract We provide a few insights on data selection for machine translation. We evaluate the quality of the new CzEng 1.0, a parallel data source used in WMT12. We describe a simple technique for reducing out-of-vocabulary rate after phrase extraction. We discuss the benefits of tuning towards multiple reference translations for English-Czech language pair. We introduce a novel approach to data selection by full-text indexing and search: we select sentences similar to the test set from a large monolingual corpus and explore several options of incorporating them in a machine translation system. We show that this method can improve translation quality. Finally, we describe our submitted system CU-TAMCH-BOJ.
Specialization linguistics ("jazykověda")
Confidentiality default – not confidential
Open access no
ISBN* 978-1-937284-20-6
Address* Montréal, Canada
Month* June
Publisher* Association for Computational Linguistics
Creator: Common Account
Created: 6/20/12 4:36 PM
Modifier: Almighty Admin
Modified: 9/6/13 4:51 PM

published version of the paperpublic2012-wmt-selecting-data.pdfapplication/force-download
Content, Design & Functionality: ÚFAL, 2006–2016. Page generated: Sat Mar 24 12:50:36 CET 2018

[ Back to the navigation ] [ Back to the content ]

100% OpenAIRE compliant