Year 2010
Type in proceedings
Status published
Language English
Author(s) Bojar, Ondřej Liška, Adam Žabokrtský, Zdeněk
Title Evaluating Utility of Data Sources in a Large Parallel Czech-English Corpus CzEng 0.9
Czech title Vyhodnocení přínosu jednotlivých zdrojů dat ve velkém paralelním česko-anglickém korpusu CzEng
Proceedings 2010: Valletta, Malta: LREC 2010: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010)
Pages range 447-452
Supported by 2010-2012 GPP406/10/P259 (Hybridní frázový a hloubkově-syntaktický strojový překlad) 2009-2012 FP7-ICT-2007-3-231720 (EuroMatrix Plus) 2009-2012 7E09003 (EuroMatrixPlus – Bringing Machine Translation for European Languages to the User) 2005-2009 LC536 (Centrum komputační lingvistiky) 2005-2010 MSM 0021620838 (Moderní metody, struktury a systémy informatiky)
Czech abstract CzEng 0.9 je třetí vydání velkého paralelního korpusu. V tomto vydání byl rozšířen o velké množství textů z různých typů zdrojů. Příspěvek popisuje a vyhodnocuje metody čištění paralelních dat a nabízí tak pohled na přínos jednotlivých typů zdrojů.
English abstract CzEng 0.9 is the third release of a large parallel corpus of Czech and English. For the current release, CzEng was extended by significant amount of texts from various types of sources, including parallel web pages, electronically available books and subtitles. This paper describes and evaluates filtering techniques employed in the process in order to avoid misaligned or otherwise damaged parallel sentences in the collection. We estimate the precision and recall of two sets of filters. The first set was used to process the data before their inclusion into CzEng. The filters from the second set were newly created to improve the filtering process for future releases of CzEng. Given the overall amount and variance of sources of the data, our experiments illustrate the utility of parallel data sources with respect to extractable parallel segments. As a similar behaviour can be expected for other language pairs, our results can be interpreted as guidelines indicating which sources should other researchers exploit first.
Specialization linguistics ("jazykověda")
Confidentiality default – not confidential
Open access no
ISBN* 2-9517408-6-7
Address* Valletta, Malta
Month* May
Venue* Mediterranean Conference Centre
Publisher* European Language Resources Association
