[ Skip to the content ]

Institute of Formal and Applied Linguistics

at Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic


[ Back to the navigation ]

Publication


Year 2012
Type in proceedings
Status published
Language English
Author(s) Larasati, Septina Dian
Title IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus
Czech title IDENTIC Corpus: Indonésko-anglický paralelní korpus obohacený o morfologii
Proceedings 2012: İstanbul, Turkey: LREC 2012: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
Pages range 902-906
Supported by 2009-2013 FP7-238405 (CLARA (Common Language Resources and their Applications)) 2005-2009 LC536 (Centrum komputační lingvistiky) 2010-2015 LM2010013 (LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat)
Czech abstract This paper describes the creation process of an Indonesian-English parallel corpus (IDENTIC). The corpus contains 45,000 sentences collected from different sources in different genres. Several manual text preprocessing tasks, such as alignment and spelling correction, are applied to the corpus to assure its quality. We also apply language specific text processing such as tokenization on both sides and clitic normalization on the Indonesian side. The corpus is available in two different formats: ‘plain’, stored in text format and ‘morphologically enriched’, stored in CoNLL format. Some parts of the corpus are publicly available at the IDENTIC homepage
English abstract This paper describes the creation process of an Indonesian-English parallel corpus (IDENTIC). The corpus contains 45,000 sentences collected from different sources in different genres. Several manual text preprocessing tasks, such as alignment and spelling correction, are applied to the corpus to assure its quality. We also apply language specific text processing such as tokenization on both sides and clitic normalization on the Indonesian side. The corpus is available in two different formats: ‘plain’, stored in text format and ‘morphologically enriched’, stored in CoNLL format. Some parts of the corpus are publicly available at the IDENTIC homepage.
Specialization linguistics ("jazykověda")
Confidentiality default – not confidential
Open access no
ISBN* 978-2-9517408-7-7
Address* İstanbul, Turkey
Month* May
Venue* Lütfi Kırdar Convention & Exhibition Centre
Publisher* European Language Resources Association
Creator: Common Account
Created: 4/11/12 10:24 AM
Modifier: Almighty Admin
Modified: 9/6/13 4:51 PM
***

Content, Design & Functionality: ÚFAL, 2006–2016. Page generated: Sun Nov 18 08:11:38 CET 2018

[ Back to the navigation ] [ Back to the content ]

100% OpenAIRE compliant