Hashem Sellat
Shadi Saleh
Mateusz Krubiński
Adam Pospíšil
Petr Zemánek
Pavel Pecina
This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the WELCOME project (https://welcome-h2020.eu/). The corpus consists of 120,600 multi-parallel sentences in English, French, German, Greek, Spanish, and Modern Standard Arabic selected from the OpenSubtitles2018 corpus [1] and manually translated into the North Levantine Arabic language. The corpus was created for the purpose of training machine translation systems capable of handling North Levantine Arabic.
In OpenSubtitles2018, we identified 3,661,627 sentences in English that were aligned with their translations in all of the following languages: arb, fra, deu, ell, spa, and in the next step, filtered out noisy sentences (for convenience, we applied filters to the English side):
After those filtering steps, we ended up with 120,771 sentences. Before the translation, an additional corpus-wise filtering step was applied by removing multi-parallel lines where: English characters appear in the Arabic sentence, Arabic characters appear in the English sentence, or Arabic characters appear in a particular sentence for all of the Indo-European languages. The final size of the corpus is equal to 120,600 lines that were manually translated into the North Levantine Arabic dialect
Some corpus-wise word-level statistics and langauge-specific ISO 639-3 codes are reported below.
language | ISO 639-3 code | #words |
---|---|---|
North Levantine Arabic | apc | 738,812 |
Modern Standard Arabic | arb | 802,313 |
German | deu | 940,234 |
Greek | ell | 869,543 |
English | eng | 999,193 |
French | fra | 956,208 |
Spanish | spa | 920,922 |
The translations are provided in seven files, each file contains data in one language. We provide linking of the English-centred sentence pairs to the original data in OpenSubtitles2018. This information is stored in the *.ids files that are aligned through the line numbers with the corresponding translations. Each line contains tab-separated items: the source filename, the target filename, space-separated positions of the source sentence in the source file, space-separated positions of the target sentence in the target file.
[1] Pierre Lison, Jörg Tiedemann, and Milen Kouylekov. 2018. OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora. Proceedings of the Eleventh International Conference on Language Resources and Evaluation, pages 1742–1748. Miyazaki, Japan. https://opus.nlpl.eu/OpenSubtitles-v2018.php
CC NC-BY-SA 4.0
http://hdl.handle.net/11234/1-5033
The work was supported by the European Commission via the H2020 Program, project WELCOME, grant agreement: 870930.