March 10, 2023
Hasem Sellat
Shadi Saleh
Mateusz Krubiński
Adam Posppíšil
Petr Zemánek
Pavel Pecina
This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the Welcome project (https://welcome-h2020.eu/). The corpus consists of 120,600 multiparallel sentences in English, French, German, Greek, Spanish, and Standard Arabic selected from the OpenSubtitles2018 corpus [1] and manually translated into the North Levantine Arabic language. The corpus was created for the purpose of training machine translation for North Levantine and the other languages.
In OpenSubtitles2018, we identified 3,661,627 sentences in English that were aligned with their translations in all of the following languages: arb, fra, deu, ell, spa, and filtered out those that matched any of the following conditions:
Then, we removed exact and near duplicates (detected in the English side) and sampled a subset of approximately 1 million words in the English side. This resulted in 120,771 multiparallel sentences with an average length of 8.28 words per sentence in the English side.
The sentences in Standard Arabic were then manually translated to North Levantine Arabic by native speakers. A few erroneous translations were automatically detected (e.g. empty or unfinished translations) and discarded. The remaining translations were aligned with the other languages through Standard Arabic and English. The final corpus comprises 120,600 sentences in English, Spanish, Greek, German, French, Standard Arabic, and the newly added North Levantine Arabic. The table below shows some overall statistics. The languages of the data files are denoted by their ISO 639-3 codes.
language | ISO 639-3 code | #words |
---|---|---|
North Levantine Arabic | apc | 738,813 |
Standard Arabic | arb | 802,316 |
German | deu | 940,234 |
Greek | ell | 869,543 |
English | eng | 999,193 |
French | fra | 956,208 |
Spanish | spa | 920,922 |
The translations are provided in seven files, each file contains data in one language. The files aligned through the line numbers; the order of lines is random. We provide linking of the English-centred sentence pairs to the original data in OpenSubtitles2018. This information is stored in the *.ids files that are aligned through the line numbers with the corresponding translations. Each line contains tab-separated items: the source filename, the target filename, space-separated positions of the source sentence in the source file, space-separated positions of the target sentence in the target file.
[1] Pierre Lison, Jörg Tiedemann, and Milen Kouylekov. 2018. OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora. Proceedings of the Eleventh International Conference on Language Resources and Evaluation, pages 1742–1748. Miyazaki, Japan. https://opus.nlpl.eu/OpenSubtitles-v2018.php
CC NC-BY-SA 4.0
http://hdl.handle.net/11234/1-5033
The work was supported by the European Commission via the H2020 Program, project WELCOME, grant agreement: 870930.