UFAL Parallel Corpus of North Levantine

ArabicNLP 2023 Paper

Authors

Hashem Sellat
Shadi Saleh
Mateusz Krubiński
Adam Pospíšil
Petr Zemánek
Pavel Pecina

Overview

This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the WELCOME project (https://welcome-h2020.eu/). The corpus consists of 120,600 multi-parallel sentences in English, French, German, Greek, Spanish, and Modern Standard Arabic selected from the OpenSubtitles2018 corpus [1] and manually translated into the North Levantine Arabic language. The corpus was created for the purpose of training machine translation systems capable of handling North Levantine Arabic.

Data processing

In OpenSubtitles2018, we identified 3,661,627 sentences in English that were aligned with their translations in all of the following languages: arb, fra, deu, ell, spa, and in the next step, filtered out noisy sentences (for convenience, we applied filters to the English side):

  • sentences containing vulgar words (based on a hand-crafted list) were removed
  • sentences containing non-standard characters were removed – only punctuation marks, English alphabet letters and digits were allowed
  • to avoid incomplete sentences, only sentences that start with a capital letter were kept
  • very similar sentences were discarded by lowercasing the text, removing punctuation and digits, and removing the duplicates. The goal was not to translate similar sentences like Good morning and Good morning! or I was born in 1961 and I was born in 1983
  • to assure the inner variance and semantic richness of the translated text, sentences with less than two words, ones containing very rare words, and sentences with a high proportion of frequent words (frequency-based approach with a manual filtering step) were removed.

After those filtering steps, we ended up with 120,771 sentences. Before the translation, an additional corpus-wise filtering step was applied by removing multi-parallel lines where: English characters appear in the Arabic sentence, Arabic characters appear in the English sentence, or Arabic characters appear in a particular sentence for all of the Indo-European languages. The final size of the corpus is equal to 120,600 lines that were manually translated into the North Levantine Arabic dialect

Some corpus-wise word-level statistics and langauge-specific ISO 639-3 codes are reported below.

 

language ISO 639-3 code #words
North Levantine Arabic apc 738,812
Modern Standard Arabic arb 802,313
German deu 940,234
Greek ell 869,543
English eng 999,193
French fra 956,208
Spanish spa 920,922

 

The translations are provided in seven files, each file contains data in one language. We provide linking of the English-centred sentence pairs to the original data in OpenSubtitles2018. This information is stored in the *.ids files that are aligned through the line numbers with the corresponding translations. Each line contains tab-separated items: the source filename, the target filename, space-separated positions of the source sentence in the source file, space-separated positions of the target sentence in the target file.

References

[1] Pierre Lison, Jörg Tiedemann, and Milen Kouylekov. 2018. OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora. Proceedings of the Eleventh International Conference on Language Resources and Evaluation, pages 1742–1748. Miyazaki, Japan. https://opus.nlpl.eu/OpenSubtitles-v2018.php

Licence

CC NC-BY-SA 4.0

Download

http://hdl.handle.net/11234/1-5033

Acknowledgement

The work was supported by the European Commission via the H2020 Program, project WELCOME, grant agreement: 870930.