UFAL Parallel Corpus of North Levantine 1.0

UFAL Parallel Corpus of North Levantine 1.0

March 10, 2023

Authors

Hasem Sellat
Shadi Saleh
Mateusz Krubiński
Adam Posppíšil
Petr Zemánek
Pavel Pecina

Overveiw

This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the Welcome project (https://welcome-h2020.eu/). The corpus consists of 120,600 multiparallel sentences in English, French, German, Greek, Spanish, and Standard Arabic selected from the OpenSubtitles2018 corpus [1] and manually translated into the North Levantine Arabic language. The corpus was created for the purpose of training machine translation for North Levantine and the other languages.

Data processing

In OpenSubtitles2018, we identified 3,661,627 sentences in English that were aligned with their translations in all of the following languages: arb, fra, deu, ell, spa, and filtered out those that matched any of the following conditions:

  • presence of non-standard characters in the English side (only English alphabet, numbers and the following characters allowed: .!?,:; '$%£€) to reduce noise.
  • non-capital first letter in the English side (to avoid incomplete sentences)
  • presence of less than two infrequent words (to increase lexical richness)
  • presence of vulgar words in the English side

Then, we removed exact and near duplicates (detected in the English side) and sampled a subset of approximately 1 million words in the English side. This resulted in 120,771 multiparallel sentences with an average length of 8.28 words per sentence in the English side.

The sentences in Standard Arabic were then manually translated to North Levantine Arabic by native speakers. A few erroneous translations were automatically detected (e.g. empty or unfinished translations) and discarded. The remaining translations were aligned with the other languages through Standard Arabic and English. The final corpus comprises 120,600 sentences in English, Spanish, Greek, German, French, Standard Arabic, and the newly added North Levantine Arabic. The table below shows some overall statistics. The languages of the data files are denoted by their ISO 639-3 codes.

language ISO 639-3 code #words
North Levantine Arabic apc 738,813
Standard Arabic arb 802,316
German deu 940,234
Greek ell 869,543
English eng 999,193
French fra 956,208
Spanish spa 920,922

The translations are provided in seven files, each file contains data in one language. The files aligned through the line numbers; the order of lines is random. We provide linking of the English-centred sentence pairs to the original data in OpenSubtitles2018. This information is stored in the *.ids files that are aligned through the line numbers with the corresponding translations. Each line contains tab-separated items: the source filename, the target filename, space-separated positions of the source sentence in the source file, space-separated positions of the target sentence in the target file.

References

[1] Pierre Lison, Jörg Tiedemann, and Milen Kouylekov. 2018. OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora. Proceedings of the Eleventh International Conference on Language Resources and Evaluation, pages 1742–1748. Miyazaki, Japan. https://opus.nlpl.eu/OpenSubtitles-v2018.php

Licence

CC NC-BY-SA 4.0

Download

http://hdl.handle.net/11234/1-5033

Acknowledgement

The work was supported by the European Commission via the H2020 Program, project WELCOME, grant agreement: 870930.