UFAL Medical Corpus v. 1.0

UFAL Medical Corpus is a collection of parallel corpora assembled during the course of projects KConnect, Khresmoi and HimL aiming at more reliable machine translation of medical texts. The collected corpora are described in HimL project Deliverable D1.1 Report on Building Translation Systems for Public Health Domain (see the list of project results) and in KConnect project Deliverable D1.2 Toolkit and Report for Translator Adaptation to New Languages (the list of project results is available here).

UFAL Medical Corpus covers following languages: Czech, German, Spanish, French, Hungarian, Polish, Romanian and Swedish. Each language is paired with English.

UFAL Medical Corpus v.1.0 also serves as the training data for WMT17 Biomedical Task. For this purpose, we somewhat restricted the set of sentences due to copyright reasons.

Data

We have combined data from various in-domain and out-of-domain sources into a single corpus. Duplicate senteces were excluded and the resulting corpus was shuffled.

UFAL Medical Corpus has following format: source_sentence [tab]  target_sentence [tab]  type_of_data [tab]  original_corpus_name

Source_sentence and target_sentence are dictionary entries in case of dictionaries. 

Type_of_data can have folloing values: medical_corpus, general_corpus, medical_dictionary, general_dictionary.

In-Domain

The following table summarizes medical-domain corpora included in the UFAL Medical Corpus collection:

Corpora cs-en de-en es-en fr-en hu-en pl-en ro-en sv-en
CESTA - - - 3,617 - - - -
ECDC 2,324 2,379 2,357 2,377 2,306 2,202 2,363 2,345
EMEA (OpenSubtitles) 445,365 481,443 487,901 493,933 462,541 459,225 424,904 466,108
EMEA (new crawl) 687,635 615,256 - - - 652,336 621,490 -
Medical Web Crawl - - 148,982 - - - - -
Medical Web Texts from CzEng 1.6 7,029 - - - - - - -
MuchMore - 28,919 - - - - - -
PatTR Medical - 1,830,647 - 2,191,537 - - - -
Subtitles 3,140 77,937 151,675 120,841 - 3,010 116,335 96,575
Total Parallel Segments 1,145,493 3,036,581 790,915 2,812,305 464,847 1,116,773 1,165,092 565,028
Total Parallel Segments (after 'sort | uniq') 819,697 2,662,810 631,087 2,634,229 351,336 800,662 852,800 444,777
Total Words (target language/en) 14M/15M 84M/94M 9M/10M 89M/100M 5M/5M 14M/14M 14M/15M 6M/5M

Out-of-Domain

We also included general domain data in the release. The following table summarizes the general purpose corpora included in the UFAL Medical Corpus collection:

Corpora cs-en de-en es-en fr-en hu-en pl-en ro-en sv-en
Cordis - - - - - 168,067 - -
EUbookshop 428,339 9,011,774 5,103,274 10,225,247 412,618 509,105 310,653 1,877,976
EUROPARL 643,361 1,918,724 1,964,134 2,006,305 621,328 627,367 396,882 1,852,450
Hunglish - - - - 2,083,159 - - -
JRC-Acquis 1,113,649 642,797 720,201 720,747 449,361 1,412,095 428,618 708,759
MultiUN - 153,545 7,734,469 11,840,859 - - - -
News Commentary 146,135 200,534 193,665 182,645 - - - -
OpenSubtitles 44,618,012 12,815,341 75,947,825 49,035,989 44,612,969 34,926,913 59,732,934 17,840,535
PatTR Other - 9,302,172 -

10,957,584

- - - -
Rapid - - - - - 132,156 - -
Total Parallel Segments 46,949,496 34,034,887 91,663,568 84,969,376 48,179,435 37,775,703 60,869,087 22,279,720
Total Parallel Segments (after 'sort | uniq') 38,065,775 31,638,916 75,421,729 74,045,053 39,499,594 31,786,926 47,829,602 19,447,606
Total Words (target language/en) 276M/333M 716M/817M 874M/889M 1,392M/1,490M 340M/262M 288M/229M 402M/377M 221M/195M
Dictionaries cs-en de-en es-en fr-en hu-en pl-en ro-en sv-en
DBpedia 148,181 681,494 544,686 44,977     139,329 549,600 - 297,913
Linguee - 51,571 - - - - - -

Register

To download UFAL Medical Corpus v.1.0, you have to register by filling in the following form. We will send you a unique username to access the files. If you do not hear from us within a week, fill the form again or contact us directly.

Name:
E-mail:
Institution:
Country:

I certify that I will use UFAL Medical Corpus v.1.0 only for research and non-commercial purposes.

Download

After the registration, you will have received a unique username. The unique username and a shared password "ufalmedi" will be requested at the following link:

Tip for Linux wget tool: Use the flags --user=YOUR-USERNAME --password=ufalmedi to pass the authorization check. Use the flag --continue to continue an interrupted transfer.

WARNING: Due to the processing error, corpus entries extracted from the Medical_Web_Texts_from_CzEng1.6 contain an additional column with a score, resulting in a following line format:

source_sentence [tab]  target_sentence  [tab]  score  [tab]  type_of_data  [tab]  original_corpus_name.

Entries extracted from other corpora should be unaffected.

Acknowledgement

We gracefully acknowledge support from: