SIS code: 
1/1 KZ

NPFL120 - Multilingual Natural Language Processing

The course focuses on multilingual aspects of natural language processing. It explains both the issues and the benefits of doing NLP in a multilingual setting, and shows possible approaches to use. We will target both dealing with multilingual variety in monolingual methods applied to multiple languages, as well as truly multilingual and crosslingual approaches which use resources in multiple languages at once. We will review and work with a range of freely available multilingual resources, both plaintext and annotated.


Topic overview

  2. Alphabets, encoding, language identificationHOMEWORK LOGO
  3. Tokenization and Word SegmentationHOMEWORK LOGOHOMEWORK LOGOHOMEWORK LOGO
  4. Machine Translation: Alignment and Phrase-Based MTHOMEWORK LOGO
  5. Cross-lingual POS taggingHOMEWORK LOGO

langtech logo

Informal prerequisities

We suggest students to first attend the NPFL100 course Variability of languages in time and space / Variabilita jazyků v čase a prostoru, which looks more theoretically and linguistically onto many phenomena that we will look at more practically and computationally.

Some basic programming skills are expected, e.g. from the NPFL092 course NLP Technology.

The course complements nicely with the NPFL070 course Language Data Resources.

Organization of the course

The course has the form of a practical seminar in the computer lab. In each class we will try to combine a lecture with practical hands-on exercises (students are therefore required to have a unix lab account).

Passing requirements

Homework tasks

There will be homework from most of the classes, typically based on finishing and/or extending the exercises from that class.

To pass the course, you will be required to actively participate in the classes and to submit all of the homework tasks. The quality of your homework solutions will determine your grade.

Grading rules

Currently, the idea is that you get some points for each homework, where a good solution gets 3 points -- a weaker solution gets less, a stronger solution gets more. Then, if your final average of points per homework is at least 3, you get the grade 1; otherwise you get a lower grade.

Detailed course plan

Introduction; WALS

  • slides
  • WALS online for clicking
  • language.tsv -- WALS dataset for computer processing (free to download in CSV, this file has been covnerted to TSV for convenience)
  • greping and cuting in the WALS dataset
  • Homework: a script for measuring language similarity using the WALS dataset
    • Idea: similarity of a pair of languages can be estimated by comparing their WALS features, e.g. by counting the number of WALS features in which they are similar (Agić, 2017). The simplest way is to iterate over the features, ignoring those that are undefined for one of the two languages, and adding 1 to the score if the values match or 0 if they do not match. If you then divide this by the number of features, you get the Hamming similarity.
    • Task 1: input = WALS code of one language, output = WALS code and similarity scores for most similar languages.
    • Task 2: input = genus (e.g. "Slavic"), output = centroid language of that genus, i.e. a language most similar to other languages of the genus
    • Task 3: find the weirdest language, i.e. most dissimilar to any other language (for whole WALS, or for a given language genus/family)
    • The definition of the task is somewhat vague, feel freee to spend as much or as little time with it as you wish
    • Use any programming language, send the script to us by e-mail once you have it. Deadline: 8th March 2018.

Alphabets, encoding, language identification

Tokenization and Word Segmentation

  • slides
  • One tokenizer you may often encounter is the Moses tokenizer:
    mkdir -p mosestok/tokenizer/; cd mosestok/tokenizer/
    chmod u+x tokenizer.perl; cd ..; mkdir -p share/nonbreaking_prefixes/; cd share/nonbreaking_prefixes/
    cd ../../..
    mosestok/tokenizer/tokenizer.perl -h
  • Quite powerful tokenizer is part of UDPipe -- download UDPipe 1.2.0, download UD 2.0 models, see the UDPipe manual
  • Try tokenizing the sentences from the slides with Moses tokenizer and with UDPipe tokenizer -- see Running UDPipe tokenizer
    • hint: udpipe --tokenize path/to/model < input.txt
    • Playing with quotes: “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” — »magyar« — ’magyar’ -- and ``tex quotes'' --- 'cause it's a mess, you know... But don’t don‘t don’t don’t don't talk 'bout that too much or students' heads'll explode!
    • Varied Chinese punctuation: 「你看過《三國演義》嗎?」他問我。“你看過‘三國演義’嗎?”他問我.
    • Vietnamese: Tất cả đường bêtông nội đồng thành quả
    • Japanese: 経堂の美容室に行ってきました。
    • Spanish: «¡María, te amo!», exclamó Juan. “María, I love you!” Juan exclaimed. ¿Vámonos al mar? Escríbeme a Soy de 'Kladno'... Tiene que bañarse.
  • Download the cleaned UDHR dataset and try tokenizing some of the texts with UDPipe
    • cmn = Chinese (Mandarin), yue = Cantonese, jpn = Japanese, vie = Vietnamese...
  • Universal Dependencies -- download UD 2.1
  • Some languages have more than one treebank. Does the tokenizer work similarly well on each of them? (I.e. are the treebanks tokenized similarly?) See Measuring Model Accuracy in UDPipe manual
    • hint: add the --accuracy switch and use the treebank test file (xyz-ud-test.conllu) as input
  • Homework 2A: Some languages are new in UD2.1 and there is no trained UDPipe tokenizer for them yet. How would you tokenize Cantonese, Buryat, or Upper Sorbian?
    • Try to find a reasonable UDPipe tokenization model for these three languages (e.g. for tokenizing Cantonese, maybe using the Chinese model makes sense?).
    • You may try to reuse what you did in HW1 ;-)
    • Report which tokenizer you chose for each of the languages, how you did that, and what accuracy it achieves (again, evaluate the tokenizer on the test data).
    • Deadline: 15th March 2018
  • Homework 2B: train a UDPipe tokenizer for one of the languages new in UD2.1 -- see Training UDPipe Tokenizer
    • hint: udpipe --train --tagger=none --parser=none output_model.udpipe < xyz-ud-train.conllu
    • should be easy: Afrikaans, Northern Sami, Serbian -- so probably just choose one of those three
    • annotation not optimal: Marathi, Telugu
    • no training data: Buryat, Cantonese, Upper Sorbian
    • report the command you used for training (on the train data) and the accuracy you got (on the test data)
    • sanity check: also run the tokenizer on some plaintext data for the language (probably from UDHR) and check that it actually does perform some reasonable-looking tokenization
    • Deadline: 15th March 2018

Machine Translation: Alignment and Phrase-Based MT

Cross-lingual POS tagging

  • devise a cross-lingual POS tagger for one under-resourced language
    • start here, finish as homework
    • use one or more source languages
    • you can get 2 points for the HW if you use just 1 source language
    • you can get 4 points if you use multiple sources (at least 3)
  • target language to use
    • suggested target language: Kazakh (kk) / Telugu (te) / Lithuanian (lt)
      • there are some very small training data in UD, so let's pretend there are none and just use the test data
      • there are some reasonable parallel data, both OpenSubtitles and Watchtower (so you can choose one of them or use both)
      • there is at least one reasonable source language for each of these
    • or: use a truly low-resource language for which there is no training data in UD (or only test data)
      • ideally use a language that you know at least a bit so that you can at least approximately evaluate how good the results are (or a language which has some UD test data)
      • please make sure there are some reasonable parallel data available
      • e.g. Uyghur (ug) could work well -- there is the Tanzil parallel corpus for it, and there are test data in UD but no train data
  • approach to use -- choose one
    1. machine translation of training data
      1. take parallel data [Watchtower/OpenSubtitles/?]
      2. train an MT system [Moses, ideally word-based monotone]
      3. translate the source-language training data into the target-language [UD 2.1]
        • extract the 2nd column, translate, put back
        • results will be better if you put each sentence on one line for the translation
        • if the data contains weird tokens (where the first column is not an integer but a range e.g. 5-7 or a decimal e.g. 5.1), you'd better remove these weird lines
        • if you don't use monotone word-based translation, you will have a hard time putting the translation back into the CoNLL-U file (if you really want to allow reordering or even full phrase-based MT, you can then ask Moses to output the source-target alignment by using the switch -alignment-output-file cs-sk.align)
      4. train a tagger on the translated data [UDPipe]
      5. run the tagger on some target language data and evaluate -- if you use multiple sources, there can be multiple POS suggestions for one word, so use voting
    2. POS projection over (multi)parallel data
      1. take parallel data [Watchtower/OpenSubtitles/?]
      2. POS tag source side [trained UDPipe UD 2.0 models/or train your own]
      3. align [Giza/FastAlign]
      4. project POS tags through the alignment from the tagged source to the non-tagged target
        • NOUN if unaligned?
        • if unaligned but elsewhere in the data aligned, use that POS tag?
        • if you use multiple sources, there can be multiple POS suggestions for one word, so use voting
        • if you don't use intersection symmetrization, there can be multiple POS suggestions even with a single source
      5. train tagger [UDPipe]
      6. run the tagger on some target language data and evaluate
  • tools to use
    • Moses -- see previous lab; for running the train-model.perl, I suggest to use the options -max-phrase-length 1 -alignment intersect and not to use the -reordering [whatever] switch, and then running Moses with the -dl 0 switch -- this should enforce word-by-word translation without reordering (=monotone), which will make it easier for you to work with the translations
    • if you just want the alignment, you can also try FastAlign
      • installation:
        git clone;
        cd fast_align;
        mkdir build;
        cd build;
        cmake ..;
      • usage:
        paste cs sk | sed 's/\t/ ||| /' > cs-sk;
        fastalign -d -o -v -i cs-sk > cs-sk.f;
        fastalign -r -d -o -v -i cs-sk > cs-sk.r;
        atools -i cs-sk.f -j cs-sk.r -c intersect > cs-sk.i
    • if you don't like Moses but want to do translation, you can also simply translate each source word to the target word most frequently aligned to it (but the translation quality will be lower as this is a single-best translation without a language model)
    • UDPipe -- we already saw it two weeks ago (where we used it for tokenization but it can also do POS tagging)
      • tag a tokenized text: udpipe --tag --input=horizontal path/to/model < input.txt > output.conllu
      • train a tagger: udpipe --train --tokenizer=none --parser=none --tagger=use_xpos=0;use_features=0 < input.conllu > output.model
    • HunAlign sentence aligner if you use parallel data that are not sentence-aligned
      • both WTC and Opus data are already sentence aligned
      • I have not written up the instructions, but I can share with you the scripts I use to run hunalign in case you ever need it: install_hunalign.sh_, hun_align.sh_
  • Parallel data sources: OPUS, Watchtower (do not share)
    • some data in Opus are weird; OpenSubtitles and Tanzil are nice
    • WTC data contain empty sentences, so you have to clean them up, e.g.:
      paste sk.s cs.s | grep -P '.\t.' >;
      cut -f1 >;
      cut -f2 >

      but if you use multiple sources and projection over the multi-parallel data, you have to be more careful, so that you do not lose the information which sentence is which (if you do MT then this does not matter)
    • some Opus data are multiparallel, but I don't know how to easily get the multiparallel sentence alignment; so if you use multiple sources, I suggest you use WTC
  • as the solution to the homework, turn in:
    • the POS tagging accuracy you got -- if there are test data in UD, measure the accuracy on the test data, if there are not, just look at the first 100 words and compute the accuracy on that
    • the trained UDPipe tagger model
    • your source codes
    • notes on what procedure you used; this can be a text description and/or the sequence of commands that you ran, at least approximately (it is a good idea to organize the whole process into a Bash script or a Makefile so that you can then e.g. easily run it again or run it for a different language etc. -- this is not required, this is just a good and useful practice)
  • note: this is already research, so if you get some good results, this may be publishable at a conference or in a journal -- or at least at SloNLP for sure

Tentative schedule

Plain text

  • alphabets, encoding, working with unicode, transcription and transliteration, language identification
  • tokenization, word segmentation
  • parallel data, word alignment, machine translation


  • variability in part-of-speech tagsets (including morphological features), UPOS, Universal Features, Interset, tagset conversion
  • cross-lingual POS tag projection
  • influence of tagging on parsing


  • cross-lingual parsing via direct model transfer, delexicalized parser transfer, lexicalized parser transfer
  • harmonization of annotation (POS, features, dependency relations), Universal Dependencies
  • source language(s) selection, adjusting the source data for the target language (annotation, word order)
  • cross-lingual parsing via projection through parallel data
  • cross-lingual parsing via treebank translation, translation for close languages, string similarity measures

Other (even more tentative)

  • multilingual machine translation
  • multilingual word embeddings
  • on the edge of monolinguality: historical texts, non-standard domains, dialects, speech transcriptions; spelling normalization, morphology normalization
  • valency, semantic roles