NPFL120 – Multilingual Natural Language Processing

The course focuses on multilingual aspects of natural language processing. It explains both the issues and the benefits of doing NLP in a multilingual setting, and shows possible approaches to use. We will target both dealing with multilingual variety in monolingual methods applied to multiple languages, as well as truly multilingual and crosslingual approaches which use resources in multiple languages at once. We will review and work with a range of freely available multilingual resources, both plaintext and annotated.

About

SIS code: NPFL120
Semester: summer
E-credits: 3
Examination: 1/1 KZ
Guarantors: Daniel Zeman
Rudolf Rosa
Ondřej Bojar
Taught in: English, unless all students present understand Czech.

Timespace Coordinates

  • in summer semester 2018, the course takes place every Friday 14:00 in SU1.

Informal prerequisities

We suggest students to first attend the NPFL100 course Variability of languages in time and space / Variabilita jazyků v čase a prostoru, which looks more theoretically and linguistically onto many phenomena that we will look at more practically and computationally.

Some basic programming skills are expected, e.g. from the NPFL092 course NLP Technology.

The course complements nicely with the NPFL070 course Language Data Resources.

Organization of the course

The course has the form of a practical seminar in the computer lab. In each class we will try to combine a lecture with practical hands-on exercises (students are therefore required to have a unix lab account).

Lectures

1. Introduction; WALS Slides wals

2. Alphabets, encoding, language identification Slides

3. Tokenization and Word Segmentation Slides tokenization

4. Machine Translation: Alignment and Phrase-Based MT (Ondřej Bojar) Slides

5. Cross-lingual POS tagging pos_tagging

6. Interset Slides

7. Cross-lingual POS tagging; POS harmonization Slides pos_harmonization

8. Delexicalized parsing Slides delex_parsing

9. Delexicalized parsing klcpos3

10. Tree projection Slides tree_projection

11. Treebank translation Slides tree_translation

12. Syntax harmonization and Enhanced Universal Dependencies Slides enhancing_ud

13. Multilingual Machine Translation (Ondřej Bojar) Slides


Requirements

Homework tasks

There will be homework from most of the classes, typically based on finishing and/or extending the exercises from that class.

To pass the course, you will be required to actively participate in the classes and to submit all of the homework tasks. The quality of your homework solutions will determine your grade.

Grading rules

Currently, the idea is that you get some points for each homework, where a good solution gets 3 points – a weaker solution gets less, a stronger solution gets more. Then, if your final average of points per homework is at least 3, you get the grade 1; otherwise you get a lower grade.

1. Introduction; WALS

 Feb 23 Slides wals

2. Alphabets, encoding, language identification

 Mar 02 Slides

3. Tokenization and Word Segmentation

 Mar 09 Slides tokenization

4. Machine Translation: Alignment and Phrase-Based MT (Ondřej Bojar)

 Mar 16 Slides

5. Cross-lingual POS tagging

 Mar 23 pos_tagging

6. Interset

 Apr 06 Slides

7. Cross-lingual POS tagging; POS harmonization

 Apr 13 Slides pos_harmonization

8. Delexicalized parsing

 Apr 20 Slides delex_parsing

9. Delexicalized parsing

 Apr 27 klcpos3

10. Tree projection

 May 04 Slides tree_projection

11. Treebank translation

 May 11 Slides tree_translation

12. Syntax harmonization and Enhanced Universal Dependencies

 May 18 Slides enhancing_ud

13. Multilingual Machine Translation (Ondřej Bojar)

 May 25 Slides

wals

 Deadline: Mar 8  3 points

  • WALS online for clicking
  • language.tsv -- WALS dataset for computer processing (free to download in CSV, this file has been covnerted to TSV for convenience)
  • greping and cuting in the WALS dataset
  • Homework: a script for measuring language similarity using the WALS dataset
    • Idea: similarity of a pair of languages can be estimated by comparing their WALS features, e.g. by counting the number of WALS features in which they are similar (Agić, 2017). The simplest way is to iterate over the features, ignoring those that are undefined for one of the two languages, and adding 1 to the score if the values match or 0 if they do not match. If you then divide this by the number of features, you get the Hamming similarity.
    • Task 1: input = WALS code of one language, output = WALS code and similarity scores for most similar languages.
    • Task 2: input = genus (e.g. "Slavic"), output = centroid language of that genus, i.e. a language most similar to other languages of the genus
    • Task 3: find the weirdest language, i.e. most dissimilar to any other language (for whole WALS, or for a given language genus/family)
    • The definition of the task is somewhat vague, feel freee to spend as much or as little time with it as you wish
    • Use any programming language, send the script to us by e-mail once you have it. Deadline: 8th March 2018.

tokenization

 Deadline: Mar 15  5 points

  • One tokenizer you may often encounter is the Moses tokenizer:
    mkdir -p mosestok/tokenizer/; cd mosestok/tokenizer/
    wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/tokenizer.perl
    chmod u+x tokenizer.perl; cd ..; mkdir -p share/nonbreaking_prefixes/; cd share/nonbreaking_prefixes/
    wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
    cd ../../..
    mosestok/tokenizer/tokenizer.perl -h
  • Quite powerful tokenizer is part of UDPipe -- download UDPipe 1.2.0, download UD 2.0 models, see the UDPipe manual
  • Try tokenizing the sentences from the slides with Moses tokenizer and with UDPipe tokenizer -- see Running UDPipe tokenizer
    • hint: udpipe --tokenize path/to/model < input.txt
    • Playing with quotes: “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” — »magyar« — ’magyar’ -- and ``tex quotes'' --- 'cause it's a mess, you know... But don’t don‘t don’t don’t don't talk 'bout that too much or students' heads'll explode!
    • Varied Chinese punctuation: 「你看過《三國演義》嗎?」他問我。“你看過‘三國演義’嗎?”他問我.
    • Vietnamese: Tất cả đường bêtông nội đồng thành quả
    • Japanese: 経堂の美容室に行ってきました。
    • Spanish: «¡María, te amo!», exclamó Juan. “María, I love you!” Juan exclaimed. ¿Vámonos al mar? Escríbeme a rur@nikde.eu. Soy de 'Kladno'... Tiene que bañarse.
  • Download the cleaned UDHR dataset and try tokenizing some of the texts with UDPipe
    • cmn = Chinese (Mandarin), yue = Cantonese, jpn = Japanese, vie = Vietnamese...
  • Universal Dependencies -- download UD 2.1
  • Some languages have more than one treebank. Does the tokenizer work similarly well on each of them? (I.e. are the treebanks tokenized similarly?) See Measuring Model Accuracy in UDPipe manual
    • hint: add the --accuracy switch and use the treebank test file (xyz-ud-test.conllu) as input
  • Homework 2A: Some languages are new in UD2.1 and there is no trained UDPipe tokenizer for them yet. How would you tokenize Cantonese, Buryat, or Upper Sorbian?
    • Try to find a reasonable UDPipe tokenization model for these three languages (e.g. for tokenizing Cantonese, maybe using the Chinese model makes sense?).
    • You may try to reuse what you did in HW1 ;-)
    • Report which tokenizer you chose for each of the languages, how you did that, and what accuracy it achieves (again, evaluate the tokenizer on the test data).
    • Deadline: 15th March 2018
  • Homework 2B: train a UDPipe tokenizer for one of the languages new in UD2.1 -- see Training UDPipe Tokenizer
    • hint: udpipe --train --tagger=none --parser=none output_model.udpipe < xyz-ud-train.conllu
    • should be easy: Afrikaans, Northern Sami, Serbian -- so probably just choose one of those three
    • annotation not optimal: Marathi, Telugu
    • no training data: Buryat, Cantonese, Upper Sorbian
    • report the command you used for training (on the train data) and the accuracy you got (on the test data)
    • sanity check: also run the tokenizer on some plaintext data for the language (probably from UDHR) and check that it actually does perform some reasonable-looking tokenization
    • Deadline: 15th March 2018

pos_tagging

 Deadline: Apr 05  3 points

  • devise a cross-lingual POS tagger for one under-resourced language
    • start here, finish as homework
    • use one or more source languages
    • you can get 2 points for the HW if you use just 1 source language
    • you can get 4 points if you use multiple sources (at least 3)
  • target language to use
    • suggested target language: Kazakh (kk) / Telugu (te) / Lithuanian (lt)
      • there are some very small training data in UD, so let's pretend there are none and just use the test data
      • there are some reasonable parallel data, both OpenSubtitles and Watchtower (so you can choose one of them or use both)
      • there is at least one reasonable source language for each of these
    • or: use a truly low-resource language for which there is no training data in UD (or only test data)
      • ideally use a language that you know at least a bit so that you can at least approximately evaluate how good the results are (or a language which has some UD test data)
      • please make sure there are some reasonable parallel data available
      • e.g. Uyghur (ug) could work well -- there is the Tanzil parallel corpus for it, and there are test data in UD but no train data
  • approach to use -- choose one
    1. machine translation of training data
      1. take parallel data [Watchtower/OpenSubtitles/?]
      2. train an MT system [Moses, ideally word-based monotone]
      3. translate the source-language training data into the target-language [UD 2.1]
        • extract the 2nd column, translate, put back
        • results will be better if you put each sentence on one line for the translation
        • if the data contains weird tokens (where the first column is not an integer but a range e.g. 5-7 or a decimal e.g. 5.1), you'd better remove these weird lines
        • if you don't use monotone word-based translation, you will have a hard time putting the translation back into the CoNLL-U file (if you really want to allow reordering or even full phrase-based MT, you can then ask Moses to output the source-target alignment by using the switch -alignment-output-file cs-sk.align)
      4. train a tagger on the translated data [UDPipe]
      5. run the tagger on some target language data and evaluate -- if you use multiple sources, there can be multiple POS suggestions for one word, so use voting
    2. POS projection over (multi)parallel data
      1. take parallel data [Watchtower/OpenSubtitles/?]
      2. POS tag source side [trained UDPipe UD 2.0 models/or train your own]
      3. align [Giza/FastAlign]
      4. project POS tags through the alignment from the tagged source to the non-tagged target
        • NOUN if unaligned?
        • if unaligned but elsewhere in the data aligned, use that POS tag?
        • if you use multiple sources, there can be multiple POS suggestions for one word, so use voting
        • if you don't use intersection symmetrization, there can be multiple POS suggestions even with a single source
      5. train tagger [UDPipe]
      6. run the tagger on some target language data and evaluate
  • tools to use
    • Moses -- see previous lab; for running the train-model.perl, I suggest to use the options -max-phrase-length 1 -alignment intersect and not to use the -reordering [whatever] switch, and then running Moses with the -dl 0 switch -- this should enforce word-by-word translation without reordering (=monotone), which will make it easier for you to work with the translations
    • if you just want the alignment, you can also try FastAlign
      • installation:
        git clone https://github.com/clab/fast_align.git;
        cd fast_align;
        mkdir build;
        cd build;
        cmake ..;
        make;
      • usage:
        paste cs sk | sed 's/\t/ ||| /' > cs-sk;
        fastalign -d -o -v -i cs-sk > cs-sk.f;
        fastalign -r -d -o -v -i cs-sk > cs-sk.r;
        atools -i cs-sk.f -j cs-sk.r -c intersect > cs-sk.i
    • if you don't like Moses but want to do translation, you can also simply translate each source word to the target word most frequently aligned to it (but the translation quality will be lower as this is a single-best translation without a language model)
    • UDPipe -- we already saw it two weeks ago (where we used it for tokenization but it can also do POS tagging)
      • tag a tokenized text: udpipe --tag --input=horizontal path/to/model < input.txt > output.conllu
      • train a tagger: udpipe --train --tokenizer=none --parser=none --tagger=use_xpos=0;use_features=0 < input.conllu > output.model
    • HunAlign sentence aligner if you use parallel data that are not sentence-aligned
      • both WTC and Opus data are already sentence aligned
      • I have not written up the instructions, but I can share with you the scripts I use to run hunalign in case you ever need it: install_hunalign.sh_, hun_align.sh_
  • Parallel data sources: OPUS, Watchtower (do not share); data from (Agić+, 2016)
    • some data in Opus are weird; OpenSubtitles and Tanzil are nice
    • WTC data contain empty sentences, so you have to clean them up, e.g.:
      paste sk.s cs.s | grep -P '.\t.' > wtc.sk-cs;
      cut -f1 wtc.sk-cs > wtc.sk-cs.sk;
      cut -f2 wtc.sk-cs > wtc.sk-cs.cs

      but if you use multiple sources and projection over the multi-parallel data, you have to be more careful, so that you do not lose the information which sentence is which (if you do MT then this does not matter)
    • some Opus data are multiparallel, but I don't know how to easily get the multiparallel sentence alignment; so if you use multiple sources, I suggest you use WTC
  • as the solution to the homework, turn in:
    • the POS tagging accuracy you got -- if there are test data in UD, measure the accuracy on the test data, if there are not, just look at the first 100 words and compute the accuracy on that
    • the trained UDPipe tagger model
    • your source codes
    • notes on what procedure you used; this can be a text description and/or the sequence of commands that you ran, at least approximately (it is a good idea to organize the whole process into a Bash script or a Makefile so that you can then e.g. easily run it again or run it for a different language etc. -- this is not required, this is just a good and useful practice)
    • Deadline: 5th April
  • note: this is already research, so if you get some good results, this may be publishable at a conference or in a journal -- or at least at SloNLP for sure

pos_harmonization

 Deadline: Apr 26  3 points

  • Tagset harmonization exercise: You get a syntactic parser trained on the UD tagset (UPOS and Universal Features), and data tagged with a different tagset. Try to convert the tagset into the UD tagset to get better results when applying the parser to the data.
    • The data in the CoNLLU format and the trained UDPipe models can be found at http://ufallab.ms.mff.cuni.cz/~rosa/npfl120/pos_harm/.
    • Running the parser
      • To run the parser and get results in the CoNLLU format, use e.g.: cat ta-ud-dev-orig.conllu | ./udpipe --parse ta.sup.parser.udpipe
      • To view the tree structures in the CoNLLU data, you can use e.g. view_conll or Udapi.
      • To evaluate the parsing accuracy, use e.g.: cat ta-ud-dev-orig.conllu | ./udpipe --parse --accuracy ta.sup.parser.udpipe
    • The tagset documentations (in practice it is often quite hard to get a proper documentation for the tagset, but we decided to be nice to you):
    • Try to achieve some reasonable parsing accuracy – I guess at least 50% should be achievable rather easily.
    • Your task is to try to do the harmonization yourself, not using any pre-existing tools for that.
    • Homework:
      • Harmonize the tagset for one of the languages.
      • Turn in the code that you used.
      • Report the parsing accuracy before and after your harmonization (both UAS and LAS); please measure the accuracy repeatedly during the development and report which changes to your solution brought which improvements of the parsing accuracy.
      • The minimum is to identify some of the main POS categories, such as verbs, nouns, adjectives, and adverbs, so that you get a reasonable parsing accuracy. For doing that, you can get 2 points for the homework. You can get more points if you further improve your solution; some suggestions are listed below.
      • You can try to identify more POS categories; ideally you should map all of the original POS tags to some UPOS tags.
      • (You can try to produce some of the Universal Features (documentation) – but this will most probably not work well, as UDPipe uses the features as one atomic string.)
      • You can try to cover all of the languages, at least in a basic way.
      • You can figure out how to use Interset (see previous lecture), use it to harmonize the tagset, and compare the parsing accuracy achieved when using your solution and when using Interset (but you still need to create at least a simple solution of your own).
      • You can use a different language and/or tagset than those listed here. Some of the UD treebanks contain the original tags in the XPOS field, so you can use those. Or you can use other data that are not part of UD -- in that case, the evaluation may not be as straightforward, so just do and report what you manage to do... To train a parsing model on the UD data, use e.g.:
        cat cs-ud-train.conllu | ./udpipe --train --tokenizer=none --tagger=none cs.parser.udpipe

delex_parsing

 Deadline: May 04  3 points

  • applying lexicalized versus delexicalized parsers in a monolingual and cross-lingual setting
    • trained lexicalized ("sup") and delexicalized ("delex") UDPipe 1.2 models trained on UD 2.1 treebanks: models
    • language groups for experimenting:
      • Norwegian (no), Danish (da), Swedish (sv)
      • Czech (cs), Slovak (sk)
      • Spanish (es), Portuguese (pt)
    • training a delexicalized UDPipe parser (without morpho features):
      cat cs-ud-train.conllu | ./udpipe --train --parser='embedding_form=0;embedding_feats=0;' --tokenizer=none --tagger=none cs.delex.parser.udpipe
    • combining multiple parsers via the MST algorithm:
      • parse a sentence with mutiple parsers -- you get multiple parse trees, i.e. 3 sets of dependency edges if you used 3 parsers
      • assign weights to the edges (e.g. 1 if the edge appeared in one parser output, 2 if in 2, etc.; or incorporate language similarity into the weights as well, i.e. edges from less similar languages get a lower weight)
      • give the list of edges and their weights to a MST algorithm, which outputs the best tree that can be constructed from the edges
      • you can use my Perl wrapper of the Perl Graph::ChuLiuEdmonds library (or look at my code and use the library directly from your Perl code; unfortunately I am unaware of any good implementation of a directed MST algorithm in Python)
        • my wrapper takes standard input (one sentence per line) and writes to standard output (one sentence per line
        • the input format is parent child weight parent child weight... where parent and child are some IDs of the parent and child nodes of the edge and the weight is a weight you assign to the edge, so e.g.:
          0 2 1.5 2 1 0.5 2 3 0.5 3 1 1.2
  • Homework:
    • Extended your cross-lingual POS tagging homework to cross-lingual parsing
    • At this point, it is probably sufficient to use only one source language (in parsing, combining multiple source languages is considerably more complicated than in tagging)
    • So: train a delexicalized parser on a source language treebank, and apply it to your cross-lingually-POS-tagged target-language data
    • Report the parsing accuracies you obtain (LAS and UAS) if possible, or at least some rough estimates if you have no data to use for automatic evaluation
    • If you have a lexically close source language, you may also try the source-lexicalized parsing, i.e. training a standard lexicalized parser (but still without morpho features) and applying it to the target language; but this will only work if there is a substantial amount of shared vocabulary between the source and the target language

klcpos3

 Deadline: May 18  3 points

  • Voluntary homework: implement KLcpos3 and evaluate it for a few languages -- for a given target treebank, compute KLcpos3 for a few sources, and report what you got; you can also train and evaluate cross-lingual delexicalized parsing for these languages and report its LAS, observing how much this correlates with KLcpos3 (you can get some extra points for doing that)

tree_projection

 Deadline: May 18  3 points

  • Projecting trees over parallel data:
    • all data is here: PUD = parallel treebanks, align = alignments by FastAlign
    • Beware: CONLL-U token IDs are 1-based, FastAlign token IDs are 0-based
    • Beware: tokens with non-integer ID (like 5-6 or 8.1) are not part of the tree nor of the alignment (so maybe you can just grep them away)
    • Beware: forms and lemmas can contain spaces in CONLL-U
    • You can use the template project.py which I prepared (it does the reading in and writing out)
    • because this is parallel treebank, you have gold standard annotation for both the source tree and the target tree, so you can measure the accuracy of your projection
      • you can use e.g. my evaluator.py for that
      • use it e.g. as python3 evaluator.py -j -m head gold.conllu pred.conllu
      • run it as python3 evaluator.py -h for more info; most importantly, you can also use -m deprel or -m las
    • Homework: implement the projections somehow; you will get points according to how good and sophisticated they are; evaluate them automatically for several language pairs and report the scores
    • Deadline: 18th May

tree_translation

 Deadline: May 25  3 points

  • Lab: cross-lingual parsing lexicalized by translation of the training treebank using machine translation
    • we get back to the VarDial 2017 cross-lingual parsing shared task setup: 3 language pairs (one is actually a triplet), using supervised POS tags:
      • Czech (cs) source, Slovak (sk) target
      • Slovene (sl) source, Croatian (hr) target
      • Danish (da) and/or Swedish (sv) source, Norwegian (no) target
    • choose any language pair you want to, or use other languages if you want to
    • for the language pairs above, some datasets are prepared for the lab (but that's only a minor convenience, you can simply use UD treebanks and e.g. OpenSubtitles or WatchTower parallel data for any languages)
      • "treebanks" are the training treebanks for the source languages and evaluation treebanks for the target languages
      • "smaller_delex_models" are the baselines, i.e. delexicalized UDPipe parsers trained on the first 4096 sentences from the training treebanks; apply them to the target evaluation treebanks to measure the baseline accuracy (around 55 LAS I think)
      • "our_vardial_models" are lexicalized parsing models which we submitted into the competition, about +5 LAS above the baselines (can you beat us?! :-))
      • "para" are parallel data, obtained from OpenSubtitles2016 aligned by MonolingualGreedyAligner with intersection symmetrization (the format of the data is "sourceword[tab]targetword" on each line); there are also "tag" variants where POS tag and morphological features are annotated for the source word
      • "translate_treebank.py" is a simple implementation of treebank translation which you can use for your inspiration
    • the baseline approach is to translate each word form in the source treebank (second column) by its most frequent target counterpart from the parallel data (as done by the sample "translate_treebank.py" script), and then train a standard UDPipe parser on that:
      udpipe --train --tokenizer=none --tagger=none out.model < train.conllu
      and evaluating the parser on the target evaluation treebank:
      udpipe --parse --accuracy out.model < dev.conllu
    • there are many possible improvements to the approach:
      • use better word alignment (e.g. FastAlign intersection alignment)
      • use the source POS tags and/or morphological features for source-side disambiguation -- e.g. the word "stát" in Czech should be translated differently as a noun ("state") and as a verb ("stand"); you already have this annotation in the source treebank, and you can get it in the parallel data using a UDPipe tagger trained on the source treebank (which is how we produced the "tag" variants of the para data, which you can use)
      • use multiple source languages -- either combine the parsers using the MST algorithm, or simply concatenate the source treebanks into one (that's what we did in VarDial for Danish and Swedish -- if you see the "ds" language code, this means just that)
      • use a proper MT system (word-based Moses probably?)
      • use your knowledge of the target language for some additional processing
      • guess some translations for unknown words
      • pre-traing target language word embeddings with word2vec (on some target language plaintext -- you can also use the target side of the parallel data) and provide the pre-trained embeddings to UDPipe in training; see the UDPipe manual
      • etc., you can have you own ideas for improvements
  • Homework:
    • implement cross-lingual parsing lexicalized by treebank translation (it is sufficient to use one language pair, either one of the above or your own)
    • describe what you did and report achieved LAS scores evaluated on the target language treebank
    • doing the simplest baseline lexicalization approach described above carries 2 points
    • implementing some of the improvements carries more points
    • Deadline: 25th May

enhancing_ud

 Deadline: Jun 01  3 points

  • We are spending some time with syntax annotation harmonization (as we have only covered morphological annotation harmonization so far), and with Enhanced UD.
  • Lab: enhancing Czech UD treebank with information from the tectogrammatical (deep syntactic) annotation in PDT
  • Homework: try to enhance the Czech UD with something
    • Choose any of the phenomena listed in Enhanced UD, and try to enrich the Czech UD annotation with it
    • You may add abstract nodes (with non-integer IDs such as 7.1) and/or add secondary dependencies (these go into the DEPS (9th) column, see CoNLL-U format)
    • You may use the tecto annotation; there are the same sentences in the tecto file as in the ud file, and the enhance.py script takes care of loading the corresponding pairs of sentences; the most useful columns are probably the ID, the COREF_IDS (IDs of coreference antecedents, separated by pipes "|"), and the EFFHEADS (effective heads, such as the "real" head for all conjuncts, or even multiple heads e.g. for shared modifiers)
    • For some phenomena the tecto annotation is probably not needed
    • You can also try to work with a different language if you decide to focus on something where you don't need the tecto annotation
    • It is not always clear how to do the enhanced UD, and also the tecto annotation is often quite complex, so don't worry if you get lost or confused -- just try to do something, and then submit the code and a commentary on what you tried to do and how and how much you think you succeeded...
    • Deadline: 1st June