# NPFL120 – Multilingual Natural Language Processing

The course focuses on multilingual aspects of natural language processing. It explains both the issues and the benefits of doing NLP in a multilingual setting, and shows possible approaches to use. We will target both dealing with multilingual variety in monolingual methods applied to multiple languages, as well as truly multilingual and crosslingual approaches which use resources in multiple languages at once. We will review and work with a range of freely available multilingual resources, both plaintext and annotated.

 SIS code: NPFL120 Semester: summer E-credits: 3 Examination: 1/1 KZ Guarantors: Daniel Zeman Rudolf Rosa Ondřej Bojar Taught in: English, unless all students present understand Czech.

### Timespace Coordinates

• in summer semester 2019, the course takes place every Monday 9:00 in SU1.

### Informal prerequisities

We suggest students to first attend the NPFL100 course Variability of languages in time and space / Variabilita jazyků v čase a prostoru, which looks more theoretically and linguistically onto many phenomena that we will look at more practically and computationally.

Some basic programming skills are expected, e.g. from the NPFL092 course NLP Technology.

The course complements nicely with the NPFL070 course Language Data Resources.

### Organization of the course

The course has the form of a practical seminar in the computer lab. In each class we will try to combine a lecture with practical hands-on exercises (students are therefore required to have a unix lab account).

### Requirements

There will be homework from most of the classes, typically based on finishing and/or extending the exercises from that class.

To pass the course, you will be required to actively participate in the classes and to submit homework tasks. The quality of your homework solutions will determine your grade.

Currently, the idea is that you get some points for each homework, where a good solution gets 3 points – a weaker solution gets less, a stronger solution gets more. Then, if your final average of points per homework is at least 3, you get the grade 1; otherwise you get a lower grade.

### 1. Introduction; WALS

Feb 18 Slides wals

### 2. Alphabets, encoding, language identification

Feb 25 Slides

• langid library and the accompanying paper
• cleaned UDHR dataset, originally from here
• mix of languages
• Unidecode for transliterating nearly any characters into ASCII
• A similar but different tool by Dan Zeman can be found in /home/zeman/projekty/transliterace/translit.pl on the ÚFAL network, if you have access.
• get iso codes of languages here
• unicodedata for various unicode stuff in Python

### 3. Tokenization and Word Segmentation

Mar 04 Slides tokenization

### 4. Machine Translation: Alignment and Phrase-Based MT (Ondřej Bojar)

Mar 11 Slides mt

### 5. Interset, POS harmonization

Mar 18 Slides pos_harmonization

### 6. Cross-lingual POS tagging

Mar 25 Slides pos_tagging

### 7. Delexicalized parsing

Apr 1 Slides delex_parsing

### 9. Treebank translation

Apr 15 Slides tree_translation

### 10. Syntax harmonization and Enhanced Universal Dependencies

Apr 27 Slides enhancing_ud

May 6 Slides

### 12. Word Embeddings

May 13 Slides embeddings

### Rules

• use any programming language you like (we suggest Python)
• send in
• source codes
• short report
• at least a few sentences, saying what you did, how you did it, how it worked, what you observed in the results, etc.
• if it makes sense, please also include a sample of the results/outputs
• it can be a long report if there is a lot to say about what you did, but otherwise a few sentences are sufficient)
• the report is more important than the source codes (we may or may not check/run your code, but we will always read the report)
• submit via e-mail to Rudolf
• either put everything into the e-mail
• or put it elsewhere (e.g. a Git repository) and send info on where it is
• the deadline is in 1.5 weeks by default (in 2019 this means next week Thursday 23:59)
• you will get points
• 3 points is the base for an OK solution
• less points for a bad solution, 0 points for no solution
• more points for a great solution, doing something clever, doing more work, going deeper, finding something good...
• if your point average is at least 3 at the end of the semester, you get the grade 1
• feel free to go deeper
• We are operating at the edge of current research frontier, so in any of the assignments there is a chance that you will discover something new (worth publishing at a scientific conference, or investigating more in a diploma thesis, etc.)
• So feel free to go as deep as you want in any of the assignments!
• You can even diverge from the task if you come up with something more interesting to do. Just follow your fantasy :-) Because this is how you research.
• You will get more points if you do anything beyond the base task (and if it is extra interesting, we can talk about publishing it in a scientific paper).
• But also feel free to simply do the assignment as it is set, this will still give you 3 points. You can do more, but you do not have to.

### wals

• WALS online for clicking
• language.tsv -- WALS dataset for computer processing (free to download in CSV, this file has been covnerted to TSV for convenience; but it was generated in 2018 and WALS was updated in the meantime, so you may want to download the new original WALS dataset instead)
• greping and cuting in the WALS dataset
• Homework: a script for measuring language similarity using the WALS dataset
• Idea: similarity of a pair of languages can be estimated by comparing their WALS features, e.g. by counting the number of WALS features in which they are similar (Agić, 2017). The simplest way is to iterate over the features, ignoring those that are undefined for one of the two languages, and adding 1 to the score if the values match or 0 if they do not match. If you then divide this by the number of features, you get the Hamming similarity.
• You can either do the tasks 1-3 (1 is really THE task, 2 and 3 are just simple extensions), or you can do the harder alternative task.
• Task 1: input = WALS code of one language, output = WALS code and similarity scores for most similar languages.
• Task 2: input = genus (e.g. "Slavic"), output = centroid language of that genus, i.e. a language most similar to other languages of the genus
• Task 3: find the weirdest language, i.e. most dissimilar to any other language (for whole WALS, or for a given language genus/family)
• Alternative task: automatically generate missing values in WALS (e.g. if all Slavic languages have the number of genders either 3 or unspecified, you can probably set the unspecified values to 3). This is a harder task, so if you do this one, you do not have to do the tasks 1-3.
• The definition of the task is somewhat vague, feel free to spend as much or as little time with it as you wish

### tokenization

• One tokenizer you may often encounter is the Moses tokenizer:
mkdir -p mosestok/tokenizer/; cd mosestok/tokenizer/
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/tokenizer.perl
chmod u+x tokenizer.perl; cd ..; mkdir -p share/nonbreaking_prefixes/; cd share/nonbreaking_prefixes/
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
cd ../../..
mosestok/tokenizer/tokenizer.perl -h
• Try tokenizing the sentences from the slides with Moses tokenizer and with UDPipe tokenizer -- see Running UDPipe tokenizer
• hint: udpipe --tokenize path/to/model < input.txt
• Playing with quotes: “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” — »magyar« — ’magyar’ -- and tex quotes'' --- 'cause it's a mess, you know... But don’t don‘t don’t don’t don't talk 'bout that too much or students' heads'll explode!
• Varied Chinese punctuation: 「你看過《三國演義》嗎？」他問我。“你看過‘三國演義’嗎?”他問我.
• Vietnamese: Tất cả đường bêtông nội đồng thành quả
• Japanese: 経堂の美容室に行ってきました。
• Spanish: «¡María, te amo!», exclamó Juan. “María, I love you!” Juan exclaimed. ¿Vámonos al mar? Escríbeme a rur@nikde.eu. Soy de 'Kladno'... Tiene que bañarse.
• Download the cleaned UDHR dataset and try tokenizing some of the texts with UDPipe
• cmn = Chinese (Mandarin), yue = Cantonese, jpn = Japanese, vie = Vietnamese...
• Some languages have more than one treebank. Does the tokenizer work similarly well on each of them? (I.e. are the treebanks tokenized similarly?) See Measuring Model Accuracy in UDPipe manual
• hint: add the --accuracy switch and use the treebank test file (xyz-ud-test.conllu) as input
• Task A: Some languages have small or no training data in UD 2.3 and there is no trained UDPipe tokenizer for them yet. How would you tokenize e.g. Cantonese (yue), Buryat (bxr), or Upper Sorbian (hsb)?
• Try to find a reasonable UDPipe tokenization model for these three languages (e.g. for tokenizing Cantonese, maybe using the Chinese model makes sense?).
• You may try to reuse what you did in hw_wals ;-)
• Report which tokenizer you chose for each of the languages, how you did that, and what accuracy it achieves (again, evaluate the tokenizer on the test data).
• Task B: train a UDPipe tokenizer for one of the languages for which no trained model is available -- see Training UDPipe Tokenizer
• hint: udpipe --train --tagger=none --parser=none output_model.udpipe < xyz-ud-train.conllu
• hint: no model is avalable if the available data is low (typically missing training data), so you either have to do a different split of the data, or perform n-fold cross-validation (so that you can evaluate the tokenizer on something)
• report the accuracy you got, and compare it to using an existing tokenizer model trained on larger data for a different language
• sanity check: also run the tokenizer on some plaintext data for the language (probably from UDHR) and check that it actually does perform some reasonable-looking tokenization

### mt

• Visually compare the left, right and intersection alignments ... check in how many sentences you see the 'garbage alignments' that all fall onto one word

• Compare the intersection alignment for the baseline and improved alignments.

• Write a small script that reads:

1. source tokens
2. target tokens
3. alignment

and emits all pairs of aligned words.

If run through sort | uniq -c | sort -n, this would be a translation dictionary.

• Continue the moses tutorial to train a phrase-based model (apply mert-moses.pl).

• Apply the trained model.

• Compare the translations from the default run and from the run with these model flags:

-dl=0 -max-phrase-length 1


### pos_harmonization

• Tagset harmonization exercise: You get a syntactic parser trained on the UD tagset (UPOS and Universal Features), and data tagged with a different tagset. Try to convert the tagset into the UD tagset to get better results when applying the parser to the data.
• The data in the CoNLLU format and the trained UDPipe models can be found at http://ufallab.ms.mff.cuni.cz/~rosa/npfl120/pos_harm/.
• Running the parser
• To run the parser and get results in the CoNLLU format, use e.g.: cat ta-ud-dev-orig.conllu | ./udpipe --parse ta.sup.parser.udpipe
• To view the tree structures in the CoNLLU data, you can use e.g. view_conll or Udapi.
• To evaluate the parsing accuracy, use e.g.: cat ta-ud-dev-orig.conllu | ./udpipe --parse --accuracy ta.sup.parser.udpipe
• The tagset documentations (in practice it is often quite hard to get a proper documentation for the tagset, but we decided to be nice to you):
• Try to achieve some reasonable parsing accuracy – I guess at least 50% should be achievable rather easily.
• Note that 100% accuracy is not reachable; the UAS upper bounds (measured on UD test data) are: CS 90%, DE 85%, EN 88%, LA 68%, TA 78%
• Your task is to try to do the harmonization yourself, not using any pre-existing tools for that.
• Homework:
• Harmonize the tagset for one of the languages.
• You can use the template harmonize.py
• Turn in the code that you used.
• Report the parsing accuracy before and after your harmonization (both UAS and LAS); please measure the accuracy repeatedly during the development and report which changes to your solution brought which improvements of the parsing accuracy.
• The minimum is to identify some of the main POS categories, such as verbs, nouns, adjectives, and adverbs, so that you get a reasonable parsing accuracy. For doing that, you can get 2 points for the homework. You can get more points if you further improve your solution; some suggestions are listed below.
• You can try to identify more POS categories; ideally you should map all of the original POS tags to some UPOS tags.
• (You can try to produce some of the Universal Features (documentation) – but this will most probably not work well, as UDPipe uses the features as one atomic string.)
• You can try to cover all of the languages, at least in a basic way.
• You can figure out how to use Interset (see the lecture), use it to harmonize the tagset, and compare the parsing accuracy achieved when using your solution and when using Interset (but you still need to create at least a simple solution of your own).

### pos_tagging

• devise a cross-lingual POS tagger for one under-resourced target language
• start here, finish as homework
• report what you did and your POS tagging accuracy on the UD test data
• suggested target language: Kazakh (kk) / Telugu (te)
• there are some small training data in UD, so let's pretend there are none and just use the test data
• there are some reasonable parallel data
• there is at least one reasonable high-resource source language for each of these to project the POS tags from -- choose the source language(s) yourself
• POS projection over (multi)parallel data
1. take parallel data -- I suggest Watchtower and/or OpenSubtitles from OPUS:

• Watchtower (do not share) by Agić+ (2016)
• OPUS by Tiedemann+ (2004)
• WTC data is in a multiparallel format
• the same line in all the files corresponds to the same sentence in the various languages
• but some lines may be empty, as not all sentences are present in all the files
• some Opus data are multiparallel, but I don't know how to easily get the multiparallel sentence alignment
• so if you use multiple sources at once, I suggest you use WTC
2. POS tag the source side of the parallel data

• you can use the trained UDPipe models

• tag a tokenized text with UDPipe:

udpipe --tag --input=horizontal path/to/model < input.txt > output.conllu

• tokenize and tag:

udpipe --tokenize --tag path/to/model < input.txt > output.conllu

• only convert tokenized text to CONLLU format:

udpipe --input=horizontal path/to/model < input.txt > output.conllu

3. word-align source and target

• you can use Giza++ (see MT lab) or FastAlign (see below)

• I suggest to use intersection alignment symmetrization, but you can play with this a bit

• FastAlign installation:

git clone https://github.com/clab/fast_align.git
cd fast_align
mkdir build
cd build
cmake ..
make

• FastAlign usage (add -s to also output alignment scores):

paste cs sk | sed 's/\t/ ||| /' > cs-sk
fastalign -d -o -v -i cs-sk > cs-sk.f
fastalign -d -o -v -r -i cs-sk > cs-sk.r
atools -i cs-sk.f -j cs-sk.r -c intersect > cs-sk.i

4. project POS tags through the alignment from the tagged source to the non-tagged target

• you can use the template pos_project.py (but it was created for a slightly different purpose so you may need to change it a bit or a lot)
• take inspiration from the lecture to do the projection
• simply copying the POS tag from source to target with no other tricks is sufficient to get 2 points for the assignment
• you still need to do something with unaligned words or multiply aligned words (e.g. voting or weighted voting, or simply use the knowledge that NOUN is usually the most frequent POS...)
• doing something more clever carries more points
• ideally start with the simple solution, measure the base accuracy, then implement some improvements, and repeatedly measure the increase in accuracy (if any)
5. train tagger on the target data

udpipe --train --tokenizer=none --parser=none --tagger='use_xpos=0;use_features=0' < input.conllu > output.model

6. evaluate the tagger on target test data

udpipe --tag --accuracy path/to/model < test.conllu

• other notes (not important for this HW)
• you can use HunAlign sentence aligner if you use parallel data that are not sentence-aligned: install_hunalign.sh_, hun_align.sh_
• some data in Opus are weird; OpenSubtitles and Tanzil are nice
• once you have word-aligned data, you can also extract a simple word-to-word translation dictionary (this single-best translation is weaker than e.g. Moses as it does not take the context into account)

### delex_parsing

• applying lexicalized versus delexicalized parsers in a monolingual and cross-lingual setting
• trained lexicalized ("sup") and delexicalized ("delex") UDPipe 1.2 models trained on UD 2.1 treebanks

• language groups for experimenting:

• Norwegian (no), Danish (da), Swedish (sv)
• Czech (cs), Slovak (sk)
• Spanish (es), Portuguese (pt)
• training a delexicalized UDPipe parser (without morpho features):

cat cs-ud-train.conllu | ./udpipe --train --parser='embedding_form=0;embedding_feats=0;' --tokenizer=none --tagger=none cs.delex.parser.udpipe

• Homework:
• Extended your cross-lingual POS tagging homework to cross-lingual parsing
• Train a delexicalized parser on a source language treebank, and apply it to your cross-lingually-POS-tagged target-language data
• Report the parsing accuracies you obtain (LAS and UAS)
• You may also try the source-lexicalized parsing:
• Train a standard lexicalized parser (but still without morpho features) on the source language
• Apply it to the target language (without any translation)
• This will only work well if there is a substantial amount of shared vocabulary between the source and the target language, i.e. they are lexically very close
• Other notes -- combining multiple parsers via the MST algorithm (you do not have to do this in this HW):
• parse a sentence with mutiple parsers -- you get multiple parse trees, i.e. 3 sets of dependency edges if you used 3 parsers
• assign weights to the edges (e.g. 1 if the edge appeared in one parser output, 2 if in 2, etc.; or incorporate language similarity into the weights as well, i.e. edges from less similar languages get a lower weight)
• give the list of edges and their weights to a MST algorithm, which outputs the best tree that can be constructed from the edges
• you can use my Perl wrapper of the Perl Graph::ChuLiuEdmonds library (or look at my code and use the library directly from your Perl code; unfortunately I am unaware of any good implementation of a directed MST algorithm in Python)
• my wrapper takes standard input (one sentence per line) and writes to standard output (one sentence per line

• the input format is

number_of_nodes parent child weight parent child weight...


where parent and child are 1-based integer IDs of the parent and child nodes of the edge and the weight is a weight you assign to the edge, so e.g.:

3 0 2 1.5 2 1 0.5 2 3 0.5 3 1 1.2


### tree_projection

• Projecting trees over parallel data:
• all data is here: PUD = parallel treebanks, align = alignments by FastAlign
• Beware: CONLL-U token IDs are 1-based, FastAlign token IDs are 0-based
• Beware: tokens with non-integer ID (like 5-6 or 8.1) are not part of the tree nor of the alignment (so maybe you can just grep them away)
• Beware: forms and lemmas can contain spaces in CONLL-U
• You can use the template project.py which I prepared (it does the reading in and writing out)
• because this is parallel treebank, you have gold standard annotation for both the source tree and the target tree, so you can measure the accuracy of your projection
• you can use e.g. my evaluator.py for that
• use it e.g. as python3 evaluator.py -j -m head gold.conllu pred.conllu
• run it as python3 evaluator.py -h for more info; most importantly, you can also use -m deprel or -m las
• Homework: implement the projections somehow; you will get points according to how good and sophisticated they are; evaluate them automatically for several language pairs and report the scores

### tree_translation

• Lab: cross-lingual parsing lexicalized by translation of the training treebank using machine translation
• we get back to the VarDial 2017 cross-lingual parsing shared task setup: 3 language pairs (one is actually a triplet), using supervised POS tags:
• Czech (cs) source, Slovak (sk) target
• Slovene (sl) source, Croatian (hr) target
• Danish (da) and/or Swedish (sv) source, Norwegian (no) target
• choose any language pair you want to, or use other languages if you want to
• for the language pairs above, some datasets are prepared for the lab (but that's only a minor convenience, you can simply use UD treebanks and e.g. OpenSubtitles or WatchTower parallel data for any languages)
• "treebanks" are the training treebanks for the source languages and evaluation treebanks for the target languages
• "smaller_delex_models" are the baselines, i.e. delexicalized UDPipe parsers trained on the first 4096 sentences from the training treebanks; apply them to the target evaluation treebanks to measure the baseline accuracy (around 55 LAS I think)
• "our_vardial_models" are lexicalized parsing models which we submitted into the competition, about +5 LAS above the baselines (can you beat us?! :-))
• "para" are parallel data, obtained from OpenSubtitles2016 aligned by MonolingualGreedyAligner with intersection symmetrization (the format of the data is "sourceword[tab]targetword" on each line); there are also "tag" variants where POS tag and morphological features are annotated for the source word
• "translate_treebank.py" is a simple implementation of treebank translation which you can use for your inspiration
• the baseline approach is to translate each word form in the source treebank (second column) by its most frequent target counterpart from the parallel data (as done by the sample "translate_treebank.py" script), and then train a standard UDPipe parser on that:
udpipe --train --tokenizer=none --tagger=none out.model < train.conllu
and evaluating the parser on the target evaluation treebank:
udpipe --parse --accuracy out.model < dev.conllu
• there are many possible improvements to the approach:
• use better word alignment (e.g. FastAlign intersection alignment)
• use the source POS tags and/or morphological features for source-side disambiguation -- e.g. the word "stát" in Czech should be translated differently as a noun ("state") and as a verb ("stand"); you already have this annotation in the source treebank, and you can get it in the parallel data using a UDPipe tagger trained on the source treebank (which is how we produced the "tag" variants of the para data, which you can use)
• use multiple source languages -- either combine the parsers using the MST algorithm, or simply concatenate the source treebanks into one (that's what we did in VarDial for Danish and Swedish -- if you see the "ds" language code, this means just that)
• use a proper MT system (word-based Moses probably?)
• use your knowledge of the target language for some additional processing
• guess some translations for unknown words
• pre-train target language word embeddings with word2vec (on some target language plaintext -- you can also use the target side of the parallel data) and provide the pre-trained embeddings to UDPipe in training; see the UDPipe manual; another good option is to download pre-trained FastText word embeddings from fasttext.cc (use the text format, this is what UDPipe can read in)
• etc., you can have you own ideas for improvements
• Homework:
• implement cross-lingual parsing lexicalized by treebank translation (it is sufficient to use one language pair, either one of the above or your own)
• describe what you did and report achieved LAS scores evaluated on the target language treebank
• doing the simplest baseline lexicalization approach described above carries 2 points
• implementing some of the improvements carries more points