NPFL120 – Multilingual Natural Language Processing

The course focuses on multilingual aspects of natural language processing. It explains both the issues and the benefits of doing NLP in a multilingual setting, and shows possible approaches to use. We will target both dealing with multilingual variety in monolingual methods applied to multiple languages, as well as truly multilingual and crosslingual approaches which use resources in multiple languages at once. We will review and work with a range of freely available multilingual resources, both plaintext and annotated.

About

SIS code: NPFL120
Semester: summer
E-credits: 3
Examination: 1/1 KZ
Guarantors: Daniel Zeman
Rudolf Rosa
Ondřej Bojar
Taught in: English, unless all students present understand Czech.

Timespace Coordinates

  • in summer semester 2020, the course takes place every Friday 14:00 at https://matfyz.zoom.us/j/863861861
  • you can follow the class online in real time
  • but we will put the recordings of the classes online, so you can also watch them afterwards

Informal prerequisities

We suggest students to first attend the NPFL100 course Variability of languages in time and space / Variabilita jazyků v čase a prostoru, which looks more theoretically and linguistically onto many phenomena that we will look at more practically and computationally.

Some basic programming skills are expected, e.g. from the NPFL092 course NLP Technology.

The course complements nicely with the NPFL070 course Language Data Resources.

Organization of the course

The course has the form of a practical seminar in the computer lab. In each class we will try to combine a lecture with practical hands-on exercises (students are therefore required to have a unix lab account).

Lectures

1. Introduction; WALS Slides wals

2. Alphabets, encoding, language identification Slides

3. Tokenization and Word Segmentation Slides tokenization

4. Interset, POS harmonization Slides pos_harmonization Online class recording

5. Cross-lingual POS tagging Slides pos_tagging Online class recording

6. Delexicalized parsing Slides delex_parsing Online class recording

7. Tree projection + Treebank translation Tree projection Treebank translation tree_projection tree_translation Online class recording

8. Word Embeddings Slides embeddings Online class recording

9. Contextual Word Embeddings Slides bert Online class recording

10. Multilingual Machine Translation (Ondřej Bojar) Slides Online class

Syntax harmonization and Enhanced Universal Dependencies Slides

Machine Translation: Alignment and Phrase-Based MT (Ondřej Bojar) Slides


Requirements

Homework tasks

There will be homework from most of the classes, typically based on finishing and/or extending the exercises from that class.

To pass the course, you will be required to actively participate in the classes and to submit homework tasks. The quality of your homework solutions will determine your grade.

Grading rules

You get some points for each homework. A good solution gets 3 points – a weaker solution gets less, a stronger solution gets more. Then, if your final average of points per homework is at least 3, you get the grade 1; otherwise you get a lower grade.

1. Introduction; WALS

 Feb 21 Slides wals

2. Alphabets, encoding, language identification

 Feb 28 Slides

  • langid library and the accompanying paper
  • cleaned UDHR dataset, originally from here
  • mix of languages
  • Unidecode for transliterating nearly any characters into ASCII
    • A similar but different tool by Dan Zeman can be found in /home/zeman/projekty/transliterace/translit.pl on the ÚFAL network, if you have access.
  • get iso codes of languages here
  • unicodedata for various unicode stuff in Python

3. Tokenization and Word Segmentation

 Mar 06 Slides tokenization

4. Interset, POS harmonization

 Mar 20 Slides pos_harmonization Online class recording

5. Cross-lingual POS tagging

 Mar 27 Slides pos_tagging Online class recording

6. Delexicalized parsing

 Apr 3 Slides delex_parsing Online class recording

7. Tree projection + Treebank translation

 Apr 17 Tree projection Treebank translation tree_projection tree_translation Online class recording

8. Word Embeddings

 Apr 24 Slides embeddings Online class recording

9. Contextual Word Embeddings

 May 15 Slides bert Online class recording

10. Multilingual Machine Translation (Ondřej Bojar)

 May 22 Slides Online class

Syntax harmonization and Enhanced Universal Dependencies

 not taught Slides

Machine Translation: Alignment and Phrase-Based MT (Ondřej Bojar)

 not taught Slides

Rules

  • use any programming language you like (we suggest Python)
  • submit
    • source codes
    • short report
      • at least a few sentences, saying what you did, how you did it, how it worked, what you observed in the results, etc.
      • if it makes sense, please also include a sample of the results/outputs
      • it can be a long report if there is a lot to say about what you did, but otherwise a few sentences are sufficient)
      • the report is more important than the source codes: we may or may not check/run your code, but we will always read the report
      • use any reasonable format you like (TXT, PDF, MD, DOC...)
  • submit via a Git repository
    • create a Git repository somewhere, probably ÚFAL Redmine or faculty GitLab
    • give read access to the repository to Rudolf and send him the address of the repository
    • the deadline is in ~1.5 weeks by default (in 2020 this means next week Sunday 23:59)
  • you will get points
    • 3 points is the base for an OK solution
    • less points for a bad solution, 0 points for no solution
    • more points for a great solution, doing something clever, doing more work, going deeper, finding something good...
    • if your point average is at least 3 at the end of the semester, you get the grade 1
  • feel free to go deeper
    • We are operating at the edge of current research frontier, so in any of the assignments there is a chance that you will discover something new (worth publishing at a scientific conference, or investigating more in a diploma thesis, etc.)
    • So feel free to go as deep as you want in any of the assignments!
    • You can even diverge from the task if you come up with something more interesting to do. Just follow your fantasy :-) Because this is how you research.
    • You will get more points if you do anything beyond the base task (and if it is extra interesting, we can talk about publishing it in a scientific paper).
    • But also feel free to simply do the assignment as it is set, this will still give you 3 points. You can do more, but you do not have to.
  • In year 2020, instead of doing some of the assignments, you can participate in the SIGTYP shared task on prediction of typological features, see below

sigtyp

 Deadline: July 1  some points

In year 2020 there is a SIGTYP shared task on the prediction of typological features

  • Instead of doing some or all of the homework tasks for the course, you can participate in this shared task
  • Depending on how many of you are interested, we can form a team or even several teams
  • A sufficiently sophisticated submission to the task can be worth 30 points (effectively replacing all homework tasks)
  • Any valid submission to the task is worth at least 12 points (i.e. four homework assignments)

Currently, the shared task website contains some of the information for the task. Some other information may be available in e-mails which Dan Zeman is getting, so he might know answers to some questions we might have about the task. However, for some questions, we might have to ask the task organizers anyway.

There is a GitHub repository for the shared task. We have our fork for our work on the task; we suggest that we form one team together and use that common Github repo for everything (and Github issues to track tasks and progress etc.) Tell us your Github account to get push access.

Currently, train and dev data have been released on the Github, plus there are some trial data which are confusingly in a slightly different format. We currently assume that the format used in the shared task will be that of the train and dev data, not that of the trial data, but that some feature values will be filled in by question marks ("?") and these need to be predicted.

In the dev data, there are no missing values, so we suggest to simulate the task setting by replacing some part of the values in the dev data by question marks (e.g. for each language, replace the first half of feature values and try to predict them using the second half of the feature values). Then we suggest to evaluate by micro-average accuracy, i.e. computing the percentage of correctly predicted values.

Suggested approaches (simpler to more complex):

  • majority voting based on language family (the language genera in train and test data will probably have no overlap)
  • determined by closest language (try to find the most similar language based on the filled in features as well as language family and GPS, copy values from that language, if a value is missing then e.g. take the second most similar language etc.)
  • combination, use weighted voting (weight = language similarity)
  • looking for intralingual causation or correlation (such as SVO implies SV, or postposition imply OV ), probably using some statistical methods such as CCA

The task website also lists some existing work on the topic:

wals

 Deadline: Mar 1  3 points

  • WALS online for clicking
  • language.tsv -- WALS dataset for computer processing (free to download in CSV, this file has been covnerted to TSV for convenience; but it was generated in 2018 and WALS was updated in the meantime, so you may want to download the new original WALS dataset instead)
  • greping and cuting in the WALS dataset
  • Homework: a script for measuring language similarity using the WALS dataset
    • Idea: similarity of a pair of languages can be estimated by comparing their WALS features, e.g. by counting the number of WALS features in which they are similar (Agić, 2017). The simplest way is to iterate over the features, ignoring those that are undefined for one of the two languages, and adding 1 to the score if the values match or 0 if they do not match. If you then divide this by the number of features, you get the Hamming similarity.
    • You can either do the tasks 1-3 (1 is really THE task, 2 and 3 are just simple extensions), or you can do the harder alternative task.
    • Task 1: input = WALS code of one language, output = WALS code and similarity scores for most similar languages.
    • Task 2: input = genus (e.g. "Slavic"), output = centroid language of that genus, i.e. a language most similar to other languages of the genus
    • Task 3: find the weirdest language, i.e. most dissimilar to any other language (for whole WALS, or for a given language genus/family)
    • Alternative task: automatically generate missing values in WALS (e.g. if all Slavic languages have the number of genders either 3 or unspecified, you can probably set the unspecified values to 3). This is a harder task, so if you do this one, you do not have to do the tasks 1-3. In year 2020, there is a shared task for this, which can replace some or all of the homework tasks, see above.
    • The definition of the task is somewhat vague, feel free to spend as much or as little time with it as you wish

tokenization

 Deadline: Mar 15  3 points

  • One tokenizer you may often encounter is the Moses tokenizer:
    mkdir -p mosestok/tokenizer/; cd mosestok/tokenizer/
    wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/tokenizer.perl
    chmod u+x tokenizer.perl; cd ..; mkdir -p share/nonbreaking_prefixes/; cd share/nonbreaking_prefixes/
    wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
    cd ../../..
    mosestok/tokenizer/tokenizer.perl -h
  • Quite powerful tokenizer is part of UDPipe -- download UDPipe 1.2.0, download UD 2.4 models, see the UDPipe manual
  • Try tokenizing the sentences from the slides with Moses tokenizer and with UDPipe tokenizer -- see Running UDPipe tokenizer
    • hint: udpipe --tokenize path/to/model < input.txt
    • Playing with quotes: “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” — »magyar« — ’magyar’ -- and ``tex quotes'' --- 'cause it's a mess, you know... But don’t don‘t don’t don’t don't talk 'bout that too much or students' heads'll explode!
    • Varied Chinese punctuation: 「你看過《三國演義》嗎?」他問我。“你看過‘三國演義’嗎?”他問我.
    • Vietnamese: Tất cả đường bêtông nội đồng thành quả
    • Japanese: 経堂の美容室に行ってきました。
    • Spanish: «¡María, te amo!», exclamó Juan. “María, I love you!” Juan exclaimed. ¿Vámonos al mar? Escríbeme a rur@nikde.eu. Soy de 'Kladno'... Tiene que bañarse.
  • Download the cleaned UDHR dataset and try tokenizing some of the texts with UDPipe
    • cmn = Chinese (Mandarin), yue = Cantonese, jpn = Japanese, vie = Vietnamese...
  • Universal Dependencies -- download UD 2.5
  • Some languages have more than one treebank. Does the tokenizer work similarly well on each of them? (I.e. are the treebanks tokenized similarly?) See Measuring Model Accuracy in UDPipe manual
    • hint: add the --accuracy switch and use the treebank test file (xyz-ud-test.conllu) as input
  • Task A: Some languages have small or no training data in UD 2.3 and there is no trained UDPipe tokenizer for them yet. How would you tokenize e.g. Cantonese (yue), Buryat (bxr), or Upper Sorbian (hsb)?
    • Try to find a reasonable UDPipe tokenization model for these three languages (e.g. for tokenizing Cantonese, maybe using the Chinese model makes sense?).
    • You may try to reuse what you did in hw_wals ;-)
    • Report which tokenizer you chose for each of the languages, how you did that, and what accuracy it achieves (again, evaluate the tokenizer on the test data).
  • Task B: train a UDPipe tokenizer for one of the languages for which no trained model is available -- see Training UDPipe Tokenizer
    • hint: udpipe --train --tagger=none --parser=none output_model.udpipe < xyz-ud-train.conllu
    • hint: no model is avalable if the available data is low (typically missing training data), so you either have to do a different split of the data, or perform n-fold cross-validation (so that you can evaluate the tokenizer on something)
    • report the accuracy you got, and compare it to using an existing tokenizer model trained on larger data for a different language
    • sanity check: also run the tokenizer on some plaintext data for the language (probably from UDHR) and check that it actually does perform some reasonable-looking tokenization

pos_harmonization

 Deadline: Mar 29  3 points

  • Tagset harmonization exercise: You get a syntactic parser trained on the UD tagset (UPOS and Universal Features), and data tagged with a different tagset. Try to convert the tagset into the UD tagset to get better results when applying the parser to the data.
    • The data in the CoNLLU format and the trained UDPipe models can be found at http://ufallab.ms.mff.cuni.cz/~rosa/npfl120/pos_harm/.
    • Running the parser
      • To run the parser and get results in the CoNLLU format, use e.g.: cat ta-ud-dev-orig.conllu | ./udpipe --parse ta.sup.parser.udpipe
      • To view the tree structures in the CoNLLU data, you can use e.g. view_conll or Udapi.
      • To evaluate the parsing accuracy, use e.g.: cat ta-ud-dev-orig.conllu | ./udpipe --parse --accuracy ta.sup.parser.udpipe
    • The tagset documentations (in practice it is often quite hard to get a proper documentation for the tagset, but we decided to be nice to you):
    • Try to achieve some reasonable parsing accuracy – I guess at least 50% should be achievable rather easily.
      • Note that 100% accuracy is not reachable; the UAS upper bounds (measured on UD test data) are: CS 90%, DE 85%, EN 88%, LA 68%, TA 78%
    • Your task is to try to do the harmonization yourself, not using any pre-existing tools for that.
    • Homework:
      • Harmonize the tagset for one of the languages.
      • You can use the template harmonize.py
      • Turn in the code that you used.
      • Report the parsing accuracy before and after your harmonization (both UAS and LAS); please measure the accuracy repeatedly during the development and report which changes to your solution brought which improvements of the parsing accuracy.
      • The minimum is to identify some of the main POS categories, such as verbs, nouns, adjectives, and adverbs, so that you get a reasonable parsing accuracy. For doing that, you can get 2 points for the homework. You can get more points if you further improve your solution; some suggestions are listed below.
      • You can try to identify more POS categories; ideally you should map all of the original POS tags to some UPOS tags.
      • (You can try to produce some of the Universal Features (documentation) – but this will most probably not work well, as UDPipe uses the features as one atomic string.)
      • You can try to cover all of the languages, at least in a basic way.
      • You can figure out how to use Interset (see the lecture), use it to harmonize the tagset, and compare the parsing accuracy achieved when using your solution and when using Interset (but you still need to create at least a simple solution of your own).

pos_tagging

 Deadline: Apr 05  3 points

  • devise a cross-lingual POS tagger for one under-resourced target language
    • start here, finish as homework
    • report what you did and your POS tagging accuracy on the UD test data
  • suggested target language: Kazakh (kk) / Telugu (te)
    • there are some small training data in UD, so let's pretend there are none and just use the test data
    • there are some reasonable parallel data
    • there is at least one reasonable high-resource source language for each of these to project the POS tags from -- choose the source language(s) yourself
  • POS projection over (multi)parallel data
    1. take parallel data -- I suggest Watchtower and/or OpenSubtitles from OPUS:

      • Watchtower (do not share) by Agić+ (2016)
      • OPUS by Tiedemann+ (2004)
      • WTC data is in a multiparallel format
        • the same line in all the files corresponds to the same sentence in the various languages
        • but some lines may be empty, as not all sentences are present in all the files
      • some Opus data are multiparallel, but I don't know how to easily get the multiparallel sentence alignment
        • so if you use multiple sources at once, I suggest you use WTC
    2. POS tag the source side of the parallel data

      • you can use the trained UDPipe models

      • tokenize and tag with UDPipe:

        udpipe --tokenize --tag path/to/model < input.txt > output.conllu
        
      • or tag an already tokenized text:

        udpipe --tag --input=horizontal path/to/model < input.txt > output.conllu
        
      • or to only convert tokenized text to CONLLU format:

        udpipe --input=horizontal path/to/model < input.txt > output.conllu
        
    3. word-align source and target

      • you can use Giza++ or efmaral or FastAlign (see below)

      • I suggest to use intersection alignment symmetrization, but you can play with this a bit

      • FastAlign installation:

        git clone https://github.com/clab/fast_align.git
        cd fast_align
        mkdir build
        cd build
        cmake ..
        make
        
      • FastAlign usage (add -s to also output alignment scores):

        paste cs sk | sed 's/\t/ ||| /' | grep '. ||| .' > cs-sk
        fast_align -d -o -v -i cs-sk > cs-sk.f
        fast_align -d -o -v -r -i cs-sk > cs-sk.r
        atools -i cs-sk.f -j cs-sk.r -c intersect > cs-sk.i
        
    4. project POS tags through the alignment from the tagged source to the non-tagged target

      • you can use the template pos_project.py (but it was created for a slightly different purpose so you may need to change it a bit or a lot)
      • take inspiration from the lecture to do the projection
        • simply copying the POS tag from source to target with no other tricks is sufficient to get 2 points for the assignment
          • you still need to do something with unaligned words or multiply aligned words (e.g. voting or weighted voting, or simply use the knowledge that NOUN is usually the most frequent POS...)
        • doing something more clever carries more points
        • ideally start with the simple solution, measure the base accuracy, then implement some improvements, and repeatedly measure the increase in accuracy (if any)
    5. train tagger on the target data

      udpipe --train --tokenizer=none --parser=none --tagger='use_xpos=0;use_features=0' < input.conllu > output.model
      
    6. evaluate the tagger on target test data

      udpipe --tag --accuracy path/to/model < test.conllu
      
  • other notes (not important for this HW)
    • you can use HunAlign sentence aligner if you use parallel data that are not sentence-aligned: install_hunalign.sh_, hun_align.sh_
    • some data in Opus are weird; OpenSubtitles and Tanzil are nice
    • once you have word-aligned data, you can also extract a simple word-to-word translation dictionary (this single-best translation is weaker than e.g. Moses as it does not take the context into account)

delex_parsing

 Deadline: Apr 12  3 points

  • applying lexicalized versus delexicalized parsers in a monolingual and cross-lingual setting
    • trained lexicalized ("sup") and delexicalized ("delex") UDPipe 1.2 models trained on UD 2.1 treebanks

    • language groups for experimenting:

      • Norwegian (no), Danish (da), Swedish (sv)
      • Czech (cs), Slovak (sk)
      • Spanish (es), Portuguese (pt)
    • UD treebanks

    • evaluating a trained UDPipe parser on test treebank data (only parsing, no tagging!):

      udpipe --parse --accuracy path/to/model < test.conllu
      
    • training a delexicalized UDPipe parser (without morpho features):

      cat cs-ud-train.conllu | ./udpipe --train --parser='embedding_form=0;embedding_feats=0;' --tokenizer=none --tagger=none cs.delex.parser.udpipe
      
  • Homework:
    • Extended your cross-lingual POS tagging homework to cross-lingual parsing
    • Train a delexicalized parser on a source language treebank, and apply it to your cross-lingually-POS-tagged target-language data
    • Report the parsing accuracies you obtain (LAS and UAS)
    • You may also try the source-lexicalized parsing:
      • Train a standard lexicalized parser (but still without morpho features) on the source language
      • Apply it to the target language (without any translation)
      • This will only work well if there is a substantial amount of shared vocabulary between the source and the target language, i.e. they are lexically very close
  • Other notes -- combining multiple parsers via the MST algorithm (you do not have to do this in this HW):
    • parse a sentence with mutiple parsers -- you get multiple parse trees, i.e. 3 sets of dependency edges if you used 3 parsers
    • assign weights to the edges (e.g. 1 if the edge appeared in one parser output, 2 if in 2, etc.; or incorporate language similarity into the weights as well, i.e. edges from less similar languages get a lower weight)
    • give the list of edges and their weights to a MST algorithm, which outputs the best tree that can be constructed from the edges
    • you can use my Perl wrapper of the Perl Graph::ChuLiuEdmonds library (or look at my code and use the library directly from your Perl code; unfortunately I am unaware of any good implementation of a directed MST algorithm in Python)
      • my wrapper takes standard input (one sentence per line) and writes to standard output (one sentence per line

      • the input format is

        number_of_nodes parent child weight parent child weight...
        

        where parent and child are 1-based integer IDs of the parent and child nodes of the edge and the weight is a weight you assign to the edge, so e.g.:

        3 0 2 1.5 2 1 0.5 2 3 0.5 3 1 1.2
        

tree_projection

 Deadline: Apr 26  3 points

  • Projecting trees over parallel data:
    • all data is here: PUD = parallel treebanks, align = alignments by FastAlign
    • Beware: CONLL-U token IDs are 1-based, FastAlign token IDs are 0-based
    • Beware: tokens with non-integer ID (like 5-6 or 8.1) are not part of the tree nor of the alignment (so maybe you can just grep them away)
    • Beware: forms and lemmas can contain spaces in CONLL-U
    • You can use the template project.py which I prepared (it does the reading in and writing out)
    • because this is parallel treebank, you have gold standard annotation for both the source tree and the target tree, so you can measure the accuracy of your projection (in real life, you have parallel data which do not have any annotation, so you need to parse the source data with a parser, and then train a parser on the target data)
      • you can use e.g. my evaluator.py for that
      • use it e.g. as python3 evaluator.py -j -m head gold.conllu pred.conllu
      • run it as python3 evaluator.py -h for more info; most importantly, you can also use -m deprel or -m las
    • Homework:
      • implement the projections somehow
      • try to ensure that what you produce is a rooted tree (only one root, all nodes have a head assigned, no cycles); report how you did this and if you succeeded
      • evaluate your solution automatically for several language pairs and report the scores
      • ideally, also compare the accuracies to the delex parsing approach
      • report what you found out
      • Alternatively, you can try to use the tree translation approach (see below); it is sufficient to only do one of the approaches, either projection or translation.

tree_translation

 Deadline: Apr 26  3 points

  • Lab: cross-lingual parsing lexicalized by translation of the training treebank using machine translation (this is an alternative to tree projection)
    • we get back to the VarDial 2017 cross-lingual parsing shared task setup: 3 language pairs (one is actually a triplet), using supervised POS tags:
      • Czech (cs) source, Slovak (sk) target
      • Slovene (sl) source, Croatian (hr) target
      • Danish (da) and/or Swedish (sv) source, Norwegian (no) target
    • choose any language pair you want to, or use other languages if you want to
    • for the language pairs above, some datasets are prepared for the lab (but that's only a minor convenience, you can simply use UD treebanks and e.g. OpenSubtitles or WatchTower parallel data for any languages)
      • "treebanks" are the training treebanks for the source languages and evaluation treebanks for the target languages
      • "smaller_delex_models" are the baselines, i.e. delexicalized UDPipe parsers trained on the first 4096 sentences from the training treebanks; apply them to the target evaluation treebanks to measure the baseline accuracy (around 55 LAS I think)
      • "our_vardial_models" are lexicalized parsing models which we submitted into the competition, about +5 LAS above the baselines (can you beat us?! :-))
      • "para" are parallel data, obtained from OpenSubtitles2016 aligned by MonolingualGreedyAligner with intersection symmetrization (the format of the data is "sourceword[tab]targetword" on each line); there are also "tag" variants where POS tag and morphological features are annotated for the source word
      • "translate_treebank.py" is a simple implementation of treebank translation which you can use for your inspiration
    • the baseline approach is to translate each word form in the source treebank (second column) by its most frequent target counterpart from the parallel data (as done by the sample "translate_treebank.py" script), and then train a standard UDPipe parser on that:
      udpipe --train --tokenizer=none --tagger=none out.model < train.conllu
      and evaluating the parser on the target evaluation treebank:
      udpipe --parse --accuracy out.model < dev.conllu
    • there are many possible improvements to the approach:
      • use better word alignment (e.g. FastAlign intersection alignment)
      • use the source POS tags and/or morphological features for source-side disambiguation -- e.g. the word "stát" in Czech should be translated differently as a noun ("state") and as a verb ("stand"); you already have this annotation in the source treebank, and you can get it in the parallel data using a UDPipe tagger trained on the source treebank (which is how we produced the "tag" variants of the para data, which you can use)
      • use multiple source languages -- either combine the parsers using the MST algorithm, or simply concatenate the source treebanks into one (that's what we did in VarDial for Danish and Swedish -- if you see the "ds" language code, this means just that)
      • use a proper MT system (word-based Moses probably?)
      • use your knowledge of the target language for some additional processing
      • guess some translations for unknown words
      • pre-train target language word embeddings with word2vec (on some target language plaintext -- you can also use the target side of the parallel data) and provide the pre-trained embeddings to UDPipe in training; see the UDPipe manual; another good option is to download pre-trained FastText word embeddings from fasttext.cc (use the text format, this is what UDPipe can read in)
      • etc., you can have you own ideas for improvements
  • Homework (an alternative to the `tree_projection` homework; it is sufficient to just do one of them):
    • implement cross-lingual parsing lexicalized by treebank translation (it is sufficient to use one language pair, either one of the above or your own)
    • describe what you did and report achieved LAS scores evaluated on the target language treebank
    • doing the simplest baseline lexicalization approach described above carries 2 points
    • implementing some of the improvements carries more points

embeddings

 Deadline: May 03  3 points

  • monolingual word embeddings: https://fasttext.cc/
    • download and install fastText (it is sufficient to make it)
    • download and gunzip a model
      • Models for 157 languages
      • download both the bin format and the text format (bin is for the fasttext tools, text for any other usage)
    • run fasttext to see all the available options
      • fasttext nn cc.en.300.bin
        • input e.g. "dog" to see words most similar to a dog
      • fasttext analogies cc.en.300.bin
        • input e.g. "teacher school hospital" (which means "teacher - school + hospital") to see what happens when you replace the "schoolness" of a teacher by "hospitalness"
      • embeddings visualisation: https://projector.tensorflow.org/
      • optional:
  • bilingual embeddings: https://github.com/artetxem/vecmap
  • homework assignment: cross-lingual parsing with bilingual word embeddings
    • choose a source language and a target language

    • get word embeddings for both the languages

    • perform a cross-lingual mapping of the embeddings with VecMap (or another tool if you want)

      • you can improve the results by using a bilingual dictionary extracted from parallel data -- e.g. take intersection alignment and construct a dictionary from all aligned pairs of words -- and use the supervised or semi-supervised setting
      • or you can use the identical or unsupervised setting
      • note that supervised variant runs much faster (~2 minutes) than the other options (~5 hours unless you have a GPU)
    • create one bilingual embeddings file

      • from VecMap, you will get two new embedding files, one for the source language and one for the target language, which contain similar vectors for similar words across languages
      • we need to have one mapping of words in both languages to a common space, to give to UDPipe to use as the representation of words
      • (or we would need to train UDPipe with the crosslingual embeddings for the source language and then exchange its embedding vocabulary for the target language crosslingual embeddings, which may or may not be possible)
      • so we can just concatenate the two files to get bilingual embeddings (ideally, we would be a little clever with the header line and duplicates)
    • train a UDPipe parser using the bilingual embeddings

         udpipe --train source.udpipe --tokenizer=none --tagger=none --parser='embedding_form_file:bilingual_embeddings.txt' source.conllu
      
    • evaluate the trained model

      • evaluate the model on a test treebank for both the source and the target language
      • compare with some meaningful alternatives (delex parser, projected parser, supervised parser...)

bert

 Deadline: May 24  3 points

  • BERT and mBERT by Google: https://github.com/google-research/bert

  • HuggingFace community makes neural NLP easy to use: https://huggingface.co/

  • I will work with BERT and mBERT

    • but feel free to use other models, e.g. DistillBERT (and multilingual DistillmBERT) which are smaller and faster and lighter
  • Install HuggingFace Transformers (there is a detailed guide on their website)

    # virtual environment
    python3 -m venv venv
    source venv/bin/activate
    
    # install Torch without CUDA (cpu version)
    # transformers can be used from Torch or Tensorflow
    pip install torch==1.5.0+cpu torchvision==0.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
    
    # install transformers
    pip install transformers
    
  • Get contextual word embeddings!

    # Imports
    from transformers import BertModel, BertTokenizer
    import torch
    
    # Loads the model (downloads it if not yet downloaded)
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained('bert-base-uncased')
    # Some valid options:
    # bert-base-uncased
    # bert-base-cased
    # bert-large-cased
    # bert-base-multilingual-uncased
    # bert-base-multilingual-cased
    
    # Input
    sentence = "A platypus is a mammal."
    
    # Tokenize and convert to token ids
    ids = tokenizer.encode(sentence)
    
    # Let's see the tokens
    # Note the special initial and final token
    bert_tokens = tokenizer.convert_ids_to_tokens(ids)
    
    # Convert to Torch tensor (a batch of 1 sentence)
    t = torch.tensor([ids])
    
    # Run the BERT model
    output = model(t)
    
    # See the contextual embedding of the first "A" word
    # output[0] is the last encoder layer output
    # output[0][0] for the first sentence
    # output[0][0][0] is the [CLS] token
    # output[0][0][1] is the "A" token
    emb_a = output[0][0][1]
    emb_mammal = output[0][0][-3]
    
    # TODO: measure cosine similarity of instances of "a"
    # TODO: measure cosine similarity of "dog" and "pes" in mBERT
    
  • Let's train a simple BERT-based tagger!

    • Take some treebanks, e.g. PUD; we need English, and e.g. Czech

    • Get BERT contextual embeddings for tokens in the treebank using connlu2vectors.py

      # Load BERT, read CoNLL-U, for each token write UPOS and contextual embedding
      # Does something too stupid to join wordpieces into tokens
      # Skips sentences where this fails
      ./connlu2vectors.py bert-base-uncased < en_pud-ud-test.conllu > en_pud.bert
      
    • Train a simple MLP classifier to predict UPOS from contextual embedding using train_mlp.py

      # Read data, split into train and test,
      # train MLP classifier, report accuracies, save model
      ./train_mlp.py en_pud.bert.model < en_pud.bert
      
    • Apply the classifier to English data, as well as to e.g. Czech data, using eval_mlp.py

      # Evaluating on traing data basically...
      ./eval_mlp.py en_pud.bert.model < en_pud.bert
      # Using monolingual (English) BERT on Czech: not good
      ./eval_mlp.py en_pud.bert.model < cs_pud.bert
      
    • Do the same but using multilingual mBERT instead of monolingual

      ./connlu2vectors.py bert-base-multilingual-uncased < en_pud-ud-test.conllu > en_pud.mbert
      ./connlu2vectors.py bert-base-multilingual-uncased < cs_pud-ud-test.conllu > cs_pud.mbert
      ./train_mlp.py en_pud.mbert.model < en_pud.mbert
      ./eval_mlp.py en_pud.mbert.model < cs_pud.mbert
      
    • Now the tagger trained on English language magically works also for Czech!

  • Homework assignment

    • try to do cross-lingual POS tagging with mBERT
    • compare several setups
    • you can choose a fixed target language and vary the source languages
    • you can also combine (concatenate) multiple source languages
    • you can try various target languages
    • you can try to improve the tokenization mismatch problem
    • you can compare mBERT to vecmap
    • you can play with the classifier setup
    • (if you have access to a GPU, you can also try fine-tuning the mBERT model; but this is very computationally demanding and can take a lot of time, so probably you should not attempt this for the homework assignment)

Conclusion

  • So can we forget everything we learned in previous classes and just use mBERT?
    • Probably in most multilingual and cross-lingual situations, mBERT should be the tool to use.
    • Many of the problems are still there (alphabets, tokenization, language similarity...)
    • Some approaches are somewhat outdated for the task for which we showed them (e.g. delexicalized parsing) but are useful concepts often used elsewhere (e.g. delexicalization in dialogue systems)
  • In the course, you should have learned both general transferable stuff as well as specific practical stuff
    • Individual tools and methods change fast; you should know and be able to use some of the current tools and methods, but you need to keep learning
    • General ideas and approaches transfer; you should be able to apply your understanding of the problematic even with new tools and methods
    • Language properties stay; we may have new tools to solve the old problems, but the problems themselves will not go away