NPFL070 - Language Data Resources

Lecturers: Zdeněk Žabokrtský, Martin Popel
Time and location: Tuesday 14.00–16.20, Linux lab SU1
the course's site in the Student Information System

Course schedule overview

Introduction
Corpora, esp. Czech National Corpus
Treebanks
Parallel corpora
Resources related to lexical semantics
Named entity, coreference and anaphora, and discourse corpora
Using data for evaluation

More detailed course schedule

Introduction
- organization stuff
- Overview od language data resources: slides
- Warm-up morning exercise - find haiku sentences
  - for a language of your choice, write a Python script that finds sentences which can be split on word boundaries into three contiguous segments with 5, 7, and 5 syllables (ok, haiku for Europeans)
  - Input: read a plain-text utf8 file (e.g. Zápisky z mrtvého domu available at Project Gutenberg) from STDIN
  - Output: print the haiku sentences, each of the three segments on a separate line, with additional empty line separating sentences
  - Simplification: You can approximate the number syllables by the number of vowels (or vowel sequences).
  - commit your solution into the undergrads svn repository into 2017-npfl070/exercise1 in your directory
- Homework 1 - create a sequence of tools for building a very simple 1MW corpus
  - choose a language different from Czech and English and also from your native language
  - find on-line sources of texts for the language, containing altogether more than 1 million words, and download them
  - convert the material into one large plain-text utf8 file
  - tokenize the file on word boundaries and print 50 most frequent tokens
  - organize all these steps into a Makefile so that the whole procedure is executed after running make all
  - commit the Makefile into hw/my-corpus in your git repository for this course
  - deadline: 12th March 2018
Corpora - Case Study: the Czech National Corpus
- (let's finish the general overview slides from the last week first)
- we'll explore the most important corpus for Czech at www.korpus.cz
- Kontext search tool - Czech intro
- POS tagset documentation
- Michal Křen's overview of recent developments in CNC: slides
- Additional reading about corpora:
Czech National Corpus, cont. ,
- Warm-up exercise:
  - try to assemble POS tags for all tokens in the following sentence (from the todays newpapers): "Trumpův ministr uvažuje o odebírání dětí matkám, které přijdou ilegálně."
  - when finished, compare your solution with that of the on-line morphological analyser of Czech Morphodita
  - once again, POS tagset documentation
- Warm-up exercise 2 (if the previous one is too easy for you or if you have spare time): use a tiny multilingual raw-text "corpus" for language identification
  1. use raw-text data for several languages available in minicorp.zip
  2. implement a language identifier in Python; the identifier should ideally recognize any language from the minicorp set
  3. use the minicorp texts for training (e.g. first 1000 lines for each language) and evaluation (next 100 lines) of a language identifier implemented in Python
  4. hint: use "similarity" in occurrences of letter trigrams
  5. commit your solution into the undergrads svn repository into 2017-npfl070/exercise2 in your directory
- construct Kontext queries for the following examples
  1. occurrences of word form "kousnout"; occurrences of all forms of lemma "kousnout"; occurrences of verbs derived from "kousnout" by prefixation (and make frequency list of their lemmas) and occurrences of adjectives derived from such prefixed verbs (and their frequency list too),
  2. name 5 verb whose infinitive does not end with '-t'; find them in the corpus and make their frequency list
  3. find adjectives with 'illusory negation', such as "nekalý", "neohrabaný", "nevrlý"...
  4. find adverbs that modify adjectives, make their frequency list,
  5. find beginnings of subordinating conditional clauses,
  6. find beginnings of subordinating relative clauses,
  7. find examples of names of (state) presidents (family name+surname), order them according to frequency of occurrences,
  8. find all occurrences of phraseme "mráz někomu běhá po zádech"
  9. find nouns with temporal meaning
  10. find adverbs with locational or directional meaning
  11. find nouns that are typical objects of the verb slovesa "kousnout" (and the same for subject)
  12. find five pairs of alternative or questionable spelling variants, and compare their frequencies using SyD
Using annotated data for evaluation
- Warm-up exercise: find synonyms (or near synonyms) in English
  - you can use a simple probabilistic English-Czech translation dictionary derived from CzEng czeng-reduced-dict.tsv.gz
  - you can rely on the hint that synonymous words are likely to have the same translation equivalents
  - you should find some reliability metrics for the detected synonymous pairs and sort the pairs, starting from the most reliable synonyms
- Intro to evaluation in NLP: pfl070-evaluation.ppt
- Homework our-annotation (work in pairs): design your own annotation project for a linguistic phenomenon of your choice
  - minimal requirements: annotation in a plain-text format, two annotations by two independently working annotators, at least 50 annotated instances, evaluated inter-annotator agreement, experiment documentation
  - commit the annotated data and experiment documentation into hw/our-annotation/ in your git repository for this course; in each pair, only one student commits the solution, while the second student is only mentioned in the documentation
  - deadline: 10th April, 2018
Treebanking
- Warm-up exercise: use the Czech National Corpus query interface (and possibly also some command-line postprocessing, if needed) to find types of Czech adjectivals
  - find "subparts of speech" (second position in morphological tags) of words which behave syntactically like adjectives (esp. they can modify nouns), but belong to other parts of speech
  - example: S - possessive pronoun
  - hint: adjectivals appears in similar contexts as adjectives; the context of a word might be modeled as the pair of morphological tags (or their parts) of the left and right neighboring words,
  - compare your findings with what you'd consider as adjectival according to the tagset documentation
- Slides on treebanks: pfl070-treebanks.ppt
- An Introduction to the Prague Dependency Treebank: slides/pfl070-pdt-intro.ppt; short description of annotated attributes:
  - morphological layer: slides/Appendix_M_Tags.pdf
  - analytical layer: slides/Appendix_A_Tags.pdf
  - tectogrammatical layer: slides/Appendix_T_Tags.pdf
  - some more details on grammatemes and coreference pdt_gramms_and_coref.ppt
Universal dependencies, Udapi (by Martin Popel)
- Warm-up exercise: Refresh ISO-639 language codes. For each language from UD 2.0 guess its typical word order type (SOV, SVO,...), prepositions vs. postpositions and (subject) pro-drop.
- Universal dependencies (slides adapted from Joakim Nivre and Dan Zeman, slides about UDv2 changes)
- Try UDPipe online service.
- Install Udapi (with git clone), download a UDv2.0 sample (or full UDv2.0) and follow the tutorial.
- Homework 3: Deadline April 10, commit blocks' source codes and results to hw/adpos-and-wordorder.
  - Complete tutorial.Adpositions (see the usage hint) and detect which of the UD2.0 treebanks (based on the */sample.conllu files) use postpositions.
  - Write a new Udapi block to detect word order type – for each language (and treebank, i.e. each sample file), compute the percentage of each of the six possible word order types. Hint: Verbs can be detected by upos. Subjects and objects can be detected by deprel, they are Core dependents of clausal predicates.
  - Bonus: Detect which languages are pro-drop (again write a new Udapi block). For a language of your choice, write a block which inserts a node for each dropped pronoun (fill form, lemma, gender, number and person, whenever applicable).
Udapi, cont. (by Martin Popel)
- warm-up: Where (and why) do we use commas in Czech and English?
- exercise1: Implement bhead – a tool like Unix head but instead of first n lines, it prints first n blocks of lines, where blocks are separated by an empty line. It will be useful for sampling conllu files.
- exercise2: Write a Udapi block which changes prepositions to postpositions (moves them after their parent's subtree).
- Homework 4: Deadline April 18, commit your block to hw/add-commas/addcommas.py. Write a Udapi block which heuristically inserts commas into a conllu file (where all commas were deleted). Choose Czech, German or English (the final evaluation will be done on all, with the language parameter set to "cs", "de" or "en", but for getting full points for this hw only the best language result counts). Use the UDv2.0 sample data: you can use the train.conllu and sample.conllu files for training and debugging your code. For evaluating with the F1 measure use the dev.conllu file, but don't look at the errors you did on this dev data (so you don't overfit). The final evaluation will be done on a secret test set (where the commas will be deleted also from root.text and node.misc['SpaceAfter'] using tutorial.RemoveCommas).
  Hints: See the tutorial.AddCommas template block. You can hardlink it to your hw directory: ln ~/udapi-python/udapi/block/tutorial/addcommas.py ~/where/my/git/is/npfl070/hw/add-commas/addcommas.py. For Czech and German (and partially for English) it is useful to detect (finite) clauses first (and finite verbs).
```
cd sample
cp UD_English/dev.conllu gold.conllu
cat gold.conllu | udapy -s \
  util.Eval node='if node.form==",": node.remove(children="rehang")' \
  > without.conllu

# substitute the next line with your solution
cat without.conllu | udapy -s tutorial.AddCommas language=en > pred.conllu

# evaluate
udapy \
  read.Conllu files=gold.conllu zone=en_gold \
  read.Conllu files=pred.conllu zone=en_pred \
  eval.F1 gold_zone=en_gold focus=,

# You should see an output similar to this
Comparing predicted trees (zone=en_pred) with gold trees (zone=en_gold), sentences=2002
=== Details ===
token       pred  gold  corr   prec     rec      F1
,            176   800    40  22.73%   5.00%   8.20%
=== Totals ===
predicted =     176
gold      =     800
correct   =      40
precision =  22.73%
recall    =   5.00%
F1        =   8.20%
```
  Results (F1) as of 2018-04-18:
  SLOC means source lines of code excluding comments and docstrings. It is reported just for info, it plays no role in the evaluation. The homeworks are not code golf, the code should be nice to read.
```
             en-test en-dev  SLOC
1. mp        54.32%  54.42%    53
2. heslo     36.65%  37.65%   131
3. Lampa     35.69%  33.17%    45
4. aaa       30.15%  28.77%    81
5. kenajykul 26.10%  23.45%    82
6. base       8.80%   8.20%    18


             cs-test cs-dev  SLOC
1. mp        88.92%  88.40%    53
2. aaa       81.25%  80.32%    81
3. Lampa     80.26%  80.71%    45
4. kenajykul 69.60%  69.51%    82
5. heslo     67.23%  66.49%   131
6. base       3.49%   3.62%    18


             de-test de-dev  SLOC
1. heslo     73.18%  72.79%   131
2. mp        62.74%  68.16%    53
3. Lampa     51.92%  53.67%    45
4. kenajykul 50.23%  46.86%    82
5. aaa       45.99%  42.40%    81
6. base       5.90%   2.93%    18
```
- What does zone and bundle mean in Udapi. How to compare two conllu files (don't forget you should use train or sample, but not dev for this):
```
udapy -TN < gold.conllu > gold.txt # N means no colors
cat without.conllu | udapy -s tutorial.AddCommas write.TextModeTrees files=pred.txt > pred.conllu
vimdiff gold.txt pred.txt # exit vimdiff with ":qa" or "ZZZZ"
```
- Homework 5: Deadline April 25 commit your block to hw/add-articles/addarticles.py. Write a Udapi block tutorial.AddArticles which heuristically inserts English definite and indefinite articles (the, a, an) into a conllu file (where all articles were deleted). Similarly as in the previous homework: F1 score will be used for the evaluation, just with focus='(?i)an?|the' (note that only the form is evaluated, but it is case sensitive). For removing articles use util.Eval node='if node.upos=="DET" and node.lemma in {"a", "the"}: node.remove(children="rehang")'. Everything else is the same. To get all points for this hw, you need at least 30% F1 (on the secret test set).
  Results (F1) as of 2018-04-25:
```
             en-test en-dev  SLOC
1. mp        41.13%  35.32%  17
2. Lampa     40.28%  34.81%  23
3. kenajykul 37.83%  33.14%  32
4. aaa       37.36%  34.23%  34
5. heslo     37.01%  31.87%  78
6. base      17.64%  15.31%   6
```
Parsing and practical applications (by Martin Popel)
- warm-up: What is the most popular month and year? Why? How about the frequency in "English Fiction"?
- warm-up: Does the usage of present perfect vs. past simple actually depend on the absence/presence of time details, as some resources suggest?
- advanced usage of Google ngrams viewer
- how to use in Udapi util.See, util.MarkDiff, eval.Parsing
- Homework hw-parse: Deadline May 2, commit your block to hw/parse/parse.py. Write a Udapi block tutorial.Parse, which does dependency parsing (labelled, i.e. including deprel assignment) for English, Czech and German. A simple rule-based approach is expected, but machine learning is not forbidden (using the provided {train,dev}.conllu). Your goal is to achieve the highest LAS (you can ignore the language-specific part of deprel, so "LAS (udeprel)" reported by eval.Parsing is the evaluation measure to be optimized). To get all points for this hw, you need at least 40% LAS on at least one of the three languages or at least 30% LAS average on all three languages (on the secret test sets).
  Results (LAS) as of 2018-05-02:
```
             en-test en-dev |cs-test  cs-dev |de-test  de-dev |avg-test  avg-dev | SLOC
1. mp         58.53% 59.27% | 39.16%  39.60% | 43.75%  45.47% |  47.15%  48.11%  |  93
2. kenajykul  32.24% 31.83% | 39.04%  38.53% | 43.04%  45.17% |  38.11%  38.51%  |  84
3. aaa        34.08% 36.50% | 37.40%  37.02% | 41.29%  42.97% |  37.59%  38.83%  |  71
4. Lampa      31.05% 32.04% | 32.45%  31.47% | 38.13%  38.03% |  33.88%  33.85%  |  80
5. heslo      26.76% 26.79% | 31.89%  32.80% | 42.52%  48.49% |  33.72%  36.03%  | 185
6. base        0.36%  0.66% |  0.13%   0.15% |  0.00%   0.02% |   0.16%   0.28%  |  10
```
Lexical resources,
- Princeton WordNet - a lexical database for English
- VALLEX - a valency lexicon of Czech verbs
- DeriNet - a derivation network of Czech lemmas (more detailed slides)
Licensing
- slides: Intro to authors' rights and licensing
- Licensing, LDC resources, HW overview

Additional material

Possible types of errors in Czech morphologically tagged corpora

You can use the following list, either directly or just for an inspiration.

word form "se" - search for corpus positions, where "se" is tagged as a vocalized preposition, but in fact it is a reflexive pronoun (or vice versa)
word form "jí" - conjugated form of the verb "jíst" (to eat) wrongly tagged as a pronoun, or vice versa
surnames derived from verbs (such as "Pospíšil") - such surnames might be incorrectly tagged as verbs (or vice versa)
forms "a" and "A" - find corpus positions, where "a" is tagged as a coordination conjunction which is wrong (it could be the English article, physical unit, itemizer, etc.)
"weird imperatives" - search for tokens incorrectly tagged as imperatives (such as "leč", which is more likely to be a conjunction)
hledejte chyby způsobené homonymií mezi některými slovesy a adjektivy (např. tvar "zelená" může být adjektivum nebo sloveso)
search for tokens incorrectly tagged as vocalized prepositions (e.g. in cases in which the following word does not require any vocalization of the preceding preposition)
search for tokens whose tags indicate the locative case (6th case); hint: this case can appear only in prepositional groups in Czech
search for errors based on the fact that for each preposition there should be a word form somewhere behind the preposition which 'saturates' the preposition and indicates the same morphological case
word form "ty" - search for places in which "ty" is tagged as a personal pronouns, but in fact is is a demonstrative pronoun (or vice versa)
word form "ti" - analogously to the previous item
swap of nominative and accusative - search for nouns (or other parts of speech) with accusative indicated in the POS tag, even if they should be tagged as nominatives (or vice versa)
"weird vocatives" - search for tokens incorrectly tagged as vocative forms of nouns
two finite verbs close to each other - search for wrongly tagged tokens using the fact that in Czech there should not be two or more finite verb forms in a single clause (but there can be complex verb forms)
foreign words - search for foreign words incorrectly tagged as forms of obviously unrelated Czech words (such as "line" in "on-line" tagged as present-tense form of the verb "linout", or Germent article tagged as a form of the Czech verb "drát")
wrong clitics - search for tagging errors using the fact that Czech clitics (several short words such as "by","ti","mi" etc.) should appear in the so called second position (Vackernagel's position) in a sentence
confusion of prepositions and other parts of speech - find tokens wrongly tagged as prepositions which are in fact nouns or adverbs (homonymous forms such as kolem/kolem/kolem, místo/místo)
search for corpus spots with incorrectly segmented sentences
search for corpus spots with incorrect tokenization (such as "... sejí ..." instead of "... se jí ...")

Course passing requirements

In short:

solving all homeworks
participating at lab tasks within classes during the semester
writing the test at the end of the semester

Homework tasks

all homeworks must be committed into the git repository (homeworks sent by email will not be accepted)
follow the naming instruction
there will be an explicit deadline for submitting each homework
if the deadline is not met by a student, an additional homework will be specified for the student
each student is supposed to create all homework solutions himself/herself; any cheating will be penalized

Premium tasks

occasionally, there will be special programming tasks announced in the class
students can get additional points (equivalent to the number of points for a single homework) for solving such task within time limits
the amount of points will be equivalent to the number of points for a single homework
for each premium task, only the winner (the one who is fastest) will get the points

Final test

all homework tasks (including the penalty ones) must be submitted before the final test
Please see the list of possible test questions here. Each test will contain around 10 questions selected from the list. The questions are to be answered in the written form in 75 minutes.
Timespace coordinates: Tuesday 29th May 2018 14:00, SU1 (i.e., same time and space as usual)

Determination of the final grade

Student's final grade will be determined by the amount of points collected during the semester:

homeworks: 60 points
written test: 30 points
lab activity: 10 points

Grading scheme:

excellent: > 90 points
very good: > 70 points
good: > 50 points