NPFL070 – Language Data Resources

1. Introduction hw_my_corpus Slides: Overview od language data resources

2. Corpora - Case Study: the Czech National Corpus Reading Text corpus (Wikipedia)

3. Czech National Corpus cont.

4. Using annotated data for evaluation hw_our_annotation Slides: Intro to evaluation in NLP

5. Treebanking Slides: Treebanks Slides: PDT

6. Universal dependencies, Udapi (by Martin Popel) hw_adpos_and_wordorder Slides: UD (Joakim Nivre and Dan Zeman) Slides: UDv2 changes

7. Udapi cont. (by Martin Popel) hw_add_commas hw_add_articles

8. Parsing and practical applications (by Martin Popel) hw_parse

9. Lexical resources Slides: Derinet

10. Licensing Slides: Intro to authors' rights and licensing


1. Introduction

hw_my_corpus Slides: Overview od language data resources

  • Course overview
  • Prerequisities:
    • Make sure you have a valid account for accessing the Czech National Corpus. If not, see the CNC registration page.
    • Make sure you understand the topics taught in Technology for NLP, which is an informal prerequisite of this course
    • Make sure you have a valid account for accessing computers in the Linux labs. If not, consult the student service in the main lab hall ('rotunda').
    • create your git repository at UFAL's redmine, follow these instructions, just replace npfl092 with npfl070 and 2017 with 2018
  • Warm-up exercise - find haiku sentences
    • for a language of your choice, write a Python script that finds sentences which can be split on word boundaries into three contiguous segments with 5, 7, and 5 syllables (ok, haiku for Europeans)
    • Input: read a plain-text utf8 file (e.g. Zápisky z mrtvého domu available at Project Gutenberg) from STDIN
    • Output: print the haiku sentences, each of the three segments on a separate line, with additional empty line separating sentences
    • Simplification: You can approximate the number syllables by the number of vowels (or vowel sequences).

2. Corpora - Case Study: the Czech National Corpus

Reading Text corpus (Wikipedia)

3. Czech National Corpus cont.

  • Warm-up exercise:
    • try to assemble POS tags for all tokens in the following sentence (from the todays newpapers): "Trumpův ministr uvažuje o odebírání dětí matkám, které přijdou ilegálně."
    • when finished, compare your solution with that of the on-line morphological analyser of Czech Morphodita
    • once again, POS tagset documentation
  • Warm-up exercise 2 (if the previous one is too easy for you or if you have spare time): use a tiny multilingual raw-text "corpus" for language identification
    1. use raw-text data for several languages available in minicorp.zip
    2. implement a language identifier in Python; the identifier should ideally recognize any language from the minicorp set
    3. use the minicorp texts for training (e.g. first 1000 lines for each language) and evaluation (next 100 lines) of a language identifier implemented in Python
    4. hint: use "similarity" in occurrences of letter trigrams
    5. commit your solution into the undergrads svn repository into 2017-npfl070/exercise2 in your directory
  • construct Kontext queries for the following examples
    1. occurrences of word form "kousnout"; occurrences of all forms of lemma "kousnout"; occurrences of verbs derived from "kousnout" by prefixation (and make frequency list of their lemmas) and occurrences of adjectives derived from such prefixed verbs (and their frequency list too),
    2. name 5 verb whose infinitive does not end with '-t'; find them in the corpus and make their frequency list
    3. find adjectives with 'illusory negation', such as "nekalý", "neohrabaný", "nevrlý"...
    4. find adverbs that modify adjectives, make their frequency list,
    5. find beginnings of subordinating conditional clauses,
    6. find beginnings of subordinating relative clauses,
    7. find examples of names of (state) presidents (family name+surname), order them according to frequency of occurrences,
    8. find all occurrences of phraseme "mráz někomu běhá po zádech"
    9. find nouns with temporal meaning
    10. find adverbs with locational or directional meaning
    11. find nouns that are typical objects of the verb slovesa "kousnout" (and the same for subject)
    12. find five pairs of alternative or questionable spelling variants, and compare their frequencies using SyD

4. Using annotated data for evaluation

hw_our_annotation Slides: Intro to evaluation in NLP

  • Warm-up exercise: find synonyms (or near synonyms) in English
    • you can use a simple probabilistic English-Czech translation dictionary derived from CzEng czeng-reduced-dict.tsv.gz
    • you can rely on the hint that synonymous words are likely to have the same translation equivalents
    • you should find some reliability metrics for the detected synonymous pairs and sort the pairs, starting from the most reliable synonyms

5. Treebanking

Slides: Treebanks Slides: PDT

  • Warm-up exercise: use the Czech National Corpus query interface (and possibly also some command-line postprocessing, if needed) to find types of Czech adjectivals
    • find "subparts of speech" (second position in morphological tags) of words which behave syntactically like adjectives (esp. they can modify nouns), but belong to other parts of speech
    • example: S - possessive pronoun
    • hint: adjectivals appears in similar contexts as adjectives; the context of a word might be modeled as the pair of morphological tags (or their parts) of the left and right neighboring words,
    • compare your findings with what you'd consider as adjectival according to the tagset documentation
  • An Introduction to the Prague Dependency Treebank: short description of annotated attributes:

6. Universal dependencies, Udapi (by Martin Popel)

hw_adpos_and_wordorder Slides: UD (Joakim Nivre and Dan Zeman) Slides: UDv2 changes

7. Udapi cont. (by Martin Popel)

hw_add_commas hw_add_articles

  • warm-up: Where (and why) do we use commas in Czech and English?

  • exercise1: Implement bhead – a tool like Unix head but instead of first n lines, it prints first n blocks of lines, where blocks are separated by an empty line. It will be useful for sampling conllu files.

  • exercise2: Write a Udapi block which changes prepositions to postpositions (moves them after their parent's subtree).

  • What does zone and bundle mean in Udapi. How to compare two conllu files (don't forget you should use train or sample, but not dev for this):

    udapy -TN < gold.conllu > gold.txt # N means no colors
    cat without.conllu | udapy -s tutorial.AddCommas write.TextModeTrees files=pred.txt > pred.conllu
    vimdiff gold.txt pred.txt # exit vimdiff with ":qa" or "ZZZZ"
    

8. Parsing and practical applications (by Martin Popel)

hw_parse

9. Lexical resources

Slides: Derinet

10. Licensing

Slides: Intro to authors' rights and licensing

  • Licensing, LDC resources, HW overview

1. hw_my_corpus

2. hw_our_annotation

3. hw_adpos_and_wordorder

4. hw_add_commas

5. hw_add_articles

6. hw_parse

1. hw_my_corpus

 100 points Create a sequence of tools for building a very simple 1MW corpus

  • choose a language different from Czech and English and also from your native language
  • find on-line sources of texts for the language, containing altogether more than 1 million words, and download them
  • convert the material into one large plain-text utf8 file
  • tokenize the file on word boundaries and print 50 most frequent tokens
  • organize all these steps into a Makefile so that the whole procedure is executed after running make all
  • commit the Makefile into hw/my-corpus in your git repository for this course

2. hw_our_annotation

 100 points design your own annotation project for a linguistic phenomenon of your choice

  • work in pairs
  • minimal requirements: annotation in a plain-text format, two annotations by two independently working annotators, at least 50 annotated instances, evaluated inter-annotator agreement, experiment documentation
  • commit the annotated data and experiment documentation into hw/our-annotation/ in your git repository for this course; in each pair, only one student commits the solution, while the second student is only mentioned in the documentation

3. hw_adpos_and_wordorder

 Deadline: April 10 23:59  100 points

  • Commit blocks' source codes and results to hw/adpos-and-wordorder.
  • Complete tutorial.Adpositions (see the usage hint) and detect which of the UD2.0 treebanks (based on the */sample.conllu files) use postpositions.
  • Write a new Udapi block to detect word order type – for each language (and treebank, i.e. each sample file), compute the percentage of each of the six possible word order types. Hint: Verbs can be detected by upos. Subjects and objects can be detected by deprel, they are Core dependents of clausal predicates.
  • Bonus: Detect which languages are pro-drop (again write a new Udapi block). For a language of your choice, write a block which inserts a node for each dropped pronoun (fill form, lemma, gender, number and person, whenever applicable).

4. hw_add_commas

 100 points

  • commit your block to hw/add-commas/addcommas.py. Write a Udapi block which heuristically inserts commas into a conllu file (where all commas were deleted). Choose Czech, German or English (the final evaluation will be done on all, with the language parameter set to "cs", "de" or "en", but for getting full points for this hw only the best language result counts). Use the UDv2.0 sample data: you can use the train.conllu and sample.conllu files for training and debugging your code. For evaluating with the F1 measure use the dev.conllu file, but don't look at the errors you did on this dev data (so you don't overfit). The final evaluation will be done on a secret test set (where the commas will be deleted also from root.text and node.misc['SpaceAfter'] using tutorial.RemoveCommas).

  • Hints: See the tutorial.AddCommas template block. You can hardlink it to your hw directory: ln ~/udapi-python/udapi/block/tutorial/addcommas.py ~/where/my/git/is/npfl070/hw/add-commas/addcommas.py. For Czech and German (and partially for English) it is useful to detect (finite) clauses first (and finite verbs).

    cd sample
    cp UD\_English/dev.conllu gold.conllu
    cat gold.conllu | udapy -s \\
      util.Eval node='if node.form==",": node.remove(children="rehang")' \\
      > without.conllu
    
    # substitute the next line with your solution
    cat without.conllu | udapy -s tutorial.AddCommas language=en > pred.conllu
    
    # evaluate
    udapy \\
      read.Conllu files=gold.conllu zone=en\_gold \\
      read.Conllu files=pred.conllu zone=en\_pred \\
      eval.F1 gold\_zone=en\_gold focus=,
    
    # You should see an output similar to this
    Comparing predicted trees (zone=en\_pred) with gold trees (zone=en\_gold), sentences=2002
    === Details ===
    token       pred  gold  corr   prec     rec      F1
    ,            176   800    40  22.73%   5.00%   8.20%
    === Totals ===
    predicted =     176
    gold      =     800
    correct   =      40
    precision =  22.73%
    recall    =   5.00%
    F1        =   8.20%
    

5. hw_add_articles

 100 points

  • Commit your block to hw/add-articles/addarticles.py.
  • Write a Udapi block tutorial.AddArticles which heuristically inserts English definite and indefinite articles (the, a, an) into a conllu file (where all articles were deleted). Similarly as in the previous homework: F1 score will be used for the evaluation, just with focus='(?i)an?|the' (note that only the form is evaluated, but it is case sensitive). For removing articles use util.Eval node='if node.upos=="DET" and node.lemma in {"a", "the"}: node.remove(children="rehang")'. Everything else is the same. To get all points for this hw, you need at least 30% F1 (on the secret test set).

6. hw_parse

 100 points

  • commit your block to hw/parse/parse.py.
  • Write a Udapi block tutorial.Parse, which does dependency parsing (labelled, i.e. including deprel assignment) for English, Czech and German. A simple rule-based approach is expected, but machine learning is not forbidden (using the provided {train,dev}.conllu). Your goal is to achieve the highest LAS (you can ignore the language-specific part of deprel, so "LAS (udeprel)" reported by eval.Parsing is the evaluation measure to be optimized). To get all points for this hw, you need at least 40% LAS on at least one of the three languages or at least 30% LAS average on all three languages (on the secret test sets).

Homework assignments

  • There will be 8–12 homework assignments.
  • For most assignments, you will get points, up to a given maximum (the maximum is specified with each assignment).
    • If your submission is especially good, you can get extra points (up to +10% of the maximum).
  • Most assignments will have a fixed deadline (usually in 1 week).
  • If you submit the assignment after the deadline, you will get:
    • up to 50% of the maximum points if it is less than 2 weeks after the deadline;
    • 0 points if it is more than 2 weeks after the deadline.
  • Once we check the submitted assignments, you will see the points you got and the comments from us in:

Test

Grading

Your grade is based on the average of your performance; the test and the homework assignments are weighted 1:1.

  1. ≥ 90%
  2. ≥ 70%
  3. ≥ 50%
  4. < 50%

For example, if you get 600 out of 1000 points for homework assignments (60%) and 36 out of 40 points for the test (90%), your total performance is 75% and you get a 2.

No cheating

  • Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty.
  • Discussing homework assignments with your classmates is OK. Sharing code is not OK (unless explicitly allowed); by default, you must complete the assignments yourself.
  • All students involved in cheating will be punished. E.g. if you share your assignment with a friend, both you and your friend will be punished.