npfl092
with npfl070
and 2017 with 2018 2017-npfl070/exercise1
in your directorymake all
hw/my-corpus
in your git repository for this course2017-npfl070/exercise2
in your directoryour-annotation
(work in pairs): design your own
annotation project for a linguistic phenomenon of your choice
hw/our-annotation/
in your git repository for this course; in each pair, only
one student commits the solution, while the second student is only
mentioned in the documentation
hw/adpos-and-wordorder
.
hw/add-commas/addcommas.py
.
Write a Udapi block which heuristically inserts commas into a conllu file (where all commas were deleted).
Choose Czech, German or English (the final evaluation will be done on all, with the language
parameter set to "cs", "de" or "en",
but for getting full points for this hw only the best language result counts).
Use the UDv2.0 sample data:
you can use the train.conllu and sample.conllu files for training and debugging your code.
For evaluating with the F1 measure use the dev.conllu file,
but don't look at the errors you did on this dev data (so you don't overfit).
The final evaluation will be done on a secret test set
(where the commas will be deleted also from root.text
and node.misc['SpaceAfter']
using tutorial.RemoveCommas).
Hints: See the tutorial.AddCommas template block.
You can hardlink it to your hw directory: ln ~/udapi-python/udapi/block/tutorial/addcommas.py ~/where/my/git/is/npfl070/hw/add-commas/addcommas.py
.
For Czech and German (and partially for English) it is useful to detect (finite) clauses first (and finite verbs).
cd sample cp UD_English/dev.conllu gold.conllu cat gold.conllu | udapy -s \ util.Eval node='if node.form==",": node.remove(children="rehang")' \ > without.conllu # substitute the next line with your solution cat without.conllu | udapy -s tutorial.AddCommas language=en > pred.conllu # evaluate udapy \ read.Conllu files=gold.conllu zone=en_gold \ read.Conllu files=pred.conllu zone=en_pred \ eval.F1 gold_zone=en_gold focus=, # You should see an output similar to this Comparing predicted trees (zone=en_pred) with gold trees (zone=en_gold), sentences=2002 === Details === token pred gold corr prec rec F1 , 176 800 40 22.73% 5.00% 8.20% === Totals === predicted = 176 gold = 800 correct = 40 precision = 22.73% recall = 5.00% F1 = 8.20%
Results (F1) as of 2018-04-18:
SLOC means source lines of code excluding comments and docstrings. It is reported just for info, it plays no role in the evaluation. The homeworks are not code golf, the code should be nice to read.en-test en-dev SLOC 1. mp 54.32% 54.42% 53 2. heslo 36.65% 37.65% 131 3. Lampa 35.69% 33.17% 45 4. aaa 30.15% 28.77% 81 5. kenajykul 26.10% 23.45% 82 6. base 8.80% 8.20% 18 cs-test cs-dev SLOC 1. mp 88.92% 88.40% 53 2. aaa 81.25% 80.32% 81 3. Lampa 80.26% 80.71% 45 4. kenajykul 69.60% 69.51% 82 5. heslo 67.23% 66.49% 131 6. base 3.49% 3.62% 18 de-test de-dev SLOC 1. heslo 73.18% 72.79% 131 2. mp 62.74% 68.16% 53 3. Lampa 51.92% 53.67% 45 4. kenajykul 50.23% 46.86% 82 5. aaa 45.99% 42.40% 81 6. base 5.90% 2.93% 18
udapy -TN < gold.conllu > gold.txt # N means no colors cat without.conllu | udapy -s tutorial.AddCommas write.TextModeTrees files=pred.txt > pred.conllu vimdiff gold.txt pred.txt # exit vimdiff with ":qa" or "ZZZZ"
hw/add-articles/addarticles.py
.
Write a Udapi block tutorial.AddArticles
which heuristically inserts English definite and indefinite articles (the, a, an) into a conllu file (where all articles were deleted).
Similarly as in the previous homework: F1 score will be used for the evaluation, just with focus='(?i)an?|the'
(note that only the form is evaluated, but it is case sensitive).
For removing articles use util.Eval node='if node.upos=="DET" and node.lemma in {"a", "the"}: node.remove(children="rehang")'
.
Everything else is the same.
To get all points for this hw, you need at least 30% F1 (on the secret test set).
Results (F1) as of 2018-04-25:
en-test en-dev SLOC 1. mp 41.13% 35.32% 17 2. Lampa 40.28% 34.81% 23 3. kenajykul 37.83% 33.14% 32 4. aaa 37.36% 34.23% 34 5. heslo 37.01% 31.87% 78 6. base 17.64% 15.31% 6
hw/parse/parse.py
.
Write a Udapi block tutorial.Parse,
which does dependency parsing (labelled, i.e. including deprel assignment) for English, Czech and German.
A simple rule-based approach is expected, but machine learning is not forbidden (using the provided {train,dev}.conllu).
Your goal is to achieve the highest LAS
(you can ignore the language-specific part of deprel,
so "LAS (udeprel)" reported by eval.Parsing
is the evaluation measure to be optimized).
To get all points for this hw, you need at least 40% LAS on at least one of the three languages
or at least 30% LAS average on all three languages (on the secret test sets).
Results (LAS) as of 2018-05-02:
en-test en-dev |cs-test cs-dev |de-test de-dev |avg-test avg-dev | SLOC 1. mp 58.53% 59.27% | 39.16% 39.60% | 43.75% 45.47% | 47.15% 48.11% | 93 2. kenajykul 32.24% 31.83% | 39.04% 38.53% | 43.04% 45.17% | 38.11% 38.51% | 84 3. aaa 34.08% 36.50% | 37.40% 37.02% | 41.29% 42.97% | 37.59% 38.83% | 71 4. Lampa 31.05% 32.04% | 32.45% 31.47% | 38.13% 38.03% | 33.88% 33.85% | 80 5. heslo 26.76% 26.79% | 31.89% 32.80% | 42.52% 48.49% | 33.72% 36.03% | 185 6. base 0.36% 0.66% | 0.13% 0.15% | 0.00% 0.02% | 0.16% 0.28% | 10
You can use the following list, either directly or just for an inspiration.
In short:
Student's final grade will be determined by the amount of points collected during the semester:
Grading scheme: