Language Data Resources

NPFL070 – Language Data Resources

About

SIS code: NPFL070
Semester: winter
E-credits: 5
Examination: 1/2 MC (KZ)
Instructors: Martin Popel, Zdeněk Žabokrtský

the classes combine lectures and practicals

Timespace Coordinates in 2024

Tuesday 10:40–11:25 + 11:30–13:00 in SW1

Course prerequisities

only an informal one: NPFL125 – Introduction to Language Technologies (unless, of course, you gained your knowledge of bash, Python, XML and alike elsewhere)

Course passing requirements

To pass the course you will need to submit homework assignments and do a written test. See Grading for more details.

Classes

1. Introduction Overview of language data types

2. More on corpora and a case study: the Czech National Corpus hw_my_corpus Reading Text corpus (Wikipedia)

3. Czech National Corpus cont., Treebanking intro hw_our_annotation Intro to Intercorp by Lucie Lukešová To tree or not to tree? Slides: Examples of constituency treebanks Slides: PDT

4. Universal Dependencies, Udapi (by Martin Popel) Slides: UD (by Dan Zeman) Slides: UDv2 hw_adpos_and_wordorder Slides: UD (Joakim Nivre and Dan Zeman)

5. Udapi cont. (by Martin Popel) hw_add_commas

6. No lecture - Dean's day

7. Using annotated data for evaluation hw_shared_task Evaluation in NLP

8. Parsing and practical applications (by Martin Popel) Tools for UD (slides 32-45) hw_add_articles

9. Lexical databases (a guided tour) Slides: Derinet Slides: Selected topics from morphology

10. Licensing, data repos Slides: Intro to authors' rights and licensing

11. HuggingFace datasets, tokenizers (by Martin Popel) hw_hf

12. Significance and Hypothesis testing (by Martin Popel) Slides: Significance and Hypothesis testing

13. Final written test

License

Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.

1. Introduction Overview of language data types

2. More on corpora and a case study: the Czech National Corpus hw_my_corpus Reading Text corpus (Wikipedia)

3. Czech National Corpus cont., Treebanking intro hw_our_annotation Intro to Intercorp by Lucie Lukešová To tree or not to tree? Slides: Examples of constituency treebanks Slides: PDT

4. Universal Dependencies, Udapi (by Martin Popel) Slides: UD (by Dan Zeman) Slides: UDv2 hw_adpos_and_wordorder Slides: UD (Joakim Nivre and Dan Zeman)

5. Udapi cont. (by Martin Popel) hw_add_commas

6. No lecture - Dean's day

7. Using annotated data for evaluation hw_shared_task Evaluation in NLP

8. Parsing and practical applications (by Martin Popel) Tools for UD (slides 32-45) hw_add_articles

9. Lexical databases (a guided tour) Slides: Derinet Slides: Selected topics from morphology

10. Licensing, data repos Slides: Intro to authors' rights and licensing

11. HuggingFace datasets, tokenizers (by Martin Popel) hw_hf

12. Significance and Hypothesis testing (by Martin Popel) Slides: Significance and Hypothesis testing

13. Final written test

1. Introduction

Overview of language data types Oct 1, 2024

Course overview
Prerequisities:
- Make sure you have a valid account for accessing the Czech National Corpus. If not, see the CNC registration page.
- Make sure you understand the topics taught in Introduction to Language Technologies, which is an informal prerequisite of this course
- Make sure you have a valid account for accessing computers in the Linux labs. If not, consult the student service in the main lab hall ('rotunda').

2. More on corpora and a case study: the Czech National Corpus

Oct 8, 2024 hw_my_corpus Reading Text corpus (Wikipedia)

Make sure you have a valid account for accessing the Czech National Corpus. If not, quickly ask for one at the CNC registration page.
Have a look at the Corpus Query Language basics.
Have a look at morphological categories distinguished in Czech positional morphological tags used in the CNC.
Have a look at the CNC search interface manual (but don't worry, we'll practice it a lot during the practicals)
During the class:
- we'll explore the most important corpus for Czech (containing many other languages too, though) at www.korpus.cz
- we'll work with the Kontext search tool
- we'll explore the tagset using MorphoDiTa
  - first, try to assemble POS tags for all tokens in the following sentence (from the todays newpapers): "V úterý byl na nejméně dva týdny poslední den, kdy mohly mít restaurace otevřeno do 20 hodin."
  - when finished, compare your solution with that of the on-line morphological analyser of Czech Morphodita
  - once again, POS tagset documentation
- construct Kontext queries for the following examples
  1. occurrences of word form "kousnout"; occurrences of all forms of lemma "kousnout"; occurrences of verbs derived from "kousnout" by prefixation (and make frequency list of their lemmas) and occurrences of adjectives derived from such prefixed verbs (and their frequency list too),
  2. name 5 verb whose infinitive does not end with '-t'; find them in the corpus and make their frequency list
  3. find adjectives with 'illusory negation', such as "nekalý", "neohrabaný", "nevrlý"...
  4. find adverbs that modify adjectives, make their frequency list,
  5. find beginnings of subordinating conditional clauses,
  6. find beginnings of subordinating relative clauses,
  7. find examples of names of (state) presidents (family name+surname), order them according to frequency of occurrences,
  8. find all occurrences of phraseme "mráz někomu běhá po zádech"
  9. find nouns that are typical objects of the verb slovesa "kousnout" (and the same for subject)
  10. find adverbs with locational or directional meaning (this is a bit tricky)
  11. find nouns with temporal meaning
  12. find some nouns created by compounding (such as "autoopravna")
Additional reading about corpora:
- Michal Křen's overview of developments in CNC: slides
- Sandra Kuebler's Introduction to Corpus Linguistics (slides)
- Text corpus (Wikipedia)
- Corpus linguistics (Wikipedia)

3. Czech National Corpus cont., Treebanking intro

October 15, 2024 hw_our_annotation

Warm-up exercise: use the Czech National Corpus query interface (and possibly also some command-line postprocessing, if needed) to find types of Czech adjectivals
- find "subparts of speech" (second position in morphological tags) of words which behave syntactically like adjectives (esp. they can modify nouns), but belong to other parts of speech
- example: S - possessive pronoun
- hint: adjectivals appears in similar contexts as adjectives; the context of a word might be modeled as the pair of morphological tags (or their parts) of the left and right neighboring words,
- compare your findings with what you'd consider as adjectival according to the tagset documentation
Individual tasks: practise CQL querying in Kontext by constructing at least three queries that find disambiguation mistakes (=incorrectly tagged and/or lemmatized tokens). Choose any corpus available in Kontext, any language. Ideally, each query should identify at least 100 corpus positions, out of which at least two thirds of the detected positions should be really errorneous (a very rough estimate based on a smaller sample is sufficient for this exercise). For instance, for English you can try to detect incorrectly tagged "that", or "der" in German, or "vino" in Spanish. If you choose Czech, then you can use the following list, either directly or just for your inspiration:
1. word form "se" - search for corpus positions, where "se" is tagged as a vocalized preposition, but in fact it is a reflexive pronoun (or vice versa)
2. word form "jí" - conjugated form of the verb "jíst" (to eat) wrongly tagged as a pronoun, or vice versa
3. surnames derived from verbs (such as "Pospíšil") - such surnames might be incorrectly tagged as verbs (or vice versa)
4. forms "a" and "A" - find corpus positions, where "a" is tagged as a coordination conjunction which is wrong (it could be the English article, physical unit, itemizer, etc.)
5. "weird imperatives" - search for tokens incorrectly tagged as imperatives (such as "leč", which is more likely to be a conjunction)
6. search for errors caused by homonymy between some verbs and adjectives (e.g., word form "zelená" could be an adjective or a verb)
7. search for tokens incorrectly tagged as vocalized prepositions (e.g. in cases in which the following word does not require any vocalization of the preceding preposition)
8. search for tokens whose tags indicate the locative case (6th case); hint: this case can appear only in prepositional groups in Czech
9. search for errors based on the fact that for each preposition there should be a word form somewhere behind the preposition which 'saturates' the preposition and indicates the same morphological case
10. word form "ty" - search for places in which "ty" is tagged as a personal pronouns, but in fact is is a demonstrative pronoun (or vice versa)
11. word form "ti" - analogously to the previous item
12. swap of nominative and accusative - search for nouns (or other parts of speech) with accusative indicated in the POS tag, even if they should be tagged as nominatives (or vice versa)
13. "weird vocatives" - search for tokens incorrectly tagged as vocative forms of nouns
14. two finite verbs close to each other - search for wrongly tagged tokens using the fact that in Czech there should not be two or more finite verb forms in a single clause (but there can be complex verb forms)
15. foreign words - search for foreign words incorrectly tagged as forms of obviously unrelated Czech words (such as "line" in "on-line" tagged as present-tense form of the verb "linout", or Germent article tagged as a form of the Czech verb "drát")
16. wrong clitics - search for tagging errors using the fact that Czech clitics (several short words such as "by","ti","mi" etc.) should appear in the so called second position (Vackernagel's position) in a sentence
17. confusion of prepositions and other parts of speech - find tokens wrongly tagged as prepositions which are in fact nouns or adverbs (homonymous forms such as kolem/kolem/kolem, místo/místo)
18. search for corpus spots with incorrectly segmented sentences
19. search for corpus spots with incorrect tokenization (such as "... sejí ..." instead of "... se jí ...")

practicals:
- searching in Intercorp using Kontext
- a tour through other CNK-related tools: Treq, SyD, Morfio
Additional reading: Intro to Intercorp by Lucie Lukešová

To tree or not to tree? Slides: Examples of constituency treebanks Slides: PDT

Task 1: draw a constituency tree from the Penn Treebank
- draw the constituency tree that is represented by a bracketed text format in sample file penntb-sample.txt
- use pen'n'paper or any drawing softare, whatever you find more efficient
- please be prepared to share the image of your tree on our googledoc whiteboard at the beginning of the class
Task 2: draw a constituency tree from the Negra Treebank
- draw the constituency tree that is represented by a line-oriented text format in sample file negra-sample.txt
- again, be prepared to share the image
Task 3: play with UDPipe (a pipeline of NLP tools)
- parse some authentic sentences using the online interface of UDPipe, for any language for which a model is provided in the interface
- view the resulting dependency trees (switch to the rightmost tab that is labeled "Show trees")
- find at least two or three trees in which the parser makes a mistake (=you'd draw the tree differently), and try to guess why the sentence is not parsed correctly
- again, store images of such suspicious trees and be ready to share them

4. Universal Dependencies, Udapi (by Martin Popel)

October 22, 2024 Slides: UD (by Dan Zeman) Slides: UDv2 hw_adpos_and_wordorder

Warm-up exercise: Refresh ISO-639 language codes.
Guess the typical word-order type (SOV, SVO,...) and adposition type (i.e. whether the language uses more prepositions or postpositions) for selected 30 languages. Mark your guesses in https://forms.gle/gp5rRUpmeHqjEFi4A. Skip the languages where you really don't know.
Try UDPipe online service.
Get familiar with the Universal Dependencies universal PoS tags, dependency relations and the CoNLL-U file format.
Universal dependencies
Install Udapi (with git clone so that you have also a local copy of 01-visualizing.ipynb), install Jupyter (pip install --user jupyter or pip install --user jupyterlab) and follow the Udapi visualizing Tutorial.

5. Udapi cont. (by Martin Popel)

October 29, 2024 hw_add_commas

Brainstorming: Where (and why) do we use commas in Czech and English?
Complete Udapi visualizing Tutorial.

What does zone and bundle mean in Udapi. How to compare two conllu files (don't forget you should use train or sample, but not dev for this):

udapy -TN < gold.conllu > gold.txt # N means no colors
cat without.conllu | udapy -s tutorial.AddCommas write.TextModeTrees files=pred.txt > pred.conllu
vimdiff gold.txt pred.txt # exit vimdiff with ":qa" or "ZZZZ"

how to use in Udapi util.See

6. No lecture - Dean's day

November 5, 2024

7. Using annotated data for evaluation

November 12, 2024 hw_shared_task Evaluation in NLP

Let's assume the task of definite and indefinite article reconstruction in English sentences: because of some strange reason, we receive amounts of English texts without any articles, and our tool should fill the articles as accurately as possible.

How exactly would you evaluate the quality of automatically filled artiles, if we can use a dataset in which artiles are correct? Be prepared for presenting the formula/algorithm.
More specifically, what precisely is the output value of your evaluation measure obtained for the following three files, in which articles have been automatically reconstructed by three hypothetical tools A, B, and C:
The no-article version of the same text (this would be the input for the hypothetical article-reconstructors): removed-articles.txt
The ground-truth version of the same text: correct-articles.txt
No, you are not expected to implement any article-reconstruction tool now. We are interested solely in the evaluation of this task.
Prepare a 3-5 line summary of your evaluation, so that you that paste it quickly into a shared whiteboard at the beginning of the next class.

8. Parsing and practical applications (by Martin Popel)

November 19, 2024 Tools for UD (slides 32-45) hw_add_articles

warm-up: What is the most popular month and year? Why? How about the frequency in "English Fiction"?
warm-up: Does the usage of present perfect vs. past simple actually depend on the absence/presence of time details, as some resources suggest?
advanced usage of Google ngrams viewer

how to use in Udapi util.MarkDiff, eval.Parsing:

udapy -HM \
  read.Conllu files=gold.conllu zone=gold \
  read.Conllu files=pred.conllu zone=pred \
  util.MarkDiff gold_zone=gold attributes=form ignore_parent=1 > diff.html

9. Lexical databases (a guided tour)

November 26, 2024 Slides: Derinet

Morphological properties of lexical units Slides: Selected topics from morphology
- inflection
  - Example: morfflex.cz (as used in MorphoDiTa)
  - Exercise: choose a verb in your native language and list all its inflected forms
  - Exercise: try to find a word with as many inflected wordforms as possible
- derivation
  - Example: DeriNet (optional reading in Czech: slides)
  - Example: Universal Derivations
  - Exercise: choose a verb in your native language and list all words derived from it
- morpheme segmentation
  - Examples: morpheme segmentation datasets for a few languages available e.g.
    - in the SIGMORPHON'22 segmentation shared task
    - in the UniSegments collection
  - Exercise: choose a past-tense word form of some prefixed verb in your native language and segment it into morphemes
  - Exercise: for a chosen language in the SIGMORPHON datasets, make a frequency list of morphemes
Syntactic/semantic combinatorial potential of lexical units
- fine-grained role inventories
- coarse-grained role inventories
Sense inventories
- traditional explanatory dictionaries
- wordnets and other thesauri
  - Example: Princeton WordNet for English
- Exercise: without using any dictionary, list all senses of leave. Then compare your list with sense inventories listed by Merriam Webster and Longman dictionaries
Multilingual lexical resources
- translation databases
- multilingual wordnets
- cognate databases
  - Example: CogNet
Other lexical resources
- terminological databases
- named entity lists (such as that of geographical names)
- etymological dictionaries
Recommended reading:
- an overview of lexical resources by Christian M. Meyerand Hatem Mousselly Sergieh

10. Licensing, data repos

December 3, 2024 Slides: Intro to authors' rights and licensing

Licensing, LDC resources
Individual preparation before the class (duration: 45 minutes):
- Read proprietary (resource-specific, non-generic) license agreements for at least 3 distinct language data resources, such as
- Can you find some repeated patterns in the licenses that you have seen? Be ready to comment your observations at the beginning of the class.
- In the remaining time,
  - if you speak Czech, find the version of the Czech Copyright Act (Autorsky zakon, 121/2000 Sb.) currently valid in the Czech Republic, and start reading it until the preparation time is over.
  - if you don't, find the text of the legal norm that is most relevant concerning the Authors' Rights in your home country and start reading it.
The Copyright Act currently valid in the Czech Republic
The Civil Code, Section 5 - Licenses
Examples of linguistic data hubs:
- LINDAT/CLARIAH-CZ Repository
- Linguistic Data Consortium catalogue

11. HuggingFace datasets, tokenizers (by Martin Popel)

December 10, 2024 hw_hf

Feedback: What have you learned from the add_articles assignment (both UD-specific and general know-how)?
Install HuggingFace datasets and transformers
Browse HuggingFace Datasets Hub and Models Hub
Load an NLP dataset of your choice (e.g. squad) and tokenize it with a (subword) tokenizer (e.g. bert-base-cased), inspect the results, similarly to the example Usage.

12. Significance and Hypothesis testing (by Martin Popel)

December 17, 2024 Slides: Significance and Hypothesis testing

common pitfalls and fallacies
meaning of "significantly better", p-value, null hypothesis
population vs. sample statistic, un/paired two-sample tests, one/two-tailed tests
confidence interval, IQR, bootstrap resampling
Why do we use bootstrapping and not the normal-based CI formula for BLEU?

13. Final written test

January 7, 2025

1. hw_my_corpus

2. hw_our_annotation

3. hw_adpos_and_wordorder

A general remark: please note that all your homework solutions should be submitted exclusively using the faculty GitLab server. Detailed information on creating and using your GitLab repository is available within the course NPFL125. For our course, the instructions are to be modified as expected:

Your project name should be "NPFL070"; the identifier should be "npfl070".
Access to your repository should be given to both instructors of NPFL070, i.e. to Zdeněk Žabokrtský and Martin Popel.

Please note that a homework specification may be subject to change until its deadline is announced on this web page.

1. hw_my_corpus

Deadline: 23:59 October 22, 2024 100 points

Create a sequence of tools for building a very simple 1MW corpus

choose a language different from Czech and English and also from your native language
find on-line sources of texts for the language, containing altogether more than 1 million words, and download them
convert the material into one large plain-text utf8 file
tokenize the file on word boundaries and print 50 most frequent tokens
organize all these steps into a Makefile so that the whole procedure is executed after running make all
commit the Makefile into hw/my-corpus in your git repository for this course
do not store the data in the repository; it must be possible to (re)construct the corpus just by running the Makefile

2. hw_our_annotation

Deadline: 23:59 October 29, 2024 100 points

Design your own annotation project for a linguistic phenomenon of your choice

work in small teams composed of two or three students
minimal requirements: annotation added as textual marks into in a plain-text file format, at least 100 annotated instances annotated (independently!) by all team members, evaluated inter-annotator agreement (which implies that you must be able to process the data automatically), experiment documentation
commit the annotated data and experiment documentation into hw/our-annotation/ in your git repository for this course; for each team, only one team member commits the solution, with all team members being mentioned in the documentation

3. hw_adpos_and_wordorder

Deadline: 23:59 November 4, 2024 100 points

Commit blocks' source codes and results to hw/adpos-and-wordorder.
Complete tutorial.Adpositions (see the usage hint) and detect which of the UD2.0 treebanks (based on the */sample.conllu files from the UDv2.0 sample) use postpositions.
Write a new Udapi block to detect word order type – for each language (and treebank, i.e. each sample file), compute the percentage of each of the six possible word order types. Hint: Verbs can be detected by upos. Subjects and objects can be detected by deprel, they are Core dependents of clausal predicates.
Bonus: Detect which languages are pro-drop (again write a new Udapi block). For a language of your choice, write a block which inserts a node for each dropped pronoun (fill form, lemma, gender, number and person, whenever applicable).

4. hw_add_commas

Deadline: 23:59 November 18, 2024 100 points

commit your block to hw/add-commas/addcommas.py. Write a Udapi block which heuristically inserts commas into a conllu file (where all commas were deleted). Choose Czech, German, French or English (the final evaluation will be done on all, with the language parameter set to "cs", "de", "fr" or "en"). Use the UDv2.0 sample data: you can use the train.conllu and sample.conllu files for training and debugging your code. For evaluating with the F1 measure use the dev.conllu file, but don't look at the errors you did on this dev data (so you don't overfit). The final evaluation will be done on a secret test set (where the commas will be deleted also from root.text and node.misc['SpaceAfter'] using tutorial.RemoveCommas). To get all points for this hw, you need to achieve at least the LY-MEDIAN (see the results below) F1 score for any of the four languages or at least 45% F1 average on all four languages (on the secret test sets).

Hints: See the tutorial.AddCommas template block. You can hardlink it to your hw directory: ln ~/udapi-python/udapi/block/tutorial/addcommas.py ~/where/my/git/is/npfl070/hw/add-commas/addcommas.py. For Czech and German (and partially for English) it is useful to detect (finite) clauses first (and finite verbs). It may be useful to first add commas according to a general rule and then delete extra commas (e.g. if neighboring a punctuation token or start/end of sentence).

cd sample
cp UD_English/dev.conllu gold.conllu
cat gold.conllu | udapy -s \
  util.Eval node='if node.form==",": node.remove(children="rehang")' \
  > without.conllu

# substitute the next line with your solution
cat without.conllu | udapy -s tutorial.AddCommas language=en > pred.conllu

# evaluate
udapy \
  read.Conllu files=gold.conllu zone=en_gold \
  read.Conllu files=pred.conllu zone=en_pred \
  eval.F1 gold_zone=en_gold focus=,

# You should see an output similar to this
Comparing predicted trees (zone=en_pred) with gold trees (zone=en_gold), sentences=2002
=== Details ===
token       pred  gold  corr   prec     rec      F1
,            176   800    33  18.75%   4.12%   6.76%
=== Totals ===
predicted =     176
gold      =     800
correct   =      33
precision =  18.75%
recall    =   4.12%
F1        =   6.76%

Results (F1) as of 2024-11-19 10:00

SLOC means source lines of code excluding comments and docstrings. It is reported just for info, it plays no role in the evaluation. The homeworks are not code golf, the code should be nice to read.

Dev set

NICK	SLOC	CS	DE	EN	FR	AVG
LY-BEST		80.71	72.79	49.39	52.36	62.20
LY-MEDIAN		70.39	53.67	38.32	32.70	51.78
BASE	18	3.15	2.60	6.76	3.43	3.98
base-appos	7	6.92	6.80	8.60	12.78	8.78
base-all	7	12.69	8.36	6.54	9.12	9.18
-_-	39	20.19	13.10	27.23	25.99	21.63
🐙	74	55.35	32.98	26.21	39.09	38.41
🚀👨‍🚀🌑	44	64.41	57.21	26.35	27.26	43.81
s	99	72.92	41.27	28.13	33.26	43.89
:)	135	59.62	38.20	39.92	39.26	44.25
bc789	49	69.25	55.91	28.01	38.10	47.82
😳	114	78.12	58.73	32.45	43.34	53.16
aeiou	79	82.22	51.06	39.16	48.79	55.31
mp	53	87.61	67.97	52.77	46.12	63.62

Test set

NICK	SLOC	CS	DE	EN	FR	AVG	Points
LY-BEST		81.16	73.18	51.86	53.72	62.62
LY-MEDIAN		72.10	51.92	36.19	39.06	52.24
BASE	18	3.08	4.38	8.09	3.81	4.84
base-all	7	12.20	8.38	6.77	9.69	9.26
base-appos	7	5.96	8.67	9.44	12.98	9.26
-_-	39	19.55	14.38	25.98	26.97	21.72	71 (EN)
s	99	fail	46.14	30.64	36.27	28.26	92 (FR)
🐙	74	55.22	30.48	28.13	39.25	38.27	100 (FR)
:)	135	58.72	37.87	39.28	39.14	43.75	100 (EN)
🚀👨‍🚀🌑	44	64.16	61.54	23.67	28.67	44.51	100 (DE)
bc789	49	68.15	54.05	29.41	39.52	47.78	100 (AV)
😳	114	77.55	64.01	32.41	45.73	54.92	100 (DE)
aeiou	79	82.04	53.81	36.36	50.53	55.69	100 (FR)
mp	53	86.91	62.52	52.34	49.03	62.70

5. hw_shared_task

Deadline: 23:59 November 25, 2024 100 points

Using the data from the previous assignment, design a toy shared task:

create a golden version of the annotated data by resolving all annotation disagreements
divide the golden data into training and evaluation sections (50:50)
implement a baseline predictor (ideally in Python) that produces automatic annotations
choose (or develop) an evaluation metric for the task, implement an evaluator and evaluate the baseline's performance, and if it makes sense for your task, evaluate also an oracle score
prepare a shared task web page that explains the annotation task, make the training and test data, the baseline predictor, as well as the evaluation script available on the web page, and present the performance values of your baseline predictor
commit it all into hw/shared-task/ in your git repository for this course; again, only one team member makes the commit
the shared tasks will be briefly presented by the students during one of the subsequent practicals

6. hw_add_articles

Deadline: 23:59 December 2, 2024 (improvements allowed till December 9) 100 points

Commit your block to hw/add-articles/addarticles.py.
Write a Udapi block tutorial.AddArticles which heuristically inserts English definite and indefinite articles (the, a, an) into a conllu file (where all articles were deleted). Similarly as in the previous homework: F1 score will be used for the evaluation, just with focus='(?i:an?|the)' (note that only the form is evaluated, but it is case sensitive; you can use details=10 to see more than the top 4 casing variants). For removing articles use util.Eval node='if node.upos=="DET" and node.lemma in {"a", "the"}: node.remove(children="rehang")'. Everything else is the same. To get all points for this hw, you need at least 30% F1 (on the secret test set).

Results (F1) as of 2024-12-09 19:00

NICK	SLOC	DEV	TEST	Points
LY-BEST	23	34.81	40.28
LY-MEDIAN	32	33.14	37.83
BASE	6	12.73	14.09
:)	34	28.31	31.44	100
watticka	38	28.13	31.56	100
bc789	29	29.29	32.86	100
🚀👨‍🚀🌑	22	30.17	33.29	100
-_-	47	31.81	34.21	100
s	21	31.73	35.10	100
🐙	60	34.56	37.46	100
aeiou	18	33.26	38.95	100
😳	17	36.10	40.37	100
mp	17	35.14	40.89

7. hw_hf

Deadline: 23:59 December 23, 2024 100 points

Commit your block to hw/hf/addarticles.py
The task is the same as add-articles, but you should use a model from HuggingFace and you cannot use the morphosyntactic UD annotation (upos, lemma, feats, deprel, parent). Your code should predict articles for sample.conllu in less then 10 minutes using less than 8GiB RAM on a computer in our lab (i5-4570S CPU @ 2.90GHz, no GPU). Everything else is the same (for 100 points, at least 30% F1 needed on the secret test set). You are encouraged to use your favorite LLM assistent (ChatGPT, Gemini,...) for assistence when solving this homework (generating the source code, suggestions which HuggingFace model to use etc).

Results (F1) as of 2024-12-20 17:00

NICK	SLOC	DEV	TEST	Points
LY-BEST	23	34.81	40.28
LY-MEDIAN	32	33.14	37.83
bc789	36	31.47	32.84	100
gpt2	53	43.95	42.11	100
🐙	66	46.17	46.34	100
pls_score_my_1st_hw_😳	19	60.75	59.00	100
Bert :)	44	62.08	61.15	100
roberta-mp	18	61.71	71.76	100

The pool of final written test questions

Basic types of corpora

What is a corpus?
How can you classify corpora? Give at least three classification criteria.
What is an annotation? What kinds of annotation do you know?
Explain terms sentence segmentation and tokenization.
Explain what lemmatization is and for what purpose it is used.
Give examples of problematic situations (from the annotation viewpoint) in sentence segmentation and in tokenization, two examples for each.
Give examples of problematic situations (from the annotation viewpoint) in lemmatization and in tagging, two examples for each.
Explain what a balanced corpus is. Why is this notion problematic?
Explain what POS tagging is and give examples of tag sets. Give examples of situations in which tagging is non-trivial even for a human.
Describe main sources of variability of POS tag sets accross different corpora.
Explain the main property of positional tag sets. Give examples of positional and non-positional tag sets.
Give examples of at least three corpora (of any type). What is their size? (very roughly, order of magnitude is enough; do not forget to mention units)
How is the tokenization produced by the bert-base-cased tokenizer (downloaded from HuggingFace Models Hub) different from the tokenization used in PennTB?

Parallel corpora

What is a parallel corpus?
What types (levels) of alignment can be present in parallel corpora?
Give examples of situations in which document alignment can be problematic.
Give examples of situations in which sentence alignment can be problematic.
Give examples of situations in which word alignment can be problematic.
Give at least three examples of possible sources of parallel texts, and for each source describe expected advantages and disadvantes.

Treebanking

Either assign Penn Treebank POS tags to words in a given English sentence (short tagset documentation of Penn Treebank tags will be available to you), or assign CNK-style morphological tags to words in a given Czech sentence (short tagset documentation will be available to you). You can choose the language.
Draw a dependency tree for a given Czech or English sentence (and mention which annotation scheme you adhere to, e.g. PDT or UD)
Draw a phrase-structure tree for a given Czech or English sentence.
Describe two main types of syntactic trees used in treebanks.
How do we recognize presence/absence of a dependency relation between two words (in dependency treebanking) -- in other words, how does the language manifest (express) dependencies in sentences.
Give at least two examples of situations in which the "treeness assumption" on intra-sentence dependency relations is clearly violated.
Give at least two examples of situations (e.g. syntactic constructions) for which annotation conventions for dependency analysis must be chosen since there are multiple solutions possible that are similarly good from the common sense view.
Why coordination is difficult to capture in dependency trees (compared to e.g. predicate-argument structure)?

Define non-projectivity in dependency trees and provide an example.

Universal Dependencies

How are Universal Dependencies different from other treebanks?
Describe the CoNLL-U format used in Universal Dependencies.
When working with Universal Dependencies which tools are suitable for automatic parsing, manual annotation, querying, automatic transformations and validity checking? Name at least one tool for each task.

Other phenomena for which annotated corpora exist

Explain what coreference is and how it can be annotated.
Explain what named entities are and how they can be annotated.
Explain what sentiment (in the context of NLP) is and how it can be annotated.

Lexical data resources

What is WordNet? What do its nodes and edges represent?
What is a synset?
What is polysemy? Give examples.
Give an example of an NLP tool/lexicon that captures inflectional morphology, explain what it can be used for and describe its main properties.
Give an example of a NLP tool/lexicon that captures derivational morphology, explain what it can be used for and describe its main properties.
What is valency? Give an example of a data resource that captures valency and describe its main properties.

Other resources

Name at least two data repositories where NLP data and models are stored?

Evaluation

In the context of NLP evaluation, explain the intrinsic/extrinsic distinction.
Give at least two examples of situations in which measuring a percentage accuracy is not adequate.
Explain the notions of precision and recall (formulas needed).
What is the precision-recall tradeoff?
What is F-measure, what is it useful for? (formula needed)
Why arithmetic mean is not used for combining precision and recall?
What is k-fold cross-validation?
Explain BLEU (the exact formula not needed, just the main principles).
Explain the purpose of brevity penalty in BLEU.
What is an oracle experiment?
Give examples of baseline solutions for at least three distinct NLP tasks, one for each (you can choose any).
Give examples of three distinct baseline solution for a single NLP task (you can choose any).
What is Labeled Attachment Score (in parsing)?
What is precision at K?
What is Word Error Rate (in speech recognition)?
What is inter-annotator agreement? How can it be measured?
What is Cohen's kappa?

Licensing

In the Czech legal system, if you create an artifact, who/what protects your author's rights?
In the Czech legal system, if you create an artifact, what should you do in order to allow an efficient protection of your author's rights?
In the Czech legal system, if you create an artifact and you want to make it usable by anyone for free, what should you do?
In the Czech legal system, what are the implications of attaching a copyright notice (e.g. "(C)opyright Josef Novák, 2018") compared to simply mentioning the author's name?
What is the difference between moral and economic authors' rights? How can you transfer them to some other person/entity?
Explain main features of GNU GPL.
Explain main features of Creative Commons.
There are four on-off elements defined in the Creative Commons license family (by, nc, sa, nd). Why it does not lead to 2⁴=16 possible licenses?
Explain the difference between copyleft licenses and permissive licenses.
Give two examples of copyleft licenses.
Give two examples of permissive (non-copyleft) licenses.

Homework assignments

There will be 7 homework assignments.
For most assignments, you will get points, up to a given maximum (the maximum is specified with each assignment).
- If your submission is especially good, you can get extra points (up to +10% of the maximum).
Most assignments will have a fixed deadline (usually in two weeks).
If you submit the assignment after the deadline, you will get:
- up to 50% of the maximum points if it is less than 2 weeks after the deadline;
- 0 points if it is more than 2 weeks after the deadline.
Once we check the submitted assignments, you will see the points you got and the comments from us in:
- Studijní mezivýsledky module in the Czech version of SIS
- Study group roster module in the English version of SIS
To pass the course, you need to get at least 50% of the total points from the assignments.

Test

There will be a written test (75 minutes) at the end of the semester.
To pass the course, you need to get at least 50% of the total points from the test.
You can find a sample of test questions on the website.

Grading

Your grade is based on the average of your performance; the test and the homework assignments are weighted 1:1.

≥ 90%
≥ 70%
≥ 50%
< 50%

For example, if you get 360 out of 600 points for homework assignments (60%) and 36 out of 40 points for the test (90%), your total performance is 75% and you get a 2.

No cheating

Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty.
Discussing homework assignments with your classmates is OK. Sharing code is not OK (unless explicitly allowed); by default, you must complete the assignments yourself.
All students involved in cheating will be punished. E.g. if you share your assignment with a friend, both you and your friend will be punished.

2018 http://ufal.mff.cuni.cz/~zabokrtsky/courses/npfl070/html/

Search form

NPFL070 – Language Data Resources

About

Timespace Coordinates in 2024

Course prerequisities

Course passing requirements

Classes

License

1. Introduction

2. More on corpora and a case study: the Czech National Corpus

3. Czech National Corpus cont., Treebanking intro

4. Universal Dependencies, Udapi (by Martin Popel)

5. Udapi cont. (by Martin Popel)

6. No lecture - Dean's day

7. Using annotated data for evaluation

8. Parsing and practical applications (by Martin Popel)

9. Lexical databases (a guided tour)

10. Licensing, data repos

11. HuggingFace datasets, tokenizers (by Martin Popel)

12. Significance and Hypothesis testing (by Martin Popel)

13. Final written test

1. hw_my_corpus

2. hw_our_annotation

3. hw_adpos_and_wordorder

4. hw_add_commas

Results (F1) as of 2024-11-19 10:00

5. hw_shared_task

6. hw_add_articles

Results (F1) as of 2024-12-09 19:00

7. hw_hf

Results (F1) as of 2024-12-20 17:00

The pool of final written test questions

Basic types of corpora

Parallel corpora

Treebanking

Universal Dependencies

Other phenomena for which annotated corpora exist

Lexical data resources

Other resources

Evaluation

Licensing

Homework assignments

Test

Grading

No cheating