NPFL070 - Examples of final written test questions

Questions on basic types of corpora

What is a corpus?
How can you classify corpora? Give at least three criteria.
What is an annotation? What kinds of annotation do you know?
Explain terms sentence segmentation and tokenization. Give examples on non-trivial situations. lemmatization, tagging?
Explain what lemmatization is and why it is used.
Explain what a balanced corpus is. Why this notion is problematic?
Explain what POS tagging is and give examples of tag sets. Give examples of situations in which tagging is non-trivial even for a human.
Explain the main sources of variability of POS tag sets accross different corpora.
Explain the main property of positional tag sets. Give examples of positional and non-positional tag sets.
Give examples of at least three corpora (of any type). What is their size? (very roughly, order of magnitude is enough; do not forget to mention units)

What is a parallel corpus?
What types (levels) of alignment can be present in parallel corpora?
Give examples of situations in which document alignment can be problematic.
Give examples of situations in which sentence alignment can be problematic.
Give examples of situations in which node alignment can be problematic.
Give at least three examples of possible sources of parallel data, and for each source describe expected advantages and disadvantes.

Either assign Penn Treebank POS tags to words in a given English sentence (short tagset documentation of Penn Treebank tags will be available to you), or assign CNK-style morphological tags to words in a given Czech sentence (short tagset documentation will be available to you). You can choose the language.
Draw a dependency tree for a given Czech or English sentence.
Draw a phrase-structure tree for a given Czech or English sentence.
Name at least four treebanks and describe their main properties.
Describe two main types of syntactic trees used in treebanks.
What is a trace (in phrase-structure trees).
How do we recognize presence/absence of a dependency relation between two words (in dependency treebanking).
Give at least two examples of situations in which the "treeness assumption" on intra-sentence dependency relations is clearly violated.
Give at least two examples of situations (e.g. syntactic constructions) for which annotation conventions for dependency analysis must be chosen since there are multiple solutions possible that are similarly good from the common sense view.
Why coordination is difficult to capture in dependency trees (compared to e.g. predicate-argument structure)?

How are Universal Dependencies different from other treebanks?
Describe the CoNLL-U format used in Universal Dependencies.
When working with Universal Dependencies which tools are suitable for automatic parsing, manual annotation, querying, automatic transformations and validity checking? Name at least one tool for each task.

What is WordNet? What do its nodes and edges represent?
What is EuroWordNet? How the interlinking through the hub language works?
What is a synset?
What is polysemy? Give examples.
Explain the difference between the notions of polysemy and homography. Why this distinction is non-trivial to make?
Give an example of an NLP tool/lexicon that captures inflectional morphology, explain what it can be used for and describe its main properties.
Give an example of a NLP tool/lexicon that captures derivational morphology, explain what it can be used for and describe its main properties.
What is valency? Give an example of a data resource that captures valency and describe its main properties.

Give at least two examples of situations in which measuring a percentage accuracy is not adequate.
Explain: precision, recall
What is F-measure, what is it useful for?
What is k-fold cross-validation ?
Explain BLEU (the exact formula not needed, just the main principles).
Explain the purpose of brevity penalty in BLEU.
What is Labeled Attachment Score (in parsing)?
What is Word Error Rate (in speech recognition)?
What is inter-annotator agreement? How can it be measured?
What is Cohen's kappa?

In the Czech legal system, if you create an artifact, who/what protects your author's rights?
In the Czech legal system, if you create an artifact, what should you do in order to allow an efficient protection of your author's rights?
In the Czech legal system, if you create an artifact and you want to make it usable by anyone for free, what should you do?
In the Czech legal system, what are the implications of attaching a copyright notice (e.g. "(C)opyright Josef Novák, 2018") compared to simply mentioning the author's name?
What is the difference between moral and economic authors' rights? How can you transfer them to some other person/entity?
Explain main features of GNU GPL.
Explain main features of Creative Commons.
There are four on-off elements defined in the Creative Commons license family (by, nc, sa, nd). Why it does not lead to 2⁴=16 possible licenses?