NPFL070 - Language Data Resources

Course schedule overview

  1. Introduction
  2. Corpora, esp. Czech National Corpus
  3. Treebanks
  4. Parallel corpora
  5. Resources related to lexical semantics
  6. Named entity, coreference and anaphora, and discourse corpora
  7. Using data for evaluation

More detailed course schedule

  1. Introduction
  2. Corpora - Case Study: the Czech National Corpus
  3. Czech National Corpus, cont. ,
  4. Treebanking
  5. Universal dependencies, Udapi (by Martin Popel)
  6. Udapi, cont. (by Martin Popel)
  7. Parallel corpora,
  8. Using annotated data for evaluation
  9. Lexical resources,
  10. Parsing (by Martin Popel)
  11. Licensing

Additional material

Possible types of errors in Czech morphologically tagged corpora

You can use the following list, either directly or just for an inspiration.

  1. word form "se" - search for corpus positions, where "se" is tagged as a vocalized preposition, but in fact it is a reflexive pronoun (or vice versa)
  2. word form "jí" - conjugated form of the verb "jíst" (to eat) wrongly tagged as a pronoun, or vice versa
  3. surnames derived from verbs (such as "Pospíšil") - such surnames might be incorrectly tagged as verbs (or vice versa)
  4. forms "a" and "A" - find corpus positions, where "a" is tagged as a coordination conjunction which is wrong (it could be the English article, physical unit, itemizer, etc.)
  5. "weird imperatives" - search for tokens incorrectly tagged as imperatives (such as "leč", which is more likely to be a conjunction)
  6. hledejte chyby způsobené homonymií mezi některými slovesy a adjektivy (např. tvar "zelená" může být adjektivum nebo sloveso)
  7. search for tokens incorrectly tagged as vocalized prepositions (e.g. in cases in which the following word does not require any vocalization of the preceding preposition)
  8. search for tokens whose tags indicate the locative case (6th case); hint: this case can appear only in prepositional groups in Czech
  9. search for errors based on the fact that for each preposition there should be a word form somewhere behind the preposition which 'saturates' the preposition and indicates the same morphological case
  10. word form "ty" - search for places in which "ty" is tagged as a personal pronouns, but in fact is is a demonstrative pronoun (or vice versa)
  11. word form "ti" - analogously to the previous item
  12. swap of nominative and accusative - search for nouns (or other parts of speech) with accusative indicated in the POS tag, even if they should be tagged as nominatives (or vice versa)
  13. "weird vocatives" - search for tokens incorrectly tagged as vocative forms of nouns
  14. two finite verbs close to each other - search for wrongly tagged tokens using the fact that in Czech there should not be two or more finite verb forms in a single clause (but there can be complex verb forms)
  15. foreign words - search for foreign words incorrectly tagged as forms of obviously unrelated Czech words (such as "line" in "on-line" tagged as present-tense form of the verb "linout", or Germent article tagged as a form of the Czech verb "drát")
  16. wrong clitics - search for tagging errors using the fact that Czech clitics (several short words such as "by","ti","mi" etc.) should appear in the so called second position (Vackernagel's position) in a sentence
  17. confusion of prepositions and other parts of speech - find tokens wrongly tagged as prepositions which are in fact nouns or adverbs (homonymous forms such as kolem/kolem/kolem, místo/místo)
  18. search for corpus spots with incorrectly segmented sentences
  19. search for corpus spots with incorrect tokenization (such as "... sejí ..." instead of "... se jí ...")

Course passing requirements

In short:

Homework tasks

Premium tasks

Final test

Determination of the final grade

Student's final grade will be determined by the amount of points collected during the semester:

Grading scheme:

Homework results

     corp tag-err heads adpos comma artic annot parse
     hw01    hw02  hw03  hw04  hw05  hw06  hw07  hw08
KD    100     100   100    75   100   100  100     86
EL    100     100  100   100   100  100     88
NM    100     100   100   100   100   100  100    100
JV    100     100   100   100   100   100  100    100