Czech title: Vybrané derivační vztahy pro automatické zpracovaní češtiny

Postdoc project GA ČR P406/12/P175

Principal investigator: Magda Ševčíková

2012–2014

The project deals with selected word-formation relations in Czech, namely with relations between adjectives and their derivates. On the basis of the semantic relation to their base adjectives, derivates were classified into two groups: into syntactic derivates (which have the same meaning as their base adjectives but differ in syntactic functions) and lexical derivates (which differ from the base adjectives in meaning). Based on our theoretical findings, an annotation proposal reflecting the derivational relations was integrated in the deep-syntactic annotation of the Prague Dependency Treebank (PDT) and included in PDT 3.0.

A database of words derived from adjectives (AdjDeriNet) was created under the project:

 

AdjDeriNet: Words Derived from Adjectives in Czech

Authors: Magda Ševčíková, Zdeněk Žabokrtský

The data consists of pairs of base adjectives and their derivatives. It contains 17,942 base adjectives (1st column in the tsv file; source_lemma in the xml file) that are base words for 26,329 lexemes of several parts of speech (2nd column in tsv; target_lemma in xml); the part of speech of the derivative is specified (3rd column in tsv; target_pos in xml):

  • 14,058 deadjectival adverbs (D)
  • 11,443 deadjectival nouns (N)
  • 609 deadjectival adjectives (A)
  • 219 deadjectival verbs (V)

The most productive base adjectives are:

  • mladý: 14 derivatives (mladě, mládek, mládě, mladík, mládí, mladina, mladinký, mladit, mládnout, mlaďoch, mlado, mladost, mlaďoučký, mlaďounký)
  • černý: 12 derivatives (černat, černě, černice, černík, černit, černoch, černo, černost, černota, černouš, černucha, černýš)
  • světlý: 12 derivatives
  • žlutý: 12 derivatives
  • starý: 11 derivatives
  • blbý: 10 derivatives
  • červený: 10 derivatives
  • zelený: 10 derivatives
  • zlatý: 10 derivatives

The list of the most productive affixes:

  • 10,473 -e/-ě
  • 7,966 -ost
  • 3,575 -y
  • 689 -ství/-ctví

Nouns ending in -as (ex. kliďas ‘phlegmatic person’), verbs with -at (zelenat ‘to turn green’), or adjectives with the suffix -ičký (maličký ‘very small’) belong to the least frequent deadjectival derivatives in the database.

The development procedure was focused on precision rather than recall (for instance, prefixes and prefixation combined with suffixation was omitted).

The data of AdjDeriNet can be downloaded from LINDAT-Clarin infrastructure in simple plain-text format (tab separated columns) or
in a self-documenting XML-based format. It can be used under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License (CC-BY-NC-SA 3.0).