Czech Academic Corpus

Present

Past

It was exactly ten years ago when the idea of statistical corpus-based approach from machine translation was first joined with the idea of corpus-based modeling within Czech language processing itself. We really liked the idea to let the computers calculate "some" numbers reflecting "some" linguistic properties of text and then let them assign these properties into text of our choice. At that time, we were lucky to have at our disposal a source of "some" numbers reflecting morphological and syntactic-analytic properties of Czech. We had the Czech Academic Corpus (CAC).

CAC was created during the 1970s and 1980s at the Institute of Czech Language under supervision of Marie Těšitelová. The main motivation for building it (a total of 550 thousand word tokens) was to obtain the quantitative characteristics of present-day Czech. Thus, the structure of CAC corresponds to a two-layer structure annotated corpus: (i) morphological layer (lowest) - full morphological annotation; (ii) analytic layer (middle) - superficial (surface) syntactic annotation.

The CAC consists of 180 texts, each containing 3000 word tokens in average. The texts are sampled from three different categories:

Original annotation scheme

At the end of 1990s, work on the Prague Dependency Treebank (PDT) had started (independetly from the CAC) and its first release was published in 2001. Heading the next releases of PDT, we have decided to convert the CAC into the PDT-like format. The conversion concerns only those CAC annotations that are relevant to the PDT annotations.