Czech Academic Corpus
Present
- October 2008: The Czech Academic Corpus v. 2.0 has been released by the Linguistic Data Consortium .
- September 2007: The Czech Academic Corpus version 1.0 has been released.
Past
It was exactly ten years ago when the idea of statistical corpus-based approach from machine translation was first joined with the idea of corpus-based modeling within Czech language processing itself. We really liked the idea to let the computers calculate "some" numbers reflecting "some" linguistic properties of text and then let them assign these properties into text of our choice. At that time, we were lucky to have at our disposal a source of "some" numbers reflecting morphological and syntactic-analytic properties of Czech. We had the Czech Academic Corpus (CAC).
CAC was created during the 1970s and 1980s at the Institute of Czech Language under supervision of Marie Těšitelová. The main motivation for building it (a total of 550 thousand word tokens) was to obtain the quantitative characteristics of present-day Czech. Thus, the structure of CAC corresponds to a two-layer structure annotated corpus: (i) morphological layer (lowest) - full morphological annotation; (ii) analytic layer (middle) - superficial (surface) syntactic annotation.
The CAC consists of 180 texts, each containing 3000 word tokens in average. The texts are sampled from three different categories:
- newspapers (52 written and 8 spoken texts)
- scientific documents (68 written and 32 spoken texts)
- administrative documents (15 written and 5 spoken texts)
Original annotation scheme
At the end of 1990s, work on the Prague Dependency Treebank (PDT) had started (independetly from the CAC) and its first release was published in 2001. Heading the next releases of PDT, we have decided to convert the CAC into the PDT-like format. The conversion concerns only those CAC annotations that are relevant to the PDT annotations.
