Czech Academic Corpus

The Czech Academic Corpus (CAC) was created by a team from the Institute of the Czech Language, of the ASCR, led by Marie Těšitelová from 1971 till 1985. The original purpose of the corpus was to build a frequency dictionary of the Czech language and the original name of the corpus was “Korpus věcného stylu” (Practical corpus). The corpus has been morphologically and syntactically annotated manually.

Independent from the CAC, an annotation of the Prague Dependency Treebank (PDT) was launched in 1996. The idea of transferring the internal format and annotation scheme of the CAC into the PDT emerged during the work on the PDT’s second version. The main goal was to make the CAC and the PDT fully compatible and thus enable the integration of the CAC into the PDT.

CAC offers:

  • For linguists: Language material reflecting the real usage of the language,

  • For computational linguists: The tools and a considerable amount of data that could help amend applications working with natural language and are not feasible without morphological and syntactical text processing,

  • For TrEd annotation tool users: The possibility to use voice control for the tool,

  • For teachers and their students: An interesting didactic tool for practising Czech language morphology and syntax.

CAC 1.0

After converting the inner format and morphological annotation scheme, we have published the first version of the CAC (Vidová Hladká a kol., 2007). Visit the CAC 1.0 guide.

CAC 2.0

The second version enriches the CAC by adding the surface syntax annotation; in the terminology of the PDT we call this annotation an “analytical layer”. Visit the CAC 2.0 guide.