Up

RAW TEXTS

The electronic text sources had been provided by the  Institute of Czech National Corpus . Originally, all data come from news articles which were published in daily newspapers Lidove Noviny, 1994-1995. The inner format of the data correponds to the SGML coding with the following CSTS document type definition (csts.dtd).

The texts are split into the subdirectories which correspond to the source - a particular SECTION of Lidove noviny newspapers - the articles come from.

D/ ... main section - daily (1991, 1992, 1993, 1994)
FI/ ... sport section - weekly (1992, 1993, 1994)
FN/ ... financial section - weekly (1993, 1994)
KP/ ... cultural section - weekly (1992, 1993, 1994)
LN/ ... digest section "Literarni noviny" - weekly (1991, 1992)
ML/ ... national section "Moravske listy" - weekly (1993)
NP/ ... Sunday section - weekly (1991, 1992, 1993, 1994)
1994/ ... main section - daily (1994)
1995/ ... main section - daily (1995)

The data contain of over 39 mil. tokens total (words proper + punctuation) in about 2,385,000 sentences.

DATA LOCATION