This Document Type Definition specifies the Czech National Corpus SGML markup scheme, as used (with extensions) in various derived corpora, most notably in the Prague Dependency Treebank. The pure DTD file is called csts.dtd and the declaration file is called csts.dcl; there is also a version for direct use with the nsgmls software (csts.doctype), and a description file which can be fed into the dtd2html program, csts.desc, which produced these HTML documentation pages.

The SGML document name. It contains an (optional) <h>eader, with bibliographic (source-of-test) and annotation/annotator information, and a number of <doc>uments (sometimes just one, sometimes hundreds) which share the same header and which contain the text proper. It cannot be omitted, not even at the end of the SGML document (as most other embedded tags can).

It has only one attribute (lang), for the specification of the default language of the contents; the language can be specified in the <a> tag within each <doc>ument differently.

The default language for the whole SGML document. Can be redefined in the <a> tag within each <doc> element.

An optional file header, as the first element of the <csts> element. It contains information about the source of the file, and about the markup performed on the file so far.

Short identification of the source of the data. Usually contains a publisher's common name, such as "Lidove noviny". Please not that this element's closing tag cannot be omitted.

Default markup information for the file. Might be (logically) superseded by the <doc> / <a>'s element of the same name. It is meant to contain the markup (main) author's name, date, and description of the markup performed on the file, whether automatically or manually. Latest markup information comes first in the sequence of markup elements.

Author of markup. Human-readable full name(s) of the author(s) (or of the main author if too many people were involved), or of person(s) who can provide further information, documentation and/or software concerning the markup.

Free-format specification of the date/time the markup was performed. Any human-readable format of the specification is acceptable, such as "01-Dec-1997", "1998-11-30", "Fri Oct 1 10:41:18 1999", or even "spring 1997". Regionally-specific (and thus globally-ambiguous) date formats should be avoided (such as "7/4/2001").

Free-text description of what has been done to the data. English is preferred as the language of the description. Several mdesc elements might be present within a <markup> element. Even though everything is optional, useful information should be recorded here, such as additional people involved, software used (with version id, parameter settings, etc.), and/or the environment in which the processing has occurred.

A document. Each file contains one or more documents (typically, there is one document per file for books, ephemerals, poems, etc., but possibly hundreds of documents per file for a newspaper, where one file contains the whole daily issue, and each document corresponds to an article.)

A document is identified by a (numerical) id attribute (documents are simply numbered within a file, starting at 1). For ease of local reference, the filename in which the document resides is repeated at every document in the file in the file attribute. Full path to the archive is used for the file reference, even though care has been taken to uniquely identify all files in the CNC (and thus, in the PDT as well) just by the filename.

The document contains one header (<a>) and its contents (<c>). The header contains information about the genre, time period, and other bibliographical and classification information as well as additional markup processing information (if any). The contents then contains a sequence of paragraphs () and sentences (<s>) within the paragraphs containing the linguistic material proper.

A filename in which the document resides (in the case of texts from the CNC, it is the filename under which the document is stored in the "Bank of the CNC"). The full path from the archive root should be included (and again, it is included for files coming from the CNC). In the case that the filename is unique, the filename alone should be sufficient (in the case of CNC, however, even though the filenames are unique, the full paths is given nevertheless, and it has the following structure:

Part 1 (top level directory): taken from the <mod> element,
Part 2: taken from the <txtype> element,
Part 3: taken from the <med> element,
Part 4: taken from the <temp> element,
Last part: the filename, which is also recorded in the <opus> element).

A typical value of the file attribute is thus

s/inf/nws/1994/ln94164

A document identification (a decimal number). Documents are identified uniquely within a file.

Documents are initially numbered continuously and densely (starting at 1 in a file), but some of them might disappear or be added during subsequent markup and processing. Therefore, nothing is guaranteed but the fact that the attribute's value is a decimal number.

Document header. The <markup> subelement has the same semantics as the one in the file header (<h>), and it is considered to contain an additional markup information to the file header's <markup> subelement.

Corpus type:

s current-epoch ("modern") language written corpus

d diachronic ("old") language written corpus

o oral (spoken) corpus (transcribed speech)