Description of Reader's Digest Czech/English Corpus
---------------------------------------------------

Files are of SGML-format (DTD PCDoc).

Location of files:
  *c.pcd   ... Czech part of the corpus
  *e.pcd   ... English part of the corpus

Each document starts with markup <doc> and ends with </doc>.

Description of other markups follows:

<p> ..... begin of the paragraph

<s> ..... begin of the sentence [<s id=pXsY>, where X is a paragraph index and
    Y is a sentence index]
<\s> .... end of the sentence

<ccs> ... delimiter of automatically aligned passages (the set of sentences).
    The count of <ccs> is the same for both languages. [<ccs id=X align=A-B>,
    where X is a passage index, A-B is (this language)-to-(second language)
    sentence count in the passage. <ccs id=X align=A-B test> are passages
    marked for test purposes]
<\ccs> .. end of aligned passage

<f> ..... the original word form and its annotation:
          <l> .... lemma (base form) as defined by morphology
          <t> .... morphological tag as assigned automatically, see
              the description of a morphological tag at the end of this file. 
      There are four possible attributes of the mark <f>:
        cap .... the first letter of the word is uppercase 
        upper .. all letters are uppercase
        num .... the word is the number
        mixed .. the word includes both characters and numbers

      other annotation:
          <o> .... different tagset (for English)
          <Ct> ... set of possible tags (for English)
          <r> .... node number assigned to this word (Czech part only)
          <g> .... number of the node governing this node, automatically
	      assigned by Statistical parser for Czech (Czech part only)

<d> ..... delimiter (';',':',',','.', etc.) and automatically assigned
       tag, after <t>.
								       
<D> ..... means that there was no space in original text.

 See DTD PCDoc (file PCDoc.DTD) for details.

*****************************************************************
Brief description of morphological tags

For Czech:

There are 13 morphological categories in Czech. Tag is a sequence of
these categories, each represented by a single character.

Morphological categories and their values:
1) part of speech (values: noun (N), verb (V), adjective (A), pronoun
  (P), adverb (D), numeral (C), preposition (R), conjunction (J),
  interjection (I), particle (T), punctuation (Z), and "undefined" (X)).
2) subpart of speech - contains details about the major POS.
3) gender (masculinum animate (M), masculinum inanimate (I), feminine
  (F), neuter (N), any of them (X), M or I (Y) ...)
4) number (singular (S), plural (P), dual (D), both S and P (X))
5) case (seven grammatical cases: nominative (1), genitive (2), dative
  (3), accusative (4), vocative (5), locative (6), instrumental (7),
  any (X))
6) possessor's gender
7) possessor's number
8) person (1st (1), 2nd (2), 3rd (3))
9) tense (verb tense)
10) degree of comparison (positive (1), comparative (2), superlative (3))
11) negation (affirmative (A)/negative (N))
12) voice (active (A)/passive (P))
13) variant/register


For English:

Tagging of the English part corresponds to WSJ tagging.

The first character indicates the part of speech (values: noun (N), verb (V), 
  adjective (A), pronoun (P), ... , and "undefined" (X)).