Description of Reader's Digest Czech/English Corpus
---------------------------------------------------
Files are of SGML-format (DTD PCDoc).
Location of files:
*c.pcd ... Czech part of the corpus
*e.pcd ... English part of the corpus
Each document starts with markup and ends with .
Description of other markups follows:
..... begin of the paragraph
..... begin of the sentence [, where X is a paragraph index and
Y is a sentence index]
<\s> .... end of the sentence
... delimiter of automatically aligned passages (the set of sentences).
The count of is the same for both languages. [,
where X is a passage index, A-B is (this language)-to-(second language)
sentence count in the passage. are passages
marked for test purposes]
<\ccs> .. end of aligned passage
..... the original word form and its annotation:
.... lemma (base form) as defined by morphology
.... morphological tag as assigned automatically, see
the description of a morphological tag at the end of this file.
There are four possible attributes of the mark :
cap .... the first letter of the word is uppercase
upper .. all letters are uppercase
num .... the word is the number
mixed .. the word includes both characters and numbers
other annotation:
.... different tagset (for English)
... set of possible tags (for English)
.... node number assigned to this word (Czech part only)
.... number of the node governing this node, automatically
assigned by Statistical parser for Czech (Czech part only)
..... delimiter (';',':',',','.', etc.) and automatically assigned
tag, after .
..... means that there was no space in original text.
See DTD PCDoc (file PCDoc.DTD) for details.
*****************************************************************
Brief description of morphological tags
For Czech:
There are 13 morphological categories in Czech. Tag is a sequence of
these categories, each represented by a single character.
Morphological categories and their values:
1) part of speech (values: noun (N), verb (V), adjective (A), pronoun
(P), adverb (D), numeral (C), preposition (R), conjunction (J),
interjection (I), particle (T), punctuation (Z), and "undefined" (X)).
2) subpart of speech - contains details about the major POS.
3) gender (masculinum animate (M), masculinum inanimate (I), feminine
(F), neuter (N), any of them (X), M or I (Y) ...)
4) number (singular (S), plural (P), dual (D), both S and P (X))
5) case (seven grammatical cases: nominative (1), genitive (2), dative
(3), accusative (4), vocative (5), locative (6), instrumental (7),
any (X))
6) possessor's gender
7) possessor's number
8) person (1st (1), 2nd (2), 3rd (3))
9) tense (verb tense)
10) degree of comparison (positive (1), comparative (2), superlative (3))
11) negation (affirmative (A)/negative (N))
12) voice (active (A)/passive (P))
13) variant/register
For English:
Tagging of the English part corresponds to WSJ tagging.
The first character indicates the part of speech (values: noun (N), verb (V),
adjective (A), pronoun (P), ... , and "undefined" (X)).