Description of Reader's Digest Czech/English Corpus --------------------------------------------------- Files are of SGML-format (DTD PCDoc). Location of files: *c.pcd ... Czech part of the corpus *e.pcd ... English part of the corpus Each document starts with markup and ends with . Description of other markups follows:

..... begin of the paragraph ..... begin of the sentence [, where X is a paragraph index and Y is a sentence index] <\s> .... end of the sentence ... delimiter of automatically aligned passages (the set of sentences). The count of is the same for both languages. [, where X is a passage index, A-B is (this language)-to-(second language) sentence count in the passage. are passages marked for test purposes] <\ccs> .. end of aligned passage ..... the original word form and its annotation: .... lemma (base form) as defined by morphology .... morphological tag as assigned automatically, see the description of a morphological tag at the end of this file. There are four possible attributes of the mark : cap .... the first letter of the word is uppercase upper .. all letters are uppercase num .... the word is the number mixed .. the word includes both characters and numbers other annotation: .... different tagset (for English) ... set of possible tags (for English) .... node number assigned to this word (Czech part only) .... number of the node governing this node, automatically assigned by Statistical parser for Czech (Czech part only) ..... delimiter (';',':',',','.', etc.) and automatically assigned tag, after . ..... means that there was no space in original text. See DTD PCDoc (file PCDoc.DTD) for details. ***************************************************************** Brief description of morphological tags For Czech: There are 13 morphological categories in Czech. Tag is a sequence of these categories, each represented by a single character. Morphological categories and their values: 1) part of speech (values: noun (N), verb (V), adjective (A), pronoun (P), adverb (D), numeral (C), preposition (R), conjunction (J), interjection (I), particle (T), punctuation (Z), and "undefined" (X)). 2) subpart of speech - contains details about the major POS. 3) gender (masculinum animate (M), masculinum inanimate (I), feminine (F), neuter (N), any of them (X), M or I (Y) ...) 4) number (singular (S), plural (P), dual (D), both S and P (X)) 5) case (seven grammatical cases: nominative (1), genitive (2), dative (3), accusative (4), vocative (5), locative (6), instrumental (7), any (X)) 6) possessor's gender 7) possessor's number 8) person (1st (1), 2nd (2), 3rd (3)) 9) tense (verb tense) 10) degree of comparison (positive (1), comparative (2), superlative (3)) 11) negation (affirmative (A)/negative (N)) 12) voice (active (A)/passive (P)) 13) variant/register For English: Tagging of the English part corresponds to WSJ tagging. The first character indicates the part of speech (values: noun (N), verb (V), adjective (A), pronoun (P), ... , and "undefined" (X)).