Obtaining the analytical parse for English
The tree structure as produced by the MST parser differs in certain points from the main guidelines for the PDT-based analytical layer. Since we wanted this layer (albeit automatically generated) to be as close to the Czech annotation (of the same layer) as possible, additional conversion of the output of the parser has been done (using our Treex NLP platform). For instance, complex verb forms (e.g. would have been playing) are always governed by the lexical verb node while the auxiliary verbs are its daughters. However, no "new" annotation has been introduced at this stage. The resulting annotation style differs from the specification of the analytical layer for Czech only in some minor points.
Figure 1 presents an example of a tree on the analytical (surface-dependency) layer (an a-layer tree, or a-tree). This section and the following section describe its structure and the inner structure of the complex labels of analytical nodes (a-nodes). The annotated sentence has a form of a dependency tree (with nodes and edges, as usual). Each tree has an extra technical root node, the attributes of which are only a subset of the attributes of the regular nodes and which differ also in their definition from all the other nodes (see below).
Unlike the tectogrammatical representation and the phrase-structure representations, the analytical representation reflects only tokens that are actually present in the text. It does not use any trace- or gap-like "artificial" nodes. The following attributes (complex labels) are attached to the analytical-layer nodes (see also Figureá2):
Figureá2: Example of the inner structure of a node on the analytical layer
- afun: for a set of possible values, see below the section on Analytical function labeling
- p_terminal.rf: reference to the id of the corresponding terminal node in the phrase-structure tree
- tag: the PTB tagset is used
- no_space_after: possible values are 0,1. Tokens followed by a punctuation mark have 1.
- is_member: possible values are 0,1. Members of a coordination or other "parallel structure" have 1.
- alignment: alignment to Czech analytical-layer tree
"Afun" means analytical function. The attribute afun labels the dependency relation between the child node it is attached to and its governing (parent) node. (The root node always has the afun AuxS, used nowhere else.) Nodes that represent regular tokens get different afuns according to their syntactic function in the sentence. Section Analytical function labeling provides a list of afuns and their descriptions.
The id attribute consists (in PCEDT 2.0) of the name of the language, the initial of the layer name (A), the name of the corpus, the ordinal number of the sentence in the corpus (the sentences are numbered throughout the entire corpus) and the number of the node itself. The root node is an exception: its id does not contain any node number but only the sentence number.
The ord attribute keeps the ordinal number of the node within the sentence. It is an integer which corresponds to the position of the token in the original word order, numbered from 1. The root has always ord="0".
PCEDT 2.0 is an (automatically) sentence- and word-aligned parallel corpus. The alignment is directed from the English part to the Czech part, for each layer separately. The English annotation layers contain the alignment information in a form of references from the English nodes to their corresponding Czech nodes, using node IDs. At the English analytical layer, it is the alignment sub-attributes alignment/counterpart.rf and alignment/type. Figureá3 illustrates the English-Czech alignment at the analytical layer.
The topmost node in a tree is always the technical root (afun AuxS). The sentence punctuation (afun AuxK) is governed directly by the technical root. Linguistically, of course, the topmost node is (within the dependency framework) always the main predicate, or a coordination node (afun Coord) governing several main predicates as members of the coordination. The modifying word is always governed by the word it syntactically modifies. Hence, arguments and adverbials of a verb are governed by that verb (with a preposition or a subordinator in between, if the modifier is a prepositional phrase or a subordinate clause), adjectives modifying nouns are governed by the nouns and adverbs modifying adjectives are governed by the adjectives.
A coordination is governed by a node with the afun Coord (see Figureá4). This can be a punctuation (e.g. comma, colon or semicolon) or a coordinating conjunction (e.g. and, or). The members of the coordination get their afuns according to their position and function in the sentence, followed by the "suffix" _Co. The original Czech annotation scheme distinguishes among coordination, apposition and parenthesis. The English annotation uses this scheme only for coordination; sequences identified as apposition or parenthesis are not regarded as paratactic. Rather, their members are regarded as being governed by the first member. Only coordination members have thus is_member="1".
Complex verb forms consisting of a lexical verb and one or several auxiliary verbs are always governed by the lexical verb. This node gets its afun according to its position and function in the sentence. The auxiliary verbs are its children (see Figureá5). In the group will be used, the node used governs the nodes will and be).
Verbs of control (e.g. plan to do something) govern the controlled verbs (see Figureá6); so do modal verbs.
Prepositions and subordinating conjunctions always govern "their" content words. They get the "auxiliary" afuns AuxP, AuxC; it is the content word (governed by a preposition/subordinating conjunction) that gets an afun according to its syntactic function in the sentence. In a clause, the predicate is the root node of the subtree representing the clause. E.g. in the sentence "I said that I like it", that is governed by said and gets the afun AuxC. In turn, that governs the node like, which has the afun Obj (object). Figureá7 shows the subtree of a prepositional adverbial clause from non-Tagalog regions governed by the verb come.
Modifiers of nouns, even the prepositional ones, are treated as Atr (attributes) and are governed by the nouns (Figureá8).
The existential there is always the clause subject. Determiners are governed by nouns and get the afun AuxA (this afun does not exist in Czech). Verbless clauses are treated as cases of verbal ellipsis and the governing node gets the afun ExD.
The possible values of the afun (analytical function) attribute are listed here.
|Pred||Predicate. A node not depending on other node.|
|Pnom||Nominal part of a copular predicate. This can be a noun (Martin is a briliant speaker), an adjective (Joan is pretty) or a prepositional group (This was of great importance). Verb complements of other verbs are marked as Obj (e.g. They considered him smart)|
|AuxV||Auxiliary verbs be, have and do, the infinitive particle to, particles in phrasal verbs|
|AuxP||preposition, parts of a secondary (multiword) preposition|
|AuxC||Subordinating conjunction, subordinator|
|AuxK||Terminal punctuation of a sentence|
|AuxG||Non-terminal and non-coordinating graphic symbols (e.g. commas in a serial coordination)|
|AuxS||The technical root of the tree|
|ExD||A technical value signalling a "missing" (elliptical) governing node; also for the main element of a sentence without predicate (stands for "externally dependent")|
|AuxA||articles a, an, the|
|Neg||Negation expressed by the negation particle not, n't|
|NR||Not recognized dependencies|
Please note: the English analytical annotation is only automatic and thus imperfect, with errors both in the structure and in the labeling. Therefore, there might be inconsistency (in terms of the linguistic information provided) along the node links between the manually annotated tectogrammatical layer and the analytical one. In such a case, the tectogrammatical layer (obviously) contains the correct interpretation of the sentence in question (barring annotation errors).