Jan Štěpánek, Charles University in Prague, ÚFAL
<sentence id="s5_5"> <chunk id="hic43"> <feats> <lemma>हो</lemma> <wxlemma>ho</wxlemma> <pos>v</pos> <g>m</g> <n>s</n> <p>m</p> <t>future</t> </feats> <ord>21</ord> <phrase>VG</phrase> <children> <chunk id="hic37"> <drel>k7</drel> <feats> <lemma>अभाव</lemma> <wxlemma>aBAva</wxlemma> <pos>n</pos> <g>m</g> <n>s</n> <c>0</c> </feats> <ord>4</ord> <phrase>NP</phrase> | <children> <chunk id="hic36"> <drel>r6</drel> <feats> <lemma>पाठक</lemma> <wxlemma>pATaka</wxlemma> <pos>n</pos> <g>m</g> <n>s</n> <c>0</c> </feats> <ord>1</ord> <phrase>NP</phrase> <children> <word id="hiw74"> <feats> <lemma>पाठक</lemma> <wxlemma>pATaka</wxlemma> <pos>n</pos> <g>m</g> <n>s</n> <c>0</c> </feats> <form>पाठक</form> <ord>2</ord> <phrase>NN</phrase> <wxform>pATaka</wxform> </word> |
<?xml version="1.0" encoding="utf-8"?> <pml_schema xmlns="http://ufal.mff.cuni.cz/pdt/pml/schema/" version="1.1"> <revision>1.0.0</revision> <description>Converted Hyderabad Treebank morph data</description> <root name="hydtmorph"> <structure> <member name="meta" required="0" type="meta.type"/> <member name="document" required="1"> <sequence role="#TREES"> <element name="sentence" type="sentence.type"/> </sequence> </member> </structure> </root> <type name="meta.type"> <structure> <member name="annotation_info"> <structure> <member name="version_info"><cdata format="any"/></member> <member name="desc"><cdata format="any"/></member> </structure> </member> </structure> </type> <type name="sentence.type"> <container role="#NODE"> <attribute name="id" role="#ID" required="1"><cdata format="ID"/></attribute> <sequence role="#CHILDNODES"> <element name="chunk" type="chunk.type"/> <element name="word" type="word.type"/> </sequence> </container> </type> <type name="children.type"> <sequence> <element name="chunk" type="chunk.type"/> <element name="word" type="word.type"/> </sequence> </type> <type name="chunk.type"> <structure role="#NODE"> <member name="id" role="#ID" as_attribute="1" required="1"><cdata format="ID"/></member> <member name="error" type="error.type"/> <member name="feats" type="feats.type"/> <member name="drel"><cdata format="any"/></member> <member name="ord" role="#ORDER"><cdata format="integer"/></member> <member name="phrase"><cdata format="any"/></member> <member name="children" type="children.type" role="#CHILDNODES"/> </structure> </type> <type name="word.type"> <structure role="#NODE"> <member name="id" role="#ID" as_attribute="1" required="1"><cdata format="ID"/></member> <member name="head"><choice><value>0</value><value>1</value></choice></member> <member name="error" type="error.type"/> <member name="feats" type="feats.type"/> <member name="form"><cdata format="any"/></member> <member name="wxform"><cdata format="any"/></member> <member name="ord" role="#ORDER"><cdata format="integer"/></member> <member name="phrase"><cdata format="any"/></member> </structure> </type> <type name="feats.type"> <structure> <member name="wxlemma"><cdata format="any"/></member> <member name="lemma"><cdata format="any"/></member> <member name="pos"><cdata format="any"/></member> <member name="g"><cdata format="any"/></member> <member name="n"><cdata format="any"/></member> <member name="p"><cdata format="any"/></member> <member name="c"><cdata format="any"/></member> <member name="v"><cdata format="any"/></member> <member name="t"><cdata format="any"/></member> </structure> </type> <type name="error.type"> <list ordered="0"> <choice> <value>not-connected</value> <value>inside-chunk</value> <value>drel-like-name</value> <value>missing-parent</value> <value>duplicate-name</value> </choice> </list> </type> </pml_schema>
http://ufal.mff.cuni.cz./~pajas/tred
The word “Tree” is important. | ![]() |
|
Archive of modules, key-bindings, resources, style-sheets, etc.
| ![]() |
btred: Batch TrEd
$this->parent
$this->root
$this->rbrother
$this->firstson
$this->attr("m/tag")
Example: Build a frequency table of POS tags in the WSJ part of the Penn Treebank
btred -q -T -N -e '
writeln $this->{pos} unless $this->children
' ??/wsj*.pml \
| sort | uniq -c | sort -n
The SQL engine requires the data to be converted to SQL and loaded into a database → stable datasets.
The Perl engine is slow, but works directly with the data files → data in progress.
Atomic attribute value equality, child relation:
nonterminal [
cat = 'VP',
child terminal [
pos = 'VB'
] ]
More relations and operators:
nonterminal $n := [ cat in { 'S','VP' }, descendant terminal [ pos ~ '^V', parent nonterminal [ cat = $n.cat ] ] ]
+, -, *, div, mod, &
Tiger Treebanknonterminal [
cat = 'S',
0x * [ label = 'SB' ],
nonterminal [
cat = 'VP'
] ]
|
|
Penn Chinese Treebanknonterminal [
cat = 'IP',
0x nonterminal [
functions = 'SBJ' ],
nonterminal [
cat = 'VP'
] ]
|
|
Penn Treebank (WSJ)nonterminal [
cat = 'S', 0x nonterminal [
functions = 'SBJ' ],
nonterminal [
cat = 'VP',
coindex.rf nonterminal [
cat = 'VP',
sibling nonterminal [
functions = 'SBJ'
] ] ] ]
| Penn Arabic Treebanknonterminal [
cat = 'VP',
0x nonterminal [
cat = 'VP'
or
functions = 'SBJ'
] ]
|
Prague Dependency Treebanka-node [
$$ = $aux and m/tag ~ '^V[^fs]'
or $$ != $aux and m/tag ~ '^V[sf]',
0x echild a-node [ afun = 'Sb' ],
? echild a-node $aux := [
afun = 'AuxV', m/tag ~ '^V[^f]' ] ]
|
cat = 'NP' | is equivalent to | ∃ x ∈ cat ( x = 'NP' ) |
cat != 'NP' | is equivalent to | ∃ x ∈ cat ( x ≠ 'NP' ) |
! cat = 'NP' | is equivalent to | ∀ x ∈ cat ( x ≠ 'NP' ) |
* cat = 'NP' | is equivalent to | ∀ x ∈ cat ( x = 'NP' ) |
Same for in, ~, etc.
nonterminal [ 0x descendant terminal [pos = "NN"]]
t-node [ a/lex.rf $n4, t-node [ a/lex.rf $n3] ]; a-node $n3 := [ a-node $n4 := [ ] ];
For Penn Treebank (WSJ), extract the underlying grammar. For each rule, show the number of applications.
nonterminal $p := [ * $ch := [ ] ]
>> give $p, $p.cat,
first_defined($ch.cat,$ch.pos),
lbrothers($ch)
>> give $2 & " -> "
& concat($3," " over $1 sort by $4)
>> for $1 give count(),$1 sort by $1 desc
| Number of applications | Rule |
|---|---|
| 189856 | PP → IN NP |
| 128140 | S → NP VP |
| 87402 | NP → NP PP |
| 72106 | NP → DT NN |
| 65508 | S → NP VP . |
| 45995 | NP → -NONE- |
| 36078 | NP → DT JJ NN |
| 31916 | VP → TO VP |
| 28796 | NP → NNP NNP |
| 23272 | SBAR → IN S |
| … | |
node $upper := [
same-tree-as $between,
node $lower := [ ] ];
node $between := [ ! ancestor $upper,
( (order-precedes $upper
and order-follows $lower)
or (order-follows $upper
and order-precedes $lower) ) ]
$upper is not an ancestor of $between
Based on CoNLL ST 2009 data
node $p := [ substr(pos,0,1) = 'V',
? node $ch := [
deprel in {'SB','OA','OC','OA2','OP'}
] ];
>> give $p.xml:id,
if($p=$ch,
if($p.deprel = 'ROOT','V','v'),
substr($ch.deprel,0,1)),
$ch.order
>> give distinct $1,
concat($2,'' over $1 sort by $3)
>> give
substitute($2,'([OS])\\1+','\\1','g')
>> filter ($1 ~ 'O' and $1 ~ 'S')
>> for $1 give $1,count() sort by $2 desc
| Main clause | Num. of occurrences | Dependent clause | Num. of occurrences |
|---|---|---|---|---|
| SVO | 11267 | SOv | 7556 | |
| VSO | 7111 | SvO | 2273 | |
| OVS | 2209 | vSO | 1113 | |
| VOS | 625 | OSv | 606 | |
| SOV | 110 | OvS | 109 | |
| OVSO | 91 | vOS | 64 | |
| VOSO | 64 | SOvO | 37 | |
| OVOS | 31 | OSOv | 34 |
| CLARA Course on Treebank Annotation | Prague, December 2010 |