Prague Dependency Treebank Tutorial: Technology


Prague Markup Language


Tree Editor TrEd


PML Tree Query Language and Engines


Jan Štěpánek, Charles University in Prague, ÚFAL

Prague Markup Language (PML)

PML—an Example

<sentence id="s5_5">
  <chunk id="hic43">
   <feats>
     <lemma>हो</lemma>
     <wxlemma>ho</wxlemma>
     <pos>v</pos>
     <g>m</g>
     <n>s</n>
     <p>m</p>
     <t>future</t>
   </feats>
   <ord>21</ord>
   <phrase>VG</phrase>
   <children>
     <chunk id="hic37">
      <drel>k7</drel>
      <feats>
        <lemma>अभाव</lemma>
        <wxlemma>aBAva</wxlemma>
        <pos>n</pos>
        <g>m</g>
        <n>s</n>
        <c>0</c>
      </feats>
      <ord>4</ord>
      <phrase>NP</phrase>
      <children>
       <chunk id="hic36">
         <drel>r6</drel>
         <feats>
           <lemma>पाठक</lemma>
           <wxlemma>pATaka</wxlemma>
           <pos>n</pos>
           <g>m</g>
           <n>s</n>
           <c>0</c>
         </feats>
         <ord>1</ord>
         <phrase>NP</phrase>
         <children>
           <word id="hiw74">
            <feats>
              <lemma>पाठक</lemma>
              <wxlemma>pATaka</wxlemma>
              <pos>n</pos>
              <g>m</g>
              <n>s</n>
              <c>0</c>
            </feats>
            <form>पाठक</form>
            <ord>2</ord>
            <phrase>NN</phrase>
            <wxform>pATaka</wxform>
           </word>

PML Schema—an Example

<?xml version="1.0" encoding="utf-8"?>
<pml_schema xmlns="http://ufal.mff.cuni.cz/pdt/pml/schema/" version="1.1">
  <revision>1.0.0</revision>
  <description>Converted Hyderabad Treebank morph data</description>


  <root name="hydtmorph">
    <structure>
      <member name="meta" required="0" type="meta.type"/>
      <member name="document" required="1">
        <sequence role="#TREES">
          <element name="sentence" type="sentence.type"/>
        </sequence>
      </member>
    </structure>
  </root>

  <type name="meta.type">
    <structure>
      <member name="annotation_info">
        <structure>
          <member name="version_info"><cdata format="any"/></member>
          <member name="desc"><cdata format="any"/></member>
        </structure>
      </member>
    </structure>
  </type>

  <type name="sentence.type">
    <container role="#NODE">
      <attribute name="id" role="#ID" required="1"><cdata format="ID"/></attribute>
      <sequence role="#CHILDNODES">
        <element name="chunk" type="chunk.type"/>
        <element name="word" type="word.type"/>
      </sequence>
    </container>
  </type>

  <type name="children.type">
    <sequence>
      <element name="chunk" type="chunk.type"/>
      <element name="word" type="word.type"/>
    </sequence>
  </type>

  <type name="chunk.type">
    <structure role="#NODE">
      <member name="id" role="#ID" as_attribute="1" required="1"><cdata format="ID"/></member>
      <member name="error" type="error.type"/>
      <member name="feats" type="feats.type"/>
      <member name="drel"><cdata format="any"/></member>
      <member name="ord" role="#ORDER"><cdata format="integer"/></member>
      <member name="phrase"><cdata format="any"/></member>
      <member name="children" type="children.type" role="#CHILDNODES"/>
    </structure>
  </type>

  <type name="word.type">
    <structure role="#NODE">
      <member name="id" role="#ID" as_attribute="1" required="1"><cdata format="ID"/></member>
      <member name="head"><choice><value>0</value><value>1</value></choice></member>
      <member name="error" type="error.type"/>
      <member name="feats" type="feats.type"/>
      <member name="form"><cdata format="any"/></member>
      <member name="wxform"><cdata format="any"/></member>
      <member name="ord" role="#ORDER"><cdata format="integer"/></member>
      <member name="phrase"><cdata format="any"/></member>
    </structure>
  </type>

  <type name="feats.type">
    <structure>
      <member name="wxlemma"><cdata format="any"/></member>
      <member name="lemma"><cdata format="any"/></member>
      <member name="pos"><cdata format="any"/></member>
      <member name="g"><cdata format="any"/></member>
      <member name="n"><cdata format="any"/></member>
      <member name="p"><cdata format="any"/></member>
      <member name="c"><cdata format="any"/></member>
      <member name="v"><cdata format="any"/></member>
      <member name="t"><cdata format="any"/></member>
    </structure>
  </type>

  <type name="error.type">
    <list ordered="0">
      <choice>
        <value>not-connected</value>
        <value>inside-chunk</value>
        <value>drel-like-name</value>
        <value>missing-parent</value>
        <value>duplicate-name</value>
      </choice>
    </list>
  </type>

</pml_schema>

Why Should I Use PML?

Tools to

Tree Editor TrEd

http://ufal.mff.cuni.cz./~pajas/tred
  • Written in Perl/Tk: OS independent (MS Windows, Linux, Mac OS X)
  • Drag-and-drop modification of trees
  • User defined macros and style-sheets
  • Minor modes

The word “Tree” is important.

TrEd Extensions

Archive of modules, key-bindings, resources, style-sheets, etc.

  • PDT 2.0 related (valency lexicon, multi-word expressions, discourse, coreference, etc.)
  • a parser
  • parallel treebank alignment
  • support for other formats:
    • Prague Arabic Dependency Treebank
    • Slovene Dependency Treebank
    • Prague Czech-English Dependency Treebank
    • CoNLL 2009 ST (2006, 2007)
    • Penn Treebank (+ Chinese, Arabic)
    • Tiger
    • Hyderabad Treebank
    • Sinica

TrEd Extensions—Examples

TrEd without GUI

btred: Batch TrEd

Example: Build a frequency table of POS tags in the WSJ part of the Penn Treebank

btred -q -T -N -e '
    writeln $this->{pos} unless $this->children
    ' ??/wsj*.pml \
    | sort | uniq -c | sort -n

Speeding btred Up

Parallelization

ntred
  • works over ssh in a shared file system
  • works in two steps:
    1. load the data into the memory of the servers
    2. submit requests serially
jtred
  • works on SGE
  • loads the data for each request, but several requests can be submitted at the same time

PML Tree Query

The SQL engine requires the data to be converted to SQL and loaded into a database → stable datasets.

The Perl engine is slow, but works directly with the data files → data in progress.

PML-TQ: The Query Language

PML-TQ: Basic Syntax

Atomic attribute value equality, child relation:

nonterminal [
  cat = 'VP',
  child terminal [
    pos = 'VB'
] ]

More relations and operators:

nonterminal $n := [
  cat in { 'S','VP' },
  descendant terminal [
    pos ~ '^V',
    parent nonterminal [
      cat = $n.cat
] ] ]

PML-TQ: Basic Syntax (2)

Relations

Predicates

PML-TQ: Basic Syntax (3)

Operators

+, -, *, div, mod, &

Functions

PML-TQ: Verb Clause without a Subject (I)

Tiger Treebank

nonterminal [
  cat = 'S', 
  0x * [ label = 'SB' ], 
  nonterminal [ 
    cat = 'VP'
] ]

Penn Chinese Treebank

nonterminal [
  cat = 'IP',
  0x nonterminal [ 
    functions = 'SBJ' ],
  nonterminal [ 
    cat = 'VP'
] ]

PML-TQ: Verb Clause without a Subject (II)

Penn Treebank (WSJ)

nonterminal [
  cat = 'S', 0x nonterminal [
    functions = 'SBJ' ],
  nonterminal [ 
    cat = 'VP', 
    coindex.rf nonterminal [
      cat = 'VP',
      sibling nonterminal [
        functions = 'SBJ'
] ] ] ]

Penn Arabic Treebank

nonterminal [
  cat = 'VP',
  0x nonterminal [
    cat = 'VP' 
    or
    functions = 'SBJ'
] ]

PML-TQ: Verb Clause without a Subject (III)

Prague Dependency Treebank

a-node [
 $$ = $aux and m/tag ~ '^V[^fs]'
    or $$ != $aux and m/tag ~ '^V[sf]',
  0x echild a-node [ afun = 'Sb' ], 
     ? echild a-node $aux := [
       afun = 'AuxV', m/tag ~ '^V[^f]' ] ]

PML-TQ: Negation

Atomic

cat = 'NP'
is equivalent to
∃ x ∈ cat ( x = 'NP' )
cat != 'NP'
is equivalent to
∃ x ∈ cat ( x ≠ 'NP' )
! cat = 'NP'
is equivalent to
∀ x ∈ cat ( x ≠ 'NP' )
* cat = 'NP'
is equivalent to
∀ x ∈ cat ( x = 'NP' )

Same for in, ~, etc.

Existencial

nonterminal [ 0x descendant terminal [pos = "NN"]]

Layers in PML-TQ

t-node [
  a/lex.rf $n4, 
  t-node [ a/lex.rf $n3] ];
a-node $n3 := [
  a-node $n4 := [  ] ];

PML-TQ: Grammar Extraction

For Penn Treebank (WSJ), extract the underlying grammar. For each rule, show the number of applications.

nonterminal $p := [ * $ch := [  ] ]
>> give $p, $p.cat,
        first_defined($ch.cat,$ch.pos),
        lbrothers($ch)
>> give $2 & " -> " 
        & concat($3," " over $1 sort by $4)
>> for $1 give count(),$1 sort by $1 desc
Number of applicationsRule
189856 PP → IN NP
128140 SNP VP
87402 NPNP PP
72106 NP → DT NN
65508 SNP VP .
45995 NP → -NONE-
36078 NP → DT JJ NN
31916 VP → TO VP
28796 NP → NNP NNP
23272 SBAR → IN S

PML-TQ: Non-Projectivity

CoNLL 2009 treebanks

node $upper := [ 
  same-tree-as $between,
  node $lower := [ ] ];
node $between := [ ! ancestor $upper,
  ( (order-precedes $upper 
     and order-follows $lower)
    or (order-follows $upper
        and order-precedes $lower) ) ]

$upper is not an ancestor of $between

Word Order Typology

For German

Based on CoNLL ST 2009 data

node $p := [ substr(pos,0,1) = 'V', 
  ? node $ch := [
    deprel in {'SB','OA','OC','OA2','OP'}
] ];
>> give $p.xml:id,
     if($p=$ch,
       if($p.deprel = 'ROOT','V','v'),
       substr($ch.deprel,0,1)),
     $ch.order
>> give distinct $1,
     concat($2,'' over $1 sort by $3)
>> give 
     substitute($2,'([OS])\\1+','\\1','g')
>> filter ($1 ~ 'O' and $1 ~ 'S')
>> for $1 give $1,count() sort by $2 desc
Main clauseNum. of occurrences Dependent clauseNum. of occurrences
SVO 11267 SOv 7556
VSO 7111 SvO 2273
OVS 2209 vSO 1113
VOS 625 OSv 606
SOV 110 OvS 109
OVSO 91 vOS 64
VOSO 64 SOvO 37
OVOS 31 OSOv 34

Conclusion

PML Framework