NPFL075 — Prague Dependency Treebank

Course schedule overview

Phrase trees, dependecny trees, non-dependency realtions. Functional Generative Description, structural layers, coordination, differences between the theory and the PDT. Lower layers in FGD, word-form and morphological layers in the PDT, asymmetric dualism. Analytical layer (prepositions, verbal complements, combined functions, modal verbs, word order). Valency. PDT Vallex.

More detailed course schedule

Practice:
  1. Be sure to correctly set CPAN and PERL5LIB:

    cpan
    o conf cpan_home ~/BIG/.cpan
    o conf makepl_arg INSTALL_BASE=~/BIG/perl
    o conf mbuildpl_arg --install_base ~/BIG/perl
    $ grep PERL5 .bashrc
    export PERL5LIB=~/BIG/perl/lib/perl5${PERL5LIB:+:$PERL5LIB}
    • Install XML::XSH2 via CPAN:
      cpan install XML::XSH2
    • Install TrEd with
      wget http://ufal.mff.cuni.cz/~pajas/tred/install_tred.bash
      bash install_tred.bash --tred-dir ~/BIG/TrEd
      Make aliases/links/scripts to run ~/BIG/TrEd/bin/start_tred as tred and similarly for btred. Run TrEd, select Setup → Manage Extensions → Install New. Tick pmltq and pdt20_sample, click Install Selected.

    Convert a phrase-structure tree to a dependency tree.

    Homework (hw01): Finish the exercise.
    Deadline: Thursday 2012/03/01 12.00 (noon)

    Additional homework (ahw01) for those who did not submit the homework in time:
    Download several phrase trees here. The format is similar to that of the previous homework, just the single quotes are omitted at terminals. Write a script that outputs the underlying grammar that is able to generate all the trees. The grammar should have the following form:

    S -> NP VP @ 20
    VP -> V NP PP @ 4
    The rules should be sorted by the left non-terminal alphabetically, for each non-terminal, sort the rules by the frequency of their usage (printed after the at-sign).
  2. Conversion of dependency trees to phrase trees.

    Premium task: representing two trees by one string.

    Homework (hw02): Download the data. They contain trees (i.e. just nodes and edges, no ordering) in two different formats, encoded by a structure (-s) of by a reference (-l). The reference encoding just represents a root of the tree as

    root.
    and any other node with the edge to its parent as
    child.parent
    (one node per line in no particular order).
    The structure encoding uses the recursive notation
    parent(child1,child2)
    Write scripts to convert between the formats. Each script should report any error encountered in its input file.

    Deadline: Thursday 2012/03/08 12.00 (noon).

    Additional homework (ahw02): Make a script that generates a random directed graph in the following format:

    node1 -> node2
    node2 -> node3
    node3 -> node2
    The script takes the number of nodes and number of edges as its parameters. Write another script that finds the longest directed simple cycle in a graph in the same format. Output:
    4: node1 node2 node3 node4 node1
    i.e. the length of the cycle, colon, list of the nodes.
  3. Two serially connected pushdown transducers can simulate a Turing machine.

    xsh
    Basic commands:

    help
    ls
    insert element child into scratch
    cd scratch/child
    insert attribute 'id="a1"' into .
    cd /
    insert attribute 'id="a2"' into /scratch
    rm /scratch/@id
    copy :r scratch/child after .
    set /scratch/child[2]/@id 'a2'
    set scratch/child[2]/text() 'A&B'
    ls :d-1
    exit
    
    Put this line into ~/.xsh2rc:
    register-namespace pml http://ufal.mff.cuni.cz/pdt/pml/ ;

    Processing w-layer: getting back the original text.

    for my $file in { glob '*.w.gz' } {
        open $file ;
        for //pml:para {
            for .//pml:w {
                echo :n (pml:token) ;
                if not(pml:no_space_after=1) echo :n ' ' ;
            }
            echo {"\n"} ;   # perl expression
        }
    } 

    Homework (hw03):
    Create an XSH script that prints a frequency table of all the tokens that have no space before themselves. (Hint: see XPath function following-sibling.) Did you expect all the tokens?

    Deadline: Thursday 2012/03/15 12.00 (noon) - sharp!

    Additional homework (ahw03): If you closely inspect the original texts reconstructed from w-layer, you may notice that the attribute no_space_after is misplaced at some punctuation (e.g. some opening brackets). Write a script that will detect these errors and correct them (you migh need to read xsh help move).

  4. m-layer
    Searching for deleted tokens (i.e. tokens from w-layer that are not represented on m-layer):

    • The slow way: looping over the files.
      for my $mfile in { glob '*.m.gz' } {
          my $mdoc := open $mfile;
          my $wdoc := open /pml:mdata/pml:head/pml:references/pml:reffile[@name='wdata']/@href ;
          for my $wid in $wdoc//pml:w/@id {
                if (count($mdoc//pml:w.rf[substring-after(., '#')=$wid])=0) echo $wid ;
          }
      }
    • The fast way: using a hash.
      for my $mfile in { glob '*.m.gz' } {
          my $mdoc := open $mfile;
          open /pml:mdata/pml:head/pml:references/pml:reffile[@name='wdata']/@href ;
          my $whash := hash substring-after(., '#') $mdoc//pml:w.rf ;
          for my $wid in //pml:w/@id
                if (count(xsh:lookup('whash', $wid))=0)
                      echo $wid ;
      }

    Homework (hw04):

    1. Write an XSH script that checks that for each change in form (w/token against m/form) the reason for the change is given in m/form_change.
    2. Write an XSH script to check that the segmentation to sentences on m-layer respects the segmentation to paragraphs on w-layer (i.e. no sentence is split into several paragraphs). Do not try to parse the @id's, the only thing guaranteed about them is their uniqueness; rather use the links and XML structure.

    Additional homework (ahw04): Write a script that removes the <s> elements from the m-files and tries to segment the data to sentences (it can use the corresponding w-files, too). You can use the whole sample data as the training/devel-test data (but do not parse the id attribute).

  5. m-layer Frequency table of form lengths, sorting by secondary key:

    for { glob '*.m.gz' } {
        open (.) ;
        for //pml:form echo string-length(.) ;
    } | sort | uniq -c | sort -k1,1n -k2nr
    

    Word in locative case is always preceded by a preposition requiring locative:

    for { glob '*.m.gz' } {
        open (.) ;
        for //pml:tag
            if (substring(.,5,1)='6' and not(substring(.,1,1)='R'))
                if (count(../preceding-sibling::pml:m/pml:tag[xsh:matches(.,'^R...6')])=0)
                    echo ../@id ;
    }

    Each preposition is followed by a word in the case the preposition requires (code not shown here).

    Homework (hw05):
    Write a script that builds a table of numbers of possible lemma & tag pairs for each form. Example output:

    první    7
    které    7
    další    6
      ...
    zástupců 1

    Additional homework (ahw05): See ahw04.

  6. m-layer and a-layer
    Getting the corrected text from m-layer, formated as HTML:

    echo '<html><head>' ;
    echo '<meta http-equiv="content-type" content="text/html; charset=utf-8">' ;
    echo '<title>m</title>' ;
    echo '<style type="text/css">ins {color:green} del {color:red}</style>' ;
    echo '</head><body>' ;
    
    for my $file in { glob '*.m.gz' } {
        my $mdoc := open $file;
        my $wdoc := open
            /pml:mdata/pml:head/pml:references/pml:reffile[@name="wdata"]/@href ;
        my $wh := hash @id $wdoc//pml:w ;
    
        my $par;
        for $mdoc/pml:mdata/pml:s {
            my $newpar = xsh:lookup('wh',
                   substring-after(./pml:m[1]/pml:w.rf, '#'))/../pml:othermarkup ;
            if not($par = $newpar ) {
                echo '<p>' ;
                $par = $newpar ;
            }
    
            for ./pml:m {
                my $w = xsh:lookup('wh', substring-after(./pml:w.rf, '#')) ;
    
                if pml:form_change {
                    echo :n '<del>' ;
                    echo :n $w/pml:token ;
                    echo :n '</del><ins>' ;
                }
    
                echo :n (.)/pml:form ;
    
                if pml:form_change echo :n '</ins>' ;
                if not($w/pml:no_space_after = 1) echo :n ' ' ;
            }
            echo '<br>' ;
        }
    }
    
    echo '</body></html>' ;
    Frequency table of cases of Subjects:
    for { glob "sample?.a.gz" } {
      $adoc := open (.) ;
      $mdoc := open
        /pml:adata/pml:head/pml:references/pml:reffile[@name="mdata"]/@href ;
      $tag_table := hash ../@id $mdoc/pml:mdata/pml:s/pml:m/pml:tag ;
      for $adoc/pml:adata/pml:trees//pml:afun[.='Sb'] {
        my $tag = xsh:lookup('tag_table',substring-after(../pml:m.rf,"#")) ;
        echo xsh:match($tag,'^....(.)','') ;
      }
    } | sort | uniq -c | sort -n

    Homework (hw06):

    1. Finish the exercise: make a frequency table of POS of parents of Subjects. Try to explain everything that is not a Verb.
    2. Make a frequency table of the analytical functions that occur exclusively at leaves (i.e. they do not occur at inner nodes at all).

    Additional homework (ahw06): Convert all m-forms to lowercase. Try to write a program that capitalizes where appropriate (rule-based or statistical) based on sentence boundaries, lemmas and tags. Baseline: capitalize the first word of each sentence and each personal or geographical name (95.36%). Beat the baseline.

  7. a-layer, PML-TQ
    Subject with a preposition:

    a-node [
        afun = 'AuxP',
        a-node [
            afun = "Sb"
        ]
    ]
    Frequency table of analytical functions of nodes that have more than one effective parent:
    a-node $child := [
        !   afun in {'AuxX', 'AuxG', 'AuxY', 'AuxZ', 'AuxK', 'ExD'}, 
        2+x eparent a-node [  ]
    ];
      >> give $child.afun
      >> for     $1
         give    $1,count()
         sort by $2
    
    Each case required by a preposition is present at its child even in coordinations:
    a-node $prep := [
        afun = 'AuxP',
        substr(m/tag,0,1) = 'R',
        descendant a-node [
            substr(m/tag,4,1) ~ '[1-7]',
            is_member = 1,
            ! substr(m/tag,4,1) = substr($prep.m/tag,4,1),
            ! afun = 'AuxP',                               # Skip compound prepositions.
            0x ancestor a-node [                           # All nodes in between are coordinations.
                ! afun in {'Coord', 'Apos'},
                ancestor $prep
            ]
        ]
    ]

    Homework (hw07): Make a frequency table of numbers of effective parents. Both the string version or PML version of the query are valid solutions. E.g.

    Number of effective parents     Number of occurences
                             0                      500
                             1                    23456
                             2                       50
                             3                       21
                             4                        2
                             7                        1

    Additional homework (ahw07): Try to find two nodes in the parent-child relation on the analytical layer whose surface words have the greatest distance from each other.

  8. a-layer, PML-TQ
    Printing the sentences:

    a-root $root := [
        descendant a-node $node := [  ]
    ]
      >> give $root.id, $node.m/form & if($node.m/w/no_space_after = 1, '', ' '), $node.ord
      >> give distinct $1, concat($2, '' over $1 sort by $3)
      >> give $2

    Optional nodes: matched also by their parents. Example: frequency table of all the compund prepositions.

    a-node $top := [
        afun = 'AuxP', 
      ? a-node $child := [
            afun = 'AuxP',
            sons() = 0
        ]
    ];
      >> give $top.id, lower($child.m/form), $child.ord
      >> give distinct $1, concat($2, ' ' over $1 sort by $3)
      >> filter $2 ~ ' ' 
      >> for $2 give $1, count() sort by $2 desc
    

    Homework (hw08): Write a query that prints sentences containing compound prepositions. Extra points: surround the compound prepositions with <b> and </b> to be rendered in bold face in HTML.

    Additional homework (ahw08): Try to find the t-nodes that refer (by their a/lex.rf or a/aux.rf attributes) to analytical nodes from a different sentence. Your query should print the sentences in HTML and put the referenced words in bold face.

  9. t-layer, btred
    Documentation of Treex::PML::Node
    Documentation of the PDT 2.0 extension.

    A frequency table of numbers of analytical functions:

    btred -T -N -e 'writeln $this->{afun}' sample?.a.gz | sort | uniq -c | sort -n
    
    Writing longer scripts: calling with btred -I.

    Print all sentences with compound prepositions:

    #!btred -t PML_A -T -e show()
    
    package PML_A;
    
    sub compound_preposition {
        my $node = shift;
        return ($node->{afun} eq 'AuxP'
            and not $node->children
            and $node->parent->{afun} eq 'AuxP');
    }
    
    sub show {
        if (grep compound_preposition($_), $root->descendants) {
            writeln GetSentenceString();
        }
    }
    
    Frequency table of number of effective parents:
    #!btred -t PML_A -TN -e eparents()
    
    package PML_A;
    
    sub eparents {
        return if $this->{afun} =~ /Aux[CPXG]/;
        my @parents = GetEParents($this, \&DiveAuxCP);
        writeln scalar @parents;
    }
    Relation between t-layer and a-layer: Searching for "switched" dependency:
    a-node $n3 := 
    [ echild a-node $n4 := [  ] ];
    
    t-node 
    [ a/lex.rf $n4, 
         echild t-node 
         [ a/lex.rf $n3 ] ];
    

    Homework (hw09): Try to search for "switched" dependencies in btred. You can use the following for starters:

    #!btred -t PML_T -TN -e switched()
    
    package PML_T;
    
    sub switched {
        my @tp      = GetEParents();
        my $anode   = GetALexNode();
        my @ach     = PML_A::GetEChildren($anode, \&PML_A::DiveAuxCP);
        my @anodes  = map GetALexNode($_), @tp;
    
        if ( ... ) {
            FPosition();
        }
    }
    
    Also, try to find ids of the analytical nodes that are not referenced from the t-layer. You can use xsh, btred or PML-TQ (several tools: extra points).

    Additional homework (ahw09): For each tectogrammatical tree, find the two nodes with the longest path that connects them. Print their ids and the length of the path.

  10. t-layer
    Verbal complement: Verify that a-nodes corresponding to t-node with functor COMPL are:

    1. a-node with afun Atv. The node in compl.rf relation to the original one corresponds to the parent of the Atv. The parent of the original node corresponds to the grandparent of the Atv.
    2. a-node with afun AtvV. The node in compl.rf realtion to the original one is a generated node. The parent of the original node corresponds to the parent of the AtvV.

    Possible forms of DIR1.

    t-node $t := [
        functor = "DIR1", 
        a/aux.rf a-node $a := [  ]
    ];
      >> for lower(if($a.m/form ~ '^.[Ee]$', substr($a.m/form,0,1), $a.m/form)),
             substr($a.m/tag,4,1)
         give $1 & '+' & $2, count()
         sort by $2

    Ineffective queries. Several ways to find a node with more than four children:

    a-node [sons() > 4]
    a-node [ 5+x a-node [] ]
    a-node [ a-node [], a-node [], a-node [], a-node [], a-node [] ]
    

    Subject-Verb-Object classification of Czech (see Joseph Greenberg). Build a frequency table of all the possible orderings of Subject, Verb and Object in the data. Count only sentences where all three elements appear, squash sequences of the same letter down to one (as in tr//s in Perl). Print percentage in the second column, e.g.

    SVO      57.4%  SvO      66.7%
    OVS      18.2%  OvS       8.3%
    VSO       7.4%  SOv       7.1%
    VOS       6.2%  vSO       4.8%
    SOV       4.1%  OSv       4.8%
    OSV       2.1%  vOS       3.6%
    OVOS      1.7%  SOvO      3.6%
    SOVO      1.7%  OSvO      1.2%
    OVSO      0.8%
    OSVO      0.4%
    
    Extra points: main clauses and subordinate clauses counted separately (as in the example).

    Homework (hw10):

    1. Find all the possible forms of TWHEN and their frequencies. Pay attention to compound prepositions, subordinate conjuctions, auxiliary verbs (should not be printed), etc. The output should look like this:
      v + 6           10  # preposition + case
      když + c         4  # subordinate conjunction + clause
      na začátku + 2   4  # compound preposition + case
      2                3  # just case
      tehdy            2  # word not expressing case
      
    2. Finish the SVO classification.

    Additional homework (ahw10): How many nouns do have a preposition on the analytical layer?

  11. t-layer
    List all modal verbs and their corresponding modality type.

    t-node $t := [
        gram/deontmod ~ '.',
      ! gram/deontmod = 'decl',
        a/aux.rf a-node $a := [
            substr(m/tag,0,1) = 'V',
          ! m/lemma = 'být'
        ]
    ]
      >> for $t.gram/deontmod, $a.m/lemma
         give $1, $2, count()
         sort by $3 desc
    
    Find all non-projective trees.
    a-root $root := [
        descendant a-node $gap := [
          (  ( order-follows $parent
               and order-precedes $child)
          or ( order-precedes $parent
               and order-follows $child
              )
           )
        ], 
        descendant a-node $parent := [
          ! descendant $gap, 
            a-node $child := [  ]
        ]
    ];
      >> give distinct $root.id
    

    Homework (hw11): Textual coreference links can be chained into so called “coreference chains”. Make a frequency table of lengths of the coreference chains.

    Additional homework (ahw11): Some tectogrammatical nodes have the list type (attribute nodetype) - they correspond to a list, whose members are represented by some (find out which) children of the node. Try to find whether these members always form a continuous part of the surface sentence.

  1. Lecture: Trees
    Slides: PPT/PDF
  2. Lecture: Projectivity, FGD
    Slides: PPT/PDF
  3. Lecture: Differences between PDT and FGD
    Slides: PPT/PDF
  4. Lecture: Morphological Layer
    Slides: PPT/PDF
  5. Lecture: Analytical Layer
    Slides: PPT/PDF
  6. Lecture: Analytical Layer (Continued)
    Slides: PPT/PDF
  7. Lecture: Tectogrammatical Layer (Introduction)
    Slides: PPT/PDF
  8. Lecture: Tectogrammatical Layer (Valency)
    Slides: PPT/PDF
  9. Lecture: Tectogrammatical Layer (Valency Continued)
    Slides: PPT/PDF
  10. Lecture: Tectogrammatical Layer (Coreference)
    Slides: PPT/PDF
  11. Lecture: Tectogrammatical Layer (TFA)
    Slides: PPT/PDF
  12. Lecture: Tectogrammatical Layer of English
    Slides: PPTX/PDF

Literature

Required work

Rules for homeworks

Results

Final test

Determination of final grade