Be sure to correctly set CPAN and PERL5LIB:
cpan
o conf cpan_home ~/BIG/.cpan
o conf makepl_arg INSTALL_BASE=~/BIG/perl
o conf mbuildpl_arg --install_base ~/BIG/perl
$ grep PERL5 .bashrc
export PERL5LIB=~/BIG/perl/lib/perl5${PERL5LIB:+:$PERL5LIB}
Convert a phrase-structure tree
to a dependency tree.
Homework (hw01): Finish the exercise.
Deadline: Thursday 2012/03/01 12.00 (noon)
Additional homework (ahw01) for those who did not submit the
homework in time:
Download several phrase trees here. The format is similar
to that of the previous homework, just the single quotes
are omitted at terminals. Write a script that outputs
the underlying grammar that is able to generate all the
trees. The grammar should have the following form:
S -> NP VP @ 20
VP -> V NP PP @ 4
The rules
should be sorted by the left non-terminal
alphabetically, for each non-terminal, sort the rules by
the frequency of their usage (printed after the at-sign).
-
Conversion of dependency trees to phrase trees.
Premium task: representing two trees by one string.
Homework (hw02):
Download the data. They contain
trees (i.e. just nodes and edges, no ordering) in two
different formats, encoded by a structure (-s) of by a
reference (-l). The reference encoding just represents a
root of the tree as
root.
and any other node
with the edge to its parent as
child.parent
(one node per line in no
particular order).
The structure encoding uses the recursive notation parent(child1,child2)
Write scripts to convert between the formats. Each script
should report any error encountered in its input file.
Deadline: Thursday 2012/03/08 12.00 (noon).
Additional homework (ahw02):
Make a script that generates a random directed
graph in the following format:
node1 -> node2
node2 -> node3
node3 -> node2
The script takes the number of nodes and number of
edges as its parameters. Write another script
that finds the longest directed simple cycle in a graph
in the same format. Output:4: node1 node2 node3 node4 node1
i.e. the length of the cycle, colon, list of the nodes.
-
Two serially connected pushdown transducers can
simulate a Turing machine.
xsh
Basic commands:
help
ls
insert element child into scratch
cd scratch/child
insert attribute 'id="a1"' into .
cd /
insert attribute 'id="a2"' into /scratch
rm /scratch/@id
copy :r scratch/child after .
set /scratch/child[2]/@id 'a2'
set scratch/child[2]/text() 'A&B'
ls :d-1
exit
Put this line into ~/.xsh2rc:
register-namespace pml http://ufal.mff.cuni.cz/pdt/pml/ ;
Processing w-layer: getting back the original text.
for my $file in { glob '*.w.gz' } {
open $file ;
for //pml:para {
for .//pml:w {
echo :n (pml:token) ;
if not(pml:no_space_after=1) echo :n ' ' ;
}
echo {"\n"} ; # perl expression
}
}
Homework (hw03):
Create an XSH script that prints a frequency table of all
the tokens that have no space before themselves. (Hint:
see XPath function following-sibling.) Did
you expect all the tokens?
Deadline: Thursday 2012/03/15 12.00 (noon) - sharp!
Additional homework (ahw03):
If you closely inspect the original texts reconstructed
from w-layer, you may notice that the
attribute no_space_after is misplaced at
some punctuation (e.g. some opening brackets). Write a
script that will detect these errors and correct them
(you migh need to read xsh help move).
-
m-layer
Searching for deleted tokens (i.e. tokens from
w-layer that are not represented on
m-layer):
- The slow way: looping over the files.
for my $mfile in { glob '*.m.gz' } {
my $mdoc := open $mfile;
my $wdoc := open /pml:mdata/pml:head/pml:references/pml:reffile[@name='wdata']/@href ;
for my $wid in $wdoc//pml:w/@id {
if (count($mdoc//pml:w.rf[substring-after(., '#')=$wid])=0) echo $wid ;
}
}
- The fast way: using a hash.
for my $mfile in { glob '*.m.gz' } {
my $mdoc := open $mfile;
open /pml:mdata/pml:head/pml:references/pml:reffile[@name='wdata']/@href ;
my $whash := hash substring-after(., '#') $mdoc//pml:w.rf ;
for my $wid in //pml:w/@id
if (count(xsh:lookup('whash', $wid))=0)
echo $wid ;
}
Homework (hw04):
- Write an XSH script that checks that for each
change in form (
w/token against
m/form) the reason for the change is given
in m/form_change.
Write an XSH script to check that the segmentation
to sentences on m-layer respects the
segmentation to paragraphs on w-layer
(i.e. no sentence is split into several paragraphs). Do not try
to parse the @id's, the only thing guaranteed about them is
their uniqueness; rather use the links and XML structure.
Additional homework (ahw04):
Write a script that removes the <s>
elements from the m-files and tries to
segment the data to sentences (it can use the
corresponding w-files, too). You can use
the whole sample data as the training/devel-test data
(but do not parse the id attribute).
m-layer
Frequency table of form lengths, sorting by secondary
key:
for { glob '*.m.gz' } {
open (.) ;
for //pml:form echo string-length(.) ;
} | sort | uniq -c | sort -k1,1n -k2nr
Word in locative case is always preceded by a preposition requiring
locative:
for { glob '*.m.gz' } {
open (.) ;
for //pml:tag
if (substring(.,5,1)='6' and not(substring(.,1,1)='R'))
if (count(../preceding-sibling::pml:m/pml:tag[xsh:matches(.,'^R...6')])=0)
echo ../@id ;
}
Each preposition is followed by a word in the case the preposition
requires (code not shown here).
Homework (hw05):
Write a script that builds a table of
numbers of possible lemma & tag pairs for each
form. Example output:
první 7
které 7
další 6
...
zástupců 1
Additional homework (ahw05): See ahw04.
-
m-layer and a-layer
Getting the corrected text from m-layer,
formated as HTML:
echo '<html><head>' ;
echo '<meta http-equiv="content-type" content="text/html; charset=utf-8">' ;
echo '<title>m</title>' ;
echo '<style type="text/css">ins {color:green} del {color:red}</style>' ;
echo '</head><body>' ;
for my $file in { glob '*.m.gz' } {
my $mdoc := open $file;
my $wdoc := open
/pml:mdata/pml:head/pml:references/pml:reffile[@name="wdata"]/@href ;
my $wh := hash @id $wdoc//pml:w ;
my $par;
for $mdoc/pml:mdata/pml:s {
my $newpar = xsh:lookup('wh',
substring-after(./pml:m[1]/pml:w.rf, '#'))/../pml:othermarkup ;
if not($par = $newpar ) {
echo '<p>' ;
$par = $newpar ;
}
for ./pml:m {
my $w = xsh:lookup('wh', substring-after(./pml:w.rf, '#')) ;
if pml:form_change {
echo :n '<del>' ;
echo :n $w/pml:token ;
echo :n '</del><ins>' ;
}
echo :n (.)/pml:form ;
if pml:form_change echo :n '</ins>' ;
if not($w/pml:no_space_after = 1) echo :n ' ' ;
}
echo '<br>' ;
}
}
echo '</body></html>' ;
Frequency table of cases of Subjects:
for { glob "sample?.a.gz" } {
$adoc := open (.) ;
$mdoc := open
/pml:adata/pml:head/pml:references/pml:reffile[@name="mdata"]/@href ;
$tag_table := hash ../@id $mdoc/pml:mdata/pml:s/pml:m/pml:tag ;
for $adoc/pml:adata/pml:trees//pml:afun[.='Sb'] {
my $tag = xsh:lookup('tag_table',substring-after(../pml:m.rf,"#")) ;
echo xsh:match($tag,'^....(.)','') ;
}
} | sort | uniq -c | sort -n
Homework (hw06):
- Finish the exercise: make a frequency
table of POS of parents of Subjects. Try to explain
everything that is not a Verb.
- Make a frequency table of the analytical functions
that occur exclusively at leaves (i.e. they do not occur
at inner nodes at all).
Additional homework (ahw06): Convert all
m-forms to lowercase. Try to write a program
that capitalizes where appropriate (rule-based or
statistical) based on sentence boundaries, lemmas and
tags. Baseline: capitalize the first word of each sentence
and each personal or geographical name (95.36%). Beat the
baseline.
-
a-layer, PML-TQ
Subject with a preposition:
a-node [
afun = 'AuxP',
a-node [
afun = "Sb"
]
]
Frequency table of analytical functions of nodes that have
more than one effective parent:
a-node $child := [
! afun in {'AuxX', 'AuxG', 'AuxY', 'AuxZ', 'AuxK', 'ExD'},
2+x eparent a-node [ ]
];
>> give $child.afun
>> for $1
give $1,count()
sort by $2
Each case required by a preposition is present at its child
even in coordinations:
a-node $prep := [
afun = 'AuxP',
substr(m/tag,0,1) = 'R',
descendant a-node [
substr(m/tag,4,1) ~ '[1-7]',
is_member = 1,
! substr(m/tag,4,1) = substr($prep.m/tag,4,1),
! afun = 'AuxP', # Skip compound prepositions.
0x ancestor a-node [ # All nodes in between are coordinations.
! afun in {'Coord', 'Apos'},
ancestor $prep
]
]
]
Homework (hw07):
Make a frequency table of numbers of effective
parents. Both the string version or PML version of the
query are valid solutions. E.g.
Number of effective parents Number of occurences
0 500
1 23456
2 50
3 21
4 2
7 1
Additional homework (ahw07):
Try to find two nodes in the parent-child relation on the
analytical layer whose surface words have the greatest
distance from each other.
-
a-layer, PML-TQ
Printing the sentences:
a-root $root := [
descendant a-node $node := [ ]
]
>> give $root.id, $node.m/form & if($node.m/w/no_space_after = 1, '', ' '), $node.ord
>> give distinct $1, concat($2, '' over $1 sort by $3)
>> give $2
Optional nodes: matched also by their parents. Example:
frequency table of all the compund prepositions.
a-node $top := [
afun = 'AuxP',
? a-node $child := [
afun = 'AuxP',
sons() = 0
]
];
>> give $top.id, lower($child.m/form), $child.ord
>> give distinct $1, concat($2, ' ' over $1 sort by $3)
>> filter $2 ~ ' '
>> for $2 give $1, count() sort by $2 desc
Homework (hw08):
Write a query that prints sentences containing compound
prepositions. Extra points: surround the compound prepositions with
<b> and </b> to be rendered in bold face in HTML.
Additional homework (ahw08):
Try to find the t-nodes that refer (by their a/lex.rf
or a/aux.rf attributes) to analytical nodes from a
different sentence. Your query should print the sentences in HTML
and put the referenced words in bold face.
-
t-layer, btred
Documentation of Treex::PML::Node
Documentation of the PDT 2.0 extension.
A frequency table of numbers of analytical functions:
btred -T -N -e 'writeln $this->{afun}' sample?.a.gz | sort | uniq -c | sort -n
Writing longer scripts: calling with btred -I.
Print all sentences with compound prepositions:
#!btred -t PML_A -T -e show()
package PML_A;
sub compound_preposition {
my $node = shift;
return ($node->{afun} eq 'AuxP'
and not $node->children
and $node->parent->{afun} eq 'AuxP');
}
sub show {
if (grep compound_preposition($_), $root->descendants) {
writeln GetSentenceString();
}
}
Frequency table of number of effective parents:
#!btred -t PML_A -TN -e eparents()
package PML_A;
sub eparents {
return if $this->{afun} =~ /Aux[CPXG]/;
my @parents = GetEParents($this, \&DiveAuxCP);
writeln scalar @parents;
}
Relation between t-layer and a-layer: Searching for
"switched" dependency:
a-node $n3 :=
[ echild a-node $n4 := [ ] ];
t-node
[ a/lex.rf $n4,
echild t-node
[ a/lex.rf $n3 ] ];
Homework (hw09):
Try to search for "switched" dependencies in btred. You can
use the following for starters:
#!btred -t PML_T -TN -e switched()
package PML_T;
sub switched {
my @tp = GetEParents();
my $anode = GetALexNode();
my @ach = PML_A::GetEChildren($anode, \&PML_A::DiveAuxCP);
my @anodes = map GetALexNode($_), @tp;
if ( ... ) {
FPosition();
}
}
Also, try to find ids of the analytical nodes that are not referenced
from the t-layer. You can use xsh, btred or PML-TQ (several tools:
extra points).
Additional homework (ahw09): For each tectogrammatical tree,
find the two nodes with the longest path that connects them. Print
their ids and the length of the path.
-
t-layer
Verbal complement: Verify that a-nodes corresponding to t-node with
functor COMPL are:
- a-node with afun
Atv. The node in
compl.rf relation to the original one corresponds to
the parent of the Atv. The parent of the original node
corresponds to the grandparent of the Atv.
- a-node with afun
AtvV. The node in
compl.rf realtion to the original one is a generated
node. The parent of the original node corresponds to the parent of
the AtvV.
Possible forms of DIR1.
t-node $t := [
functor = "DIR1",
a/aux.rf a-node $a := [ ]
];
>> for lower(if($a.m/form ~ '^.[Ee]$', substr($a.m/form,0,1), $a.m/form)),
substr($a.m/tag,4,1)
give $1 & '+' & $2, count()
sort by $2
Ineffective queries. Several ways to find a node with more than four children:
a-node [sons() > 4]
a-node [ 5+x a-node [] ]
a-node [ a-node [], a-node [], a-node [], a-node [], a-node [] ]
Subject-Verb-Object classification of Czech (see
Joseph Greenberg). Build a frequency table of all the possible
orderings of Subject, Verb and Object in the data. Count only
sentences where all three elements appear, squash sequences of the
same letter down to one (as in tr//s in Perl). Print
percentage in the second column, e.g.
SVO 57.4% SvO 66.7%
OVS 18.2% OvS 8.3%
VSO 7.4% SOv 7.1%
VOS 6.2% vSO 4.8%
SOV 4.1% OSv 4.8%
OSV 2.1% vOS 3.6%
OVOS 1.7% SOvO 3.6%
SOVO 1.7% OSvO 1.2%
OVSO 0.8%
OSVO 0.4%
Extra points: main clauses and subordinate clauses counted separately (as in the example).
Homework (hw10):
- Find all the possible forms of
TWHEN and their
frequencies. Pay attention to compound prepositions, subordinate
conjuctions, auxiliary verbs (should not be printed), etc. The output
should look like this:
v + 6 10 # preposition + case
když + c 4 # subordinate conjunction + clause
na začátku + 2 4 # compound preposition + case
2 3 # just case
tehdy 2 # word not expressing case
- Finish the SVO classification.
Additional homework (ahw10):
How many nouns do have a preposition
on the analytical layer?
-
t-layer
List all modal verbs and their corresponding modality type.
t-node $t := [
gram/deontmod ~ '.',
! gram/deontmod = 'decl',
a/aux.rf a-node $a := [
substr(m/tag,0,1) = 'V',
! m/lemma = 'být'
]
]
>> for $t.gram/deontmod, $a.m/lemma
give $1, $2, count()
sort by $3 desc
Find all non-projective trees.
a-root $root := [
descendant a-node $gap := [
( ( order-follows $parent
and order-precedes $child)
or ( order-precedes $parent
and order-follows $child
)
)
],
descendant a-node $parent := [
! descendant $gap,
a-node $child := [ ]
]
];
>> give distinct $root.id
Homework (hw11):
Textual coreference links can be chained into so
called “coreference chains”. Make a
frequency table of lengths of the coreference
chains.
Additional homework (ahw11): Some tectogrammatical nodes
have the list type (attribute nodetype) -
they correspond to a list, whose members are represented by some
(find out which) children of the node. Try to find whether these
members always form a continuous part of the surface sentence.