Lab sessions
Every other Thursday (more or less), 10:40 a.m. in SW1
Jiří Mírovský
mirovsky at ufal.mff.cuni.cz
room 422
Daniel Zeman
zeman at ufal.mff.cuni.cz
room 409
Outline
- various formats for phrase-structure and dependency trees, transformations (Perl or another programming language)
- mining information from the word layer and morphological layer of the Prague Dependency Treebank or the Prague English Dependency Treebank (bash, Perl or another programming language)
- mining information from all layers of the PDT/PEDT (btred, Perl)
- mining information from UD data (btred, Perl)
- searching in treebanks with PML-Tree Query (PML-TQ)
Homeworks
Results of the homeworks (click here)
Class 02 - March 19th, 2026
The goal of today's class and the homework is to learn to use btred, a scripting tool for working with TrEd data.
Homework 02: After you attend the class or finish the tutorial to btred (see below), read carefully the instructions for the homework (due on March 30th).
Individual study (if you miss the class)
Our task is to learn to use btred, a scripting tool for working with TrEd data.
Let us start with a few clarifying statements, which you may already know:
- Tree editor TrEd works primarily with data encoded in the Prague Markup Language (PML), which is a general XML format for encoding linguistically annotated treebanks. Specifics of individual treebanks (annotation layers, types of nodes, types of relations between the nodes, attributes defined at nodes, etc.) are defined in TrEd extensions.
-
btredis a command line scripting interface for the PML data. It can use both general functions defined in the PML and treebank-specific functions defined in TrEd extensions. -
TrEd and
btredare written in Perl, so also scripts forbtredneed to be written in Perl. However, you only need basic knowledge of Perl.
Your task
Please, follow the btred tutorial, which will take you through first steps of working with btred. Our plan is to cover steps 1-5 and 7 of the tutorial in this class and 6 in the next class. For the homework, knowledge from steps 1-5 and 7 should suffice.
For the tutorial, you can use any PDT-like data, e.g. these PDT data containing annotation of texts up to the analytical layer (a-layer). For examples that use the tectogrammatical layer (t-layer), use these data from the PDT, which contain also the t-files.
As Perl may be new to you and btred most certainly is, let me give you a few hints to make your work with Perl and btred easier. You will see in the tutorial that btred scripts may start with three different lines:
-
#!btred -e function_to_run()# the functionfunction_to_run()will be run once for each given file. You can get an array of all trees (their roots) in the given file bymy @roots = GetTrees(). -
#!btred -T -e function_to_run()# the functionfunction_to_run()will be run on each tree in the given files. Variable$rootwill contain the root of the current tree. You can get an array of all nodes (incl. the root) in the given tree bymy @nodes = GetNodes($root), or (excl. the root) bymy @nodes = $root->descendants -
#!btred -TN -e function_to_run()# the functionfunction_to_run()will be run on all nodes in all trees in the given files. Variable$rootwill contain the root of the current tree and variable$thiswill contain the current node.
A simple script for counting all nodes in each given file and printing the number next to each file name might look like this:
#!btred -e count_nodes()
sub count_nodes {
my $total_count = 0;
my @roots = GetTrees(); # get an array of roots of all trees in the file
foreach my $root (@roots) {
my @nodes = GetNodes($root); # get an array of all nodes in the tree (incl. the root)
my $number_of_nodes = scalar(@nodes); # get the length of the array
$total_count += $number_of_nodes;
}
my $filename = FileName();
print "$filename: $total_count\n";
}
If the script is named count.btred, it can be run on all gzipped a-files in the current directory from a terminal with the following command:
btred -I count.btred *.a.gz
I suggest that after the first line in any btred (or Perl in general) script, you add the following Perl instructions:
use strict; # it informs e.g. about non-declared variables (often typos) use warnings; # it warns e.g. if an uninitialized variable is used (e.g. in addition or concatenation) use utf8; # it allows utf8 in the script source code binmode STDIN, ':utf8'; # setting utf8 for STDIN binmode STDOUT, ':utf8'; # dtto for STDOUT binmode STDERR, ':utf8'; # dtto for STDERR
Manuals and documentation
For writing btred scripts generally and for a particular treebank (say, a PDT-like treebank), there are three main sources of information:
-
TrEd/btred user manual, namely its section 15 - User Macros, and most importantly its subsections 15.8. Public API: pre-defined macros (functions
GetNodes,GetTrees,ListV,Filename, etc.) and 15.9. Hooks: automatically executed macros (functionexit_hook, which is executed once after all input files are processed), -
documentation to Treex::PML - the fundamental libraries used by the TrEd toolkit, first of all documentation to Treex::PML::Node (object methods such as
parent,firstson,level,attr,set_attr,children,descendants,ancestors, etc.), -
documentation for the PDT extension (treebank-specific functions such as
PML_A::GetSentenceString,PML_A::GetEParents, etc.)
Exercise after 5 steps of the tutorial
After you finish the first five steps of the tutorial, you may practice the aquired knowledge, if you want, on the following tasks (you may not need to actually write the scripts; just thinking the tasks through may suffice):
- Print the id of each tree root and the depth of the tree (the length of the longest path from the root to a leaf) to STDOUT.
-
Count (and print to STDOUT) the distribution of sizes (number of nodes) of all trees in the given files (i.e., how many times there are trees with 1 node, 2 nodes, 3 nodes, etc.). You may count the final distribution in an outside script (bash) or use the function
exit_hook. -
Print all sentences in the a-files that are shorter than 5 tokens (there is a function
PML_A::GetSentenceString($root)defined in the TrEd extension for the PDT.
Exercise after 7 steps of the tutorial
- Count a distribution of morphological tags (use only first five positions of the tag) for a-nodes with afun Pred, similarly for a-nodes with afun 'Sb'.
As another example, below is a script for printing out verb-less sentences from the analytical layer:
#!btred -T -e function_to_run()
use strict;
sub function_to_run {
my $found = 0; # remember if a verb has been found
my @nodes = GetNodes($root); # get all nodes in the tree
foreach my $node (@nodes) { # and process them one by one
my $tag = $node->attr('m/tag'); # get the morphological tag
if ($tag =~ /^V/) { # check if it is a verb
$found = 1; # if it is, remember that a verb has been found
last; # and finish the cycle
}
}
if ($found == 0) { # if a verb was not found in the sentence
my $sent = PML_A::GetSentenceString($root); # get the sentence
print "$id: $sent\n"; # and print it out
}
}
Class 01 - March 5th, 2026
First task for today: Transform phrase-structure trees to dependency ones
Sample phrase structure tree (file)
S (
NP ( N ( 'Peter' ) )
* VP ( * V ( 'gave' )
NP ( D ( 'a' )
* N ( 'flower' ) )
PP ( * P ( 'to' )
N ( 'Mary' ) )
)
)
Another sample phrase structure tree (file)
S (
NP ( A ('Young') * N ('men'))
* VP (* V ( 'love' )
COORD (NP ( N('beer'))
* CONJ ('and')
NP ( N( 'girls' ) )
))
)
Documentation for the morphological layer of the Prague Dependency Treebank
Demo of a Czech and English morphological analyzer and tagger
Homework 01 (due on March 16th)
Submit the results via svn. Ask me by e-mail if you encounter difficulties or if something is unclear in the instructions.
Class 00 - February 19th, 2026
The first task for today: Installation of tree editor TrEd on computers in the lab or at your personal computers.
We will NOT install from the TrEd home page, as the installation packages are outdated.
Instead, on Linux, follow these instructions:
-
Setting up cpan (so that it uses local directories):
-
mkdir -p ~/.cpan/CPAN
echo "1" >~/.cpan/CPAN/MyConfig.pm -
perl -MCPAN -e shell
cpan> o conf init # (choose the local::lib option and at the end, allow setting (or set manually) some variables in .bashrc file) - exit cpan (via 'q'), exit and start bash (or "source .bashrc")
-
mkdir -p ~/.cpan/CPAN
-
Installing cpanm for easier installation of other Perl modules
- cpan App::cpanminus
-
Installing Tred
- git clone https://github.com/ufal/TrEd.git # it clones the TrEd development repository into a (automatically created) TrEd directory
- cd TrEd/tred
-
./tred
- install missing libraries as reported (e.g., cpanm UNIVERSAL::DOES) and repeat; use force if needed: cpanm --force UNIVERSAL::DOES
The second task for today: Test the installation, setup TrEd (extensions, fonts)
After we have installed TrEd, let us try it - download the following data:
lindat.cz - go to "Repository" and search for Prague Dependency Treebank 2.0 - sample data
After you unzip the data, try to open one of the .t.gz files. You should get an error message complaining about missing schemas. It is because you also need to install a TrEd extension for the given type of data:
In TrEd, go to Setup -> Manage Extensions -> Get New Extensions and search for pdt20. Check it and press "Install Selected". Now close and start TrEd again. It should be able to open the data now.
You can customize TrEd in the configuration file .tredrc - customize fonts: font section in Tred documentation
TrEd can handle all types of treebanks - try, e.g., an example from the Penn Treebank:
An example file from the Penn Treebank transformed to the PML - to test the TrEd installation; an extension for the Penn Treebank files (ptb) needs to be installed first...
More English data to test the TrEd installation: section 000 of PEDT
- Download and unzip the data. It is section 000 of the Penn Treebank transformed to the format of the Prague treebank family; in this particular case, one document is represented by three files corresponding to surface syntax - analytical layer (a-files), deep syntax - tectogrammatical layer (t-files), and original phrase structure layer (p-files). Then try to open one of the t-files in TrEd (you will need to install the "pedt" extension), you can also open an a-file and a p-file.
A lexicon type of data: CzeDLex (Lexicon of Czech Discourse Connectives)
- Download the czedlex1.0.zip file from lindat.cz and unzip it.
- In TrEd, install the "czedlex" extension
- In TrEd, open the file czedlex1.0/PML/czedlex1.0.pml
The third task for today (if there is time): Transform phrase-structure trees to dependency ones
Sample phrase structure tree (file)
S (
NP ( N ( 'Peter' ) )
* VP ( * V ( 'gave' )
NP ( D ( 'a' )
* N ( 'flower' ) )
PP ( * P ( 'to' )
N ( 'Mary' ) )
)
)
Another sample phrase structure tree (file)
S (
NP ( A ('Young') * N ('men'))
* VP (* V ( 'love' )
COORD (NP ( N('beer'))
* CONJ ('and')
NP ( N( 'girls' ) )
))
)


