Lab sessions

Every other Thursday (more or less), 10:40 a.m. in SW1

Jiří Mírovský
mirovsky at ufal.mff.cuni.cz
room 422

Daniel Zeman
zeman at ufal.mff.cuni.cz
room 409

Outline

  • various formats for phrase-structure and dependency trees, transformations (Perl or another programming language)
  • mining information from the word layer and morphological layer of the Prague Dependency Treebank or the Prague English Dependency Treebank (bash, Perl or another programming language)
  • mining information from all layers of the PDT/PEDT (btred, Perl)
  • mining information from UD data (btred, Perl)
  • searching in treebanks with PML-Tree Query (PML-TQ)

Homeworks

Results of the homeworks (click here)


Class 02 - March 19th, 2026

The goal of today's class and the homework is to learn to use btred, a scripting tool for working with TrEd data.

Homework 02: After you attend the class or finish the tutorial to btred (see below), read carefully the instructions for the homework  (due on March 30th).

Individual study (if you miss the class)

Our task is to learn to use btred, a scripting tool for working with TrEd data.

Let us start with a few clarifying statements, which you may already know:

  • Tree editor TrEd works primarily with data encoded in the Prague Markup Language (PML), which is a general XML format for encoding linguistically annotated treebanks. Specifics of individual treebanks (annotation layers, types of nodes, types of relations between the nodes, attributes defined at nodes, etc.) are defined in TrEd extensions.
  • btred is a command line scripting interface for the PML data. It can use both general functions defined in the PML and treebank-specific functions defined in TrEd extensions.
  • TrEd and btred are written in Perl, so also scripts for btred need to be written in Perl. However, you only need basic knowledge of Perl.

Your task

Please, follow the btred tutorial, which will take you through first steps of working with btred. Our plan is to cover steps 1-5 and 7 of the tutorial in this class and 6 in the next class. For the homework, knowledge from steps 1-5 and 7 should suffice.

For the tutorial, you can use any PDT-like data, e.g. these PDT data containing annotation of texts up to the analytical layer (a-layer). For examples that use the tectogrammatical layer (t-layer), use these data from the PDT, which contain also the t-files.

As Perl may be new to you and btred most certainly is, let me give you a few hints to make your work with Perl and btred easier. You will see in the tutorial that btred scripts may start with three different lines:

  • #!btred -e function_to_run() # the function function_to_run() will be run once for each given file. You can get an array of all trees (their roots) in the given file by my @roots = GetTrees().
  • #!btred -T -e function_to_run() # the function function_to_run() will be run on each tree in the given files. Variable $root will contain the root of the current tree. You can get an array of all nodes (incl. the root) in the given tree by my @nodes = GetNodes($root), or (excl. the root) by my @nodes = $root->descendants
  • #!btred -TN -e function_to_run() # the function function_to_run() will be run on all nodes in all trees in the given files. Variable $root will contain the root of the current tree and variable $this will contain the current node.

A simple script for counting all nodes in each given file and printing the number next to each file name might look like this:

#!btred -e count_nodes()

sub count_nodes {
    my $total_count = 0;
    my @roots = GetTrees();  # get an array of roots of all trees in the file
    foreach my $root (@roots) {
      my @nodes = GetNodes($root);  # get an array of all nodes in the tree (incl. the root)
      my $number_of_nodes = scalar(@nodes);  # get the length of the array
      $total_count += $number_of_nodes;
    }
    my $filename = FileName();
    print "$filename: $total_count\n";
}

If the script is named count.btred, it can be run on all gzipped a-files in the current directory from a terminal with the following command:

btred -I count.btred *.a.gz

I suggest that after the first line in any btred (or Perl in general) script, you add the following Perl instructions:

use strict;  # it informs e.g. about non-declared variables (often typos)
use warnings;  # it warns e.g. if an uninitialized variable is used (e.g. in addition or concatenation)
use utf8;  # it allows utf8 in the script source code
binmode STDIN, ':utf8';  # setting utf8 for STDIN
binmode STDOUT, ':utf8';  # dtto for STDOUT
binmode STDERR, ':utf8';  # dtto for STDERR

Manuals and documentation

For writing btred scripts generally and for a particular treebank (say, a PDT-like treebank), there are three main sources of information:

Exercise after 5 steps of the tutorial

After you finish the first five steps of the tutorial, you may practice the aquired knowledge, if you want, on the following tasks (you may not need to actually write the scripts; just thinking the tasks through may suffice):

  • Print the id of each tree root and the depth of the tree (the length of the longest path from the root to a leaf) to STDOUT.
  • Count (and print to STDOUT) the distribution of sizes (number of nodes) of all trees in the given files (i.e., how many times there are trees with 1 node, 2 nodes, 3 nodes, etc.). You may count the final distribution in an outside script (bash) or use the function exit_hook.
  • Print all sentences in the a-files that are shorter than 5 tokens (there is a function PML_A::GetSentenceString($root) defined in the TrEd extension for the PDT.

Exercise after 7 steps of the tutorial

  • Count a distribution of morphological tags (use only first five positions of the tag) for a-nodes with afun Pred, similarly for a-nodes with afun 'Sb'.

As another example, below is a script for printing out verb-less sentences from the analytical layer:

#!btred -T -e function_to_run()
use strict;

sub function_to_run {
  my $found = 0; # remember if a verb has been found
  my @nodes = GetNodes($root); # get all nodes in the tree
  foreach my $node (@nodes) { # and process them one by one
    my $tag = $node->attr('m/tag'); # get the morphological tag
    if ($tag =~ /^V/) { # check if it is a verb
      $found = 1; # if it is, remember that a verb has been found
      last; # and finish the cycle
    }
  }

  if ($found == 0) { # if a verb was not found in the sentence
    my $sent = PML_A::GetSentenceString($root); # get the sentence
    print "$id: $sent\n"; # and print it out
  }
}

 



Class 01 - March 5th, 2026

(slides from the class)

First task for today: Transform phrase-structure trees to dependency ones

Sample phrase structure tree (file)

S (
  NP ( N ( 'Peter' ) )
  * VP ( * V ( 'gave' )
         NP ( D   ( 'a' )
              * N ( 'flower' ) )
         PP ( * P ( 'to' )
              N   ( 'Mary' ) )
       )
)

Another sample phrase structure tree (file)

S (
  NP ( A ('Young') * N ('men'))
  * VP (* V ( 'love' )
        COORD (NP ( N('beer'))
               * CONJ ('and')
               NP ( N( 'girls' ) )
        ))
)

Documentation for the morphological layer of the Prague Dependency Treebank
Demo of a Czech and English morphological analyzer and tagger

Homework 01 (due on March 16th)
Submit the results via svn. Ask me by e-mail if you encounter difficulties or if something is unclear in the instructions.

 



Class 00 - February 19th, 2026

(slides from the class)

The first task for today: Installation of tree editor TrEd on computers in the lab or at your personal computers.

We will NOT install from the TrEd home page, as the installation packages are outdated.

Instead, on Linux, follow these instructions:

  1. Setting up cpan (so that it uses local directories):
    • mkdir -p ~/.cpan/CPAN
      echo "1" >~/.cpan/CPAN/MyConfig.pm
    • perl -MCPAN -e shell
      cpan> o conf init  # (choose the local::lib option and at the end, allow setting (or set manually) some variables in .bashrc file)
    • exit cpan (via 'q'), exit and start bash (or "source .bashrc")
  2. Installing cpanm for easier installation of other Perl modules
    • cpan App::cpanminus
  3. Installing Tred
    • git clone https://github.com/ufal/TrEd.git  # it clones the TrEd development repository into a (automatically created) TrEd directory
    • cd TrEd/tred
    • ./tred
      • install missing libraries as reported (e.g., cpanm UNIVERSAL::DOES) and repeat; use force if needed: cpanm --force UNIVERSAL::DOES

 

The second task for today: Test the installation, setup TrEd (extensions, fonts)

After we have installed TrEd, let us try it - download the following data:
lindat.cz - go to "Repository" and search for Prague Dependency Treebank 2.0 - sample data

After you unzip the data, try to open one of the .t.gz files. You should get an error message complaining about missing schemas. It is because you also need to install a TrEd extension for the given type of data:

In TrEd, go to Setup -> Manage Extensions -> Get New Extensions and search for pdt20. Check it and press "Install Selected". Now close and start TrEd again. It should be able to open the data now.

You can customize TrEd in the configuration file .tredrc - customize fonts: font section in Tred documentation

TrEd can handle all types of treebanks - try, e.g., an example from the Penn Treebank:

An example file from the Penn Treebank transformed to the PML - to test the TrEd installation; an extension for the Penn Treebank files (ptb) needs to be installed first...

More English data to test the TrEd installation: section 000 of PEDT

  • Download and unzip the data. It is section 000 of the Penn Treebank transformed to the format of the Prague treebank family; in this particular case, one document is represented by three files corresponding to surface syntax - analytical layer (a-files), deep syntax - tectogrammatical layer (t-files), and original phrase structure layer (p-files). Then try to open one of the t-files in TrEd (you will need to install the "pedt" extension), you can also open an a-file and a p-file.

A lexicon type of data: CzeDLex (Lexicon of Czech Discourse Connectives)

  • Download the czedlex1.0.zip file from lindat.cz and unzip it.
  • In TrEd, install the "czedlex" extension
  • In TrEd, open the file czedlex1.0/PML/czedlex1.0.pml

 

The third task for today (if there is time): Transform phrase-structure trees to dependency ones

Sample phrase structure tree (file)

S (
  NP ( N ( 'Peter' ) )
  * VP ( * V ( 'gave' )
         NP ( D   ( 'a' )
              * N ( 'flower' ) )
         PP ( * P ( 'to' )
              N   ( 'Mary' ) )
       )
)

Another sample phrase structure tree (file)

S (
  NP ( A ('Young') * N ('men'))
  * VP (* V ( 'love' )
        COORD (NP ( N('beer'))
               * CONJ ('and')
               NP ( N( 'girls' ) )
        ))
)