Lab sessions

on-line or individual, Wednesday 10:40 a.m.

Jiří Mírovský
mirovsky at ufal.mff.cuni.cz
room 422

Daniel Zeman
zeman at ufal.mff.cuni.cz
room 409

(Archive of the practical classes from 2020)

Outline

  • various formats for phrase-structure and dependency trees, transformations (Perl or another programming language)
  • mining information from the word layer and morphological layer of the Prague Dependency Treebank or the Prague English Dependency Treebank (bash, Perl or another programming language)
  • mining information from all layers of the PDT/PEDT (btred, Perl)
  • mining information from UD data (Udapi, Python)
  • searching in treebanks with PML-Tree Query (PML-TQ)

Homeworks

Results of the homeworks (click here)

  • All homeworks must be committed into the https://svn.ms.mff.cuni.cz/svn/undergrads/students svn repository; do not send your homeworks by e-mail. Ask for an svn account if you do not have one in this repository yet.
  • Submit your work into your personal directories in the svn repository. There is an explicit deadline for submitting each homework - usually the Monday after next before midnight (so it is about 12 days to finish the homework).
  • If the deadline is not met, ask for additional homework. All homeworks must be submitted in order to get the credit (zápočet). You can solve an additional homework even if you submitted the normal homework in time, i.e., you can improve your average by solving some of the additional homeworks. (You have to e-mail us to confirm your additional homework is ready to be rated. All additional homeworks must be submitted at least one week before the credit.)

Class 07 - May 26th, 2021

(Note: This is the last practical class.)

Zoom meeting:

https://matfyz.zoom.us/j/97437108463?pwd=Tnc3T0lzcG5MSTNnWk1jR24wWHg1dz09

Homework 07 (due on May 31st)


Class 06 - May 19th, 2021

(Note: The subsequent (and the last) practical class will take place after ONE week, i.e. on May 26th, 2021)

Zoom meeting:

https://matfyz.zoom.us/j/97437108463?pwd=Tnc3T0lzcG5MSTNnWk1jR24wWHg1dz09

Homework 06 (due on May 24th)


Class 05 - May 5th, 2021

(Note: The subsequent practical class will take place after two weeks, i.e. on May 19th, 2021)

Zoom meeting:

https://matfyz.zoom.us/j/97437108463?pwd=Tnc3T0lzcG5MSTNnWk1jR24wWHg1dz09

Homework 05 (due on May 17th)

If you do not attend the practical class, you can continue following instructions from Class 03 below.


Class 04 – April 21, 2021

(Note: The subsequent practical class will take place after two weeks, i.e., on May 5, 2021; it will be led by Jiří Mírovský.)

Zoom meeting:

https://matfyz.zoom.us/j/97437108463?pwd=Tnc3T0lzcG5MSTNnWk1jR24wWHg1dz09

Homework 04: see the last section in the tutorial below (due on May 4)

If you do not attend the practical class, you can follow the instructions in this document.


Class 03 - April 7th, 2021

(Note: The subsequent practical class will take place after two weeks, i.e. on April 21st, 2021; it will be led by Dan Zeman.)

Zoom meeting:

https://matfyz.zoom.us/j/97437108463?pwd=Tnc3T0lzcG5MSTNnWk1jR24wWHg1dz09

Homework 03 (due on April 19th)

At the end of the class, we did not succeed in debugging a script for printing out verb-less sentences from the analytical layer. See the solution below, just at the end of instructions for today's class.

If you do not attend the practical class, you can follow the following instructions:

Our task is to learn to use btred, a scripting tool for working with TrEd data.

Let me start with a few clarifying statements, which you may already know:

  • Tree editor TrEd works primarily with data encoded in the Prague Markup Language (PML), which is a general XML format for encoding linguistically annotated treebanks. Specifics of individual treebanks (annotation layers, types of nodes, types of relations between the nodes, attributes defined at nodes, etc.) are defined in TrEd extensions.
  • btred is a command line scripting interface for the PML data. It can use both general functions defined in the PML and treebank-specific functions defined in TrEd extensions.
  • TrEd and btred are written in Perl, so also scripts for btred need to be written in Perl. However, you only need basic knowledge of Perl.

Your task

Please, follow the btred tutorial, which will take you through first steps of working with btred. Our plan is to cover steps 1-7 of the tutorial. For the tutorial, you can use any PDT-like data, e.g. the PDT data from the class 02. For examples that use the tectogrammatical layer (t-layer), use this data from the PDT, which contain also the t-files.

As Perl may be new to you a btred most certainly is, let me give you a few hints to make your work with Perl and btred easier. You will see in the tutorial that btred scripts may start with three different lines:

  • #!btred -e function_to_run() # the function function_to_run() will be run once for each given file. You can get an array of all trees (their roots) in the given file by my @roots = GetTrees().
  • #!btred -T -e function_to_run() # the function function_to_run() will be run on all trees in the given files. Variable $root will contain the root of the curren tree. You can get an array of all nodes (incl. the root) in the given tree by my @nodes = GetNodes($root), or (excl. the root) by my @nodes = $root->descendants
  • #!btred -TN -e function_to_run() # the function function_to_run() will be run on all nodes in all trees in the given files. Variable $root will contain the root of the curren tree and variable $this will contain the current node.

A simple script for counting nodes in each given file and printing the number next to each file name might look like this:

#!btred -T -e count_nodes()

sub count_nodes {
    my @nodes = GetNodes($root);  # get all nodes in the tree
    my $number_of_nodes = scalar(@nodes);  # get the length of the array
    my $filename = FileName();
    print "$filename: $number_of_nodes\n";
}

If the script is named count.btred, it can be run on all gzipped a-files in the current directory from a terminal with the following command:

btred -I count.btred *.a.gz

I suggest that after the first line in any btred (or Perl in general) script, you add the following Perl instructions:

use strict;  # it informs e.g. about non-declared variables (often typos)
use warnings;  # it warns e.g. if an uninitialized variable is used (e.g. in addition or concatenation)
use utf8;  # it allows utf8 in the script source code
binmode STDIN, ':utf8';  # setting utf8 for STDIN
binmode STDOUT, ':utf8';  # dtto for STDOUT
binmode STDERR, ':utf8';  # dtto for STDERR

Manuals and documentation

For writing btred scripts generally and for a particular treebank (say, a PDT-like treebank), there are three main sources of information:

Exercise after 4 steps of the tutorial

After you finish the first four steps of the tutorial, you may practice the aquired knowledge, if you want, on the following tasks (you may not need to actually write the scripts; just thinking the tasks through may suffice):

  • Print the id of each tree root and the depth of the tree (the length of the longest path from the root to a leaf) to STDOUT.
  • Count (and print to STDOUT) the distribution of sizes (number of nodes) of all trees in the given files (i.e., how many times there are trees with 1 node, 2 nodes, 3 nodes, etc.). You may count the final distribution in an outside script (bash) or use the function exit_hook.
  • Print all sentences in the a-files that are shorter than 5 tokens (there is a function PML_A::GetSentenceString($root) defined in the TrEd extension for the PDT.

Exercise after 7 steps of the tutorial

  • homework 3 (see above)
  • Count a distribution of morphological tags (use only first five positions of the tag) for a-nodes with afun Pred, similarly for a-nodes with afun 'Sb'.

At the end of the class, we did not succeed in debugging a script for printing out verb-less sentences from the analytical layer. The problem was in several variables not marked with keyword "my" and a misplaced directive "last". Here is a working script (works with "use strict"):

#!btred -T -e function_to_run()
use strict;

sub function_to_run {
  my $found = 0; # remember if a verb has been found
  my @nodes = GetNodes($root); # get all nodes in the tree
  foreach my $node (@nodes) { # and process them one by one
    my $tag = $node->attr('m/tag'); # get the morphological tag
    if ($tag =~ /^V/) { # check if it is a verb
      $found = 1; # if it is, remember that a verb has been found
      last; # and finish the cycle
    }
  }

  if ($found == 0) { # if a verb was not found in the sentence
    my $sent = PML_A::GetSentenceString($root); # get the sentence
    print "$id: $sent\n"; # and print it out
  }
}

 


Class 02 - March 24th, 2021

(Note: The subsequent practical class will take place after two weeks, i.e. on April 7th, 2021)

Zoom meeting:

https://matfyz.zoom.us/j/97437108463?pwd=Tnc3T0lzcG5MSTNnWk1jR24wWHg1dz09

Slides for the practical class 02 (click here)

Section 000 of the WSJ part of the Penn Treebank in the original merged file format.
PDT data for the class (it is a part of PDT w-, m- and a-files)
PEDT data for the class (it is a part of PEDT, namely a-files from sections 00* (with w- and m- info merged in))
Documentation for the m-layer
PDT 3.5
Demo of a Czech and English morphological analyzer and tagger

Homework 02 (due on April 5th)


Class 01 - March 10th, 2021

(Note: The subsequent practical class will take place after two weeks, i.e. on March 24th, 2021)

Zoom meeting:

https://matfyz.zoom.us/j/97437108463?pwd=Tnc3T0lzcG5MSTNnWk1jR24wWHg1dz09

 

English data to test the TrEd installation: section 000 of PEDT

  • Download and unzip the data. It is section 000 of the Penn Treebank transformed to the format of the Prague treebank family; in this particular case, one document is represented by three files corresponding to surface syntax - analytical layer (a-files), deep syntax - tectogrammatical layer (t-files), and original phrase structure layer (p-files). Then try to open one of the t-files in TrEd (you will need to install extension pedt), you can also open an a-file and a p-file.

Slides for the practical class 01 (click here)

Sample phrase structure tree (file)

S (
  NP ( N ( 'Peter' ) )
  * VP ( * V ( 'gave' )
         NP ( D   ( 'a' )
              * N ( 'flower' ) )
         PP ( * P ( 'to' )
              N   ( 'Mary' ) )
       )
)

Another sample phrase structure tree (file)

S (
  NP ( A ('Young') * N ('men'))
  * VP (* V ( 'love' )
        COORD (NP ( N('beer'))
               * CONJ ('and')
               NP ( N( 'girls' ) )
        ))
)

Homework 01 (due on March 22nd)

 


Class 00 - March 3rd, 2021

Zoom meeting:

https://matfyz.zoom.us/j/97437108463?pwd=Tnc3T0lzcG5MSTNnWk1jR24wWHg1dz09

M e e t i n g  I D : 974 3710 8463

P a s s c o d e : 211886

The task for today: Installation of tree editor TrEd on computers in the lab or at your personal computers from the TrEd home page.

On MS Windows, use the installation package containing also the Strawberry Perl distribution.

On Linux, follow these instructions:

  1. Setting up cpan (so that it uses local directories; http://www.perlmonks.org/?node_id=630026):
    • mkdir -p ~/.cpan/CPAN
      touch ~/.cpan/CPAN/MyConfig.pm # (or echo "1" >~/.cpan/CPAN/MyConfig.pm, or cp /root/.cpan/CPAN/Config.pm ~/.cpan/CPAN/MyConfig.pm if the subsequent command does not work)
    • perl -MCPAN -e shell
      cpan> o conf init # (use local::lib and at the end, allow setting (or set manually) some variables in .bashrc file)
    • exit cpan, exit and start bash
  2. Installing cpanm for easier installation of other Perl modules
    • cpan App::cpanminus
  3. Installing Tred

After we have installed TrEd, let us try it - download the following data:
lindat.cz - search for Prague Dependency Treebank 2.0 - sample data

After you unzip the data, try to open one of the .t.gz files. You should get an error message complaining about missing schemas. It is because you also need to install a TrEd extension for the given type of data:

In TrEd, go to Setup -> Manage Extensions -> Get New Extensions and search for pdt20. Check it and press "Install Selected". Now close and start TrEd again. It should be able to open the data now.

You can customize TrEd in the configuration file .tredrc - customize fonts: font section in Tred documentation

TrEd can handle all types of treebanks - try, e.g., an example from the Penn Treebank:

An example file from the Penn Treebank transformed to the PML - to test the TrEd installation; an extension for the Penn Treebank files (ptb) needs to be installed first...