Lab sessions

SU1, Friday 12:20 p.m.

Jiří Mírovský
mirovsky at ufal.mff.cuni.cz
room 422

Daniel Zeman
zeman at ufal.mff.cuni.cz
room 409

Outline

  • various formats for phrase-structure and dependency trees, transformations (Perl or another programming language)
  • mining information from the word layer and morphological layer of the Prague Dependency Treebank or the Prague English Dependency Treebank (bash, Perl or another programming language)
  • mining information from all layers of the PDT/PEDT (btred, Perl)
  • mining information from UD data (Udapi, Python)
  • searching in treebanks with PML-Tree Query (PML-TQ)

Homeworks

Results of the homeworks (click here)


Class 09 – May 15, 2020

The last topic for this course is searching in treebanks, again in the form of your individual study based on the following instructions. This is the last practical class this year and you will get the last homework. Feel free to contact me in the subsequent weeks with any troubles/questions.

We will search treebanks using the Prague Markup Language Tree Query (PML-TQ), which is a powerful search engine for any treebank encoded in the Prague Markup Language (PML). Please notice the power of the framework - once a treebank is encoded in the PML, you can open/browse/edit it using TrEd, process it automatically using btred, and search in it using the PML-TQ.

The PML-TQ is a client-server system. You have two options which client to use and it is up to you which of them you prefer (meaning: the homework can be done in any of them):

  1. Web interface - If you choose to use the web interface, you will not need to install anything and you will get instant access to treebanks, to most of them without an account; examples from the tutorial are linked directly to the web interface, so you can try them easily; BUT: you will be able to create queries only in a textual form (not graphically) and the results will be displayed in a uniform and non-variable form; the web interface also does not support searching in local files (which you however do not need now).
  2. TrEd interface - If you choose to use the TrEd interface, you will need to install the PML-TQ extension first, which can pass flawlessly or there may be problems with the installation and you will have to perform some manual installation steps; also connecting to the server requires some authentication steps; BUT: you will be able to create queries in a graphical environment, the queries will be represented in a graphical form and the results will be displayed in the full form/variability that TrEd and the particular TrEd extension for the given treebank offer. TrEd also supports searching in local files (but you do not need it now).

If you have time and plan to work with treebanks in the future, I would recommend to try option 2 and if you get into too many troubles with the installation, revert to option 1. However, if you want to be done with the topic of searching in treebanks minimalistically and as quickly as possible, you can choose option 1 right away. Please refer to a web page listing the two clients for some connection/installation instructions. You will also get some info (login name etc.) in an e-mail.

There are two tutorials to the PML-TQ, one focused on the web interface, one on the TrEd interface. Please follow one of them based on the client interface you prefer:

Later, you may want to consult other documentation sources, see the PML-TQ documentation page.

Homework 05 (due on May 27th) - if you finish and wish to get the marks sooner, write me an e-mail.


Class 08 – April 24, 2020

Online interactive Zoom session, continuing work with Universal Dependencies in Udapi. The PDF from last week has been expanded with new sections.

Homework 04 (due by May 13, 2020) is specified at the end of the PDF tutorial mentioned above.



Class 07 – April 17, 2020

Individual work with Universal Dependencies in Tred and with Udapi. For instructions, see this PDF. The instructions include several exercises. While you are encouraged to do them all, they are not mandatory and they do not constitute official homework.



Class 06 - April 3rd, 2020

Continuing individual study based on the instructions given below in class 05.

Further notes about using Perl in btred (useful for homework 03)

  • Be careful when using the Perl function sort. If you, for example, want to sort analytical nodes according to their tree order (attribute ord), be sure to use the operator ˂=˃ (numerical comparison), not cmp (alphabetical comparison), i.e.:
    my @sorted = sort {$a->attr('ord') ˂=˃ $b->attr('ord')} @nodes;
    
  • Take advantage of the Perl function grep to filter elements of an array, for example:
    my @children = grep {$_->attr('afun') =~ /^(Sb|Obj)$/} $verb->children;
    

Homework 03 (due on April 15th)


Class 05 - March 27th, 2020

Individual study based on the following instructions.

Our task for today and for the next class is to learn to use btred, a scripting tool for working with TrEd data. There will be no homework today, you still have time to do your previous homework (homework 2). A new homework (homework 3) will be given next week. The following work instructions apply for two classes (today and the next week), so you can plan to do it according to your needs.

Let me start with a few clarifying statements, which you may already know:

  • Tree editor TrEd works primarily with data encoded in the Prague Markup Language (PML), which is a general XML format for encoding linguistically annotated treebanks. Specifics of individual treebanks (annotation layers, types of nodes, types of relations between the nodes, attributes defined at nodes, etc.) are defined in TrEd extensions.
  • btred is a command line scripting interface for the PML data. It can use both general functions defined in the PML and treebank-specific functions defined in TrEd extensions.
  • TrEd and btred are written in Perl, so also scripts for btred need to be written in Perl. However, you only need basic knowledge of Perl.

Your task

Please, follow the btred tutorial, which will take you through first steps of working with btred. Our plan is to cover steps 1-7 of the tutorial. I suggest that you split the work in this way: steps 1-4 this week, steps 5-7 next week. But, of course, if you find the first four steps simple enough, you can sooner proceed further. For the tutorial, you can use any PDT-like data, e.g. the PDT data from the class 02. For examples that use the tectogrammatical layer (t-layer), use this data from the PDT, which contain also the t-files.

As Perl may be new to you a btred most certainly is, let me give you a few hints to make your work with Perl and btred easier. You will see in the tutorial that btred scripts may start with three different lines:

  • #!btred -e function_to_run()    # the function function_to_run() will be run once for each given file. You can get an array of all trees (their roots) in the given file by @roots = GetTrees().
  • #!btred -T -e function_to_run()    # the function function_to_run() will be run on all trees in the given files. Variable $root will contain the root of the curren tree. You can get an array of all nodes (incl. the root) in the given tree by @nodes = GetNodes($root), or (excl. the root) by @nodes = $root->descendants
  • #!btred -TN -e function_to_run()   # the function function_to_run() will be run on all nodes in all trees in the given files. Variable $root will contain the root of the curren tree and variable $this will contain the current node.

A simple script for counting nodes in each given file and printing the number next to each file name might look like this:

#!btred -T -e count_nodes()

sub count_nodes {
    my @nodes = GetNodes($root);  # get all nodes in the tree
    my $number_of_nodes = scalar(@nodes);  # get the length of the array
    my $filename = FileName();
    print "$filename: $number_of_nodes\n";
}

If the script is named count.btred, it can be run on all gzipped a-files in the current directory from a terminal with the following command:

btred -I count.btred *.a.gz

I suggest that after the first line in any btred (or Perl in general) script, you add the following Perl instructions:

use strict;  # it informs e.g. about non-declared variables (often typos)
use warnings;  # it warns e.g. if an uninitialized variable is used (e.g. in addition or concatenation)
use utf8;  # it allows utf8 in the script source code
binmode STDIN, ':utf8';  # setting utf8 for STDIN
binmode STDOUT, ':utf8';  # dtto for STDOUT
binmode STDERR, ':utf8';  # dtto for STDERR

Manuals and documentation

For writing btred scripts generally and for a particular treebank (say, a PDT-like treebank), there are three main sources of information:

Exercise after 4 steps of the tutorial

After you finish the first four steps of the tutorial, you may practice the aquired knowledge, if you want, on the following tasks (you may not need to actually write the scripts; just thinking the tasks through may suffice):

  • Print the id of each tree root and the depth of the tree (the length of the longest path from the root to a leaf) to STDOUT.
  • Count (and print to STDOUT) the distribution of sizes (number of nodes) of all trees in the given files (i.e., how many times there are trees with 1 node, 2 nodes, 3 nodes, etc.). You may count the final distribution in an outside script (bash) or use the function exit_hook.
  • Print all sentences in the a-files that are shorter than 5 tokens (there is a function PML_A::GetSentenceString($root) defined in the TrEd extension for the PDT.

Exercise after 7 steps of the tutorial

  • homework 3 (see above in class 06)
  • Count a distribution of morphological tags (use only first five positions of the tag) for a-nodes with afun Pred, similarly for a-nodes with afun 'Sb'.

Class 04 - March 20th, 2020

Individual study based on the homework. Please contact me (Jiří Mírovský) with any questions.
Homework 02 (due on April 1st)


Class 03 - March 13th, 2020

Cancelled.


Class 02 - March 6th, 2020

Section 000 of the WSJ part of the Penn Treebank in the original merged file format.
PDT data for the class (it is a part of PDT w-, m- and a-files)
PEDT data for the class (it is a part of PEDT, namely a-files from sections 00* (with w- and m- info merged in))
Documentation for the m-layer
PDT 3.5
Demo of a Czech and English morphological analyzer and tagger


Class 01 - February 28th, 2020

English data to test the TrEd installation: section 000 of PEDT

Sample phrase structure tree (file)

S (
  NP ( N ( 'Peter' ) )
  * VP ( * V ( 'gave' )
         NP ( D   ( 'a' )
              * N ( 'flower' ) )
         PP ( * P ( 'to' )
              N   ( 'Mary' ) )
       )
)

Another sample phrase structure tree (file)

S (
  NP ( A ('Young') * N ('men'))
  * VP (* V ( 'love' )
        COORD (NP ( N('beer'))
               * CONJ ('and')
               NP ( N( 'girls' ) )
        ))
)

Homework 01 (due on March 11th)


Class 00 - February 21, 2020

Installation of tree editor TrEd on computers in the lab (installation script 'install_tred.bash' from the TrEd home page)

  1. Setting up cpan (so that it uses local directories; http://www.perlmonks.org/?node_id=630026):
    • mkdir -p ~/.cpan/CPAN
      touch ~/.cpan/CPAN/MyConfig.pm # (or echo "1" >~/.cpan/CPAN/MyConfig.pm, or cp /root/.cpan/CPAN/Config.pm ~/.cpan/CPAN/MyConfig.pm if the subsequent command does not work)
    • perl -MCPAN -e shell
      cpan> o conf init # (use local::lib and at the end, allow setting (or set manually) some variables in .bashrc file)
    • exit cpan, exit and start bash
  2. Installing cpanm for easier installation of other Perl modules
    • cpan App::cpanminus
  3. Installing Tred


lindat.cz - search for Prague Dependency Treebank 2.0 - sample data

Configuration file .tredrc - customize fonts: font section in Tred documentation

Tred & svn: file pmlbackend_conf.xml

An example file from the Penn Treebank transformed to the PML - to test the TrEd installation; an extension for the Penn Treebank files (ptb) needs to be installed first...