SU1, Friday 12:20 p.m.
Jiří Mírovský
mirovsky at ufal.mff.cuni.cz
room 422
Daniel Zeman
zeman at ufal.mff.cuni.cz
room 409
Results of the homeworks (click here)
The last topic for this course is searching in treebanks, again in the form of your individual study based on the following instructions. This is the last practical class this year and you will get the last homework. Feel free to contact me in the subsequent weeks with any troubles/questions.
We will search treebanks using the Prague Markup Language Tree Query (PML-TQ), which is a powerful search engine for any treebank encoded in the Prague Markup Language (PML). Please notice the power of the framework - once a treebank is encoded in the PML, you can open/browse/edit it using TrEd, process it automatically using btred, and search in it using the PML-TQ.
The PML-TQ is a client-server system. You have two options which client to use and it is up to you which of them you prefer (meaning: the homework can be done in any of them):
If you have time and plan to work with treebanks in the future, I would recommend to try option 2 and if you get into too many troubles with the installation, revert to option 1. However, if you want to be done with the topic of searching in treebanks minimalistically and as quickly as possible, you can choose option 1 right away. Please refer to a web page listing the two clients for some connection/installation instructions. You will also get some info (login name etc.) in an e-mail.
There are two tutorials to the PML-TQ, one focused on the web interface, one on the TrEd interface. Please follow one of them based on the client interface you prefer:
Later, you may want to consult other documentation sources, see the PML-TQ documentation page.
Homework 05 (due on May 27th) - if you finish and wish to get the marks sooner, write me an e-mail.
Online interactive Zoom session, continuing work with Universal Dependencies in Udapi. The PDF from last week has been expanded with new sections.
Homework 04 (due by May 13, 2020) is specified at the end of the PDF tutorial mentioned above.
Individual work with Universal Dependencies in Tred and with Udapi. For instructions, see this PDF. The instructions include several exercises. While you are encouraged to do them all, they are not mandatory and they do not constitute official homework.
Continuing individual study based on the instructions given below in class 05.
sort
. If you, for example, want to sort analytical nodes according to their tree order (attribute ord
), be sure to use the operator ˂=˃ (numerical comparison), not cmp
(alphabetical comparison), i.e.:
my @sorted = sort {$a->attr('ord') ˂=˃ $b->attr('ord')} @nodes;
grep
to filter elements of an array, for example:
my @children = grep {$_->attr('afun') =~ /^(Sb|Obj)$/} $verb->children;
Homework 03 (due on April 15th)
Individual study based on the following instructions.
Our task for today and for the next class is to learn to use btred
, a scripting tool for working with TrEd data. There will be no homework today, you still have time to do your previous homework (homework 2). A new homework (homework 3) will be given next week. The following work instructions apply for two classes (today and the next week), so you can plan to do it according to your needs.
Let me start with a few clarifying statements, which you may already know:
btred
is a command line scripting interface for the PML data. It can use both general functions defined in the PML and treebank-specific functions defined in TrEd extensions.
btred
are written in Perl, so also scripts for btred
need to be written in Perl. However, you only need basic knowledge of Perl.
Please, follow the btred tutorial, which will take you through first steps of working with btred
. Our plan is to cover steps 1-7 of the tutorial. I suggest that you split the work in this way: steps 1-4 this week, steps 5-7 next week. But, of course, if you find the first four steps simple enough, you can sooner proceed further. For the tutorial, you can use any PDT-like data, e.g. the PDT data from the class 02. For examples that use the tectogrammatical layer (t-layer), use this data from the PDT, which contain also the t-files.
As Perl may be new to you a btred most certainly is, let me give you a few hints to make your work with Perl and btred easier. You will see in the tutorial that btred scripts may start with three different lines:
#!btred -e function_to_run()
# the function function_to_run()
will be run once for each given file. You can get an array of all trees (their roots) in the given file by @roots = GetTrees()
.
#!btred -T -e function_to_run()
# the function function_to_run()
will be run on all trees in the given files. Variable $root
will contain the root of the curren tree. You can get an array of all nodes (incl. the root) in the given tree by @nodes = GetNodes($root)
, or (excl. the root) by @nodes = $root->descendants
#!btred -TN -e function_to_run()
# the function function_to_run()
will be run on all nodes in all trees in the given files. Variable $root
will contain the root of the curren tree and variable $this
will contain the current node.
A simple script for counting nodes in each given file and printing the number next to each file name might look like this:
#!btred -T -e count_nodes() sub count_nodes { my @nodes = GetNodes($root); # get all nodes in the tree my $number_of_nodes = scalar(@nodes); # get the length of the array my $filename = FileName(); print "$filename: $number_of_nodes\n"; }
If the script is named count.btred
, it can be run on all gzipped a-files in the current directory from a terminal with the following command:
btred -I count.btred *.a.gz
I suggest that after the first line in any btred (or Perl in general) script, you add the following Perl instructions:
use strict; # it informs e.g. about non-declared variables (often typos) use warnings; # it warns e.g. if an uninitialized variable is used (e.g. in addition or concatenation) use utf8; # it allows utf8 in the script source code binmode STDIN, ':utf8'; # setting utf8 for STDIN binmode STDOUT, ':utf8'; # dtto for STDOUT binmode STDERR, ':utf8'; # dtto for STDERR
For writing btred scripts generally and for a particular treebank (say, a PDT-like treebank), there are three main sources of information:
GetNodes
, GetTrees
, ListV
, Filename
, etc.) and 15.9. Hooks: automatically executed macros (function exit_hook
, which is executed once after all input files are processed),
parent
, firstson
, level
, attr
, set_attr
, children
, descendants
, ancestors
, etc.),
PML_A::GetSentenceString
, PML_A::GetEParents
, etc.)
After you finish the first four steps of the tutorial, you may practice the aquired knowledge, if you want, on the following tasks (you may not need to actually write the scripts; just thinking the tasks through may suffice):
exit_hook
.
PML_A::GetSentenceString($root)
defined in the TrEd extension for the PDT.
Individual study based on the homework. Please contact me (Jiří Mírovský) with any questions.
Homework 02 (due on April 1st)
Cancelled.
Section 000 of the WSJ part of the Penn Treebank in the original merged file format.
PDT data for the class (it is a part of PDT w-, m- and a-files)
PEDT data for the class (it is a part of PEDT, namely a-files from sections 00* (with w- and m- info merged in))
Documentation for the m-layer
PDT 3.5
Demo of a Czech and English morphological analyzer and tagger
English data to test the TrEd installation: section 000 of PEDT
S ( NP ( N ( 'Peter' ) ) * VP ( * V ( 'gave' ) NP ( D ( 'a' ) * N ( 'flower' ) ) PP ( * P ( 'to' ) N ( 'Mary' ) ) ) )
S ( NP ( A ('Young') * N ('men')) * VP (* V ( 'love' ) COORD (NP ( N('beer')) * CONJ ('and') NP ( N( 'girls' ) ) )) )
Homework 01 (due on March 11th)
Installation of tree editor TrEd on computers in the lab (installation script 'install_tred.bash' from the TrEd home page)
lindat.cz - search for Prague Dependency Treebank 2.0 - sample data
Configuration file .tredrc
- customize fonts: font section in Tred documentation
Tred & svn: file pmlbackend_conf.xml
An example file from the Penn Treebank transformed to the PML - to test the TrEd installation; an extension for the Penn Treebank files (ptb) needs to be installed first...