Treex Tutorial @ MTM2013

My slides about TectoMT and Treex.

Try Treex online

Using Treex::Web try to analyze various English sentences up to tectogrammatical layer (t-layer). You may try also Czech analysis or English-to-Czech TectoMT translation, but this will take much longer. It is a beta version, you are the first testers:-).

Prepare Treex environment

For this tutorial, we recommend to use the pre-installed Treex on SU1 and SU2 lab machines:
source ~popem3am/preinstall/treex.sh
If you love hacking, you may try installing Treex on your notebook and downloading the sample data files, but it will take more than 20 minutes to install the Perl dependencies and do the SVN checkout.

Browse files with TrEd

Command ttred executes TrEd editor with a support for opening files in treex format.
ttred treex_tutorial/data/czeng1.treex.gz
You can browse other files in treex_tutorial/data/ to see samples from CzEng 1.0 English-Czech parallel treebank, HamleDT collection of 29 dependency treebanks, PCEDT 2.0, PennTB and British National Corpus (with Stanford and FANSE parser analyses). You can also download treex files from your previous Treex::Web experiments (click on "Download all" and unzip result.treex).

First steps with Treex

The elementary unit of code in Treex is called block. Each block should solve some well defined and usually linguistically motivated task, e.g. tokenization, tagging or parsing.

A sequence of blocks is called scenario and it can describe end-to-end NLP application, e.g. machine translation or preprocessing of a parallel treebank.

Treex applications can be executed from Perl. However, usually they are executed using the command line interface treex.

We will start traditionally with the "Hello, world!" example :-).

 echo 'Hello, world!' | treex Read::Text language=en Write::Text language=en

The desired output was printed to STDOUT, but there are some info messages around printed to STDERR. To filter out these messages you can either use the --quiet option (-q) or the standard redirection of STDERR.

 echo 'Hello, world!' | treex -q Read::Text language=en Write::Text language=en
 echo 'Hello, world!' | treex Read::Text language=en Write::Text language=en 2>/dev/null
What does the syntax mean?

Read::Text language=en Write::Text language=en is a scenario definition. The scenario consists of two blocks: Read::Text and Write::Text. Each block has one parameter set, the name of the parameter is language and its value is en (which is an ISO 639-1 code for English).

Why is the language parameter needed?

One Treex document can contain sentences in more languages (which is useful for tasks like word alignment or machine translation), so it is necessary to instruct each block on which language it should operate.

Can I make the scenario description shorter?

It is not necessary to repeat the same parameter specification for every block. You can use a special block Util::SetGlobal:

 echo 'Hello, world!' | treex -q Util::SetGlobal language=en Read::Text Write::Text
Can I make it even shorter?

Yes. (And I know the previous example was not actually shorter.) There is an option --language (-L) which is just a shortcut for Util::SetGlobal language=...

 echo 'Hello, world!' | treex -q --language=en Read::Text Write::Text
 echo 'Hello, world!' | treex -q -Len Read::Text Write::Text

The "Hello, world!" example is silly. The first block (so-called reader) reads the plain text input, converts it to the Treex in-memory document representation, and this document is passed to the second block (so-called writer) that converts it to plain text and prints to STDOUT. No (linguistic) processing was done.

There are readers and writers for various other formats than plain text (e.g. HTML, CoNLL, PennTB MRG, PDT PML), so you can use it for format conversions. You can also create you own readers and writers for new formats.

For simplicity, we'll continue to use plain text format in this tutorial chapter, but we'll try to do something slightly more interesting.

SEGMENTATION TO SENTENCES

To segment a text into sentences, we can use block W2A::Segment and writer Write::Sentences that prints each sentence on a separate line.

 echo "Hello! Mr. Brown, how are you?" \
  | treex -Len Read::Text W2A::Segment Write::Sentences

You can see, that the text was segmented into three sentences: "Hello!", "Mr.", and "Brown, how are you?". Block W2A::Segment is language independent (at least for languages using Latin alphabet) and it finds sentence boundaries just based on a regex rules that detect end-sentence punctuation ([.?!]) followed by a capital letter. To get the correct segmentation we must use W2A::EN::Segment which has a list of English words (tokens) that usually do not end a sentence even if they are followed by a full stop and a capital letter. By the way, Treex is object-oriented, blocks are classes and W2A::EN::Segment is a descendant of the W2A::Segment base class.

 echo "Hello! Mr. Brown, how are you?" \
  | treex -Len Read::Text W2A::EN::Segment Write::Sentences
Where can I find blocks' source code?

The blocks are actually Perl modules and you can find them in ~/preinstalled_perl5/lib/perl5/Treex/Block/. Generally, you can find the real location of a Perl module with perldoc -l. The full name of the W2A::EN::Segment module is actually Treex::Block::W2A::EN::Segment, but since the prefix Treex::Block:: is common to all blocks, it is not written in the scenario description. So the location of W2A::EN::Segment can be found with

 perldoc -l Treex::Block::W2A::EN::Segment
What does the name W2A::EN::Segment mean?

All Treex blocks that do shallow linguistic analysis (segmentation, tokenization, lemmatization, PoS tagging, dependency parsing) are grouped in a directory W2A (W and A are names of the two layers of language description). Language specific blocks are stored in a subdirectory with a uppercase ISO code of the given language (EN) in this case.

How to read already segmented input?

If you have sample.txt with one sentence per line, you can load it to Treex using

 cat sample.txt | treex -Len Read::Sentences ...

There are many other options for segmentation, see (perldoc for) modules Treex::Block::W2A::Segment, Treex::Block::W2A::SegmentOnNewlines, and Treex::Block::W2A::ResegmentSentences.

Task 1

You have an input plain text (e.g. data/news.txt) where each paragraph (including headlines) is on a separate line. Load this file into Treex and print one sentence per line. Note that headlines do not end with a fullstop, but they should be treated as separate sentences

HINT: See documentation of Treex::Block::W2A::Segment.

TOKENIZATION, LEMMATIZATION, TAGGING

Try these scenarios and check the differences:

 echo "Mr. Brown, we'll start tagging." |\
  treex -Len Read::Sentences W2A::TokenizeOnWhitespace Write::CoNLLX

 echo "Mr. Brown, we'll start tagging." |\
  treex -Len Read::Sentences W2A::Tokenize Write::CoNLLX

 echo "Mr. Brown, we'll start tagging." |\
  treex -Len Read::Sentences W2A::EN::Tokenize Write::CoNLLX

 echo "Mr. Brown, we'll start tagging." |\
  treex -Len Read::Sentences W2A::EN::TagLinguaEn Write::CoNLLX

Now, the fourth column in CoNLLX format contains PoS (part-of-speech) tags, but the tokenization is different than with W2A::EN::Tokenize. The reason is that W2A::EN::TagLinguaEn is actually a thin wrapper for a popular Perl module Lingua::EN::Tagger, which does tokenization and tagging in one step.

We can try another tagger which is better suited for the modularity idea.

 echo "Mr. Brown, we'll start tagging." |\
  treex -Len Read::Sentences\
             W2A::EN::Tokenize\
             W2A::TagTreeTagger\
             W2A::EN::Lemmatize\
             Write::CoNLLX

Now, the third column contains lemmas, but the tags are not from the standard PennTB tagset. As a result, also the lemmas for proper nouns are lowercased (because W2A::EN::Lemmatize expects NNP tag for proper nouns). For English and Czech, Treex offers a pre-trained model for a high-quality MorphoDiTa. For many other languages, there are pre-trained TreeTagger models.

 echo "Mr. Brown, we'll start tagging." |\
  treex -Len Read::Sentences\
             W2A::EN::Tokenize\
             W2A::EN::TagMorphoDiTa\
             W2A::EN::Lemmatize\
             Write::CoNLLX

 echo "Es tut mir leid." |\
  treex -Lde Read::Sentences W2A::Tokenize W2A::TagTreeTagger Write::CoNLLX
 echo "Lo siento" |\
  treex -Les Read::Sentences W2A::Tokenize W2A::TagTreeTagger Write::CoNLLX
 echo "Mi dispiace" |\
  treex -Lit Read::Sentences W2A::Tokenize W2A::TagTreeTagger Write::CoNLLX
 echo "Je suis desolée" |\
  treex -Lfr Read::Sentences W2A::Tokenize W2A::TagTreeTagger Write::CoNLLX
 echo "Bohužel jsem tento tutorial nedokončil." |\
  treex -Lcs Read::Sentences W2A::CS::Tokenize W2A::CS::TagMorphoDiTa lemmatize=1 Write::CoNLLX

DEPENDENCY PARSING

 echo "John loves Mary" |\
  treex -Len Read::Sentences\
             W2A::EN::TagLinguaEn\
             W2A::EN::Lemmatize\
             W2A::EN::ParseMSTperl\
             Write::CoNLLX

Task 3

Try to use different taggers and find sentences where different tagging leads to different parsing. You can also try to use different parsers: W2A::EN::ParseMST (the original R. McDonald's implementation), or W2A::EN::ParseMalt (use parameter memory=1g), but those blocks (and wrappers for the Java implementation) are not released on CPAN yet, so if you are not using the preinstalled Treex in SU2, you may need to install the parsers first.

Task 4

Treex native format *.treex.gz is a actually a gzipped XML. During the following section on readers and writers, look inside the files (zless my.treex.gz. Check what happens when lemmatization is added. Try to continue in the analysis and add tagging and parsing. Visualize the individual steps using TrEd with

 ttred my.treex.gz

READERS, WRITERS AND TREEX NATIVE FORMAT

So far, we have printed all the results to STDOUT in CoNLLX format. You can easily forward the output to a file using a standard redirection, but you can also specify the output file with a writer's parameter to.

 echo "John loves Mary" | treex -Len Read::Sentences W2A::EN::TagLinguaEn\
      Write::CoNLLX to=my.conll

Similarly, you can specify the input files with a reader's parameter from.

 treex -Len Read::CoNLLX from=my.conll Write::Sentences

The parameter from can contain a list of comma or space separated files. If a file starts with @ character, it is interpreted as a file list with one file name per line. So you can do e.g.:

 ls data/pcedt*.treex.gz > my.list
 treex -Len Read::Treex from=@my.list Write::Sentences to=out.txt

For Treex file format (*.treex or *.treex.gz) there is a shortcut, which automatically adds the reader to the beginning of the scenario.

 treex -Len Write::Sentences -- data/pcedt*.treex.gz

You can use treex CLI as an format convertor.

 treex -Len Read::CoNLLX from=my.conll Write::Treex to=my.treex.gz

Finally, there is another shortcut, that allows you to modify treex files in situ.

 treex -s -Len W2A::EN::Lemmatize -- my.treex.gz
 # check that lemmas were added
 treex -Len -q Write::CoNLLX -- my.treex.gz

WRITING YOUR OWN BLOCKS

Now, you can try some (Perl) programming tasks. Both templates (containing specifications) and solutions are provided in ~/treex_tutorial/Treex/Block/Tutorial/.

USEFUL TRICKS

Sometimes you need a block with just a few lines of code for ad hoc hacking. You don't need to create a separate file for the block, you can write the code directly to the scenario (on the command line):

 treex Util::Eval document='print $document->full_filename' -- *.treex
 treex Util::Eval anode='print $anode->lemma."\n"' -- *.treex
 treex Util::Eval zone='$zone->remove_tree("t") if $zone->has_tree("t");' -- *.treex

The three example above are substitutes for blocks overriding methods process_document, process_anode and process_zone, respectively.

You may need to store some ad hoc information within nodes (for which there is no official attribute designed). You can use so-called wild attributes for this purpose. They will be automatically serialized (into attribute wild_dump) before saving to a treex file.

 $node->wild->{name_of_my_new_attribute} = $value;
 $value = $node->wild->{name_of_my_new_attribute};

Task 5: Clause Patterns

Each sentence can be divided into (finite) clauses. For example: The man who wrote this tutorial was lazy and had not enough time, so he could finish it. This sentence has four clauses. The second clause is embedded into the first one. We can assign a clause_number to each token (tokens separating clauses get 0): 11222211033330444440. Based on the dependency tree, we can assign a clause_depth to each token with non-zero clause_number: 11222211 1111 22222. Using the clause_depths, we can define a clause pattern of a sentence: 12112. (The colors are added here only for better readability.) See PDT 2.5 documentation for details.

Download PDT 3.0 and extract clause patterns for the first ten files. Alternatively, you can use data/czeng[123].treex.gz and write your own heuristic block for assigning clause_numbers. Present a table (or a gnuplot picture) with a histogram (number of occurrences) of the clause patterns.