Documentation of The Czech Adaptation of Michael Collins' Parser

Václav Honetschläger

Table of Contents

1. Introduction

2. User's Guide

2.1. General Information
2.2. Parsing
2.3. Training
2.4. What is new

3. Developer's & Maintainer's Guide

3.1. Czech Adaptation Overview
3.2. Training Dissected
3.3. Parsing Dissected

4. Map of Files and Folders

Abstract

This document describes the Czech adaptation of the Michael Collins' parser from the user's and partly from the developer's and maintainer's point of view.

1. Introduction

Michael Collins has described his parser in the following documents:

Michael Collins: A New Statistical Parser Based on Bigram Lexical Dependencies, Proceedings of the 34th Annual Meeting of the ACL, Santa Cruz
Michael Collins: Three Generative, Lexicalized Models for Statistical Parsing, Proceedings of the 35th Annual Meeting of the ACL, Madrid

It is an statistical parser assigning constituent trees to sentences and thus determining their shallow structure. In 1998 it was adapted for Czech, namely for data in the CSTS format, where the sentence structure has form of dependency trees. The adaptation is described in this paper:

Michael Collins, Jan Hajič, Lance Ramshaw and Christoph Tillmann: A Statistical Parser for Czech, Proceedings of the 37th Annual Meeting of the ACL

The current efficiency of the parser for PDT 1.0 is 82.61% on the development test data and 82.76% on the evaluation test data. For PDT 2.0, the efficiency is 82.43 % on the development test data and 81.57 % on the evaluation test data. All the data were tagged machinely.

You can download the parser from ~honet/download/collins/.

2. User's Guide

2.1. General Information

Environmental variables

Environmental variable COLLINS_PATH have to point to the parser's root folder.

Limitations

The parser has several limits, the most important are:

sentences must be shorter than 150 words (PMAXWORDS defined in sentence.h),
words have to be shorter than 100 characters (MAXWORDLEN defined in sentence.h),
morphologic tags have to be shorter than 100 characters (MAXTAGLEN defined in sentence.h).

Config Files

The parser (both parsing and training script) always loads default.config file in its root folder (set by COLLINS_PATH). Through the -c<filename> option (see also Section 2.2, " Parsing " and Section 2.3, " Training ") an user config file can also be loaded. If an option is set in the both files and its values are not the same, this in the user config file is applied.

On each line of a config file there is a name-value pair describing one option; the name is separated from the value by whitespaces. String from '#' to the end of line is regarded as a comment.

All options with remarks whether they apply to training or parsing follow.

train-data (parsing)

name of the model (trained by means of collins-train.pl script); consists of (and is a basename of) a bunch of files in ${COLLINS_PATH}/data folder

depstotree (training)

arguments to be passed to depstotree.* program; separated by '::'

tag-set (training & parsing)

Perl script containing maptag() function for conversion of morphologic tags; had to be in ${COLLINS_PATH}/exec/d2t folder

use-hand-tags (training & parsing)

determines which morphologic tags will be chosen for training/parsing:

1 - manually disambiguated, i.e. t
0 - machinely disambiguated, i.e. MDt (with or without attributes)
other - machinely disambiguated with the tags' source identified in attributes, i.e. MDt value or MDt src="value"

vdist (parsing)

I don't know :-(

2.2. Parsing

The parsing process is launched by the script collins.pl. Short information on usage is obtained after executing collins.pl --help or collins.pl -h.

The general shape of the launching command is collins.pl [config] [online] [dump] [input-file] [output-file] and arguments are explained bellow.

input-file

Input file in CSTS format. If omitted, stdin is used for input. The words had to contain following data:

the form of the word (tag f or d)
the morphologic tag (tag t or MDt, see the section called " Config Files ")
order of the word in its sentence (tag r)

output-file

The parsed input file, i.e. identical to the input file except for tag g with link to the parent is added to words. If omitted, stdout is used for output.

config

Loads the supplied config file in addition to the default one (see the section called " Config Files "). Format: -c<filename>

online

Runs the parser in on-line mode-when a sentence is entered, the parsed one is output without buffering (in this mode, temporary files are replaced with named pipes). Since the parser can get jammed for some reason in this mode, this is only an option. Format: -n

dump

Dumps intermediate results, format: -<number><filename>

number: 1 - before parsing, 2 - after parsing
filename: name of file which to dump to

E.g.: collins.pl -cmy.config -1before -2after input output means:

the config file my.config is also loaded;
data for the parser are also written into the file before;
data from the parser are also written into the file after;
input is the input file;
output is the output file.

To be able to use the parser in the "on-line" mode there is a need to end sentences explicitly. The ending tag </s> serves for this purpose.

2.3. Training

The training process is launched by the script collins-train.pl. Short information on usage is obtained after executing collins-train.pl without arguments.

The general shape of the launching command is collins-train.pl [config] train-data output-name and arguments are explained bellow.

train-data: File in CSTS format with word links to parents determined.
output-name: Name of the model files; files will be created in ${COLLINS_PATH}/data folder and will have names beginning with the value of this argument.
config: Loads the supplied config file in addition to the default one (see the section called " Config Files "). Format: -c<filename>.

2.4. What is new

Parser has been improved in more-or-less important ways since November 2001:

a bug causing lower efficiency has been removed,
parser is more user-friendly now,
several constants had to be enlarged because of bigger data (PDT 1.0),
adaptation to a slightly different format of PDT 1.0 has been done,
dump of intermediate results is possible,
an user config file can be supplied,
the input file is read at once no more; sentences are processed one-by-one instead (and thus limit of maximal number of sentences disappeared),
the parser can run in "on-line" mode.

3. Developer's & Maintainer's Guide

3.1. Czech Adaptation Overview

The Czech adaptation consists in adding pre- and/or postprocessing phases to the parser, their purpose is constituent to dependency tree (or vice versa) conversion. Furthermore, data for the trainer/parser had to be in a special format so there has been another need to preprocess them. Before training, data had to be tailored for the parser and then converted from dependency to constituent trees. Before parsing, data had to be tailored for the parser; after parsing, obtained constituent trees had to be converted to dependency trees and this data and the unparsed data merged (since in data processed by the parser there is not the whole information retained).

3.2. Training Dissected

The training process consists of several steps; I will describe each of them along with the sample of data which outputs from this step.

Beginning

The sample training file follows.

<s id=cmpr9413:002-p4s2/bcb01aba.fs/#38>
<f cap>V<MDl a>v<MDt a>RR--6----------<r>1<g>6
<f>návrzích<MDl a>návrh<MDt a>NNIP6-----A----<r>2<g>1
<f>na<MDl a>na<MDt a>RR--4----------<r>3<g>2
<f>případné<MDl a>případný<MDt a>AAFP4----1A----<r>4<g>5
<f>změny<MDl a>změna<MDt a>NNFP4-----A----<r>5<g>3
<f>vycházejí<MDl a>vycházet_:T<MDt a>VB-P---3P-AA---<r>6<g>0
<f>ze<MDl a>z<MDt a>RV--2----------<r>7<g>6
<f>svých<MDl a>svůj-1<MDt a>P8XP2----------<r>8<g>12
<f>většinou<MDl a>většinou<MDt a>Db-------------<r>9<g>10
<f>několikaletých<MDl a>několikaletý<MDt a>AAFP2----1A----<r>10<g>12
<f>podnikatelských<MDl a>podnikatelský<MDt a>AAFP2----1A----<r>11<g>12
<f>zkušeností<MDl a>zkušenost<MDt a>NNFP2-----A----<r>12<g>7
<D>
<d>.<MDl a>.<MDt a>Z:-------------<r>13<g>0

proc2.pl

The script proc2.pl is used both in parsing and training and operates in the following manner:

copies the starting/ending s-tags indicating beginning/end of a sentence
information about words (t- and d-tag) transcribes so that on every line there are the four following pieces of information about one word, separated by a space:
1. value of the t- or d-tag
2. value of the mapped morphologic tag (see bellow)
3. value of the r-tag
4. value of the g-tag (when parsing, 0 is supplied instead)
ignores every other tag
adds #START# in front of beginning of a sentence and #END# after its end

Arguments of this script follow:

name of the Perl module containing the function maptag(), which maps the morphologic tag (i.e. filters irrelevant information out of it); set by the config option tag-set
information which morphologic tags to use; set by the config option use-hand-tags
information whether training (0) or parsing (1) is performed; thus in collins.pl ans collins-train.pl is set constantly

#START#
<s id=cmpr9413:002-p4s2/bcb01aba.fs/#38>
V R6 1 6
návrzích N6 2 1
na R4 3 2
případné A4 4 5
změny N4 5 3
vycházejí VB 6 0
ze R2 7 6
svých P2 8 12
většinou Db 9 10
několikaletých A2 10 12
podnikatelských A2 11 12
zkušeností N2 12 7
. Z- 13 0
#END#

depstotree.*

The program converts dependency trees into constituent ones. Its arguments are copied from the config option depstotree. Heads of phrases are denoted with '>'.

( TOP ( #P ( >#NULL# #NULL# ) ( VP ( RP ( >R6 V ) ( NP ( >N6 návrzích ) ( RP ( >R4 na ) ( NP ( A4 případné ) ( >N4 změny ) ) ) ) ) ( >VB vycházejí ) ( RP ( >R2 ze ) ( NP ( P2 svých ) ( AP ( Db většinou ) ( >A2 několikaletých ) ) ( A2 podnikatelských ) ( >N2 zkušeností ) ) ) ) ( Z- . ) ) )

treetrain.*

This program creates the first file of the resultant model, *.events, whose transcribtion follows.

3 #NULL# #NULL# TOP #P 0 0
2 #STOP# #STOP# #NULL# #NULL# #STOP# TOP #P 0 110 0 0
2 #STOP# #STOP# #NULL# #NULL# #STOP# TOP #P 0 010 0 0
3 #NULL# #NULL# #P #NULL# 0 0
2 #STOP# #STOP# #NULL# #NULL# #STOP# #P #NULL# 0 110 0 0
2 vycházejí VB #NULL# #NULL# VP #P #NULL# 0 010 0 0
2 . Z- #NULL# #NULL# Z- #P #NULL# 0 001 0 0
2 #STOP# #STOP# #NULL# #NULL# #STOP# #P #NULL# 0 001 0 0
6 1 #NULL# #NULL#
3 vycházejí VB VP VB 0 0
2 V R6 vycházejí VB RP VP VB 0 110 0 0
2 #STOP# #STOP# vycházejí VB #STOP# VP VB 0 100 0 0
2 ze R2 vycházejí VB RP VP VB 0 010 0 0
2 #STOP# #STOP# vycházejí VB #STOP# VP VB 0 000 0 0
3 V R6 RP R6 0 0
2 #STOP# #STOP# V R6 #STOP# RP R6 0 110 0 0
2 návrzích N6 V R6 NP RP R6 0 010 0 0
2 #STOP# #STOP# V R6 #STOP# RP R6 0 000 0 0
6 1 V R6
3 návrzích N6 NP N6 0 0
2 #STOP# #STOP# návrzích N6 #STOP# NP N6 0 110 0 0
2 na R4 návrzích N6 RP NP N6 0 010 0 0
2 #STOP# #STOP# návrzích N6 #STOP# NP N6 0 000 0 0
6 1 návrzích N6
3 na R4 RP R4 0 0
2 #STOP# #STOP# na R4 #STOP# RP R4 0 110 0 0
2 změny N4 na R4 NP RP R4 0 010 0 0
2 #STOP# #STOP# na R4 #STOP# RP R4 0 000 0 0
6 1 na R4
3 změny N4 NP N4 0 0
2 případné A4 změny N4 A4 NP N4 0 110 0 0
2 #STOP# #STOP# změny N4 #STOP# NP N4 0 100 0 0
2 #STOP# #STOP# změny N4 #STOP# NP N4 0 010 0 0
6 1 případné A4
6 1 změny N4
6 1 vycházejí VB
3 ze R2 RP R2 0 0
2 #STOP# #STOP# ze R2 #STOP# RP R2 0 110 0 0
2 zkušeností N2 ze R2 NP RP R2 0 010 0 0
2 #STOP# #STOP# ze R2 #STOP# RP R2 0 000 0 0
6 1 ze R2
3 zkušeností N2 NP N2 0 0
2 podnikatelských A2 zkušeností N2 A2 NP N2 0 110 0 0
2 několikaletých A2 zkušeností N2 AP NP N2 0 100 0 0
2 svých P2 zkušeností N2 P2 NP N2 0 100 0 0
2 #STOP# #STOP# zkušeností N2 #STOP# NP N2 0 100 0 0
2 #STOP# #STOP# zkušeností N2 #STOP# NP N2 0 010 0 0
6 1 svých P2
3 několikaletých A2 AP A2 0 0
2 většinou Db několikaletých A2 Db AP A2 0 110 0 0
2 #STOP# #STOP# několikaletých A2 #STOP# AP A2 0 100 0 0
2 #STOP# #STOP# několikaletých A2 #STOP# AP A2 0 010 0 0
6 1 většinou Db
6 1 několikaletých A2
6 1 podnikatelských A2
6 1 zkušeností N2
6 1 . Z-

makefiles.pl

The script creates the remaining model files. It takes the base name of the model as its argument.

The file with lexicon, *.lexicon:

#NULL# #NULL# 0
. Z- 0
V R6 0
na R4 0
návrzích N6 0
několikaletých A2 0
podnikatelských A2 0
případné A4 0
svých P2 0
vycházejí VB 0
většinou Db 0
ze R2 0
zkušeností N2 0
změny N4 0

The file with grammar, *.grm:

L AP A2 Db
L NP N2 A2
L NP N2 AP
L NP N2 P2
L NP N4 A4
L VP VB RP
R #P #NULL# VP
R #P #NULL# Z-
R NP N6 RP
R RP R2 NP
R RP R4 NP
R RP R6 NP
R VP VB RP
U #P #NULL#
U AP A2
U NP N2
U NP N4
U NP N6
U RP R2
U RP R4
U RP R6
U TOP #P
U VP VB
X #P #NULL# 0
X AP A2 0
X NP N2 0
X NP N4 0
X NP N6 0
X RP R2 0
X RP R4 0
X RP R6 0
X TOP #P 0
X VP VB 0
Y #P #NULL# 0
Y AP A2 0
Y NP N2 0
Y NP N4 0
Y NP N6 0
Y RP R2 0
Y RP R4 0
Y RP R6 0
Y TOP #P 0
Y VP VB 0

The file with non-terminal sybols, *.nts:

#NULL#
#P
A2
A4
AP
Db
N2
N4
N6
NP
P2
R2
R4
R6
RP
TOP
VB
VP
Z-

The Final Phase

As the last step, the symbol TOP from the file *.nts is moved by means of operating system utilities to the beginning of the file (I do not know why):

TOP
#NULL#
#P
A2
A4
AP
Db
N2
N4
N6
NP
P2
R2
R4
R6
RP
VB
VP
Z-

3.3. Parsing Dissected

Parser's Design

There has been a claim to parser to give a parsed sentence immediately after it is entered and not to wait to an end of input. Two action had to be undertaken to provide it: switching off of buffering and cope with the limitation that files could not be used by parsing.

Buffering of outputs must be switched off completely, i.e. setbuf(stdout, NULL) in C and STREAM->autoflush(1) method in Perl.

It was natural to use files because input of the parser have had to be read twice - for parsing and for merging parsed data with the original input; However, in the required pipeline processing the input can be read only once. Named pipes have been used for solving this problem. The parser fork()s itself. The parent process opens the main parsing pipe with its output directed to the first named pipe, opens the second named pipe, reads the "global" input and sends it into both the parsing pipe and the second named pipe. The child process executes the merging script with the named pipes as arguments, so that the script can read from them as from files, catches its output and sends it to the "global" output. (The exec() cannot be used because the command's output would be lost if it was sent to stdout).

Beginning

The sample to-be-parsed file follows.

<s id=cmpr9415:001-p4s1/bcc01zua.fs/#4>
<i>b
<f cap>Výměna<MDl a>výměna<MDt a>NNFS1-----A----<r>1
<f>zboží<MDl a>zboží<MDt a>NNNS2-----A----<r>2
<f>mezi<MDl a>mezi<MDt a>RR--7----------<r>3
<f upper>ČR<MDl a>ČR_:B_;G_^(Česká_republika)<MDt a>NNFXX-----A----<r>4
<f>a<MDl a>a<MDt a>J^-------------<r>5
<f cap>Kanadou<MDl a>Kanada_;G<MDt a>NNFS7-----A----<r>6
<f>představuje<MDl a>představovat_:T<MDt a>VB-S---3P-AA---<r>7
<f>kolem<MDl a>kolem<MDt a>RR--2----------<r>8
<f>půl<MDl a>půl-1<MDt a>ClXS2----------<r>9
<f>promile<MDl a>promile<MDt a>NNNS1-----A----<r>10
<f>kanadského<MDl a>kanadský<MDt a>AAIS2----1A----<r>11
<f>zahraničního<MDl a>zahraniční<MDt a>AAIS2----1A----<r>12
<f>obchodu<MDl a>obchod<MDt a>NNIS2-----A----<r>13
<D>
<d>.<MDl a>.<MDt a>Z:-------------<r>14

proc2.pl

For description of proc2.pl's operation see the section called " proc2.pl ".

#START#
<s id=cmpr9415:001-p4s1/bcc01zua.fs/#4>
Výměna N1 1 0
zboží N2 2 0
mezi R7 3 0
ČR NX 4 0
a J^ 5 0
Kanadou N7 6 0
představuje VB 7 0
kolem R2 8 0
půl C2 9 0
promile N1 10 0
kanadského A2 11 0
zahraničního A2 12 0
obchodu N2 13 0
. Z- 14 0
#END#

forparser.prl

The script only converts words and (mapped) morphologic tags into linear form, adds the zeroth special element #NULL# explicitly and number of elements at the beginnig of a line.

15 #NULL# #NULL# Výměna N1 zboží N2 mezi R7 ČR NX a J^ Kanadou N7 představuje VB kolem R2 půl C2 promile N1 kanadského A2 zahraničního A2 obchodu N2 . Z-

filtertags.prl

15 #NULL# #NULL# Výměna N1 zboží N2 mezi R7 ČR NX a J^ Kanadou N7 představuje VB kolem R2 půl C2 promile N1 kanadského A2 zahraničního A2 obchodu N2 . Z-

parser.*

The main program. As far as the structure is concerned, tree-like form and the form with parentheses carry equal information.

PROB 1554 -72.3783 0
TOP -72.3783 #P -72.3783 #NULL# 0 #NULL#
       VP -63.981 NP -26.0739 N1 0 Výměna
             N2 0 zboží
             RP -9.09766 R7 0 mezi
                N7P -7.51357 NX 0 ČR
                    J^ 0 a
                    N7 0 Kanadou
          VB 0 představuje
          RP -8.14667 R2 0 kolem
             CP -2.6891 C2 0 půl
          NP -15.2441 N1 0 promile
             NP -5.85716 A2 0 kanadského
                A2 0 zahraničního
                N2 0 obchodu
       Z- 0 .
(TOP~~1~#NULL# (#P~~1~#NULL# #NULL#/#NULL# (VP~~2~představuje (NP~~1~Výměna Výměna/N1 zboží/N2 (RP~~1~mezi mezi/R7 (N7P~~2~a ČR/NX a/J^ Kanadou/N7 ) ) ) představuje/VB (RP~~1~kolem kolem/R2 (CP~~1~půl půl/C2 ) ) (NP~~1~promile promile/N1 (NP~~3~obchodu kanadského/A2 zahraničního/A2 obchodu/N2 ) ) ) ./Z- ) )
TIME 1

addheads.prl

The script retains just the form with parentheses and adds a head to every phrase (denoted by '>').

(TOP~~1~#NULL# (#P~~1~#NULL# #NULL#/>#NULL# (VP~~2~představuje (NP~~1~Výměna Výměna/>N1 zboží/N2 (RP~~1~mezi mezi/>R7 (N7P~~2~a ČR/NX a/>J^ Kanadou/N7 ) ) ) představuje/>VB (RP~~1~kolem kolem/>R2 (CP~~1~půl půl/>C2 ) ) (NP~~1~promile promile/>N1 (NP~~3~obchodu kanadského/A2 zahraničního/A2 obchodu/>N2 ) ) ) ./Z- ) )

fortreedep.prl

( TOP ( #P ( >#NULL# #NULL# ) ( VP ( NP ( >N1 Výměna ) ( N2 zboží ) ( RP ( >R7 mezi ) ( N7P ( NX ČR ) ( >J^ a ) ( N7 Kanadou ) ) ) ) ( >VB představuje ) ( RP ( >R2 kolem ) ( CP ( >C2 půl ) ) ) ( NP ( >N1 promile ) ( NP ( A2 kanadského ) ( A2 zahraničního ) ( >N2 obchodu ) ) ) ) ( Z- . ) ) )

treetodeps.*

The program converts constituent trees into dependency ones. The output format is the same as this of proc2.pl.

#START#
<s BLANK>
#NULL# #NULL# 0 0
Výměna N1 1 7
zboží N2 2 1
mezi R7 3 1
ČR NX 4 5
a J^ 5 3
Kanadou N7 6 5
představuje VB 7 0
kolem R2 8 7
půl C2 9 8
promile N1 10 7
kanadského A2 11 13
zahraničního A2 12 13
obchodu N2 13 10
. Z- 14 0
#END#

mergecoll.pl

The script merges the original file in CSTS format (the first argument) with the file containing the parsed data (the second argument) and merges information about the shallow structure from it with the original file. This is the final phase of parsing.

<s id=cmpr9415:001-p4s1/bcc01zua.fs/#4>
<i>b
<f cap>Výměna<MDl a>výměna<MDt a>NNFS1-----A----<r>1<g>7
<f>zboží<MDl a>zboží<MDt a>NNNS2-----A----<r>2<g>1
<f>mezi<MDl a>mezi<MDt a>RR--7----------<r>3<g>1
<f upper>ČR<MDl a>ČR_:B_;G_^(Česká_republika)<MDt a>NNFXX-----A----<r>4<g>5
<f>a<MDl a>a<MDt a>J^-------------<r>5<g>3
<f cap>Kanadou<MDl a>Kanada_;G<MDt a>NNFS7-----A----<r>6<g>5
<f>představuje<MDl a>představovat_:T<MDt a>VB-S---3P-AA---<r>7<g>0
<f>kolem<MDl a>kolem<MDt a>RR--2----------<r>8<g>7
<f>půl<MDl a>půl-1<MDt a>ClXS2----------<r>9<g>8
<f>promile<MDl a>promile<MDt a>NNNS1-----A----<r>10<g>7
<f>kanadského<MDl a>kanadský<MDt a>AAIS2----1A----<r>11<g>13
<f>zahraničního<MDl a>zahraniční<MDt a>AAIS2----1A----<r>12<g>13
<f>obchodu<MDl a>obchod<MDt a>NNIS2-----A----<r>13<g>10
<D>
<d>.<MDl a>.<MDt a>Z:-------------<r>14<g>0

4. Map of Files and Folders

The whole parser is written in C and Perl. Files compiled from C sources has suffixes indicating the hardware architecture on which they was compiled.

default.config - the default configuration file
data/ - folder where models are saved to when training and loaded from when parsing
exec/ - all executable files (compiled binary programs and Perl scripts)
- collins-common.pl - Perl file with stuff common for parsing and training
- collins-train.pl - the main training script
- collins.pl - the main parsing script
- d2t/ - executables for the preprocessing phase of both parser and trainer
  - depstotree.* - for training
  - p-sc-tags2.pl - module used by proc2.pl
  - proc2.pl - for parsing and training
- parser/ - the core of the parser
  - filtertags.prl
  - forparser.prl
  - parser.*
- t2d/ - executables for the postprocessing phase of the parser
  - addheads.prl
  - fortreedep.prl
  - mergecoll.pl
  - treetodeps.*
- train/ - the core of the trainer
  - makefiles.pl
  - treetrain.*
obj/ - compiled object files
- d2t/
- parser/
- t2d/
- train/
src/ - sources of the C programs
- d2t/ - sources of depstotree.*
- parser/ - sources of parser.*
- t2d/ - sources of treetodeps.*
- train/ - sources of treetrain.*