Table of Contents
Abstract
This document describes the Czech adaptation of the Michael Collins' parser from the user's and partly from the developer's and maintainer's point of view.Michael Collins has described his parser in the following documents:
It is an statistical parser assigning constituent trees to sentences and thus determining their shallow structure. In 1998 it was adapted for Czech, namely for data in the CSTS format, where the sentence structure has form of dependency trees. The adaptation is described in this paper:
The current efficiency of the parser for PDT 1.0 is 82.61% on the development test data and 82.76% on the evaluation test data. For PDT 2.0, the efficiency is 82.43 % on the development test data and 81.57 % on the evaluation test data. All the data were tagged machinely.
You can download the parser from ~honet/download/collins/
.
Environmental variable COLLINS_PATH
have to point to the parser's root folder.
The parser has several limits, the most important are:
PMAXWORDS
defined in sentence.h
),MAXWORDLEN
defined in sentence.h
),MAXTAGLEN
defined in sentence.h
).The parser (both parsing and training script) always loads default.config
file in its root folder (set by COLLINS_PATH
). Through the -c<filename> option (see also Section 2.2, " Parsing " and Section 2.3, " Training ") an user config file can also be loaded. If an option is set in the both files and its values are not the same, this in the user config file is applied.
On each line of a config file there is a name-value pair describing one option; the name is separated from the value by whitespaces. String from '#'
to the end of line is regarded as a comment.
All options with remarks whether they apply to training or parsing follow.
${COLLINS_PATH}/data
folder'::'
maptag()
function for conversion of morphologic tags; had to be in ${COLLINS_PATH}/exec/d2t
foldert
MDt
(with or without attributes)MDt value
or MDt src="value"
The parsing process is launched by the script collins.pl
. Short information on usage is obtained after executing collins.pl --help or collins.pl -h.
The general shape of the launching command is collins.pl [config] [online] [dump] [input-file] [output-file] and arguments are explained bellow.
stdin
is used for input. The words had to contain following data:
f
or d
)t
or MDt
, see the section called " Config Files ")r
)g
with link to the parent is added to words. If omitted, stdout
is used for output.E.g.: collins.pl -cmy.config -1before -2after input output means:
my.config
is also loaded;before
;after
;input
is the input file;output
is the output file.To be able to use the parser in the "on-line" mode there is a need to end sentences explicitly. The ending tag </s>
serves for this purpose.
The training process is launched by the script collins-train.pl. Short information on usage is obtained after executing collins-train.pl without arguments.
The general shape of the launching command is collins-train.pl [config] train-data output-name and arguments are explained bellow.
${COLLINS_PATH}/data
folder and will have names beginning with the value of this argument.Parser has been improved in more-or-less important ways since November 2001:
The Czech adaptation consists in adding pre- and/or postprocessing phases to the parser, their purpose is constituent to dependency tree (or vice versa) conversion. Furthermore, data for the trainer/parser had to be in a special format so there has been another need to preprocess them. Before training, data had to be tailored for the parser and then converted from dependency to constituent trees. Before parsing, data had to be tailored for the parser; after parsing, obtained constituent trees had to be converted to dependency trees and this data and the unparsed data merged (since in data processed by the parser there is not the whole information retained).
The training process consists of several steps; I will describe each of them along with the sample of data which outputs from this step.
The sample training file follows.
<s id=cmpr9413:002-p4s2/bcb01aba.fs/#38> <f cap>V<MDl a>v<MDt a>RR--6----------<r>1<g>6 <f>návrzích<MDl a>návrh<MDt a>NNIP6-----A----<r>2<g>1 <f>na<MDl a>na<MDt a>RR--4----------<r>3<g>2 <f>případné<MDl a>případný<MDt a>AAFP4----1A----<r>4<g>5 <f>změny<MDl a>změna<MDt a>NNFP4-----A----<r>5<g>3 <f>vycházejí<MDl a>vycházet_:T<MDt a>VB-P---3P-AA---<r>6<g>0 <f>ze<MDl a>z<MDt a>RV--2----------<r>7<g>6 <f>svých<MDl a>svůj-1<MDt a>P8XP2----------<r>8<g>12 <f>většinou<MDl a>většinou<MDt a>Db-------------<r>9<g>10 <f>několikaletých<MDl a>několikaletý<MDt a>AAFP2----1A----<r>10<g>12 <f>podnikatelských<MDl a>podnikatelský<MDt a>AAFP2----1A----<r>11<g>12 <f>zkušeností<MDl a>zkušenost<MDt a>NNFP2-----A----<r>12<g>7 <D> <d>.<MDl a>.<MDt a>Z:-------------<r>13<g>0
The script proc2.pl is used both in parsing and training and operates in the following manner:
s
-tags indicating beginning/end of a sentencet
- and d
-tag) transcribes so that on every line there are the four following pieces of information about one word, separated by a space:
t
- or d
-tagr
-tagg
-tag (when parsing, 0
is supplied instead)#START#
in front of beginning of a sentence and #END#
after its endArguments of this script follow:
maptag()
, which maps the morphologic tag (i.e. filters irrelevant information out of it); set by the config option tag-set0
) or parsing (1
) is performed; thus in collins.pl ans collins-train.pl is set constantly#START# <s id=cmpr9413:002-p4s2/bcb01aba.fs/#38> V R6 1 6 návrzích N6 2 1 na R4 3 2 případné A4 4 5 změny N4 5 3 vycházejí VB 6 0 ze R2 7 6 svých P2 8 12 většinou Db 9 10 několikaletých A2 10 12 podnikatelských A2 11 12 zkušeností N2 12 7 . Z- 13 0 #END#
The program converts dependency trees into constituent ones. Its arguments are copied from the config option depstotree. Heads of phrases are denoted with '>'
.
( TOP ( #P ( >#NULL# #NULL# ) ( VP ( RP ( >R6 V ) ( NP ( >N6 návrzích ) ( RP ( >R4 na ) ( NP ( A4 případné ) ( >N4 změny ) ) ) ) ) ( >VB vycházejí ) ( RP ( >R2 ze ) ( NP ( P2 svých ) ( AP ( Db většinou ) ( >A2 několikaletých ) ) ( A2 podnikatelských ) ( >N2 zkušeností ) ) ) ) ( Z- . ) ) )
This program creates the first file of the resultant model, *.events
, whose transcribtion follows.
3 #NULL# #NULL# TOP #P 0 0 2 #STOP# #STOP# #NULL# #NULL# #STOP# TOP #P 0 110 0 0 2 #STOP# #STOP# #NULL# #NULL# #STOP# TOP #P 0 010 0 0 3 #NULL# #NULL# #P #NULL# 0 0 2 #STOP# #STOP# #NULL# #NULL# #STOP# #P #NULL# 0 110 0 0 2 vycházejí VB #NULL# #NULL# VP #P #NULL# 0 010 0 0 2 . Z- #NULL# #NULL# Z- #P #NULL# 0 001 0 0 2 #STOP# #STOP# #NULL# #NULL# #STOP# #P #NULL# 0 001 0 0 6 1 #NULL# #NULL# 3 vycházejí VB VP VB 0 0 2 V R6 vycházejí VB RP VP VB 0 110 0 0 2 #STOP# #STOP# vycházejí VB #STOP# VP VB 0 100 0 0 2 ze R2 vycházejí VB RP VP VB 0 010 0 0 2 #STOP# #STOP# vycházejí VB #STOP# VP VB 0 000 0 0 3 V R6 RP R6 0 0 2 #STOP# #STOP# V R6 #STOP# RP R6 0 110 0 0 2 návrzích N6 V R6 NP RP R6 0 010 0 0 2 #STOP# #STOP# V R6 #STOP# RP R6 0 000 0 0 6 1 V R6 3 návrzích N6 NP N6 0 0 2 #STOP# #STOP# návrzích N6 #STOP# NP N6 0 110 0 0 2 na R4 návrzích N6 RP NP N6 0 010 0 0 2 #STOP# #STOP# návrzích N6 #STOP# NP N6 0 000 0 0 6 1 návrzích N6 3 na R4 RP R4 0 0 2 #STOP# #STOP# na R4 #STOP# RP R4 0 110 0 0 2 změny N4 na R4 NP RP R4 0 010 0 0 2 #STOP# #STOP# na R4 #STOP# RP R4 0 000 0 0 6 1 na R4 3 změny N4 NP N4 0 0 2 případné A4 změny N4 A4 NP N4 0 110 0 0 2 #STOP# #STOP# změny N4 #STOP# NP N4 0 100 0 0 2 #STOP# #STOP# změny N4 #STOP# NP N4 0 010 0 0 6 1 případné A4 6 1 změny N4 6 1 vycházejí VB 3 ze R2 RP R2 0 0 2 #STOP# #STOP# ze R2 #STOP# RP R2 0 110 0 0 2 zkušeností N2 ze R2 NP RP R2 0 010 0 0 2 #STOP# #STOP# ze R2 #STOP# RP R2 0 000 0 0 6 1 ze R2 3 zkušeností N2 NP N2 0 0 2 podnikatelských A2 zkušeností N2 A2 NP N2 0 110 0 0 2 několikaletých A2 zkušeností N2 AP NP N2 0 100 0 0 2 svých P2 zkušeností N2 P2 NP N2 0 100 0 0 2 #STOP# #STOP# zkušeností N2 #STOP# NP N2 0 100 0 0 2 #STOP# #STOP# zkušeností N2 #STOP# NP N2 0 010 0 0 6 1 svých P2 3 několikaletých A2 AP A2 0 0 2 většinou Db několikaletých A2 Db AP A2 0 110 0 0 2 #STOP# #STOP# několikaletých A2 #STOP# AP A2 0 100 0 0 2 #STOP# #STOP# několikaletých A2 #STOP# AP A2 0 010 0 0 6 1 většinou Db 6 1 několikaletých A2 6 1 podnikatelských A2 6 1 zkušeností N2 6 1 . Z-
The script creates the remaining model files. It takes the base name of the model as its argument.
The file with lexicon, *.lexicon
:
#NULL# #NULL# 0 . Z- 0 V R6 0 na R4 0 návrzích N6 0 několikaletých A2 0 podnikatelských A2 0 případné A4 0 svých P2 0 vycházejí VB 0 většinou Db 0 ze R2 0 zkušeností N2 0 změny N4 0
The file with grammar, *.grm
:
L AP A2 Db L NP N2 A2 L NP N2 AP L NP N2 P2 L NP N4 A4 L VP VB RP R #P #NULL# VP R #P #NULL# Z- R NP N6 RP R RP R2 NP R RP R4 NP R RP R6 NP R VP VB RP U #P #NULL# U AP A2 U NP N2 U NP N4 U NP N6 U RP R2 U RP R4 U RP R6 U TOP #P U VP VB X #P #NULL# 0 X AP A2 0 X NP N2 0 X NP N4 0 X NP N6 0 X RP R2 0 X RP R4 0 X RP R6 0 X TOP #P 0 X VP VB 0 Y #P #NULL# 0 Y AP A2 0 Y NP N2 0 Y NP N4 0 Y NP N6 0 Y RP R2 0 Y RP R4 0 Y RP R6 0 Y TOP #P 0 Y VP VB 0
The file with non-terminal sybols, *.nts
:
#NULL# #P A2 A4 AP Db N2 N4 N6 NP P2 R2 R4 R6 RP TOP VB VP Z-
There has been a claim to parser to give a parsed sentence immediately after it is entered and not to wait to an end of input. Two action had to be undertaken to provide it: switching off of buffering and cope with the limitation that files could not be used by parsing.
Buffering of outputs must be switched off completely, i.e. setbuf(stdout, NULL)
in C and STREAM->autoflush(1)
method in Perl.
It was natural to use files because input of the parser have had to be read twice - for parsing and for merging parsed data with the original input; However, in the required pipeline processing the input can be read only once. Named pipes have been used for solving this problem. The parser fork()
s itself. The parent process opens the main parsing pipe with its output directed to the first named pipe, opens the second named pipe, reads the "global" input and sends it into both the parsing pipe and the second named pipe. The child process executes the merging script with the named pipes as arguments, so that the script can read from them as from files, catches its output and sends it to the "global" output. (The exec()
cannot be used because the command's output would be lost if it was sent to stdout
).
The sample to-be-parsed file follows.
<s id=cmpr9415:001-p4s1/bcc01zua.fs/#4> <i>b <f cap>Výměna<MDl a>výměna<MDt a>NNFS1-----A----<r>1 <f>zboží<MDl a>zboží<MDt a>NNNS2-----A----<r>2 <f>mezi<MDl a>mezi<MDt a>RR--7----------<r>3 <f upper>ČR<MDl a>ČR_:B_;G_^(Česká_republika)<MDt a>NNFXX-----A----<r>4 <f>a<MDl a>a<MDt a>J^-------------<r>5 <f cap>Kanadou<MDl a>Kanada_;G<MDt a>NNFS7-----A----<r>6 <f>představuje<MDl a>představovat_:T<MDt a>VB-S---3P-AA---<r>7 <f>kolem<MDl a>kolem<MDt a>RR--2----------<r>8 <f>půl<MDl a>půl-1<MDt a>ClXS2----------<r>9 <f>promile<MDl a>promile<MDt a>NNNS1-----A----<r>10 <f>kanadského<MDl a>kanadský<MDt a>AAIS2----1A----<r>11 <f>zahraničního<MDl a>zahraniční<MDt a>AAIS2----1A----<r>12 <f>obchodu<MDl a>obchod<MDt a>NNIS2-----A----<r>13 <D> <d>.<MDl a>.<MDt a>Z:-------------<r>14
For description of proc2.pl's operation see the section called " proc2.pl ".
#START# <s id=cmpr9415:001-p4s1/bcc01zua.fs/#4> Výměna N1 1 0 zboží N2 2 0 mezi R7 3 0 ČR NX 4 0 a J^ 5 0 Kanadou N7 6 0 představuje VB 7 0 kolem R2 8 0 půl C2 9 0 promile N1 10 0 kanadského A2 11 0 zahraničního A2 12 0 obchodu N2 13 0 . Z- 14 0 #END#
The script only converts words and (mapped) morphologic tags into linear form, adds the zeroth special element #NULL#
explicitly and number of elements at the beginnig of a line.
15 #NULL# #NULL# Výměna N1 zboží N2 mezi R7 ČR NX a J^ Kanadou N7 představuje VB kolem R2 půl C2 promile N1 kanadského A2 zahraničního A2 obchodu N2 . Z-
15 #NULL# #NULL# Výměna N1 zboží N2 mezi R7 ČR NX a J^ Kanadou N7 představuje VB kolem R2 půl C2 promile N1 kanadského A2 zahraničního A2 obchodu N2 . Z-
The main program. As far as the structure is concerned, tree-like form and the form with parentheses carry equal information.
PROB 1554 -72.3783 0 TOP -72.3783 #P -72.3783 #NULL# 0 #NULL# VP -63.981 NP -26.0739 N1 0 Výměna N2 0 zboží RP -9.09766 R7 0 mezi N7P -7.51357 NX 0 ČR J^ 0 a N7 0 Kanadou VB 0 představuje RP -8.14667 R2 0 kolem CP -2.6891 C2 0 půl NP -15.2441 N1 0 promile NP -5.85716 A2 0 kanadského A2 0 zahraničního N2 0 obchodu Z- 0 . (TOP~~1~#NULL# (#P~~1~#NULL# #NULL#/#NULL# (VP~~2~představuje (NP~~1~Výměna Výměna/N1 zboží/N2 (RP~~1~mezi mezi/R7 (N7P~~2~a ČR/NX a/J^ Kanadou/N7 ) ) ) představuje/VB (RP~~1~kolem kolem/R2 (CP~~1~půl půl/C2 ) ) (NP~~1~promile promile/N1 (NP~~3~obchodu kanadského/A2 zahraničního/A2 obchodu/N2 ) ) ) ./Z- ) ) TIME 1
The script retains just the form with parentheses and adds a head to every phrase (denoted by '>'
).
(TOP~~1~#NULL# (#P~~1~#NULL# #NULL#/>#NULL# (VP~~2~představuje (NP~~1~Výměna Výměna/>N1 zboží/N2 (RP~~1~mezi mezi/>R7 (N7P~~2~a ČR/NX a/>J^ Kanadou/N7 ) ) ) představuje/>VB (RP~~1~kolem kolem/>R2 (CP~~1~půl půl/>C2 ) ) (NP~~1~promile promile/>N1 (NP~~3~obchodu kanadského/A2 zahraničního/A2 obchodu/>N2 ) ) ) ./Z- ) )
( TOP ( #P ( >#NULL# #NULL# ) ( VP ( NP ( >N1 Výměna ) ( N2 zboží ) ( RP ( >R7 mezi ) ( N7P ( NX ČR ) ( >J^ a ) ( N7 Kanadou ) ) ) ) ( >VB představuje ) ( RP ( >R2 kolem ) ( CP ( >C2 půl ) ) ) ( NP ( >N1 promile ) ( NP ( A2 kanadského ) ( A2 zahraničního ) ( >N2 obchodu ) ) ) ) ( Z- . ) ) )
The program converts constituent trees into dependency ones. The output format is the same as this of proc2.pl.
#START# <s BLANK> #NULL# #NULL# 0 0 Výměna N1 1 7 zboží N2 2 1 mezi R7 3 1 ČR NX 4 5 a J^ 5 3 Kanadou N7 6 5 představuje VB 7 0 kolem R2 8 7 půl C2 9 8 promile N1 10 7 kanadského A2 11 13 zahraničního A2 12 13 obchodu N2 13 10 . Z- 14 0 #END#
The script merges the original file in CSTS format (the first argument) with the file containing the parsed data (the second argument) and merges information about the shallow structure from it with the original file. This is the final phase of parsing.
<s id=cmpr9415:001-p4s1/bcc01zua.fs/#4> <i>b <f cap>Výměna<MDl a>výměna<MDt a>NNFS1-----A----<r>1<g>7 <f>zboží<MDl a>zboží<MDt a>NNNS2-----A----<r>2<g>1 <f>mezi<MDl a>mezi<MDt a>RR--7----------<r>3<g>1 <f upper>ČR<MDl a>ČR_:B_;G_^(Česká_republika)<MDt a>NNFXX-----A----<r>4<g>5 <f>a<MDl a>a<MDt a>J^-------------<r>5<g>3 <f cap>Kanadou<MDl a>Kanada_;G<MDt a>NNFS7-----A----<r>6<g>5 <f>představuje<MDl a>představovat_:T<MDt a>VB-S---3P-AA---<r>7<g>0 <f>kolem<MDl a>kolem<MDt a>RR--2----------<r>8<g>7 <f>půl<MDl a>půl-1<MDt a>ClXS2----------<r>9<g>8 <f>promile<MDl a>promile<MDt a>NNNS1-----A----<r>10<g>7 <f>kanadského<MDl a>kanadský<MDt a>AAIS2----1A----<r>11<g>13 <f>zahraničního<MDl a>zahraniční<MDt a>AAIS2----1A----<r>12<g>13 <f>obchodu<MDl a>obchod<MDt a>NNIS2-----A----<r>13<g>10 <D> <d>.<MDl a>.<MDt a>Z:-------------<r>14<g>0
The whole parser is written in C and Perl. Files compiled from C sources has suffixes indicating the hardware architecture on which they was compiled.
default.config
- the default configuration filedata/
- folder where models are saved to when training and loaded from when parsingexec/
- all executable files (compiled binary programs and Perl scripts)
collins-common.pl
- Perl file with stuff common for parsing and trainingcollins-train.pl
- the main training scriptcollins.pl
- the main parsing scriptd2t/
- executables for the preprocessing phase of both parser and trainer
depstotree.*
- for trainingp-sc-tags2.pl
- module used by proc2.plproc2.pl
- for parsing and trainingparser/
- the core of the parser
filtertags.prl
forparser.prl
parser.*
t2d/
- executables for the postprocessing phase of the parser
addheads.prl
fortreedep.prl
mergecoll.pl
treetodeps.*
train/
- the core of the trainer
makefiles.pl
treetrain.*
obj/
- compiled object files
d2t/
parser/
t2d/
train/
src/
- sources of the C programs
d2t/
- sources of depstotree.*parser/
- sources of parser.*t2d/
- sources of treetodeps.*train/
- sources of treetrain.*