English has been studied most for this purpose and many good results
have been obtained, especially using statistically-based techniques coupled
with automatic learning from large annotated corpora. However, these techniques
have not been applied to other languages than English very much, especially
to languages which display substantially different behavior (both morphologically
and syntactically) than English (such as Slavic languages, spoken mostly
in Europe by approx. 350 million people).
We have worked on the following parsers before and during the workshop: Michael Collins' state-of-the-art parser for English (which was being adapted for Czech), Dan Zeman's "direct" dependency parser for Czech, and Ciprian Chelba's and Fred Jelinek's Structured Language Model (as a parser). Also, we were exploring the possibility (and discovering the difficulties) of parsing an "unknown" language using the "Parsing by translation" approach.
On top of those parsers, we were working on the "Classifier Combination"
method for combining the results of several parsers to obtain still better
results. Although we could not test this method on really different parsers,
interesting results have been obtained using various variants of a single
parser.
In order to compare the results among different parsers, we have set
up an evaluation scheme which would be applicable to all parsers. As the
test data consist of dependency trees, it was natural to choose the following
evaluation metric:
An error occurs if and only if a dependency link goes to a wrong governing node.
Every sentence contains one artificial node (tree root), thus the number of nodes representing "real" nodes equals the number of dependency links (counting also links leading to the artificial root node). Parsing accuracy can thus be defined very simply:
Accuracy = number of correct dependency links / number of words in a sentence
provided that each node has been supplied by one and only one upgoing link by the parser being evaluated. Precision and recall can be easily defined, should some parser(s) provide partial parses or multiple parses.
Special evaluation tools have been developed, and all parser "developers" have been instructed to provide the output of their respective parsers in the test data format for automatic evaluation.
There were two test sets of data. One of them (the development test
set) has been freely available, and people could in fact use it in any
way they wanted. The final evaluation test set has been kept separate
for final and completely unbiased evaluation.
Both sets consisted of over 3500 sentences, whereas the training
set contained over 19 thousand sentences. The original partitioning of
the Prague Dependency Treebank into training and the two test sets is
still kept in the freely available version of the PDT (http://ufal.ms.mff.cuni.cz/pdt/pdt.html),
so that results obtained later and/or by other people are directly
comparable.
The sizes of the data:
Sentences
|
Words
|
|
Training
|
19126
|
327597
|
Development test
|
3697
|
63718
|
Evaluation test
|
3787
|
65390
|
The following resources were available for the workshop (part
of them has been developed in the past without direct relation to the
workshop, but some, such as the evaluation tool and the tree viewer
have been developed by the team members berfore and during the
workshop):
Recent work in statistical parsing of English has used lexicalized
trees as a representation, and has exploited parameterizations that lead
to probabilities which are directly associated with dependencies between
pairs of words in the tree structure. Typically, a corpus such as the Penn
treebank is used as training and test data: hand-coded rules are used to
assign head-words to each constituent in the tree, and the dependency structures
are then implicit in the tree.
In the Czech PDT corpus we have dependency annotations, but no tree structures. For parsing Czech we considered a strategy of converting dependency structures in training data to lexicalized trees, then running the parsing algorithms originally developed for English. Crucially, the mapping from dependencies to trees is one-to-many. The choice of tree structure is important in that it determines the parameterization of the model: that is, the independence assumptions that the parsing model makes. There are at least 4 degrees of freedom when deciding on the tree structures:
1) How ``flat'' should the trees be? The trees could be maximally flat (one phrasal level per head-word), binary branching, or anything between these two extremes.
2) What set of part of speech (POS) tags should be used?
3) What non-terminal labels should the internal nodes have?
4) How to deal with crossing dependencies?
As a baseline system we: 1) made the trees as flat as possible; 2) chose just the major categories (noun, verb etc.) as parts of speech; 3) derived non-terminal labels from the POS tag of the headword (for example, a phrase headed by a noun would be labeled NP); 4) effectively ignored the crossing dependencies problem.
During the workshop we refined this conversion process in several ways: modifying the tree structures for linguistically special cases such as coordination and relative clauses; experimenting with different POS tag-sets; and modifying the parsing model (for example, extending it to handle bigram dependencies at the phrasal level).
The baseline results on the final test set were 72.4% accuracy. The final system, with all refinements, recovered dependencies with 80.0% accuracy. Results on the development set showed that newspaper-style text (75% of the sentences in test data) were parsed at around 2% greater accuracy than this averaged, 80% result. These results therefore compare favorably to those on English Wall Street Journal text: that is, around 90% accuracy but with considerably more training data (890,000 words vs. 347,000 for Czech).
Several techniques were developed in preparation for this workshop, that had not been published before and that help the tag-based part of the parser. Although they were prepared before the workshop, the workshop brought a great deal to their implementing for the Prague Dependency Treebank, and to testing them thoroughly.
Then the second part of the workshop was devoted to lexicalizing the parser, i.e. to developing a new statistical model that deals with dependencies between words rather than between the morphological tags. It is true that this part did not help the parser as much as expected (and as reported for other parsers for English) but the outcomes are still challenging and the model developed here enables to continue with the research in future.
Let us very briefly look at the structure of the parser. Its main task can be characterized by the following expression:
It means that the parser wants to find the dependency tree T that maximizes p(T|S) where S is the sentence being parsed. In other words, we want to construct the tree that most likely is a dependency structure of the sentence S. Because in no way we are able to decide among all possible trees in the universe, we have to decompose the tree probability into edge probabilities. These can be estimated from the relative frequencies of dependencies in training data. Then, using the Viterbi search algorithm, we try to find the tree with the highest possible product of probabilities of its edges. Here we take a significant simplification that the dependency probabilities are statistically independent, i.e.
This obviously is not true and weakens the parser so that we had to introduce various constraints, additional models and techniques that help us a little to work around this weakness. A list of them follows; a more detailed description will be given later in this report.
The baseline parser includes all the techniques whose development started
before the workshop so "baseline" may be read as "before lexicalization".
A deeper description of the techniques and a more diversified summary of
their contribution to the parsing accuracy will be given later in this
report. The final results include the lexical part of the parser as well
as some minor improvements that will be described later, too.
D-test | E-test | |||||
norm | sci | all | norm | sci | all | |
Baseline | 54 | 48 | 51 | 57 | 51 | 54 |
Final | 57 | 52 | 55 | 58 | 53 | 56 |
We conducted a preliminary survey of how we might use a bilingual corpus to generate a grammar for monolingual parsing. In particular, we examined the Czech Readers Digest corpus, which consists of around 2 million words of aligned Czech and English sentences. These sentences were taken from the Czech Readers Digest articles paired with the original English Readers Digest articles from which they were translated. The essential idea is that correspondences in the bitext imply structure for the two monolingual halves of the text. In the ideal case, we would like to infer the structure itself, based on the knowledge that the aligned sentences are translations of each other, or perhaps to parse the English side and use those structures to infer the Czech parses. In these cases, we would have a way to build up a new treebank, based in part on an analysis of the English side of the bilingual corpus. Furthermore, since we do have the Czech treebank, we could compare any new results with it as a reference point. Our somewhat more modest goal was to see to what extent the English parses and Czech parses correspond, given the corpus, parsers, and other resources that are available at the workshop. Since very large Czech and English treebanks are now available, we are able to train parsers to parse both sides of the corpus.
Our original motivation for trying this experiment at the workshop is
to provide a very different source of grammatical information for Eric's
"Super Parser". Our expectation is that the Super Parser will make the
best improvement over the individual parsers when the individual parsers
behave very differently. Being able to infer the Czech parses from the
English side of course would have met the criterion of being a very different
source of grammatical information. Moreover, to the extent that such an
exercise is possible, we would have a means of building trainable parsers
for languages for which treebanks are not available, but for which bilingual
texts are available.
The Structured Language Model consists of two main parts: a parser and a predictor. The model proceeds along a sentence in a left to right manner, constructing partial parses for the sentence prefix available at each word. These partial parses consist of binary branching lexical parse trees where each node is associated with a lexical item and a nonterminal or part of speech (POS) label. Each nonterminal label and lexical item pair that covers a partial parse is referred to as a headword. The predictor uses the last two exposed headwords over a sentence prefix to predict the next word in the sentence. With parameter reestimation Chelba and Jelinek report results that this technique does achieve significantly lower perplexities for the UPenn Treebank of Wall Street Journal text.
Partial parses are constructed in a binary manner by considering the last two headwords and their associated tag information. The parser can choose from three moves at any given point: ADJOIN-RIGHT, ADJOIN-LEFT, or NULL. An ADJOIN-RIGHT move creates a new nonterminal that covers the last two words, percolating the right word up the parse as the headword of the new phrase. ADJOIN-LEFT acts similarly, except that the left word is percolated up. After each adjoin operation, headwords are renumbered. The parser continues to build syntactic structure over a sentence prefix in this manner until the most probable move is the NULL move, which does not change the parse structure, but passes control to the predictor, which then predicts the next word and its POS tag from the previous two headwords. The model proceeds down each sentence in this manner until the end of sentence marker is reached. Because of the large (exponential) number of possible parse trees associated with any given sentence, this model uses a multiple stack search with pruning through the space of sentence parse tree hypotheses (see Chelba, Jelinek 1998). In this manner, hypotheses with equal numbers of parser operations and predictions are compared against each other.
Baseline results:
No unknown word statistics. No use of external (MDt = Machine Disambiguated
tags) POS tags.
devel | 54.7% |
eval | 55.5% |
Use MDt tags for:
unknown word threshold:3
|
none
|
unk
|
all
|
devel
|
57.91%
|
67.42%
|
67.54%
|
eval
|
57.70%
|
67.16%
|
67.45%
|
unknown word threshlod:5
|
none
|
unk
|
all
|
devel
|
--
|
68.04%
|
68.18%
|
eval
|
57.29%
|
68.12%
|
68.32%
|
Various directions of future research are also discussed, such as handling
crossing dependencies, parse closing strategies, optimization of the objective
function, predictior probability decomposition, changes in data the annotation
scheme, and optimization of the POS tagset.
Our ultimate goal is to combine a diverse array of mature parsers. While we are currently developing a number of parsers for Czech (see elsewhere this report), at the moment one parser performs significantly better than the others, and appears to outperform the other parsers across all linguistic environments (reference table in superparsing section). Therefore, we will have to wait until the other parsers improve before we can expect gains through combination.
We instead focused primarily on diversification and combination of a single parsing algorithm, namely the Collins Parser (Chapter 1 of this report) ported to Czech. To generate a set of variants of the parser, we employed a technique known as bagging. Given a training set with N instances (sentences), we generate a single bag by randomly drawing instances from the training set N times with replacement. Once we have a training bag, we can then train the parser on that bag. We can generate as many bags as we wish, with each bag containing a slightly different version of the original training corpus, containing multiple copies of some sentences and no copies of others.
In the text of Chapter 5 we show the results achieved from generating
multiple parsers via bagging and then combining the outputs of these parsers.
We were consistently able to achieve an improvement in accuracy over the
single parser trained from the original training set.
Several important problems have been identified during the work
on the various parsers which warrant future research. The most challenging
problem is the phenomenon of non-projectivity, or crossing dependency,
which makes any parser based on a Context-Free Grammar theoretically insufficient.
Also, the number of tags used for languages such as Czech is so high that
we will have to work more on tag clustering or classification to obtain
various clustering schemes suitable for various parsing techniques. Furthermore,
the Czech tagger needs to be improved, especially for the number, gender
and case categories, where it's error rate of 4-6% is still too high for
parsing purposes.
Certain theoretical questions will have to be solved, too: for example, is the tree representation as chosen really the best for statistical parsing? It seems that at least for some of the parsers, some changes would help, while preserving the dependency structure which we still believe is best suited for further processing in natural language understanding tasks (i.e. after parsing is done).
Adn of course, the more data, the better: the annotation of the PDT
continues and we will certainly re-run the experiments we did during the
Workshop '98 when the PDT is finished and there is more than 1 mil. words
for training.
The resources (data and tools) used throughout the workshop are the results (in part or in whole, and on top of the funding provided by the workshop as described in the previous paragraph) of the following institutional grants (granted to the Institute of Formal and Applied Linguistcs, Faculty of Mathematics and Physics, Charles University, Prague, and lead by Prof. Eva Hajicova): Project No. VS96151 of the Ministry of Education of the Czech Republic, Grants No. 405/96/0198 and 405/96/K214 of the Grant Agency of the Czech Republic. The following individual grants have also contributed to the development of tools used at the workshop: Grant No. 39/94 of the Charles University, Prague, Czech Republic, and the Grant No. 195/1995 of the Research Support Scheme/HESP of the Soros' Open Society Fund.
The bilingual data used for the "Parsing by translation" task have been kindly prepared for use at the workshop by Martin Cmejrek and Jan Curin, now graduate students at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic. Special thanks to Reader's Digest VYBER (Prague, Czech Republic) for granting the licence for using their textual material.
The Prague Dependency Treebank data used at the workshop has been prepared and annotated by a team of people under the supervision of Jan Hajic (in alphabetical order): Alla Bemova, Eva Buranova, Jiri Karnik, Michal Kren, Petr Pajas, Jarmila Panevova, Jan Stepanek, and Zdenka Uresova. The morphological annotation of the data has been provided by a separate team, supervised by Barbora Hladka (in alphabetical order): Jiri Hana, Lenka Kebortova, Kristyna Kupkova, Pavel Kveton, Emil Jerabek, Jiri Mirovsky, Karel Skoupy and Stepan Verecky. Most of the technical work of the actual preparation of the data for the workshop has been done by (in alphabetical order): Alena Bohmova, Jan Hajic, Barbora Hladka, Kiril Ribarov and Dan Zeman, all of IFAL, Faculty of Mathematics and Physics, Charles University, Prague, before or during the workshop.
The Prague Dependency Treebank is based on raw texts provided by the Institute of Czech National Corpus, Faculty of Arts, Charles University, Prague, Czech Republic, which come from the following sources: Lidove Noviny Publishers, Mlada Fronta Publishers, Ringier CR, Academy of Science of the Czech Republic.
Last but not least, we are very grateful to the local organizers (above all to Jacob Laderman and Kimberly S. Petropoulos) who made all the technical and organizational arrangements necessary for a well-functioning and scientifically stimulating environment and thus made the Workshop'98 a very pleasant experience.