The experimental
results from NSF Workshop'99 CLSP Johns Hopkins University
Czech/English
Statistical Machine Translation
Jan Cuřín
Automatic Speech Recognition
P. Beyerlein, W. Byrne, J. M. Huerta, S. Khudanpur, B. Marthi, J. Morgan,
N. Peterek, J. Picone, W. Wang
Statistical Machine
Translation Group Team
|
Yaser Al-Onaizan1, Jan Curin2,
Michael Jahr3, Kevin Knight1, John Lafferty4,
Dan Melamed5, |
|
Franz-Josef Och6, David Purdy7,
Noah A. Smith8, David Yarowsky9 |
|
ISI,
University of Southern California (1) |
Technical
University RWTH, Aachen (6) |
|
IFAL,
Charles University, Prague (2) |
Department
of Defence (7) |
|
CS,
Stanford University (3) |
University
of Maryland (8) |
|
CS
Dep., Carnegie Mellon University (4) |
CLSP,
Johns Hopkins University (9) |
|
CS
Res. Dep. Of West Group (5) |
|
Automatic Speech
Recognition Group Team
|
P. Beyerlein1, W. Byrne2,
J. M. Huerta3, S. Khudanpur2, B. Marthi4, |
|
J. Morgan5, N. Peterek6,
J. Picone7, W. Wang8 |
|
|
|
Philips Research Laboratories (1) |
CLSP, Johns Hopkins University (2) |
|
Dept. ECE, Carnegie Melon University
(3) |
Depts. CS and Math, University of Toronto
(4) |
|
Dept. Foreign Languages, USAMA, West Point
(5) |
UFAL, Charles University, Prague (6) |
|
ISIP, Mississippi State University (7) |
Dept. ECE, Rice University (8) |
Statistical
machine translation and language independent acoustic modeling were two of the topics
studied at the 1999 Johns Hopkins University Language Engineering Workshop
hosted by the Center for Language and Speech Processing. In booth of these
topics Czech played the role of the target experimental language.
Automatic
translation from one human language to another using computers, better known as
machine translation (MT), is a longstanding goal of computer science.
Recently,
statistical data analysis has been used to gather MT knowledge automatically
from parallel bilingual text. Unfortunately, these techniques and tools have
not been disseminated to the scientific community in very usable form, and new
follow-on ideas have developed sporadically. In a six-week summer workshop at
Johns Hopkins University, we constructed a basic statistical MT toolkit (called
Egypt) intended for distribution to interested researchers. We describe
experiments on Czech/English statistical MT in this paper.
Language
independent acoustic modeling was one of the topics studied at the 1999 Johns
Hopkins University Language Engineering Workshop. Our work was motivated by the
need for speech recognition in languages beyond the well-studied languages of
Europe, Asia, and the Americas. The statistical techniques used for speech and
language modeling require relatively large amounts of monolingual speech and
text as training data. In the `resource-rich' languages which have such
corpora, these statistical methods have been shown to work quite well. However,
if only small amounts of training data are available in a language, these
monolingual techniques are less effective. Our goal was to address this problem
by developing techniques that reduce the amount of data needed to model
resource-poor languages by borrowing data and models from resource-rich
languages.
Automatic
translation from one human language to another using computers, better known as
machine translation (MT), is a longstanding goal of computer science. In order to
be able to perform such a task, the computer must ``know'' the two
languages-synonyms for words and phrases, grammars of the two languages, and
semantic or world knowledge. One way to incorporate such knowledge into a
computer is to use bilingual experts to hand-craft the necessary information
into the computer program. Another is to let the computer learn some of these
things automatically by examining large amounts of parallel text: documents
which are translations of each other. The Canadian government produces one such
resource, for example, in the form of parliamentary proceedings which are
recorded in both English and French. The statistical machine translation (SMT)
techniques have unfortunately not been applied widely yet in the MT community.
The statistical approach is still very much a minority approach in the field of
MT. This is partly due to the fact that the mathematics involved were not
particularly familiar to computational linguistics researchers at the time they
were first published [11].
Recently, statistical data
analysis has been used to gather MT knowledge automatically from parallel
bilingual text. Unfortunately, these techniques and tools have not been
disseminated to the scientific community in very usable form, and new follow-on
ideas have developed sporadically. In a six-week summer workshop at Johns
Hopkins University, we constructed a basic statistical MT toolkit (called
Egypt) intended for distribution to interested researchers. We also used the
toolkit as a platform for experimentation during the workshop. Our experiments
included working with distant language pairs (such as Czech/English), rapidly
porting to new language pairs, managing with small bilingual data sets,
speeding up algorithms for decoding and bilingual and text training, and
incorporating morphology, syntax, dictionaries, and cognates. We describe one
of our experiments the Czech/English statistical MT in this paper. The toolkit
and the experiments with other language pairs are described in the final report
from JHU workshop [10].
Our web site www.clsp.jhu.edu/ws99/projects/mt
contains downloadable SMT tools and useful MT related references.
We had available a
Czech/English corpus which is a parallel text of articles from the Reader's
Digest, years 1993-1996. The Czech part is a translation of the English one.
The Reader's Digest corpus consists of 53,000 sentence pairs from 450 articles.
Sentence pairs were aligned automatically by [12]
algorithm, but this alignment was not sufficient for good quality alignment.
Dan Melamed realigned the corpus using SIMR/GSA [15]
during this workshop. With language-pair-specific parameter settings learned
from a small amount of word-aligned data, SIMR performance can be substantially
improved; however, this experiment simply adopted French/English settings.
There was also a lot of
manual work to do on this corpus before the workshop. Every issue of this
magazine contains only 30-60% of articles translated from English to the local
language. We had to search in the English version to find the corresponding
articles that are in the Czech version. The translations in Reader's Digest are
mostly very liberal. They include many constructions with direct speech.
Articles with culture-specific facts have been excluded.
The tools available for
Czech were: a morphological analyzer, POS tagger, and lemmatizer provided by
IFAL (Charles University, Prague) and a statistical parser for Czech developed
at a previous NLP summer workshop at Johns Hopkins University. The corpus has
been morphologically analyzed, tagged, lemmatized, and parsed by these tools.
Description of these tools are in [13,14].
There is also a
Czech/English online dictionary available. This dictionary consists of 88,000
entries and covers 89% of tokens in the Czech part of corpus.
We also experimented with a
technically-oriented Czech/English corpus from IBM. This is a huge and very
good source of Czech/English parallel data, but for a very specific domain.
This corpus consists of operating system messages and operating system guides.
These are products of localization and translation of software from English to
Czech. The translations are very literal and precise. In most cases sentences
are translated sentence by sentence. This source is not publicly available and
can be used only for internal experiments at IFAL.
Two Czech commercial
translation systems were available: PC Translator 98 and SKIK v. 4.0.
We used translations by commercial systems for evaluation purposes.
Czech, as a Slavic
language, is a highly inflectional and almost free word-order language. Most of
the functions expressed in Czech by endings (inflection) are rendered by
English word order and some function words.
For example, most Czech
nouns or personal pronouns can form singular and plural forms in 7 cases. Most
adjectives can form 4 genders, both numbers, 7 cases, 3 degrees of comparison,
and can be either of positive or negative polarity (giving 336 possibilities
for each adjective). In the corpus there are 72,000 word forms in Czech part
against 31,000 forms in English.
Czech is a pro-drop
language. This means that the subject pronoun (I, he, they) has usually a zero
form. There are no definite and indefinite articles in Czech. English
preposition equivalents can be also the part of a Czech noun or pronoun
inflection. For demonstration there are 15% more tokens in English than in
Czech in the corpus.
All these features create
problems in translation. Our implemented translation models (IBM3, IBM4) allow
only one-to-many alignments form English to Czech. Therefore it is useful to
have more words in Czech than in English. We therefore decided to help the
translation model by preprocessing Czech into English-like
form-``Czech-prime,'' as we call it.
As the training program for
translation model parameters (GIZA) and decoder (from Weaver) were in
development during the workshop, we did many experiments on Czech/English
translation using the Alignment Templates system developed at the University of
Aachen [16]. This system considers whole phrases
rather than single words as the basis for the alignment models. The basic idea
is that a whole group of adjacent words in the source sentence may be aligned
with a whole group of adjacent words in the target language. As a result the
context of words has a greater influence and the changes in word order from
source to target language can be learned explicitly. The Alignment Template
approach was applied to some of the tasks considered during the Workshop. The
aim was to provide an additional baseline for the IBM3 system and to analyze
how important the modeling of word groups are for translation quality. For more
details, see [16].
The normal Czech input
containing all word forms is the baseline corpus. The next step was lemmatized
Czech input. In this version of the input, we discard information about number,
tense, gender and other features which are necessary to produce a useful
translation. In the full Czech-prime, there is information such as number or
tense attached to each lemma, which is expected to be relevant for English
translation. Furthermore, artificial words are added to the Czech corpus in
positions where they should appear in English.
An example of a Czech
sentence with artificial words (in brackets) is given in Figure .
Corresponding words in both languages are coindexed. There is an artificial
word [I] for first person, singular subject, large numbers of artificial
articles and an artificial preposition [of] corresponding to the Czech
genitive. There is a potential over-generation of artificial words as you can
see in position 5. This over-generation can be compensated for in the
translation or language models.
I1 am2 convinced3
that4 []5 team6 work7 is8
the9 key10 for11 the12 realization13
of14 ones15 dreams16
[I] 1 jsem2 přesvědčen3, že4 [the] 5 týmová6
práce7 je8 [the]
9 klíčem10 ke11 [the] 12 splnění13 [of]
14 [the] 15 snů16
Figure 1: Addition of artificial
words into Czech sentence
Here is a description of major changes for individual parts of speech in Czech:
·
nouns
o
different
lemma for singular and plural
o
if the
noun is not governing pronoun, the artificial article is added before the noun
group (group of nouns and adjectives)
o
if the
noun is in genitive, dative, locative or instrumental case, and it is not
governed by a preposition in parse tree, the artificial preposition is added
before the noun group
·
verbs
o
different
lemma for different tenses
o
if the
verb is not governing a nominative noun, the artificial subject is added
(artificial subjects differ for person, gender and number depending on the form
of the verb)
o
special
solution for auxiliary verb to be
o
artificial
word for negative verbs
·
personal
pronouns
o
different
lemma for singular and plural
o
for
third person, singular, there is a different lemma for masculine animate,
feminine and others (he, she, it)
o
if the
pronoun is in genitive, dative, locative or instrumental case, and it is not
governed by a preposition in the parse tree, an artificial preposition is added
·
other
pronouns
o
different
lemma for singular and plural
·
adjectives
and adverbs
o
artificial
word for second (more) and third grade ( the most)
o
artificial
word for negative adjective or adverb
|
Table
1: Examples of alignment of artificial words in the training corpus
In
Table 1 see an example of artificial words
alignment in the training corpus (first 4 cases in order). We demonstrate in
how many cases the artificial word is aligned to the certain word in English.
The table contains artificial words for the singular article (NtheS),
plural article (NtheP), genitive preposition (Nprep2), first
person, singular subject ( Vsubj1S), third person, singular subject (Vsubj3S),
and negative verb (Vnot).
Translation models for the
Alignment Templates system and the GIZA parameter estimation tool were built on
the training corpus with the Czech part preprocessed as above. The test Czech
sentences were preprocessed in the same way.
We carried out a human
evaluation of translations to observe progress obtained by each level of
preprocessing the Czech input. The tool for the human evaluation, which allows
us to make an evaluation via Internet, was developed during the workshop. It
displays the original sentence (in Czech) and translations from different
translation systems. Translations are shuffled for each original sentence.
Evaluators assign marks from 1 to 5 to each translation. Mark 1 is the best,
mark 5 is the worst translation.
In our particular case the
evaluation was done by two evaluators on 66 randomly chosen sentences from the
test data. Results are in the Table . Average counts of assigned marks are
in columns. Rows correspond to translation systems. The average value of marks
assigned to each translation system is in the last column in the table.
|
Table
2: Human evaluation of Czech/English Translation
We
can observe the progress of quality of translation obtained by the Egypt toolkit
from the baseline to the simple lemmatized version and to the English-like
version of Czech input (Czech-prime) in comparison with the two commercial
systems and the Alignment Templates system. Results on Czech/English
translation using the Alignment Templates system (AlTemp) are better then one
of commercial systems and almost the same as the second one.
The parallel corpus from
Reader's Digest is relatively small. Experience from different sizes of
training sets of the Canadian Hansard corpus indicates that 50,000 sentence
pairs is really the basic amount of data. The results are significantly better
for corpus ten times larger. Therefore, we have done an experiment on the
strictly domain-specific data from IBM as well. The training set contained 1
million short sentence pairs and 10 million words in each language. The
Alignment Template system was used to train a translation model and to
translate the test part of corpus.
Almost 34% of the sentences
from the testing data were translated exactly the same as in the reference set.
According to human evaluation on 56 randomly chosen sentences from testing
corpus, another 30% of sentences were excellent translations, 11% were good or
acceptable, 8% of translations had bad word order, and 17% of translations were
bad.
By comparison, in the
Reader's Digest test corpus, only 1.42% of translated sentences are exactly the
same as their reference translations.
We carried out the first
experiments on statistical machine translation from Czech to English. We can
observe how the progress of translation quality depends on the preprocessing of
the Czech input. The Reader's Digest corpus output from the Alignment Template
system is comparable with translations by commercial systems. The results of
the Alignment Template system and the Egypt system are not directly comparable
as in the Egypt system the dictionary was not used as a knowledge source. In
addition, in the development of the procedure of transforming Czech to
Czech-prime we used the Alignment Template system to gather knowledge about
problematic constructions. This may have led to a bias in favor of the
Alignment Template system. Nevertheless it seems to be possible to conclude
from these results that modeling word-groups in source and target language (as
done in Alignment Templates) is important.
The results reached for the
technical computer oriented corpus are very good and promising. Larger amount
of data can significantly help the system. As the general translation tool has
been just developed, it is now possible to experiment with different system
parameters, such as the number of iterations of particular models, and to
adjust the translation models to better suit the Czech/English language pair.
We describe
procedures and experimental results using speech from diverse source languages
to build an ASR system for a single target language. This work is intended to
improve ASR in languages for which large amounts of training data are not
available. We have developed both knowledge based and automatic methods to map
phonetic units from the source languages to the target language. We employed
HMM adaptation techniques and Discriminative Model Combination to combine
acoustic models from the individual source languages for recognition of speech
in the target language. Experiments are described in which Czech Broadcast News
is transcribed using acoustic models trained from small amounts of Czech read
speech augmented by English, Spanish, Russian, and Mandarin acoustic models.
While in our studies we
used multiple languages simultaneously, our goal was not to build a
`multilingual' ASR system capable of recognizing several languages equally
well. We intended instead to develop a good monolingual system for a specified
target language by borrowing data and models from other languages. This is
called `language independent acoustic modeling' to suggest a similarity in
nature to speaker independent modeling. In the current state-of-the-art,
speaker independent models are first trained from multiple speakers and then
adapted to a specific speaker either before or during recognition. Analogously,
language independent modeling is a methodology that combines speech and models
from multiple source languages and transforms them for recognition in a
specific target language.
As mentioned above,
acoustic training data is only one resource needed for statistical ASR.
However, we have assumed that language models, pronunciations, and appropriate
acoustic processing are available for the target language, and that only
transcribed acoustic training data is in short supply. This is not a completely
unrealistic scenario, however, in that dictionaries with pronunciations are
available for many languages, as are on-line newspapers and other text.
However, we stress that we address here only one aspect of language independent
modeling.
We have developed methods
to share data and acoustic models between languages. Underlying these methods
are `phone mappings' that describe the similarity of sounds in two different
languages. We obtain these phone mappings using both knowledge-based and
automatic methods. The knowledge-based methods rely only on acoustic-phonetic
phonetic categorizations of the individual languages and as such can be used if
no data at all is available in the target language. The automatic methods
derive phone mappings using small amounts of acoustic data in the target
language. By either approach we can borrow models from several languages
simultaneously to cover the phone inventory of the target language. The
automatic methods allow additional refinement by borrowing models
sub-phonetically at the HMM-state level. This can be especially valuable if the
target language contains phones not found in any of the source languages since
these techniques are free to assemble a new phone model from component states
of different source language phone models.
While both the automatic
and knowledge-based phone mappings can be used without modification to
construct recognizers in the target language by borrowing acoustic models from
the various source languages, HMM adaptation techniques can also be used to
improve the systems using the small amount of target language adaptation data
we assume is available. As a further refinement, we obtained the best
recognition performance not from individually adapted source language acoustic
models but by using Discriminative Model Combination (DMC) to combine models from
several languages simultaneously. This combination can be done at the sentence
or sub-word level, with better performance obtained using phone-level
combinations. We note in particular that DMC makes effective use of source
language acoustic models that by themselves do not perform well in transcribing
the target language.
We present below a
necessarily brief description of our experiments. Our web site www.clsp.jhu.edu/ws99/projects/asr
contains complete documentation of our work, some of the language data and
models used, and a more extensive bibliography of prior work in language
independent and multilingual acoustic modeling.
As part of our research program we established an experimental framework
for language independent acoustic modeling. Since this problem has not been
widely studied, we were not able to use previously defined training and test
sets. We therefore began by investigating ASR performance to find an
appropriate `operating point' for our experiments.
We chose Czech language Voice of America (VOA) broadcasts as our test domain
since news broadcasts contain a variety of different types of speech and are
relatively easy to obtain. We chose Czech since we have ongoing projects [2] from which we could
borrow resources. We also felt that studying Czech as a rapid-porting task was
realistic since, unlike Spanish or Mandarin, there is fairly little knowledge
of existing Czech ASR to influence our work. Our final test set consisted of
one week of news broadcasts, although due to evolution of our experiments, not
all the numbers reported below are directly comparable; see our web site for
more detailed reporting.
As our out-of-domain
acoustic training data, we used broadcast news recordings in English, Spanish,
and Mandarin obtained from the Linguistic Data Consortium. We also used read
Russian speech collected at West Point for computer aided foreign language
instruction and read Czech speech from the Charles University Corpus of
Financial News (CUCFN). All speech was down-sampled to 16KHz as needed. The
acoustic models were trained from mel-frequency, cepstral data using HTK [6]. Unless otherwise
noted, the source language acoustic models were monophone systems to simplify
cross-language mapping; full system descriptions are on our web site.
We built our initial Czech
broadcast news system from a ten hour Czech VOA acoustic training set using
techniques known to work well in other languages and domains. The language
model and pronouncing dictionary were taken from our previous work [2].
After obtaining the performance of this well-trained system, we reduced
drastically the size of the acoustic training set and retrained new,
impoverished acoustic models. Given our past experience and the reported
experience of others, we expected that training a system using approximately
one hour of acoustic training data would yield an ASR system that performed
substantially worse than the initial, well-trained 10 hour system. We would
then attempt to improve this impoverished system by borrowing from other
languages. However, as Table 1 shows, performance on
Czech VOA is relatively good despite large variations in training set size and
model complexity. This behavior appears to be due to the extremely regular and
careful speech used by Czech VOA announcers and not due to a preponderance of
speech by individual news anchors or other obvious similarities between
training and test sets. We note that we observed similar behavior in
experiments with Spanish VOA broadcasts.
|
Model type |
WER (%) |
|
|
12.8 hour |
12 mixture, cross-word triphone |
27.1 |
|
10.0 hour |
20 mixture, monophone |
27.6 |
|
1.0 hour |
8 mixture, monophone |
30.2 |
|
0.5 hour |
20 mixture, monophone |
31.3 |
Table 1: Training and Testing on Czech VOA
Broadcasts.
From these results we
concluded that the Czech VOA speech was too self-similar to be used as both
training and test data. We therefore investigated a cross-domain training
scenario in which a small amount of read speech from the CUCFN corpus would
serve as the Czech language training data. After comparing performance across
the mono-lingual Czech read and broadcast domains (Table 2), we decided to fix the 1.0 hour CUCFN read speech
training set as the Czech language acoustic training set and to attempt to
improve performance on the Czech VOA test data by borrowing from English,
Mandarin, Spanish and Russian. This provides a realistic and interesting
training scenario that involves cross-domain as well as multilingual factors.
Table 2: WER in Training and Testing on Czech
VOA Broadcasts and CUCFN Read Speech Using 20 Mixture Monophone Models.
These experiments with
Czech VOA are reported as a cautionary note to emphasize that language is just
one characteristic of speech and that other conditions, such as speaking style,
are significant factors in ASR performance. It is therefore critically
important to obtain diverse training and test sets for multilingual
experiments. It is also important that results of limited domain experiments,
such as training and testing with data from the same news programs, be
interpreted cautiously since performance may not carry over to more diverse
domains.
In some
applications, it is highly desirable to develop speech recognition systems
without any acoustic training data. In such situations, borrowing models from
other languages for which speech recognition technology is well-developed is an
attractive idea. The approaches presented here are referred to as knowledge-based
because they exploit linguistic knowledge of the languages and their phoneme
inventories, and because they have not been retrained on any target language
acoustic data.
Our initial experiments
involved simple mappings in which phones from the Czech target language were
mapped to their nearest neighbor in a single source language using a similarity
measure based on feature-based descriptions of the phones. This is a manual
procedure that leverages extensive knowledge of acoustic phonetics [3]. Our approach involved first describing the
phones in both the source and target languages in terms of their articulatory
positions, a process that leads to a description of the sounds using the
International Phonetic Alphabet (IPA) [4].
The advantage of this
approach is that all languages can, in theory, be represented within the same
system. We determined the proximity of a sound in the target language to a
sound in the source language using this representation, and developed an
associated symbol-to-symbol mapping. While it was possible to achieve
reasonable mappings for each language, there are significant variations in the
level of detail used in the source language phonetic inventories. Spanish, for
example, only used 25 phones, while Russian used 44 phones. We used these
mappings to obtain baseline performance using acoustic models from the source
languages derived from these mappings. The procedure was quite simple:
represent each phone symbol in the Czech lexicon using a corresponding source
language phone located from these mappings. The performance of systems
constructed in this manner is given in Table 3.
Overall, we observe that performance is poor - in the range of 80%WER. It was a
great surprise to observe that the Russian acoustic models, though they were
trained on read speech, were a close match to the VOA data, especially
considering the differences in microphones, speaking style, and speaking rates.
We also observed from these experiments that performance for English and
Spanish was comparable, and performance for Mandarin lags the other systems.
Table 3: Performance Using Knowledge Based
Phone Mappings.
It was evident from the construction
of the mappings that a single source language did not provide optimal coverage
of Czech. Therefore, it was natural to explore a mapping that involved phones
from all source languages based on proximity in the IPA table. Since Russian
was clearly acoustically closer to Czech than any of the other source
languages, we excluded Russian from the set of source languages for this
experiment, so that it would not mask any trends in our knowledge-based
systems. Though we achieved modest improvements in performance (1.6% absolute
WER), we did not achieve performance comparable to data-driven mapping methods
discussed next.
Our next attempt to
understand deficiencies in the knowledge-based system was to explore a series
of experiments in which the recognition system was allowed to chose the best
combination of phones at runtime. First, we explored a parallel pronunciation
approach [5] in which each item in the lexicon was represented
as a sequence of phones from a single language implemented using pronunciation
networks. Unfortunately, this approach resulted in slightly degraded
performance even though we had hoped that the additional degrees of freedom
would offset any systematic acoustic bias between the two domains. We next
tried a multiphone approach that allowed the recognition system to mix
and match phones from all source languages as an attempt to let the recognizer
find the best realization of a phone, rather than fixing this based on a priori
linguistic knowledge. We found minor improvement in performance over the
parallel pronunciation system, as expected. However, overall performance is
still below the best monolingual system, and far below the Russian monolingual
system. In these experiments we have observed that, though the overall WER is
high, performance at the phone-level appears to be quite good. The alignments
are plausible, and a majority of the words are only partially misrecognized.
Since Czech is an inflected language, this analysis raised some concerns that
our language modeling approach was not optimal. For example, a
morphologically-based approach might be better if the majority of the errors
occur on endings rather than stems - it could be the case that performance at a
morphological level is good, and hence the system would be usable for
information extraction tasks.
We
developed a general methodology to derive cross-language mappings automatically
both at phonetic and sub-phonetic levels. We call our approach the Confusion
Matrix approach to finding cross-lingual mappings. These confusion matrices
are tables of acoustic similarity between phones across languages. They are
obtained by first performing a mono-lingual phonetic labeling of the target
language acoustic data using the target language phone set - this can be done
manually or via forced-alignment using HMMs; we use the latter approach.
Phonetic recognition of this data is then performed using acoustic models from
each of the source languages; for this we used simple, unweighted, phone-loop
recognizers. This yields multiple phonetic segmentations of the target language
acoustic data in the source language phone inventories.
Once a criterion for
co-occurrence between two phonetic labelings of the acoustic segments is
defined (e.g., a minimum number of overlapping frames, etc.), we can arrange
the phones of the source language and target language into a matrix that
contains the counts of co-occurrences between the nth and kth
phones of the source and target languages, respectively, in the (n,k) entry of
the matrix. This matrix of co-occurrences is the confusion matrix.
After the confusion matrix
between the phones of two languages is obtained, we derive mappings from this
matrix. Given a source phone (in the nth row), we would like to
select the phone in the target language that best matches it (i.e., choose the
best matching kth column). To do this we can simply choose the
column with the highest count. A better method takes into account the number of
times the kth source language phone was hypothesized by dividing the
counts of the bin (n,k) by the accumulated counts of the column k.
We extended this technique to
the state level, motivated by our intuition that some phones seemed hard to
match from one language to another. To obtain the subphonetic mapping, we broke
each HMM in the source and target language into its conforming states and
derived an HMM from each of these states. Using these new, sub-phone HMMs we
constructed a new confusion matrix. As expected, we found that some of these
hard-to-match target language phones were modeled by assembling new models from
phonetic subunits from other languages.
We described above how we
established the best mapping for each phone/state of the target language. We
found out that when many states and phones from various languages were
competing to represent any given target model, several models seemed to give
high counts and thus be close candidates for a reasonable match. We explored
the possibility of including several of these best matching candidates by
combining the Gaussian models in their mixtures after weighting them
accordingly. We established the weights used in this state combination in
proportion to the normalized number of counts corresponding to the map.
Table 4
shows recognition experiments we conducted using mappings derived from
confusion matrices. For comparison in this experiment, monophone Czech models
trained on 1 hour of Czech give 38% WER. When mappings are obtained using the
phone-level confusion matrix approach, the word error rate drops below 70%.
State-level mappings further reduce the error rate of the English mappings.
Better results are obtained when multiple source languages are included
(English, Spanish and Mandarin), and state mappings are obtained for both
state-to-state mapping and best three states to a single Czech state (the
3-state method). The best result is below 55% WER. The 3-state methods reported
differ in the presence (54.4%) or absence (55.8%) of count normalization of the
columns in the confusion matrix.
|
WER |
Source(s)/Method |
WER |
|
|
EN/Phone |
68.3 |
SP/Phone |
68.7 |
|
EN/State |
64.8 |
SP/State |
70.0 |
|
MA/State |
79.7 |
EN+SP+MA/State |
62.3 |
|
EN+SP+MA/3-State |
55.8 |
EN+SP+MA/3-State |
54.4 |
Table 4: WER(%) Using Automatic Phone Mappings.
Despite the substantial differences between the quality of phone
mappings obtained by knowledge-based and automatic state-level phone mappings,
adaptation using MLLR and MAP 1
on the 1.0 hour of Czech read speech largely compensates for these differences,
as shown in Table 5. Furthermore, while performance
improves significantly, the adapted systems do not individually improve over
the monolingual Czech systems.
|
Mixtures / Type |
Unadapted |
MLLR+MAP |
|
|
MA 10 hr. |
20 /monophone |
88.7 |
63.0 |
|
SP 10 hr. |
20 / monophone |
71.6 |
50.9 |
|
RU 3 hr. |
20 / monophone |
60.8 |
45.3 |
|
EN 10 hr. |
20 / monophone |
75.7 |
47.2 |
|
EN 10 hr. |
8 / triphone |
|
35.1 |
|
EN 72 hr. |
12 / triphone |
|
32.7 |
|
CZ 1 hr. |
20 / monophone |
33.4 |
|
|
CZ 1 hr. |
6 / triphone |
30.7 |
|
Table 5: Adaptation WER(%) of Systems with Varying Complexities and
Amounts of Source Language Training Data
Discriminative model combination [1]
aims at an optimal integration of all available acoustic and language models
into one log-linear posterior probability distribution. The coefficients of the
log-linear combination are estimated on training samples using discriminative
methods to obtain an optimal classifier. For example, a multilingual
combination at the sentence level of scores from Czech, Spanish, and Mandarin
acoustic models has the following form for a sentence hypothesis w given the
acoustic data x
|
where Lcz(w)
is the Czech language model likelihood, Acz(x|w), Asp(x|w), Ama(x|w) are the Czech, Spanish, and
Mandarin acoustic model likelihoods. The parameters l are optimized to minimize WER on a held-out
set of Czech data.
Although the results are
not reported in detail here, we find that DMC rescoring at the sentence level
does not improve over the monolingual Czech performance. However, performance
can be improved by applying DMC at the phoneme-class level. For example, the
acoustic likelihood Acz(x|k) can be separated by the contribution of vowels, consonants, and
silence models. Parameters can then be introduced to define a posterior
distribution based on these language-specific phonetic classes:
|
|
Acoustic Scores and Phonetic Classes |
WER(%) |
|
N-Best oracle |
19.8 |
|
first best (baseline) |
34.0 |
|
Vru+Cru+Sru+Vsp+Csp+Ssp |
31.8 |
|
Lcz+Acz+Aru+Asp+Aen |
29.2 |
|
Lcz+Vcz+Ccz+Scz+Vru+Cru |
|
|
+Sru+Vsp+Csp+Ssp+Ven+Cen+Sen |
28.9 |
Table 6: DMC Rescoring of 1000-best Lists. The combination uses
knowledge based mappings, the Czech language model, and the Czech, Spanish, Russian,
and English vowel, consonant and silence models.
From
the results in Table 6 we conclude that the structuring
into phoneme classes improves performance over combination at the sentence
level. Furthermore, combination of multilingual phoneme-class models performs
better than the monolingual Czech systems, even when the monolingual systems
are optimized using DMC.
We have
presented a methodology for language independent acoustic modeling. We found
that both knowledge-based and automatic methods can be used to derive
cross-lingual phonetic mappings. Model adaptation and discriminative model
combination can then be used to further improve and merge systems from diverse
languages. Additional experiments, particularly in language adaptive training,
can be found on our web site.
ACKNOWLEDGMENTS
This
work was supported by the National Science Foundation under Grant No.
#IIS-9820687, and carried out at the 1999 Workshop on Language Engineering,
Center for Language and Speech Processing, Johns Hopkins University. Any
opinions, findings, and conclusions or recommendations expressed in this
material are those of the authors and do not necessarily reflect the views of
the National Science Foundation or The Johns Hopkins University. Satellite news
broadcast recordings were done under contract by the Linguistic Data
Consortium, Philadelphia, PA, USA. We thank M. Riley and F. Pereira of ATT for
use of their large vocabulary decoder.
Thanks to staff of IFAL, Charles University, Prague,
especially to Jan Hajic and Barbora Hladká for providing tools for Czech
morphological analysis, tagging and lemmatization, and to Michael Collins for
the possibility to use his statistical parser. The following grants have
contributed to the data preparation and development of tools: Project No.
VS96151 of the Ministry of Education of the Czech Republic, Grant No.
405/96/K214 of the Grant Agency of the Czech Republic. Thanks to Martin Cmejrek
for collaboration on parallel Czech/English data preparation and to Lenka
Kadlcáková and Martin Cmejrek for the prompt evaluation of translations.
Special thanks to Reader's Digest Výber (Prague, Czech Republic) for granting
the license for using their textual material and to IBM Czech Republic for the
chance to run test translations on their data.
[10]
Al-Onaizan, Yaser, Jan Curin, Michael Jahr, Kevin Knight, John Lafferty,
Dan Melamed, Franz-Josef Och, David Purdy, Noah A. Smith, David Yarowsky. 1999.
Statistical Machine Translation, Final Report, JHU Workshop 1999.
Brown, P. F., V. J. Della Pietra, S. A. Della
Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine
translation: Parameter estimation. Computational Linguistics 19(2).
[12]
Gale, W. and K. Church. 1993. A program for aligning sentences in
bilingual corpora. Computational Linguistics 19(1).
[13]
Hajic, Jan, Eric Brill, Michael Collins, Barbora Hladká, Douglas Jones,
Cynthia Kuo, Lance Ramshaw, Oren Schwartz, Christoph Tillmann, and Daniel
Zeman. 1998. Core Natural Language Processing Technology Applicable to
Multiple Languages (Final Report, Summer Workshop'98). Tech. Rep., Center
for Speech and Language Processing, Johns Hopkins University.
[14]
Hajic, Jan and Barbora Hladká. 1998. Tagging inflective languages:
Prediction of morphological categories for a rich, structured tagset. In Proceedings
of Coling/ACL.
[15]
Melamed, I. Dan. 1996. A geometric approach to mapping bitext correspondence.
In Proceedings of the First Conference on Empirical Methods in Natural
Language Processing.
[16]
Och, F. J., C. Tillmann, and H. Ney. 1999. Improved
alignment models for statistical machine translation. In Proc. of the Joint
SIGDAT Conf. on Empirical Methods in Natural Language Processing and Very Large
Corpora.
[1]
P. Beyerlein, ``Discriminative Model Combination'', ICASSP, Seattle,
1998.
[2]
W. Byrne et al. ``Large Vocabulary Speech Recognition for Read
and Broadcast Czech'', 1999 Workshop on Text Speech and Dialog, Marianske
Lazne, Czech Republic.
[3]
D. Calvert, Descriptive Phonetics, Thieme, New York, 1986.
[4]
Handbook of the International Phonetic Alphabet, Cambridge University
Press, Cambridge, UK, 1999.
[5]
T. Schultz and A. Waibel, ``Language Independent and Language Adaptive
Large Vocabulary Speech Recognition," ICSLP, Sydney, Australia, 1998.
[6]
S. Young et al. The HTK Book, Entropic, Inc. 1999.