The experimental results from NSF Workshop'99 CLSP Johns Hopkins University

Czech/English Statistical Machine Translation

Jan Cuřín


Automatic Speech Recognition

P. Beyerlein, W. Byrne, J. M. Huerta, S. Khudanpur, B. Marthi, J. Morgan, N. Peterek, J. Picone, W. Wang

 

 

Statistical Machine Translation Group Team

Yaser Al-Onaizan1, Jan Curin2, Michael Jahr3, Kevin Knight1, John Lafferty4, Dan Melamed5

Franz-Josef Och6, David Purdy7, Noah A. Smith8, David Yarowsky9

 

 

ISI, University of Southern California (1)

Technical University RWTH, Aachen (6)

IFAL, Charles University, Prague (2)

Department of Defence (7)

CS, Stanford University (3)

University of Maryland (8)

CS Dep., Carnegie Mellon University (4)

CLSP, Johns Hopkins University (9)

CS Res. Dep. Of West Group (5)

 


 

Automatic Speech Recognition Group Team

P. Beyerlein1, W. Byrne2, J. M. Huerta3, S. Khudanpur2, B. Marthi4

J. Morgan5, N. Peterek6, J. Picone7, W. Wang8

 

 

Philips Research Laboratories (1) 

CLSP, Johns Hopkins University (2) 

Dept. ECE, Carnegie Melon University (3) 

Depts. CS and Math, University of Toronto (4) 

Dept. Foreign Languages, USAMA, West Point (5) 

UFAL, Charles University, Prague (6) 

ISIP, Mississippi State University (7) 

Dept. ECE, Rice University (8)

 

Abstract

Statistical machine translation and language independent acoustic modeling were two of the topics studied at the 1999 Johns Hopkins University Language Engineering Workshop hosted by the Center for Language and Speech Processing. In booth of these topics Czech played the role of the target experimental language.

 

Automatic translation from one human language to another using computers, better known as machine translation (MT), is a longstanding goal of computer science.

Recently, statistical data analysis has been used to gather MT knowledge automatically from parallel bilingual text. Unfortunately, these techniques and tools have not been disseminated to the scientific community in very usable form, and new follow-on ideas have developed sporadically. In a six-week summer workshop at Johns Hopkins University, we constructed a basic statistical MT toolkit (called Egypt) intended for distribution to interested researchers. We describe experiments on Czech/English statistical MT in this paper.

 

Language independent acoustic modeling was one of the topics studied at the 1999 Johns Hopkins University Language Engineering Workshop. Our work was motivated by the need for speech recognition in languages beyond the well-studied languages of Europe, Asia, and the Americas. The statistical techniques used for speech and language modeling require relatively large amounts of monolingual speech and text as training data. In the `resource-rich' languages which have such corpora, these statistical methods have been shown to work quite well. However, if only small amounts of training data are available in a language, these monolingual techniques are less effective. Our goal was to address this problem by developing techniques that reduce the amount of data needed to model resource-poor languages by borrowing data and models from resource-rich languages.

 

1 Czech/English Statistical Machine Translation

1.1 Introduction

Automatic translation from one human language to another using computers, better known as machine translation (MT), is a longstanding goal of computer science. In order to be able to perform such a task, the computer must ``know'' the two languages-synonyms for words and phrases, grammars of the two languages, and semantic or world knowledge. One way to incorporate such knowledge into a computer is to use bilingual experts to hand-craft the necessary information into the computer program. Another is to let the computer learn some of these things automatically by examining large amounts of parallel text: documents which are translations of each other. The Canadian government produces one such resource, for example, in the form of parliamentary proceedings which are recorded in both English and French. The statistical machine translation (SMT) techniques have unfortunately not been applied widely yet in the MT community. The statistical approach is still very much a minority approach in the field of MT. This is partly due to the fact that the mathematics involved were not particularly familiar to computational linguistics researchers at the time they were first published [11].

Recently, statistical data analysis has been used to gather MT knowledge automatically from parallel bilingual text. Unfortunately, these techniques and tools have not been disseminated to the scientific community in very usable form, and new follow-on ideas have developed sporadically. In a six-week summer workshop at Johns Hopkins University, we constructed a basic statistical MT toolkit (called Egypt) intended for distribution to interested researchers. We also used the toolkit as a platform for experimentation during the workshop. Our experiments included working with distant language pairs (such as Czech/English), rapidly porting to new language pairs, managing with small bilingual data sets, speeding up algorithms for decoding and bilingual and text training, and incorporating morphology, syntax, dictionaries, and cognates. We describe one of our experiments the Czech/English statistical MT in this paper. The toolkit and the experiments with other language pairs are described in the final report from JHU workshop [10]. Our web site www.clsp.jhu.edu/ws99/projects/mt contains downloadable SMT tools and useful MT related references.

1.2  Resources

We had available a Czech/English corpus which is a parallel text of articles from the Reader's Digest, years 1993-1996. The Czech part is a translation of the English one. The Reader's Digest corpus consists of 53,000 sentence pairs from 450 articles. Sentence pairs were aligned automatically by [12] algorithm, but this alignment was not sufficient for good quality alignment. Dan Melamed realigned the corpus using SIMR/GSA [15] during this workshop. With language-pair-specific parameter settings learned from a small amount of word-aligned data, SIMR performance can be substantially improved; however, this experiment simply adopted French/English settings.

There was also a lot of manual work to do on this corpus before the workshop. Every issue of this magazine contains only 30-60% of articles translated from English to the local language. We had to search in the English version to find the corresponding articles that are in the Czech version. The translations in Reader's Digest are mostly very liberal. They include many constructions with direct speech. Articles with culture-specific facts have been excluded.

The tools available for Czech were: a morphological analyzer, POS tagger, and lemmatizer provided by IFAL (Charles University, Prague) and a statistical parser for Czech developed at a previous NLP summer workshop at Johns Hopkins University. The corpus has been morphologically analyzed, tagged, lemmatized, and parsed by these tools. Description of these tools are in [13,14].

There is also a Czech/English online dictionary available. This dictionary consists of 88,000 entries and covers 89% of tokens in the Czech part of corpus.

We also experimented with a technically-oriented Czech/English corpus from IBM. This is a huge and very good source of Czech/English parallel data, but for a very specific domain. This corpus consists of operating system messages and operating system guides. These are products of localization and translation of software from English to Czech. The translations are very literal and precise. In most cases sentences are translated sentence by sentence. This source is not publicly available and can be used only for internal experiments at IFAL.

Two Czech commercial translation systems were available: PC Translator 98 and SKIK v. 4.0. We used translations by commercial systems for evaluation purposes.

1.3  About the Czech Language

Czech, as a Slavic language, is a highly inflectional and almost free word-order language. Most of the functions expressed in Czech by endings (inflection) are rendered by English word order and some function words.

For example, most Czech nouns or personal pronouns can form singular and plural forms in 7 cases. Most adjectives can form 4 genders, both numbers, 7 cases, 3 degrees of comparison, and can be either of positive or negative polarity (giving 336 possibilities for each adjective). In the corpus there are 72,000 word forms in Czech part against 31,000 forms in English.

Czech is a pro-drop language. This means that the subject pronoun (I, he, they) has usually a zero form. There are no definite and indefinite articles in Czech. English preposition equivalents can be also the part of a Czech noun or pronoun inflection. For demonstration there are 15% more tokens in English than in Czech in the corpus.

All these features create problems in translation. Our implemented translation models (IBM3, IBM4) allow only one-to-many alignments form English to Czech. Therefore it is useful to have more words in Czech than in English. We therefore decided to help the translation model by preprocessing Czech into English-like form-``Czech-prime,'' as we call it.

1.4  Tuning Czech-prime

As the training program for translation model parameters (GIZA) and decoder (from Weaver) were in development during the workshop, we did many experiments on Czech/English translation using the Alignment Templates system developed at the University of Aachen [16]. This system considers whole phrases rather than single words as the basis for the alignment models. The basic idea is that a whole group of adjacent words in the source sentence may be aligned with a whole group of adjacent words in the target language. As a result the context of words has a greater influence and the changes in word order from source to target language can be learned explicitly. The Alignment Template approach was applied to some of the tasks considered during the Workshop. The aim was to provide an additional baseline for the IBM3 system and to analyze how important the modeling of word groups are for translation quality. For more details, see [16].

The normal Czech input containing all word forms is the baseline corpus. The next step was lemmatized Czech input. In this version of the input, we discard information about number, tense, gender and other features which are necessary to produce a useful translation. In the full Czech-prime, there is information such as number or tense attached to each lemma, which is expected to be relevant for English translation. Furthermore, artificial words are added to the Czech corpus in positions where they should appear in English.

An example of a Czech sentence with artificial words (in brackets) is given in Figure . Corresponding words in both languages are coindexed. There is an artificial word [I] for first person, singular subject, large numbers of artificial articles and an artificial preposition [of] corresponding to the Czech genitive. There is a potential over-generation of artificial words as you can see in position 5. This over-generation can be compensated for in the translation or language models.

 

I1 am2 convinced3 that4 []5 team6 work7 is8 the9 key10 for11 the12 realization13 of14 ones15 dreams16

[I] 1 jsem2 přesvědčen3, že4 [the] 5 týmová6 práce7 je8 [the] 9 klíčem10 ke11 [the] 12 splnění13 [of] 14 [the] 15 snů16

Figure 1: Addition of artificial words into Czech sentence


Here is a description of major changes for individual parts of speech in Czech:

·         nouns

o        different lemma for singular and plural

o        if the noun is not governing pronoun, the artificial article is added before the noun group (group of nouns and adjectives)

o        if the noun is in genitive, dative, locative or instrumental case, and it is not governed by a preposition in parse tree, the artificial preposition is added before the noun group

·         verbs

o        different lemma for different tenses

o        if the verb is not governing a nominative noun, the artificial subject is added (artificial subjects differ for person, gender and number depending on the form of the verb)

o        special solution for auxiliary verb to be

o        artificial word for negative verbs

·         personal pronouns

o        different lemma for singular and plural

o        for third person, singular, there is a different lemma for masculine animate, feminine and others (he, she, it)

o        if the pronoun is in genitive, dative, locative or instrumental case, and it is not governed by a preposition in the parse tree, an artificial preposition is added

·         other pronouns

o        different lemma for singular and plural

·         adjectives and adverbs

o        artificial word for second (more) and third grade ( the most)

o        artificial word for negative adjective or adverb

    art. word

translation

%

 art. word

translation

%  

    NtheS

the

41.20

 Vsubj1S

I

84.73  

    NtheS

a

22.31

 Vsubj1S

my

6.59  

    NtheS

,

5.19

 Vsubj1S

NULL

4.61  

    NtheS

's

4.06

 Vsubj1S

me

1.28  

    NtheP

the

26.61

 Vsubj3S

he

12.61  

    NtheP

of

11.68

 Vsubj3S

is

11.22  

    NtheP

,

8.02

 Vsubj3S

it

8.02  

    NtheP

-

6.83

 Vsubj3S

NULL

7.39  

    Nprep2

of

36.85

 Vnot

n't

27.09  

    Nprep2

,

16.89

 Vnot

not

14.49  

    Nprep2

-

7.83

 Vnot

NULL

12.11  

    Nprep2

in

5.53

 Vnot

no

8.45  

Table 1: Examples of alignment of artificial words in the training corpus

In Table 1 see an example of artificial words alignment in the training corpus (first 4 cases in order). We demonstrate in how many cases the artificial word is aligned to the certain word in English. The table contains artificial words for the singular article (NtheS), plural article (NtheP), genitive preposition (Nprep2), first person, singular subject ( Vsubj1S), third person, singular subject (Vsubj3S), and negative verb (Vnot).

Translation models for the Alignment Templates system and the GIZA parameter estimation tool were built on the training corpus with the Czech part preprocessed as above. The test Czech sentences were preprocessed in the same way.

1.5  Evaluation

We carried out a human evaluation of translations to observe progress obtained by each level of preprocessing the Czech input. The tool for the human evaluation, which allows us to make an evaluation via Internet, was developed during the workshop. It displays the original sentence (in Czech) and translations from different translation systems. Translations are shuffled for each original sentence. Evaluators assign marks from 1 to 5 to each translation. Mark 1 is the best, mark 5 is the worst translation.

In our particular case the evaluation was done by two evaluators on 66 randomly chosen sentences from the test data. Results are in the Table . Average counts of assigned marks are in columns. Rows correspond to translation systems. The average value of marks assigned to each translation system is in the last column in the table.

 System

 1(+)

 2

 3

 4

 5(-)

 average

 Commercial1

 11

 19.5

 18

 14.5

 3

 2.68182

 Czech'+dict - AlTemp

 13

 18

 14

 17

 3.5

 2.69466

 Commercial2

 7

 18.5

 21

 16.5

 3

 2.84848

 Czech' - Egypt

 6

 16

 16.5

 18

 9.5

 3.13636

 Simple lemmatized - Egypt

 3

 12.5

 23.5

 23

 4

 3.18939

 baseline - Egypt

 3.5

 6.5

 13.5

 27.5

 15

 3.66667

Table 2: Human evaluation of Czech/English Translation

We can observe the progress of quality of translation obtained by the Egypt toolkit from the baseline to the simple lemmatized version and to the English-like version of Czech input (Czech-prime) in comparison with the two commercial systems and the Alignment Templates system. Results on Czech/English translation using the Alignment Templates system (AlTemp) are better then one of commercial systems and almost the same as the second one.

1.6  Experiment with the Computer Oriented Corpus

The parallel corpus from Reader's Digest is relatively small. Experience from different sizes of training sets of the Canadian Hansard corpus indicates that 50,000 sentence pairs is really the basic amount of data. The results are significantly better for corpus ten times larger. Therefore, we have done an experiment on the strictly domain-specific data from IBM as well. The training set contained 1 million short sentence pairs and 10 million words in each language. The Alignment Template system was used to train a translation model and to translate the test part of corpus.

Almost 34% of the sentences from the testing data were translated exactly the same as in the reference set. According to human evaluation on 56 randomly chosen sentences from testing corpus, another 30% of sentences were excellent translations, 11% were good or acceptable, 8% of translations had bad word order, and 17% of translations were bad.

By comparison, in the Reader's Digest test corpus, only 1.42% of translated sentences are exactly the same as their reference translations.

1.7  Conclusion

We carried out the first experiments on statistical machine translation from Czech to English. We can observe how the progress of translation quality depends on the preprocessing of the Czech input. The Reader's Digest corpus output from the Alignment Template system is comparable with translations by commercial systems. The results of the Alignment Template system and the Egypt system are not directly comparable as in the Egypt system the dictionary was not used as a knowledge source. In addition, in the development of the procedure of transforming Czech to Czech-prime we used the Alignment Template system to gather knowledge about problematic constructions. This may have led to a bias in favor of the Alignment Template system. Nevertheless it seems to be possible to conclude from these results that modeling word-groups in source and target language (as done in Alignment Templates) is important.

The results reached for the technical computer oriented corpus are very good and promising. Larger amount of data can significantly help the system. As the general translation tool has been just developed, it is now possible to experiment with different system parameters, such as the number of iterations of particular models, and to adjust the translation models to better suit the Czech/English language pair.

2 Toward Language-Independent Acoustic Modeling

2.1  Introduction

We describe procedures and experimental results using speech from diverse source languages to build an ASR system for a single target language. This work is intended to improve ASR in languages for which large amounts of training data are not available. We have developed both knowledge based and automatic methods to map phonetic units from the source languages to the target language. We employed HMM adaptation techniques and Discriminative Model Combination to combine acoustic models from the individual source languages for recognition of speech in the target language. Experiments are described in which Czech Broadcast News is transcribed using acoustic models trained from small amounts of Czech read speech augmented by English, Spanish, Russian, and Mandarin acoustic models.

While in our studies we used multiple languages simultaneously, our goal was not to build a `multilingual' ASR system capable of recognizing several languages equally well. We intended instead to develop a good monolingual system for a specified target language by borrowing data and models from other languages. This is called `language independent acoustic modeling' to suggest a similarity in nature to speaker independent modeling. In the current state-of-the-art, speaker independent models are first trained from multiple speakers and then adapted to a specific speaker either before or during recognition. Analogously, language independent modeling is a methodology that combines speech and models from multiple source languages and transforms them for recognition in a specific target language.

As mentioned above, acoustic training data is only one resource needed for statistical ASR. However, we have assumed that language models, pronunciations, and appropriate acoustic processing are available for the target language, and that only transcribed acoustic training data is in short supply. This is not a completely unrealistic scenario, however, in that dictionaries with pronunciations are available for many languages, as are on-line newspapers and other text. However, we stress that we address here only one aspect of language independent modeling.

We have developed methods to share data and acoustic models between languages. Underlying these methods are `phone mappings' that describe the similarity of sounds in two different languages. We obtain these phone mappings using both knowledge-based and automatic methods. The knowledge-based methods rely only on acoustic-phonetic phonetic categorizations of the individual languages and as such can be used if no data at all is available in the target language. The automatic methods derive phone mappings using small amounts of acoustic data in the target language. By either approach we can borrow models from several languages simultaneously to cover the phone inventory of the target language. The automatic methods allow additional refinement by borrowing models sub-phonetically at the HMM-state level. This can be especially valuable if the target language contains phones not found in any of the source languages since these techniques are free to assemble a new phone model from component states of different source language phone models.

While both the automatic and knowledge-based phone mappings can be used without modification to construct recognizers in the target language by borrowing acoustic models from the various source languages, HMM adaptation techniques can also be used to improve the systems using the small amount of target language adaptation data we assume is available. As a further refinement, we obtained the best recognition performance not from individually adapted source language acoustic models but by using Discriminative Model Combination (DMC) to combine models from several languages simultaneously. This combination can be done at the sentence or sub-word level, with better performance obtained using phone-level combinations. We note in particular that DMC makes effective use of source language acoustic models that by themselves do not perform well in transcribing the target language.

We present below a necessarily brief description of our experiments. Our web site www.clsp.jhu.edu/ws99/projects/asr contains complete documentation of our work, some of the language data and models used, and a more extensive bibliography of prior work in language independent and multilingual acoustic modeling.

2.2  Multilingual Training and Test Sets

As part of our research program we established an experimental framework for language independent acoustic modeling. Since this problem has not been widely studied, we were not able to use previously defined training and test sets. We therefore began by investigating ASR performance to find an appropriate `operating point' for our experiments.
We chose Czech language Voice of America (VOA) broadcasts as our test domain since news broadcasts contain a variety of different types of speech and are relatively easy to obtain. We chose Czech since we have ongoing projects [2] from which we could borrow resources. We also felt that studying Czech as a rapid-porting task was realistic since, unlike Spanish or Mandarin, there is fairly little knowledge of existing Czech ASR to influence our work. Our final test set consisted of one week of news broadcasts, although due to evolution of our experiments, not all the numbers reported below are directly comparable; see our web site for more detailed reporting.

As our out-of-domain acoustic training data, we used broadcast news recordings in English, Spanish, and Mandarin obtained from the Linguistic Data Consortium. We also used read Russian speech collected at West Point for computer aided foreign language instruction and read Czech speech from the Charles University Corpus of Financial News (CUCFN). All speech was down-sampled to 16KHz as needed. The acoustic models were trained from mel-frequency, cepstral data using HTK [6]. Unless otherwise noted, the source language acoustic models were monophone systems to simplify cross-language mapping; full system descriptions are on our web site.

We built our initial Czech broadcast news system from a ten hour Czech VOA acoustic training set using techniques known to work well in other languages and domains. The language model and pronouncing dictionary were taken from our previous work [2]. After obtaining the performance of this well-trained system, we reduced drastically the size of the acoustic training set and retrained new, impoverished acoustic models. Given our past experience and the reported experience of others, we expected that training a system using approximately one hour of acoustic training data would yield an ASR system that performed substantially worse than the initial, well-trained 10 hour system. We would then attempt to improve this impoverished system by borrowing from other languages. However, as Table 1 shows, performance on Czech VOA is relatively good despite large variations in training set size and model complexity. This behavior appears to be due to the extremely regular and careful speech used by Czech VOA announcers and not due to a preponderance of speech by individual news anchors or other obvious similarities between training and test sets. We note that we observed similar behavior in experiments with Spanish VOA broadcasts.

Training Data 

Model type 

WER (%)

12.8 hour 

12 mixture, cross-word triphone 

27.1

10.0 hour 

20 mixture, monophone 

27.6

1.0 hour 

8 mixture, monophone 

30.2

0.5 hour 

20 mixture, monophone 

31.3

Table 1: Training and Testing on Czech VOA Broadcasts.

From these results we concluded that the Czech VOA speech was too self-similar to be used as both training and test data. We therefore investigated a cross-domain training scenario in which a small amount of read speech from the CUCFN corpus would serve as the Czech language training data. After comparing performance across the mono-lingual Czech read and broadcast domains (Table 2), we decided to fix the 1.0 hour CUCFN read speech training set as the Czech language acoustic training set and to attempt to improve performance on the Czech VOA test data by borrowing from English, Mandarin, Spanish and Russian. This provides a realistic and interesting training scenario that involves cross-domain as well as multilingual factors.

Training Set 

CUCFN 

VOA 

1.0 hr VOA 

66.1% 

28.8% 

1.0 hr CUCFN 

47.3% 

35.7% 

Table 2: WER in Training and Testing on Czech VOA Broadcasts and CUCFN Read Speech Using 20 Mixture Monophone Models.

These experiments with Czech VOA are reported as a cautionary note to emphasize that language is just one characteristic of speech and that other conditions, such as speaking style, are significant factors in ASR performance. It is therefore critically important to obtain diverse training and test sets for multilingual experiments. It is also important that results of limited domain experiments, such as training and testing with data from the same news programs, be interpreted cautiously since performance may not carry over to more diverse domains.

2.3  Knowledge-Based Phone Mappings

In some applications, it is highly desirable to develop speech recognition systems without any acoustic training data. In such situations, borrowing models from other languages for which speech recognition technology is well-developed is an attractive idea. The approaches presented here are referred to as knowledge-based because they exploit linguistic knowledge of the languages and their phoneme inventories, and because they have not been retrained on any target language acoustic data.

Our initial experiments involved simple mappings in which phones from the Czech target language were mapped to their nearest neighbor in a single source language using a similarity measure based on feature-based descriptions of the phones. This is a manual procedure that leverages extensive knowledge of acoustic phonetics [3]. Our approach involved first describing the phones in both the source and target languages in terms of their articulatory positions, a process that leads to a description of the sounds using the International Phonetic Alphabet (IPA) [4].

The advantage of this approach is that all languages can, in theory, be represented within the same system. We determined the proximity of a sound in the target language to a sound in the source language using this representation, and developed an associated symbol-to-symbol mapping. While it was possible to achieve reasonable mappings for each language, there are significant variations in the level of detail used in the source language phonetic inventories. Spanish, for example, only used 25 phones, while Russian used 44 phones. We used these mappings to obtain baseline performance using acoustic models from the source languages derived from these mappings. The procedure was quite simple: represent each phone symbol in the Czech lexicon using a corresponding source language phone located from these mappings. The performance of systems constructed in this manner is given in Table 3. Overall, we observe that performance is poor - in the range of 80%WER. It was a great surprise to observe that the Russian acoustic models, though they were trained on read speech, were a close match to the VOA data, especially considering the differences in microphones, speaking style, and speaking rates. We also observed from these experiments that performance for English and Spanish was comparable, and performance for Mandarin lags the other systems.

Source Language : Czech VOA WER (%) 

Russian : 60.8 

Spanish : 71.7 

English : 75.5 

Mandarin : 88.7 

Table 3: Performance Using Knowledge Based Phone Mappings.

It was evident from the construction of the mappings that a single source language did not provide optimal coverage of Czech. Therefore, it was natural to explore a mapping that involved phones from all source languages based on proximity in the IPA table. Since Russian was clearly acoustically closer to Czech than any of the other source languages, we excluded Russian from the set of source languages for this experiment, so that it would not mask any trends in our knowledge-based systems. Though we achieved modest improvements in performance (1.6% absolute WER), we did not achieve performance comparable to data-driven mapping methods discussed next.

Our next attempt to understand deficiencies in the knowledge-based system was to explore a series of experiments in which the recognition system was allowed to chose the best combination of phones at runtime. First, we explored a parallel pronunciation approach [5] in which each item in the lexicon was represented as a sequence of phones from a single language implemented using pronunciation networks. Unfortunately, this approach resulted in slightly degraded performance even though we had hoped that the additional degrees of freedom would offset any systematic acoustic bias between the two domains. We next tried a multiphone approach that allowed the recognition system to mix and match phones from all source languages as an attempt to let the recognizer find the best realization of a phone, rather than fixing this based on a priori linguistic knowledge. We found minor improvement in performance over the parallel pronunciation system, as expected. However, overall performance is still below the best monolingual system, and far below the Russian monolingual system. In these experiments we have observed that, though the overall WER is high, performance at the phone-level appears to be quite good. The alignments are plausible, and a majority of the words are only partially misrecognized. Since Czech is an inflected language, this analysis raised some concerns that our language modeling approach was not optimal. For example, a morphologically-based approach might be better if the majority of the errors occur on endings rather than stems - it could be the case that performance at a morphological level is good, and hence the system would be usable for information extraction tasks.

2.4  Automatic Generation of Phone and State Level Acoustic Mappings Across Languages

We developed a general methodology to derive cross-language mappings automatically both at phonetic and sub-phonetic levels. We call our approach the Confusion Matrix approach to finding cross-lingual mappings. These confusion matrices are tables of acoustic similarity between phones across languages. They are obtained by first performing a mono-lingual phonetic labeling of the target language acoustic data using the target language phone set - this can be done manually or via forced-alignment using HMMs; we use the latter approach. Phonetic recognition of this data is then performed using acoustic models from each of the source languages; for this we used simple, unweighted, phone-loop recognizers. This yields multiple phonetic segmentations of the target language acoustic data in the source language phone inventories.

Once a criterion for co-occurrence between two phonetic labelings of the acoustic segments is defined (e.g., a minimum number of overlapping frames, etc.), we can arrange the phones of the source language and target language into a matrix that contains the counts of co-occurrences between the nth and kth phones of the source and target languages, respectively, in the (n,k) entry of the matrix. This matrix of co-occurrences is the confusion matrix.

After the confusion matrix between the phones of two languages is obtained, we derive mappings from this matrix. Given a source phone (in the nth row), we would like to select the phone in the target language that best matches it (i.e., choose the best matching kth column). To do this we can simply choose the column with the highest count. A better method takes into account the number of times the kth source language phone was hypothesized by dividing the counts of the bin (n,k) by the accumulated counts of the column k.

We extended this technique to the state level, motivated by our intuition that some phones seemed hard to match from one language to another. To obtain the subphonetic mapping, we broke each HMM in the source and target language into its conforming states and derived an HMM from each of these states. Using these new, sub-phone HMMs we constructed a new confusion matrix. As expected, we found that some of these hard-to-match target language phones were modeled by assembling new models from phonetic subunits from other languages.

We described above how we established the best mapping for each phone/state of the target language. We found out that when many states and phones from various languages were competing to represent any given target model, several models seemed to give high counts and thus be close candidates for a reasonable match. We explored the possibility of including several of these best matching candidates by combining the Gaussian models in their mixtures after weighting them accordingly. We established the weights used in this state combination in proportion to the normalized number of counts corresponding to the map.

Table 4 shows recognition experiments we conducted using mappings derived from confusion matrices. For comparison in this experiment, monophone Czech models trained on 1 hour of Czech give 38% WER. When mappings are obtained using the phone-level confusion matrix approach, the word error rate drops below 70%. State-level mappings further reduce the error rate of the English mappings. Better results are obtained when multiple source languages are included (English, Spanish and Mandarin), and state mappings are obtained for both state-to-state mapping and best three states to a single Czech state (the 3-state method). The best result is below 55% WER. The 3-state methods reported differ in the presence (54.4%) or absence (55.8%) of count normalization of the columns in the confusion matrix.

Source(s)/Method 

WER 

Source(s)/Method 

WER 

EN/Phone 

68.3 

SP/Phone 

68.7 

EN/State 

64.8 

SP/State 

70.0 

MA/State 

79.7 

EN+SP+MA/State 

62.3 

EN+SP+MA/3-State 

55.8 

EN+SP+MA/3-State 

54.4 

Table 4: WER(%) Using Automatic Phone Mappings.

2.5  Acoustic Adaptation

Despite the substantial differences between the quality of phone mappings obtained by knowledge-based and automatic state-level phone mappings, adaptation using MLLR and MAP 1 on the 1.0 hour of Czech read speech largely compensates for these differences, as shown in Table 5. Furthermore, while performance improves significantly, the adapted systems do not individually improve over the monolingual Czech systems.

Source 

Mixtures / Type 

Unadapted 

MLLR+MAP 

MA 10 hr. 

20 /monophone 

88.7 

63.0 

SP 10 hr. 

20 / monophone 

71.6 

50.9 

RU 3 hr. 

20 / monophone 

60.8 

45.3 

EN 10 hr. 

20 / monophone 

75.7 

47.2 

EN 10 hr. 

8 / triphone 

 

35.1 

EN 72 hr. 

12 / triphone 

 

32.7 

CZ 1 hr. 

20 / monophone 

33.4 

 

CZ 1 hr. 

6 / triphone 

30.7 

 

Table 5: Adaptation WER(%) of Systems with Varying Complexities and Amounts of Source Language Training Data

2.6  Discriminative Model Combination of Multiple Source Language Acoustic Models

Discriminative model combination [1] aims at an optimal integration of all available acoustic and language models into one log-linear posterior probability distribution. The coefficients of the log-linear combination are estimated on training samples using discriminative methods to obtain an optimal classifier. For example, a multilingual combination at the sentence level of scores from Czech, Spanish, and Mandarin acoustic models has the following form for a sentence hypothesis w given the acoustic data x

llmLcz(w)+lcz Acz(x|w) + lspAsp(x|w) +lma Ama(x|w) 

where Lcz(w) is the Czech language model likelihood, Acz(x|w), Asp(x|w), Ama(x|w) are the Czech, Spanish, and Mandarin acoustic model likelihoods. The parameters l are optimized to minimize WER on a held-out set of Czech data.

Although the results are not reported in detail here, we find that DMC rescoring at the sentence level does not improve over the monolingual Czech performance. However, performance can be improved by applying DMC at the phoneme-class level. For example, the acoustic likelihood Acz(x|k) can be separated by the contribution of vowels, consonants, and silence models. Parameters can then be introduced to define a posterior distribution based on these language-specific phonetic classes:

llm Lcz(k) + lcz,V Vcz(x|k) + lcz,C Ccz(x|k) + lcz,S Scz(x|k).

 

Acoustic Scores and Phonetic Classes 

WER(%) 

N-Best oracle 

19.8

first best (baseline) 

34.0

Vru+Cru+Sru+Vsp+Csp+Ssp

31.8 

Lcz+Acz+Aru+Asp+Aen

29.2 

Lcz+Vcz+Ccz+Scz+Vru+Cru

 

+Sru+Vsp+Csp+Ssp+Ven+Cen+Sen

28.9 

Table 6: DMC Rescoring of 1000-best Lists. The combination uses knowledge based mappings, the Czech language model, and the Czech, Spanish, Russian, and English vowel, consonant and silence models.

From the results in Table 6 we conclude that the structuring into phoneme classes improves performance over combination at the sentence level. Furthermore, combination of multilingual phoneme-class models performs better than the monolingual Czech systems, even when the monolingual systems are optimized using DMC.

2.7  Conclusion

We have presented a methodology for language independent acoustic modeling. We found that both knowledge-based and automatic methods can be used to derive cross-lingual phonetic mappings. Model adaptation and discriminative model combination can then be used to further improve and merge systems from diverse languages. Additional experiments, particularly in language adaptive training, can be found on our web site.

 

ACKNOWLEDGMENTS This work was supported by the National Science Foundation under Grant No. #IIS-9820687, and carried out at the 1999 Workshop on Language Engineering, Center for Language and Speech Processing, Johns Hopkins University. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or The Johns Hopkins University. Satellite news broadcast recordings were done under contract by the Linguistic Data Consortium, Philadelphia, PA, USA. We thank M. Riley and F. Pereira of ATT for use of their large vocabulary decoder.

Thanks to staff of IFAL, Charles University, Prague, especially to Jan Hajic and Barbora Hladká for providing tools for Czech morphological analysis, tagging and lemmatization, and to Michael Collins for the possibility to use his statistical parser. The following grants have contributed to the data preparation and development of tools: Project No. VS96151 of the Ministry of Education of the Czech Republic, Grant No. 405/96/K214 of the Grant Agency of the Czech Republic. Thanks to Martin Cmejrek for collaboration on parallel Czech/English data preparation and to Lenka Kadlcáková and Martin Cmejrek for the prompt evaluation of translations. Special thanks to Reader's Digest Výber (Prague, Czech Republic) for granting the license for using their textual material and to IBM Czech Republic for the chance to run test translations on their data.

 

REFERENCES FOR SMT

[10]

Al-Onaizan, Yaser, Jan Curin, Michael Jahr, Kevin Knight, John Lafferty, Dan Melamed, Franz-Josef Och, David Purdy, Noah A. Smith, David Yarowsky. 1999. Statistical Machine Translation, Final Report, JHU Workshop 1999.

[11]

Brown, P. F., V. J. Della Pietra, S. A. Della Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2).

[12]

Gale, W. and K. Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1).

[13]

Hajic, Jan, Eric Brill, Michael Collins, Barbora Hladká, Douglas Jones, Cynthia Kuo, Lance Ramshaw, Oren Schwartz, Christoph Tillmann, and Daniel Zeman. 1998. Core Natural Language Processing Technology Applicable to Multiple Languages (Final Report, Summer Workshop'98). Tech. Rep., Center for Speech and Language Processing, Johns Hopkins University.

[14]

Hajic, Jan and Barbora Hladká. 1998. Tagging inflective languages: Prediction of morphological categories for a rich, structured tagset. In Proceedings of Coling/ACL.

[15]

Melamed, I. Dan. 1996. A geometric approach to mapping bitext correspondence. In Proceedings of the First Conference on Empirical Methods in Natural Language Processing.

[16]

Och, F. J., C. Tillmann, and H. Ney. 1999. Improved alignment models for statistical machine translation. In Proc. of the Joint SIGDAT Conf. on Empirical Methods in Natural Language Processing and Very Large Corpora.

 

REFERENCES FOR ASR

[1]

P. Beyerlein, ``Discriminative Model Combination'', ICASSP, Seattle, 1998.

[2]

W. Byrne et al. ``Large Vocabulary Speech Recognition for Read and Broadcast Czech'', 1999 Workshop on Text Speech and Dialog, Marianske Lazne, Czech Republic.

[3]

D. Calvert, Descriptive Phonetics, Thieme, New York, 1986.

[4]

Handbook of the International Phonetic Alphabet, Cambridge University Press, Cambridge, UK, 1999.

[5]

T. Schultz and A. Waibel, ``Language Independent and Language Adaptive Large Vocabulary Speech Recognition," ICSLP, Sydney, Australia, 1998.

[6]

S. Young et al. The HTK Book, Entropic, Inc. 1999.