23. ledna 2022 vyšel na webu ČT 24 článek o překladači CUBBITT Trénink překladače na Mechanickém pomeranči u oficiálních dokumentů narazil, směje se jeho autor. Níže uvádíme automatický překlad tohoto článku do angličtiny vyrobený právě překladačem CUBBIT, který je dostupný veřejnosti na webu lindat.cz/cubbitt.

Training of the translator on the Clockwork Orange at the official documents crashed, laughs its author

The latest versions of commonly available language translators amaze the public and experts with their high quality. How much more can they be improved, how can they cope with political correctness and can they ever reach the level of professional translators of beautiful literature? Not only these topics have been dealt with for over fifteen years by the language expert Martin Popel from the Faculty of Mathematics and Physics of Charles University (MFF UK), whose translator beats not only the world's largest companies but also translation agencies in the professional competition.

There are currently several publicly available and free translators available online. These include probably the most famous Google for Czech users, but the quality is comparable to DeepL, Microsoft Bing and also the Czech CUBBITT created by experts from the Institute of Formal and Applied Linguistics of MFF UK. They all work on a similar principle and differ, among other things, in the number of languages they can translate.

But which one works best? The issue of measurability of quality is complex, however, one possible answer is given by the annual Workshop on Machine Translation (WMT) competition, which expertly compares machine (or automated or automatic) translations in several languages, including English and Czech.

“Participants will send their translation systems. These are tested on several thousand test sentences that are not known in advance. These are recent news and publicity texts," describes the computer linguist and author of the Czech university translator Popel. Subsequently, translations are evaluated anonymously, including those from live translators who mix with machine translators.

“For a long time Google Translate was the winner. In 2018, my publicly available CUBBITT system won this competition. We beat all the tested translators, but also, to my surprise, a professional translation agency,” says Popel, adding: “In terms of translation accuracy, it came out significantly better than the translation agency and in fluency worse.” The following years confirmed that for English and Czech in both directions, the Czech translator at least achieves better quality on news texts.

Translators of large companies are included in the competition by the organizers of WMT, under anonymized names. Officially, it is not known which is which, but everyone can compare the translations sent to the competition with those currently offered by the companies.

Training translators on sentence pairs

“Basically, all today's translators are based on the principle of neural networks and deep machine learning,” says Popel, further elaborating on the principle: “In the training data, we have pairs of sentences – training examples – and we want the translator to learn. But not completely by heart, but to get some generalization out of it and to translate sentences it has never seen." For Czech and English, language experts at IFF UK have around sixty million sentence pairs at their disposal.

It is through English that it is automatically translated between most of the world's languages, even very related ones. "For most languages, the most training pairs are with this language, although we assume that some smaller languages in South America, for example, will have the most parallel data with Spanish, perhaps with Chinese elsewhere," explains the linguist. The advantage of neural networks is that with enough and quality training data, it is possible to make a good translator for very unrelated languages.

Other methods have been used for machine translation before, but today, according to Popel, neural networks are the best performers – specifically the Transformer architecture, which Google came up with and made publicly available under a free license. "CUBBITT is my system, but it couldn't have come into existence without the work of thousands of people before me," observes the expert.

Translators are trained on very powerful computers, which he says can be a limiting factor these days. They certainly can't compete with big companies in this respect. "But even with what we have, we can compete with them," he notes.

The more quality data, the better the results

The amount of training data and the quality of it are determinants for automatic translators. "We can't outsource those sixty million sentences to a translation agency and demand a high standard. We take everything that is available somewhere, such as film subtitles, which were sometimes translated by amateurs," reveals Popel.

Another important source is European Union documents issued compulsorily in multi-language translations and under a free licence, including legal texts or parliamentarians' speeches in the European Parliament. The representation of training data from various fields of human activity also has a major influence on the quality of translation. "It would probably be difficult to translate from Chinese to English about traditional Chinese medicine using data trained on film subtitles," explains the linguist.

Part of the work of language experts is the process of filtering and "cleaning" the data, which is partly automatic. "It should be pointed out that we have very high quality training data that colleagues from the Institute of Formal and Applied Linguistics have been working on for over fifteen years," says the expert, specifying: "They were putting together a parallel Czech-English corpus, CzEng, which we use as the core of the training data."

Non-scriptural speech is not a problem, political correctness is Also for non-scriptural and colloquial language or idioms, machine translators do very well with enough training data. "But you have to be careful that the translator doesn't use vulgarisms in the translation, for example, if there were none in the original sentence. You also have to be careful about introducing prejudices against gender, religion or race," notes Popel.

According to him, these prejudices are in the training data. To illustrate this, he states: "When we say 'works as a conductor', it's not clear whether it's a man or a woman. Most translators then use some stereotypes. So if the conductor was more often a man in the training data, they translate it into English as a man. We have now released a new version of CUBBITT, which translates whole documents, not just individual sentences. In the given example, he can take into account the surrounding sentences and learn from them what genus the conductor is. In other cases, the translation of ambiguous words or the continuity of sentences has been improved."

But he also remembers a more laughable example from practice. "In the beginning, the training dates included subtitles from the film A Clockwork Orange. There are a lot of vulgarities, neoplasms and rusisms, the translation and especially the original is brilliant from the point of view of working with the language. We didn't want them there when translating official documents, but you could tell that the system was trained on this film," he admits.

Automatic translation of fiction not yet in sight

At the current quality, it might seem that automatic translators represent a strong competition for professional translators of beautiful literature. But according to Popel, their level cannot be compared yet. "Translation of fiction is a completely different category for me. That's something like a house painter and a painter of paintings," he admits.

Even for fiction, he says, the more data there is, the better the results the translation will achieve. "But that alone is not enough. It is necessary to work not only with a single sentence or paragraph, but also with the context of the chapter and ideally the whole book. I don’t know of a translator who can cover this yet,” he says. When reading beautiful literature, the experience itself must also be taken into account. Here, too, automatic translators have great reserves. At the same time, he notes that the reading experience can be spoiled by a bad translation, be it machine translation or human translation.

Martin Popel highlights the qualities of experienced translators, whose art he feels great humility about. But he also points out that it also depends on how much care they give to the text. When asked if machine translation will cost translators their jobs, he says diplomatically, “I think that the poor ones may well do.”

When translating fiction, he says, there are still many places where machine translators still make mistakes and where they will certainly continue to make them in the years to come. On the other hand, he says, it is impossible to say with certainty that they will never be able to do some things. “When I chose this field, I thought that over the next thirty years the quality would be so bad that there would still be something to improve. It still is, but I underestimated the speed of development," he admits. He says it took half the time to achieve such good results in the compilation competition.

Neural poems and plays

Even within the scientific research of mathematical linguists, there are occasional by-products. One of them, for example, is translator-generated poetry. He says the idea was born more or less by accident. "I once made such a stupid programming mistake," says Popel. "Instead of choosing the best option, the translator chose the worst, albeit syntactically correct one. And translations of some sentences sounded like poems." He then developed the idea to the point where visitors to the open house at MFF UK could try out his neural poetry generator for themselves. "But it's such a pastime. It can't compete with real poets," he admits.

Another example of the creative use of his automatic translator is an experiment on which scientists from MFF UK collaborated with theatregoers from Švandova Theatre and DAMU. Together they created and realized the first play written by artificial intelligence. The text was first generated in English and then translated into Czech using CUBBITT. The production of "AI: When a Robot Writes a Play" was performed in February last year by Švanda Theatre, symbolically one hundred years after the world premiere of Karel Čapek's drama R.U.R., in which the word robot was mentioned for the first time.

PROFILE

Mgr. Martin Popel, Ph. D.

Already during his studies at the Faculty of Mathematics and Physics of Charles University (MFF UK), he began to focus on mathematical linguistics and machine translations. Now he teaches and scientifically works at the Institute of Formal and Applied Linguistics of MFF UK.
His main research interests include machine learning, deep learning and syntactic analysis.
In 2017, he started work on his own machine translator CUBBITT (Charles University Block-Backtranslation-Improved Transformer Translation), with which he won the international WMT 2018 (Conference on Machine Translation) competition a year later. He developed the topic in a study published by the prestigious scientific journal Nature Communications.
The translator is available to the public on the LINDAT/CLARIAH-CZ website, an infrastructure project to support cutting-edge research in language technology and the humanities and social sciences.