Interview with Markus Foti: How to reap the benefits of MT@EC
On 26th of October 2016, the 2nd ELRC Conference took place in Brussels in conjunction with this year’s Translating Europe Forum. A key question was how public language service providers can keep up with the changing world of translation – and what kind of benefits for public online services can be expected from translation technologies such as machine translation.
In a subsequent interview, Markus Foti, MT@EC/eTranslation Project Manager at the Directorate-General for Translation (DGT), gives an interesting insight into the applicability and potential of the EC’s machine translation tool MT@EC for public services across Europe. This also includes a glimpse with regard to the data used and needed to train MT@EC for particular translation requirements.
Dear Mr. Foti, it is the mission of DGT to provide the European Commission (EC) with high-quality translation and other language services. Since 2013, you are providing the statistical machine translation (MT) tool MT@EC to support the EC translators in their work. In which areas is MT@EC currently being used?
MT@EC has become a standard tool for translators in the Commission, but also in the other EU Institutions. It covers all of the EU languages, with 73 direct engines, and less-used pairs going through a pivot language.
For translation in the EU institutions, requesting MT in the form of a data file (tmx) has often been built into the workflow so that when EU translators start to translate a document MT is already available for them, alongside a translation memory. Some prefer not to use the data file and instead just have a full translation beside them to use as a reference document.
But MT@EC has always had a broader scope than as an aid for translators. Non-translators use it to get the gist of documents that the are supposed to work with – sometimes the MT gets them the information they were looking for, and sometimes it just helps them to decide which bits of the 150 pages of legislation that a country has submitted in response to a question are really relevant and need careful, human translation. If you find out that only 10 pages actually need translation, you've saved time and money, and the translators will be under less pressure as well.
And from the beginning MT@EC was intended to be open to national public administrations as well, and with eTranslation being funded by CEF, this is even more the case. Universities who are taking part in the European Master's in Translation programme also have access so that the up-and-coming generation of translators get experience with various tools of the trade.
Apart from that, there are also quite a number of EU websites that connect to MT@EC to get translations for text boxes or search results, or the web pages themselves.
This machine-to-machine connection is one of the areas where we are really eager to improve the system, yes in terms of speed, but also because the translations we have used to build MT@EC were official DGT translations, so they are really high quality, but also pretty formal. I don't think you'll be surprised to hear that people don't always talk the same way that official EU documents sound – in the early days, we had a problem in that, at least for Greek, MT@EC could not translate the pronoun “I”. Not too much legislation speaks in the first person, so we had nothing for the system to learn from! And if MT@EC, or its successor, eTranslation, are to do a good job on more natural language, we need to feed it with that kind of data so it can learn.
So what kind of data does and did DGT use to train MT@EC for their needs? And how did you manage to get hold of this data?
Well the bulk of it is DGT's translations, along with those of the other EU institutions, and that was really easy to get because for over fifteen years now completed translations have been stored in our vast internal translation memory, called Euramis, in the form of individual segments. So we extract the corresponding segments in two languages and use those to build each engine. This is a vast resource – most language pairs have around 40 million segments, and English is coming up to 100 million. But we have gaps in style and tone.
We are hoping that many of the CEF countries will have similar databases and that they will share them with us so we can tailor eTranslation engines to their needs.
In a couple of cases we have also been provided with data to build domain-specific engines (medical translations, specific diplomatic documents...). These were tmx files, so they are easy for us to handle.
So this means that in order to adapt MT@EC to the needs of other public service administrations, all you need is corresponding language data such as existing translations, translation memories, lexica etc.?
That's right, but the format is quite important. We have also received data in response to our appeals that was delivered as word or pdf files in various languages, and we never did incorporate those. The translations were good, but it would have cost us quite a bit of time to extract and then align the texts. And the thing is that doing that for 100 pages takes quite some time, but 100 pages worth of data just won't have much of an influence on our final output. Under CEF there is some funding for processing these kinds of documents, but bilingual tmx files are really the easiest for us.
Lexica and glossaries can also be quite useful, too. Basically anything bi- or multi-lingual which is already aligned is the easiest to use.
With eTranslation, there will be an automated translation platform that provides more services than just the machine translation tool MT@EC. Can you give us a brief insight into what additional support will be provided through eTranslation?
This has to be user-driven. We are supposed to be a “multilingualism enabler”, so as our current and future users become aware of language needs that need to be filled, we will tackle them.
One thing that has come up so far is transliteration on its own, for example, for a list of the names of people attending a conference. Another is named-entity recognition as a separate module, so you could find out that La Gioconda is La Joconde is the Mona Lisa, thereby enabling multilingual searching, for example. These things would also be added as integral functions of MT@EC/eTranslation, but users who did not need a full translation could access them separately.
For us it is an interesting challenge to do whatever we can to fill gaps that users point out to us.