Gathering Data and Building a Neural Engine Farm for all Language Combinations for eTranslation: NTEU
Language technologies are key to create a digital single market across language barriers. The European Commission’s machine translation system eTranslation is an essential tool for that, as it allows public administrations, universities, ministries – and now also SMEs – to communicate and exchange information across languages and borders. However, some European public administrations may require on-premises installations, from and into their national language, whilst the data used to build the language combinations may benefit the community as a whole.
To address this situation, Pangeanic, Tilde and KantanMT joint forces for the CEF Telecom action “Neural Translation for the European Union” (NTEU), which started in September 2019. The aim of this action is to extend the coverage of eTranslation by building a neural engine farm including all EU official language combinations, e.g. Spanish to German, Latvian to Swedish, Croatian to Italian or French to Polish, to name a few.
According to Manuel Herranz, CEO of Pangeanic, there is an increasing need for machine translation in European public administrations and “(t)here are many reasons why, in the 21st century, a country’s public sector would need machine translation as a state technology and a service to its citizens”.
As a consequence, the NTEU action follows an ambitious goal, which consists in creating the largest ever direct language-to-language engine combination. In order to do so, NTEU will build direct machine translation engines between any of the 24 EU official languages. This will result in 23 x 23 = 529 NMT engines, which will be shared with the European Commission and public administrations of the EU Member States.
The collection of training data will, however, also be a great challenge, as NTEU will require massive amounts of viable parallel corpora from the administrative, general, medical and other domains with 15M segments per corpus for well-resourced languages, 12M for medium resourced and 10M for lower resourced. The consortium will make use of other projects’ data gathering efforts, as well as DGT material and donate data from its own resources and use high-quality synthetic data to obtain near-human quality performing engines.
The data collected by ELRC will also play a significant role for this project, as it will be used to customize machine translation engines to extend the CEF eTranslation service through general data, in-domain synthetic data, additional other collections. At the same time, all data used in the NTEU project will be made available through the ELRC-SHARE repository for the benefit of the community.