Discover Automated Translation
Automated translation, also known as machine translation, allows users to instantly translate words, sentences, full documents, and websites from one language to another. Translations are performed at speeds of up to one sentence per second – infinitely faster than any human can ever translate.
Though automated translation does not provide the same level of quality and accuracy as human translation, they provide quick insight into the general meaning or “gist” of a text, thus helping us to cross language barriers between nations and facilitate multilingual communication.
To ensure quick translations of texts, automated translation systems are trained on huge amounts of existing human translations. Using sophisticated algorithms, the automated translation systems then mine this parallel data to produce instant translations of texts.
Automated translation systems can be further improved by adding industry specific terminology, linguistic rules, monolingual data, and other language resources. This effectively customizes, or tailors, a system to a particular domain or industry.
What is eTranslation
CEF eTranslation builds further upon the existing MT@ EC service to create a truly pan-European automated translation platform, providing high quality translations in all official EU languages and in various domains. CEF eTranslation helps European and national public administrations exchange information across language barriers in the EU.
The main purpose of CEF eTranslation is to make all Digital Service Infrastructures (DSIs) multilingual. While CEF eTranslation is mainly intended to be integrated into other digital services, it also offers useful stand-alone services for translating documents or snippets of text.
Unlike general-purpose web translators, CEF eTranslation will be adapted to specific terminology and text types that are typical for the usage context (e.g. tender documents, legal texts, medical terminology). It enables multilingual operation of digital services and can be used to reduce the time and cost of translating documents.
Check out these videos to learn more about CEF eTranslation
Saila Rinne, European Commission
ELRC WHITE PAPER
The ELRC White Paper “Sustainable Language Data Sharing to Support Language Equality in Multilingual Europe – Why Language Data Matters” provides an analysis of European practices for sharing language data and the corresponding challenges, as well as clear recommendations for policy-level decision-makers on how to overcome these challenges.
The Resource Collection Guidelines
The term language resources refers to sets of language data and descriptions in machine readable form, including written and spoken corpora, grammars, and terminology databases. The Resource Collection Guidelines provide step by step manual on how to be involved in ELRC initiative and how to contribute language resources for the improvement of eTranslation DSI.
Guidelines for CEF generic services projects
This document is addressed to the consortia members of CEF-funded projects (generic services projects). It is a concise step-by-step guide to contributing language resources through the ELRC-SHARE repository (Section 2). It additionally provides recommendations for the technical and legal validation of language resources, which the CEF generic services projects may wish to adopt (Sections 3 and 4).
The term language resources refers to sets of language data and descriptions in machine readable form, including written and spoken corpora, grammars, and terminology databases. Language resources can be used to build, improve, or evaluate natural language systems such as machine translation engines.
Why are language resources needed?
Language Resources are needed to improve machine translation’s quality, in both general and specific domains. To improve the CEF Automated Translation platform’s services, the underlying automated translation systems must be trained with relevant Language Resources in all official languages of the 30 countries participating in the CEF programme. Large general domain corpora, whether monolingual (e.g. official corpora of national languages) or multilingual, should be sought, as well as domain-specific Language Resources in the fields of consumer rights, culture, legal domain, social security, health, public procurement, etc. These domains will be covered by online public services to be supported by CEF-AT.
Significant amounts of valuable linguistic data are generated every day in all Member States, and in the CEF-affiliated countries, non-governmental and private organisations. A large part of this data can be very valuable as Language Resources for the CEF.AT platform.
The European Language Resource Coordination action is looking for open data that can be made available for re-use through open data initiatives, but also for commercially available datasets.
Language Resources for CEF.AT platform
Some datasets produced by public administrations can be used directly by the Automated Translation system: aligned corpora from translation memories, terminology resources, lexica and dictionaries. Many other resources are published as information documents (reports, guides, flyers, records of administrative decisions, etc.) which will need additional processing to be turned into Language Resources, provided that they come in a reusable format. For example, a scanned PDF may be unexploitable if it is produced with simple OCR tools.
Which types of data are useful for MT training?
To meet the Automated Translation requirements, relevant Language Resources are of various types:
Translation memories: linguistic databases that capture translations made by humans. They can be used to facilitate future translations tasks but also for training automated translation systems
Translation/language models: statistic information that assigns a probability to a piece of unseen text, based on some training data
Corpora: monolingual and multilingual corpora, comparable, aligned, parallel documents
Lexica: monolingual and multilingual lists of words, multi-words, sentences, etc. in general or specific subject fields
Terminological resources: structured sets of concepts, with associated linguistic information in a specific subject field
Grammars: sets of rules that formalize a language
Where to look for the Language Resources?
Relevant sources of valuable Language Resources are Public Services in the EU Member States and countries associated to the CEF programme.
These could be public services with a mission at the national, regional, local, cross-border or cross-country (bi-lateral, multi-lateral) level, as well as international organizations with a European basis and mission, including Head of State offices, national or federal ministries, parliaments, regional governments, local authorities, etc. These could be public administrations responsible for online e-government platforms and services in the CEF-relevant areas (e.g. consumer rights, culture, legal domain, social security, health, public procurement), publication offices of ministries, documentation centres, national libraries etc.
Use Case – Reuse of Emergency Calls embedded in TV Shows
In this document, the ELRC legal helpdesk will analyze under which legal conditions audio, video and dialogue subtitles coming from emergency calls embedded in a German TV show can be re-used for developing AI models. This analysis will lead us to review several legal aspects specific to the German legislation including intellectual property and copyright protection, but also to tackle the use and sharing of different types of data (audio, video and transcriptions of the dialogues), and their derivatives, for research and commercial purposes.