What language resources are needed?
Why are language resources needed?
Language Resources are needed to improve machine translation’s quality, in both general and specific domains. To improve the CEF Automated Translation platform’s services, the underlying automated translation systems must be trained with relevant Language Resources in all official languages of the 30 countries participating in the CEF programme. Large general domain corpora, whether monolingual (e.g. official corpora of national languages) or multilingual, should be seeked, as well as domain-specific Language Resources in the fields of consumer rights, culture, legal domain, social security, health, public procurement, etc. These domains will be covered by the online public services to be supported by CEF-AT.
Significant amounts of valuable linguistic data are generated every day in all Member States, and in the CEF-affiliated countries, non-governmental and private organisations. A large part of this data can be very valuable as Language Resources for the CEF.AT platform.
The European Language Resource Coordination action is looking for open data that can be made available for re-use through open data initiatives, but also for commercially available datasets.
Language Resources for CEF.AT platform
Some datasets produced by public administrations can be used directly by the Automated Translation system: aligned corpora from translation memories, terminology resources, lexica and dictionaries. Many other resources are published as information documents (reports, guides, flyers, records of administrative decisions, etc.) which will need additional processing to be turned into Language Resources, provided that they come in a reusable format. For example, a scanned PDF may be unexploitable if it is produced with simple OCR tools.
Which types of data are useful for MT training?
To meet the Automated Translation requirements, relevant Language Resources are of various types:
Translation memories: linguistic databases that capture translations made by humans. They can be used to facilitate future translations tasks but also for training automated translation systems
Translation/language models: statistic information that assigns a probability to a piece of unseen text, based on some training data
Corpora: monolingual and multilingual corpora, comparable, aligned, parallel documents
Lexica: monolingual and multilingual lists of words, multi-words, sentences, etc. in general or specific subject fields
Terminological resources: structured sets of concepts, with associated linguistic information in a specific subject field
Grammars: sets of rules that formalize a language
Where to look for the Language Resources?
Relevant sources of valuable Language Resources are Public Services in the EU Member States and countries associated to the CEF programme.
These could be public services with a mission at the national, regional, local, cross-border or cross-country (bi-lateral, multi-lateral) level, as well as international organizations with a European basis and mission, including Head of State offices, national or federal ministries, parliaments, regional governments, local authorities, etc. These could be public administrations responsible for online e-government platforms and services in the CEF-relevant areas (e.g. consumer rights, culture, legal domain, social security, health, public procurement), publication offices of ministries, documentation centres, national libraries etc.