Language-centric Artificial Intelligence applications are popping up all around us in our everyday lives. Voice assistants, TV show recommenders, automated translation services, to name a few, are used more and more. The development of such powerful applications is fuelled by three key factors: algorithms, computing power, and data. Nowadays, well-established deep learning algorithms are available and can be implemented via open-source machine learning libraries. In parallel, hardware IT infrastructures are continuously extended in terms of computing power and big data storage. The third factor concerns the acquisition of sizable data appropriate for the problem at hand (e.g. millions of translated sentence pairs in various language combinations for training Machine Translation engines).
To this end, the Institute for Language and Speech Processing / Athena Research Centre, one of the founding partners of the ELRC initiative, has set up a workflow and developed a pipeline for parallel language data acquisition from the web. Τhe ELRC workflow supports several CEF languages in any domain. To address the EC’s requirements for domain-specific language data, the three domains in focus are currently “Health”, “Culture” and “Scientific Research”.
The process is triggered by identifying multi/bi-lingual websites with content related to the targeted domains. The main sources are websites of National Agencies, International organisations and broadcasts. Then, the ILSP Focused Crawler (ILSP-FC) toolkit is used to acquire the main content of the detected websites and to identify pairs of candidate parallel documents. Depending on the format of the source data, efficient methods for text extraction are applied, including for instance OCR on PDF files. The next step leverages multilingual embeddings to extract Translation Units (TUs). Finally, a battery of criteria is applied to filter out TUs of limited or no use (e.g. sentences containing only numbers) and thus generate parallel LRs of high quality. It is worth mentioning that the constructed datasets are clustered into groups, according to the conditions of use indicated on the websites the data originated from.
In the current COVID-19 crisis, the ELRC language data collection activities also addressed the growing demand for improved technology-enhanced multilingual access to COVID-19 information. As part of the data collection activities in the “Health” domain, efforts focused on identifying reliable sources of language data and on compiling dedicated resources on the pandemic. To this end, the relevant MEDISYS metadata collections have been parsed and harvested in order to extract pairs of parallel sentences from comparable corpora by applying the above-mentioned workflow. Parts of these datasets were offered to the MLIA-Eval initiative.
The total number of TUs for EN-X language pairs that have been collected as part of the ELRC activities during the last two years amount to more than 40 million translation units in total. Further to the above, a considerable number of TUs have been identified for X-Y language pairs, where X and Y are CEF languages other than EN, while millions of TUs have been extracted from websites with multi-domain content, with the aim to cluster them in domain-specific subsets. In total, the constructed LRs comprise approximately 80 million TUs.1
Depending on their conditions of use as indicated by the source website, parts of these datasets are available for download through the ELRC-SHARE repository (https://elrc-share.eu/).
 Note: Data acquisition and identification of parallel sentences are work in progress. The numbers provided reflect only the current state of the tasks and are growing constantly.