ELRC participated in the LREC 2022 in Marseille with an on-site booth at the HLT Village and a remote presentation of the paper ELRC Action: Covering Confidentiality, Correctness and Cross-linguality, which describes the LT assessments performed as part of the ELRC action of the European Commis-sion.
The assessments consist of testing various tools and techniques, documenting them in a hands-on way, performing experiments with them, and setting up proof-of-concept environments that demon-strate their potential and their challenges to EC staff and EU Member State representatives, thus facilitating their uptake by public sector users. The paper zoomed in on the two most extensive as-sessments (LT specifications), including a consultation round with various types of stakeholders.
- In the Automated Anonymisation specification, tools and techniques for deidentifying monolin-gual or bilingual texts were investigated. They aim at replacing Named Entities (NE) and specific patterns with NE labels or other words of a similar type, thus supporting the effort to make (un-structured) text GDPR compliant for organisations that want to store and process text containing personal information and/or share it with other organisations. Of course, the sensitivity of the in-formation has an influence on the choice of replacement strategy; the use of similar words instead of NE labels might be better suited for hampering malicious attempts at reidentification, but is al-so more challenging from a linguistic point of view.
One of the long-term goals is to give the user of an anonymisation tool sufficient control, i.e. the possibility to create custom NE lists and patterns and, ideally, to run the tool in-house. However, a lot of exploration is still needed before MT systems can properly translate anonymised text.
- In the Multilingual Fake News Processing specification, tools and techniques for detecting arti-cles that spread false information were investigated. The goal is to deceive readers and, as such, help prevent damage on political or other levels. Despite the global character of disinformation, the publicly available datasets required for training deep learning models are limited in terms of available languages. Therefore, the specification concentrates on ways to increase multilingual support.
In addition to experiments with supervised classification, which make use of text-inherent as well as of categorical and numerical features (such as the Alexa rank), a novel approach for unsuper-vised classification was proposed. This approach applies the anomaly detection technique to train a model using various types of features, but making use only of articles known to constitute true news. This strategy aims at reducing the impact of data sparsity, on the level of language as well as topics. When applying the model to an unseen article, it is considered to be fake news if it is identified as an anomaly.