The value of language data

Since the COVID-19 pandemic, the importance of language-centric AI has significantly increased, not only in Europe, as Language Technologies (LT) provided valuable tools and services to facilitate – and in many cases to actually enable – the exchange of information. These significant changes in the way we work contributed to new trends and a greater availability and uptake of language-centric AI in general. Today, the increased use of LT is no longer limited to Machine Translation, as more and more organisations have recognised the usefulness of LT tools such as Fake news detection, Anonymisation, Speech recognition or Text to Speech, to facilitate their daily operations, just to name a few.

Language Data Management and Sharing

For all LT applications, language data plays a crucial role. This is even more true when we consider the exponential growth of digital communication platforms, which in turn increase the need for more efficient and reliable LT. Organisations, however, can only collect the necessary amount of language data required for the development of competitive language-centric AI if they invest considerable efforts – both in terms of time and resources. For this reason, data sharing is increasingly considered as the best way towards a truly sustainable language data management. Nonetheless, in many countries of the EU, the sharing of language data is still not common practice, even though tons of data are produced in public administrations, research and industry on a daily basis.

pictureDuring our investigation for the 2022 edition of the ELRC White Paper, we tried to find out more about the Language Data management and sharing in European PA and SMEs.

Following the results of our 2022 survey, the value of language data is being increasingly recognised all over Europe (see Figure 1): 17 of the ELRC National Anchor Points (NAPs) stated that their organisations are storing language data whenever possible (dark blue), while only 4 of them indicated that this is still hardly or never the case (yellow). In the remaining countries/organisations, language data is stored at least sometimes.

Similarly, the large majority of the external survey contributors indicated that language data are stored whenever possible in their organisations (59%), but the percentage of those who indicated that they hardly or never store such data is not minimal (19%).

Despite the encouraging results, this confirms that awareness-raising efforts must be maintained – also on the part of the European governments, which is also reflected by the fact that although 17 NAPs know that language data are explicitly mentioned in the AI regulations of their countries, only 4 are aware of a corresponding strategic or financial plan. Moreover, 6 stated that language data are mentioned only as a side note – e.g., as useful example of AI, while 5 indicated that language data are not mentioned at all.

However, this doesn’t necessarily mean that this topic is completely disregarded by the country’s AI regulations. For example, the external survey contributors from Spain gave partly completely divergent answers, while most of them (34%) openly stated that they don’t know whether language data are mentioned in the national AI regulations or not. So, it is probably true that there is rather just a lack of communication/information to this respect.

And in fact, we could already find numerous best practice examples with regard to the management of language data in European AI regulations. Just to name a few:

  • The Norwegian strategy includes a full chapter about LT and language data, which highlights the crucial importance of language resources, especially for the NLP systems targeting less-resourced languages like the Sami languages.
  • The Spanish AI Strategy mentions boosting the National LT Plan and the creation of resources in the Spanish Language as one of their action items.
  • In Ireland, the value of language data is publicised, because one of their action items is to move away from US-based language data and use sources that include everyday language used by Irish citizens. In addition to that, the development of language resources for Irish is mentioned as one of the key enablers to provide digital services in Irish.

Such developments reiterate that the value of language data has significantly increased and will continue to increase in Europe – within public administration and organisations, but also in national regulations.

The 2022 edition of the ELRC White Paper is now available for download.