Following the recent technical workshop by ELRC and DG CNECT on large language models (LLM), it is worth taking a closer look on the European developments in this area, particularly when they concern less popular and morphologically rich languages like Polish.
The National Information Processing Institute (Ośrodek Przetwarzania Informacji - Państwowy Instytut Badawczy – OPI PIB), a Polish interdisciplinary research institute, may boast of interesting achievements in this field. The experts from the Laboratory of Linguistic Engineering (LIL) developed the Polish RoBERTa large model, which was trained on the largest text corpus ever used for Polish.
The works started with the extension of the existing text corpus – a collection of about 15 GB of text data used in the past to train the ELMo model. As BERT-type models have a much larger capacity and require a corresponding dataset to fully exploit their potential, in December 2019, OPI PIB experts started downloading data from Common Crawl, a public archive containing petabytes of web page copies. The Common Crawl data from November-December 2019 and January 2020 allowed – after filtering and cleaning – to accumulate a sufficiently large set. The actual training of the model lasted from February to May 2020. With a corpus of 130 GB of data, equivalent to over 400 thousand books, Polish RoBERTa large became the largest model ever trained in Poland.
The model was tested using the Comprehensive Language Evaluation List (KLEJ benchmark) developed by company Allegro, which made it possible to evaluate the model’s performance based on nine tasks, such as sentiment analysis or semantic similarity testing of texts. Based on the KLEJ analysis, the OPI PIB model took the first place in this ranking.
In 2021, updated versions of the Polish RoBERTa models and the GPT-2 model designed for text generation tasks were released. The base part of their data corpus consists of high-quality texts (Wikipedia, documents of the Polish parliament, statements from social media, books, articles, longer written forms). The web part of the corpus, on the other hand, includes extracts from websites (CommonCrawl project), which have been previously filtered and properly cleaned.
It takes about 3-4 months to train one single neural model of a language, but the results are very promising. All neural models developed in OPI PIB concern texts in Polish, which is particularly valuable, as most of the solutions of this type are developed for English. The mentioned transformer-type models allow for a precise representation of the syntax and semantics of Polish and make it possible to build advanced Polish language processing tools.
Commendably enough, the Institute makes the models available to the public and free of charge: they are available on the website of the Institute: https://opi.org.pl/modele-uczenia-maszynowego-udostepnione-przez-opi-pib/
In September, researchers from the Institute are expected to deliver a presentation at the 3rd National ELRC workshop in Warsaw.