argia.eus
INPRIMATU
EusCrawl, giant corpus of words in Basque
  • The group of Euskaldunes IXA informatics of the UPV/EHU has compiled the largest corpus of words in Euskera elaborated so far, processed it with the participation of the Hitz center (and with the collaboration of the Meta company) and prepared it for reuse in different formats. The materials are available under Creative Commons licenses, under the name of EusCrawl.
Sustatu 2022ko martxoaren 24a

In total, 12.5 million documents and 423 million words have been produced, which have been extracted through the absorption of documents. The corpus is available in two different formats: JSONL and TXT. The address is http://ixa.ehu.eus/euscrawl/

The texts originate from various sources, depending on which content is reusable with one or another license: The content has been obtained with the Cc-by-sa free license from Wikipedia, Berria and Argia. Other cuts include the contents of Hitza, or those of Bilbo Hiria Irratia.

What will this great corpus of EusCrawel be used for? Its application will focus on the technology of linguistic models based on artificial intelligence. As explained by the IXA group, "linguistic models are trained with a large number of texts and, reading the text, they are able to learn the structure of the language and create new texts. Linguistic models can be found at the core of current language processing applications, both in search and answer to questions, as well as in machine translation, voice recognition or in the dialogue system and in chats. In short, the linguistic models are the engine of most of the applications that are made around the language and the texts are the gasoline of this engine".

The number of texts needed to build good language models is very high. Finding texts for languages like English is not a problem; but still, it is necessary to collect those amounts, and so scientists have taken part in creating a corpus called Colossal Clean Crawled Corpus (C4), with 156,000 million words.

EusCrawl is small in comparison, but it should start somewhere. In addition, in the case of the Basque Country there have been a large number of texts, but not in terms of quality they have been fully reliable: Google and Meta-AI (formerly Facebook) are corpus of mC4 (1 billion words) and CC100 (416 million words) that have been downloaded automatically from the Internet and identified with the language program of the documents.

In fact, although EusCrawl is smaller than these, they have already used it to create other derivative products: The IXA have created two linguistic models trained with EusCrawl, one of which is the largest model for Euskera today, with 355 million parameters.

In addition, IXA has reported that EusCrawl will be used in the BigScience project, which aims to build a free multilingual and giant linguistic model, using five million hours of computation to do so. The linguistic model that will be created in BigScience will also know Euskera.

EusCrawl has been published on the Internet and has also been presented as a work performed by five people from the IXA group, in an academic role. It can be said that it is the result of the IXA group of the UPV/EHU, but also the company Meta (former Facebook), through the informatics Mikel Artetxe, which acts as a bridge in IXA and Metan. They also signed the paper Itziar Aldabe, Rodrigo Agerri, Olatz Perez de ViƱaspre and Aitor Soroa.

Learn more about EusCrawl in Unibertsitatea.net.