Automatically translated from Basque, translation may contain errors. More information here. Elhuyarren itzultzaile automatikoaren logoa

EusCrawl, giant corpus of words in Basque

  • The group of Euskaldunes IXA informatics of the UPV/EHU has compiled the largest corpus of words in Euskera elaborated so far, processed it with the participation of the Hitz center (and with the collaboration of the Meta company) and prepared it for reuse in different formats. The materials are available under Creative Commons licenses, under the name of EusCrawl.
Artikulu hau CC BY-SA 3.0 lizentziari esker ekarri dugu.

24 March 2022 - 09:10

In total, 12.5 million documents and 423 million words have been produced, which have been extracted through the absorption of documents. The corpus is available in two different formats: JSONL and TXT. The address is http://ixa.ehu.eus/euscrawl/

The texts originate from various sources, depending on which content is reusable with one or another license: The content has been obtained with the Cc-by-sa free license from Wikipedia, Berria and Argia. Other cuts include the contents of Hitza, or those of Bilbo Hiria Irratia.

What will this great corpus of EusCrawel be used for? Its application will focus on the technology of linguistic models based on artificial intelligence. As explained by the IXA group, "linguistic models are trained with a large number of texts and, reading the text, they are able to learn the structure of the language and create new texts. Linguistic models can be found at the core of current language processing applications, both in search and answer to questions, as well as in machine translation, voice recognition or in the dialogue system and in chats. In short, the linguistic models are the engine of most of the applications that are made around the language and the texts are the gasoline of this engine".

The number of texts needed to build good language models is very high. Finding texts for languages like English is not a problem; but still, it is necessary to collect those amounts, and so scientists have taken part in creating a corpus called Colossal Clean Crawled Corpus (C4), with 156,000 million words.

EusCrawl is small in comparison, but it should start somewhere. In addition, in the case of the Basque Country there have been a large number of texts, but not in terms of quality they have been fully reliable: Google and Meta-AI (formerly Facebook) are corpus of mC4 (1 billion words) and CC100 (416 million words) that have been downloaded automatically from the Internet and identified with the language program of the documents.

In fact, although EusCrawl is smaller than these, they have already used it to create other derivative products: The IXA have created two linguistic models trained with EusCrawl, one of which is the largest model for Euskera today, with 355 million parameters.

In addition, IXA has reported that EusCrawl will be used in the BigScience project, which aims to build a free multilingual and giant linguistic model, using five million hours of computation to do so. The linguistic model that will be created in BigScience will also know Euskera.

EusCrawl has been published on the Internet and has also been presented as a work performed by five people from the IXA group, in an academic role. It can be said that it is the result of the IXA group of the UPV/EHU, but also the company Meta (former Facebook), through the informatics Mikel Artetxe, which acts as a bridge in IXA and Metan. They also signed the paper Itziar Aldabe, Rodrigo Agerri, Olatz Perez de Viñaspre and Aitor Soroa.

Learn more about EusCrawl in Unibertsitatea.net.


You are interested in the channel: Euskara
2025-04-25 | Enbata

The 42nd edition of the Sara Writers’ Conference brings together a large number of Basque writers and readers
The attendees enjoyed a two-day programme of numerous activities, including a book fair, shows, workshops and round tables. For the coordinator, Maider Elcano, the strength of the Conference lies in having an "open space" that gives space to writers and creators.

Ernai expresses his intention to undertake "a new national impulse" at the Youth Meeting held in Berriozar
Ernai said he had gathered more than 3,000 people at the meeting between Thursday and Sunday. The main political event of the conference took place on Saturday evening, with the intervention of the spokesman of the youth organization, Amaiur Egurrola. The meeting point ended in... [+]

2025-04-22 | Aiaraldea
The Basque Language Council of Aiaraldea will take new steps to reach the Basque festivities
The Basque Festival of Gora Aiaraldea has once again set in motion the dynamic of the Basque Council of Aiaraldea, in collaboration with Municipalities and associations. Its purpose is to promote the presence and use of the Basque language in the context of the festivities. This... [+]

2025-04-22 | Euskal Irratiak
Brebetaren zientzietako froga euskaraz egiten ahalko da, sare pribatu zein publikoan

Antton Kurutxarri, Euskararen Erakunde Publikoko presidente ordearen hitzetan, Jean Marc Huart Bordeleko Akademiako errektore berriak euskararen gaia "ondo menderatzen du"


Berwick and us

You may not know who Donald Berwick is, or why I mention him in the title of the article. The same is true, it is evident, for most of those who are participating in the current Health Pact. They don’t know what Berwick’s Triple Objective is, much less the Quadruple... [+]


2025-04-16 | Haritz Arabaolaza
The language of

Is it important to use a language correctly? To what extent is it so necessary to master grammar or to have a broad vocabulary? I’ve always heard the importance of language, but after thinking about it, I came to a conclusion. Thinking often involves this; reaching some... [+]


2025-04-16 | Rober Gutiérrez
The skills

Adolescents and young people, throughout their academic career, will receive guidance on everything and the profession for studies that will help them more than once. They should be offered guidance, as they are often full of doubts whenever they need to make important... [+]


Iñaki Bakero (Erriberan Euskaraz)
“Batzuek ez gaituzte hemen nahi, baina bagaude”

Maiatzaren 17an Erriberako lehenengo Euskararen Eguna eginen da Arguedasen, sortu berri den eta eskualdeko hamaika elkarte eta eragile biltzen dituen Erriberan Euskaraz sareak antolatuta


Aitonita eta ortologia

Ansorena´tar Joseba Eneko.

Edonori orto zer den galdetuz gero, goizaldea erantzungo, D´Artagnanen mosketero laguna edo ipurtzuloa, agian. Baina orto- aurrizkiak zuzen adierazten du eta maiz erabiltzen dugu: ortodoxia, ortopedia, ortodontzia... Orduan (datorrena... [+]


Goiatz Urkijo, Coordinator of Euskaraldia
"There is no contradiction between effort and enthusiasm"
We were on the eve of the fourth edition of Euskaraldia with Goiatz Urkijo. In the third, they noticed the downturn; the fact that the second was carried out in the midst of the pandemic did not help much. Their goal is to make this year more popular and exciting. At the moment... [+]

For education in Basque, no more English sessions

We have had to endure another attack on our language by the Department of Education of the Government of Navarre; we have been forced to make an anti-Basque change in the PAI program. In recent years, by law, new Model D schools have had to introduce the PAI program and have had... [+]


The analysis
Microphones in theaters

"Ask for your turn and we'll join you," the willing and cheerful announcer who speaks from the studios tells the young correspondent who walks through the streets of Bilbao. The presenter immediately addressed the audience. "In the meantime, we are going to Pamplona..." They opened... [+]


On May 10, the Sorionekuak initiative will cross bridges and gates to proclaim that the Basque language belongs to all Navarrans
In the morning the most significant bridges in Navarre will be populated by the dynamic Sorionekuak. In the afternoon a citizen mobilization has been called from the Parque Costarapea in Pamplona to the Parque Takonera in the Old Town.

Eguneraketa berriak daude