Automatically translated from Basque, translation may contain errors. More information here. Elhuyarren itzultzaile automatikoaren logoa

EusCrawl, giant corpus of words in Basque

  • The group of Euskaldunes IXA informatics of the UPV/EHU has compiled the largest corpus of words in Euskera elaborated so far, processed it with the participation of the Hitz center (and with the collaboration of the Meta company) and prepared it for reuse in different formats. The materials are available under Creative Commons licenses, under the name of EusCrawl.
Artikulu hau CC BY-SA 3.0 lizentziari esker ekarri dugu.

24 March 2022 - 09:10

In total, 12.5 million documents and 423 million words have been produced, which have been extracted through the absorption of documents. The corpus is available in two different formats: JSONL and TXT. The address is http://ixa.ehu.eus/euscrawl/

The texts originate from various sources, depending on which content is reusable with one or another license: The content has been obtained with the Cc-by-sa free license from Wikipedia, Berria and Argia. Other cuts include the contents of Hitza, or those of Bilbo Hiria Irratia.

What will this great corpus of EusCrawel be used for? Its application will focus on the technology of linguistic models based on artificial intelligence. As explained by the IXA group, "linguistic models are trained with a large number of texts and, reading the text, they are able to learn the structure of the language and create new texts. Linguistic models can be found at the core of current language processing applications, both in search and answer to questions, as well as in machine translation, voice recognition or in the dialogue system and in chats. In short, the linguistic models are the engine of most of the applications that are made around the language and the texts are the gasoline of this engine".

The number of texts needed to build good language models is very high. Finding texts for languages like English is not a problem; but still, it is necessary to collect those amounts, and so scientists have taken part in creating a corpus called Colossal Clean Crawled Corpus (C4), with 156,000 million words.

EusCrawl is small in comparison, but it should start somewhere. In addition, in the case of the Basque Country there have been a large number of texts, but not in terms of quality they have been fully reliable: Google and Meta-AI (formerly Facebook) are corpus of mC4 (1 billion words) and CC100 (416 million words) that have been downloaded automatically from the Internet and identified with the language program of the documents.

In fact, although EusCrawl is smaller than these, they have already used it to create other derivative products: The IXA have created two linguistic models trained with EusCrawl, one of which is the largest model for Euskera today, with 355 million parameters.

In addition, IXA has reported that EusCrawl will be used in the BigScience project, which aims to build a free multilingual and giant linguistic model, using five million hours of computation to do so. The linguistic model that will be created in BigScience will also know Euskera.

EusCrawl has been published on the Internet and has also been presented as a work performed by five people from the IXA group, in an academic role. It can be said that it is the result of the IXA group of the UPV/EHU, but also the company Meta (former Facebook), through the informatics Mikel Artetxe, which acts as a bridge in IXA and Metan. They also signed the paper Itziar Aldabe, Rodrigo Agerri, Olatz Perez de Viñaspre and Aitor Soroa.

Learn more about EusCrawl in Unibertsitatea.net.


You are interested in the channel: Euskara
2024-10-17 | UEU
Manex Agirre Arriolabengoa
"We Basques have a lot to do and empower ourselves."
"School of Empowerment for Euskaldunes. The school "Jobs to stop being a subordinate prayer" has been organized in Vitoria-Gasteiz within the programming of Izaskun Arrue Kulturgunea (IAK) and with the collaboration of UEU. The school's dynamism is Manex Agirre Arriolabengoa... [+]

Ainhoa Lasa Agirre, consultant
"Young people want to talk about Euskera"
Ainhoa Lasa Agirre (Leuven, Flandria, 1976) is a member of the Emun cooperative. In the summer courses of the UEU, in July we met him talking about socio-linguistic education. She has been performing interventions in the classrooms of 4th ESO youth for a dozen years. These are... [+]

Popuerza

On Sunday of September it is customary to climb to Ernio, dance in Zelatun and eat brown chorizo, or something. The worst time you don't need people. This year, when my friends were leaving earlier and I was delaying, I would go up alone, finding the crews coming down. Most young... [+]


Anbroxi Burguburu. A simple farmer (say)
"A few years ago, Santa Grazirat was Euskaldun at this point."
It is the book of Txomin Peillen, Animismua Zuberoan (Haranburu, 1983). Their stories have brought us to Urdatx or Santa Grazi. Among others, the stories of the last deceased bear, as in this case it is not a story. In the inn of the people they ask us, they teach us a picture... [+]

2024-10-15 | Sustatu
The School of Empowerment in Vitoria-Gasteiz, following the examples of feminism, for the Basque country
Izaskun Arrue Kulturgunea of Vitoria-Gasteiz (IAK) and UEU will launch the School of Empowerment for Euskaldunes, with the objective of empowering Basques and Basques and, consequently, transforming society. The presentation will be held on 30 October by Garikoitz Goikoetxea... [+]

2024-10-14 | Leire Ibar
The Euskalgintza of Vitoria-Gasteiz will organize an all-day programme on 19 October
On January 19, the initiative “Vitoria-Gasteiz, the city of Euskera” will be held in the capital of Alavesa. The 12-hour programme will consist of various activities to be carried out in Basque. The aim of the initiative is for the "Euskera be the protagonist" of the ... [+]

Welcome to the Euskararen Mundura Conference
“To be Basque is an option, but we have to give way to that option”
How does it need the welcome we give to the new Basque citizens? How to build partnerships for the process of normalization and revitalization of the Basque Country? The most urgent challenges have been addressed at the Ongi etorri Euskararen Mundura Conference in... [+]

2024-10-11 | ARGIA
Hernani schools come together to promote the use of the Basque Country and Basque culture
The schools are concerned that in their schools the use of Euskera and the transmission of Basque culture are slowing down. They've done a year-long reflection process and they've presented the Gu geok platform. They've presented a decalogue to make a leap in these centers.

2024-10-11 | Cira Crespo
Ongi etorri Euskararen Mundura jardunaldiak
“Ederra litzateke euskaraz kalean ikastea”

Euskalgintzaren Kontseiluak antolatutako "Ongi etorri Euskararen Mundura" jardunaldiak izan dira Gasteizen ostegunean eta ostiralean. Egun bi bete-beteak,  eta mahai gainean Euskararen normalizazio prozesuan euskal herritar berriak integratzeari buruzko praktikak... [+]


2024-10-11 | Sustatu
More than nostalgia: When Son Goku started in Basque
Last week was 35 years of the first time Son Goku spoke in Basque. On 4 October 1989, the Dragon Ball was released in ETB1 (which started in Japan in 1984) and a commemorative event will be held in San Sebastian on 20 October, Sunday, with the issuance of that first episode and... [+]

2024-10-10 | Leire Ibar
Rejected the proposal by EH Bildu to examine whether linguistic rights are respected in the Justice Administration
On Thursday, the PNV and the PSE approved an amendment in the Parliament of Vitoria-Gasteiz calling for "progress in the Euskaldunisation of Justice", but did not propose concrete measures, according to EH Bildu.

2024-10-09 | Ula Iruretagoiena
Territory and architecture
Basque Country

Edurne Azkarate said from the micro stage that the Basque film has little Basque in the celebration of the San Sebastian Film Festival. The phrase echoes for its truthfulness. In the architecture scene you can repeat the same motto and I am sure that in so many other cultural... [+]


2024-10-09 | Iñigo Satrustegi
Laba de Pamplona Association
The eruption erupted in Basque
Coffee shops, meeting points, cultural centers, shops and more, always having as its axis the Basque. Laba de Pamplona/Iruña opened its doors two and a half years ago. But the project is ahead. We've done a retrospective review with colleagues: challenges and problems. We have... [+]

Eguneraketa berriak daude