Automatically translated from Basque, translation may contain errors. More information here. Elhuyarren itzultzaile automatikoaren logoa

EusCrawl, giant corpus of words in Basque

  • The group of Euskaldunes IXA informatics of the UPV/EHU has compiled the largest corpus of words in Euskera elaborated so far, processed it with the participation of the Hitz center (and with the collaboration of the Meta company) and prepared it for reuse in different formats. The materials are available under Creative Commons licenses, under the name of EusCrawl.
Artikulu hau CC BY-SA 3.0 lizentziari esker ekarri dugu.

24 March 2022 - 09:10

In total, 12.5 million documents and 423 million words have been produced, which have been extracted through the absorption of documents. The corpus is available in two different formats: JSONL and TXT. The address is http://ixa.ehu.eus/euscrawl/

The texts originate from various sources, depending on which content is reusable with one or another license: The content has been obtained with the Cc-by-sa free license from Wikipedia, Berria and Argia. Other cuts include the contents of Hitza, or those of Bilbo Hiria Irratia.

What will this great corpus of EusCrawel be used for? Its application will focus on the technology of linguistic models based on artificial intelligence. As explained by the IXA group, "linguistic models are trained with a large number of texts and, reading the text, they are able to learn the structure of the language and create new texts. Linguistic models can be found at the core of current language processing applications, both in search and answer to questions, as well as in machine translation, voice recognition or in the dialogue system and in chats. In short, the linguistic models are the engine of most of the applications that are made around the language and the texts are the gasoline of this engine".

The number of texts needed to build good language models is very high. Finding texts for languages like English is not a problem; but still, it is necessary to collect those amounts, and so scientists have taken part in creating a corpus called Colossal Clean Crawled Corpus (C4), with 156,000 million words.

EusCrawl is small in comparison, but it should start somewhere. In addition, in the case of the Basque Country there have been a large number of texts, but not in terms of quality they have been fully reliable: Google and Meta-AI (formerly Facebook) are corpus of mC4 (1 billion words) and CC100 (416 million words) that have been downloaded automatically from the Internet and identified with the language program of the documents.

In fact, although EusCrawl is smaller than these, they have already used it to create other derivative products: The IXA have created two linguistic models trained with EusCrawl, one of which is the largest model for Euskera today, with 355 million parameters.

In addition, IXA has reported that EusCrawl will be used in the BigScience project, which aims to build a free multilingual and giant linguistic model, using five million hours of computation to do so. The linguistic model that will be created in BigScience will also know Euskera.

EusCrawl has been published on the Internet and has also been presented as a work performed by five people from the IXA group, in an academic role. It can be said that it is the result of the IXA group of the UPV/EHU, but also the company Meta (former Facebook), through the informatics Mikel Artetxe, which acts as a bridge in IXA and Metan. They also signed the paper Itziar Aldabe, Rodrigo Agerri, Olatz Perez de Viñaspre and Aitor Soroa.

Learn more about EusCrawl in Unibertsitatea.net.


You are interested in the channel: Euskara
Ahetzek auzapez abertzalea izanen du: Ramuntxo Labat-Aramendi

EH Bai koalizioak babesturiko Ahetzen zerrenda gailendu da bozen bigarren itzulian, joan den igandean, botoen %44 erdietsirik.


All victims of the IAP

The victims created by the IAP are not only functionalized teachers thanks to the stabilization process brought about by the IAP Law, but much more. Some have been given some media visibility as a result of Steilas's appeal, but most of them are invisible. All the victims of the... [+]


2025-01-08 | Leire Ibar
The Agricultural College of Hazparne will teach three subjects in Basque from the next course
The students of the private school Armand David de Hazparne will have the opportunity to study in Basque the subjects of Agro-equipment, Social and Economic Sciences and Business Economics. The director of the centre, Bertrand Gaufryau, explained that this training will take... [+]

2025-01-08 | ARGIA
Robert Hirigoien, one of the creators of Herri Urrats dies
Euskaltzaindia scholar Robert Hirigoien (Larresoro, Lapurdi, 1944) died on January 4 of the same year. The last goodbye will be given to you on Thursday, in your home town, at 10:00. He was one of the founders of the Herri Urrats festival, the Assembly of Labortans, the Ikastola... [+]

Iametza has translated the Ninja Forms plugin into Euskera to create WordPress forms
Seeing that the translation of the Ninja Forms plugin into the Basque language was handed over to create WordPress forms, Iametza has taken on the task of updating the translation.

Painted in the courts and offices of CCOO and UGT to denounce the aggression against the Basque Country
After painting the Baiona and Donostia-San Sebastian courts, a painting was carried out on Thursday evening at the Eibar Court. The authors have written "stop the onslaught" and launched a green painting on the building, which has been demolished. In December, several offices of... [+]

2024-12-31 | ARGIA
Death of the Basque Gasteiztarra Gontzal Fontaneda
The Euskaltzale and Gasteiztarra militant died on Thursday, 30 December, in an accident at work. Gontzal Fontaneda Orille (1943-2024) was a witness and travelling companion of the Basque country in Vitoria in the 1960s. She began to learn Basque at the age of 15. He invented a... [+]

2024-12-31 | Sustatu
Also on the Max streaming platform begin to appear content in Euskera
On international pay streaming platforms, Amazon Prime Video and Netflix were the first to offer content in Basque and now Max, which until 2024 has been HBO or HBO Max, has been added. The screens in Euskera have adapted subtitles, and EITB has offered doubles, and this has... [+]

2024-12-30 | Rober Gutiérrez
51% of young people employed

In recent months I have had to work in a number of institutes and, at some point, I have had to talk to the students about the possibilities offered by the labour market. The typology of the students is varied and in the same city varies a lot from one neighborhood to another,... [+]


2024-12-30 | Jon Torner Zabala
'Bagare': 50 years of hymns by Euskera and Basque identity
The song Bagare was created in December 1974 by the zeanuri Gontzal Mendibil and Bittor Kapanaga in their home town of Olaeta (Aramaio). Later it became a hymn by the Basque Country and Basque identity.

2024-12-27 | Julene Flamarique
EuskarAbentura 2025: Deadline for registration until 5 March
The EuskarAbentura 2025 expedition will offer 127 young people the opportunity to walk the seven territories, immersed in culture, history and landscapes. The selection of participants from EuskarAbentura is based not only on the quality of the projects, but also on criteria... [+]

2024-12-27 | ARGIA
The LAB trade union has denounced the appointment of URL0 director to "a person who does not know Euskera"
"A person who is not Euskaldun – Nagore de los Rios – has been elected by the EITB General Directorate Eitb.eus and as the director of the Social Media section and, consequently, a person who does not know Euskera has been designated as the director of a media that has among... [+]

Arrival of Olentzero and Mari Domingi to Irun: "The City Hall has taken ownership of the act that has emerged from the people"
The arrival of Olentzero and Mari Domingi to Irun is organized by the associations and agents of the people from the beginning. From morning to night it has been known that the City Hall has taken possession of the act, so the latest resolutions on it have remained in office:... [+]

Moving us?

Euskaraldia comes back. Apparently, it will be in the spring of next year. They have already presented it and the truth is that it has surprised me; not Euskaraldia himself, but his motto: We'll do it by moving around.

The first time I have read or heard it, the title of the... [+]


Eguneraketa berriak daude