In total, 12.5 million documents and 423 million words have been produced, which have been extracted through the absorption of documents. The corpus is available in two different formats: JSONL and TXT. The address is http://ixa.ehu.eus/euscrawl/
The texts originate from various sources, depending on which content is reusable with one or another license: The content has been obtained with the Cc-by-sa free license from Wikipedia, Berria and Argia. Other cuts include the contents of Hitza, or those of Bilbo Hiria Irratia.
What will this great corpus of EusCrawel be used for? Its application will focus on the technology of linguistic models based on artificial intelligence. As explained by the IXA group, "linguistic models are trained with a large number of texts and, reading the text, they are able to learn the structure of the language and create new texts. Linguistic models can be found at the core of current language processing applications, both in search and answer to questions, as well as in machine translation, voice recognition or in the dialogue system and in chats. In short, the linguistic models are the engine of most of the applications that are made around the language and the texts are the gasoline of this engine".
The number of texts needed to build good language models is very high. Finding texts for languages like English is not a problem; but still, it is necessary to collect those amounts, and so scientists have taken part in creating a corpus called Colossal Clean Crawled Corpus (C4), with 156,000 million words.
EusCrawl is small in comparison, but it should start somewhere. In addition, in the case of the Basque Country there have been a large number of texts, but not in terms of quality they have been fully reliable: Google and Meta-AI (formerly Facebook) are corpus of mC4 (1 billion words) and CC100 (416 million words) that have been downloaded automatically from the Internet and identified with the language program of the documents.
In fact, although EusCrawl is smaller than these, they have already used it to create other derivative products: The IXA have created two linguistic models trained with EusCrawl, one of which is the largest model for Euskera today, with 355 million parameters.
In addition, IXA has reported that EusCrawl will be used in the BigScience project, which aims to build a free multilingual and giant linguistic model, using five million hours of computation to do so. The linguistic model that will be created in BigScience will also know Euskera.
EusCrawl has been published on the Internet and has also been presented as a work performed by five people from the IXA group, in an academic role. It can be said that it is the result of the IXA group of the UPV/EHU, but also the company Meta (former Facebook), through the informatics Mikel Artetxe, which acts as a bridge in IXA and Metan. They also signed the paper Itziar Aldabe, Rodrigo Agerri, Olatz Perez de Viñaspre and Aitor Soroa.
Learn more about EusCrawl in Unibertsitatea.net.
Antton Kurutxarri, Euskararen Erakunde Publikoko presidente ordearen hitzetan, Jean Marc Huart Bordeleko Akademiako errektore berriak euskararen gaia "ondo menderatzen du"
You may not know who Donald Berwick is, or why I mention him in the title of the article. The same is true, it is evident, for most of those who are participating in the current Health Pact. They don’t know what Berwick’s Triple Objective is, much less the Quadruple... [+]
Is it important to use a language correctly? To what extent is it so necessary to master grammar or to have a broad vocabulary? I’ve always heard the importance of language, but after thinking about it, I came to a conclusion. Thinking often involves this; reaching some... [+]
Adolescents and young people, throughout their academic career, will receive guidance on everything and the profession for studies that will help them more than once. They should be offered guidance, as they are often full of doubts whenever they need to make important... [+]
Maiatzaren 17an Erriberako lehenengo Euskararen Eguna eginen da Arguedasen, sortu berri den eta eskualdeko hamaika elkarte eta eragile biltzen dituen Erriberan Euskaraz sareak antolatuta
Ansorena´tar Joseba Eneko.
Edonori orto zer den galdetuz gero, goizaldea erantzungo, D´Artagnanen mosketero laguna edo ipurtzuloa, agian. Baina orto- aurrizkiak zuzen adierazten du eta maiz erabiltzen dugu: ortodoxia, ortopedia, ortodontzia... Orduan (datorrena... [+]
We have had to endure another attack on our language by the Department of Education of the Government of Navarre; we have been forced to make an anti-Basque change in the PAI program. In recent years, by law, new Model D schools have had to introduce the PAI program and have had... [+]
"Ask for your turn and we'll join you," the willing and cheerful announcer who speaks from the studios tells the young correspondent who walks through the streets of Bilbao. The presenter immediately addressed the audience. "In the meantime, we are going to Pamplona..." They opened... [+]