argia.eus
INPRIMATU
Latxa: Hitz creates the largest and free linguistic model in Basque
  • Recently the great model of free Catalan language called Aina Flor was introduced, and in the news last week we said that the director of the Hitz Basque Centre, Eneko Agirre, announced that he was also coming in Euskera shortly. And just yesterday, the Hitz Center became public. Latxa. LLM is a great linguistic model, a superdatabase on which artificial intelligence initiatives are based. LLMs are the basis for OpenAI ChatGPT versions, for example. Now we have one of these, in Basque (well, lots of real models, made up of 3 corpus).
Sustatu 2024ko urtarrilaren 30

According to Hitz Zentroa "is the family of open models" Latxa, which includes the "largest linguistic model in Basque". It is built on the linguistic model Meta or Facebook Llama 2 and follows its license. Llama 2 has already seen excellent results in Basque, able to perform a correct oral machine translation in Basque via the product Seamless M4T. Latxa’s logo is precisely the one that links Llama and the Basque sheep, although there is also a connection in the name (as we thought).

Latxa collects models of between 7 and 70 billion parameters. Regarding the set of texts for the construction of models, Basque researchers have used EusCrawl, a set of texts in Basque of 1.72 million documents and 288 million words. EusCrawl was extracted from 33 quality websites, offering higher quality than other corpus training techniques from the Internet.

In fact, Latxa has not been done for the general public, that will come later. However, the three models are available on the Huwaukee Face platform and can be used by the expert engineer by checking the “model card”, where the instructions for technical information and initiating the use of the models are located.

The development of Latxa has been the result of a research, innovation and development initiative, which is part of the IKER-GAITIK project, supported by the Basque Government, in cooperation with the European EuroHpc programme.

Today's language models have amazing performance, like English ChatGPT or English Bard. However, in the case of minority languages and the Basque language no. With these models he took a step in the session of Hitz Zentroa to turn the situation around, and according to his data, Latxa responds better than other systems to formulations in Basque.

More information, here.

In Hugginface: Latxa.