argia.eus
INPRIMATU
How has artificial intelligence been used to spread the Basque language?
  • Among them, Naiara Perez, researcher at the Linguistic Technology Center HI-TZ at the UPV/EHU; Itziar Cortes, researcher at Elhuyar; and Eli Pombo, manager at Iametza. Among other things, they will talk about the place that Euskera has in artificial intelligence, about its possibilities, challenges and difficulties.
Olaia L. Garaialde 2024ko uztailaren 10a

It's on everyone's lips, and in everyday life, artificial intelligence is used more than it's thought. But what is it? “To give a simple definition, I would say that it is a field between mathematics and informatics, in which computers are given human competencies, such as linguistic ability, vision, movement…”, explains Naiara Pérez. The mind can be accurate or general. That is to say, the specific serves for some actions, such as, for example, accessing a tunnel and automatically igniting the car lights or asking things to voice assistants. On the contrary, generic is something similar to human beings and is able to perform more than one action, as is the case of the artificial interlocutor ChatGPT of Open Al.

Artificial intelligence works through neural networks. Neural networks try to mimic with the instruments of computation the human nervous system and the way to learn. To do this, it is necessary to inform and design learning algorithms that enable the tools to learn patterns. “We haven’t yet arrived, but we’ve started to really think about whether artificial intelligence can represent the mind of a human being in the coming years,” Pérez said.

However, Pombo has made a statement in which he recalls that artificial intelligence is a tool: “I wouldn’t let us sit back. We must use technology rationally, for truly useful uses and ethically.”

Eli Pombo, Iametza “I feel like we’re trying to lose the train and make room, but I’m not pessimistic. Apart from the difficulties, I think we also have a favorable wind.”

Artificial intelligence can be applied in many sectors, including language processing. This includes, inter alia, systems for the simultaneous recognition and translation of the written language, systems for the reception and writing of the oral language and systems for the conversion of voice texts. According to Pérez, in recent years in the Basque Country there are “many” researchers in linguistic technology and development, and there is a “strong” sector: “We are emerging from us”. Therefore, there is a sector that investigates in Euskera, but in all this whirlwind of technology, what is the place that the Basque country has? What benefits and disadvantages does it have? What difficulties and possibilities does the Basque Country have?

“Artificial intelligence is the reality around us and the Basque country has to be there. Otherwise, it loses opportunities in the digital presence,” Pombo stressed. In addition, he added that this is an opportunity for the Basque people to spread to other places. Pérez has joined in this and pointed out that large-scale tools such as ChatGPT are often in the hands of large companies: “We cannot wait for what big companies are going to do. The priority of companies like Google and Microsoft is not to take all the languages of the world into account. These companies focus on the languages with the highest number of clients, i.e. the hegemonic languages”. However, he added that this type of company has also begun to integrate Euskera.

Pombo feels that the researchers working in Basque in this sense are “fighting”: “I feel that we are trying to lose the train and make room, but I am not pessimistic. Apart from the difficulties, I think we also have a favorable wind.” He also says that there is "a lot of will" to do things by public institutions and citizens. In addition, according to Pombo, it is “impossible” to compete against the advances made by large companies: “We have to keep doing things with common sense and without frustrating.”

Elhuyar has, among others, the Elia and Entzun tools. The neural translator Elia translates in a few seconds simple texts and formatted documents. The expert platform processes previously recorded audio and video files and creates their transcripts and subtitles.

Technological sovereignty Because technology is based on data in general, Pombo believes that in order not to make information available to large companies it is necessary to manage it “in a sovereign way”: “Free software allows you to improve what you have done for you and strengthen the local economy”. In this sense, Cortés has warned that the use of this data by users has been
"rational". Therefore, customer data is not used in Elhuyar to train tools: “We have to be vigilant, we don’t read the small print or don’t warn us and we’re feeding these systems unintentionally.” Pérez

has also stressed the origin of the data and stated that in free models the ethical aspect is "cleaner". They have also underlined the impact of working from local needs and aspirations. “If we do, we will create the content of the issues that interest us,” said Pérez. He added that the creation of “cutting-edge technology” contributes to “nourishing” the technological sector and the research sector in the Basque Country: “If we work openly, we can promote collaboration between the research centers here.”

Naiara Perez, I.T. Center. : “We cannot wait for what big companies do. The priority of companies like Google and Microsoft is not to take all the languages of the world into account.”

There are many who make available the other tools to develop technology. An example of this is the Latxa created by the HiTZ Language Technology Center of the UPV/EHU. Language is a great model, and when you give a string of texts to these kinds of models, they give the word more likely. “Latxa does nothing on its own, it’s a motor that generates other applications,” explains Pérez. For example, the spelling corrector is used to create applications for answering questions and automatic exercises in teaching: “We, for example, have included Latxa as a user in the game Once a day to give answers.” It is available on the network and can be downloaded by anyone.

Lack of information,
difficulty Data
is the treasure of artificial intelligence. In fact, training tools requires as many quality data as possible. For example, creating a system that listens to the voice and transcribes it directly requires many recordings and transcripts. The interviewed researchers have emphasized that one of the great difficulties the Basque country has is obtaining large amounts of data. “Compared to the hegemonic languages, the Basque language does not have so much content, and then it is more difficult to obtain results,” said Pérez. However, Cortés has added to this, although he believes that, compared to other minority languages, there are "more" contained in euskera.La most of the current tools are formed in

unified Basque, although there are also tools that work the dialects. The Batua translation service, for example, is based on artificial intelligence and neural networks and is a project developed by the Vicomtech technology center and promoted by Euskaltel, Mondragonlingua and EITB. This translation service is familiar with Basque, French, Spanish, English and Biscayan. In addition, it is able to make translations between the Basque Country and all those languages. “In the case of the Basque dialects it is much more difficult to obtain data; if the Basque country costs in the batua, think of the Biscayan or the labortan,” said Pérez. As for the dialects, the lack of norms and variants existing in each area hinder the process: “The Biscayan is not united, so if we feed the machine with the dialects of Getxo, Gernika or Ondarroa, it is very difficult to create a praton.” Despite the difficulties, this does not mean that quality content

is not generated in the Basque Country. In fact, Cortés stressed that "much" is taken care of the content generated in Euskera: “With the data we have, we’re getting very tidy results.” In addition, he added that artificial intelligence has opened other doors to the Basque country, as they were "bounded" with old systems. He explains that with the first systems they were unable to create and make available “truly useful” systems.

Translated into practice, it explains the advances in machine translation: “It didn’t work well before, but now it did.” In 2007, Elhuyar and the UPV created Matxin, the first free automatic translator. The system at the time was not based on artificial intelligence. “In the case of the Basque Country, the results it gave were not close to those we have today. Until 2016, we did not achieve a quality system for translation between Basque and Spanish.” Today, Elia is known as the automatic translator.

On the contrary, in the case of other languages, for example, translations between Spanish and Galician, old systems are currently used which are not yet based on artificial intelligence: “In the nearest languages or with similarities, very good results were obtained. In the case of the Basque Country there are declinations, the verbs are different and the order of words is free. This creates difficulties in creating rules of passage from one language to another. Thanks to artificial intelligence today, there are almost no limits."

Itziar Cortes, Elhuyar: “If we use the automatic translator to translate to Basque what is not in Basque and we do not see if the translation is ok, we do not favor the Basque”

As technology is able
to do more and more things, Cortés believes that we have to be “reasonable.” Otherwise, instead of being an instrument for the dissemination of the Basque Country, it says that it can be counterproductive: “If we use the content created in Euskera to translate it, we are spreading the content that has somehow been created in Euskera. But if we use the automatic translator to translate to Basque what is not in Basque and we do not look at whether the translation is fine, we do not favor the Basque”. He added that this will cause us to have “low quality” texts in the medium term: “If we use these texts to train future systems, the quality of the Basque country will be low”. Besides reviewing the content generated, what can

be done to ensure quality? The HiTZ Language Technology Centre has adopted several avenues for this purpose. Firstly, the contents of the Basque media with a Creative Commons license have been used. Because these data are not enough, large files have also been used and their content has been filtered through a filter. “We’ve got 4,000,000 documents, but it’s not enough to create a tool like ChatGPT; however, we’ve achieved good results.” Another

example is the Elia, Entzun and TTS tools that Elhuyar has. All three are based on artificial intelligence. The neural translator Elia translates in a few seconds simple texts and formatted documents. The expert platform processes previously recorded audio and video files and creates their transcripts and subtitles. The neural TTS turns the text into a voice. Elhuyar technologies know Basque, Spanish, French, English, Catalan and Galician. This means that Elia, Entzun and TTS can be used in those six languages. It's Elhuyar's latest news: “We’ve seen that customers are multilingual and want to use our technology in languages other than Basque. Having Euskera as the main axis, they can use the only tool in more languages.”

Elhuyar zientzia.eus has an integrated neural translator on its web. Note: “Text written in Basque and automatically translated through Elia, without subsequent supervision.” In this example, the text appears in catanata.

Cortés believes that all these tools can help “expand” the Basque: “We don’t have to be afraid to create in Basque. If we want to reach more people, in addition to professional translators, today we have many tools.” In Elhuyar, for example, most of the time they create things in Euskera, but they translate a few things out of Euskal Herria with an automatic translator. Always alerting the user and offering the possibility to access the version original.Por example,

on the portal zientzia.eus they have an integrated automatic translator. Does this mean that if a person puts the Como xurdiu or solar system on the Internet? - How did the solar system come about? in Galician–, which is accessible to the portal zientzia.eus. The website offers the possibility to read in another language and see the original version. “We are clear that the reader has to know that it is a text made by an automatic translator, and not by a person,” said Cortés.

ARGIA has recently participated in the Itzulinguru project and, like the web zientzia.eus, has integrated the automatic translator into the experiment proposed by the Sociolinguistics Cluster and the research group Innoklab of the UPV/EHU. This project has been supported by: Elhuyar, AEK, Orai, Center for Artificial Intelligence, Osakidetza, Department of Education of the Basque Government and Hekimen.Aunque use automatic

translator, do not leave everything in the hands of tools. Elhuyar members review the content to ensure its quality. Sometimes it is researchers and other professional translators: “We are clear: professional translators are needed because the automatic does not achieve 100% of the quality. Furthermore, you cannot leave anyone who knows whether a text is good or not, because we don’t all have the same criteria.” All these tools and resources currently

available can help in language learning. However, what happens if, instead of awakening the desire to learn, they don’t want to learn the language? According to Pérez, although the desire or the need to learn a language is linked to the need to communicate, it is not limited to this: “As far as Euskera is concerned, I don’t think anyone learns to communicate on their own. It's a choice, and there's a lot closer together. Really, if you want to learn Basque, French, Arabic or any language, these kinds of tools will make the way easier, but a translator will not give you the pleasure of reading directly in Basque.”