Telecom company Veon, mobile operator Beeline Kazakhstan, the Barcelona Supercomputing Center, and the GSMA lobby group announced on Wednesday a collaborative initiative to address the “AI language gap” for under-represented languages.
Language models, such as those powering chatbots like ChatGPT, often rely on vast amounts of online data, including digital books, websites, articles, and blogs, to learn how to generate human-like responses. However, data and resources in many languages are limited.
“Out of nearly 7000 languages spoken around the globe, only seven are considered high-resource languages in the digital world: English, Spanish, French, Mandarin, Arabic, German, and Japanese,” the groups stated in a joint announcement.
The collaboration aims to develop tools and language model documentation for under-represented languages, particularly those spoken in the countries where Veon operates — Pakistan, Ukraine, Bangladesh, Kazakhstan, Uzbekistan, and Kyrgyzstan. Catalan, spoken by around 10 million people, will also be included.
“The lack of resources in other languages results in an AI language gap, leading to sub-optimal user experiences in AI applications, deepening bias in AI models, and risking a wider digital divide in AI technologies,” the statement added.