Onome Amuge
Google has partnered with a consortium of African universities and research organisations to launch WAXAL, a large-scale open speech dataset aimed at accelerating the development of artificial intelligence tools for African languages.
The dataset, unveiled on Tuesday, provides foundational speech data for 21 sub-Saharan African languages spoken by more than 100 million people, including Hausa, Yoruba, Luganda and Acholi. By making the resource openly accessible, the project seeks to address the lack of high-quality language data for most of the world’s languages, considered by researchers as one of the most significant structural barriers to inclusive AI.
Voice-enabled technologies such as digital assistants, automated translation and speech-to-text systems have been embedded in everyday life across much of Europe, North America and parts of Asia. However, their reach in Africa has been limited. Despite the continent being home to more than 2,000 languages, only a small fraction are adequately represented in the datasets that underpin modern AI systems, effectively excluding hundreds of millions of people from interacting with technology in their native languages.
WAXAL was designed to narrow that divide. Developed over three years with funding from Google, the dataset comprises 1,250 hours of transcribed, natural speech collected across multiple countries, alongside more than 20 hours of high-quality studio recordings intended for the creation of synthetic voices. Researchers say the scale and diversity of the data make it one of the most comprehensive speech resources ever released for African languages.
Google argues that access to such data is essential if African researchers and entrepreneurs are to move beyond being passive users of imported technologies and instead shape AI systems that reflect local linguistic and cultural realities.
“The ultimate impact of WAXAL is the empowerment of people in Africa,” said Aisha Walcott-Bryant, head of Google Research Africa. She said the dataset would allow students, researchers and start-ups to build tools on their own terms, in their own languages, enabling applications ranging from educational software to voice-enabled services with commercial potential.
Unlike many earlier AI initiatives led by global technology companies, WAXAL was structured to be developed by African institutions rather than merely about African languages. Universities and community organisations including Makerere University in Uganda, the University of Ghana and Digital Umuganda in Rwanda led the data collection process, working with guidance from Google researchers. Crucially, these institutions retain ownership of the data, a governance model intended to set a precedent for more equitable AI partnerships.
The approach reflects growing scrutiny of how data is sourced and controlled in AI development, particularly in emerging markets. Critics have long argued that global technology firms extract data from the global south without sufficient local benefit. By placing African institutions at the centre of the project, Google and its partners say they are attempting to rebalance that relationship.
The dataset spans a wide linguistic range, covering languages from west, east, central and southern Africa, including Akan, Dagbani, Dholuo, Ewe, Fulani, Igbo, Kikuyu, Lingala, Malagasy, Shona, Swahili and Yoruba, among others. Researchers say this diversity increases the likelihood that tools built on WAXAL can be adapted across regions rather than remaining narrowly focused.
WAXAL is being released at a time when governments, donors and technology companies are intensifying efforts to ensure that advances in AI do not exacerbate existing inequalities. While the dataset alone will not solve Africa’s digital challenges, its backers argue that it provides the linguistic raw material without which inclusive AI systems cannot be built.
The dataset is available from Tuesday, with further details published on Google’s Africa blog.