So I have been involved in an AI project since a while professionally (see https://toolkit-digitalisierung.de/en/fair-forward/). Also, this website is all about how to understand and how to foster knowledge commons and open knowledge peer production for human development. Therefore, I would like to present you here some thoughts on how we might be able to democratize Artificial Intelligence (AI) globally through new commons, a concrete initiative that does this and a new alliance on the issue.
Let’s start with some quick background: It becomes more and more clear that most modern machine learning (aka AI) approaches rely on massive amounts of so-called training data. Such training data are not available as a knowledge commons most of the time, and for most people worldwide. A good example is spoken language, as this article from GIZ explains, from which I partly borrow in the next paragraphs:
Language-based AI can be used to share information in a targeted, personalised way and reach people who cannot read – e.g. through interactive voice assistance. But there’s a problem. AI can only work when it is ‘fed’ and trained with data. Suitable language data from African and Asian nations has so far been a scarce resource. Currently, the data is predominately gathered and used by big companies like Google and Amazon. Local languages in Africa and Asia are commercially less interesting and/or more complex and therefore are often neglected – at the same time they promise to yield high societal benefits. A classical case for a knowledge-commons approach.
Which brings me to the concrete initiative which I would like to present here – brought to you by browser company Mozilla and its foundation. They launched the Common Voice platform, which enables people to contribute voice recordings in their native language as part of a growing AI training data commons. On the platform, anyone can record and listen to sentences and check the pronunciation. So far, it has collected datasets in 60 languages and many thousands of hours of voice recordings in African languages – more than 1500 hours in Kinyarwanda alone – all of the data is available to everyone as a commons with an open data licence. The uptake of the initiative in Rwanda is largely due to the close cooperation between Mozilla, local partners, such as the Rwandan start-up Digital Umuganda and GIZs FAIR Forward project (Disclaimer: I am involved, see above).
The development of language-based AI is particularly valuable in Rwanda as almost 30 per cent of its citizens are illiterate. Voice assistants might therefore prove to be useful in this context The data gathered in the Kinyarwanda language, which is spoken by more than 12 million people, will soon be used to provide all the country’s citizens with information on topics such as health, without them having to be able to read. This includes information about the coronavirus. The first language-based chatbot has already been launched.
So I cordially invite you to further check out the common-voice project and contribute to it (In Kinyarwanda, Luanda, Swahili or any other language you speak 🙂 ). Also if you have other thoughts or comments, please use the comment field.
Finally,I would like to direct you to a new alliance “Open for Good” (disclaimer: our project FAIR Forward is involved) that is tackling some of the following fundamental issues around open AI training data as a commons:
- How to increase availability and quality of openly available training data in a systematic way
- How to foster truly localized training data for artificial intelligence and machine learning,
- How to ensure representative and non-discriminatory training data (as far as possible),
- How to direct those localized AI to improve public services, strengthen private sector development and foster sustainable development