Ethical AI Development in Kenya: The Role of Ethical Data Sourcing and Governance
- Josephine Kaaniru |
- April 10, 2025 |
- Artificial Intelligence
-
Unethical Data Sourcing Concerns have defined the Initial Stages of Generative AI Training
Kenya’s emergence as a hub for artificial intelligence (AI) innovation offers immense potential, especially for addressing societal challenges. For instance, AI-powered Assistive Technologies (ATs) hold transformative possibilities for disability inclusion.1 AI in Kenya has also positively impacted education, agriculture, health and business sectors.2 Unfortunately, most of the AI tools used in these sectors in Kenya originate from large foreign technology companies and organisations with the financial resources to develop them.3 Kenyan developers often rely on these foundational tools and datasets to build local solutions due to inadequate quality datasets fully reflective of local populations and concerns.4 AI policy researchers have highlighted the significant problem of a lack of high-quality datasets that accurately represent local populations. This gap exacerbates biases in AI systems, as flawed training data—often sourced from Western contexts—fails to account for diverse demographics. A striking example of this issue is Joy Buolamwini’s 2021 documentary Coded Bias, where she and other AI and data stakeholders uncovered that AI systems struggled to detect darker faces and inaccurately categorised women.5 This revelation underscored how algorithms reflect real-world biases shaped by historical data and the limited diversity within development teams. Consequently, local datasets are recognised as a crucial opportunity for African developers to create AI solutions tailored to the continent’s unique challenges.
While stakeholders have embarked on the well-intentioned mission of collecting and preparing data for training AI systems, this process has faced challenges that have long plagued relations between Africa and the rest of the developed world.6 A key issue is the tendency of foreign companies—and even local contractors—to extract data from local populations at little or no cost, only for the resulting technology systems to be sold back to these same communities at a price. This “digital colonisation” presents itself in Africa in various forms, such as through cheap digital labour, data extraction, and Africans being relegated to the roles of beta testers of global platforms.7 The challenge of social media data is particularly relevant; since the rise of platforms like Facebook, Twitter, Instagram, and TikTok in Kenya, Kenyans have built a thriving digital ecosystem encompassing businesses, human progress, problem-solving, innovation, cultural expression, and dynamic social interactions—making their data invaluable for AI training. For instance, a Meta official confirmed that the company has scraped all public data posted on its platforms since 2007 for training AI.8 Only users in the European Union are granted an opportunity to opt out of this violation.9 In contrast, the publicly provided personal data of all Meta users in other regions has proved an invaluable resource for the company.10 Beyond social media data, AI systems have also been trained on vast datasets from books, music, and global news archives, often without the creators’ consent.11 This widespread use of data raises serious ethical concerns around intellectual property rights, exploitative data extraction, and the reinforcement of biases in AI models. Ethical AI development depends on inclusive, well-regulated data governance mechanisms that ensure representativeness and accountability and foster responsible innovation.12
In that regard, this blog explores unethical data sourcing as it has manifested both globally and locally. It also looks at what entails ethical data sourcing, while considering the dilemma presented by open data initiatives. By examining these issues, the article aims to consider actionable strategies for ethical AI development in Africa through ethical sourcing and governance of data.
-
How Unethical Data Sourcing has Manifested so far Globally
OpenAI, arguably the most recognisable AI company in the world, has faced its latest battle over how it sourced data to train the GPT models. The company is now entangled in a copyright lawsuit filed by Indian book publishers, joining a growing list of legal challenges from authors, news agencies, and musicians worldwide.13 These cases highlight ongoing concerns over AI models training on proprietary content without authorisation, threatening intellectual property rights and undermining traditional revenue streams. The lawsuit in India, led by the Federation of Indian Publishers, demands that OpenAI either secure proper licensing agreements or delete datasets containing copyrighted material, reflecting broader global anxieties about the ethical sourcing of AI training data.14
Other global companies that have faced this challenge include Stability AI, sued by Getty Images for scraping millions of copyrighted photos; Anthropic, targeted by music publishers over song lyrics; and Microsoft, dragged into OpenAI-related suits due to its partnership, all underscoring a mounting tension between AI innovation and the rights of content creators across industries and borders.15 Stability AI’s legal woes began in February 2023 when Getty Images accused it of using over 12 million images to train its Stable Diffusion model, claiming this unauthorised use fueled a generative AI that competes with Getty’s stock photo business.16 Anthropic, founded by ex-OpenAI researchers, faced a lawsuit in October 2023 from Universal Music Group and others, who alleged that its Claude model was trained on copyrighted lyrics, and that it often produces almost verbatim reproductions of copyrighted materials, without proper licensing.17 Meanwhile, companies like Nvidia18 were hit with a 2024 authors’ lawsuit over pirated books. Perplexity AI, sued by Dow Jones and the New York Post19 for repurposing news articles, further illustrate how the AI industry has relied on vast datasets sourced from what they term “publicly available information”, which is constantly clashing with established intellectual property frameworks. These developments have raised significant questions about fair use, the sourcing of training data for AI development, the future of the creative industry, and intellectual property considerations.
-
In Africa?
Like much of Africa, Kenya is experiencing extensive data extraction, often in ways that raise ethical concerns. Past research by CIPIT indicates that AI developers require vast amounts of data to train models, and much of this data is sourced through methods like web crawling and scraping. The authors note that while some datasets, such as the Demographic and Health Survey (DHS) and the Malaria Indicator Survey (MIS), are publicly available, others—like detailed reports from the Senegal National Malaria Control Program—may be copyrighted or sensitive but still end up being used without proper authorisation. Many African countries, including Kenya, have copyright laws that do not explicitly address AI training, leaving room for exploitation. Kenya’s Copyright Act allows for fair dealing in scientific research but does not clarify whether AI model training falls under this exception.20 In contrast, South Africa’s Copyright Amendment Bill attempts to define fair use,21 though its impact on AI development remains uncertain. This lack of clear legal and ethical frameworks means AI developers often extract data without consent or transparency, which undermines privacy and erodes trust in AI systems. This unchecked data sourcing leads to unethical AI systems—trained on data acquired through questionable means, violating individual rights and reinforcing existing societal inequities.22 Without stronger data governance, the continued extraction of African data to train AI will likely continue to be based on a foundation of exploitation rather than fairness and accountability.
Further exploitation in data sourcing has positioned Africa as a low-wage data annotation hub, where global tech companies outsource AI training tasks. In Kenya, young, unemployed workers, often desperate for economic opportunities, are hired by companies like Meta and OpenAI through third-party firms such as Sama, to label vast datasets – images, text, and videos – for minimal pay, sometimes as low as $2 per hour. This work frequently exposes them to disturbing content, including violence and explicit material, to refine AI algorithms used worldwide. The unethical nature of this sourcing is evident in the economic vulnerability it exploits, offering little job security or mental health support despite the psychological toll. Global companies bypass stricter labor laws in wealthier nations, exploiting Kenya’s weaker regulations.23 Workers have brought legal action against Meta and Sama, citing long hours, precarious contracts, and a stark power imbalance that limits their rights and avenues for redress.24 Allegations of retaliation against workers attempting to unionise further highlight the exploitative conditions.25 This case exemplifies the digitisation of global inequality, where countries like Kenya serve as low-wage labour centres for AI advancements,26 yet see little of the economic benefits despite their critical role in fueling tools like OpenAI’s ChatGPT 27
Kenya is taking steps to curb the longstanding exploitation of its citizens’ data for AI training by foreign entities. The Ministry of Information, Communications, and the Digital Economy (MICDE) has acknowledged Kenya’s vulnerability to data exploitation, where international companies harvest local data with little to no benefits returning to the population.28 To counter this, the government has emphasised national data sovereignty through the National AI Strategy, the recently released Diplomat’s Playbook on Artificial Intelligence, and proposed29 data governance legislation. These initiatives aim to halt exploitative data practices and ensure that AI development aligns with ethical standards while empowering Kenyans rather than serving solely foreign interests.
Beyond national policy efforts, equitable data practices must also extend to direct benefit-sharing with affected communities. The implementation of Benefit Sharing Agreements can provide a structured way for African communities to receive fair compensation for their data contributions, as exemplified by the Masakhane Pelargonium case. Here, the Masakhane community in South Africa disputed their inclusion under a traditional authority’s Benefit Sharing Agreements (BSA) with Schwabe Pharmaceuticals, demanding a separate BSA to secure benefits from the harvesting of the Pelargonium sidoides and Pelargonium reniforme plants on their land. This effort was rooted in their assertion of self-representation through a Communal Property Association.30 This case highlights the critical need to move beyond superficial consultations toward authentic engagement that respects community agency and delivers tangible rewards. By embracing benefit-sharing, African communities can transition from being mere sources of raw data to active beneficiaries of the technological advancements their contributions enable.
-
Ethical Data Sourcing
Ethical data sourcing is foundational in developing trustworthy AI systems, weaving together data privacy, informed consent, and a spectrum of digital rights tied to data governance. The performance and legitimacy of AI systems depend heavily on the quality and ethical integrity of their data sources – a principle increasingly enshrined in legal and regulatory frameworks worldwide.31 Core ethical considerations, including transparency in AI development, robust data protection, and the assurance of fairness and non-discrimination, are critical to fostering AI systems that uphold integrity.32
Informed consent, a crucial element to ethical data sourcing, is defined by the GDPR as a freely given, specific, and unambiguous acceptance to data processing,33 signaled by clear action indications through statements or actions, ensuring the data subject understands how their data is used. This emphasis on informed consent underscores an understanding of how one’s data is handled—a principle mirrored in Afrocentric frameworks. For instance, the Malabo Convention (African Convention on Cybersecurity and Personal Data Protection) requires that data processing rest on the “principle of consent and legitimacy of personal data processing,” aligning with claims that processing is legitimate only with consent, except for legally permitted exceptions.34 Likewise, Kenya’s Data Protection Act mandates that personal data processing should only occur once the data subject consents to such processing.35 These overlapping definitions highlight a universal commitment to empowering data subjects, though their implementation varies across jurisdictions.
Understanding the role of data in the design and deployment of AI systems is critical, particularly in ensuring transparent processes for data collection, storage, processing, and governance.36 Bias in AI often originates from multiple sources – how data is collected, the structure of algorithms, and patterns of user interaction. These biases can lead to unequal and discriminatory outcomes, which are then embedded in machine learning models.37 Tackling such bias requires a multifaceted approach, including curating more representative datasets, using bias-aware algorithms, and incorporating robust user feedback mechanisms. Importantly, bias frequently stems from unrepresentative or incomplete data sources, making ethical data sourcing a foundational step toward building fairer and more accountable AI systems.
-
The Open Data Dilemma
Africa’s scarcity of localised datasets has long hindered the development of AI and data-driven solutions tailored to its diverse populations.38 Efforts to build African-based datasets often stumble into exploitation, with concerns over inadequate consent, lack of fair compensation, and external control dominating the narrative. Open data has emerged as a potential antidote, promising to democratise information access and spur innovation.39 In Kenya, the government’s Open Data Initiative (KODI), launched in 2011, was hailed as a pioneering effort to make public data such as census figures and budget data, openly available to fuel transparent development.40 Open data proponents see it as a way to empower local developers and reduce reliance on proprietary or unethically sourced datasets, a critical step toward ethical AI.
Yet, the reality in Kenya reveals significant challenges in maintaining government-initiated open data. Updates to the portal have stalled; although by 2016 the portal boasted 849 datasets,41 the 2022 Global Data Barometer highlighted significant deficiencies in Kenya’s open data initiatives, pointing to outdated and poorly managed data, alongside inadequate capacity that risks exposing sensitive information, ultimately undermining public trust.42 In other instances, datasets are often outdated or incomplete, undermining their utility for AI innovation.43 Additionally, privacy concerns further complicate the landscape of open data in Kenya. Publicly released datasets such as health or demographic records, raise fears of re-identification, especially in tight-knit communities where ensuring anonymization is more complex.44 Furthermore, questions surrounding data ownership present additional challenges. The issues of who controls data derived from communal resources and how the benefits are shared remain largely unresolved, highlighting the necessity for clearer legal and policy frameworks.
-
Conclusion
As Africa advances toward ethical AI development, robust data governance within AI regulation offers a critical framework essential in balancing innovation, inclusivity, and privacy. Addressing exploitative data extraction, enforcing consent-driven collection practices, and ensuring that AI development benefits local communities are critical steps in this process. Open data, while a catalyst for local AI solutions, requires robust governance safeguards to prevent misuse and inequity. By embedding equitable data-sharing agreements, strengthening regional collaboration, and prioritising transparency, Africa can chart a path toward ethical AI development that empowers its people and ensures that technological advancements align with ethical principles and digital sovereignty. Such an ethical AI policy, grounded in strong data governance, not only drives responsible AI but also empowers Africa’s people, setting a global standard for fair benefit sharing and accountability.
Image is by freepik
1 CIPIT, ‘AI Assistive Technologies (ATs) for Persons with Disabilities (PWDs) in Africa’ (2023) <https://cipit.org/ai-assistive-technologies-ats-for-persons-with-disabilities-pwds-in-africa/>.
2 CIPIT, ‘The State of AI in Africa Report 2023’ (2023) <https://cipit.strathmore.edu/wp-content/uploads/2023/05/The-State-of-AI-in-Africa-Report-2023-min.pdf>.
3 CIPIT, ‘The State of AI in Africa Report 2023’ (2023) <https://cipit.strathmore.edu/wp-content/uploads/2023/05/The-State-of-AI-in-Africa-Report-2023-min.pdf>.
4 United Nations Development Programme (UNDP) and Italian G7 Presidency, ‘AI Hub for Sustainable Development Strengthening Local AI Ecosystems through Collective Action’ (2024) <https://www.undp.org/sites/g/files/zskgke326/files/2024-07/ai_hub_report_digital.pdf>.
5 PBS, ‘Coded Bias | Films | PBS’ (Independent Lens2021) <https://www.pbs.org/independentlens/documentaries/coded-bias/>.
6 Benedikt Erforth, ‘Data Extraction, Data Governance and Africa- Europe Cooperation: A Research Agenda’ (2024)<https://www.megatrends-afrika.de/assets/afrika/publications/MTA_working_paper/MTA_WP14_Erforth_Digital_Cooperation.pdf>.
7 Benedikt Erforth, ‘Data Extraction, Data Governance and Africa- Europe Cooperation: A Research Agenda’ (2024)
8 Jess Weatherbed, ‘Meta Fed Its AI on Everything Adults Have Publicly Posted since 2007’ (The Verge12 September 2024)<https://www.theverge.com/2024/9/12/24242789/meta-training-ai-models-facebook-instagram-photo-post-data>.
9 Jess Weatherbed, ‘Meta Fed Its AI on Everything Adults Have Publicly Posted since 2007’ (The Verge12 September 2024)
10 Jess Weatherbed, ‘Meta Fed Its AI on Everything Adults Have Publicly Posted since 2007’ (The Verge12 September 2024)
11 Adil S Al-Busaidi and others, ‘Redefining Boundaries in Innovation and Knowledge Domains: Investigating the Impact of Generative Artificial Intelligence on Copyright and Intellectual Property Rights’ (2024) 9 Journal of Innovation & Knowledge 100630 <https://www.sciencedirect.com/science/article/pii/S2444569X24001690#:~:text=As%20previously%20noted%2C%20GenAI%20technologies,explicit%20permission%20of%20copyright%20holders.>.
12 Oakley Parker, ‘Data Governance and Ethical AI: Developing Legal Frameworks to Address Algorithmic Bias and Discrimination’ <https://www.researchgate.net/publication/384966994_Data_Governance_and_Ethical_AI_Developing_Legal_Frameworks_to_Address_Algorithmic_Bias_and_Discrimination>
13 Aditya Kalra, Arpan Chaturvedi and Munsif Vengattil, ‘OpenAI Faces New Copyright Case, from Global Book Publishers in India’ Reuters (24 January 2025) <https://www.reuters.com/technology/artificial-intelligence/openai-faces-new-copyright-case-global-publishers-india-2025-01-24/>.
14 Aditya Kalra, Arpan Chaturvedi and Munsif Vengattil, ‘OpenAI Faces New Copyright Case, from Global Book Publishers in India’ Reuters (24 January 2025)
15 Bruce Barcott, ‘AI Lawsuits Worth Watching: A Curated Guide | TechPolicy.Press’ (Tech Policy Press1 July 2024) <https://www.techpolicy.press/ai-lawsuits-worth-watching-a-curated-guide/>.
16 Bruce Barcott, ‘AI Lawsuits Worth Watching: A Curated Guide | TechPolicy.Press’ (Tech Policy Press1 July 2024)
17 Bruce Barcott, ‘AI Lawsuits Worth Watching: A Curated Guide | TechPolicy.Press’ (Tech Policy Press1 July 2024)
18 Mark Hill and Courtney Benard, ‘Nvidia Faces Class-Action Lawsuit for Training AI Model on “Shadow Library”’ (Lexology 30 April 2024) <https://www.lexology.com/library/detail.aspx?g=3a665ce3-3db6-40a3-899e-10c2cf606a71> accessed 19 March 2025.
19 Dawn Chmielewski and Katie Paul, ‘Murdoch’s Dow Jones, New York Post Sue Perplexity AI for “Illegal” Copying of Content’ Reuters (21 October 2024) <https://www.reuters.com/legal/murdoch-firms-dow-jones-new-york-post-sue-perplexity-ai-2024-10-21/>.
20 CIPIT, ‘Artificial Intelligence (AI) Training Data and the Copyright Dilemma: Insights for African Developers – Centre for Intellectual Property and Information Technology Law’ (12 February 2025)
21 Copyright Amendment Bill (South Africa) <https://www.parliament.gov.za/storage/app/media/uploaded-files/Copyright%20Amendment%20Bill%20Draft.pdf >
22 CIPIT, ‘Artificial Intelligence (AI) Training Data and the Copyright Dilemma: Insights for African Developers – Centre for Intellectual Property and Information Technology Law’ (12 February 2025)
23 Raksha Vasudevan, ‘A Lawsuit against Meta Shows the Emptiness of Social Enterprises’ (Wired 20 July 2022) <https://www.wired.com/story/social-enterprise-technology-africa/> .
24 Business and Human Rights Resource Centre, ‘Meta & Sama Lawsuit (Re Poor Working Conditions & Human Trafficking, Kenya) – Business & Human Rights Resource Centre’ (Business & Human Rights Resource Centre 2022) <https://www.business-humanrights.org/fr/latest-news/meta-sama-lawsuit-re-poor-working-conditions-human-trafficking-kenya/> .
25 Business and Human Rights Resource Centre, ‘Meta & Sama Lawsuit (Re Poor Working Conditions & Human Trafficking, Kenya) – Business & Human Rights Resource Centre’ (Business & Human Rights Resource Centre 2022)
26 ‘WeeTracker’ (WeeTracker25 November 2024) <https://weetracker.com/2024/11/25/openai-sama-kenyan-workers-controversy/> accessed 9 April 2025.
27 Billy Perrigo, ‘Exclusive: OpenAI Used Kenyan Workers on Less than $2 per Hour to Make ChatGPT Less Toxic’ (Time18 January 2023) <https://time.com/6247678/openai-chatgpt-kenya-workers/>.
28 ‘Kenya to Restrict Use of Locals’ Data for Foreign AI Training’ (The East African21 January 2025) <https://www.theeastafrican.co.ke/tea/sustainability/innovation/kenya-to-restrict-use-of-locals-data-for-foreign-ai-training-4896508> .
29 ‘Report of the Information, Communications and the Digital Economy Sectoral Working Group Republic of Kenya Ministry of Information, Communications and the Digital Economy’ (2024) <https://ict.go.ke/sites/default/files/2024-09/MICDE%20Sector%20Working%20Group%20Report%20-%20June%202024.pdf> .
30 Zuziwe Msomi and Sally Matthews, ‘Protecting Indigenous Knowledge Using Intellectual Property Rights Law: The Masakhane Pelargonium Case’ (2016) 45 Africanus: Journal of Development Studies 62 <https://unisapressjournals.co.za/index.php/Africanus/article/download/645/432/4917>.
31 Morgan Sullivan, ‘Key Principles for Ethical AI Development’ (20 October 2023) Transcend Blog https://transcend.io/blog/ai-ethics accessed 24 January 2025.
32 Swetha Sistla, ‘AI with Integrity: The Necessity of Responsible AI Governance’ (2024) Journal of Artificial Intelligence & Cloud Computing SRC/JAICC-E180 https://doi.org/10.47363/JAICC/2024(3)E180 accessed 24 January 2025.
33 Article 4(11) General Data Protection Regulation <https://gdpr-info.eu/> accessed 24 January 2025.
34 African Convention on Cyber Security and Personal Data Protection, Article 13
35 Data Protection Act, s 30
36 Mahmoud Barhamgi and Elisa Bertino, ‘Editorial: Special Issue on Data Transparency—Uses Cases and Applications’ (2022) 14(2) J Data and Information Quality art 6 https://doi.org/10.1145/3494455.
37 Emilio Ferrara, ‘Fairness and Bias in Artificial Intelligence: A Brief Survey of Sources, Impacts, and Mitigation Strategies’ (2024) 6 Sci 3 https://doi.org/10.3390/sci6010003
38 KICTANET, ‘Africa Left Behind: Lack of Local Data Hinders AI Effectiveness | KICTANet Think Tank’ (KICTANET 30 April 2024) <https://www.kictanet.or.ke/africa-left-behind-lack-of-local-data-hinders-ai-effectiveness/>.
39 ICT Works, ‘14 Barriers to Using Open Data for Better Development Decisions – ICTworks’ (ICTworks 3 April 2024) <https://www.ictworks.org/open-data-development-decisions/> accessed 19 March 2025.
40 OGP, ‘Open Data for Development (KE0034)’ (Open Government Partnership2022) <https://www.opengovpartnership.org/members/kenya/commitments/KE0034/> accessed 28 February 2025.
41 ‘The Kenya Open Data Initiative – Centre for Public Impact’ (Centre for Public Impact26 September 2024) <https://centreforpublicimpact.org/public-impact-fundamentals/the-kenya-open-data-initiative/> accessed 10 April 2025.
42 OGP, ‘Open Data for Development (KE0034)’ (Open Government Partnership 2022) <https://www.opengovpartnership.org/members/kenya/commitments/KE0034/> accessed 28 February 2025.
43 Ugwu Jovita Nnenna and others, ‘Challenges and Opportunities in Implementing Open Government Data Initiatives in East Africa’ (2024) 10 Journal of Social Sciences 1 <https://www.researchgate.net/publication/379534081_Challenges_and_Opportunities_in_Implementing_Open_Government_Data_Initiatives_in_East_Africa>.
44 Jude O Igumbor and others, ‘Considerations for an Integrated Population Health Databank in Africa: Lessons from Global Best Practices’ (2021) 6 Wellcome Open Research 214 <https://pmc.ncbi.nlm.nih.gov/articles/PMC8844538/> accessed 26 February 2025.