Whose Words, Whose Rights? Rethinking CC Licensing in African Language Data Sets.
- Florence Ogonjo |
- July 11, 2025 |
- Copyright,
- Intellectual Property
The rapid advancement of AI has turned nearly all content into potential training data. This continued evolution signifies a shift that poses unique challenges for African language datasets. While open licensing frameworks like Creative Commons were designed to democratise access, they often fail to address the extractive dynamics that emerge when data is divorced from its cultural and community context.
This was the context within which the workshop hosted on 9th July 2025 by Creative Common, the Centre for Intellectual Property and Information Technology Law (CIPIT) an the Data Science Law Lab (University of Pretoria) sought to address the emerging issues bringing together AI researchers, legal scholars, AI developers and funders to explore alternatives that balance openness and equity from and African perspective.
Key to this were discussions on the limits of traditional open licensing, where Sarah Pearson of Creative Commons opened with a clarifying assessment of the CC licences contextualising the assessment to the age of AI. It was noted that, while CC BY attribution and CC0 Public Domain licenses have enabled widespread reuse, they were not created in anticipation of governing machine learning applications. Consequently, limitations of CC licences are noted by the fact that they do not address issues of data sovereignty, privacy or economic reciprocity, which are critical gaps when data is scraped, repurposed and monetised by entities that are not familiar with the cultural nuances or origins of the data.
Notably, the introduction of CC Signals, a new framework under development, signifies an attempt from Creative Commons to bridge the gaps. Grounded on reciprocity, the framework proposes standardised, machine-readable signals that allow data stewards to attach conditions such as attribution requirements, direct contributions to communities or ecosystem reinvestment to AI training users. In further discussion, it was important to acknowledge existing tensions, questioning how such signals would be both legally enforceable and respectful of copyright boundaries. Although there was no clear solution, Ms Pearson suggested that the answer lies in technical standardisation rather than legal expansion, primarily leveraging existing opt-out protocols from global bodies.
The session reflected on home grown solutions, looking at innovative licensing solutions developed in Africa, by Africans for Africans. Dr. Melissa Omino presented on the Nwulite Obodo License (NOODL), a three-tiered licensing model that seeks to address the gaps in licensing of African data sets that cannot be addressed by the traditional CC licensing framework. The NOODL license allows African data stewards to retain governance rights and impose conditions on commercial users. Distinctly, NOODL explicitly centres African agency, ensuring those who create data benefit from its value. In discussing open source data sets, particularly for African languages, Miguel Morachim, speaking on the Common Voice initiative by Mozilla, gave insights on challenges and the need for licenses as a governance framework. Migule noted crowd-sourcing challenges, noting how projects like Common Voice rely on public contributions but face tensions between open access and community control. On Attribution and consent, there was an emphasis on the need for dynamic consent mechanisms allowing contributors to update permissions as reuse contexts evolve, further stressing that licensing frameworks must balance standardisation for scalability with local adaptability to reflect community norms.
The Esethu License discussed by Aremu Anuoluwapo, much like the NOODL license presents a community-centric approach to data curation and governance to ensure equitable benefit sharing from linguistic resources contributed by local communities. The license features a six-step approach beginning from data set creation with native speakers controlling the data, data set license where community representatives ensure fair benefit distribution, data set release with a sitable license, contributions for research and commercial use and lastly, licensing fees received from non-African commercial entities. Sustainability for creating long-term impact was a key component of the discussions, which was highlighted by Chris Emezue of Naija Voices as he touched on considerations for key stakeholders, understanding which stakeholders are being considered and left out. Additionally, he touched on balancing accessibility and sustainability, his approach would also centre the community by creating data sets with the community, co-ownership of the data set by the community and applying a non-commercial default licence and use of data as given.
The discussion led by Prof Okorie brought out significant perspectives and underlying tensions in open licensing standards. While CC BY licensing and similar frameworks promote accessibility, they inadvertently perpetuate extractive dynamics by failing to address existing power asymmetries, cultural context, benefits to the community and compensation for data creators. Dr. Vukosi Marivate pointed out these dynamics reflecting on data philanthropy, underscoring how mandated openness from funders can undermine sustainability when separated from equitable and responsible considerations on data collection and accessibility of open data. Funders in the room noted that whereas the proposed solutions, i.e NOODL, Esethu and CC Signals, attempt to address and mitigate the tensions by incorporating community-defined terms of use, key implementation challenges remain, particularly in balancing local data governance parameters with scalability while avoiding the bureaucratic elements likely to arise. Wholistically, the discussion was clear on the fact that data governance ought to prioritise looking at open data, especially of African language datasets, from a contextualised manner where openness is negotiated and not imposed, and technical frameworks developed to ensure African datasets are not commodified without reciprocity.
Image used is from canva