The Masakhane Project showcases the transformative power of open, collaborative efforts in advancing natural language processing (NLP) for African languages. However, its reliance on the JW300 dataset-a vast multilingual corpus primarily comprising copyrighted biblical translations-uncovered significant legal and ethical challenges. These challenges focused on copyright restrictions, contract overrides, and the complexities of cross-border data use. This led to the discontinuation of JW300’s use within Masakhane, prompting a shift toward community-generated data. The experience illustrates the urgent need for robust copyright exceptions, clear legal frameworks, and ethical data sourcing to foster innovation and inclusivity in global NLP research. Read the full blog post here: https://blog.knowledgegov.org/archives/46447