Generative artificial intelligence has a data problem.
For years, the typical approach to building gen AI models has been to gather as much data as possible by scraping vast swaths of the internet, training at an enormous scale and dealing with the consequences later. The result has been increasingly powerful technology, but also growing concerns about bias, consent, ownership and the uneven distribution of value created from the world’s information.
Mozilla Data Collective was created to fill the gaps in this model.
The organization, which launched last November, is attempting to create a different kind of marketplace for AI data built around community ownership, consent and what founder and Chief Executive E.M. Lewis-Jong calls “fair value exchange.”
“We need clean, abundant, contextualized, consentful datasets to build AI models worth having,” Lewis-Jong said in a recent e-mail interview. “It’s a big, structural problem, and it requires a structural solution.”
Datasets assembled through indiscriminate web scraping often reproduce the same limitations and biases found online, Lewis-Jong said. Entire languages, cultures and communities remain underrepresented in modern AI systems, while many creators have little visibility into how their content is used. Governments around the world are also increasingly scrutinizing the legal foundations of large-scale data collection, creating new compliance challenges for technology companies.
Mozilla Data Collective addresses those issues by putting communities directly into the data supply chain. Rather than treating data as a resource to be extracted, the organization views it as something that should remain under the control of the people who create it.
Rooted in speech
The idea emerged partly from Mozilla’s experience with Common Voice, its long-running initiative to collect speech data from volunteers around the world. Common Voice demonstrated that people are willing to contribute data when they believe their contributions are meaningful and they have a voice in how the project is governed. More than a half-million contributors have participated across hundreds of languages, helping create one of the world’s largest publicly available voice datasets.
The rise of generative AI complicated that equation. Communities that had enthusiastically contributed data began asking tougher questions about who ultimately benefits from open datasets when they are absorbed into increasingly concentrated and opaque AI ecosystems. Some continued to favor fully open licensing models, while others wanted more transparency, control or compensation. Mozilla Data Collective has created licenses and policies to accommodate those different preferences.
In its model, sovereignty doesn’t necessarily mean restricting access. Instead, it gives communities the ability to decide for themselves how their data will be used. Contributors can choose to share data openly, require attribution, limit use to educational or research purposes, restrict access geographically or seek compensation. The critical principle is that those decisions belong to data creators rather than to an intermediary platform.
The organization argues that this approach is increasingly important as AI systems expand into languages and cultural contexts that have historically received little attention from technology companies.
Today, the collective hosts hundreds of curated datasets representing more than 300 languages. Its collection includes Hazargi literature from Afghanistan, oral histories in the Mada language from Cameroon, and Romansh newspapers from Switzerland. Many of these resources would be difficult or impossible to find through conventional commercial data channels.
‘Mission-locked’ enterprise
The organization’s unusual governance structure is intended to reinforce that mission. Mozilla Data Collective operates as what Lewis-Jong describes as a “mission-locked British social enterprise.” That means “our purpose is baked into our governance structure at multiple levels,” Lewis-Jong said. “We exist to give communities ownership and agency over their data, and enable them to define and drive fair value exchange on their own terms.”
The structure was chosen to avoid what the organization sees as the limitations of both traditional nonprofit and conventional for-profit models. Nonprofits can struggle to build sustainable infrastructure at scale, while venture-backed startups face pressure to prioritize growth and monetization over community interests.
The collective’s success is measured both by financial performance and by mission-related objectives. Lewis-Jong said this alignment is essential because many technology companies eventually encounter tension between their stated mission and the incentives created by their revenue models. “We’re held to a double bottom line,” he said. “If we don’t hit our mission stage gates, we don’t get to exist.”
With a $10 million initial commitment from the Mozilla Foundation, Mozilla Data Collective has some wiggle room with revenue. It doesn’t take a percentage of the fees communities choose to charge for their datasets. Instead, contributors receive the full amount, while downloaders pay a separate platform fee to cover infrastructure and operating costs. The goal, Lewis-Jong said, is to encourage transparency and collective bargaining rather than the obscure brokerage arrangements that often characterize data markets.
Creator control
The organization also places significant emphasis on curation and quality control. Every participating organization and dataset is reviewed before being accepted onto the platform. Copyrighted content is rejected if uploaders don’t hold the necessary rights and fair-use claims aren’t deemed justification for distribution. The platform combines legal, technical and community safeguards intended to provide clear information about a dataset’s provenance and permissions.
Recent new platform capabilities give data contributors greater control over access and compensation. Among them are tools that allow dataset owners to approve access requests, a conversational assistant that helps developers discover relevant datasets and a forthcoming compensation system that will enable contributors to establish licensing terms and pricing.
The long-term vision is not necessarily to compete directly with the large data brokers that currently dominate AI training pipelines, Lewis-Jong said. Instead, the group sees itself as creating an alternative model that connects developers with communities historically overlooked by mainstream data markets. He described the platform less as a broker and more as a bridge.
Mozilla Data Collective is betting that the future of AI will require more than bigger models and larger datasets. It will require new institutions that balance innovation with consent, participation and trust to ensure the people who create the world’s data have a meaningful role in determining how it’s used.
Image: Mozilla Data Collective
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
- 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
- 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.



