

(Connect-world/Shutterstock)
What does “open” mean in the context of AI? Must we accept hidden layers? Do copyrights and patents still hold sway? And do consumers have the right to opt out of data collection? These are the types of questions that the folks at the Open Source Initiative are trying to get to the bottom of, as part of a deep dive to define “open source AI.”
The rules around what could be considered open source in tech used to be fairly well-defined, according to Stefano Maffulli, the executive director of the Open Source Initiative. Back in the 1970s, it was generally accepted that only things generated by a human could be legally protected with a copyright or a patent. Stuff generated by a machine, such as binary code, generally could not be protected.
That began to change with the PC revolution in the 1980s and Microsoft’s massive success selling software. Following several policy changes and landmark lawsuits, people began seeking and gaining protection for things such as source code and machine-generated binary code, Maffulli says.
With the advent of massive generative AI models that are trained on public data scraped from the Internet, we find ourselves at the edge of what current copyright law can cover. In fact, according to Maffulli, we’ve likely already passed that point, and now find ourselves in dire need of new ideas and new frameworks to define what can and should be protected, and what can and should be open and accessible to all.
“When [GitHub] CoPilot was announced [in October 2021], it suddenly dawned that there were new copyright issues appearing on the horizon,” Maffulli tells Datanami in a recent interview. “Then I started diving a little bit deeper into how AI [works], how machine learning, deep learning, neural networks work, and it dawned on me again that there were new artifacts, new things. And we were really at the dawn of a new era where we need new laws, we need new frameworks to understand what’s happening. And we need to do that very quickly.”
OSI ‘Deep Dive’

You can access the OSI deep dive report on open AI here
With its “Defining Open Source Deep Dive” program, the OSI organization is taking a disciplined and multi-pronged approach to understanding all aspects of the openness in AI question.
It set the process in motion earlier this year with a 20-page report on AI openness in February. In early June, it posted a public call for papers and research on the topic, followed by a set of kickoff meetings in San Francisco later that month. There were two community review workshops in July, in Oregon and Switzerland, followed by a third workshop last week in Spain.
If all goes according to schedule, OSI hopes to submit the first release candidate of a new definition of open source for AI paper next month. The process will continue into 2024, according to the group’s website.
The group is trying to remain open to all perspectives in coming up with its definition and policy recommendations. “It largely depends on what people want to do,” Maffulli says. “At the Open Source Initiative, we’re just driving this conversation. We’re not really forcing our opinions on anyone.”
A New Age of Data
The radical openness that defined the first 40 years of the Internet served the community well and sowed the seeds of technological progress to come. The egalitarianism of the Internet’s first phase of development fostered a community that thrived with openness and a ethos of sharing.
That started to change with the dawn of the big data age and the advent of social media and smart phones. Tech firms realized they could scrape the Internet for data freely shared by users, as well as some data not freely shared but still available (such as books), to amass huge data sets. Those data sets are now being used to train massive generative AI models that have the potential to not only reshape consumers’ relationship with technology for years to come, but also separate winners from losers on the corporate and creative battlefields.
One of the big questions that OSI is struggling with is: Does current copyright law still work in the age of AI? The answer hasn’t been determined yet, but it doesn’t look like it will.
“I think we’re at the point where we should make a decision whether we want those to be covered by copyright or whether we need to create new rights and new obligations for society,” Maffulli says. “What’s the best approach?”
There are different perspectives to these questions, and each deserves to be considered. The debate touches on several aspects of intellectual property rights, including copyrights, patents, trademarks, and trade secrets. But it’s also tied up into privacy rights, security obligations, and labor law, which adds to the complexity.
Maffulli says he understand the plight of creative workers whose past work can be harnessed to train a GenAI model that can re-create that workers’ output, potentially putting him out of work. Is there any legal recourse for him? Should he be granted legal protections? It’s tempting, he says.
“The reaction to that is to say, wait a second, you have been feeding my images, my text, into this machine and now this machine is capable of replacing me? No!” he says. “I have copyright rights on the work that I have produced. I never authorized anyone to use the archive of my work as a data mining source. Therefore, I want you to ask me for permission. I think that that’s a very fair approach a very fair reaction.”
However, if communities and government opt to stiffen data protections, it will naturally make it harder to obtain data to train AI models. That will not only slow down the overall rate of AI innovation, but it will likely also have the side effect of entrenching the already dominant positions that OpenAI, Google, and Meta enjoy in the space, he says.
“I think the biggest threat is there will not be the possibility to have a diverse amount of players in the field,” he says. “This is a field that naturally, at every step, favors the ones with the big resources, large amounts of resources. Because the main three components are data, knowledge, and hardware.”
The tech giants already have the data, which they have been systematically scraping from the Internet for years. They have the financial resources to afford the giant GPU clusters needed to train AI models. And they naturally attract the top minds in the field as a byproduct of having massive GPU clusters and lots of data to play with.
Maffulli sounds pragmatic about the potential to enact meaningful change by strengthening copyright protections. The tech giants already have the means to bury lawsuits brought by individuals, he says. And besides, they already have all the data. In many cases, they acquired it fair and square, thanks to consumers’ tendency to click “yes” on every privacy policy dialog box they’re presented.
‘Cat’s Out of the Bag’
For years Maffulli shared his image and title liberally across the Web. Then at one point, he tried to rein in back in by deleting his image on every major site. It’s his likeness and his right, he figured. He would force the tech giants to forget they ever saw him, he thought. At some point, he realized it was likely impossible.
That experience has informed his view on what is possible to be done with data and the open future of AI. “I think it’s better off if we just let it go,” Maffulli says. “The cat is out of the bag.”
In other words, instead of trying to put the cats back in the bag, we are better off just managing the loose cats as best we can. That means stronger operational controls on data that’s already out in the open, and better guardrails to guide those cats to happy homes.
“I do think that it cannot be solved by copyright law,” Maffulli says. “It needs to be solved by having strong policy, privacy protection laws, strong control from the individual to say ‘I don’t want to be recognized. Therefore, even if you have my face in the database, it gets deactivated. You cannot use it.’”
There are plusses and minuses to open source and to copyright protections, and they must be weighed carefully. OSI’s policy is not to judge how practitioners use open source software, noting that it’s impossible to draw a line between moral and immoral uses. As the debate plays out over what open means in AI, that line is murkier than ever.
Related Items:
Why Truly Open Communities are Vital to Open Source Technology
Do Customers Want Open Data Platforms?
Open Data Hub: A Meta Project for AI/ML Work
December 1, 2023
- Kognitos Raises $20M in Series A Funding to Automate Businesses Using Generative AI
- Voltron Data Launches Theseus to Unlock the Power of the Largest Data Sets for AI
- Insight Tech Journal Reflects on Gen AI and the Biggest IT Disruptors of 2023
- Accenture Launches Specialized Services to Help Companies Customize and Manage Foundation Models
- VAST Data’s New Platform Update Aims to Simplify AI Workflows and Hybrid Cloud Operations on AWS
November 30, 2023
- HPE Collaborates with NVIDIA to Deliver an Enterprise-Class, Full-Stack GenAI Solution
- Hitachi Vantara Introduces Pentaho+: A Simplified Platform for Trusted, GenAI-ready Data
- SAS Forecasts 2024 AI Trends: Tackling the Dark Age of Fraud with AI Solutions
- Scality’s 2024 Data Storage Predictions Reveal Continued HDD Relevance Against SSD Advances
- DataRobot Named a Leader in IDC MarketScape: Worldwide AI Governance Platforms 2023 Vendor Assessment
- HPE Fuels Business Transformation with New AI-Native Architecture and Hybrid Cloud Solutions
- Dremio Delivers GenAI-Powered Data Discovery and Unified Path to Apache Iceberg on the Data Lakehouse
- Quantum Myriad All-Flash File and Object Solution Now Generally Available
- AWS Announces 5 New Amazon SageMaker Capabilities for Scaling with Models
- Berkeley Lab’s 2023 Hopper Fellow Tackles Complex Datasets with Large-Scale Graph Analysis
- KNIME Launches AI Learnathon to Help Users Build Custom AI-Powered Data Apps – No Coding Required
November 29, 2023
- SiMa.ai and Supermicro Announce Partnership to Accelerate Power-Efficient ML at the Edge
- MongoDB Announces Atlas Vector Search Enhancement with Amazon Bedrock
- NVIDIA Brings Business Intelligence to Chatbots, Copilots and Summarization Tools with Enterprise-Grade Generative AI Microservice
- Cloudian Introduces HyperStore Bucket Migrator for the Amazon S3 Express One Zone Storage Class
Most Read Features
- Databricks Bucks the Herd with Dolly, a Slim New LLM You Can Train Yourself
- Big Data File Formats Demystified
- Data Mesh Vs. Data Fabric: Understanding the Differences
- Altman’s Back As Questions Swirl Around Project Q-Star
- Quantum Computing and AI: A Leap Forward or a Distant Dream?
- Patterns of Progress: Andrew Ng Eyes a Revolution in Computer Vision
- Taking GenAI from Good to Great: Retrieval-Augmented Generation and Real-Time Data
- Five AWS Predictions as re:Invent 2023 Kicks Off
- It’s a Snowday! Here’s the New Stuff Snowflake Is Giving Customers
- Berners-Lee Startup Seeks Disruption of the Current Web 2.0 Big Data Paradigm
- More Features…
Most Read News In Brief
- Mathematica Helps Crack Zodiac Killer’s Code
- Databricks: We’re a Data Intelligence Platform Now
- Pandas on GPU Runs 150x Faster, Nvidia Says
- GenAI Debuts Atop Gartner’s 2023 Hype Cycle
- Salesforce Report Highlights Importance of Data in the AI Revolution
- Retool’s State of AI Report Highlights the Rise of Vector Databases
- Cloudera Makes a Move in GenAI with Pinecone Partnership
- Amazon Launches AI Assistant, Amazon Q
- Big Growth Forecasted for Big Data
- New Data Unveils Realities of Generative AI Adoption in the Enterprise
- More News In Brief…
Most Read This Just In
- Salesforce Announces New Automotive Cloud Features
- DataStax Launches New Integration with LangChain, Enables Developers to Build Production-ready Generative AI Applications
- Dataiku Announces Breakthroughs in Generative AI Enterprise Applications, Safety, and Tooling
- Snowflake Puts Industry-Leading Large Language and AI Models in the Hands of All Users with Snowflake Cortex
- Martian Raises $9M for Advanced Model Mapping to Enhance LLM Performance and Accuracy
- Dremio Enhances KION Group’s Data Processing, Reducing Query Times from Half an Hour to Seconds
- Amazon Aurora MySQL zero-ETL Integration with Amazon Redshift Now Generally Available
- Terra Quantum Announces Partnership with NVIDIA for Quantum-Enhanced Data Analytics
- AWS Announces 4 Zero-ETL Integrations to Make Data Access and Analysis Faster and Easier Across Data Stores
- New NYU Report Identifies Tangible Threats Posed by Emerging Generative AI and How to Address Them
- More This Just In…
Sponsored Partner Content
-
Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023
-
The Art of Mastering Data Quality for AI and Analytics
-
Navigating the AI era: How to empower data engineers for success
-
TileDB Adds Vector Search Capabilities
-
The uses and abuses of Cloud Data Warehouses
-
4 Tips For Migrating From Proprietary to Open Source Solutions