Making the Leap From Data Governance to AI Governance
The topic of data governance is one that’s been well-trod, even if not all companies follow the widely accepted precepts of the discipline. Where things are getting a little hairy these days is AI governance, which is a topic on the minds of C-suite members and boards of directors who want to embrace generative AI but also want to keep their companies out of the headlines for misbehaving AI.
These are very early days for AI governance. Despite all the progress in AI technology and investment in AI programs, there really are no hard and fast rules or regulations. The European Union is leading the way with the AI Act, and President Joe Biden has issued a set of rules companies must follow in the U.S. under an executive order. But there are sizable gaps in knowledge and best practices around AI governance, which is a book that’s still largely being written.
One of the technology providers that’s looking to push the ball forward in AI governance is Immuta. Founded by Matt Carroll, who previously advised U.S. intelligence agencies on data and analytics issues, the College Park, Maryland company has long looked to governing data as the key to keeping machine learning and AI models from going off the rails.
However, as the GenAI engine kicked into high gear through 2023, Immuta customers have asked the company for more controls over how data is consumed in large language models (LLMs) and other components of GenAI applications.
Customer concerns around GenAI were laid bare in Immuta’s fourth annual State of Data Security Report. As Datanami reported in November, 88% of the 700 survey respondents said that their organization is using AI, but 50% said the data security strategy at their organization is not keeping up with AI’s rapid rate of evolution. “More than half of the data professionals (56%) say that their top concern with AI is exposing sensitive data through an AI prompt,” Ali Azhar reported.
Joe Regensburger, vice president of research at Immuta, says the company is working to address emerging data and AI governance needs of its customers. In a conversation this month, he shared with Datanami some of the areas of research his team is looking into.
One of the AI governance challenges Regensburger is researching revolves around ensuring the veracity of outcomes, of the content that’s generated by GenAI.
“It’s sort of the unknown question right now,” he says. “There’s a liability question on how you use…AI as a decision support tool. We’re seeing it in some regulations like the AI Act and President Biden’s proposed AI Bill Rights, where outcomes become really important, and it moves that into the governance sphere.”
LLMs have the tendency to make things up out of whole cloth, which poses a risk to anyone who uses it. For instance, Regensburger recently asked an LLM to generate an abstract on a topic he researched in graduate school.
“My background is in high energy physics,” he says. “The text it generated seemed perfectly reasonable, and it generated a series of citations. So I just decided to look at the citations. It’s been a while since I’ve been in graduate school. Maybe something had come up since then?
“And the citations were completely fictitious,” he continues. “Completely. They look perfectly reasonable. They had Physics Review Letters. It had all the right formats. And at your first casual inspection it looked reasonable…It looked like something you would see on archives. And then when I typed in the citation, it just didn’t exist. So that was something that set off alarm bells for me.”
Getting into the LLM and figuring out why it’s making stuff up is likely beyond the capabilities of a single company, and will require an organized effort by the entire industry, Regensburger says. “We’re trying to understand all those implications,” he says. “But we’re very much a data company. And so as things move away from data, it’s something that we’re going to have to grow into or partner with.”
Most of Immuta’s data governance technology has been focused on detecting sensitive data residing in databases, and then enacting policies and procedures to ensure it’s adequately protected as it’s being consumed, primarily in advanced analytics and business intelligence (BI) tools. The governance policies can be convoluted. One piece of data in a SQL table may be allowable for one type of queries, but it would be disallowed when combined with other pieces of data.
To provide the same level of governance for data used in GenAI would require Immuta to implement controls in the repositories used to house the data. The repositories, for the most part, are not structured databases, but unstructured sources like call logs, chats, PDFs, Slack messages, emails, and other forms of communication.
Despite the challenges in working with sensitive data in structured data sources, the task is much harder when working with unstructured data sources because the context of the information varies from source to source, Regensburger says.
“So much context is driven by it,” he says. “A telephone number is not a telephone number unless it’s associated with a person. And so in structured data, you can have principles around saying, okay, this telephone phone number is coincident with a Social Security number, it’s coincident with someone’s address, and then the entire table has a different sensitivity. Whereas within unstructured data, you could have a telephone number that might just be an 800 number. It might just be a company corporate account. And so these are things are much harder.”
One of the places where a company could potentially gain a control point is the vector database as it’s used for prompt engineering. Vector databases are used to house the refined embeddings generated ahead of time by an LLM. At runtime, a GenAI application may combine indexed embedding data from the vector database along with prompts that are added to the query to improve the accuracy and the context of the results.
“If you’re training model off the shelf, you’ll use unstructured data, but if you’re doing it on the prompt engineering side, usually that comes from vector databases,” Regensburger says. “There’s a lot of potential, a lot of interest there in how you would apply some of these same governance principles on the vector databases as well.”
Regensburger reiterated that Immuta does not currently have plans to develop this capability, but that it’s an active area of research. “We’re looking at how we can apply some of the security principles to unstructured data,” he says.
As companies begin developing their GenAI plans and begin building GenAI products, the potential data security risks come into better view. Keeping private data private is a big one that’s on lots of peoples’ list right now. Unfortunately, it’s far easier to say “data governance” than to actually do it, especially when dealing at the intersection of sensitive data and probabilistic models that sometimes behave in unexplainable ways.