Follow Datanami:
November 22, 2021

It’s Time for Governance on Streaming Data, Confluent Says

One of the toughest aspects of big data is simply managing it and ensuring that it’s clean, accessible, and secure. This is hard enough when the data is at rest, but it’s another ballgame entirely when it’s moving. Now Confluent is taking that challenge up with the general release of its Stream Governance suite.

Putting data in motion literally is what Confluent is all about. As the company behind the open source Apache Kafka project, streaming data is company’s raison d’etre. Data in motion is mostly good, but there are occasions where it’s not-so-good and in need of control, says Confluent’s director of product management Dan Rosanova.

“I tell people it’s like two sides of the same coin,” Rosanova says. “One is that, oh all your information can be everywhere. That’s wonderful, right? Unless it’s my Social Security number or my bank account number. You really want to be careful where that information is going to.”

The benefits of streaming data are generally well-understood, Rosanova says. As the developer of one of the most widely used pub/sub systems for moving data (Apache Kafka), streaming data has propelled Confluent to a $20 billion market capitalization following its IPO earlier this year. But as customers become more conscious to the security and regulatory risks of enabling that data to flow, they’re asking for more ways to restrict it.

“There’s a big wave of trying to rationalize about that data, to understand what is where, to understand what is flowing, who has access to it, how it got there,” he says. “And so we’re getting this wave, a desire to bring the concepts of data governance into streams, into data in motion.”

Confluent officially launched its Stream Governance suite at its Kafka Summit event this September. The suite includes three components, including Stream Quality, Stream Catalog, and Stream Lineage. The software offerings are available on Confluent Cloud, and can also be used with Confluent Platform running in the cloud or with on-prem Kafka clusters, Rosanova says.

Stream Quality, for example, includes a set of tools designed to help customers define and enforce the data quality rules. It includes a schema registry for defining rules, a validation element that enforces rules at the topic level, and a schema linking capability (in preview) to synchronize schemas across clusters.

Stream Catalog, meanwhile, provides users with a centralized library where groups of users can share what data they have and search for data they need. Confluent likens it to a “digital library” for data in motion that works by centralizing all schemas-related metadata and makes it available for discovery via the global search ba

Lastly, Stream Lineage was designed to give users a “big picture” view of data in motion, with an eye toward abiding with data regulations. It provides a GUI for visualizing the event stream flows at a high level, while also allowing them to drill down to ask specific questions about where data originated from, where it’s going, and how it was transformed.

“Getting the data [moving] around, it’s really cool,” Rosanova says. “But if you don’t provide tooling to visualize, to reason about, and to secure it, you’re getting into a pretty precarious place.”

One of the first things that companies often do when they implement a streaming data platform like Confluent Cloud or Kafka is to build a real-time dashboard, Rosanova says. However, all too often, developers start duplicating data assets. That’s one of the reasons why Stream Governance is needed, he says.

“You don’t really have a good map of where the data is an organization, so you go to the place you can find it,” Rosanova says. “And then you start kind engineering–you’re on an archaeological journey with your data… And you inadvertently end up usually duplicating a lot of stuff that’s already there.”

At the same time, Confluent recognizes that it is duplicating some of the data management tools that are already in the market. One doesn’t have to dive too far into the Datanami archive to read stories about the providers of data catalogs, of the importance of ensuring data quality, and for tracking data lineage. These are all well-trod themes over the past decade-plus of big data.

The problems is those third-party tools don’t necessarily provide what Confluent’s customers are demanding, says Rosanova, who has spent many years working in the data integration middleware space.

“This is an area where I personally feel, from my own background and experience and actually doing some of this work for a long time, that the current data governance stuff has not lived up to expectations,” he says.

While third-party providers have developed data lineage, data quality, and data catalog solutions, the ability to access those capabilities directly from within the Confluent or Kafka pipes leaves the user something less than satisfied, he says.

“We all can conceptually understand the value of data governance. But if it’s my job to build this executive’s dashboard, everything that’s blocking me is just a toll, a tax,” Rosanova says. “By being the pipes, by being the conduit through which information flows, if we can make this part of the roadway, part of the flow, it’s a much lower tax and a much lower effort. So like rather than asking people to use a plug in or do extra work to make something work, it’s just in the pipes.”

The response to the Stream Governance suite has been positive, Rosanova says. “There’s been a huge demand for this,” he says. When discussing this during a Zoom call with executives at a “very large Wall Street bank” earlier this year, the executives literally leaned into the cameras when the streaming governance capabilities came up, he says.

“We talked about Kafka, all this stuff,” Rosanova says. “But this is the place where people who are responsible, who have a large amount of responsibility, were very interested because they could see very quickly the problems this solves.”

This is just the start of Confluent’s foray into governance, and the company has a lot more it can do to simplify governance on streaming data, Rosanova says.

Related Items:

Confluent Raises More Than $800M in IPO

Confluent S-1 Reveals ‘Reimagining of Business’ Theme

Real-Time Data Streaming, Kafka, and Analytics Part One: Data Streaming 101