Tackling Data Governance in a Multi-Cloud DW World
Microsoft Azure Synapse is the top destination for customers adopting cloud data warehouse over the next 12 to 24 months, followed by Databricks and Amazon Redshift, according to a new data engineering study released today by Immuta. The survey also found that more than half of companies plan to use two or more cloud data warehouses, which can complicate data governance initiatives, Immuta says.
Immuta’s inaugural Data Engineering Survey found that that 75% of data engineering teams plan to adopt at least one cloud data warehouse over the next one to two years. That’s a big deal, says Steve Touw, Immuta’s CTO and co-founder, but it’s not exactly news. “Everyone is moving to cloud,” he says. “I think we know that, but it’s more validation.”
The bigger news was that 52% of the 130 data professionals who took the data governance software provider’s survey plan to adopt two or more data warehouses over that same time period.
“So it’s not that they’re moving just to Snowflake. They’re moving to Snowflake and Databricks. Or Redshift and something else,” Touw told Datanami. “We find that interesting, and a big driver of our product.”
The Immuta report jibes with Gartner’s view, expressed in its first-ever Magic Quadrant for Cloud Database Management Systems, which concluded that the majority of new database deployments will be headed to the cloud. By 2023, cloud databases will account for 50% of the total database market, Gartner found.
The cloud data warehouse market will be well diversified, with organizations using multiple data warehouses to serve multiple data needs. That goes against conventional wisdom, Touw says.
“You’d think when an organization would move to the cloud, they would try to consolidate [databases], but that has not been the case,” he says. “One data warehouse or one lake isn’t going to solve all your problems, so the data ends up in multiple places and multiple different computes.”
Controlling who has access to what data is hard enough with just one database or one data warehouse. When you throw multiple data warehouses into the mix–not mention the need to ensure that sensitive data uses do not violate data regulations like HIPAA, GDPR, CCPA, etc.–and the data governance challenge can quickly get out of hand.
Center for New Data
The Center for New Data, which is a new public policy think tank that sprung up earlier this year in response to the COVID-19 pandemic, is a perfect example of how these data governance challenges play out in the real world.
The organization, which has relationships with hundreds of data scientists and public policy researchers at major universities around the country, stores large amounts of sensitive data in a Snowflake data warehouse. For example, a trillion-row dataset of “ping” data from 50 million Americans’ cell phone is being used to determine to what extent Americans stayed home during the recent Thanksgiving holiday to curb the spread of the coronavirus (or, as it appears to be, instead traveled far and wide).
That raw data, which is composed of a device ID, lat/lon coordinates, and a timestamp, is “feature-ized” by researchers and used to create additional data sets. But due to the potential for re-identification and other data abuses, the Center must implement strict controls over who has access not only to the raw data (and derived data sets), but also what the researchers are allowed to do with it.
The organizations uses Immuta to manage data access policies, including which groups can access which data sets. It controls what they can do with it, such as which data sets can be joined and which ones can’t. The software also controls how much of the data can be accessed as plain text, and which parts of the data must be masked.
Because Immuta has close integration with the data warehouse that it supports (for example, Immuta can create native Snowflake “views”), the Center administrators can set the governance controls in Immuta, and never have to put their hands on the access and security controls that Snowflake supplies.
That makes life easier for the data engineers and administrators who set up the analytics environments for the researchers, says Ryan Naughton, co-executive director of the Center for New Data.
“Primarily we use Snowflake [but] we’ll be adding Databricks soon,” Naughton says. “As we add more and more components in a cloud database environment, it’s much cleaner than any sort of on-prem situation we’ve had before. And the ability for us to manage access control and all of that, and the new frameworks of governance, has been a breath of fresh air.”
Immuta supports traditional role-based access control (RBAC), as other data governance providers do. But it also supports “purpose-driven” access control and attribute-based access control (ABAC). That further simplifies life for Center for New Data administrators and engineers.
“It means when you’re in the Immuta governance perspective, you’re on a project, and when you’re in that project, you have access to these specific data sources with these masking policies, etc.,” Naughton says. “If you’re going to switch to using this other data set–if you switch your role in Snowflake, for example–now you lose the access to the first, and gain access to the second. But we didn’t have to do all that management in each separate database environment.”
With its dynamic purpose-driven access control in addition to ABAC, Immuta can help eliminate the need to manage hundreds of individual data access policies. In some cases, it can boil it down to fewer than 10 Immuta policies, which are defined and controlled by a data governance expert in a point-and-click manner using a GUI (thereby eliminating the need for them to be database experts too).
“The goal is to focus those native integration on the cloud stack data warehouse providers that have the most traction,” Touw says. The company already supports Snowflake and Databricks, and today unveiled support for Starburst and Presto.
And there’s more on the way. “We’re going to release [Microsoft Azure] Synapse and Redshift here shortly, as well as [Google Cloud] BigQuery,” he says.
As customers adopt multiple cloud data warehouses, the need for an independent third-party to manage governance in that environment becomes more acute. “It’s why we love talking about multi-data warehouse and multi-cloud world, where you need a single place to manage not only your access control, but all your governance and DataOps activities,” Touw says. “That’s where we see ourselves fitting in and helping customers.”