In Search of the Modern Data Stack
The modern data stack is many things to many people. It’s multi-cloud! It’s the data mesh! It’s BI plus AI! To get a better visibility into just what the modern data stack is, how it’s evolving, and why it all matters, we look to Fivetran’s “Multi-Cloud Modern Data Stack: Fireside Chat with Industry Trailblazers” for some insight.
For Ali Ghodsi, the CEO and co-founder of Databricks, things are pretty clear: The modern data stack is the open lakehouse architecture, which combines elements of a data warehouse and a data lakes to provide high-quality data in support of BI and AI workloads. It’s all about making things simpler.
“What’s going to happen over the next five years,” Ghodsi says during the fireside chat, “is “companies like Fivetran and Databricks and many others that are to come [are] going to re-envision how things are done in this amazing new infrastructure that we have. And it’s going to be much, much simpler. You’ll be able to actually do much more with it, and move faster.”
George Fraser, the CEO and co-founder of cloud-based ELT company Fivetran that hosted the chat, doesn’t necessarily share Ghodsi’s view of the lakehouse. For Fraser, many attempts at centralization have failed, which is he has said in the past that data lakes like S3 and ADLS are legacy tech.
“I think the core of the modern data stack is really the modern cloud data warehouse, and I would include Databricks in that category,” Fraser says (adding “you can yell at me later, Ali.”). “These modern analytical stores are so much faster. You can unplug your OLAP cube. You don’t need to do that anymore…There’s all these components that used to exist that just go away.”
Martin Casado, who has overseen a16z investments in both Databricks and Fivetran, isn’t quite so sure that the modern data stack will coalesce around data warehouses like those offered by Snowflake, Databricks, Google Cloud, AWS, and Microsoft Azure.
“You kind of presuppose that the data warehouse is central, and it’s clearly important,” he tells Fraser. “But in our investigations, when we talked a bunch of customers, it’s pretty clear to us that we’re seeing a multiplicity of data stores out there, and new emergent architectures. It may not be the data warehouse. It’s not obvious to me, certainly, that that’s going to be core.”
As the senior director of data analytics services at Google Cloud, Sudhir Hasbe gets his hands dirty in a lot of different products: BigQuery, Dataflow, Dataproc, Composer, Data Fusion, Data Catalog, Dataprep and PubSub. He unabashedly has a Google-centric view of what the modern data stack entails. “I think I would love to have all data on Google Cloud,” Hasbe quips. “It’s not going to happen.”
Actually, Google Cloud is the most progressive of the cloud giants when it comes to supporting a multi-cloud strategy. With its DataPlex offering, Google Cloud is also on the leading edge of adopting data fabric (or data mesh) approaches to federating management of data stored in different locations.
“Data is distributed in an organization across different clouds, and that’s here to stay for a very long time,” Hasbe says. “So the real question is, how can we enable organizations to leverage all data across all of these platforms, and provide capabilities that are going to be seamless?”
The logical view of the modern data stack gets more complicated when one considers two additional questions: Who is going to use it, and how is the data going to be managed? These may be afterthoughts for small teams. But in large enterprises with multiple departments that don’t necessarily see eye to eye on data (and which may actively be competing with each other), it becomes a tough question.
“When you have a single copy of data that can be accessed by different engines, the problem is people will create multiple copies,” Hasbe said. “People are blown away when they see how many copies are created by different users within in an organization. And that creates a lot of problem in governance, management, and maintaining the compliance that organizations need.”
In Hasbe’s opinion, the best approach is to maintain a centralized data storge tier, along with a common data catalog and a consistent set of governance policies on that centralized data.
Maintaining privacy and security are critical aspects of the modern data stack, says Ghodsi. “I sometimes joke that Databricks is a privacy company,” he says. “Security, privacy, sovereignty, governance – all of that, every company I talk to, a big chunk of the conversation is about that. So that’s going to be super central.”
There are a handful of vendors working in that space, and platform providers like Databricks, Google Cloud, and Snowflake work with a fair number of them. Hasbe spoke about the work it’s doing with Collibra, while Snowflake has taken an equity position in Alation. Privacy as a service vendors like Immuta, Privacera, and BigID also factor into the equation too.
The question for Ghodsi is how it will all shake out. “Everybody knows that, which ever vendor has control over that will have a lot of power,” he says. “So who’s going to have that? And everybody is racing toward that, and I think there’s going to be an open standard for that as well and I think that’s going to also be multi cloud. The clear dominant winner there is not yet clear in the modern data stack. There are lots of different alternatives, but I do think that’s going to be super critical
Fraser’s company is all about radically simplifying the ingestion of the data. But once it’s in the database or the lake or the object store, it’s not really Fivetran’s business anymore, although it does work to ensure that all the metadata from the source system is correctly fed into the analytics destination.
“We have an interesting perspective on this at Fivetran because, in the past, this problem of governance was often solved in the data movement layer,” he says. “On the way in, you would sort of deal with governance early. At the beginning you would anonymize data and stuff like that.
“And in the modern data stack, we really moved away from that more towards let’s replicate everything, and then let’s sort it out after it gets there,” he continues. “So the data governance problem is a little bit of the Wild West now. It’s partly our fault, because we’ve been kicking it downstream. I think we’re going to arrive at a better solution, but it’s very much an evolving space right now.”
Another area of the modern data stack that needs innovation is in the DataOps layer. That’s particularly true in multi-cloud environments, just due to the incompatibility of cloud provider’s stacks.
“A lot of DataOps in the past has been done by sort of brute force,” Fraser says. “You move the entire data set every night, you copy it again. That’s a classic pattern that most large enterprises do today, and that pattern is really dead. You cannot do that in a multi cloud environment. You cannot do that in an environment where you’re replicating an on-prem traditional database to a cloud analytical database….Snapshot replication is just not an option anymore.”
Google Cloud’s Sudhir agreed that, for multi-cloud customers, there are no good DataOps solutions that can provide full visibility into all of hte data and the security policies surrounding it.
“I think the next innovation is going to happen in data observability space,” he says. “There are a bunch of startups doing a bunch of work on how do you use data observability as a platform to do better data operations and DataOps management. I think that’s the next level of innovation that’s going to happen in this industry and I’m watching very closely in that space and seeing what comes out of that.”
You can view Fivetran’s “Multi-Cloud Modern Data Stack: Fireside Chat with Industry Trailblazers” at this link.