Matei Zaharia is a very busy man. When he’s not helping to shape the future of Databricks as its CTO, he is helping to shape the future of computer science as an assistant professor at Stanford University. He also finds time for research and helping with Apache Spark, the open source project for which he is best known for.
Amid his hectic schedule with Databricks and Stanford, Zaharia was kind enough to take some time to answer questions from Datanami at the Data + AI Summit that took place in San Francisco last month. Here is a condensed version of that interview arranged by topic.
On Presto and Working in an Open Ecosystem
“People have used Presto with Databricks for a long time. It’s a bit nuanced and people sometimes get confused about it. It is true that we give you a lot of batteries included in our product. If you’re setting up a new data project, you want a compute engine, you want a UI, you want notebooks, model serving, whatever–we have those in our platform.
“But at the same time, the whole point of lakehouse and building on these open formats and open APIs like Spark is that many other people will build interesting applications and you can run those on your data too. So for example with Presto, that’s one where since the beginning you could run Presto alongside Databricks…They interface with each other. They share one copy of the data. They can both work in place on it. In many other areas, we have partners that do all kinds of workloads, from machine learning to BI to whatever else–streaming. They all integrate.”
On the Importance of Open Table Formats
“When we started Delta Lake, we started it as just a feature in our product, and we saw it was so useful because it added transactions and data versioning and features like that to your data lake that virtually every customer was adopting it. And everyone was also asking how do I get external tools to work with it? All of these open source things like PyTorch or TensorFlow–all these tools out there. That’s why we decided to make the format open.
“It’s hard to take something that’s proprietery and immediately release all of it. So for a while we had some extensions, mostly around performance, that weren’t open. But we always wanted to encourage this ecosystem and we think it’s the right next step and we invested the work to do that.”
On Developing Features Privately, Then Open Sourcing Them
“Spark started as a research product at UC Berkeley. And actually for a while we developed it at Berkeley. We didn’t have it on a public GitHub. But we had users in there and when it got good enough…we said, hey this is actually useful and we want external people to use it. We released it.
“Even Hadoop, if you remember at the beginning–a lot of the development was happening in Yahoo or in Facebook. Facebook developed Hive on its own then open sourced it after it was sort of already working. The way to think about it is, especially as an enterprise company, [is] you want to release things that you can keep supporting in the future. The worst thing is you tell someone, hey go do something this way, and then a year later, they say we’re canceling that! We want to deprecate it. So we want to make sure that things are tested and stable enough that we want to commit to them.”
On the Future of Open Source Innovation
“I think different kinds of projects can start in different ways. And for things that are established–like all the new features in Delta Lake and Spark–many of them we just build out there from the beginning. But for something that’s a whole new concept, like Delta Lake is, ‘Hey, here’s how you manage all your data.’ It’s really risky if people start adopting it and then it’s actually the wrong design and you have to tell them to migrate. That’s sort of a challenge. It’s something that companies are figuring out.”
“We’re seeing a lot of other companies that want to engage deeply in the [open source] development process. I think we were seeing that about two-thirds of contributions [in Spark] are from outside Databricks now and we expect that to increase. We also want to give them a very easy way to do that, where they know everything in there, they can plan how everything will integrate. All our roadmap is public also, so we can discuss, we want to do this at this time. And people can say, can you wait to put in something else, or whatever.”
On the Selection of The Linux Foundation Over Apache Software Foundation for Delta Lake
“They’re both great open source hosting foundations. With Linux Foundation, we saw a lot of interesting cloud and AI projects in there–for example, Kubernetes is in there–and we want to make sure we integrate well with those. That’s why we went for it. For each project, we’ll put it wherever we think it makes the most sense. For example, for a lot of stuff in Spark, obviously we’re adding modules and stuff to Apache Spark.”
On Current State of the Apache Spark Project
“There’s quite a bit going on. We’re actually talking about two efforts that we want to contribute a lot of engineering resources to. One of them is streaming, improving stream processing performance, operability, and just functionality with what we call Project Lightspeed.”
“This is a pretty surprising one to us. We had streaming on our platform for a while. We didn’t have a huge engineering team working on it. It was just kind of working. And then when we looked at the metrics for usage, we saw that it’s growing very quickly. It actually grew by a factor of nine in usage in the past three years. And it was actually growing at a faster rate than our batch jobs and interactive and other stuff, which is pretty cool for something where basically they said there’s not that much engineering going in.”
On Apache Flink Vs. Spark Structured Streaming
“There are definitely differences [between Spark Structured Streaming and Flink]. We’re looking closely at that. They do cater to slightly different audiences. So for Structured Streaming, as I said, we wanted to make it very easy if you start with a batch query or interactive query to just turn it into a stream, so the number one thing we prioritized is how easy you can write a job.
“With Flink, often the teams using it are more advanced. They’re engineers who want fine-grained control over everything, and they’ll often squeeze out very low latency from it. It’s usually better at latency–not at throughput, but latency–than Spark is. So we’re looking at how we can improve latency and throughput [with Project Lightspeed] while keeping the ease of use and also add operability.
“The advanced APIs are another one, the advanced windowing and so on. Those are things we didn’t use to have that we’re adding….Basically we want sub-second latency even for pretty complicated queries. Right now ,it’s pretty easy to get around a minute of latency for most kind of queries. We think we can bring many of them to sub-second.”
On 2022 Being the Year That Streaming Data Finally Breaks Out
“It might take a while. But we are seeing pretty interesting signs of it. One thing we’re seeing is basically a double-digit percentage of our workload is streaming. That didn’t use to be the case a few years ago. So definitely increasing. There’s just the trend in more enterprises to want to build operational applications with their data. It’s not everyone.
“The thing driving it tends to be more these applications. Like say I’m running a streaming movie service and I want to recommend stuff or fix quality issues in real time, as opposed to, I think what a lot of people thought was any kind of BI or dashboard I see will magically turn into streaming and be faster. That hasn’t been as useful. And that’s kind of a nice-to-have. But for these operational ones, you kind of have to have it work. If you’re streaming video thing goes down for a few minutes and people just leave–that’s what’s driving it.”
On Whatever Happened to Apache Spark GraphX
“It’s still around. It’s something called GraphFrames. But there hasn’t been that much new activity in it. We still see usage of it. It’s something that could pick up more, but we haven’t done anything super major there.”
“But it is there. It actually benefits from things like Photon. Underneath the hood, it’s doing a lot of joins and SQL computation, so it does benefit form that. But we’re not doing some huge new effort there.”
On Data Gravity Vs. Data Silos
“There’s a little bit of nuance. I do think the world is fragmenting, especially geographical. It’s very hard to move any data about data across geographical boundaries. And it’s going to get even harder. So you do need to deploy your computations and your machine learning and all that stuff into many regions. That does bring new challenges. That’s one of reasons we’re excited that our offering works across cloud and so on, is you can actually do that even if you have different vendors in different regions.
“At the same time though, if you think within a region, a lot of enterprise computing is moving into the cloud. And what’s really different in the cloud compared to the way you use to manage IT is all your computation, all your data is on the same really fast network inside that data center. So historically for example, maybe you had two departments that each set up a data warehouse and they each paid for it. They each had their own cluster. It would be very hard to connect the two and search across them and combine them.
“In the cloud, there’s no reason why, since they’re both just some buckets in S3—there’s no reason why you can’t have a job done, scan data in both, and combine them. That’s why we’re betting on open formats, first of all. If you have a team that’s using Databricks and one using Presto, they can both see each other’s data, and we’re just starting to give you features to federate all your data together and combine it all in one interface. So I think that is a change. There are so many companies building on these open formats, so many pieces of software–even the major cloud vendors, they all support Parquet, Delta Lake, things like that.”
On the Possibility of An On-Prem Databricks Environment
“We do support [multi-cloud]. We don’t offer Databricks itself on prem now. But we can connect through all these cloud-to-on-prem links, and you can have reasonable performance accessing that data.
“We’re always investigating whether we should have an on-prem [offering] too. And right now, we found we can get pretty far with just the ability to connect that data, and the open APIs, like Spark, where you could run the same job on prem or Databricks. But we’ll have to see.
Project Lightspeed aims at improving latencies in Spark Structured Streaming (Peshkova/Shutterstock)
“For multi-cloud, we’re seeing a lot of need for that. One of the things we’ve invested in is really good support for Terraform. It’s from Hashi Corp. Basically it’s a way to script deployment of software into different clouds and to automate it. If you want to deploy the same application in three different cloud regions, you can write a script and I can connect to each one and it does that. That’s an open source project that we do integrate with. So we do see people managing multi-cloud deployment this way.”
On the Rise of Data Fabrics and Data Meshes
“We do try to support it…They’re more like architectures, or patterns for how organizations should manage work internally. Like how do you set up teams? Is there one central data team in your company, or are there several?
“And we’re more of a technology platform, so we want to support all these different patterns. There are some pieces of technology you need for some of them, and so we’re investing in some of those. For example, with Unity Catalog, which is our governance layer, you can delegate ownership of part of your catalog to different individuals, so they can each own their piece and still combine them. We also have our data sharing protocol, Delta Sharing. That allows you, even if you have completely different deployments of Databricks, or even other software, you can still share data between them.”
“We don’t have a specific data mesh management layer. We have the low-level kind of technology bits you can use to build a data mesh architecture…. I do think even with organizations that build data mesh, they’re going to want to put the data in the same data centers and the same cloud regions, because of the speed and the low cost of them combining across them. It’s more about ownership. That’s what it’s about. It’s a little bit like micro services in software. It used to be everyone had to add code into one giant application [that was] super slow to release stuff. Now people have these different things they own that they can each kind of manage.
Is Real-Time Streaming Finally Taking Off?
Databricks Bolsters Governance and Secure Sharing in the Lakehouse
Databricks Opens Up Its Delta Lakehouse at Data + AI Summit
, apache spark
, Delta Lake
, Matei Zaharia
, open source
, open table format
, Project Lightspeed
, Spark Graph
, Spark Structured Streaming