Databricks CEO on Streaming Analytics, Deep Learning, and SQL
As Apache Spark continues to gain steam, so too does Databricks, the company behind the popular distributed processing framework. At the recent Strata + Hadoop World conference, we caught up with Databricks CEO and co-founder Ali Ghodsi to get the latest scoop on advances in the Spark world.
Real-time streaming processing and streaming analytics is gaining a foothold these days. So it’s no surprise to see that Spark is in the middle of the action. “Streaming is taking off,” Ghodsi says. “This year there’s a huge uptick in real-time streaming.”
Databricks, which drives development of Apache Spark, developed the new Structured Streaming engine in Spark 2.0 to make developing and deploying real-time applications easier. According to Ghodsi, no other framework can deliver everything that you need to build continuous applications.
“If you take Storm or existing streaming systems, they’re great for very basic statistics, but what if I want to do real-time machine learning?” he says. “How do you do that with Storm or some other system? That’s what Structured Streaming is about. It enables you to do machine learning, ETL, SQL–all those things in real time.”
A Unified API
Spark users can access the Structured Streaming functionality through the same API that they use to access other parts of Spark, including the MLlib machine learning library, the SQL engine, the R engine, and the GraphX graph processing engine–including new libraries that are in the works (more on that later).
“We used to have multiple APIs in Spark. Now we have one single API. It’s called DataFrame,” Godhsi says. “That’s it. You learn that one API and you’re done.”
Ghodsi was clearly happy with the recent Spark survey that found the upstart distributed processing framework was going mainstream. “It’s the most awesome survey you can find on the planet,” he offers, “but it’s biased for Spark.”
Spark in the Cloud
The survey points to growing momentum for cloud-based Spark deployments. Sixty-one percent of the 1,600 or so survey-takers had cloud deployments of Spark, which was up 10% compared to last year’s survey. On-premise Spark deployments are down.
For Databricks, which runs its SaaS offering on Amazon Web Services EC2 infrastructure, the cloud just makes sense. “The way you use Databricks today is you get a tiny little cluster, which barely costs you anything, and as you start submitting queries and the workload increases. It just grows,” Ghodsi tells Datanami. “It can grow up to a 1,000 machines. When you stop using it, in an hour or so, it shrinks down a gain. So it turns everything on its head.”
You could try to duplicate this type of elastic Spark setup with your on-prem cluster, perhaps by running Spark in a container technology like Docker or Apache Mesos, which like Spark came out of UC Berkeley’s AMPlab. (Databricks uses Kubernetes to virtualize Spark on AWS, Gohdsi says.)
You could do that, but it would be difficult to replicate the performance benefits of true cloud computing without sacrificing performance or spending a fortune, the Spark CEO says.
“You either have to buy more machines and overprovision for the worst case, or you go for the average case, in which case you’re going to have poor performance in the middle of the day,” he says. “We are using containers under the hood to achieve this. But if you ask [Databricks customers] ‘Are you using containers for Spark?’ They would say no because it’s a black box to them. That’s the nice thing. You as a customer don’t see that.”
Rapid Dev Cycle
Databricks’ cloud customers also benefit from a rapid iteration cycle driven largely by the demands of its largest customers. Organizations that get Spark through their Hadoop distributors (none of which has shipped Spark 2.0 yet) are not exposed to this cycle of change and innovation.
“There’s new version of Spark coming out Monday, and the Monday after that and the Monday after that,” Ghodsi says. “We use to be driven by looking at the mailing list and talking to the community. Now we just look at what the customers want.”
Security is a big deal for these cloud-based Spark customers. You can expect to see more security features, including enterprise level auditing functionality and fine-grained role-based access control (RBAC) in upcoming versions of the Databricks offering. There will also be more vertical-specific functionality added, says Kavitha Mariappan, the vice president of marketing for the San Francisco company.
Deep Learning on GPUs
GPU-based deep learning is also on the docket for Databricks. “There’s something called TensorFrames, which is a mixture of Tensor Flow and Dataframes,” Ghodsi says. “We’re working with customer to do image classification using deep learning.”
At the moment, the TensorFrames work is being done outside of Spark proper, but eventually it will get its own library in the framework, right along with MLlib, SparkSQL, Spark Streaming, and GraphX. “It’s coming,” Ghodsi says.
Coordinating the deep learning work among multiple GPU-equipped machines will take some work. “Spark has always, on a single machine, used different libraries,” Ghodsi says. “For instance on a single machine, when doing matrix manipulations, we’ve always used this thing called BLAS [Basic Linear Algebra Subprograms). And now we use TensorFlow if we need to do GPU-based gradient descent. But how do you orchestrate it? How do you get all those machines to, in parallel, work on this? That’s where Spark comes in.”
SQL, or Esperanto?
Ghodsi also offered some advice to other distributed processing frameworks regarding support for SQL. “It’s a big mistake” to not support it, he says.
The folks behind Spark debated whether or not to add SQL support back in 2012, Ghodsi says. “It was a very hefty debate,” he says. “I’m glad we did it though. People were saying SQL is the old way. Why would anyone want to use SQL? But the SQL side won out, and we added SQL and now all of our customers are using it. So it would have been a big mistake not to do it, looking back.”
Ghodsi, who hails from Sweden, compared the SQL debate to attempts to replace English as the lingua franca for people around the world.
It’s like saying Esperanto is superior to English,” he says. “It’s superior, the grammatical structure is superior, the vocabulary is more standardized, so let’s just use that and drop English. Good luck. I think that’s what the debate looked like at Databricks.”
Today, Esperanto, which was developed by a late nineteenth century Polish ophthalmologist to be an easier-to-learn alternative to English in the hopes of unifying people around the globe, is estimated to be spoken by 10,000 to 10 million people. English is spoken by about 1.5 billion.