Spark’s New Deep Learning Tricks
Imagine being able to use your Apache Spark skills to build and execute deep learning workflows to analyze images or otherwise crunch vast reams of unstructured data. That’s the gist behind Deep Learning Pipelines, a new open source package unveiled yesterday by Databricks.
Deep Learning Pipelines, which was unveiled at the Spark Summit conference in San Francisco Tuesday, will essentially provide a way to extend the Spark MLlib library to popular deep learning frameworks like TensorFlow and Keras.
This will allow Spark users to leverage existing work they’ve done in MLlib, and to execute deep learning models directly in Spark’s existing machine learning library, says Reynold Xin, co-founder and chief architect at Databricks, the commercial outfit behind Apache Spark.
“It’s a library to integrate essentially all deep learning libraries with Spark to make deep learning substantially easier without having to actually learn about the specifics of deep learning,” Xin tells Datanami.
Deep Learning Pipelines will start out as its own source project, separate from the Apache Spark project, Xin says. Over time, depending on how things go, it could become a part of the main Apache Spark project. “It’s possible” that it will become a part of the Apache Spark project, he says. “We haven’t actually thought a lot about it. We want to get it out there and work with users.”
In the meantime, Databricks will include the new deep learning library in its own Spark-based software as a service (SaaS) offering. Databricks’ version will leverage the concept of transfer learning to take existing deep learning models available in the open domain and modify them to make them more applicable to its customers’ specific domains, Xin says.
“There might be a generic model for doing image classification, but maybe one of our customers wants to detect what kind of car is in a picture,” he says. “We have this techniques called transfer learning built into this library that, with just a few lines of code, allows users to apply an existing model, published by pretty much anybody on the Internet, and then retrain it on a much smaller amount of data in a much faster fashion — in just a few minutes — and then get a better model for their domain.”
Another cool feature that Databricks is adding with Deep Learning Pipelines is the capability to expose a trained deep learning model as SQL.
“With one line of code now the data scientist or data engineer who actually trains the model can make this model available as a SQL function,” Xin says. “So even a business analyst will be able to build, for example, predictions in their BI tools.”
Deep Learning Pipelines supports TensorFlow and Keras now, but will likely be bolstered to support other popular deep learning frameworks. Mxnet is popular on Amazon, while Theano, Torch, and Caffe are also gaining more attention as deep learning techniques become more popular.
This isn’t Spark’s first foray into deep learning or GPU computing. But the folks at Databricks are bullish that the new Deep Learning Pipelines project could revolutionize deep learning for a more general audience.
“We do see that this library has the potential to do for deep learning what Spark did for big data, to make deep learning much more accessible to everybody,” Xin says. “Deep learning is at a similar stage right now to what MapReduce was for big data. You can actually get good results if you spend an enormous amount of time with great talent, and a lot of PhDs in deep learning and machine learning. But after spending time with a lot of customers, we realized that this is just too difficult to use. Training could take weeks. And it has a very steep learning curve, so we need to make something easier.”
The commercial outfit behind Spark also used its Spark Summit conference to make two other announcements, including a new server-less analytics architecture for its cloud environment and the general availability of Structured Streaming.
While some customers like fiddling with settings and optimizing the configuration for peak performance, other customers just want something that’s easy to set up so they can quickly start querying data. That’s what Databricks’ Serverless offering is designed to enable.
According to Xin, the new Databricks Serverless offering will automatically configure the environment and automatically adapt the cluster to workloads.
“It auto-scales the nodes, it auto-scales the local storage attached to the nodes, and it automatically adapts when these’ a lot of users connecting to the cluster,” he says. “In many ways I think it will be better, taking away the knobs from the users, because we belie it provides a better experience for a vast number of users.”
The service, which provides isolation via Linux containers and runs on the new Databricks Runtime version 3.0 (based on Apache Spark version 2.2 under the hood), will allow users to submit queries to through their data science notebooks or through JDBC or ODBC connections. It will initially support the Spark SQL and Python Dataframe API, but over time could be expanded to support other Spark engines, Xin adds.
Finally, Databricks also announced that it’s fully supporting the new Spark Structured Streaming API in its cloud environment. It also made some tweaks to its implementation of Structured Streaming to eliminate the micro-batch approach to workloads, thereby bolstering response time and shrinking latency.
“There’s been a lot of criticism on Spark being micro-batch based and having high latency, “Xin says. “So we did a bunch of experiments with new changes.”
The micro-batch worked with data analytic jobs that required up to 100 milliseconds of latency, which was fine for some workloads. But some clients demanded latencies as low as 1 to 5 milliseconds, which forced them to look to different streaming frameworks.
One of the changes Databricks is playing with involves replacing the micro-batch design with more of a macro-batch, or long-running setup. “What we’re doing under the hood is to allow Spark to launch a job that’s long running,” Xin says. “Instead of a micro-batch, it’s actually running super long batches.”
This approach, which Databricks calls continuous processing and which is avaibale in Databricks Runtime 3.0 (and, resumabley, Apache Spark down the road), will be useful for lightweight processing of data with latency demands down to 1 millseconds, Xin says. “This is a new mode that allows customer to get 1ms latency,” he says.
Databricks was able to change the streaming approach because the Structured Streaming API was designed from the beginning to not have dependencies on the micro-batch approach. “The API never included micro-batch,” Xin says.