June 7, 2017

Spark’s New Deep Learning Tricks

Alex Woodie

(ktsdesign/Shutterstock)

Imagine being able to use your Apache Spark skills to build and execute deep learning workflows to analyze images or otherwise crunch vast reams of unstructured data. That’s the gist behind Deep Learning Pipelines, a new open source package unveiled yesterday by Databricks.

Deep Learning Pipelines, which was unveiled at the Spark Summit conference in San Francisco Tuesday, will essentially provide a way to extend the Spark MLlib library to popular deep learning frameworks like TensorFlow and Keras.

This will allow Spark users to leverage existing work they’ve done in MLlib, and to execute deep learning models directly in Spark’s existing machine learning library, says Reynold Xin, co-founder and chief architect at Databricks, the commercial outfit behind Apache Spark.

“It’s a library to integrate essentially all deep learning libraries with Spark to make deep learning substantially easier without having to actually learn about the specifics of deep learning,” Xin tells Datanami.

Deep Learning Pipelines will start out as its own source project, separate from the Apache Spark project, Xin says. Over time, depending on how things go, it could become a part of the main Apache Spark project. “It’s possible” that it will become a part of the Apache Spark project, he says. “We haven’t actually thought a lot about it. We want to get it out there and work with users.”

In the meantime, Databricks will include the new deep learning library in its own Spark-based software as a service (SaaS) offering. Databricks’ version will leverage the concept of transfer learning to take existing deep learning models available in the open domain and modify them to make them more applicable to its customers’ specific domains, Xin says.

“There might be a generic model for doing image classification, but maybe one of our customers wants to detect what kind of car is in a picture,” he says. “We have this techniques called transfer learning built into this library that, with just a few lines of code, allows users to apply an existing model, published by pretty much anybody on the Internet, and then retrain it on a much smaller amount of data in a much faster fashion — in just a few minutes — and then get a better model for their domain.”

Another cool feature that Databricks is adding with Deep Learning Pipelines is the capability to expose a trained deep learning model as SQL.

“With one line of code now the data scientist or data engineer who actually trains the model can make this model available as a SQL function,” Xin says. “So even a business analyst will be able to build, for example, predictions in their BI tools.”

Deep Learning Pipelines supports TensorFlow and Keras now, but will likely be bolstered to support other popular deep learning frameworks. Mxnet is popular on Amazon, while Theano, Torch, and Caffe are also gaining more attention as deep learning techniques become more popular.

This isn’t Spark’s first foray into deep learning or GPU computing. But the folks at Databricks are bullish that the new Deep Learning Pipelines project could revolutionize deep learning for a more general audience.

“We do see that this library has the potential to do for deep learning what Spark did for big data, to make deep learning much more accessible to everybody,” Xin says. “Deep learning is at a similar stage right now to what MapReduce was for big data. You can actually get good results if you spend an enormous amount of time with great talent, and a lot of PhDs in deep learning and machine learning. But after spending time with a lot of customers, we realized that this is just too difficult to use. Training could take weeks. And it has a very steep learning curve, so we need to make something easier.”

The commercial outfit behind Spark also used its Spark Summit conference to make two other announcements, including a new server-less analytics architecture for its cloud environment and the general availability of Structured Streaming.

While some customers like fiddling with settings and optimizing the configuration for peak performance, other customers just want something that’s easy to set up so they can quickly start querying data. That’s what Databricks’ Serverless offering is designed to enable.

According to Xin, the new Databricks Serverless offering will automatically configure the environment and automatically adapt the cluster to workloads.

“It auto-scales the nodes, it auto-scales the local storage attached to the nodes, and it automatically adapts when these’ a lot of users connecting to the cluster,” he says. “In many ways I think it will be better, taking away the knobs from the users, because we belie it provides a better experience for a vast number of users.”

The service, which provides isolation via Linux containers and runs on the new Databricks Runtime version 3.0 (based on Apache Spark version 2.2 under the hood), will allow users to submit queries to through their data science notebooks or through JDBC or ODBC connections. It will initially support the Spark SQL and Python Dataframe API, but over time could be expanded to support other Spark engines, Xin adds.

Finally, Databricks also announced that it’s fully supporting the new Spark Structured Streaming API in its cloud environment. It also made some tweaks to its implementation of Structured Streaming to eliminate the micro-batch approach to workloads, thereby bolstering response time and shrinking latency.

“There’s been a lot of criticism on Spark being micro-batch based and having high latency, “Xin says. “So we did a bunch of experiments with new changes.”

The micro-batch worked with data analytic jobs that required up to 100 milliseconds of latency, which was fine for some workloads. But some clients demanded latencies as low as 1 to 5 milliseconds, which forced them to look to different streaming frameworks.

One of the changes Databricks is playing with involves replacing the micro-batch design with more of a macro-batch, or long-running setup. “What we’re doing under the hood is to allow Spark to launch a job that’s long running,” Xin says. “Instead of a micro-batch, it’s actually running super long batches.”

This approach, which Databricks calls continuous processing and which is avaibale in Databricks Runtime 3.0 (and, resumabley, Apache Spark down the road), will be useful for lightweight processing of data with latency demands down to 1 millseconds, Xin says. “This is a new mode that allows customer to get 1ms latency,” he says.

Databricks was able to change the streaming approach because the Structured Streaming API was designed from the beginning to not have dependencies on the micro-batch approach. “The API never included micro-batch,” Xin says.

Databricks Eyes Data Engineers With Spark Cloud

How Spark Illuminates Deep Learning

Applications: Artificial Intelligence, Data Mining, Predictive Analytics

Technologies: Cloud, Frameworks, Middleware

Sectors: Financial Services, Healthcare, Retail

Vendors: Databricks

Tags: apache spark, Data Scientists, deep learning, Keras, machine learning, MLlib, neural networks, Spark, Spark SQL, Structured Streaming, TensorFlow

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Spark’s New Deep Learning Tricks

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 18, 2024

April 17, 2024

April 16, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Spark’s New Deep Learning Tricks

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 18, 2024

April 17, 2024

April 16, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link