Follow Datanami:
November 19, 2018

Databricks Upgrades Spark Support, Adds ML Runtime

via Shutterstock

Databricks announced support this week for the latest version of Spark, integrating it into its enterprise analytics platform. Along with support for version 2.4 of the stream processing framework integrated as part of Databricks’ latest runtime, the company also this week unveiled a new runtime feature aimed at simplifying deep learning.

The Spark 2.4 release unveiled earlier this month includes upgrades that improve the performance of distributed deep learning and machine learning framework running on Spark. Databricks noted that the upgraded version of its analytics platform running on Spark 2.4 includes improvements that address dependencies associated with deep learning tasks.

The Spark upgrades were consolidated in an effort called Project Hydrogen that introduced a new scheduling mode called “barrier execution.” The tool allows developers to embed training for distributed deep learning as an Apache Spark workload, San Francisco-based Databricks said.

“This is the largest change to Spark’s scheduler since the inception of the project,” said Reynold Xin, co-founder at Databricks and a Spark contributor. Xin added that the upgrades would help reduce the complexity of machine learning workloads.

The new runtime feature dubbed HorovodRunner is designed to simplify scaling of distributed deep learning workloads from a single machine to large clusters. Previously, migrating from single-node workloads to distributed training on CPU or GPU clusters required full code rewrites, the company said.  HorovodRunner would reduce programming and training time from hours to minutes, Databricks claims.

Along with Horovod, the distributed training framework, Databricks said its platform provides native integrations with Kera, TensorFlow and other machine learning schemes along with MLlib and GraphFrames machine learning algorithms.

Last week, Databricks announced a partnership with cloud data integrator Talend (NASDAQ: TLND) to combine the cloud service with Databrick’s analytics platform to enable data engineers to leverage the cluster computing framework for processing large data sets at scale.

Recent items:

Databricks, Talend Expand Cloud Access to Spark

What’s in the Pipeline for Apache Spark?