Follow Datanami:
June 15, 2015

IBM, Databricks Join Forces to Advance Spark

IBM has jumped on the Apache Spark bandwagon, revealing it would throw its considerable weight behind the open source in-memory processing framework that has been gaining momentum over the last year.

Separately, Databricks, the company formed by the creators of the analytics engine, released Apache Spark 1.4 that includes the SparkR API, its first new language API since 2012.

IBM said Monday (June 15) it would integrate Spark software into the “core” of its analytics and commerce platforms while offering Apache Spark as a service on its Bluemix cloud application development platform.

The commitment to Apache Spark also gives IBM another vehicle besides its Watson cognitive computing platform for advancing its machine learning technology.

Along with advancing Spark’s machine learning capabilities through collaboration with Databricks, IBM also said it would open a Spark Technology Center in San Francisco while committing more than 3,500 developers and researchers to focus on Spark-related projects.

Backing for Apache Spark also includes the donation of IBM’s SystemML machine learning technology to the Spark open source project. IBM also said it would leverage current partnerships to train as many as 1 million data scientist and engineers on Apache Spark.

It also plans to host Spark applications on its Power and Z Systems infrastructure.

The partners said they plan to introduce new domain specific algorithms to the Spark ecosystem and add new machine learning primitives to the Apache Spark Project.

IBM’s full-throated endorsement of Apache Spark reflects the growing momentum of what has emerged as Hadoop’s most popular open-source projects. Last fall, Hortonworks outlined a similar investment in Spark aimed at moving the platform to the enterprise.

In a statement, IBM said it is fully committed to Spark as a foundational technology platform for accelerating innovation and driving analytics across every business in a fundamental way.”

Developed by AMPLab (“Algorithms, Machine, People”) at the University of California at Berkeley in 2009, Spark was released by startup Databricks in 2013. It is described as a general-purpose data processing engine packaged to handle SQL queries and advanced analytics like machine learning. The cluster-computing framework with in-memory processing quickly gained traction in the analytics market, with hyper-scale deployments by Internet giants like Yahoo and Baidu.

Sparks’ creators said their intent was to forge the next generation of analytics tools to derive insights from heterogeneous data by combining machine learning, hyper-scale computing and “human computation.”

IBM said its data scientists would begin working over the next few months with Apache Spark open-source community to advance machine-learning capabilities. The initial goal is development of “smart business apps,” the company said.

As part of its plan to integrate Spark into its analytics and consumer platforms, IBM said it would begin offering a beta version of its “Spark-as-a-Service” on its Bluemix cloud platform.

In a blog post, Fred Reiss of IBM’s Spark Technology Center said several hundred data scientists, developers and designers would begin working at the San Francisco center over the next several months. The center was formed to speed IBM’s adoption of new Spark technologies. For example, it integrated an earlier version of Spark (version 1.3.1) to IBM’s Open Platform for Apache Hadoop.

IBM said developers have been steadily reducing Spark’s backlog of bug fixes while working to improve its performance. Reiss said the next step would be contributing new features and components to Apache Spark, with special emphasis on machine learning as the company shifts its technology to the open-source community.

It also expects to begin demonstrating business applications based on Spark in the coming weeks.

The company said more than 300 IBM engineers are already working on Hadoop and Spark open source development efforts.

Meanwhile, Databricks said its 1.4 release of Spark could be downloaded here. The release adds window functions to Spark SQL and in its DataFrame library. Databricks said window functions are increasingly popular among data analysts, allowing them to compute statistics over window ranges.

Related items:

Hortonworks Hatches a Roadmap to Improve Apache Spark

Three Things Spark Needs to Out-Hadoop Hadoop