Follow Datanami:
March 18, 2014

Databricks Moves to Standardize Apache Spark

Alex Woodie

Databricks, the company behind open source Apache Spark, today rolled out a certification program that creates a Spark standard that big data analytic application developers can write to, and that customers can rely on. It’s a smart move by Databricks, which is looking to avoid the forking that has clouded Hadoop’s march into the enterprise.

Spark is catching fire as the hottest alternative to MapReduce in the big data world today. The in-memory processing engine is said to run 10x to 100x faster than MapReduce, comes with a set of machine learning and graph libraries, and also features streaming and SQL features to boot. The software doesn’t have to run on Hadoop, but since lots of customers are already storing lots of data in HDFS, most customers are running it there.

The promises of faster processing and easier Hadoop programming (it supports Java, Python, and Scala) are luring developers to Spark. Matei Zaharia, who created Spark while a grad student at Cal Berkeley’s AMPlab, founded Databricks with his former Berkeley adviser, computer science professor Ion Stoica, to build a business around Spark.

The Databricks business model is still unfolding. It appears unlikely at this point that the company will sell an enterprise version of Spark. Zaharia says the company wants to keep Spark “100 percent open source.” Instead, it appears the company will focus on developing open source software at the Apache Software Foundation, providing training and networking, and helping to foster the adoption of the technology. The new certification program is part of that outreach effort.

Stoica tells Datanami that standards are necessary to have a strong application ecosystem around Spark. “The role of the certification is to make sure that once an application is certified to open Apache Spark, that it will run also on [other] Apache Spark distributions, which we’re also going to certify,” he says. “So that’s why we are very excited about this. We are really convinced that this is the right thing to do. We have a chance to create a really powerful application ecosystem.”

At launch time, Databricks had certified three application vendors in its new program, including Adatao, Alpine Data Labs, and Tresata. Bruno Aziza, the head of marketing for analytical application provider Alpine Data Labs, says the company moved to adopt Spark for the speed boost it provides over MapReduce on Hadoop.

“We asked, ‘Is there a way to run the model faster and run more data through it?’ That translates into, you’re going to have to use Spark,'” Aziza says. “It’s the best way to give them value, agility, and to get answers faster.”

Databricks foresees Spark occupying the space currently occupied by MapReduce, Hive, Storm, and other Hadoop data processing engines. Instead of using a combination of these technologies and worrying about stitching them together, developers will simply write to the Spark standard, and they won’t need to worry about whether it will run on Hadoop.

The certification program will ensure that application developers who use Spark can rest easy knowing that their software will run on all of the Hadoop platforms, says Arsalan Tavakoli-Shiraji, a Berkeley grad and ex-McKinsey partner who started working on business development at Databricks this January.

“History tells us when there’s a lot of different vendors, there’s a lot of forking and fragmentation,” Tavakoli-Shiraji says. “Look in the Hadoop space. You have application vendors saying ‘Here’s my version that works on Cloudera, Here’s my version that works on Hadoop distribution X, Y, and Z. Our goal is to say we want to help you certify against one place, and so you know that your customers will have multiple commercial support options.”

As Cloudera, Hortonworks, and MapR compete for market share, there is bickering about who’s more committed to open source and who has strayed from the pure version of Apache Hadoop. One the one hand, you have Hortonworks, which is adamant about its adherence to open source and the fact that it pays the salaries of many of the developers who commit code to the Apache Hadoop project. On the other you have MapR, which has made extensive modifications to Hadoop, and who was also voted by Forrester to have the best Hadoop distro. In the middle you have Cloudera, which has had the most commercial success thus far, which employs Hadoop creator Doug Cutting, and which just nabbed $160 million in additional venture funding in a bid to extend its lead and build out its grand Enterprise Data Hub vision of Hadoop.

So far, Cloudera is the only Hadoop distributor with a formal partnership with Databricks. But all of them are expected to jump on board. As Spark ramps up in the coming months and years, Databricks doesn’t want Hadoop politics to cause any friction or to get in the way of Spark’s growth.

“I don’t want to hear that if I want a particular Spark application that I’m obligated to get Spark support from a particular vendor,” Databricks’ Tavakoli-Shiraji says. “I want to know that regardless of where I get Spark from, through other relationships, then this catalog of applications can work on it, irrespective of what I’ve chosen.”

Spark is moving very quickly right now, and the business model hasn’t been entirely hashed out (or at least made public). Databricks, which received $14 million in venture funding last fall, looks to be a central part of that. But don’t expect the company to become the Cloudera or Hortonworks of Spark, Stoica says.

“Instead our position is everyone who wants to include the distribution of Apache open Spark, we are going to help them to be successful,” he says. “Cloudera is the first example. We’re going to help them with some level support, like level 3 support…We’re going to have some support organization, but this is mostly for helping our partners to have success around Apache Spark. We’re looking more in trying to build value around the open source Spark, rather than just support it.”

Related Items:

Apache Spark: 3 Real-World Use Cases

Spark Graduates Apache Incubator

Databricks Partners with Cloudera for Analytics

Datanami