Follow Datanami:
June 30, 2014

Databricks Takes Apache Spark to the Cloud, Nabs $33M

Databricks today announced a new big data platform called the Databricks Cloud that will allow users to leverage Apache Spark technology to build end-to-end pipelines that underlie advanced analytic appellations, recommendation systems. The company, which is hosting Spark Summit 2014 in San Francisco this week, also announced $33 million in Series B funding.

Databricks Cloud is an Apache Spark implementation that runs on Amazon AWS and uses the S3 file system. The idea is to provide developers with an easier way to build end-to-end analytics applications. Instead of stitching together separate tools, Databricks is looking to automate the hard bits, such as cluster management, streamline the fun data science parts–such as data discovery and machine learning—and combine all that into an environment where users can actually build and run money-making (or money saving) big data analytics system in production.

The new offering builds on the Spark components, which include SparkSQL, Spark Streaming, the MLlib, machine learning library, the Graphx graph database, all of which are accessible via a single API that Scala, Python, and Java users can call. On top of this base Databricks builds two additional components, called Notebooks and Dashboards.

Notebooks provides a graphical interface for performing data discovery and exploration upon data sets, while Dashboards allows users to create and host dashboards based on the data discovery and the Spark operators. The data and queries that underpin these dashboards can be regularly updated and refreshed. There’s also a job launcher that allows users to execute arbitrary Spark jobs.databricks_1

Databricks CEO Ion Stoica says the new cloud can be used to build the entire data pipelines that underlie advanced analytic systems, such as the recommendation systems that big Web outfits like Google, Facebook, and Yahoo have built, often on top of Hadoop. But there’s no Hadoop or HDFS in this cloud; the company brought its own parallelism to the Amazon show.

“Say your company starts a new data initiative. Based on your organization, you’re tasked with building a Hadoop cluster,” Stoica tells Datanami. “It can take six to nine months to set up a cluster on premise. Once you are done, extracting value out of the data is a long and continuous struggle.”

Once you’ve landed the data you want to analyze, you first must clean it and transform it. “Then if you want to analyze it, you need to do some ETL on top of it,” Stoica says. “Then you are going to start exploring the data and for that you use Hive to do it in HDFS. If you want interactive query you’re going to use something like Impala or Drill. Or you may move the data into SQL or NoSQL database and then use traditional database tools to query the data.”

But just data exploration and getting a bunch of dashboards is not enough, he says. “Soon you want to get deeper insight and ask business questions, like ‘Why is my engagement dropping?’ and more generally how to improve revenue and how to reduce cost,” he says. “To answer these questions, you start to employ sophisticated machine learning and graph algorithms and use systems like Mahout for machine learning and Giraph for graph processing and maybe R on a small set of data.”

“Finally,” he continues, “once you get the insights, you want to productize the insights by building data products such as recommendation systems. It’s quiet complicated, very hard to put together, because you need to integrate the disparate sets of tools. Once you do that. The data is still hard to navigate and it’s even harder to develop and deploy applications.”

Databricks wants to eliminate the need for these point products with its new cloud offering

Databricks wants to eliminate the need for these point products with its new cloud offering

Databricks Cloud provides a single, unified platform that stitches together the various tools and allows developers to build big data analytic applications without worrying about getting the necessary hardware resources behind them. “Our vision is to free users to focus on turning data into value by dramatically simplifying data analysis and processes, to eliminate need to set up cluster, to make it easy to instantiate,” Stoica says.

There are plenty of other hosted big data analytic offering on the market, and they’re starting to get good traction. But Databricks’ status as the company behind Apache Spark gives it serious software credentials. It’s the combination of those Spark credentials, the ease of use that hosting brings, and the other “glue” in Databricks Cloud that the company hopes will add up to more than the sum of its parts.

“When we talk to users, they say it doesn’t work if you only solve one or two of the challenges,” says Arsalan Tavakoli-Shiraji, who handles business development for Databricks. “Clusters are hard to manage. So the point we focused on is to give people complete out-of-the box answers” to all of the challenges.

Spark has gained a lot of momentum on Hadoop recently as a potential replacement for MapReduce. But Hadoop is not required. Spark was actually created at Cal Berkeley’s AMPLab by Stoica and his grad student Matei Zaharia, Databricks CTO without a requirement to run on Hadoop. But customers who develop Spark apps on Databricks Cloud can move them to on-premise Hadoop clusters if they want, Stoica says. “We will allow that,” he says.

Databricks Cloud has been in a private beta since the beginning of the year, and should be generally available later this year. Customers will be able to start with a small investment of $200 per month and pay more only when their workloads increase, Tavakoli-Shiraji says.

Meanwhile, the company announced the close of $33 million in series B funding to go along with the $14 million it raised last fall. The latest round was led by New Enterprise Associates (NEA) with follow-on investment from original investor Andreessen Horowitz.

Related Items:

Databricks Moves to Standardize Apache Spark

Apache Spark: 3 Real-World Use Cases

Spark Graduates Apache Incubator

Datanami