Pentaho Eyes Spark to Overcome MapReduce Limitations
Pentaho today announced it’s supporting Apache Spark with its suite of data analytic tools. While supporting Spark gives Pentaho performance advantages over MapReduce when executing data transformations and running queries within Hadoop, the software company is approaching the in-memory framework for other use cases with a wary eye.
Apache Spark will be a supported engine with the next release of Pentaho‘s analytical platform, which is due in mid to late June. Pentaho is eyeing two main use cases with this release. This first is giving existing Spark users the capability to orchestrate existing Spark jobs with other data assets via the Pentaho suite. The second is supporting the Spark SQL engine to execute queries created using the front-end Pentaho suite of tools.
It was a fairly simply matter to support the Spark API within Spark, which was done by creating a lightweight wrapper for the Pentaho execution engine that speaks the Spark API. This will benefit not only existing Spark customers who are looking for ways to manage data analytic pipelines (including integrating and blending big and varied data sources, which is a core Pentaho focus). But it will also benefit existing Pentaho customers who are looking for faster, real-time alternatives to executing transformations and running queries in Hadoop using MapReduce, says Pentaho CTO James Dixon.
“From what I’ve seen, [Spark] is an extremely efficient engine,” Dixon tells Datanami. “It runs our transformations as fast as is machinely possible. It adds little to no overhead on top of the code that you’re providing, in complete contrast to MapReduce, where there was tremendous latency put into place. So the technology itself, for its original case, is very sound.”
Pentaho Labs has been playing around with Spark for the better part of two years, and the company, which is in the process of being gobbled up by IT giant Hitachi Data Systems, now feels that Spark is ready to be used by its customers for these two use cases. As for the other use two cases that Pentaho has its eyes on—including push-down processing from Pentaho into Spark and support for Spark Streaming–the jury is still out.
“For the new use cases, we just have to prove whether or not it’s ready, because it’s evolving rapidly,” Dixon says. “In terms of the hype, some of the things they’re claiming it can do have not been proven. I’m skeptical about the hype, not the technology.”
One of Dixon’s concerns about Spark has to do with concurrency issues, for both users and jobs. Pentaho features a multi-threaded engine that’s designed to allow multiple users to simultaneously hit all sorts of backend systems–from big data stores like Hadoop to traditional data warehouse systems like Teradata and Hewlett-Packard‘s Vertica–and it’s not clear how Spark will handle that. “We’re not 100 percent certain how Spark is going to respond for concurrency with a multi-threaded engine in it,” Dixon says.
It comes down to architecture. Spark was designed as a single-user system to allow a data scientist to rapidly iterate her machine learning models over a given piece of data. Some of the early Spark adopters that Pentaho spoke with–mostly firms located along the US 101 corridor along the western edge of San Francisco Bay–ran into trouble when multiple people tried to access a Spark cluster.
“If one data scientist who worked on a problem forgot to release the memory and then went to lunch, then nobody else could use the cluster…because the data is locked into memory for that one user,” Dixon says. “You take something like that and now you’re trying to deploy it as a multi-user concurrent engine to answer multiple problems at a time–it was never designed for that.”
Of course, Hadoop started out with similar architectural objections. Back when it was just HDFS and MapReduce, Hadoop was great at doing the job for which it was designed: Indexing the Internet. Over time, Hadoop’s architecture was expanded to do other things, such as processing SQL queries like a relational data warehouse (Hive) and storing unstructured data in a NoSQL manner (HBase), Dixon says.
“That’s why we’re taking the approach here of carefully trying teach of these use cases,” the CTO says “There’s a lot of hype and a lot of hope around Spark because it’s significantly faster than MapReduce, and it’s significantly more flexible. It would be hard for anything to be worse than MapReduce in terms of trying to work with it. The fact that Spark has none of the latency of MapReduce and it’s significantly more flexible in what you can do with it and how it works–it definitely has a lot of promise. That’s why there’s so much interest in Spark.”
But Dixon is concerned because most of the use cases that people want to do with Spark are things that it wasn’t designed to do. “That’s where the concurrency issue comes from,” he says. “It was designed as a single-user data science tool. In a cluster, where you’ve got maybe 100 business users carving and querying it as a database or using it as the backend for a data mart or a data warehouse—that’s not something it was originally designed for, nor was Spark running multiple transformations in that environment.
“Not that it can’t do those things or it won’t be able to do those things soon,” he says. “We need to prove that it can do this, and that’s really what we feel our community of users and partners and customers is looking for us to do–to look at these different use case and work out which ones are real, which ones are viable, which ones are safe to use today.”