Hortonworks Hatches a Roadmap to Improve Apache Spark
Hortonworks today issued a broad and detailed roadmap outlining the investment it would like to see made to Apache Spark, the in-memory processing framework that has become one of Hadoop’s most popular subprojects. The plan focuses on improving how Spark runs with YARN, enabling monitoring and management of Spark, and ensuring that Spark plays nicely with Hive and other Hadoop engines.
In the blog piece, titled “An investment in Apache Spark for the Enterprise,” Hortonworks director of product management Vinay Shukla and Tim Hall, vice president of product management, outline the enhancements and improvements that Hortonworks is suggesting to the Apache Software Foundation, which governs the Spark project. Hortonworks engineers are already participants in the Apache Spark project, and is looking to help the project move forward in away that benefits its customers and the open source community as a whole.
“We’ve seen this unbridled excitement around Spark really over the past eight months,” says Hortonworks director of product marketing Jim Walker. “It’s great, everybody loves it—machine learning, iterative workloads, data science, awesome. That said, It needs to be reliable….I can’t use it to only have it fall over.
“Secondly,” Walker continues, “it needs to run with the other workloads that are already running in Hadoop. The promise of running multiple engines on a single set of data–people have absolutely bought into that, and they want to make sure that Spark runs as a good citizen, as a tenant within a cluster along with Hive and Hbase and everyone else.”
Finally, to run Spark in production, it needs to meet the needs of governance, security, and operations. “It has to be secure. I have to be able to provision, manage, and monitor that engine along with my other engines, and it has to be through a single console,” Walker tells Datanami.
The work around Spark will involve not just Apache Spark, but numerous other projects in the Hadoop ecosystem, from Apache YARN and MapReduce to Hive and Hbase. Hortonworks has already begun making changes in some of these other Hadoop components to address the needs of running Spark in the enterprise.
The move garnered praise from Databricks, the company behind open source Spark. “We’re thrilled to see Hortonwork’s full-throated support for Spark; and not just as another component, but as a major piece of the solution for big data processing for enterprises,” Databricks business development executive Arsalan Tavakoli-Shiraji tells Datanami via email. “From the level of commitment they’re making to Spark (e.g., in terms of resources) you can see that this is something that is extremely important to their customers.”
In June, Hortonworks shipped a tech preview of its support for Spark in conjunction with the delivery of the version 2.1 release of its Hortonworks Data Platform (HDP). The plan is for Spark to be fully supported with the next major release, tentatively dubbed version 2.2, sometime this fall (an announcement at the Strata + Hadoop World conference next month is a good bet).
But Hortonworks customers won’t have to wait that long, as the company this week will be shipping a second tech preview of Spark in HDP 2.1.3 that delivers some of the enhancements that have already been made as part of Hortonworks phase one plan for delivering enterprise Spark.
Much of that works revolves around integrating with Hive, says Hortonworks co-founder and architect Arun Murthy. “A lot of people have data in Hive data warehouse. They want to come in and access the data, pull it into memory, do some modeling, analytics in Spark. They want to make it a first class citizen,” Murthy tells Datanami.
“But so far it [Spark] only works with Hive version .12,” he says. “We now have versions working against Hive version .13,…and when the Apache community releases Hive version .14, we’ll do the same. It’s not sexy work, but something you have to do as far as the broad enterprise enablement. All this will be delivered in our preview going out this week, so we’re very excited to have this out.”
The new tech preview will also deliver support for Hive’s Optimized Row Columnar (ORC) data format. “One thing we’re doing is making sure we can give Spark applications the same level of performance when it comes to accessing data,” Murthy says. “For example, like any database, ORC supports what you call predicate pushdown.” That capability doesn’t exist yet in Spark, but thanks to the contributions that Hortonworks is making to Spark, it will support it. “We’re doing that so we can get efficient access to data,” Murthy says.
The work around Hive integration and support for ORC is done and available now. Other items on Hortonworks’ short-term Spark to-do list include bolstering its integration with Ambari, the open source Apache project that provides provisioning, managing, and monitoring for Hadoop; and improving how it works with Hadoop’s security apparatus, specifically in the area of Active Directory and LDAP, the enterprise standards for user authentication. The company also wants to integrate Spark with Apache Argus, the open source security project created after Hortonworks acquired XA Secure and contributed all the software to the Apache Software Foundation.
Hortonworks also has longer term plans for bolstering Spark. One of the key areas in this “phase two” effort (expected to be delivered in the beginning of 2015) will be improving how Spark runs on YARN, specifically from a reliability and a scalability perspective. One of the consistent complaints that users have about Spark is that it currently does not scale adequately for massive Hadoop deployments. The possibility that Spark may actually give you erroneous results is limiting its uptake in the biggest Hadoop environments, including the 32,000 node Hadoop cluster at Yahoo, which is currently using Spark on a 1,000 node cluster.
One of the ways Hortonworks aims to improve Spark’s Hadoop scalability and reliability features is by more closely integrating it with YARN. The common Spark deployment model on YARN currently is slightly different than how it’s done in MapReduce, Hive, or Pig, Murthy says. “So what happens is Spark comes up, grabs a bunch of resources and keeps it forever, and doesn’t let them go until the entire application finishes,” he says. “This is potentially fine if you’re doing intensive in-memory applications. But you certainly don’t want to use what I call the service-like mode, deploying a Spark service to every user.
Instead, Hortonworks envisions Spark becoming more like MapReduce in how it consumes cluster resources. Instead of holding onto CPU or memory until the entire job is over, Spark will get resources depending on each sub task Spark performs. “So now it looks like an application model, not a service model,” Murthy says. “One of the things we did is use the YARN shuffle service in YARN, which allows us to transfer intermediate data between mappers and reducers or applications…We’ve been using a bunch of investments we made in Tez, for example, to actually do the same up and down application model. We’re really exited about this.”
The phase two plans also involve the creation of an in-memory tier in HDFS to share resiliently distributed datasets (RDDs) across independent Spark apps. “We definitely want to leverage the in-memory tier in Spark, so if you cache an in memory data set in Spark…you’ll be able to use it across multiple applications in the context of Hadoop,” Murthy says.
Hortonworks developers have been participating directly in the Apache Spark community for a few weeks now, and the company is gearing up for a long-term commitment to improve Spark’s fortunes on Hadoop. The company is also in the process of deepening its partnership with Databricks.
“The Spark community has been very welcoming. Great feedback,” Murthy says. “Our aim is to make sure that whatever we do, we invest in the core of the platform and we’re expanding that definition of the core to include Spark. We just want to make sure that everybody gets the best of the core of Apache software, whether it’s Spark or YARN or HDFS and MapReduce.”