Meet Your Friendly Neighborhood Spark Sherpa
Apache Spark is the most popular big data project at the moment, with thousands of contributors cranking out code on a weekly basis. Keeping up with Spark releases is hard, and it’s why Hadoop distributor Hortonworks views itself as a Sherpa that guides customers on how best to use the explosive big data tech.
The amount of activity on Apache Spark is extraordinarily high, with hundreds of JIRA issues being addressed every week, and thousands more in the queue. The Apache project unleashed version Spark version 1.6 two weeks ago, and already some Spark backers are working on Spark 2.0.0.
Don’t expect to see that version of Spark in the Hortonworks Data Platform (HDP) any time soon. Today the Hadoop distributor announced the addition of two Spark 1.5.2 components—Spark Streaming and Spark SQL—with a maintenance release of HDP. It’s the first time Hortonworks has officially supported any Spark components besides “core” Spark and MLlib, its machine learning library.
This was the fourth time that Hortonworks (NASDAQ: HDP) has added new Spark features to HDP during 2015, which reflects both the remarkable demand for Spark functionality among HDP users, as well the relentless pace of development in the open source Apache project.
“The Spark community continues to move extraordinarily fast,” says Tim Hall, the vice president of product management at Hortonworks. “We’re trying to keep pace [and to] test, certify and provide the latest innovation for our customers so they can extract the maximum value of all the capabilities within Spark and run it at scale in the enterprise.”
Hortonworks support for Spark is driving sales of HDP, according to Hall, who says customers are moving their Spark initiatives forward quickly. “There were a lot of customers at the beginning of year that were in tire kicking and exploratory mode [with Spark], and by the end of the year we have many more that are production deployed,” he says.
The addition of Spark Streaming and Spark SQL to HDP undoubtedly will be met with praise from Hortonworks customers looking to get real value from Spark. But the move may also spur some additional questions, such as “What took so long?”
“These are two of the most asked-for capabilities within Spark,” Hall says. “We’ve been holding off in terms of providing the GA support for them until they were ready. And the challenge for us is the [Spark] community is moving super fast. We want to make sure that our customers are going to be successful with them and we’ve been working within the community to make sure that will occur.”
Hortonworks is in a tough spot. They’re obliged to give customers the Spark features they crave, but the company must also protect its customers from the potentially harmful effects of using technology that isn’t ready for prime-time. You can put two other Spark sub-projects—the GraphX graph database and Spark R—in that “use at your own risk” category.
Hall takes a philosophical view of the situation. “Hortonworks is the Spark Sherpa,” he tells Datanami. “They’re looking to us to be Sherpas on their Spark journey and provide them with the appropriate guidance on readiness.”
While Spark is a great project, there’s a lot of “code churn” going on, and Hortonworks will do its best to pick only the Spark parts that are fully baked, Hall says. “[We tell customers] of course you can try to use Spark R or Graphx or these other things today,” he says. “But we’ve done some testing and we know there are sharp edges, and we want to give you the heads up on that before you get into trouble.”
The Santa Clara, California company also announced that it’s supporting the ORC [optimized row colum] file format with Spark SQL on HDP. Spark SQL was originally developed with an affinity for the Parquet file format, and so supporting ORC will make enable better compatibility with Apache Hive. It also unveiled the Hortonworks Community Connection, a new portal designed to help HDP customers collaborate and ask questions of one another.