Three Things Apache Spark Needs to Out-Hadoop Hadoop
It’s only September, but it’s clear that 2014 will go down as the Year of Apache Spark. While the open source processing framework has gathered an enormous amount of momentum within the Hadoop ecosystem, there are three areas where the Spark community should focus on if it’s going to shine brighter in 2015.
Apache Spark stormed the big data scene early in the year, becoming the Hot New Thing in an industry that generates Hot New Things at increasingly breakneck speeds. It would not be inaccurate to say that Spark stole the spotlight a bit from Hadoop, which previously held Hot New Thing status. But the Hadoop distributors didn’t seem to mind too much, as they jumped on the Spark bandwagon and sang its virtues anyway.
Databricks, the startup that’s behind Spark, says developers are flocking to Spark because it provides an easier and more flexible way to write big data applications than traditional Hadoop (i.e. first-generation MapReduce). And while Spark is in no way tied to Hadoop–Spark can run on NoSQL or Amazon’s S3 file system too–Spark has become Hadoop’s fastest sub-project, if you will, and the brightest hope for delivering the on the big data promises that the rise of Hadoop put into our collective imagination.
While Spark is sizzling at the moment, the world is a fickle place, and the spark could flame out when the next shiny new object grabs developers’ attention (look, Sqrrl!). The technology appears to have a remarkably solid base given its young age, but that’s not stopping Databricks and the larger Apache community from working hard to bolster Sparks’ capability and build a good track record.
Here are the three things Spark needs to keep its momentum going in 2015 and beyond:
1. Prove High-End Scalability
If Spark is going to become the go-to technology in production Hadoop clusters, then it’s going to need bullet-proof scalability. Developers rely on MapReduce to process petabytes of data because it’s proven itself to be reliable at scale over the past 10 years. You can’t say the same thing about Spark, in part because it’s so young.
One of the companies that’s working to prove Spark’s scalability is Web giant Yahoo, where Hadoop originated in the mid-2000s. The company is using Spark on a limited basis, primarily for a subset of use cases related to iterative, memory-intensive, machine learning algorithms, says Yahoo vice president of engineering Peter Cnudde.
“We’re looking at Spark as a way to replace some of those [older] implementations and see if they could be a better way to implement it,” he tells Datanami. “I think Spark’s strength is around iterative algorithms, and it’s in that context that we’re evaluating it.”
Yahoo will not use Spark as a wholesale replacement for MapReduce, which remains the workhorse powering the majority of the jobs running on Yahoo’s 32,000-node Hadoop cluster. “We still write new MapReduce jobs all the time” and run them on YARN as MapReduce 2 (MR2) jobs, Cnudde says.
Up to this point, Spark has not proven to Yahoo that it can scale. “At the very large scale, it doesn’t work yet. It has challenges at the larger scale,” Cnudde says. “So for some cases, it doesn’t work for us. But for others…where the data set is smaller and it’s a bit more iterative, it has more promise.”
Yahoo was instrumental in hardening YARN before it became generally available, and the same sort of work needs to be done on the core of Apache Spark before it’s ready for production at scale, Cnudde says.
“There’s just not that many companies that have 32,000-node Hadoop deployments,” he says. “We’re one of them. There are a couple others. But there aren’t that many, and one of these companies has to do the work. And then around Spark, there are some particular issues that the community is aware of and that they’re improving on and there’s quite a bit of effort being done in the Spark community to improve it. But in the end you just need to run it and find the issues and resolve them and move on. It’s just hard work.”
2. Get On-the-Record Accounts
For all the hype that big data technologies get, there’s a surprising lack of case studies that big data promoters can point to and say “See, it works just like we said in the real world, and this customer is proof.” This lack of reference accounts has stymied the Hadoop distributors, who have sold Hadoop licenses to perhaps 1,000 companies to date, despite venture capital funding measured in the billions of dollars. And the lack of verifiable case studies is threatening to hold back Spark’s growth too.
“Spark is a fantastic technology, but I don’t know of any large implementations on Spark,” comScore CTO Mike Brown told Datanami in a recent interview. “There’s a lot of interesting papers and things, and people talking about doing this effort in Spark. But I don’t know of any [Spark applications] in production. It’s a pretty big transition to go from a proof of concept to actual production, when you have to produce data every day at a large scale. That’s the issue I have with Spark.”
Like Yahoo, comScore continues to rely heavily on MapReduce for the bulk of its Hadoop workloads (Yahoo is also a customer of comScore’s digital media analytics, but that’s another story). Its development group is following Spark, but for the moment it will continue to rely on tested MapReduce jobs running its 400-node cluster MapR Technologies cluster.
Databricks is aware of the perception that Spark is talked about but not used, even if it believes the perception to be false. “Stay tuned. It’s something we’re actively working on,” says Databricks business development executive Arsalan Tavakoli-Shiraji. “We are actually spending a lot of time in saying how do we surface some of these to be more referenceable.”
There’s always going to be noise and doubts about a technology and whether it scales, performs, and is mature. “But the second that there’s an extremely large ecosystem of applications that now work seamlessly on it and are using it without a second thought…those questions stop,” he says.
One of the things that Databricks is most proud of are the certified Spark applications, says Tavakoli-Shiraji. “One of the major trends we’re starting to see is companies who previously in the world of Hadoop had written their own entire processing engine…are ripping out their processing engine and replacing it with Spark underneath,” he says.
Until those customers come forward en masse, Spark will retain its “white unicorn” status.
3. Strive for Backward Compatibility
Hadoop version 2 is undeniably a better and more flexible framework for running a variety of big data workloads. The YARN resource manager in Hadoop 2 ensures that batch MapReduce jobs can get along with interactive Hive jobs and real-time Storm jobs. It’s also the main technology enabler for running Spark on Hadoop, and thereby opening up Spark Streaming, MLlib, GraphX, and Spark SQL interfaces (along with pre-alpha components like Spark R and BlinkDB) to run on Hadoop 2.
But for the early adopters, the path from Hadoop version 1 to Hadoop 2 may look more like a migration than a simple upgrade. Not only did the APIs change, but entire operating models and vendors changed. For customers running Hadoop in production, keeping all the various components in compatible lockstep has become a major barrier to productivity.
Apache Spark has the potential to learn from Hadoop’s baggage. “With Spark, we’re fortunate to be the second, or N-th, mover,” Tavakoli-Shiraji says. “So we get to see what things the Hadoop community did well–which was a lot of things–but also the things they struggled with.”
When Spark 1.0 became available earlier this year, the community promised “API stability” for the entire 1.x release cycle. This ensures that customers can “feel free to upgrade and innovate and get the benefits without worrying that all of these things are going to break,” Tavakoli-Shiraji says. “It sounds trivial. But it’s actually harder to do. And it’s really important to customers.”
The Apache Spark community should continue to make maintaining backward compatibility a major goal. There’s too much of a Wild West mentality in the big data analytics space at the moment, and enterprises who get burned with their first project will be hesitant to go back for more.
Spark is an enormously promising, if young, technology that has the chance to dramatically alter the big data landscape. If it can accomplish these three tasks—proving scalability, getting referenceable accounts, and maintaining backwards compatibility–then it has a shot at actually fulfilling its promise–and maybe saving Hadoop’s bacon along the way.