Cutting: Spark an ‘All-Around Win’ for Hadoop
Hadoop co-creator Doug Cutting said today that Apache Spark is “very clever” and is “pretty much an all-around win” for Hadoop, adding that it will enable developers to build better and faster data-oriented applications than MapReduce ever could.
Cutting talked at length about Spark during today’s Cloudera webinar, titled “Uniting Spark and Hadoop: The One Platform Initiative.” The Hadoop distributor, which employs Cutting as its chief architect, has ramped up the pro-Spark messaging since the launch of the One Platform Initiative earlier this month.
Cloudera plans to eventually replace MapReduce with Spark, and so it called on Cutting to help evangelize for Hadoop’s new primary engine. “It really makes a lot of improvements on MapReduce,” Cutting said. “It can do pretty much everything that MapReduce [does], but it does more and it does many of the things better.”
Spark is fundamentally easier to use because it has this rich higher level API, Cutting said. “And it’s not just a Java API, as MapReduce primarily was, but there’s also Scala and Python, which are extremely popular language for data science,” he continued. “It also has an interactive shell. You can explore your data interactively and not be forced to go through a compile load loop each time you make a change.”
While Spark is slated as a replacement for MapReduce, its capabilities go above and beyond what MapReduce was ever able to do, Cutting said.
“It’s added a lot of new kinds of capabilities that weren’t present in MapReduce,” he said. “Not only does it handle batch processing well, but with a technique called micro-batching you can do streaming and have computations that are updated second by second, at a much higher rate.”
Another huge advantage that Spark holds over MapReduce is the performance, Cutting said. Applications written in Spark typically run anywhere from 10 to 100 times faster than the equivalent applications written with MapReduce as the underlying engine. Cutting provided a good description of how that performance advantage actually came to be.
“The technology that underlies Spark performance is something called an RDD, a resilient distribute dataset, an abstraction that describes a set of data that lives across the cluster,” he said. “It may be in memory, it may be on disk–it also may be recomputed if needed, because the method that it was created is attached to it…. So if you have a failure and you’ve got a data set that’s only in memory, you can recreate the portion of it that’s been lost without having to restart all the computations.
“It’s a very clever approach,” Cutting continued. “This give it a lot of its performance because it can both keep data sets in memory and flush them to disk if there’s no longer room in memory and really mange the use of memory across the cluster. As the cluster has grown and as memory has gotten cheaper, we now have huge amounts of memory across the clusters that can really be used to accelerate things. But it’s hard to manually determine that in your applications and it’s much better to get the system to manage it, so Spark is able to do that.”
Spark combines the capabilities of other systems into a single system, which provides a productivity advantage for developers. “There’s libraries for machine learning and graph processing. So you don’t need to move to another system to build systems that incorporate these,” Cutting said.
While most people probably tuned into hear Cutting talk about the technical advantages of Spark, the Father of Hadoop also took some time to tout Cloudera’s investments in Spark. Cloudera was the first Hadoop distributor to ship and support Spark, Cutting said, and together with Intel employs about 75 percent of the total number of committers to the Apache Spark project out of all the Hadoop vendors–about five times the resources invested in Spark compared to other Hadoop vendors.
All told, Cloudera has about 150 Spark customers in healthcare, financial services, insurance, advertising, and search personalization use cases, Cutting said. The largest Spark deployment among CDH users is 800 nodes, he said.
“I’m very exited about it,” Cutting said. “In healthcare we’ve seen people using the core of Spark, more batch-based, to look at diseases of various sort, but using Spark Streaming to actually predict when Sepsis might be occurring in a hospital. We’ve seen the death-from-sepsis rate from hospitals substantially reduce after they’ve deployed a system based on Spark Streaming. It’s pretty exciting to say you’re saving…lives with this stuff.”
At the end of the day, Cutting’s endorsement of Spark carries a lot of weight—especially considering the concern of some in the Hadoop community that Spark and Hadoop might not play well together. As the co-creator of Hadoop (he developed the software as an indexer for the World Wide Web with Mike Cafarella while working at Yahoo in the mid-2000s), Cutting is widely revered for his views on big data. To that end, Cutting’s endorsement of Spark speaks volumes about how
“Spark’s…really got a tremendously wide range of utility that complements the rest of the Hadoop ecosystem, building on the underlying shared resources of HDFS and YARN,” Cutting said. “If you build expertise in Spark, it means you don’t have to build expertise in five more separate sites and learn to support and maintain those, because you can combine all of those tougher into one integrated component, which makes it simpler for developers and simpler for operators who are running things.”
Cutting is absolutely on board having Spark replace MapReduce as the standard execution engine in the Hadoop ecosystem. “It’s a better tool in many ways,” he said. “It combines more functionality with greater ease of use and higher performance. So it’s pretty much an all-around win and Cloudera is very happy to embrace Spark and push Spark forward.”