Follow Datanami:
October 10, 2014

Spark Smashes MapReduce in Big Data Benchmark

Databricks today released benchmark results for Apache Spark running the Sort Benchmark, a competition for measuring the sorting performance of large clusters. Spark running on Hadoop sorted 100 TB of data in 23 minutes, three times faster than the previous record held by Yahoo using MapReduce on Hadoop. The result, Databricks says, are due to targeted improvements the Spark community made to improve performance, and should lay to rest any concerns about Spark’s scalability.

Databricks, which is the commercial outfit behind the open source Apache Spark project, ran the Daytona GraySort 100 TB benchmark on Amazon‘s EC2 cloud. It finagled 206 machines from EC2 (the i2.8xlarge instances), which sported 6,592 cores, 49 TB of memory, and 1.2 PB of solid‐state drive storage. The company says this setup provided 10 times less computing power than Yahoo used last year, when it sorted completed the 100 TB sort with MapReduce using 2,100 machines.

To further prove Spark’s scalability the company also ran the 1 PB benchmark, even though there are no official rules for that benchmark, and results are not officially logged. Under the guidance of Databricks co-founder and architect Reynold Xin, the company requisitioned 190 EC2 instances and completed the massive sort in 234 minutes. That’s about four times faster than the previous record set by Yahoo with a Hadoop-MapReduce cluster running across 3,800 machines.

Spark is an in-memory framework, but it also uses disk if needed. This benchmark proves that Spark can handle very large data sets that can’t fit into memory, says Ali Ghodsi, head of engineering at Databricks. “We picked the data set that was larger than the amount of memory we had,” he says. “There’s no way we could have cheated. There’s nothing we could do. You don’t have memory to put it in.”

The Sort Benchmark result should clear the air regarding any doubts people had about scalability, Ghodsi says. “We wanted to do this to prove that we can run at scale,” he tells Datanami.

There have been questions about Spark’s scalability at the high end. Yahoo in particular has questioned Spark’s capability to run on big clusters. Yahoo’s 32,000 node Hadoop cluster is one of the largest in the world, and the Web giant has been instrumental in hardening a variety of Hadoop technologies over the years, including YARN and Hadoop version 2.

Spark is running on Yahoo’s cluster, but it hasn’t been able to take advantage of the whole cluster. Last month, Yahoo’s vice president of engineering Peter Cnudde told Datanami that while the company is working with Spark, it’s not using them broadly for the biggest workloads. “At the very large scale, it [Spark] doesn’t work yet. It has challenges at the larger scale,” Cnudde said.

Ghodsi says he’s familiar with the questions about scalability, but that they have been addressed with changes and new features introduce with Spark version 1.1, which was released in September. The next release, version 1.2, will bring additional scalability enhancements. The Sort Benchmark is proof that those enhancements have bolstered scalability, he says.

“In the very beginning a year and a half ago, there were people who said ‘We’re having trouble running Spark at really enormous scale. It’s not always stable, or it’s fine if it’s in memory but when it doesn’t fit in memory, we’re running into problems,'” Ghodsi says. “About six months ago we had a company retreat, and we discussed what can we do about this. We said, ‘Let’s actually do something really challenge at really large scale, and make sure we can make it work for that.'”

The made several enhancements to Spark that improved its scalability, but two stand out as the most important. This includes a total re-write of the all-important shuffle function, work that was done by Spark founder Matei Zaharia himself. The other was a brand new network transport layer that can sustain 10 Gbps Ethernet speeds, which is critical to supporting shuffling.

The performance that Spark displayed on the Sort Benchmark is a direct result of those two improvements, Ghodsi says. “Shuffle is so fundamental to all these workloads. It’s used all over the place in Spark,” he says, including the ability to join multi-PB tables using SQL, as well as enabling machine learning.

The new shuffle and network transport layer are in the version 1.1 release that debuted in September, and further enhancements are on tap for the forthcoming version 1.2 release. “We found things that we needed to improve and harden to make it really scale and be robust. They’re actually now in the released versions of Spark,” Ghodsi says.

When Databricks hears about Spark having trouble scaling, the customer invariably is running an older version, Ghodsi says. “They’re running Spark 0.8 or 0.8. They’re actually using the research prototype that we developed in the AMPlab,” ,” he says. “Spark is a fast moving project…There’s a lot happening to it. And these improvement are there. But if you use the version 0.8 you’ll never see any of these benefits.”

Related Items:

Three Things Apache Spark Needs to Out-Hadoop Hadoop

Where Does Spark Go From Here?

Apache Spark: 3 Real-World Use Cases