3 Major Things You Should Know About Apache Spark 1.6
Hundreds of fixes and new features made their way into Apache Spark 1.6, which was announced last week and is expected to ship in mid-December. But here are the three main things in Spark 1.6 that are most likely to affect you, according to the head of engineering and product at Databricks.
1. Automatic Memory Tuning
Spark has always provided several tunable knobs that programmers and administrators can use to dial in the performance of their applications. As an in-memory framework, it’s important to allocate enough memory for the actual execution of operations, but it’s also important to leave enough memory in the cache.
But setting the correct allocations was not easy. “You had to be quite an expert to know exactly which ones you should tune,” says Databricks Vice President of Engineering and Product Ali Ghodsi.
That hit-and-miss approach to Spark tuning will be a thing of the past as a result of the new automatic memory tuning capability introduced in Spark 1.6. “Spark will automatically tune a lot of these memory parameters dynamically,” Ghodsi tells Datanami. “That makes Spark for the masses much, much easier and better to use.”
Spark now automatically tunes itself in response to how it’s being used. So if the framework sense that it needs a big scratch space to join data in the morning, it can temporarily allocate 70 percent of the available memory to the cache, and 30 percent operators. And when the workload shifts to machine learning later in the day, Spark could automatically dial itself back to a 60/40 or a 50/50 split between cache and operators.
“It can automatically re-tune itself every minute of the day,” Ghodsi says. “That’s certainly something you wouldn’t do if you were manually tuning it. If you were manually tuning it, you’d do it once per month or once per year.”
The enhancement should have a similar impact as adding more physical RAM to the Spark server. Depending on how your application was developed, it could boost its performance by anywhere from a few percent to 10x or more.
“In many cases, it means you can fit more of your dataset in memory, even though you didn’t buy more memory,” Ghodsi says. “And you’ll have less risk of running out of memory and having to spill to disk, unless you tuned it really, really well.”
This cache-aware computing feature is one of the items in Apache Spark’s Project Tungsten, which we covered in-depth earlier this year.
2. Spark Streaming’s Big Speedup
There are lots of emerging use cases for Spark Streaming, the real-time streaming analytics Spark sub-project (see todays’ feature story, “Spark Streaming: What Is It and Who’s Using It?” by Tathagata Das, the Databricks engineer who created Spark Streaming, for a great primer on the tech).
One of the most common uses for Spark Streaming is state management. Uber, for instance, uses it to track the location of its drivers, and fitness wearable firms use it to count the number of steps people take.
Prior to Apache 1.6, state management was calculated in a rather expensive manner (computationally anyway–the software, of course, is free). With this release, the folks behind Apache Spark have dramatically streamlined the state management functionality, and in the process delivered a 10x boost in Spark Streaming performance.
“The previous implementation under the hood was running a pretty expensive join operation. We improved it by keeping track of deltas. Now we keep track of exactly what changed in the state between the previous set of values that were coming in through the stream and this one. We more efficiently just keep track of what has changed, and we can do that much faster, which means much less data. That translates into about a 10x speedup, maybe more.”
Not all Spark users will see the 10x speedup in Spark Streaming performance that Databricks is touting. But considering that about 90 percent of Spark users use Spark Streaming, and that state management is one of the core functions of the sub-project, it’s likely that many Spark users will benefit substantially by the elimination of this particular bottleneck.
3. The New Datasets API
What if Spark’s original Resilient Distributed Dataset (RDD) API and the new DataFrame API that was delivered earlier this year got married and had a baby? According to Ghodsi, that baby would look a lot like the new Datasets API that’s being delivered in Spark 1.6.
The new DataSets API delivers the best features of the RDD and DataFrame APIs, and carries none of the drawbacks (which, alas, is proof perhaps that the baby analogy does not work here).
One of the reasons that people like the original RDD API is that it supports static type information, a valuable tool used by developers to ensure their code is crisp and clean without extensive testing. However, working the low-level RDD API is not exactly fast.
Spark addressed that time-to-value issue with DataFrame, a higher-level API that helps Spark programmers get productive quicker than if they were using the RDD API. However, the DataFrame API doesn’t support static typing, which increases the likelihood of errors in that first go-round.
The new Datasets API delivers the high level-ness of the DataFrame API, but includes the static typing support that RDD users have grown to love. This has been one of the most requested features by existing Spark users, Ghodsi says, and will particularly help Java and Scala coders, since some Python tools don’t support static typing anyway.
“In the past they’ve been saying we love the RDD API. We know the programs are always correct because it has static typing. But we prefer this high level API for DataFrame–it’s awesome and we also love the fact that the DataFrame programs are faster because they go through an optimizer,” Ghodsi says. “The future seems to be with DataFrames, but we still want to do static typing. What’s your answer? In the past you have to pick one of these two–the correct one or the faster one. Now we can tell them… you get the best of both worlds.”