May 4, 2015

Deep Dive Into Databricks’ Big Speedup Plans for Apache Spark

Alex Woodie

Apache Spark rose to prominence within the Hadoop world as a faster and easier to use alternative to MapReduce. But as fast as Spark is today, it won’t hold a candle to future versions of Spark that the folks at Databricks are now developing.

Last week Databricks shared its development roadmap for Spark with the world. The company has near-term and long-term plans to boost the performance of the platform; the plans are grouped into something called Project Tungsten.

In a telephone briefing, Databricks co-founder Reynold Xin gave Datanami the low-down on the changes coming to Spark, why they’re necessary in light of enhancements made to the underlying hardware, and how they’ll impact Spark users.

One of the first changes that Databricks has planned is to improve how Spark utilizes memory. As a Java program, Spark currently relies on the underlying Java Virtual Machine (JVM) container–as well as the JVM’s garbage collection routines–to manage the memory that applications need.

While this approach works and keeps Spark programmers out of the memory management business (which can be quite tedious), the folks at Databricks feel that, going forward, the JVM and its associated garbage collection routines will be too computationally expensive to continue to use in the long run.

“So what we’re doing as part of the Tungsten initiative is to sidestep the JVM garbage collection and try to manage memory efficiently ourselves,” Xin says. “We don’t want the overhead of garbage collection, and in particular we don’t want our users to worry about the overhead and having to trim them.”

Databricks is planning two ways to improve the memory management of Spark under the Tungsten initiative. The first involves letting Spark pre-allocate a large chunk of space in the JVM’s managed memory for applications. While this doesn’t get the JVM out of the memory management business for Spark apps, it should bring incremental improvements, because the JVM is only managing one (large) object, instead of many objects, Xin explains.

The second plan is to bypass the JVM completely and go entirely off-heap with Spark’s memory management, an approach that will get Spark closer to bare metal, but also test the skills of the Spark developers at Databricks and the Apache Software Foundation. “I think this is the optimum approach in the long run,” Xin says, “but it’s a riskier approach because it’s relatively untested. You need to worry about whole new things.”

While Xin frets about the prospect of gray hairs, the risk posed to the user will be minimal. The API will not change, and programmers won’t have to do anything different. “It’s basically a major engineering investment,” he says. “I don’t think there will be a lot of downside. I think if it’s done well, I don’t think users will notice any regression. A lot of workloads very likely will see order of magnitude gain.”

SQL query processing in Spark stands to get a major speed-up, Xin says. “That’s actually a very important goal of this project,” he says. “The other thing is a lot of the advance machine learning workloads will also get faster because they are… heavily CPU bound.”

As Xin explains, these changes are needed to help Spark and Hadoop apps get more out of today’s faster hardware. “Back in the day, Hadoop was so [poor performance-wise]…that anything we did was much better,” he says. The reliance on spinning disk and 1Gb Ethernet networks meant that Spark still had headroom when it came to the CPU itself. Just moving the data in and out was the bottleneck, as Moore’s Law kept plenty of processor capacity in reserve.

That dynamic has changed with the advent of very fast SSDs, speedy 10Gb Ethernet networks, and the decay of Moore’s Law, Xin says. “The underlying hardware is actually becoming much better, compared with the CPU and memory subsystems,” he says. “So as a result, before it was fairly easy for a Spark program to saturate the network and I/O, and now when we look at it, it’s actually harder because now it’s underutilizing the I/O and memory. So the goal is to squeeze as much as possible out of the new hardware.”

Databricks has a few other tricks up its sleeve with Project Tungsten besides bypassing the JVM to boost memory management, including cache-aware computation. As Xin and Databricks engineer Josh Rosen explain in last week’s blog, cache-aware computation will enable Spark to take advantage of today’s L1, L2, and L3 on-chip caches.

“When profiling Spark user applications, we’ve found that a large fraction of the CPU time is spent waiting for data to be fetched from main memory,” Xin and Rosen write. “As part of Project Tungsten, we are designing cache-friendly algorithms and data structures so Spark applications will spend less time waiting to fetch data from memory and more time doing useful work.”

The company is also looking into code generation to further accelerate Spark. There is already some code generation for SQL and DataFrames in Spark. But with future releases, Databricks will be broadening the code generation coverage to most built-in expressions, the company says. “In addition, we plan to increase the level of code generation from record-at-a-time expression evaluation to vectorized expression evaluation, leveraging JIT’s capabilities to exploit better instruction pipelining in modern CPUs so we can process multiple records at once,” Xin and Rosen write.

But wait, there’s more! Databricks is also exploring the potential to use GPUs to accelerate certain types of workloads, such as deep learning algorithms and some graph analytic workloads, Xin says. The company will be using the OpenCL library to enable Spark to leverage GPUs in clusters when they are available. Also on the far horizon is the potential use of LLVM compiler technologies to take advantage of the Single Instruction, Multiple Data (SIMD) and Streaming SIMD Extensions (SSE) instructions in modern X86 chips.

Databricks will start to expose the JVM memory management enhancements with the upcoming release of Spark version 1.4. That release, which will ship in June, will contain the enhancements but not run them by default. They will start to be used by default in version 1.5, which is slated for September. Some of the other stuff on the roadmap, such as code generation and cache-award computing, will be added to future editions.

Spark has come a long way in a short amount of time. But judging by Project Tungsten and the product’s roadmap, it has a long way to go to fulfill its creators vision of helping developers to easily build fast data-intensive applications.

Apache Spark and Java 8: The Big Data Team for 2015

Apache Spark: 3 Real-World Use Cases

Applications: Complex Event Processing, Enterprise Analytics, Predictive Analytics

Technologies: Frameworks, Middleware, Network, Processors, Storage, Systems

Sectors: Financial Services, Healthcare, Retail

Vendors: Databricks

Tags: apache spark, big data, garbage collection, Hadoop, i/o, JVM, performance optimization

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Deep Dive Into Databricks’ Big Speedup Plans for Apache Spark

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 13, 2024

May 10, 2024

May 9, 2024

May 8, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Deep Dive Into Databricks’ Big Speedup Plans for Apache Spark

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 13, 2024

May 10, 2024

May 9, 2024

May 8, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link