July 7, 2020

Google Cloud’s Dataproc Gets a GPU-Powered Spark Boost

Oliver Peckham

(DANIEL CONSTANTE/Shutterstock)

Google Cloud’s Dataproc – its big data platform that allows users to run Apache Hadoop and Spark jobs – is getting a boost. Apache Spark 3 and Hadoop 3 have launched general availability, enhancing users’ data analytics capabilities with a series of new features – and naturally, those features are now available on Google Cloud’s Dataproc image version 2.0.

In a blog post, Christopher Crosbie (product manager for Google Cloud) and Igor Dvorzhak (a software engineer at Google) highlighted the new features offered in the Apache Spark 3 implementation.

Adaptive queries: Spark can now optimize a query plan while execution is occuring. This will be a big gain for data lake queries that often lack proper statistics in advance of the query processing.

Dynamic partition pruning: Avoiding unnecessary data scans are critical in queries that resemble data warehouse queries, which use a single fact table and many dimension tables. Spark 3 brings this data pruning technique to Spark.

GPU acceleration: NVIDIA has been collaborating with the open source community to bring GPUs into Spark’s native processing. This allows Spark to hand off processing to GPUs where appropriate.

The headline of these features, the Googlers say, is simple: performance. “As data scientists shift from using traditional analytics to AI applications that better model complex market demands, CPU-based processing can’t keep up without compromising either speed or cost,” wrote Erik Pounds, director of product marketing for Nvidia, in his own blog post. “The growing adoption of AI in analytics has created the need for a new framework to process data quickly and cost-efficiently with GPUs.”

The blog post also explains that there will be improvements to Spark on Kubernetes, as well as some deprecated features: MLLib, GraphX, DataSource API, and Python 2.7 are no longer supported in Spark 3, though replacements like SparkGraph, DataSource V2, and Python 3 are available.

Hadoop 3 is also now available in Dataproc image version 2.0, enabling features like native GPU support in the YARN schedule and YARN containerization. Finally, the new version of Dataproc includes some non-Apache upgrades, such as support for new software libraries, support for new shared libraries (and upgrades to existing shared libraries), and various optimizations.

According to Google, getting started with Spark 3 and Hadoop 3 in Dataproc image version 2.0 is as simple as entering a few lines of code. To read more about the new features and see how to work with the new capabilities yourself, click here.

Applications: Enterprise Analytics

Technologies: Cloud

Vendors: Google Cloud

Tags: apache hadoop, apache spark, Dataproc, Google Cloud, Nvidia

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Google Cloud’s Dataproc Gets a GPU-Powered Spark Boost

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 13, 2024

May 10, 2024

May 9, 2024

May 8, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Google Cloud’s Dataproc Gets a GPU-Powered Spark Boost

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 13, 2024

May 10, 2024

May 9, 2024

May 8, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link