October 10, 2014

Apache Spark Beats Record for Fastest Processing of Big Data

BERKELEY, Calif., Oct. 10 — Databricks, the company founded by the creators of popular open-source Big Data processing engine Apache Spark, announced today that it has broken the world record for the GraySort, a third-party, industry benchmarking competition for sorting large on-disk datasets. Databricks completed the Daytona GraySort, which is a distributed sort of 100 terabyte (TB) of on-disk data, in 23 minutes with 206 machines with 6,592 cores during this year’s Sort Benchmark competition. That feat beat the previous record, held by Yahoo, of 70 minutes using a large, open-source Hadoop cluster of 2100 machines for data processing. This means that Spark sorted the same data three times faster using ten times fewer machines.

Additionally, while no official petabyte sort competition exists, Databricks pushed Spark further to also sort one petabyte (PB) of data on 190 machines in under four hours (234 minutes). This PB time beats previously reported results based on Hadoop MapReduce.

“Spark is well known for its in-memory performance, but Databricks and the open source community have also invested a great deal in optimizing on-disk performance, scalability, and stability,” said Ion Stoica, CEO of Databricks. “Beating this data processing record previously set on a large Hadoop MapReduce clusters not only validates the work we’ve done, but also demonstrates that Spark is fulfilling its promise to serve as a faster and more scalable engine for all data processing needs. This competition asked us to sort a set of data far larger than what most companies will have to compute, meaning the value of Spark can be realized at organizations of any size with quantities of data large and small.”

With less than one-tenth of the computing power used by Yahoo’s cluster of 2,100 machines, Databricks’ 206 machines used Apache Spark on top of Hadoop File System (HDFS) and sorted through 100 TB of on-disk data in just 23 minutes. The cluster was hosted by Amazon EC2, using 206 i2.8xlarge instances, with an aggregate of 49 TB of memory, 6,592 cores, and 1.2 PB of solid-state drive space.

“This sort benchmark is particularly challenging because sorting 100 TB of data generates 500 TB of disk I/O and 200 TB of network I/O. At the core of sorting is an operation called shuffle, which moves data across all the machines. Shuffle is also a critical operation underpinning almost all workloads, including running SQL queries on multiple data sources and doing predictive analytics,” said Reynold Xin, who leads this effort at Databricks. “A lot of development has gone into improving Spark since the 1.0 release for such demanding workloads. Some notable examples include a new sort-based shuffle module that was optimized for large workloads, and a new network transport module that was able to sustain 1.1GB/s/node during shuffle. All these improvements will be available in Spark 1.2.”

In breaking the previous record, Databricks demonstrates that Spark has the power to scale for both in-memory and on-disk data processing, and can do so with one-tenth of the resources and in a fraction of the time once required.

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Apache Spark Beats Record for Fastest Processing of Big Data

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 15, 2024

April 12, 2024

April 11, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Apache Spark Beats Record for Fastest Processing of Big Data

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 15, 2024

April 12, 2024

April 11, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link