March 10, 2014

BigData Top 100 Benchmark Nearing Completion

Alex Woodie

Researchers are closing in on finalizing the benchmark that will be used to identify the fastest big data systems in the world. The group driving this effort, BigData Top100, could settle on a technical standard for the benchmark by this summer, with a possible inaugural list of the top 100 big data systems at a Strata conference in 2015.

Nearly two years ago, a small group of researchers started work on a benchmark to identify the top performing big data systems. Today, members of the group continue to hash out the various possible approaches to measuring big data brawn in a computational context, with the goal of finding the single best to way to do it.

“We’ve been at it for a couple of years now so I think there’s some level of maturity in all the different ideas,” Chaitanya Baru of the San Diego Supercomputer Center (SDSC) told HPCwire editor Nicole Hemsoth in an HPCwire Soundbite podcast last week. “I suspect the August meeting will be defining in the future direction…If you want something like the Top 100

, maybe a year from now.”

BigData Top100 was founded by representatives from SDSC, the University of San Diego, and the Center for Large-Scale Data Systems Research. Today, it counts members from 10 different vendors, including Intel, IBM, HP, Oracle, Google, Seagate, Brocade, Cisco, Pivotal, Mellonox, and Facebook. “The benchmark under development–it’s not yet public information–but I’m sure that all of companies involved in the benchmark have been trying things out,” Baru says. “When you’re ready to define something as a benchmark standard, that’s when we all come together and maybe pick the best of what people have been working on.”

Picking a single benchmark to model all big data workloads is not an easy thing to do. A benchmark designed to measure graph processing–one type of big data workload that’s also popular in HPC–would not do so well measuring the performance of a key-value style database. Similarly, a benchmark set up to measure the performance of an in-memory stream processing system acting on, say, 50 TB of fast-moving data would probably look different than one set up to run machine learning algorithms against 5 PB of static data. And then there are the HPC applications of big data.

The folks behind the BigData Top 100 project realize this, and are adamant about the need for the benchmark to evolve. Nonetheless, one has to start somewhere, and the researchers are making their choices. Baru, formerly a database engineer with IBM who worked on TPC benchmarks for enterprise systems, went over some of the options with Hemsoth.

“There are lots of discussions. We’ve been running these big data workshops since 2012 about what are the different benchmarks you could have,” he says. “You could have micro benchmarks on I/O to measure disk performance. You could have functional benchmarks like sorting. But what we really decided–and this also comes a little from the way TPC does things–is that what’s important for end users is the application load benchmark. We want to model this as an application load.”

The huge variety of data–the unstructured and semi-structured part of it–is one of the trickiest parts of the big data phenomenon. But variety is also a one of the most common constants to real-world big data problems, so chances are good that this part of the notorious Three Vs will feature heavily in the final benchmark–most likely in the context of machine learning and data mining. And don’t be surprised if the benchmark closely resembles real-world Hadoop workloads.

The group is also looking to publish its BigData Top 100 list along the same timeframe that the Top500 list is updated. That would mean two announcements per year. Baru says the group is considering making one announcement at the annual ISC meeting in EU and one in Strata in the United States.

A New Benchmark for Big Data

SDSC Creating BigData Top100 List

Sectors: Academia

Tags: Hadoop

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

BigData Top 100 Benchmark Nearing Completion

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

BigData Top 100 Benchmark Nearing Completion

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link