BigData Top 100 Benchmark Nearing Completion
Researchers are closing in on finalizing the benchmark that will be used to identify the fastest big data systems in the world. The group driving this effort, BigData Top100, could settle on a technical standard for the benchmark by this summer, with a possible inaugural list of the top 100 big data systems at a Strata conference in 2015.
Nearly two years ago, a small group of researchers started work on a benchmark to identify the top performing big data systems. Today, members of the group continue to hash out the various possible approaches to measuring big data brawn in a computational context, with the goal of finding the single best to way to do it.
“We’ve been at it for a couple of years now so I think there’s some level of maturity in all the different ideas,” Chaitanya Baru of the San Diego Supercomputer Center (SDSC) told HPCwire editor Nicole Hemsoth in an HPCwire Soundbite podcast last week. “I suspect the August meeting will be defining in the future direction…If you want something like the Top 100
, maybe a year from now.”
BigData Top100 was founded by representatives from SDSC, the University of San Diego, and the Center for Large-Scale Data Systems Research. Today, it counts members from 10 different vendors, including Intel, IBM, HP, Oracle, Google, Seagate, Brocade, Cisco, Pivotal, Mellonox, and Facebook. “The benchmark under development–it’s not yet public information–but I’m sure that all of companies involved in the benchmark have been trying things out,” Baru says. “When you’re ready to define something as a benchmark standard, that’s when we all come together and maybe pick the best of what people have been working on.”
Picking a single benchmark to model all big data workloads is not an easy thing to do. A benchmark designed to measure graph processing–one type of big data workload that’s also popular in HPC–would not do so well measuring the performance of a key-value style database. Similarly, a benchmark set up to measure the performance of an in-memory stream processing system acting on, say, 50 TB of fast-moving data would probably look different than one set up to run machine learning algorithms against 5 PB of static data. And then there are the HPC applications of big data.
The folks behind the BigData Top 100 project realize this, and are adamant about the need for the benchmark to evolve. Nonetheless, one has to start somewhere, and the researchers are making their choices. Baru, formerly a database engineer with IBM who worked on TPC benchmarks for enterprise systems, went over some of the options with Hemsoth.
“There are lots of discussions. We’ve been running these big data workshops since 2012 about what are the different benchmarks you could have,” he says. “You could have micro benchmarks on I/O to measure disk performance. You could have functional benchmarks like sorting. But what we really decided–and this also comes a little from the way TPC does things–is that what’s important for end users is the application load benchmark. We want to model this as an application load.”
The huge variety of data–the unstructured and semi-structured part of it–is one of the trickiest parts of the big data phenomenon. But variety is also a one of the most common constants to real-world big data problems, so chances are good that this part of the notorious Three Vs will feature heavily in the final benchmark–most likely in the context of machine learning and data mining. And don’t be surprised if the benchmark closely resembles real-world Hadoop workloads.
The group is also looking to publish its BigData Top 100 list along the same timeframe that the Top500 list is updated. That would mean two announcements per year. Baru says the group is considering making one announcement at the annual ISC meeting in EU and one in Strata in the United States.
SDSC Creating BigData Top100 List