Follow Datanami:
January 13, 2014

Toward Comprehensive Big Data Benchmarking

Nicole Hemsoth

It’s difficult enough to keep track of the ebb and flow of new tools that span the big data ecosystem, let alone keeping tabs on the latest measures and standards by which to evaluate the influx. While there are standards for a number of specific application areas, architectures, and programmatic approaches, it’s difficult to get a more comprehensive view across solutions and system-wide needs.

A team of researchers from the Institute of Computing Technology at the Chinese Academy of Sciences have tackled the problem of benchmarking big data with a new tool called BigDataBench.

The effort, based on input from direct research and a team of outside industrial partners, will examine the whole of larger application scenarios where there are “diverse and representative datasets.” They are using the core of 19 existing benchmarks that bring in pieces of information from application scenarios, the operations/algorithmic angle, data source handling, software stacks and application types. They noted that using a standard Xeon (E5645) there were some notable comparisons with their benchmarks versus the discrete types (PAR-SEC, HPCC and SPECCPU among them). Their results can be found in more detail here.

The benchmarking suite includes six real-world data sets, and 19 big data workloads, covering six application scenarios: micro benchmarks, “cloud OLTP,” relational query, search engine, social networks, and e-commerce.

BigDataBench also provides several big data generation tools to generate scalable volumes of big data, e.g, PB scale, from small-scale real-world data while preserving their characteristics.  A full spectrum of system software stacks for real-time analytics, offline analytics, and online service is being included. The sample data sets including those from Wikipedia (over 4 million articles), a Google Web graph with 875,713 nodes and 5,105,039 edges, massive e-commerce transaction data in structured format and more. This provides a well-rounded view of different data types and puts the results in more defined context.

As the team notes, “considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architectures.” The problem is, most of the benchmark efforts to date only consider specific applications or software stacks and are too limited for a broader effort.

The benchmark is freely available via the open source project but does require some ramping up time as it’s not simple to navigate. However, for those looking to move beyond a monolithic benchmarking effort, especially if using Xeon E5-series processors, this could be a handy tool.