Big Data Benchmark Gauges Hadoop Platforms
In another indication of a maturing technology and growing demand, an industry group has released a big data analytics benchmark designed to gauge the performance of Hadoop-based systems.
The Transaction Processing Performance Council said this week its TPCx-BB benchmark for big data analytics systems covers systems such as MapReduce, Apache Hive, Apache Spark and Machine Learning Library, or MLib.
According to the TPC website, the “express” benchmark measures the performance of Hadoop-based systems, including hardware and software components. The benchmark executes 30 frequently performed analytical queries in the context of “retailers with physical and online store presence.”
The queries are expressed in SQL for structured data and in machine learning algorithms for semi-structured and unstructured data. SQL queries can use Hive or Spark while machine learning algorithms use MLib along with “user defined functions and a procedural program,” the benchmark group added.
Along with representing the three data types, the new benchmark simulates big data processing, analytics and reporting for the 30 use cases. Runtimes for the big data simulations range from seconds to hours.
The benchmark workload also addresses data set scaling and can run concurrent threads supporting multiple jobs with different characteristics running on a single cluster or via node scaling. The metric supports Hive on MapReduce as well as Hive running on both Spark and Apache Tez, the framework for building high-performance batch and interactive data processing applications.
The benchmark characteristics ultimately provide performance and price metrics for determining the tradeoffs between data analytics performance and cost, the council said.
The retail-oriented benchmark reflects the shift beyond “shopping basket analysis” techniques that have given way to detailed consumer behavior modeling, developers said. That shift has resulted in an explosion of data analytics systems, prompting the need for new mechanism to compare disparate platforms in real-world use cases.
“With the advent of so many big data and analytics systems—from an array of hardware and software vendors—there is immediate demand for apples-to-apples, cross-platform comparison,” argued Bhaskar Gowda, chairman of the TPCx-BB committee and a senior staff engineer with Intel Corp.’s (NASDAQ: INTC) Data Center Group.
Gowda noted that Hewlett Packard Enterprise (NYSE: HPE) published results shortly after the data analytics benchmark was released for its ProLiant DL360 and 380 servers. He cited the early results as evidence of “industry demand for such a benchmark.”
TPCx-BB is the industry council’s third big data benchmark. A previous metric released in December 2015 measures the performance of SQL-based big data systems by expanding the group’s original data analytics gauge.
Besides Intel and HPE, representatives from Actian, Cisco Systems, Cloudera, Dell, Huawai, IBM, Microsoft, Oracle Red Hat and VMware serve on the big data analytics committee.
The TPCx-BB benchmark can be downloaded here.