TPC Crafts More Rigorous Hadoop Benchmark From TeraSort Test
While Moore’s Law has made computing and storage capacity less expensive with each passing year, the amount of data that companies are storing and the number and sophistication of the algorithms that they want to employ on that data to perform analytics is growing faster than the prices are dropping. And that means the bang for the buck of the underlying hardware and the analytics software that runs atop it matter.
The trouble is that benchmarking systems takes far too much time and a considerable amount of money as well. No one knows this better than the hardware and software vendors who have run the TPC family of online and decision support benchmarks from the Transaction Processing Performance Council, which has been supplying industry-standard and audited benchmarks for enterprise systems since its founding in 1988. At the time, the minicomputer revolution was in full swing and the Unix market was poised to explode. And benchmarking was a bit like the wild west, and so vendors joined the TPC consortium and brought some order to the chaos.
This is precisely what the TPC hopes to do now with a set of benchmarks it is calling the TPCx suite, which is a new family of tests that have been created to be easy to administer and cost a lot less money to run. The TPC is taking its inspiration for new system-level benchmarks from the Standard Performance Evaluation Corp, whose SPEC tests for measuring the oomph of processors on a variety of Java application, Web serving, virtualization, and parallel processing workloads are commonly used in the industry. With the SPEC tests, all of the code is given to the companies running the tests, and they can do some optimizations but there is not a lot of coding work to be done. With the TPC tests, what vendors get is a set of specifications and they have to code the benchmark themselves. This can take many months and hundreds of thousands of dollars. Porting the benchmark code to new systems software or tuning it for each successive generation of hardware also took time and money. Moreover, it can take as long as five years to get a new benchmark through the consortium and approved by its members, as they argue the merits. What this means is that by the time a new benchmark test is delivered to the market, it can be obsolete. The whole TPC pace is too slow for the modern market.
So, explains Raghunath Nambiar, chairman of the TPC big data benchmark committee and a distinguished engineer at Cisco Systems, the TPC has come up with a new benchmark framework called TPCx, with the x being short for express. The idea, explains Nambiar, is to take the kinds of tests that vendors are already using to do their comparisons and to wrap a framework around it that everyone will follow while at the same time putting a little rigor into adding pricing information for the systems under test.
The first TPCx test is called TPCx-HS, and it is short for Hadoop Sort, and it is using the Terasort benchmark as its foundation. The TeraSort test is now part of the open source Apache Hadoop distribution, and the baseline machine dates from May 2008, when techies at Yahoo! ran the TeraByte Sort test on a 910-mode cluster running Red Hat Enterprise Linux and using the Sun Java JDK to sort 10 billion records (weighing in at 1 TB in capacity) in 209 seconds. (That is about three and a half minutes.)
“Almost all vendors make claims based on the TeraSort benchmark,” says Nambiar. “The challenge for customers is that these claims are not comparable.”
The TeraSort test has three components, and the TPC has added two more to it to create the TPCx-HS benchmark. The parts of the TeraSort test include TeraGen, which is a MapReduce program that chops up the data to be sorted using a random sequence; the TeraSort piece, which samples the data and sorts it; and TeraValidate, which is another MapReduce program that validates that the output from TeraSort is in fact sorted. TPCx-HS wraps a pre-data check and a post-data check around the TeraSort sorting routine, just to make sure everything is working properly. This code is all written in Java, and there is nothing for vendors to do but load it up on their Hadoop distributions and systems and networks of choice and let it rip. The pieces have been renamed HSGen, HSSort, and HSValidate, and they do the bulk loading, sorting, and scanning of the data. The additional checking routines are to make sure that data is in fact in triplicates.
As is the case with TPC decision support tests like TPC-D and TPC-H, the TPCx-HS test will use scaling factors to load progressively heavier work onto a cluster. Each record in the benchmark has 100 bytes, and the capacity scaling factors range from 1 TB (the baseline for the TeraSort test) up to 10 PB. The steps are 1 TB, 3 TB, 10 TB, 30 TB, 100 TB, 300 TB, 1 PB, 3 PB, and 10 PB. Nambiar says that the 1 TB scale factor requires around 6 TB of total disk capacity to run because it requires triple copies in the Hadoop Distributed File System and you need space for both the input and output data. You can do this today on a cluster with maybe three or four server nodes. To do the 10 PB test would require a cluster approaching something on the order of 1,000 nodes – about the size of the Yahoo! configuration from six years ago that processed 10,000 times less data. Nambiar expects that the 10 TB version of the test will be the sweet spot for the TPCx-HS test, with vendors loading up the benchmark on clusters with from 10 to 20 nodes, depending on the storage configurations in the machines.
The TPCx-HS test will come up with a throughput metric that takes the scaling factor and divides it by the time it takes to run the test in seconds and divides it into the scaling factor. Like other TPC tests, vendors will have to submit the cost of the cluster they run the test upon, and a relative price/performance metric will be published. Vendors also have to say when the machine is available, since hardware and software makers alike sometimes run benchmarks on hardware that is not yet actually available just so they can one-up their competition.
The TPCx-HS test can be run on any software stack that supports the Hadoop runtime and the MapReduce routine as well as any file system that is API-compatible with the Hadoop Distributed File System. Those alternative file systems that do other kinds of sophisticated data sharding or protection that does not involve making triplicates of data have to nonetheless do triples to run the TPCx-HS test. Vendors make two runs at the test for each scaling factor and they are not allowed to make any configuration, tuning changes, or reboots between the runs. Companies are going to be encouraged to provide energy consumption statistics so the energy efficiency of the Hadoop machines can be assessed as well. The TPCx-HS test is not designed to be run on public clouds, says Nambiar, but you know that as soon as the code is available, someone is going to do it. The first TPCx-HS results are expected later this year.