TPC Accuses Nvidia of Violating Rules in Benchmark Test
The Transaction Processing Performance Council (TPC) this week accused Nvidia of violating its fair use policy by altering a big data analytics benchmark test last year and failing to submit official results. Nvidia has denied any wrongdoing and said it is working with the TPC to submit official results and settle the disagreement.
Last June, Datanami wrote about the test results that Nvidia published for its DGX A100 systems running a pair of TPCx-BB tests, which emulate Hadoop clusters that mix SQL and machine learning workloads on structured and less-structured data.
Nvidia claimed that its GPU systems “shattered” the benchmark. It said that its DGX A100 system ran 37x faster than the previous record holder on the smaller test, which simulated a 1TB dataset, and ran 19.5x faster on the larger test, which simulated a 10TB dataset.
In a real-world setting, Nvidia claimed its systems would result in millions of dollars in savings on hardware and power, and provide results in minutes instead of hours.
Nearly seven months later, Nvidia has not provided sufficient documentation of the tests to satisfy the folks at the TPC, which said the results that Nvidia published in June are still unofficial and should be dismissed as invalid.
“Since NVIDIA has not published official TPC results, and instead compared results from derived workloads to official TPC results, the comparisons are invalid,” the organization stated in a press release on January 27, 2021.
TPC claimed that by publishing unofficial results, Nvidia has violated its fair use policy. That policy, which can be read here, states: “The TPC label may be applied to only fully legitimate Results, used in a fair manner,” it states. “Any use of the TPC name in conjunction with published results must ONLY be used with official published results, available on the TPC web site.”
“We are aware of the TPC’s claim that NVIDIA violated its fair use policy by comparing unofficial results to official results. We are confident that the improvements we demonstrated accurately reflect the benefit provided by the use of GPUs. We are working with the TPC on an official submission,” the spokesperson tells Datanami.
The spokesperson further explained that the changes Nvidia made to the test were for the purpose of enabling the benchmark to run on GPUs.
In June, Nvidia engineers published a blog post on Medium in which they described how they worked for months to prepare the Rapids-based software stack for the test. That blog has since been deleted. “We implemented the TPCx-BB queries as a series of Python scripts utilizing the RAPIDS dataframe library, cuDF, the RAPIDS machine learning library, cuML, CuPy, and Dask as the primary libraries,” the Nvidia engineers wrote in the since-deleted Medium blog. “We relied on Numba to implement custom-logic in user-defined functions, and we relied on spaCy for named entity recognition. These results would not be possible without the RAPIDS community and the broader PyData ecosystem.”
Disputes over server benchmark results are not unheard of, and numerous accusations of manipulation of the testbeds have been made before over the years. The disputes often arise over the testing of systems that have not yet been sold. TPC requires that tested systems, including hardware and software, be available for purchase within six months of the test publication date. The use of non-standard hardware or software configurations, and the inability of an independent body (such as the TPC) to audit or verify the hardware and software configurations, is another source of benchmark disputes.
According to Chris Elford, who’s the chairman of the TPCx-BB subcommittee, Nvidia’s modifications of the SQL queries should have been cleared with the TPC organization before Nvidia published the results.
“You have to run that kit unmodified in order to publish it,” Elford says. “If you want to get changes to the kit, you have to go back to the TPC and get them to review it. And that would go out in a new package.”
TPC isn’t against vendors making any modifications to the SQL at the core of the benchmark. Being flexible encourages more vendors to take the test, Elford says. Several years ago, the organization spend between six months and a year working with Alibaba to hammer out a way to test its hosted Hadoop solution.
In Nvidia’s case, Elford said the company re-implemented the TPCx-BB queries using Python and the Rapids BlazingSQL engine, to get the queries to run on GPUs. It’s unclear to what extent the queries were optimized during that process. According to Elford, the SQL queries that TPC uses in its test are not optimized because that’s not how SQL looks in the real-world.
“What would need to happen, were they to pursue it in the committee, is we would look at the proposed syntax, and have a discussion whether it is a change that has a significant impact to the benchmark or not,” Elford tells Datanami. “We wouldn’t say we wouldn’t take any changes. But we have to be a little conservative.”
However, even if Nvida had run the same exact SQL queries as defined by TPCx-BB, Nvidia violated other standards when it came to publicizing the results, according to Elford.
There are three main components of the TPCx-BB benchmark, including the database load time, the power run (which involves running a set of 30 queries in succession), and the throughput run (which involves running a series of queries at the same time). But according to Elford’s understanding of Nvidia’s test, the company did not account for the load time, nor did it account for the throughput run.
“They concentrated just on the power runtime, which, for doing engineering analysis–we do that type of things for our own analysis. We don’t publish it, but it’s a useful metric,” says Elford, who works at Intel. “The problem is….if you cherry pick out just the power runtimes, you’re not really doing justice to the benchmark.”
In addition to modifying the SQL and publishing partial benchmark results without submitting official results, Nvidia also went the next step of making pricing comparisons of other TPCx-BB tests, which is another no-no, according to Elford.
“If they do publicize it, they need to be very clear to follow TPC clear use policy to say ‘We’re not compatible’ and explain how…you’re they’re accounting for power run, not accounting for throughput, and we rewrote the queries in Python and not SQL,” Elford says. “What they did would have been fine if they said, number one here’s how we’re differing.”