
A New Benchmark for Big Data
Database expert Chaitanya Baru keeps one foot in the world of high performance computing via his role at the San Diego Supercomputer Center (SDSC), and another firmly planted on enterprise big data soil.
The data-intensive systems researcher has garnered a fair amount of attention lately with his plan to build a new stable benchmark for big data—one that pulls elements from both of those worlds he inhabits.
Baru and team’s stated mission with their BigData Top 100 project is to “provide academia with a way to evaluate new techniques for big data in a realistic setting; industry with a tool to drive development; and customers with a standard way to make informed decisions about big data systems.”
However, without identifying the variables and how to reflect how dynamic and frequently-changing they are, the team would simply be creating another static benchmark, useful only in certain settings. This is the foundation upon which the benchmark’s value rests, Baru told us in a conversation last week.
He explained that the project will be iterative in nature, with the first benchmark being the basis for the next, and so on, until a constantly-shifting benchmark is created that maintains standardization in the wake of change. Baru says this will be an open benchmark-development process, based input from the steering committee, which is a balance of industry and academic perspectives.
The BigData Top 100 effort will culminate in an end-to-end application-layer benchmark for measuring the performance of big data applications with recognition that the benchmark itself must evolve to meet the needs of ever-changing applications. The final result of this benchmark-building effort will emerge this year following critical input from the vendor, academic and user communities that have tied themselves to the project to decide how to capture the evolving elements and snap them into the benchmark itself.
Baru says that any new big data benchmark should factor in the addition of new feature sets, large data sizes, large-scale and evolving system configurations, shifting loads, and heterogeneous technologies of big data platforms.
According to the team, the following are critical elements of an ideal big data benchmark:
• Simplicity: Following the dictum that “Everything should be made as simple as possible, but no simpler,” the benchmark should be technically simple to implement and execute. This is challenging, given the tendency of any software project to overload the specification and functionality, often straying from the most critical and relevant aspects. |
|
• Ease of benchmarking: The costs of benchmark implementation/execution and any audits should be kept relatively low. The benefits of executing the benchmark should justify its expense—a criterion that is often underestimated during benchmark design. |
|
• Time to market: Benchmark versions should be released in a timely fashion in order to keep pace with the rapid market changes in the big data area. A development time of 3 to 4 years, common for industry consortia, would be unacceptable in the big data application space. The benchmark would be outdated and obsolete before it is released! |
|
• Verifiability of results: Verification of results is important, but the verification process must not be prohibitively expensive. Thus, to ensure correctness of results while also attempting to control audit costs, the BigData Top100 List will provide for automatic verification procedures along with a peer-review process via a benchmark steering committee to ensure verifiability of results. |
At the core, says Baru, is the non-static, concurrent development of the benchmark that differentiates it from the slew of other static application or hardware benchmarks out there. During our chat last week, he noted that while established communities like HPC have had decades to work out its own primary measurements (FLOPS, for example), big data benchmark efforts to date have been scattered, for specific application areas and based on consistent factors.
For instance, the Graph 500 has become a popular benchmark for graph problems, but the results of the algorithmic test would be meaningless when looking at a key value-based problem’s performance. The same is true of the Terasort benchmark, which only applies to a specific subset of real-world applications. On that note, the standard measurements for distributed systems, such as measuring high performance computing installations, are themselves not a fit for the new data-intensive world. Baru said even for a traditional supercomputer center like SDSC, the real concerns are around the massive wells of data from scientific applications, including large simulations.
Further, being able to factor in price for performance is a critical element, he argued, noting that some of the more successful big data-oriented benchmarks factor this in but are too narrow in their scope “While the performance of traditional database systems is well understood and measured by long-established institutions such as the Transaction Processing Performance Council (TCP), there is neither a clear definition of the performance of big data systems nor a generally agreed upon metric for comparing these systems,” he claims.
The vendor angle is important here since there is something in it for them—namely a standard to compare performance in the context of price. With the help of a team comprised of other researchers from Greenplum, Oracle, IBM, Cisco and others, Baru hopes to bring a new eye to performance/price metrics for large-scale data projects. The goal is to create a new standard for “big data” vendors to begin comparing their wares along a benchmark that is tailored to reflect real data-intensive workloads.
It should be stressed that while there is vendor support behind the initiative, this is still a very academically-rooted effort—and one that takes its cues from some of the highest-performing systems on the planet. It began as an NSF-funded workshop at the Center for Large-Scale Data Systems Research and the San Diego Supercomputer Center.
Baru is a key figure at the supercomputing center—he’s also been working in the field of “big data” before it ever had the mainstream moniker, both in his work at with the Data Intensive Computing Environments (DICE) Group and at his earlier post in the early-mid 90s as a lead behind large-scale database research. During his tenure there he was one of three IBMers who led the design and development of the DB2 Parallel Edition, which hit the market in 1995 and shook up the database space. As head of large-scale data-intensive efforts underway now at SDSC, he is seeing the triple-v problems of big data firsthand at an extreme scale.
While the problems of supercomputing centers’ scientific simulations and applications might sound rather removed from the real-world concerns of enterprise shops struggling to keep up with their needs for big, fast data handling, there are some useful lessons Baru is bringing over from the big box world of supercomputing—at least in concept.
“We think that data is the real context for the FLOPS,” said Baru, noting that the traditional method of looking at the power of a system was to measure its Floating Operations Per Second (hence the acronym). While he says that hardware optimizations are still critical, the software stack needs to continue to evolve to meet the increasingly diverse needs of the scientific community.
Related Articles
DDN Captures Hadoop with HPC Hook
Juneja: HPC, Cloud, and Open Source the Nexus of Big Data Innovation
June 5, 2023
- BigID Integrates with ServiceNow to Automate Data Classification for Digital Transformation
- The Snowflake Ecosystem Will Converge at the World’s Largest Data, Apps, and AI Event, Snowflake Summit 2023
- IDC Report: The Convergence of Modern Technologies Transforms IT Roles, Fueling Growth in DataOps and MLOps
- Automation Anywhere Partners with Google Cloud to Bring Together Generative AI and Intelligent Automation
- CloudFabrix Announces Availability of ‘Macaw’ Generative AI Assistant and Edge Enhancements with Its Low Code RDAF Platform Release
June 2, 2023
- Esri Announces Winners of the 2023 ArcGIS Online Competition
- Accenture Acquires Nextira, Expanding Engineering Capabilities in AI & ML
- ReproCell, HNCDI, and IBM Introduce Pharmacology-AI to Optimize Drug Response Analysis
- BigID Revolutionizes Auto-Classification with Classifier Tuning
June 1, 2023
- Databricks Releases Keynote Lineup and Generation AI Programming for 2023 Data + AI Summit
- New Relic Launches Amazon Security Lake Integration
- Latest Couchbase Capella Release Features New Developer Platform Integrations and Greater Enterprise Features
- Anyscale Launches Aviary: Open Source Infrastructure to Simplify LLM Deployment
- Census Announces GitLink to Bring Software Engineering Best Practices to Data Activation Workflows
- GridGain Releases Conference Schedule for Virtual Apache Ignite Summit 2023
- Automation Anywhere and AWS Bring the Power of Generative AI to Mission Critical Mainstream Enterprise Processes
- Domino Reveals Breakthrough Innovations for Swift and Cost-effective Enterprise AI Deployment
- Acceldata to Illuminate Cloud-Based Management Solutions at Enterprise Data Summit
May 31, 2023
Most Read Features
- Tableau Jumps Into Generative AI with Tableau GPT
- Data Mesh Vs. Data Fabric: Understanding the Differences
- Vector Databases Emerge to Fill Critical Role in AI
- Which BI and Analytics Vendors Are Incorporating ChatGPT, and How
- Google Claims Its TPU v4 Outperforms Nvidia A100
- LLMs Are the Dinosaur-Killing Meteor for Old BI, ThoughtSpot CEO Says
- The Semantic Layer Architecture: Where Business Intelligence is Truly Heading
- Open Source Provides Path to Real-Time Stream Processing
- Hallucinations, Plagiarism, and ChatGPT
- Data Management Implications for Generative AI
- More Features…
Most Read News In Brief
- Microsoft Unifies Data Management, Analytics, and ML Into ‘Fabric’
- Mathematica Helps Crack Zodiac Killer’s Code
- Nine Things I Learned at Tableau Conference 2023
- Big Data Career Notes: May 2023 Edition
- Informatica Claims 80% Speedup for Data Management Tasks with LLMs
- AI Chatbots: A Hedge Against Inflation?
- IBM Embraces Iceberg, Presto in New Watsonx Data Lakehouse
- Big Growth Forecasted for Big Data
- We’re Still in the ‘Wild West’ When it Comes to Data Governance, StreamSets Says
- Qlik Completes Acquisition of Talend
- More News In Brief…
Most Read This Just In
- DataStax and ThirdAI Announce Partnership to Democratize Access to Advanced AI Tech
- MariaDB Unveils Distributed SQL Vision at OpenWorks 2023, Boosting Scalability for MySQL and PostgreSQL Communities
- Pega Announces Pega GenAI to Infuse Generative AI Capabilities in Pega Infinity ’23
- Sumo Logic Names Joe Kim as President and CEO
- Google Cloud’s Generative AI Revolutionizing Workplace Applications: Major Enterprise Partnerships Announced
- Red Hat OpenShift AI Accelerates Generative AI Adoption Across the Hybrid Cloud
- Francisco Partners Completes Acquisition of Sumo Logic
- Google Cloud Unveils A3 GPU Supercomputer: Next-Gen Power for Advanced AI Models
- MariaDB Ushers in New Era with Paul O’Brien as CEO, Unveils Ambitious Growth Plan
- Informatica Announces Expanded Industry Focus and Zero Cost Data Pipelines and Transformations with AWS
- More This Just In…
Sponsored Partner Content
-
Inside the ROI of Informatica iPaaS
-
Wakefield Survey: Monte Carlo’s 2023 State of Data Quality Survey
-
Achieving reliable data is a marathon not a sprint—get O’Reillys Data Quality Fundamentals
-
Get your single source of Snowflake data access truth, for free
-
40+ financial datasets, pre-integrated in Apperate.
-
Informatica Ranks as the #1 Data Engineering Vendor
Sponsored Whitepapers
Sponsored Multimedia
Contributors
Featured Events
-
Enterprise Data Summit
June 7 -
CDAO Insurance 2023
June 13 - June 14 -
ODSC Europe 2023
June 14 - June 15London United Kingdom