May 8, 2014

IBM Finds the Need for (SQL on Hadoop) Speed

Alex Woodie

IBM will be joining Cloudera, Hortonworks, and others in the great SQL-on-Hadoop performance race when it ships Big SQL version 3 next month. In addition to peddling unadulterated speed, IBM will be touting security, data federation, and the capability for SQL-based BI tools, like Cognos, to get full access to Hadoop.

While traditional Hadoop version 1 engines like MapReduce have demonstrated the potential of Hadoop, enterprises today demand a broader set of interfaces into Hadoop that can be used with existing business intelligence tools and programmed by mere business analysts, as opposed to elusive (and expensive) data scientists. Most notably, organizations have demanded SQL, which effectively allows Hadoop to mirror the role of a traditional data warehouse, and to operate on structured data.

Hadoop distributors have responded to this demand by giving people what they want. IBM introduced Big SQL a year ago with the launch of InfoSphere BigInsights version 2.1. Before then, Cloudera began its Impala project, while Hortonworks sought to bolster Hive through the Stinger initiative.

Last week, IBM announced the latest incarnation of its Hadoop distribution, InfoSphere BigInsights version 3.0, which also includes Big SQL version 3.0. The new release of the Big SQL engine will bring “full function” SQL capabilities, and will allow users to run SQL on Hadoop in the same way they would for a traditional relational database, without requiring any changes to their apps, IBM says.

Specifically, IBM says Big SQL version 3 brings support for the SQL 2011 language, including support for stored procedures and user-defined functions. This brings Big SQL’s capabilities up to parity with what users expect of a data warehouse, says IBM distinguished engineer Linton Ward, who works with big data analytics in his role as the chief engineer for Power for workload optimized systems.

“What that means is you can now use these tools, like Cognos, that leverage SQL, to access Hadoop data,” Ward tells Datanami. “So Hadoop will still own the data, but it allows you to get SQL interfaces.”

It’s all about enabling a broader group of people access to the new Hadoop repositories. “Are statisticians the best people to be writing Java code?” Ward asks. “Maybe some of the [big data] tooling will be aided by some of the conventional SQL tools out there that have been developed over the last couple of decades.”

Data federation in Big SQL version 3 will enable users to submit SQL statements that tap into other data sources. Big SQL will automatically create the wrappers that submit the SQL query to (and pull data back from) DB2 for LUW, Oracle, and Teradata. This data federation feature will also support IBM’s data warehouse products, PureData System for Analytics and PureData System for Operational Analytics.

Big SQL will also see security enhancements. Specifically, IBM has broadened the ways that authentication can be performed, and now supports processes based on the OS, based on LDAP, or based on custom authentication plug-ins. The fine-grained security policy in Big SQL prevents users from seeing rows and columns of data they don’t have permission to see, IBM says. All user activity can also be tracked and audited, while support for TLS ensures data is encrypted as it moves over the network.

IBM isn’t talking a lot at this point about the performance of Big SQL 3.0. In its announcement letter, IBM says the new interface will include “scale-out parallelism performance to hundred of nodes.” It also talked about “extreme performance,” which doesn’t mean much.

IBM is expected to have new performance benchmark results to talk about when the product ships, which is expected in June. (IBM, in its confounding way, essentially says it will release general availability information when the product is generally available.) “I think you’re going to see some pretty exciting performance coming out of that from the software team,” Ward says.

We’re in the midst of an SQL-on-Hadoop arms race, as vendors seek to differentiate their Hadoop offerings by building the best and highest performing SQL-on-Hadoop interfaces. There’s also an element of marketing and one-upmanship involved, which it appears IBM will be unable to resist partaking in.

In January, Cloudera touted internal benchmarks that showed its Impala SQL-on-Hadoop engine ran twice as fast as an unnamed commercial data warehouse systems and 24 times faster than Apache Hive version 0.12. It also claimed that Impala scaled nearly linearly, at least up to 36 nodes. The company said it was working on another Impala test that scaled up to 1,000 nodes.

Last month, Hortonworks announced that SQL processing in the new Tez-based version of Hive, or version 0.13, ran 100 times faster than Hive version 0.10 when the Stinger initiative started 13 months ago. (The performance benefits versus Hive version 0.12 were not as great.)

As we get closer to Hadoop Summit–the presumed venue for the big unveil of Big SQL 3.0 and InfoSphere BigInsight 3.0–we’ll revisit the latest SQL performance claims.

Cloudera Touts Near Linear Scalability with Impala

What Can GPFS on Hadoop Do For You?

Applications: Data Mining

Technologies: Middleware

Vendors: IBM

Tags: Hadoop, mapreduce, sql

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

IBM Finds the Need for (SQL on Hadoop) Speed

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 23, 2024

April 22, 2024

April 19, 2024

April 18, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

IBM Finds the Need for (SQL on Hadoop) Speed

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 23, 2024

April 22, 2024

April 19, 2024

April 18, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link