Follow Datanami:
September 10, 2013

IBM Ships Hadoop Appliance for the Big-Skills Challenged

Alex Woodie

One of the side effects of the massive run-up in Hadoop deployments is the creation of a big skills shortage. IBM claims to have addressed part of that shortage with the general availability of an all-in-one Hadoop appliance that will relieve users of some of the burdens of deploying, programming, and managing a Hadoop environment.

This Friday, IBM will begin delivery of its new PureData System for Hadoop, a prepackaged Hadoop appliance that it unveiled in April. The Intel X64-based system ships with a pre-installed a copy of BigInsights 2.1, IBM’s Hadoop distribution. And, because it’s a member of IBM’s PureSystems line of products, it has “patterns” designed to speed the implementation process.

The appliance will be welcomed by organizations who want Hadoop, but lack the confidence to deploy it, says Nancy Kopp, director of big data product strategy for IBM.

“If Hadoop is on your agenda, this is the best way to get an enterprise-class system up and running within your environment very quickly without having the issue of skills,” Kopp says. “The whole idea behind the PureData for Hadoop is to make Hadoop very consumable for the enterprise.”

Just how quickly can a PureData for Hadoop customer get up and running? As usual, your mileage will vary. But according to Kopp, it can be done in as little as 89 minutes, which was the time it took one of IBM’s first customer. “That’s a customer who’s already rolled their own system in the past,” she admits.

The PureData System for Hadoop H1001 is a stand-alone appliance that comes with 648 TB of capacity (assuming four-to-one compression on the disks). The frame is not expandable, but users can connect up to three of them together, taking the system into the petabyte range.

 In addition to BigInsights, the appliance comes with a bevy of related software products, including the BigSheets Web console, which provides an Excel-like front-end for data visualization; a collection of analytic “accelerators” for processing text, machine data, and data from social media; a copy of Optim, which can move data back and forth between Hadoop and a traditional data warehouse environment at speeds up to 14GB per second; and a copy of Guardium for securing the data.

 The Hadoop appliance undoubtedly spins up very quickly, which is a very nice thing if you don’t have the skills do it yourself. One thing that the Hadoop appliance won’t automatically spin for you are MapReduce programs. If you want to run MapReduce against your PureData for Hadoop system, you can, but you’re going to have to spend a few (hundred thousand) bucks to get the appropriately skilled programmer in place to do that for you.

 However, not all customers will need MapReduce to make their Hadoop appliances worthwhile, IBM argues. As a result of the “Big SQL” capabilities in the latest release of its BigInsights Hadoop distro, IBM is enabling your standard SQL jockey to ride the Hadoop wagon.

“It’s all about consume-ability,” Kopp says of the ability to use SQL to extract meaningful information out of a Hadoop file system that’s loaded with reams of structured, semi-structured, and unstructured data. “The value proposition of that is you can actually enhance existing analytics with unstructured data in a way that you couldn’t before.”

It seems a bit odd, and slightly ironic, that Structured Query Language (SQL) could end up being one of the best ways to get at data that has little or no structure to it (which is a bit of a misnomer, because all meaningful data has some structure to it; otherwise it’s just meaningless gobbledygook). That brings us back to the looming skills shortage.

“My favorite quote of the year,” Kopp says, “was from Alistair Croll at Stratum, who said that isn’t it ironic that the future of NoSQL is SQL? As we move into adoption, because of the [Hadoop and NoSQL] skills shortage, we had to find ways to leverage the existing skills base. What is the existing skill base in most shops now? It’s SQL. It’s not Java.”

IBM made a couple of other announcements that fit into this storyline, including its Information Governance Dashboard, which enables users to manage information from a single point for both SQL and non-SQL data. It also unveiled a new release of its InfoSphere Data Click software, which is used to rapidly provision sets of data for analysis purposes. With the new release, DataClick can provision data to both standard SQL and Hadoop environments.


 IBM Announces “BLU Acceleration” and PureData System for Hadoop

 Rometty: The Third Era and How to Win the Future

 Rebuilding the Data Center One Block At A Time