June 10, 2013

Using Hadoop to Augment Data Warehouses with Big Data Capabilities

Alex Woodie

If you run a data warehouse at your organization, you may be wondering how the latest big data technologies, such as Hadoop, can benefit your information analysis. According to IBM product manager Vijay Ramaiah, there are several ways that Hadoop and related tools can augment an existing data warehouse and deliver new analytical capabilities along the way.

Organizations that have already invested lots of time and money into building a data warehouse may be good candidates for augmenting their warehouse with a Hadoop-based system if they face one of several circumstances, Ramaiah, who is the product manager for IBM’s big data portfolio, says in a recent video.

When an organization is “drowning” in big data or throwing away data because it lack the capability to store and process it, that may signify a good time to front-end an existing data warehouse with a Hadoop repository include, Ramaiah says. Similarly, if an organization is using the warehouse to store all data, including cold or rarely accessed data, they may be better off shunting that data over to Hadoop. Organizations that want to analyze non-operational data; that want to explore large and complex sets of data; or that are looking to delay a data warehouse upgrade are also good candidates.

One effective way of using of Hadoop with an existing data warehouse is to use Hadoop as a “landing zone” for big, raw data, Ramaiah says. “Instead of taking all this directly into your warehouse or other aspects of your enterprise environment, what if you could bring all this data, land it in Hadoop, use it as a place where you can do some pre-processing of this data, and then determine if you take it on to other systems?” he asks in the video.

The second common job for Hadoop in existing data warehousing environments is using Hadoop to perform data discovery and analytics on combinations of structured, semi-structured, and unstructured data, including real-time streaming data (possibly in conjunction with IBM’s text analytics engine). Since most data warehouses require structured data, this is an area where Hadoop and other big data tools can bring net new capabilities to an organization.

The third common way customers with existing data warehouses use Hadoop is by using their existing query tools against the columnar data store. “It’s a very effective way to do analytics,” Ramaiah says. “The MapReduce technology provides great performance. What would previously take you weeks and days now takes minutes and hours.”

Ramaiah advises organizations to start small with their Hadoop-based data warehouse augmentations, and grow from there. Given the large volume, velocity, and variety of big data, most projects will benefit from master data management (MDM) and data lifecycle management tools.

Organizations can assemble the various components they need as projects and budgets dictate, eliminating the need for a “big bang” big data project, according to Ramaiah. IBM’s distribution of the open source Hadoop database, dubbed InfoSphere BigInsights, includes additional components and capabilities in the areas of text analytics, performance and workload optimization, data visualization, developer and administrative workbenches, enterprise application connectors and accelerators, and security.

Other big data products from Big Blue that might be used in a data warehouse augmentation project may include InfoSphere Information Server, Optim, and Guardium.

Related items:

Hadoop Sharks Smell Blood; Take Aim at Status Quo

Hadoop Distros Orbit Around Solr

The Transformational Role of the CIO in the New Era of Analytics