Follow Datanami:
October 2, 2013

Eliminating Overhead Through Data Blending at the Source

Isaac Lopez

Predictive analytics is sometimes described as the ‘killer app’ of big data. By taking data collected in the past, sophisticated algorithms are used to extrapolate meaning into the future. However, getting from the data to the predictive stage can be a complicated, cumbersome ordeal.

One company, Pentaho, says it can take the sting out of the process by providing an integrated platform that aims to handle the process from the data origination point, all the way to analysis while eliminating some of the hassle along the way. Recently the company released the newest 5.0 version of its business analytics platform, which provides data blending at the source.

“It has become very apparent that the real value of big data is not necessarily in the data, but in the combination of that data with other relevant data from existing internal and external systems and sources,” explains Chuck Yarbrough, a technical solutions manager with Pentaho.

One of the primary challenges of blending data has been the complicated ETL processes where data gets shuffled from its original landing place to enterprise data warehouses or data marts before it can be used with BI tools to derive any value from it. Pentaho says it is working to eliminate this step by blending big data at the source.

“We’re rapidly moving into an era of distributed data analytic architectures where data should remain in its most optimal store,” says Yarbrough. “The structural variety and volume make it extremely difficult and time consuming delaying any valuable use of the data, plus it’s not economically feasible to store that volume.”

This becomes increasingly true as data volumes rise, as well as the trend towards real time analytics picks up full steam. Competitive pressures are increasingly making it necessary for businesses to respond to the data in real time, making the clumsy processes of extracting, merging, cleansing, transforming and re-storing data cumbersome.

The folks at Pentaho say they’ve recognized the trends and have responded to it with their 5.0 release by implementing data management tools that streamline the process and eliminate many of the hassles that have been inherent with predictive analytics processes.

They call the process “just in time blending.” What’s more they claim that, by  using the Pentaho data integrator, they are able to pull the relevant, disparate data sets straight from their silos (i.e. an existing data warehouse or NoSQL database), and into the data analytics engine without the need for additional cleansing or loading.

Aside from the eliminating the complicated set up processes, there are a lot of benefits to their approach, Pentaho says. Since the data is left in its original landing place, it maintains whatever level of data governance and security it was given when it was first stored, making audits easier. Also without the need for transitory processing, it says it’s able to provide real-time analysis.

According to Pentaho engineer Matt Casters, the company is able to accomplish its data blending scheme through the creative use of SQL. “At first glance it seems that the worlds of data integration and SQL are not compatible,” said Casters in a recent article. “However, SQL itself is a mini-ETL environment on its own as it selects, filters, counts and aggregates data.  So we figured that it might be easiest if we would translate the SQL used by the various BI tools into Pentaho Data Integration transformations.”

It’s very clever, but whether it really works remains an open question. Eliminating the previous prep work required for predictive analytics sounds too good to be true, but if it works, it could go a long way towards the goal of bringing predictive analytics down to earth.

Related items:

Pentaho Goes All In with Big Data Blending

Datawatch’s Big Visualization Strategy for Data

ScaleOut Ships Its Own MapReducer for Hadoop

Datanami