Follow Datanami:
February 23, 2012

Pervasive Technologist Sheds Light on Predictive Analytics

Robert Gelber

With all the hype surrounding Hadoop and what it can do for big data, the platform can be challenging to data scientists unfamiliar with MapReduce coding.

Pervasive software’s latest offering attempts to increase the adoption of Hadoop by taking it to a higher level.

Analytics vendors have been steadily adding support for Hadoop with their offerings and even partner with multiple Hadoop distributers to make their application simpler to operate and more robust in functionality compared to their competitors.

Today, Pervasive announced their Pervasive RushAnalyzer, aimed at assisting data scientists, business analysts and big data devs alike, to build and establish predictive analytics. The software can be delivered on multiple platforms including Hadoop clusters, high-performance servers and desktops.

Features include direct access to all files and databases, and an architecture that Pervasive claims, eliminates memory constraints or the need for separate stores before analysis. Along with compatibility for Hadoop, users have the ability to utilize SPSS, SAS and R as well. 

RushAnalyzer also appears to focus on ease of use with its incorporation of a visual drag and drop environment. Prebuilt operators are included with the software to help determine patterning and predictive analysis.

Mike Hoskins,Pervasive CTO and general manager, big data products and solutions said, “The Hadoop framework is exceptionally powerful, but the skills needed to take advantage of Hadoop’s assets are in short supply. Pervasive RushAnalyzer delivers powerful performance without the need for MapReduce coding,”.

Pervasive is not the first vendor to incorporate higher level Hadoop support, however, but the company has been gaining traction with a series of announcement meant to open Hadoop and more general big data technology adoption over the past year.

To understand more about this announcement and Pervasive’s strategy in general, we talked with the company’s Chief Technologist, Jim Falgout.

How does this help “prepare” the data for analysis and also, can you please elaborate on the following quote “The product’s dataflow architecture eliminates any memory constraints, as well as the need  for separate stores before analytics are run.”—Please give us a pretty in-depth explanation of what you mean here and what it means for the user.

RushAnalyzer has built-in operators specifically for data prep (joins, lookups, aggregations, missing values, etc.) Many analysis frameworks require that the data being worked with fit into memory. DataRush, which RushAnalyzer is built upon, is data scalable and will work with data sets of any size with virtually no limitations. DataRush has been used successfully to prepare and analyze data sets from Gigabytes to multiple Terabytes in size.

On the second point, many analysis frameworks require that data be imported into the frameworks data storage system before the data can be utilized. DataRush does not impose this requirement and can access data from many different types of data sources directly. When required, DataRush does support an intermediate data staging format that can be accessed in a parallel fashion. This format can be used for efficiently staging data in between workflows.

Where does this fit in with other Pervasive products, most notably your DataRush software—and more importantly, how does this fit with other common code, including open source packages like R?

RushAnalyzer is built on top of the DataRush engine, so it inherits all the performance advantages and extensibility that come with this technology. It also leverages some open source software and, specifically, allows existing R code to be executed as part of any workflow including visualizations. Additionally, RushAnalyzer reads/writes SAS data files and so could be used to supplement SAS and other tools.

You mention this is geared toward creating predictive analytics solutions on clusters, desktops, etc.—can you be more specific and put this in some kind of industry use case *hypothetical* (i.e. at a life sciences, oil and gas, or other large enterprise setting)—what does it involve set-up wise and what is the final result that can’t be attained through similar offerings that leverage Hadoop? 

We are not aware of any other product that delivers extreme performance on both conventional hardware and Hadoop clusters and non-Hadoop clusters.   RushAnalyzer is so efficient at using all the compute cycles built into today’s commodity hardware that you can easily consume and process millions of rows/second on a single 8-core box.

Use cases are broad, but they include fraud detection in healthcare, CDR analysis in telecoms, market basket analysis, etc. The same results can be attained with Hadoop, but would require 2x or more the number of nodes to churn through the data in the same time – and could not be “right-sized” for use by departments/organizations that are not ready to deploy Hadoop.

Who is something like this aimed at? You say data analysts—but please be more specific? Is this a specialist or more data generalist approach (i.e. address ease of use—who ideally would be using it since you say part of the appeal is reducing need for MapReduce work, etc..)

We’re targeting multiple analytics-related roles with this product because, at least in larger organizations, the end-to-end creation and use of analytics requires several skill sets. The visual environment, combined with all the ready-to-use data mining operators, is ideal for data prep staff, data analysts and data scientists (although the latter often have the skills/need to use the other interfaces, below) – they can quickly build simple and complex workflows in the drag-and-drop environment and immediately run them and share them with colleagues and business managers.

The scripting interface to the underlying DataRush analytics engine allows developers to create custom operators and make them immediately usable (and reusable) in the UI for less technical staff. The engine API goes one step further in allowing any of these analytic workflows to be embedded within operational systems so that existing enterprise applications can be enhanced with robust predictive intelligence that was never possible before.