Follow Datanami:
April 16, 2013

ScaleOut Building Real-Time Bridge to Hadoop with hServer

Isaac Lopez

The real-time operational world can benefit from Hadoop’s power to analyze large data sets in parallel, says in-memory data grid company, ScaleOut Software, who today launched a new platform called hServer that the company says bridges real-time analytics with Hadoop.

“The problem we see is that Hadoop is really not targeted for real-time analytics,” said Bill Bain, CEO of ScaleOut Software. “It’s targeted for offline data, and often that data is being copied in from a data warehouse or another system. It’s not sored directly from an operational system, so it’s not data that’s changing, it’s data that is static.”

ScaleOut’s hServer, says Bain, addresses this by integrating an in-memory data grid with Hadoop and placing it between HDFS and Hadoop. According to Bain, this enables the ability to perform Hadoop-style Map Reduce analysis on live data stored in the grid, as well as real-time analysis enabled by transparent caching of HDFS data.

“We’re bridging Hadoop to the real-time analytics usage model using an in-memory data grid to host the data,” explained Bain. “We’re also providing the ability to transparently cache the HDFS data so that you can do multiple ‘what if’ tests on an in-memory data set as a separate usage model from the fast changing operational data. You can use our grid to cache HDFS and speed up access to data in Hadoop so you can do these ‘what if’ tests more rapidly as you change your Hadoop algorithm.”

Chief Operating Officer, David Brinker, says that the hServer platform is being driven by a need in the market for real time analytics on fast changing data.

“Most of our customers historically have used our product to hold fast changing data that’s being used in their operation – so live, operational data – and they’ve used it to scale their application performance,” explained Brinker. “As we’ve worked with them over the years, it’s become evident that they would also like to do analysis of that data that’s being held in the grid while it’s changing.”

Giving an example, Brinker detailed a hedge fund management customer holding the instruction set of hedging strategies in the grid as trading data streams in. “So you’ve got this real-time view of each of your strategies, and at the same time, there is continuous MapReduce style analysis being run to make sure that the strategy is in range. If a particular company moves out of range and a strategy rule is violated, the system will alert the trader for action.” While the customer referenced, notes Brinker, works with human traders, it could just as easily be sending off a trade order to an algorithmic high frequency trading system. He says that other applications include e-commerce, reservation systems, credit card fraud, and others.  

 “The thing that makes an in-memory grid good for this is first of all, the data is in-memory, so you get in-memory speed,” comments Brinker. “The second is that we can do this continuous Map Reduce style computation without moving the data – data in motion is pretty much the enemy of high performance – you get a lot faster throughput by being able to analyze the data in place.”

Brinker is careful to note that when he references “MapReduce-style analysis,” he’s referencing ScaleOut’s own proprietary style of MapReduce which they call Parallel Method Invocation, which he says is “conceptually very similar to Hadoop’s MapReduce.”

The company says that hServer includes an open source java library that includes a grid record reader and a grid record writer that allow the mappers to directly access key value pairs from the grid, as well as put key value pairs back into the grid. This enables Hadoop programs to retrieve key value pairs directly from the grid to put them into Hadoop mappers for analysis and subsequently stored as key value pairs directly in the grid through the java library.

“We’re using some fairly sophisticated techniques under the floor to avoid network hops so that we minimize overhead to make these work very quickly,” Brinker explained. “Also, other applications can touch these key value pairs using the standard APIs that we’ve been shipping with our products.”

The library includes a wrapper to the standard record reader for HDFS (or any other input source) so that a record reader can be wrapped with what ScaleOut calls a “data set record reader.” Bain says that with a two line change to the program, ScaleOut’s data set record reader will wrap a regular record reader so when data is being ingested to the mappers, it is being stored on the side into the grid for future access. When the program is run again, hServer can detect automatically whether or not the collection of key value pairs is in the grid, whether it has ever been changed in HDFS, and if it has not been changed, it will serve the data directly from the grid, thus reducing access latency.

“We’ve done some fairly exhaustive testing of this and found that we can drive the access latency down by something like a factor of 11 in the TeraSort benchmark with its key value pairs over just purely accessing it from HDFS,” commented Bain, who further notes that while this is a jump forward in terms of access latency, it’s not so remarkable in terms of overall Hadoop execution time.

“In future releases, we’ll address other overheads in the Hadoop execution cycle,” he explained. “These include the batch scheduling, the reduction time, the shuffle time, and so forth to really drive down the overall execution time to real-time computing so that you can get results in a few seconds instead of minutes or even hours.”

Brinker says the company is beginning distribution immediately, offering a free community edition (supported via a community forum), as well as several commercial editions that can be licensed.

Related items:

Ephemeral, Fast Data Finds Home in Memory 

Actian Aims at Being Disruptive in Big Data 

Baldeschwieler: Looking at the Future of Hadoop 

Datanami