Follow Datanami:
October 15, 2014

DataTorrent Raises the Bar for Real-Time Streaming

There’s a lot of talk these days about real-time streaming applications. If analyzing and acting on new data is good, then doing it immediately must be better. The truth is, building real-time streaming applications is not easy work. One company that’s pushing the bar higher in this area is DataTorrent.

DataTorrent develops a distributed Hadoop 2 application called Real Time Streaming (RTS) that enables users to act upon and analyze time-sensitive data, including call data records, log and machine data, clickstream data, or practically any other piece of data that’s accessible through a database or message bus. The YARN-compatible software is often deployed alongside Kafka or Flume, and processes each piece of data as it’s ingested in Hadoop. RTS includes a library of Java-based operators that developers can arrange within the real-time data pipeline to have the desired effect.

When RTS 1.0 became generally available earlier this year, proving scalability was the top priority. In tests on a 34-node Hadoop cluster, RTS 1.0 was able to process 1.5 billion events per second. That’s about 1,000 times faster than Apache Storm and about 100 times faster than Apache Spark Streaming, the company says.

With RTS 2.0, which was unveiled today at the Strata + Hadoop World conference in New York City, DataTorrent has concentrated on rounding out the product set and making it easier for the enterprise to use. To that end, it added 50 new operators; unveiled new auto-partitioning and auto-scaling capabilities; added a key value pair database for supporting super-fast lookups; and enabled users to make upgrades    without taking the cluster down. It also announced a pair of private beta projects for products aimed at developing RTS apps in a drag-and-drop manner, and enabling users to build their own user dashboards.datatorrent_logo

The new auto-scaling capability will allow users to scale their DataTorrent clusters up or down by simply changing a configuration setting in the product. The system will automatically deploy the Hadoop resources that it needs to accomplish a task in a given amount of time. So if RTS was taking 70 seconds to pull in 10,000 events, but the customer wanted it to process those events in just a single second, a single SLA property can be changed, and the product would automatically deploy an additional 70 data processing components to fit the need.

The RTS Cascading Unifier will do this and keep everything in order, says John Fanelli, DataTorrent’s vice president of marketing. “Even though we scale to handle the data, the data still has an order that it comes in, and you must preserve the order in which the data comes in,” he says. “It does all that without impacting the scalability or performance.”

RTS 2.0 ships with 450 open source operators, about 50 more than RTS 1.0. There are now operators for running statistical analysis in R and Rapid Miner. Some of the other new ones provide filtering and pattern matching capabilities, and the capability to set thresholds on windows of time. So if RTS detects something happening once within a five minute window, for example, it will do nothing. But if it detects something happening three times, it will fire off an alert. Some of the new features border on complex event processing (CEP), Fanelli says.

AB Testing WINNER

The capability to make changes to real-time streaming apps without downtime will enable A/B testing

This release also supports this idea of multi-dimensional processing, Fanelli says, “or the ability to pre-compute every possible query on an event string before it’s being queried.” So if there are 7 dimensions to a particular even, then the customer can pre-compute every combination of those, which is upwards of 100 million combinations.

“It creates a cube, if you will, where all computation is pre-computed,” Fanelli says. “So when a customer makes a query or wants to look at the data differently, it enables them to have that answer at hand without having to calculate because it’s already pre-computed.” One customer is using this feature to provide real-time analysis of online advertising. Instead of waiting for a batch run to provide results of his advertising strategy, he can calculate it immediately.

Similar, the new key value pair database that’s built into RTS 2.0 will allow customers to store high volume and potentially unstructured data in a HDFS-based distributed hash table for fast comparisons and pattern-matching analysis. Some early RTS adopters attempted to glue an external NoSQL database onto RTS to achieve this, so DataTorrent decided to build its own and provide a more elegant and efficient solutions.

One beta tester used the key-value pair to accelerate the de-duplication of data. Using batch processes to de-duplicate millions of incoming events is a slow process, taking potentially an hour or more. But by caching the event values in a distributed key-value database, the de-duplication can occur in less than a second.

RTS 2.0 also introduces the capability to make changes to the streaming app without bringing it down. Developers can insert or delete operators from the streaming pipeline on the fly without forcing downtime (it’s assumed that you would test it, of course). This will be particularly useful for doing A/B testing, Fanelli says.

Davinci

Davinci will allow real-time streaming apps to be created in a graphical drag-and-drop manner

DataTorrent also unveiled two betas: the visual app builder, codenamed DaVinci, and a self-service dashboard builder, codenamed Michelangelo. Michelangelo will allow users to create views based on data coming out of RTS. “For many of our customers, the dashboard is actually the application. That’s what the end user sees,” Fanelli says. “This allows them to create their own.”

DaVinci is potentially the more impactful of the two private betas. It allows a non-technical user to choose input files, select one or more operators to act upon them, configure the SLA and window-size settings (RTS looks at everything on a time-oriented basis), and then press the “launch” button. “The idea is it can be done by developers and non-developers, so you can have data scientist creating their own real-time streaming applications,” Fanelli says.

RTS 1.0 is just going into production with the first batch of customers, who are predominantly in online advertising, retail, telecommunications, and manufacturing (where Internet of Things use cases prevail). With RTS 2.0, DataTorrent is hoping to widen the potential use cases for real-time streaming, and make application development easier to do.

“We continue to push the boundaries,” Fanelli says. “Real-time streaming applications aren’t just for technical folks. We believe with RTS 2.0 and the private betas that we’re really moving the bar upstream in terms of who can use these platforms.”

Related Items:

DataTorrent RTS Clocks In at 1.5B Events per Second

It’s Sink or Swim in the IoT’s Ocean of Bigger Data

Crossing the Big Data Stream with DataTorrent

 

Datanami