The Real-Time Future of ETL
We’re on the cusp of a huge uptick in data generation thanks to the IoT, but most of that data will never be landed in a central repository or stored for any length of time. To get a handle on this morass of data, enterprises will eventually find themselves using real-time processing engines as an ETL layer to soften the downstream impact.
The analyst group IDC gave us a glimpse of our massive data future when it published its “Data Age 2025” report. The report, which was commissioned by disk maker Seagate, predicted the world will generate 160 exabytes of data per year in 2025, up from about 16 exabytes this year.
To keep from getting swamped by this forthcoming data deluge, organizations will need to get creative in how they handle data. Only about 15 percent of that data will be stored on a disk, a tape cartridge, or other storage medium, according to the group.
The good news is much of the data from IoT devices will be repetitive in nature, and has limited value. In many cases, the value of the data will come from identifying the anomalies within the data stream.
Damian Black, the CEO of SQLstream, predicts enterprises will need real-time processing engines to identify those anomalies and generate alerts. “Very shortly, it’s going to become not even technically possible to store the data we’re generating,” Black tells Datanami. “You need to analyze data on the fly.”
Real Time Challenges
Much of the big data processing the world has done up to this point has lived in batch, but for the reasons previously stated, that method can’t keep up with the ongoing explosion of data.
“Hadoop does not provide all the answers,” Black says. “It does not really provide answers in any way to real-time or near real-time questions. It doesn’t give you visibility into streaming data or rapidly changing data. And it doesn’t support real time, data-driven actions. Those are all things that we actually do very well.”
SQLstream subscribes to the school of thought that says the sooner the data is processed, the better. It’s a position similar to the one advocated by Kafka creator Jay Kreps, who foresees a need for an overarching dataflow architecture that has processing capacity to satisfy a range of requirements, from real-time to interactive to batch. Kreps called this the Kappa architecture, to differentiate it from the Lambda architecture that sees real-time analytic systems running separately and in parallel to big data processing systems like Hadoop.
In Black’s view Kafka and Amazon’s Kinesis will be the primary data pipelines to manage the flow of data, while real-time analytic engines, such as SQLStream Blaze, Storm, Spark Streaming, Apex, Flink and any number of other systems, provide the processing logic and operators that transform the data into useful informiaotn.
“We think with high velocity data that you’re going to see…that Kafka is very well placed,” Black says. “We view Kinesis as almost like a proprietary version of Kafka. It’s independently generated and built, but it provides very similar functionality.”
ETL On the Fly
As the amount of data generated by the IoT ramps up, enterprises will need some way to process the data near the edge, because the volumes will be too great to move to a central repository. This gives rise to the notion of real-time analytic engines performing ETL functions that today are largely processed in a periodic batch manner.
“In many ways, we see this as the future of ETL,” Black says. “From passive batch to an active real-time approach, where you cannot only acquire new data, but you can analyze it on the fly and actually visualize it and you can take real time action.”
Black says the rapid growth of data generation is not keeping up with falling storage and processing prices, which leads to an imbalance. “That means basically you’re hitting a wall,” he says. “We’re already at this stage now.”
Some big data platform providers continue to insist that they should centralize the data and store it in a big data lake, Black says. But that approach leads to too many lost opportunities, not to mention that big data storage wall, he says.
“We’re only really examining 1% of the available data,” he says. “But those ambitions are changing. We believe the way to [process more data] is to process data on the fly, and to do that, you need to do it with something that has low latency.”
SQLstream has been a bit of a dark horse in the real time analytics space, but it appears to be on the cusp of a major breakout. The company’s profile got a substantial boost last year when Amazon Web Services signed an OEM deal to utilize a portion of its Blaze platform to be the real-time engine powering the hosted Kinesis Analytics offering.
Then earlier this year, the company landed in the “Leaders” section of a Forrester Wave report on real time analytics, alongside big established names in the space, like IBM, TIBCO, SAP, Oracle, and Software AG, as well as up-and-coming real-time players like DataTorrent and Impetus Technologies.
Black sounds confident that the sudden attention is not misplaced. “We’re one to two orders of magnitude faster than alternative platforms,” he says. “I know it sounds too good to be true. But it’s frustrating that the world doesn’t necessarily know about it.”
Black says there are several technical advantages that SQLstream Blaze holds over other real-time systems, including: an avoidance of Java (the Blaze kernel is written in C++); lock-free execution (which minimizes the number of threads) and lock-free execution in the protocol layer; a SQL optimizer; on-the-fly schema generation; and intelligent reclamation of memory used for indexing.
“We automatically recover the index space in the data space as we go,” Black explains. “That’s the primary reason initially Amazon looked at us. They looked at most of the leading platforms out there. They wanted something that was SQL standard-based, or as close to that as possible. But what made the difference in the end was the raw performance. We’re able to process literally so much more than they were expecting anticipating. So a single instance on EC2 has incredible throughout capabilities.”
According to Black, Blaze is able to process one million records per second per core. That throughput enables customers to build pipelines that process large amounts of data with a relatively modest investment in server capacity, either on-premise or in the cloud.
One SQLstream customer is the online ad exchange called the Rubicon Project. According to Black, it took the company about three hours to process hundreds of billions of relatively large records (2KB to 3KB each, with hundreds of fields) on a Hadoop cluster with close to 200 nodes. “We can get them real-time visibility with just 12 servers,” he says.
The amount of data that companies deal with has increased exponentially over the past decade. But that flood of data will pale in comparison to the tsunami of data that’s going to hit over the next 10 years. To keep from being overwhelmed, companies will need to move away from a centralized, batch-oriented mindset and embrace real-time processing and analytics.
Real-time data processing won’t be just a “nice to have” feature that companies can use to differentiate themselves. Eventually, it will be a matter of digital survival.