July 6, 2022

Is Real-Time Streaming Finally Taking Off?

(Blue Planet Studio/Shutterstock)

Like commercial fusion reactors, real-time streaming is a tantalizing technology, but one that perpetually needs just a few more years (or decades) of R&D. But some in the industry are sensing that something has shifted over the past year, and that real-time streaming is finally hitting its stride.

“Every year, we’re waiting for that year where streaming workloads take off, and I think last year was it,” Databricks CEO Ali Ghodsi said during his keynote address at the Data + AI Summit last week. “We actually saw 2.5X growth in revenue for our streaming workloads last year, so I think streaming is finally happening.”

Streaming data, which some call real-time data, isn’t a new topic, of course. It’s been used in various forms for decades. With the first dot-com boom, however, valuable new types of events, such as clickstreams, became available. In the subsequent years, big data flows have been turbo-charged, and new technologies, such as Apache Kafka, have emerged to help manage it. But the means to build operational and analytical applications atop that channeled data has remained something available only to the biggest organizations.

The folks at Databricks indicate this could be starting to change. But why?

“I think it’s because people are moving to the right of this data AI maturity curve,” Ghodsi said during the keynote, “and they’re having more and more AI use cases that just need to be real-time, like real-time fraud detection.”

In other words, companies are accelerating their movement from traditional, backward-facing BI workloads toward more advanced, forward-looking AI-powered technologies, which he calls the AI maturity curve. These AI-powered predictions need to be made in shorter time windows, hence the need for real-time tech.

Ali Ghosdi speaking at Data + AI Summit June 28, 2022

While we don’t have insight into the scale of Databricks’ real-time streaming data revenues, we do have an idea of the investments the company is making in that tech. In 2021, it hired Karthik Ramasamy, the creator of Apache Storm and Apache Pulsar, to head up development of Structured Streaming, the high-level Spark API for stream processing.

Ramasamy will be heavily involved in Project Lightspeed, a new initiative Databricks unveiled last week to overhaul Structured Streaming. According to a blog post written by Ramasamy and his Databricks colleagues, the major goals of Project LightSpeed include:

  • Improving the latency and ensuring it is predictable;
  • Enhancing functionality for processing data with new operators and APIs;
  • Improving ecosystem support for connectors;
  • And simplifying deployment, operations, monitoring, and troubleshooting.

Additionally, the developers will seek to get a better handle on technical challenges of real time streaming, including things like offset management; asynchronous checkpointing; and state checkpointing frequency.

Lightspeed will bring additional functionality helpful for processing events and building real-time applications, like stateful operators; advanced windowing; state management; and asynchronous I/O. It will also add “a powerful yet simple API for storing and manipulating state” in Python, the company says.

Whether real-time streaming is actually ready to go to the next level or not, it’s looking like Structured Streaming is about to get a lot better.

Related Items:

It’s Not ‘Mobile Spark,’ But It’s Close

Databricks Opens Up Its Delta Lakehouse at Data + AI Summit

Databricks Bolsters Governance and Secure Sharing in the Lakehouse