Fueled by Kafka, Stream Processing Poised for Growth
Once a niche technique used only by the largest organizations, stream processing is emerging as legitimate technique for dealing with massive amounts of data generated every day. While it’s not needed for every data challenges, organizations are increasingly finding ways to incorporate stream processing into their plans — particularly with the rise of Kafka.
Stream processing is just that – processing data as soon as it arrives, as opposed to processing it after it lands. The amount of processing that is applied to the data as it flows can vary greatly. On the one hand, users may do very little besides a simple transformation, such as converting temperatures from Celsius into Fahrenheit or combining it with another stream, while at the upper end, stream processors may apply real-time analytics or machine learning algorithms.
Almost any type of data can be used in stream processing. Sources can include a database event from RDBMs or NoSQL, sensor data from the IoT, comments made on social media, or a credit card swipe. The data’s destination similarly can be diverse – it could be headed to a traditional file system, a relational or NoSQL database, a Hadoop data lake, or a cloud-based object store.
What happens in between that initial data creation event and when the data written to some type of permanent repository is collectively referred to as stream processing. Initially, proprietary products developed by the likes of TIBCO, Software AG, IBM, and others were developed to handle streaming data. But more recently, distributed, open source frameworks have emerged to deal with the massive surge in data generation.
Apache Kafka — a distributed publish and subscribe message queue that’s open source and relatively easy-to-use –by far is the most popular of these open source frameworks, and Kafka is seen today by industry insiders as helping to fuel the ongoing surge in demand for tools to work with stream data processing.
Steve Wilkes, the CTO and founder of Striim, says Kafka’s popularity is helping to push stream processing into the center stage. “Kafka is driving a lot of our market,” he says. “A good majority of our customers are utilizing Kafka in one way, shape, or form.”
The underlying trend driving investment in stream processing is that customers need access to the latest data, Wilkes says. “It’s the recognition that, no matter how you’re doing analytics — whether you’re doing them in streaming fashion or whether you’re doing them after the fact through some sort of Hadoop jobs or big data analytics you need that up-to-date data,” he tells Datanami.
Striim this week unveiled a new release of its stream data processing solution, Striim version 3.8, that features better support for Kafka. This includes the capability to automatically scale Striim to more efficiently read from and write to Kafka as users scale up their real-time streaming architecture.
Many Kafka users are using the core Kafka product, along with the open source Kafka Connect software, to rapidly move data from its source to another destination, such as Hadoop or a data lake hosted on the cloud. Fewer shops are using the Kafka Streams API to write application logic on top of the message bus, a niche that third-party vendors are moving to fill.
According to a recent report from Confluent, the company behind open source Kafka and developer of the Confluent Platform, 81% of Kafka customers are using it to build data pipelines. Other common use case include real-time monitoring, ETL, microservices, and building Internet of Things (IoT) products.
Keeping the data lake updated with fresh data is an increasingly difficult task – and one that stream processing is being asked to fill as a sort of modern ETL role. According to Syncsort‘s recent 2018 Big Data Trends survey, 75% of respondents say that keeping their data lake updated with changing data sources is either “somewhat” or “very difficult.”
Another vendor that’s seeing the Kafka impact is StreamSets, a software vendor that bills itself as the “air traffic control” for data in motion. StreamSets’ initial product was a data collector that automated some of the nitty gritty work involved in capturing and moving data, often atop the Kafka message queue. The vendor recently debuted a low-footprint data collector that works in CPU- and network-constrained environments, and cloud-based console for managing the entire flow of customer’s data.
StreamSets Vice President of Marketing Rick Bilodeau says Kafka is driving a lot of the company’s business. “We do a lot of work with customers for Kafka, for real-time event streaming,” he tells Datanami. “We see fairly broad Kafka adoption as a message queue, where people are using [StreamSets software] primarily to broker data in and out of the Kafka bus.”
Some of StreamSets customers have a million data pipelines running at the same time, which can lead to serious management challenges. “Companies will say, ‘We built a bunch of pipelines with Kafka, but now have a scalability problem. We can’t keep throwing people at it. It’s just taking us too long to put these things together,'” Bilodeau says. “So they use data collector to accelerate that process.”
Today, StreamSets sees lots of customers implementing real-time stream processing for Customer 360, cybersecurity, fraud detection, and industrial IoT use cases. Stream processing is still relatively new, but it’s beginning to grow in maturity rapidly, Bilodeau says.
“It’s not the first inning, for sure. It’s maybe the third inning,” he says. “On the Gartner Hype Cycle, it’s approaching early maturity. Every company seems to have something they want to do to with streaming data.”
Striim’s Wilkes agrees. Fewer than half of enterprises are working with streaming data pipelines, he estimates, but it’s growing solidly. “Streaming data wasn’t even really being talked about a few years ago,” he says. “But it’s really starting to get up to speed. There is a steady progression.”
We’re still mostly in the pipeline-building phase, where identifying data sources and creating data integrations dominates real-time discussions, Wilkes says. That period will give way to more advanced use cases and people become comfortable with the technology.
“We’re seeing that a lot of customers are still at the point of obtaining streaming sources. They understand the need to get a real-time data infrastructure,” he says. “The integration piece always comes first. The next stage after you have access to the streaming data is starting to think about the analytics.”