Fresh Streaming Data: Get It While It’s Hot
Streaming data technology that’s been simmering on the backburner for the past few years will be the main entrée at this week’s Strata + Hadoop World conference in New York City.
There’s a profound shift currently underway in the big data community, as companies look for better ways to manage the huge flows of data occurring across their networks, and find faster ways to make business decisions.
While platforms like Hadoop have enabled us to efficiently process huge amounts of data at rest, companies increasingly want to analyze the data as soon it arrives, before it gets to Hadoop or traditional data warehouses. This was possible before, but it required complex data architectures and an extensive amount of coding. Thanks to the advent of new big data frameworks, real-time stream processing is now within the grasp of average companies.
Apache Kafka is front and center to this shift. Originally developed at LinkedIn to handle billions of messages per day, companies across all industries are now installing big Kafka clusters to funnel massive flows of data from the Web, production apps, and IoT devices into big data repositories, such as Hadoop data lakes and NoSQL and NewSQL databases.
Once companies have Kafka running data pipelines, they’re turning to a number of streaming processing platforms, such as Spark, Storm, Samza, Apex, and Flink (all Apache projects) to transform, analyze, and do something interesting with the data.
Neha Narkhede, the co-founder and CTO of Confluent, the company behind the open source Apache Kafka project, recently shared her insight into the industry shift with Datanami.
“We’ve seen tremendous growth in Kafka recently,” Narkhede says. “This excitement around Kafka, in my opinion, is because there’s a huge tectonic shift in the way companies are managing their data.”
A couple years ago, Narkhede says, big data was the big trend. Companies competed to gather as much data as they could, and process it offline, often in big Hadoop clusters. The more data they gathered and processed, they more value they got out of it. That fundamental data equation is changing right before our eyes, and Kafka’s soaring popularity is a direct sign of it, Narkhede says.
“Companies are actively moving to adopt stream data, which is all about processing data in real time,” she says. “But a key insight there, and what we’re learning is, the value of data is disproportionally higher for fresher data. The key movement here is actually not the more the better, but the faster the better.”
There’s another change going on. Instead of viewing streaming data as something separate from batch processing, as described in Nathan Marz’s Lambda architecture, there’s a concerted effort to unify the batch and stream processing into a single architecture, which Narkhede and her Kafka co-creator Jay Kreps call the Kappa architecture.
This Kappa viewpoint is quickly gaining steam across the industry. “The problem [with Lambda] is it leaves all the complexity of merging the results from these two very different worlds to developers. It’s operationally intensive and very error-prone. This is something we’re seeing companies in fact move away from,” Narkhede says.
Under the Kappa architecture, a batch processing job just becomes another type of stream processing job developed under one of the new frameworks, albeit one that starts and stops, instead of running continuously.
“That’s the power of Kafka,” Narkhede says. “It’s the single layer that actually allow you to go from historical data assessment to current data assessment using a single log, which is really the common layer between these two worlds.”
This viewpoint is shared by others in the industry, including MemSQL, which develops an in-memory relational database used to provide operational analytics for next-gen transactional systems.
“In many enterprises, the historical way of managing data has been batch processing,” says MemSQL chief marketing officer Gary Orenstein. “That’s in part a relic of the architecture of legacy systems, and in part it’s just the enterprise inertia that takes over. But if you’re dealing with customer-facing aspects of business, the customer knows where things are at any given moment of time, and therefore the people that the customers are interacting with should have access to the same info.”
The enterprise batch process is ripe for disruption by emerging streaming architectures, Orenstein says.
“We see a big shift happening in adoption of streaming in classic enterprise workloads that have been too focused on batch processing,” he tells Datanami. “For these customers to move forward with digital transformation, customer 360 indicatives, and IoT activity based on real-time mobile phone input, they’re going to need to move to a streaming solution.”
There are many ways to “slice the onion” with real-time processing, Orenstein says. “But we don’t think that there’s many ways to do it with exactly once semantics, which is that critical important nugget to getting enterprise comfortable with moving form batch to real time,” he says.
The folks behind the Apache Flink project see similar factors driving the rise of real-time processing. Kostas Tzoumas, one of the originators of Flink and the CEO and co-founder Flink distributor data Artisans, says he’s seeing a lot of interest in Flink among banks and telecos.
“One driver is building real time products,” Tzoumas tells Datanami. “Another driver is that companies are developing pipeline which run 24/7. They want a robust way and fault tolerant way to implement those, and that’s streaming. Another driver is micro-services.”
As companies get more streaming data successes under their belt, they’ll become more comfortable with the emerging paradigm and find what works and what doesn’t. One thing that’s clear, though, is that real-time data pipelines are here to stay.