March 14, 2016

Kafka Gets a Stream-Processing Makeover

George Leopold

A new library for building streaming applications seeks to shift the focus from analytics to developing core application used to process data streams.

Confluent Inc. announced a technical preview this week of a new technical feature in Apache Kafka called Kafka Streams. The preview version is part of an upcoming Kafka 0.10 release. Confluent’s preview version of Kafka Streams is available here.

Kafka Streams is a Java library of distributed stream processing applications built on Apache Kafka. Jay Kreps, Confluent’s CEO and co-founder, said Kafka Streams addresses “event-at-a-time,” stateful and distributing processing tasks. Streaming applications built with the library can include updating databases or calls to external services, Kreps said.

Other open source stream processing frameworks include Apache Flink, Apache Samza, Apache Spark and Apache Storm. Comparable proprietary services include Amazon Web Services’ (NASDAQ: AMZN) Lambda and Google Cloud’s (NASDAQ: GOOG, GOOGL) DataFlow.

“The gap we see Kafka Streams filling is less the analytics-focused domain these frameworks focus on and more building core applications and micro-services that process data streams,” Kreps noted in a blog post.

Kreps was part of a team at LinkedIn that built the stream-processing framework Apache Samza. It was initially rolled out for internal applications, and then supported in production before the LinkedIn turned it over to the Apache Foundation.

“One of the key misconceptions we had was that stream processing would be used in a way sort of like a real-time MapReduce layer,” Kreps said. “What we eventually came to realize, though, was that the most compelling applications for stream processing are actually pretty different from what you would typically do with a Hive or Spark job—they are closer to being a kind of asynchronous micro-service rather than being a faster version of a batch analytics job.”

Confluent noted that streaming processing applications must undergo the same processes such as configuration, deployment and monitoring as typical enterprise applications. Streaming applications are used to process “asynchronous event streams from Kafka instead of HTTP requests,” Kreps explained.

While current Kafka APIs are sufficient for serial message processing, Confluence said complex tasks like compute aggregations require more horsepower. Hence, a full stream-processing framework provides access to more advanced operations.

Confluence said Streams implements Kafka’s core abstractions as primitives for stream processing. “The goal is to simplify stream processing enough to make it accessible as a mainstream application programming model for asynchronous services,” Kreps said.

The framework also is intended to balance processing loads as new instances of an app are added while maintaining local state stores for tables. It also is designed for fault tolerance. From there, the company said apps could be deployed using DevOps tools like Puppet or Chef. Apps also can be packaged as Docker application container images.

Confluent, Palo Alto, Calif., was founded by the team that built Kafka at LinkedIn. The startup’s twist on the technology is making, for example, company operations available as real-time Kafka streams to Hadoop and other enterprise data platforms.

Recent items:

MapR Introduces Streams To Compete with Kafka

MapR Targets Hadoop Batch Constraints With 5.0

Share This