Follow Datanami:
March 26, 2016

Distributed Stream Processing with Apache Kafka

Jay Kreps

We’re only three months into 2016, but it has been an exciting year in open source and big data. With a marked jump in growth, usage and queries on Apache Kafka (Redmonk), the demand for engineering and DevOps jobs requiring Kafka talent is creating huge demand for training and skills development, as users look to leverage new features and create new deployments.

Enterprises are in the midst of a paradigm shift around how they collect and process data to make business decisions. Stream processing is causing that shift, and it runs on Kafka. Confluent is helping to drive the growth and adoption of Kafka, and I’ll be taking to the stage at Strata + Hadoop World on Wednesday at 11 a.m. to discuss how Kafka enables the modern 24/7 organization that generates data continuously to process and analyze it continuously too.

The Apache Kafka project recently added built-in functionality to enable distributed stream processing called Kafka Streams, which is a library that makes it simple to build horizontally scalable, fault-tolerant applications that react to data streams with millisecond latency.

This type of application architecture is just coming into its own and is critical for applications in finance, telecom, and the emerging domain of the Internet of Things. To date, building this type of application has been quite complex, requiring many new moving pieces beyond Kafka—additional databases, Hadoop clusters, as well as a dedicated stream processing cluster. Kafka Streams simplifies this quite a bit, moving much of the functionality directly to a simple library that does not require any additional Hadoop cluster or stream processing system to be set up.

Despite this simplicity, Kafka Streams actually significantly ups the power of a stream processing system by including native functionality for combining derived tables with streams. For example, a use case around detecting credit card fraud might involve taking a stream of transactions, joining on relevant data about the customer and the merchants, and then applying a risk-scoring model to each transaction using these attributes. Kafka Streams allows this all to be done in a simple framework, well integrated with Kafka itself.

Kafka Streams is designed to work well with the newly added Kafka Connect framework that allows the automatic capturing of data streams from many popular data systems. Connect does the integration with external systems to get the data in and out of Kafka, and Streams helps build applications that transform those streams. Kafka, now with the additional features of Connect and Streams operates as a true streaming platform—allowing data streams to be captured, stored, and processed at massive scale.

You can learn more about Kafka Streams on the Confluent blog here, or come by my Strata + Hadoop World session on Distributed Stream Processing with Apache Kafka on Wednesday, March 30 at 11 a.m. PT in room 210 C/G.


About the author: Jay Kreps is the cofounder and CEO of Confluent, a Jay Krepscompany focused on Apache Kafka. Previously, Jay was one of the primary architects for LinkedIn, where he focused on data infrastructure and data-driven products. He was among the original authors of a number of open source projects in the scalable-data-systems space, including Voldemort, Azkaban, Kafka, and Samza. Jay is also the author of I Heart Logs (O’Reilly Media 2014).