March 26, 2016

Distributed Stream Processing with Apache Kafka

Jay Kreps

We’re only three months into 2016, but it has been an exciting year in open source and big data. With a marked jump in growth, usage and queries on Apache Kafka (Redmonk), the demand for engineering and DevOps jobs requiring Kafka talent is creating huge demand for training and skills development, as users look to leverage new features and create new deployments.

Enterprises are in the midst of a paradigm shift around how they collect and process data to make business decisions. Stream processing is causing that shift, and it runs on Kafka. Confluent is helping to drive the growth and adoption of Kafka, and I’ll be taking to the stage at Strata + Hadoop World on Wednesday at 11 a.m. to discuss how Kafka enables the modern 24/7 organization that generates data continuously to process and analyze it continuously too.

The Apache Kafka project recently added built-in functionality to enable distributed stream processing called Kafka Streams, which is a library that makes it simple to build horizontally scalable, fault-tolerant applications that react to data streams with millisecond latency.

This type of application architecture is just coming into its own and is critical for applications in finance, telecom, and the emerging domain of the Internet of Things. To date, building this type of application has been quite complex, requiring many new moving pieces beyond Kafka—additional databases, Hadoop clusters, as well as a dedicated stream processing cluster. Kafka Streams simplifies this quite a bit, moving much of the functionality directly to a simple library that does not require any additional Hadoop cluster or stream processing system to be set up.

Despite this simplicity, Kafka Streams actually significantly ups the power of a stream processing system by including native functionality for combining derived tables with streams. For example, a use case around detecting credit card fraud might involve taking a stream of transactions, joining on relevant data about the customer and the merchants, and then applying a risk-scoring model to each transaction using these attributes. Kafka Streams allows this all to be done in a simple framework, well integrated with Kafka itself.

Kafka Streams is designed to work well with the newly added Kafka Connect framework that allows the automatic capturing of data streams from many popular data systems. Connect does the integration with external systems to get the data in and out of Kafka, and Streams helps build applications that transform those streams. Kafka, now with the additional features of Connect and Streams operates as a true streaming platform—allowing data streams to be captured, stored, and processed at massive scale.

You can learn more about Kafka Streams on the Confluent blog here, or come by my Strata + Hadoop World session on Distributed Stream Processing with Apache Kafka on Wednesday, March 30 at 11 a.m. PT in room 210 C/G.

About the author: Jay Kreps is the cofounder and CEO of Confluent, a company focused on Apache Kafka. Previously, Jay was one of the primary architects for LinkedIn, where he focused on data infrastructure and data-driven products. He was among the original authors of a number of open source projects in the scalable-data-systems space, including Voldemort, Azkaban, Kafka, and Samza. Jay is also the author of I Heart Logs (O’Reilly Media 2014).

Applications: Data Mining, Enterprise Analytics

Technologies: Frameworks, Middleware, Network

Sectors: Other

Vendors: Confluent

Tags: Jay Kreps, Kafka, Strata, stream processing

Distributed Stream Processing with Apache Kafka

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

July 3, 2025

July 2, 2025

July 1, 2025

June 30, 2025

June 27, 2025

Sponsored Partner Content

AI That Knows Your Business: Meet Cube D3

Mainframe data: A powerful source for AI insights

CData recognized in the 2024 Gartner ® Magic Quadrant™ Report

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Transforming Healthcare with Data

IDC Spotlight: Boosting AI Impact with Data Products

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Distributed Stream Processing with Apache Kafka

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

July 3, 2025

July 2, 2025

July 1, 2025

June 30, 2025

June 27, 2025

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Share

Copy short link