May 24, 2016

Kafka Creators Tackle Consistency Problem in Data Pipelines

Alex Woodie

(THPStock/Shutterstock)

One of the big questions surrounding the rise of real-time stream processing applications is consistency. When you have a distributed application involving thousands of data sources and data consumers, how can you be sure that the data going in one side comes out the other unchanged? That’s the challenge that Confluent is addressing with today’s launch of new software for Apache Kafka.

If you’re moving big data today, you’re probably using Apache Kafka, or at least looking at it. The distributed messaging bus effectively lets users route any piece of data from any source to any destination, and enables applications to subscribe to any number of Kafka data feeds. It’s just big data plumbing at the end of the day, but it’s relatively easy to deploy, horizontally scalable, and open source, just like Apache Hadoop, which gives it an advantage over traditional messaging systems.

Kafka co-creators Jay Kreps and Neha Narkhede, who developed Kafka while working at LinkedIn, co-founded Confluent in 2014 to build an ecosystem around this new way of thinking about data. The pair recognized that we have entered a new paradigm of data—that instead of dealing with data in isolated batches, that data would flow freely and continuously in the future. Apache Kafka would provide that core enabling technology, while Confluent would build add-on tools to make that easier.

Today’s launch of Confluent Enterprise 3.0 marks the first foray into commercial software for Confluent. Confluent Enterprise 3.0 contains mostly open source code, including the latest release of Kafka, version 0.10. That release introduces Kafka Connect, which enables users to create pipelines by connecting a wide variety of data sources to Kafka in an easy drag-and-drop manner.Confluent_Platform_architecture

But Confluent Enterprise 3.0 also contains key elements that are not free, including the Confluent Control Center. As Kreps explains, Control Center is aimed at giving enterprises the confidence to deploy Kafka for critical business functions without worrying that data will be lost or corrupted as it flows through newly created data pipelines.

“If you’re going to build around this in a real way you have to know the integrity of your data,” Kreps says. “You have to be able to prove that everything that came out of all those source databases that streamed its way into Hadoop or streamed its way into the stream processing jobs or made it to the application they’re subscribing–you have to somehow know that it’s correct so you can let everything that relies on that data know that it’s working.”

Control Center, which rides atop Kafka Connect, provides monitoring of message flows in Kafka. It’s similar in some respects to software Kreps and Narkhede built while at LinkedIn, but is designed to be more widely used and more easily consumed.

“In the prior world, you could do all this, but you’re doing it from the command line with low level tools,” Kreps says. “So this gives you a super easy way to do that management in a point and click fashion.”

The sort of problem that Confluent Control Center addresses will become more prevalent as more organizations adopt the new streaming data paradigm and deploy Kafka to create more data pipelines.

“Once you’re putting something like streaming data into practice, your level of uncertainty is higher because instead of having data drop in some big file at the end of the day, you have data arriving continuously,” Kreps says. “So how do you resolve that uncertainty and have some kind of proof?” that the data arrived correctly?

Kreps compares it to balancing a checkbook. There are many ways for the checkbook to get out of balance, including not recording a check, an error on the part of the bank, or other situations. “But if the numbers do add up, it’s almost certainly that that is in balance. And that’s exactly how this works,” he says.

Confluent_Logo_1The software builds on Confluent’s experience of what people who are implementing stream processing need to make it a bullet-proof part of their IT infrastructure. “If anything ever does go wrong, how can you trace and find the source of the problem without kind of gathering a meeting of everybody involved in any system that touches this pipeline and talking it through for days on end,” Kreps says. “This is a super critical tool that everybody we talk to about this problem has experienced first-hand.”

Confluent Enterprise 3.0 also marks the initial release of Kafka Streams, which adds stream data processing to the core messaging functionality enabled by Apache Kafka. As Kreps describes it, Kafka Streams isn’t meant to compete with the Apache Spark’s and Apache Storm’s of the world. Instead, it’s designed to simplify how organizations deploy basic processing functionality, primarily on transactional data.

“It lets you take a stream of data and enrich it with side data,” Kreps says. “So you might take a stream load of clicks and join on information about the user or the customer to enrich it in some way. It allows you to handle out-of-order data. It’s actually very competitive in terms of features. But it does this in a way that’s super lightweight and simple and doesn’t introduce any dependencies in your stack beyond Kafka.”

Since it emerged on the big data scene a couple of years ago, Kafka has become a game changer for stream data processing. If it’s not there already, Kafka is well on its way to dominating a critical part of the big data stack related to the generation and management of data pipelines, whether the workload involves analytics or transaction processing or some mix.

The simplicity that Kafka brings to complex data pipelining tasks is truly astounding, and whether or not the new features discussed here catch on, it leaves one wondering what the folks behind Kafka will come up with next.

Related Items:

The Real-Time Rise of Apache Kafka

Kafka Gets a Stream-Processing Makeover

The Real-Time Future of Data According to Jay Kreps

Share This