Follow Datanami:
March 7, 2017

How Kafka Redefined Data Processing for the Streaming Age


The Apache Kafka phenomenon reached a new high today when Confluent announced a $50 million investment from the venture capital firm Sequoia. The investment signals renewed confidence that Kafka is fast becoming a new and must-have platform for real-time data processing, says Kafka co-creator and Confluent CEO Jay Kreps.

“What we’re seeing in the world, and why Sequoia and existing investors were so excited, is the emergence of this whole new category of infrastructure around streaming, a streaming platform,” Kreps tells Datanami in an interview last week.

“It’s a different category of thing,” he adds. “I don’t think there was something quite like this before. There were technologies that were precursors, but they’re pretty different technically.”

Kafka was created, like all great products, out of necessity. The social network LinkedIn had all manners of distributed systems to help process data, including Hadoop (which it often struggles to run efficiently; see today’s story on Dr. Elephant). But the company lacked a durable message bus that could reliably deliver hundreds of millions of messages a day.

Confluent CEO and co-founder Jay Kreps

So Kreps and his LinkedIn colleagues, Neha Narkhede and Jun Rao, did what they needed to do: they built a new publish and subscribe messaging bus for LinkedIn. The new system, dubbed Kafka, worked so well that they decided to open source it. In 2014, Kreps, Rao. and Narkhede left LinkedIn to found Confluent with the idea of building a commercial product around what was then an incubating Apache Kafka project.

Since then, Kafka has caught on like wildfire. Companies were impressed not only with Kafka’s capability to reliably deliver huge volumes of data from their sources to their destinations, but to do so without requiring large armies of skilled technicians to implement a complex distributed system. Kafka users especially appreciate how they can set up various streams of data, what are known as “Kafka topics,” and then allow people to subscribe to those streams. As a standalone distributed system that runs on clusters of X86 servers, Kafka is equally adept at serving up data for transactional and analytical purposes alike.

Today, the distributed system has been adopted by more than one-third of the Fortune 500. Subscriptions to Confluent Enterprise, a version of Kafka that adds proprietary management and monitoring features, surged by 700% last year. Third-party application vendors are flocking to support Kafka and build open source connectors so their customers can partake of the digital riches flowing across these new Kafka pipe.

In short, Kafka has quickly become the defacto standard messaging bus underlying real-time stream processing, and that puts Confluent and Kreps right at ground zero of a new wave of innovation.

“It’s exciting because there aren’t that many truly new categories of infrastructure that come around,” he says. “There are a million and one databases, and I’m sure tomorrow there will be a million and two. But because there are so many, they end up being almost niches, whereas this can really be something that ties together and be the core data flow in the enterprise and the big central nervous system that everything comes off of.”

Building a digital version of a central nervous system comes with its share of pressure. If Kreps and company bungle something in Kafka, it could impact hundreds of thousands of companies and hundreds of millions of consumers who are dependent on data services that are being built on a Kafka foundation.

But Kreps and his Confluent colleagues are building a reputation for getting stuff right, for rolling things out slowly and making sure they’ve dotted their i’s and crossed their t’s before publicly stating that new releases are ready for production use. The company is nearing completion of a major new feature in Kafka, exactly once processing, that is the holy grail of distributed computing, and they’re being extra cautious about it.

Before starting to code the exactly once processing feature in Kafka, the company published its theory on how to solve the problem in a paper, with the idea that other distributed systems experts can poke holes in the idea and find flaws. Kreps and his colleagues had kicked around ideas for how to add more transactional capabilities into the distributed messaging system, and wanted to see if he was way off base.

“It’s a little like security in that way,” Kreps says. “You want people to find the flaws early on rather than when it’s running in production and it crashes. We’ve got through that process, and now we’re going through the process of implementing it.”

You can follow the development of the exactly once feature on GitHub if you so desire. It’s not yet ready for production, but Kreps says it’s on track to be added to Kafka, perhaps before the Kafka Summit takes place in New York City in May.

Kreps seems to get off on tackling these sorts of distributed systems engineering challenges. It’s a high-stakes game being played out in a public venue, with tens of millions of dollars on the line. It’s not a game for the faint of heart, but the rewards are potentially much greater than even the $80 million that Confluent has received in venture capital so far.

“It’s actually technically challenging to make something so complicated actually be simple and easy to use and build on,” Kreps says. “I guess we’ll be judged based on whether or not we can make that successful, as this category emerges and as it matures and as more applications come.  It’s not an easy thing to do. I’m super proud of what we’ve built so far and I hope that we continue to be perceived that way by the people adopting it.”

The baffling thing about Kafka, the enigma as it were, is that it exists on the cutting edge of what computer science can accomplish, and yet it remains a relatively simple tool, especially compared to its distributed system brethren, Hadoop. That’s been the design principle of Kafka since the beginning, and Kreps hopes that it will remain that way into the foreseeable future.

“That’s really the challenge we’ve set for ourselves,” he says. “We’d like this to be the way that if you’re going to build asynchronous microservices in your company, this would be the tool you would reach for. Not the tool you have to reach for if your data is really, really big or the tool you’re forced to after you’ve exhausted all the other solutions, but the thing that’s the easiest and best to go to start with, and it will scale horizontally to the size of a company.”

Related Items:

Exactly Once: Why It’s Such a Big Deal for Apache Kafka

Distributed Stream Processing with Apache Kafka

The Real-Time Future of Data According to Jay Kreps