The Real-Time Rise of Apache Kafka
Shiny new objects are easy to find in the big data space. So when the industry’s attention shifted towards processing streams of data in real time–as opposed to batch-style processing that was popular with first-generation Hadoop–we saw dozens of promising new technologies pop up seemingly overnight. One of them was Apache Kafka.
The interesting thing is that Kafka wasn’t actually new. Jay Kreps started writing the software, which serves as a messaging layer for moving data, when he worked at LinkedIn in 2008, and the software was contributed to open source in 2011.
Kafka was created to solve LinkedIn’s data movement problems, which were considerable. Kafka is used to route more than 800 billion messages per day at LinkedIn, amounting to more than 175TB of data, according to the company’s engineering department. There are just a handful of companies in the world that need to handle that kind of data volume, and LinkedIn is one of them.
Kafka was hardened at LinkedIn to handle insane volumes of data, while maintaining features like durability, statefullness, and resilience. Soon other companies started using the open source message broker. First the other big Silicon Valley Web properties like Twitter, Netflix, and Yahoo, and then big IT shops outside of the valley, like Cerner, Goldman Sachs, and Wal-Mart.
Word began to spread about how well Kafka solved this particular problem of moving high-volumes of messages among Hadoop clusters, operational databases, search engines, and other data repositories. It wasn’t long before Kafka was being paired up with stream processing engines, such as Apache Samza (also developed by Kreps team at LinkedIn), to actually work upon the data as it comes streaming in. Then other stream processing engines were laid atop Kafka pipes, including Apache Storm, S4, Apache Spark, Apache Apex, and Apache Flink.
Organizations increasingly began using these open source technologies to build real-time applications for a number of use cases, such as log clickstream analysis, fraud detection, and recommendation systems. Kafka’s use cases will only grow as machine data flows off cars, windmills, and medical devices hooked up to the IoT.
In late 2014, two creators of Kafka at LinkedIn, Jay Kreps and Neha Narkhede, founded the company Confluent to continue building Kafka and a surrounding ecosystem. In the 11 months immediately following the launch of Confluent, Kafka’s adoption rate increased by 7x. Internet searches for Apache Kafka have doubled during that time, according to Google Trends, while number of jobs listed on Indeed.com that include Kafka as a job requirement spiked by 1,000 percent.
“It’s amazing how the last year has brought such phenomenal growth. It’s almost like a hockey-stick curve,” Confluent CTO Narkhede told Datanami in an interview at the Strata + Hadoop World conference last week. “Since we announced Confluent it seems there’s a lot of renewed interest in Apache Kafka.”
Don’t expect Kafka to be a flash in the pan, like so many other technologies that are hot one day and cold the next. At Confluent, the company appears to be taking a pragmatic, long-term approach to its business plan, with a strong focus on the needs of the community and its customers.
Confluent has made two big product launches in the past few months that extend Kafka’s functionality and make it easier to use, including Kafka Streams and Kafka Connect, both of which are available in Kafka version 0.9.
Connect and Streams
Kafka Streams is a stream processing engine that sits atop a Kafka “substrate” to provide basic capabilities upon streaming data. Narkhede describes Kafka Streams as a lightweight library of functions that’s built atop Kafka primitives. Organizations that have already defined their data flows in Kafka topics can easily leverage Kafka Streams to do things, such as filtering, joining, or mapping the data, she says.
“We’ve depended on Kafka in a big way to build out a lot of the primitives [such as] how do you shard your data across a cluster of machines, how do you make sure data arrives in order, how do you make sure as machines fail your data isn’t lost and it’s fully replicated,” she says. “It has taken us years to stabilize that in Kafka, and now Kafka is ubiquitous.
“Our opinion is, if you like Kafka and you’ve already deployed it, then why would you build these primitives?” she continues. “We’ve followed a very incremental approach and made sure the foundation was built out right, using the primitives.”
You can expect more capabilities to be added to Kafka Streams in the coming months and years, including more fine-grained queries of the state of data. “In Kafka, a lot of our focus is on operational stability and simplicity,” Narkhede says. “That’s essentially our focus with Streams. We want to: A. help the community adopt it; B. pay a lot of attention to simplicity ease of use and operational stability. Then we can add fancier capabilities.”
Once Kafka Streams is stable, Confluent’s plan calls for building out vertical solutions on top of that core, including solutions for security, machine learning, and monitoring. The possibility of collecting and harnessing data from medical devices in pursuit of advanced precision medicine is not far outside of Confluent’s thoughts either.
The other new capability in version 0.9, Kafka Connect, is also having an impact on the data integration industry. One of the problems with building connectors is that they’ve nearly always been custom, one-off solutions. Standards, such as ODBC and JDBC, provide some basis to work from, but actual implementation have nearly always suffered from incompatibility and the fact that relational database vendors rarely do things the same way.
Does Kafka Connect has the power to change that? It’s still early days, but the results so far look promising.
“Connect is a framework to essentially solve all the common problems that any source or sync has,” Narkhede says. “The whole framework is structured around making sure the connector developer can just build connectors, and any two connectors will just work with each other.”
According to Narkhede, the Kafka community built 20 Kafka Connect connectors during the first month it was available. Anybody can go to Confluent’s Kafka Connector Hub and download connectors that integrate MySQL, HDFS, ElasticSearch, Cassandra, MongoDB, and Amazon S3.
All this development is pushing Apache Kafka into the forefront of big data processing, where it’s serving as much-needed glue to connect all the disparate systems that have cropped up. Confluent plans to encourage the community to take Streams and Connect even further with a “hackathon” during the Kafka Summit, which takes place April 26.
If you don’t have your ticket to the Kafka Summit, tough luck: the conference is already sold out. Such is life in the fast-moving world of Apache Kafka.
“It’s super exciting. The community has responded very well,” Narkhede says. “Having been through [the initial launch of] Apache Kafka, it took a couple years to get it to broad adoption. Connect and Streams have been super fast. The adoption has been just amazing.”