Follow Datanami:
February 18, 2015

Cloudera Brings Kafka Under Its ‘Data Hub’ Wing

Cloudera is making Apache Kafka a supported part of its Hadoop distribution, the company announced today. While Kafka still doesn’t run on Hadoop, Cloudera says the changes it is instituting will help CDH customers build real-time analytics applications that span Hadoop and Kafka.

Kafka is an open source message broker that’s designed to handle massive flows of streaming, real-time data, such as log data. The software was originally developed at LinkedIn, which uses it to process hundreds of millions of messages a day. In November, LinkedIn spun out a separate company called Confluent to drive the software and further commercialize it.

While Kafka doesn’t run under Hadoop, it’s commonly deployed alongside the big data platform, particularly when real-time streaming analytics are involved. For that reason, Hadoop distributors like Cloudera and Hortonworks have worked to improve how Kafka works with their software.

Cloudera today announced that it’s making three important changes in how it works with Kafka. For starters, Cloudera now includes Kafka in the download for CDH, Cloudera’s Distribution of Hadoop. Customers will also be able to configure and manage their Kafka clusters using Cloudera Manager, the vendor’s management software for CDH. Finally, Cloudera will officially support Kafka as part of its technical support contracts.

It’s all about taking Kafka-CDH implementations to the next level, says Alex Gutow, product marketing manager for Cloudera. “They pair very nicely together,” Gutow tells Datanami. “There’s a growing popularity around it, especially as real-time streaming use cases are getting more and more popular and necessary.”cloudera_logo_new

Kafka was the first technology to be embraced in Cloudera Labs, an initiative that Cloudera launched in 2014 to accelerate the hardening of big data technologies. Kafka was the first member of that program, and it’s also now the first graduate, Gutow says.

“We’ve put in a lot of work to make it production-grade for the rest of the platform,” she says. “It’s really matured to the point where we think it would be a good addition to the platform. We’re seeing it be used for mature, production ready use cases.”

As Cloudera’s customers get more comfortable with big data analytics, they’re increasingly looking to speed up the insights they’re generating. That push toward real time analytic is an industry-wide phenomenon, of course. But Cloudera sees Kafka playing a big role in powering those types of applications, alongside other technologies like Spark Streaming, HBase, and Flume.

“We already had a handful of customers running Kafka in production so as they get more mature with it and we see how their use cases expend, that’s going to be a key driver,” Gutow says. “We’re working with Confluent to see how the project continues to evolve.”

Part of that discussion with Confluent is looking at ways to run Kafka natively on Hadoop. While it’s possible to run Kafka on the same physical cluster where Hadoop is running, the two applications are not aware of each other. It would seem to be desirable to bring the technologies closer together, perhaps by making Kafka a YARN-compatible program for Hadoop version 2 clusters. That is the approach that DataTorrent is taking with Project Koya, which the company unveiled in December.

Cloudera isn’t ready to discuss possible strategies for bringing the technologies together. “We’re currently evaluating a lot of different options there, so it’s definitely something we’re keeping an eye on,” Gutow says. “We’re been working pretty closely with the folks at Confluent, Jay Kreps and the other committers in the space. It’s something we’re having conversations around.”

Related Items:

Why Kafka Should Run Natively on Hadoop

LinkedIn Spinoff Confluent to Extend Kafka

LinkedIn Centralizing Data Plumbing with Kafka

Datanami