There is data in motion, and then there is really big data in motion. The folks at LinkedIn gave us a compelling example of the latter today when it announced that it’s using the distributed messaging system Kafka to process more than 1.1 trillion messages per day.
Kafka of course was born at LinkedIn. Around the year 2010, the social media company found it was struggling to adequately move data through batch-oriented messaging systems, and so three LinkedIn engineers—Jay Kreps, Neha Narkhede, and Jun Rao—created a new system built on a distributed platform.
LinkedIn eventually released Kafka into open source, the three engineers left to found a separate company called Confluent to guide the development of Kafka and provide technical support. The company has big plans for Kafka, and is keen to help support the 100 or so companies that have adopted the next-gen messaging bus to get a handle on big, fast-moving data.
While Kreps and company were building Confluent, Kafka continued to grow to enormous proportions. Today, the system is processing a mind-boggling 4.5 million messages per second. LinkedIn’s Senior Engineering Manager of Streams Infrastructure, Kartik Paramasivam, provided more details on the growth of Kafka in a blog posting:
“We started using Kafka in production at large scale in July 2011 and at that point processed about 1 billion messages per day. This ticked up to 20 billion messages per day in 2012. In July 2013 we were processing about 200 billion messages per day through Kafka.
“A few months ago, we hit a new level of scale with Kafka,” he continues. “We now use it to process more than 1 trillion published messages per day with peaks of 4.5 million messages published per second – that equals about 1.34 PB of information per week. Each message gets consumed by about four applications on average.”
The folks over at Confluent celebrated the milestone with a blog post announcing Kafka’s introduction to the “four commas” club (or 1,100,000,000,000). As Narkhede explains, watching Kafka evolve from being the “nervous system” at LinkedIn into a core component of giant corporations has been exciting.
Kafka downloads are increasing every month
“Kafka plays a critical part in shaping LinkedIn’s infrastructure as well as that for the hundreds of other companies that use Kafka – from web giants like Netflix, Uber, and Pinterest to large enterprises like Cerner, Cisco and Goldman Sachs,” writes Narkhede, who is the head of engineering at Confluent. “At these companies, Kafka powers critical data pipelines, allows data to be synced in real-time across geographically distant data centers and is a foundation for real-time stream processing and analytics.”
As companies embrace new types of data–such as user activity data and log data from GPS-enabled devices and financial data streams–Narkhede sees Kafka playing a central role in bringing all that data together.
“New high-volume data sources…that could not be collected previously in LinkedIn’s legacy systems are now easily collected using Kafka,” she writes. “The same data that goes into the offline data warehouse and Hadoop is available for real-time stream processing and analytics in all applications. And all the data collected is available for storage or access in the various databases, search indexes, and other systems in the company through Kafka.”
As real-time analytics moves from science fiction to reality, it’s pretty clear that systems like Kafka will play central roles in them.
The Real-Time Future of Data According to Jay Kreps
Cloudera Brings Kafka Under Its ‘Data Hub’ Wing
LinkedIn Spinoff Confluent to Extend Kafka