Apache Kafka co-creator Jun Rao at Kafka Summit September 30, 2019
Databases are powerful abstractions that allow people and programs to manipulate large amounts of data in a predictable, repeatable way. There’s a good reason why SQL is so important to modern data processing. But in some circumstances, a database is the wrong tool for the job. That was definitely the case at LinkedIn, says Apache Kafka co-creator Jun Rao.
According to Rao’s keynote speech at Kafka Summit yesterday, LinkedIn‘s data architecture in the early 2010s didn’t differ much from the typical enterprise. There was a hefty dose of relational databases and SQL, two technologies that had proven themselves in countless deployments over the past two decades.
“The initial architecture for data in LinkedIn was pretty standard, in SQL,” Rao told a couple of thousand Kafka Summit attendees Monday morning. “In the middle was a centralized database. That’s actually where LinkedIn put all its data.”
The years that Rao spent on LinkedIn’s engineering team corresponded with very high growth for the business-oriented social media property. The company was adding millions of users every year and expanding the number of services available to users at a fast clip.
As the outfit grew, the engineering team started to wonder whether the relational database was the right choice to hold all of the data that was being generated. Relational databases were designed to store normalized data that was transactional in nature. But the data that LinkedIn was generating had a different characteristic.
“Over time we realized there’s a lot more digitized information that’s not transactional in nature but it’s equally useful,” Rao said. “For example, if you click on a link, type in keywords, or interact with mobile devices — these are not really transactional data, but they carry … strong user intention.”
Apache Kafka is the second-most popular Apache project, Rao says
This event data didn’t easily fit into LinkedIn’s database, but it was still useful, Rao says. Events, such as job changes, were very important for LinkedIn’s business, but detecting that event was not a simple matter with the database paradigm. “If you can leverage that information, you can build more engaging use cases,” he says.
Eventually, Rao’s team at LinkedIn came to two important observations about the data the company was generating and how it was choosing to store it. Once it came to those realizations, Rao says, the company realized that the relational database was actually holding them back.
“The first mismatch is this: Database are really built for states, not for events,” Rao says. “If you look inside the database, the database actually has a pretty good way of capturing those event changes, which is the commit logs. But for the database, the way you interact with the database is not through that commit log. It’s acutely through the tables and indexes which are derived from that commit log and the table actually just captures the current state.”
The way users interact with this database, through SQL, poses other problems, because it only queries the current state stored in those tables, not the underlying events that were recorded in the commit log and trickled up to the index or table.
“Since SQL is the only interface available in the database, a lot of applications are getting those events through the SQL interface, which is a mismatch in terms of the interface, and is also not an efficient way to get those events,” Rao continues. “So the result is, as the number of event-driven applications grows, the database becomes overwhelmed to the extent that it’s completely unresponsive.”
The second mismatch has to do with the volume of data LinkedIn was trying to store in a relational database. “For LinkedIn to be a digital company, it needs to leverage not only that transactional data, but also non-transactional data,” Rao says. “And in terms of volume, the latter part can easily be 1,000 times bigger. And if you look at a database, it’s mostly designed for storing the transactional part of the data, and [adding] events like clickstream and keywords and application logs to the database–it’s going to be prohibitively expensive.”
Kafka was designed to manage LinkedIn’s clickstream data (whiteMocca/Shutterstock)
The underlying dilemma facing Rao and his engineering colleagues at LinkedIn was that the existing database-oriented infrastructure was simply a poor fit for what they were trying to do with it. The company attempted to patch together a solution anyway, and it was basically an ugly mess composed of local data stores, key-value databases, relational databases, and other technologies.
Rao and his colleagues put their heads together and tried to come up with a better solution. The event-oriented nature of much of LinkedIn’s data strongly suggested to them that a messaging bus-based approach might work better.
“We actually tried a few of those,” Rao says. “The short answer is none of those messaging systems acutely worked for LinkedIn’s use case. If you look inside why those systems don’t’ work, it boils down to those two fundamental issues. The first thing is a lot of the traditional messaging systems, they really are built for the transactional data, so when you’re trying to flow a thousand times more digitalized data to those systems, that’s when they start breaking.”
The second thing that LinkedIn learned is a lot of those messaging systems were not optimized for real time. “As a result, you’re doing things like caching all day in memory,” Rao says. “Now if you have to accommodate that for 1,000 more data and your downstream applications starts banging a little bit, that’s when your execution [fails].”
Rao’s colleagues, including Kafka co-creators Neha Narkhede and Jay Kreps, agreed to consider other options. “At that point we realized, there’s really no architecture or solution we can use to build those event driven [system] for all digitalized data. Maybe we should just consider building something ourselves. How hard can that be?”
Higher Abstractions, Lower Complexity in Kafka’s Future
The Real-Time Rise of Apache Kafka
The Real-Time Future of Data According to Jay Kreps