Facebook Graph: More Than a Trillion Edges Served
With more than a billion subscribers and hundreds of billions of connections among them, Facebook has a pretty big stockpile of user information, which it is opening up to outside developers to mine through its Facebook Graph service. And because the new service is based on Apache’s Giraph, the social media giant shouldn’t have to worry about scalability anytime soon.
Facebook Graph is a promising new service the company launched in January to enable developers to build applications–via the Facebook Graph API–that query user-generated data on Facebook and get back personalized results in real time. The services works by mapping the relationships between different nodes, such as people, music, food, or places. Different nodes can be connected in a variety of ways, which are defined as edges.
Facebook Graph enables developers to tap into this massive collection of points and lines, and figure out, in real time, that user Bobby X is listening to a streaming radio station Y, or that user Susy Q just liked restaurant X. (Wait, what’s that noise? Oh, it’s just the sound of thousands of advertisers drooling over the possibilities this entails.)
In an August blog post, Facebook developer Avery Ching shared some of the technical details behind Facebook Graph, and how the company picked Apache Giraph as its core underlying graphing technology. What it comes don to is pure, unadulterated scalability and performance advantages over alternative graphing engines, including Apache Hive and GraphLab.
Giraph, in case you’re not aware, is a Java graphing program that sits on Hadoop. It was originally developed at Yahoo based on a paper that Google engineers wrote describing their own advanced graphing mechanism, called Pregel. Yahoo donated the code to the Apache Software Foundation, where it is currently maintained by the likes of Facebook, LinkedIn, Twitter, and Hortonworks.
“We ended up choosing Giraph for several compelling reasons,” Ching writes in the blog. “Giraph directly interfaces with our internal version of HDFS [since Giraph is written in Java] and talks directly to Hive. Since Giraph runs as a MapReduce job, we can leverage our existing MapReduce [Corona] infrastructure stack with little operational overhead. With respect to performance, at the time of testing Giraph was faster than the other frameworks–much faster than Hive.”
How much, exactly, is “much”? According to a recent video of a presentation Ching made earlier this summer, “much” could be more properly described as “three orders of magnitude.”
The Facebook team set up a little graphing testbed to compare how long it would take Hive and Giraph to process a page rank graph that contained more than 400 billion edges, a pretty sizable graph. Giraph was able to process the work 120 times faster than Hive, and had a 26x performance advantage in terms of computing resources, according to Ching.
“This is significant because without these kinds of performance improvements, we really can’t run these applications reliably in production and at scale, and get them back in a reasonable time,” Ching says in the video. “If we’re trying to compute this value every week, and it takes a week to do it, it’s not very good.”
|Apache Giraph demonstrates near linear scalability on Facebook’s cluster.
How far could Giraph scale? After all, processing for a billion edges “isn’t cool,” Ching said in the video. “It’s boring. What’s really cool?” (You guessed it.) “A trillion edges.”
So off the Facebook engineers went to the lab to try another test. This time, they assembled a graph with more than 1 trillion edges–a massive graph, to be sure–and ran it on a cluster of 200 servers. The processing took less than four minutes per iteration.
That fantastic speed is not the result of any major tweaking beyond what Facebook did to get Giraph running well on its servers, including improving its memory optimization, reducing Java’s object allocation, and improving network communication.
“We didn’t cheat,” Ching said in the video. “People in the academic community do things like pre-partitioning the graph to reduce the network traffic and to improve locality. We didn’t do that. This is a random hashing algorithm, so we’re randomly assigning the vertices to different machines in the system. Obviously, if we do some separation and locality optimization, we can get this number down quite a bit.”
The future certainly looks promising for Facebook Graph (which is still in beta) to determine connections out of massive mounds of data. Ching says Facebook is looking to make its Graph service easier to use, and to eliminate the need to know Java. “We want to make things like streaming or WebUIs or even higher level languages that bring graph processing to the masses the same way that SQL did for MapReduce,” he says in the video.