Facebook Advances Giraph With Major Code Injection
Apache Giraph received a Facebook-sized shot in the arm, as the social network announced that they’ve injected performance enhancing code into the trunk of the open source graph analytics project, scaling the capabilities of the framework past a trillion edges.
Giraph, which mimics Google’s Pregel system, (which itself was inspired by the Bulk Synchronous Parallel model developed by Leslie Valiant in the 1980’s) is an iterative graph analytics framework built for large scale modeling. Graphs use a series of nodes (or vertices) and links between the relative nodes, called “edges.” Using graphs, companies like Google and Facebook are able to map relationships between such things as web pages, users, and their preferences (likes) – and then use that information to better target content.
With its first release in February of 2012, Giraph has now won the Facebook sweepstakes, and is looking to be a big deal in the months and years to come. After a bake-off between various graph-processing platforms, including Apache Hive, GraphLab, and Apache Giraph, the social-network opted to put their seal of approval and their considerable muscle behind the open-source Giraph.
There were several compelling reasons for choosing Giraph, explained Facebook engineer, Avery Ching, in a recent article:
- Giraph directly interfaces with Facebook’s own internal brand of HDFS (which we wrote about here)
- Giraph talks directly with Hive
- Giraph runs as a MapReduce Job, allowing them to leverage Corona, Facebook’s existing MapReduce infrastructure stack with little operational overhead
- Giraph was faster than the other frameworks (at least at the time of testing).
- Giraph’s graph-based API, supports a wide array of graph applications in a way that is easy to understand.
- Giraph added other useful features including master computation and composable computation
Once they selected Giraph, they social network went about customizing it to fit their needs at their massive scale. Ching says that they picked three production applications to drive development: label propagation, variants of page rank, and k-means clustering.
In order to run these applications at Facebook scale (over 1 billion users and hundreds of billions of friendships), the company got to work on muscling Giraph up. Among the most significant renovations that Zuckerberg’s worker bees made to the framework is the capacity for performance-boosting multi-threading in order to mitigate problems that they had sharing resources with other Hadoop tasks running on the same machine. “When Giraph takes all the task slots on a machine in a homogenous cluster, it can mitigate issues of different resource availabilities for different workers (slowest worker problem). For these reasons, we added multithreading to loading the graph, computation, and storing the computed results ,” wrote Ching on the upgrade.
Another significant improvement included memory optimization, where Ching says Giraph was a memory behemoth due to all data types being stored as separate Java objects. To address this challenge, Facebook engineers opted to serialize every vertex and its edges into a byte array, as well as messages on the server (as opposed to being stored as separate Java objects). “Reducing memory use was a big factor in enabling the ability to load and send messages to 1 trillion edges,” explained Ching.
Facebook made additional improvements, including the implementation of sharded aggregators, which they says gives them the ability to efficiently handle tens of gigabytes of aggregator data coming in from every worker, balancing them across workers as opposed to being bottlenecked by a master. They also made improvements in input and write-back flexibility, as well as the creation of HiveIO, a Hadoop I/O format style API that can be used to talk to Hive in a MapReduce job.
The ultimate outcome of their improvement is a drastically souped-up Giraph, which they say is faster, more memory efficient, and supremely scalable. “On 200 commodity machines, we are able to run an iteration of page rank on an actual 1 trillion edge social graph formed by various user interactions in under four minutes with the appropriate garbage collection and performance tuning,” Ching boasted.
The company has opted to put their code back into the trunk branch of Giraph, said Ching, giving all of these performance improvements back to the community, along with a stable API and copious documentation which includes a page rank example to get developers started. The enhancements have already been released as part of the 1.0.0 version of the Apache distribution.