August 16, 2013

Facebook Advances Giraph With Major Code Injection

Isaac Lopez

Apache Giraph received a Facebook-sized shot in the arm, as the social network announced that they’ve injected performance enhancing code into the trunk of the open source graph analytics project, scaling the capabilities of the framework past a trillion edges.

Giraph, which mimics Google’s Pregel system, (which itself was inspired by the Bulk Synchronous Parallel model developed by Leslie Valiant in the 1980’s) is an iterative graph analytics framework built for large scale modeling. Graphs use a series of nodes (or vertices) and links between the relative nodes, called “edges.” Using graphs, companies like Google and Facebook are able to map relationships between such things as web pages, users, and their preferences (likes) – and then use that information to better target content.

With its first release in February of 2012, Giraph has now won the Facebook sweepstakes, and is looking to be a big deal in the months and years to come. After a bake-off between various graph-processing platforms, including Apache Hive, GraphLab, and Apache Giraph, the social-network opted to put their seal of approval and their considerable muscle behind the open-source Giraph.

There were several compelling reasons for choosing Giraph, explained Facebook engineer, Avery Ching, in a recent article:

Giraph directly interfaces with Facebook’s own internal brand of HDFS (which we wrote about here)
Giraph talks directly with Hive
Giraph runs as a MapReduce Job, allowing them to leverage Corona, Facebook’s existing MapReduce infrastructure stack with little operational overhead
Giraph was faster than the other frameworks (at least at the time of testing).
Giraph’s graph-based API, supports a wide array of graph applications in a way that is easy to understand.
Giraph added other useful features including master computation and composable computation

Once they selected Giraph, they social network went about customizing it to fit their needs at their massive scale. Ching says that they picked three production applications to drive development: label propagation, variants of page rank, and k-means clustering.

In order to run these applications at Facebook scale (over 1 billion users and hundreds of billions of friendships), the company got to work on muscling Giraph up. Among the most significant renovations that Zuckerberg’s worker bees made to the framework is the capacity for performance-boosting multi-threading in order to mitigate problems that they had sharing resources with other Hadoop tasks running on the same machine. “When Giraph takes all the task slots on a machine in a homogenous cluster, it can mitigate issues of different resource availabilities for different workers (slowest worker problem). For these reasons, we added multithreading to loading the graph, computation, and storing the computed results ,” wrote Ching on the upgrade.

Another significant improvement included memory optimization, where Ching says Giraph was a memory behemoth due to all data types being stored as separate Java objects. To address this challenge, Facebook engineers opted to serialize every vertex and its edges into a byte array, as well as messages on the server (as opposed to being stored as separate Java objects). “Reducing memory use was a big factor in enabling the ability to load and send messages to 1 trillion edges,” explained Ching.

Facebook made additional improvements, including the implementation of sharded aggregators, which they says gives them the ability to efficiently handle tens of gigabytes of aggregator data coming in from every worker, balancing them across workers as opposed to being bottlenecked by a master. They also made improvements in input and write-back flexibility, as well as the creation of HiveIO, a Hadoop I/O format style API that can be used to talk to Hive in a MapReduce job.

The ultimate outcome of their improvement is a drastically souped-up Giraph, which they say is faster, more memory efficient, and supremely scalable. “On 200 commodity machines, we are able to run an iteration of page rank on an actual 1 trillion edge social graph formed by various user interactions in under four minutes with the appropriate garbage collection and performance tuning,” Ching boasted.

Previously, the largest reported graphs belonged to Twitter (1.5 billion edges) and the Yahoo! Altavista graph (6.6 billion edges).

The company has opted to put their code back into the trunk branch of Giraph, said Ching, giving all of these performance improvements back to the community, along with a stable API and copious documentation which includes a page rank example to get developers started. The enhancements have already been released as part of the 1.0.0 version of the Apache distribution.

Facebook Drills In Big Data Thinking at Bootcamps

How Facebook Fed Big Data Continuuity

Applications: Data Mining, Research Analytics, Visualization

Technologies: Frameworks

Sectors: Biosciences, Financial Services, Healthcare, Other, Retail, Science

Tags: Apache Giraph, facebook, Graph Analytics

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Facebook Advances Giraph With Major Code Injection

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

April 17, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Facebook Advances Giraph With Major Code Injection

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

April 17, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link