September 12, 2013

Facebook Graph: More Than a Trillion Edges Served

Alex Woodie

With more than a billion subscribers and hundreds of billions of connections among them, Facebook has a pretty big stockpile of user information, which it is opening up to outside developers to mine through its Facebook Graph service. And because the new service is based on Apache’s Giraph, the social media giant shouldn’t have to worry about scalability anytime soon.

Facebook Graph is a promising new service the company launched in January to enable developers to build applications–via the Facebook Graph API–that query user-generated data on Facebook and get back personalized results in real time. The services works by mapping the relationships between different nodes, such as people, music, food, or places. Different nodes can be connected in a variety of ways, which are defined as edges.

Facebook Graph enables developers to tap into this massive collection of points and lines, and figure out, in real time, that user Bobby X is listening to a streaming radio station Y, or that user Susy Q just liked restaurant X. (Wait, what’s that noise? Oh, it’s just the sound of thousands of advertisers drooling over the possibilities this entails.)

In an August blog post, Facebook developer Avery Ching shared some of the technical details behind Facebook Graph, and how the company picked Apache Giraph as its core underlying graphing technology. What it comes don to is pure, unadulterated scalability and performance advantages over alternative graphing engines, including Apache Hive and GraphLab.

Giraph, in case you’re not aware, is a Java graphing program that sits on Hadoop. It was originally developed at Yahoo based on a paper that Google engineers wrote describing their own advanced graphing mechanism, called Pregel. Yahoo donated the code to the Apache Software Foundation, where it is currently maintained by the likes of Facebook, LinkedIn, Twitter, and Hortonworks.

“We ended up choosing Giraph for several compelling reasons,” Ching writes in the blog. “Giraph directly interfaces with our internal version of HDFS [since Giraph is written in Java] and talks directly to Hive. Since Giraph runs as a MapReduce job, we can leverage our existing MapReduce [Corona] infrastructure stack with little operational overhead. With respect to performance, at the time of testing Giraph was faster than the other frameworks–much faster than Hive.”

How much, exactly, is “much”? According to a recent video of a presentation Ching made earlier this summer, “much” could be more properly described as “three orders of magnitude.”

The Facebook team set up a little graphing testbed to compare how long it would take Hive and Giraph to process a page rank graph that contained more than 400 billion edges, a pretty sizable graph. Giraph was able to process the work 120 times faster than Hive, and had a 26x performance advantage in terms of computing resources, according to Ching.

“This is significant because without these kinds of performance improvements, we really can’t run these applications reliably in production and at scale, and get them back in a reasonable time,” Ching says in the video. “If we’re trying to compute this value every week, and it takes a week to do it, it’s not very good.”


Apache Giraph demonstrates near linear scalability on Facebook’s cluster.

How far could Giraph scale? After all, processing for a billion edges “isn’t cool,” Ching said in the video. “It’s boring. What’s really cool?” (You guessed it.) “A trillion edges.”

So off the Facebook engineers went to the lab to try another test. This time, they assembled a graph with more than 1 trillion edges–a massive graph, to be sure–and ran it on a cluster of 200 servers. The processing took less than four minutes per iteration.

That fantastic speed is not the result of any major tweaking beyond what Facebook did to get Giraph running well on its servers, including improving its memory optimization, reducing Java’s object allocation, and improving network communication.

“We didn’t cheat,” Ching said in the video. “People in the academic community do things like pre-partitioning the graph to reduce the network traffic and to improve locality. We didn’t do that. This is a random hashing algorithm, so we’re randomly assigning the vertices to different machines in the system. Obviously, if we do some separation and locality optimization, we can get this number down quite a bit.”

The future certainly looks promising for Facebook Graph (which is still in beta) to determine connections out of massive mounds of data. Ching says Facebook is looking to make its Graph service easier to use, and to eliminate the need to know Java. “We want to make things like streaming or WebUIs or even higher level languages that bring graph processing to the masses the same way that SQL did for MapReduce,” he says in the video.

Facebook Advances Giraph With Major Code Injection

The Five Types of Hadoop Data

Applications: Data Mining

Technologies: Cloud, Systems

Sectors: Academia, Biosciences, Financial Services, Government, Healthcare, Manufacturing, Retail, Science

Tags: Apache Giraph, Facebook Graph, Hadoop, Hive, mapreduce

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Facebook Graph: More Than a Trillion Edges Served

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

April 22, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Facebook Graph: More Than a Trillion Edges Served

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

April 22, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link