Follow Datanami:
August 31, 2020

How Facebook Accelerates SQL at Extreme Scale


Serving SQL queries on a petabyte of data is one thing, but delivering it at Facebook’s scale is something else entirely. Earlier this year, the social media giant implemented the Alluxio distributed file system into its massive data architecture to speed up queries, while maintaining the separation of compute and storage.

Facebook has long been a proponent of open source software, and it has generated its share of open source projects in the big data space. Apache Hive and its follow-on, Presto, both emerged from Facebook, as did RocksDB and Apache Cassandra. GraphQL is one of a number of developer tools that came out of Facebook, along with PyTorch.

Facebook was an early adopter of Apache Hadoop, and remnants of Hadoop are still evident in the company’s architecture. It has moved away from storing data in Hadoop clusters, but it still uses an HDFS to access data in Warm Storage, which is the custom distributed file system that Facebook developed for itself after it moved off Hadoop. Warm Storage functions as a BLOB store atop a huge number of spinning disks, not dissimilar to Google Colossus or Amazon S3, Facebook Software Engineer James Sun tells us, and it uses synchronous Reed-Solomon error-correction coding to bolster data resiliency at massive scale.

The separation of compute and storage is a pillar of modern computer architecture, and is something that many organizations are striving toward today. But for Facebook, separating compute and storage is an absolute necessity, as the costs due to wasted storage or compute capacity would quickly add up.

Internally, Facebook is a big user of Presto for powering ad-hoc interactive SQL queries, A/B testing, and for serving critical dashboards, such as daily active users (DAU) and monthly average users (MAU). As a compute engine with no storage component, Presto is a great fit for companies that have separated compute from storage (that is why Facebook invented it, after all).

Facebook was torn between the pros and cons of its Raptor caching system (Image courtesy Facebook)

While Presto has worked well for Facebook, physics has not. To keep SQL query latencies in a workable range (i.e. around a second), several years ago Facebook implemented a data caching strategy. Dubbed Raptor, the system requires users to build ETL pipelines to cache data on SSDs attached to the Presto cluster.

Raptor delivered the query speedup that Facebook needed, since many Presto queries no longer needed to travel across the network. But Raptor created other problems along the way.

For starters, by hooking compute and storage back up together, it eliminated the company’s ability to scale both separately, which is bad. It also led to fragmentation of the data and a degraded user experience, because sometimes the queries were hitting the giant Warm Storage data lake and sometimes they were hitting data cached in the local SSD, according to Sun. It also complicates security and privacy, because now Facebook engineers have to worry not just about protecting data in Warm Storage, but also data sitting in these local SSDs.

Last fall, Facebook started to implement a replacement for Raptor. Based on a customized version of Alluxio, the new system delivers a comparable query performance as Raptor, but without the necessity to cache data on local SSDs.

Alluxio is a distributed file system that accelerates the flow of data from multiple storage systems to multiple compute engines

Alluxio is a virtual distributed file system that serves as the memory-based glue to connect multiple computational engines with multiple backend storage systems. It was created by Haoyuan “HY” Li at UC Berkeley AMPLab in 2013. Ion Stoica, an advisor for AMPLab and co-founder of Databricks, once said that Alluxio is to storage as Internet Protocol (IP) is to the Internet.

Facebook worked with Alluxio, the company behind the open source software project of the same name, to develop a customized version of the Alluxio file system that would run as a Java library. That version, which Alluxio made public with the version 2.2 release, doesn’t offer all the capabilities of the full Alluxio product, but it provides the core benefits that Facebook needed.

Facebook engineer Sun explains:

“What we did is we installed an Alluxio cache to the entire Presto interactive fleet in Facebook,” he tells Datanami. “We have mostly deprecated Raptor, meaning usually [the users] don’t need to write ETL jobs or whatever pipelines in order to dump Warm Storage data into local SSDs. They just don’t do it. They just directly query Warm Storage and the Alluxio cache library will do it for you.”

Facebook stores data in the ORC file format, and makes it available to Presto via the HDFS interface (the company still uses HDFS, but no longer uses Hadoop). The Alluxio implementation can work with SSDs and use them to store data locally to speed up processing, but the local cache is not necessary and Alluxio can also speed up queries against Warm Storage when no SSDs are present, according to Sun.

The Alluxio caching system can work with or without SSDs at Facebook (Image courtesy Facebook)

With Alluxio installed on hundreds of nodes in Facebook’s data center, Facebook users are enjoying comparable performance to what they had with Raptor. They can continue to submit large and complex queries that involve numerous joins and aggregations atop tables that may exceed a petabyte of data. Without Alluxio or Raptor, these queries could take up to 10 seconds to return an answer, which is just not feasible for Facebook. On simpler queries that hit oft-queried tables, Alluxio is delivering a 30% to 50% boost for Presto workloads, which is significant in its own right.

According to a July presentation by James, Facebook software engineer Rohit Jain, and Alluxio co-founding engineer Bin Fan, the Alluxio cache has resulted in 57% fewer reads from the remote data store (Warm Storage) relative to Raptor. The hit rate for the Alluxio cache (meaning the number of times the data asked for is residing in the local SSD cache) exceeds 90%.

“As soon as a large data request which is not cached comes to it, it reduces our cache hit rate,” says Jain, who previously spearheaded the Trafodion project at HP). “We are trying to work on better configuration for things on what should be cached.”

The main benefits accrue in finance and security. Because its Presto compute engines are once again independent of data storage, Facebook can save money by not overprovisioning compute or storage. On the security front, Facebook no longer has to worry about protecting data residing in SSDs and can focus more energy on securing data in Warm Storage.

“Alluxio is not only going to boost our query performance, but it’s going to kill storage silos such as Raptor, so we don’t really need to care about security, privacy, or HA or DR anymore because Warm Storage takes care of it,” Sun says. “The [cached] data is going to be on SSD, but it’s not semantic. There’s no semantic meaning for it. It’s just bits.”

Raptor is not completely gone today, but it should be completely gone within six months, Sun says.

Related Items:

Alluxio Looks to Bring Data Closer to Presto Engine

Alluxio Bolsters Data Orchestration for Hybrid Cloud World

Meet Alluxio, the Distributed File System Formerly Known as Tachyon