Pandora’s Journey from Hadoop to MemSQL
Pandora Media, the music streaming service, transitioned two years ago from heavy reliance on Hadoop for distributed processing to MemSQL as its primary database.
The switch was prompted by the company’s inability by 2016 to keep up with evolving requirements and new features on the advertising-based side of the streaming service. “Answers about Monday weren’t available until Wednesday,” the company noted in a blog post detailing its transition to MemSQL.
While query performance was sufficient, keeping up with “24-hour period” data was not, with a roughly 20-hour delay on a large Hadoop cluster. Performance declined further as the daily haul of data grew larger, requiring more processing time.
As data backed up, Pandora engineers said they considered 30 different platforms for delivering query results in real time. “We wanted a data store that would be fast enough to support real-time queries in less than one second,” wrote software engineer Kaan Erdener.
For this, it narrowed its search to platforms supporting a dashboard that could give the Pandora ad sales team a real-time view of data from past and current campaigns. “We wanted to get away from the cycle of new data arriving, waiting for nightly jobs, [then] view the results in a dashboard,” Erdener explained.
That along with a faster data store, preferably with sub-second query times over hundreds of billions of rows of data. Moreover, the company was willing to spend whatever it would take to meet its performance requirements.
Pandora narrowed the list of candidate platforms from 30 down to three “finalists”: CitrusDB, Redshift and MemSQL. The Amazon (NASDAQ: AMZN) platform provided the performance and features Pandora wanted, but the company balked at cost along with a requirement to ship all its data to AWS datacenters. Meanwhile, CitrisDB lacked some of the features Pandora needed in early 2016 when the database transition took place.
That left MemSQL, which offered what Pandora considered the appropriate mix of features, performance and flexibility, including a distributed architecture, the ability to read and write data directly to and from memory along with the ability to run on bare metal, virtual machines or software containers.
The goal in implementing its new relational database management system was consolidating all of Pandora’s ad campaigns into a single database. Data was ingested using Hive, Postgres and data files. Once there, engineers focused on using MemSQL to boost query performance.
Among the early lessons was focusing on columnar store since rows of data are stored directly on cluster memory, significantly adding to memory usage. By contrast, columns are stored on disk.
Ultimately, Pandora emphasized columnar performance in developing its ad campaign dashboard, including “reads, writes and overall disk performance,” Erdener noted. Pandora’s rule of thumb was: “If the table has less than 1 billion rows, start with row store; more than 1 billion, start with column store,” he added.
Since implementing the MemSQL-based dashboard, Pandora said it has expanded the platform to other company data projects.
Pandora has been an early adopter of a number of emerging big data technologies, including Apache Kafka as a platform for real-time stream processing of data.