Follow Datanami:
August 15, 2013

SoundCloud Liberates Data with Hadoop, Pentaho

Alex Woodie

For the folks at the music distribution website SoundCloud, the task of answering basic questions about the organization’s business was becoming increasingly difficult using traditional tools. However, after implementing a Hadoop cluster and front-end analysis tools from Pentaho, the task of turning data into actionable information became much easier for non-technical users.

SoundCloud is an upstart music distribution company that is increasingly being used by musicians, publishers, radio stations, and music-lovers searching for the latest sound. The German company was founded in 2007, and today has the 216th most popular website in the world, according to Alexa.

Like all companies, SoundCloud managers need answers to basic questions about the company’s effectiveness. According to a recent presentation by SoundCloud’s Ole Bahlmann, the company taught its product managers how to extract data from its MySQL database using pre-built scripts, and then distribute the results using Google Docs spreadsheets.

Although it worked early on, Bahlmann admits he is a bit embarrassed that the company relied on the “hacked together” system. But it simply couldn’t scale as the company entered a period of rapid growth beginning in 2010, Bahlmann said.

As SoundCloud ramped up, it began collecting lots of data about how its customers were using the site. Trying to build correlations between awareness campaigns and how many Facebook users added a SoundCloud URL link to their timeline, for example, was becoming difficult to do with spreadsheets.

As the company became more popular–today more than 12 hours of music is added to SoundCloud every minute, according to Bahlmann–the company had entered the realm of big data, and it was having repercussions on its IT infrastructure. “For us, big data was data that didn’t fit into the MySQL database,” he said.

The company sought a solution to this problem, which led it to the Hadoop file system and Pentaho, a developer of Web-based tools that help users integrate, analyze, report on, visualize, and make predictions from their data.

The Hadoop- and Pentaho-based system has not only enabled SoundCloud to eliminate silos of information, but it’s easy enough to use that non-technical product managers can explore big sets of data in search of answers to their own questions, without involving IT experts, Bahlmann said.

“We can load billions of rows in there and have them access thru Pentaho and have very fast access to the information,” Bahlmann said. “If you teach people and if you educate the people to use the data and your simple tools then you can actually enable them to get the golden nuggets of data.”

More recently, SoundCloud has added Amazon Redshift’s cloud-based analytics offering to the mix. “Hadoop, as great as it is, if you run a query, depending on the size of your cluster, you’re still waiting three to five minutes to get the results,” Bahlmann said. “With Redshift, we can simply put all the data in there and the queries return in a couple of seconds. We have all this information available through Pentaho to all the end users at SoundCloud.”

Related Items:

Data Driving the Exit Into Hadoop 

The Three T’s of Hadoop: An Enterprise Big Data Pattern 

Bare Metal or the Cloud, That is the Question… 

Datanami