Mining for YouTube Gold with Hadoop and Friends
Unless you’ve been living in a cave without Wi-Fi, you’ve probably noticed a revolution occurring in the world of video. Not only are hundreds of millions of young people using Web services like YouTube, Netflix, and Amazon to watch their favorite TV shows, but they’re tuning into new content created by thousands of independent publishers. The old media apple cart hasn’t just been turned over–it’s been run over by a train.
Or maybe, it’s been run over by an elephant. Named after Doug Cutting’s kid’s fuzzy yellow toy elephant, Hadoop is the headliner of a new group of scale-out, distributed technologies that Yahoo, Google, and other Silicon Valley firms started developing 10 years ago to index and serve data on the Internet.
YouTube, which is owned by Google, is a classic example of why these types of big data technologies are critical. Every day, people upload 1.2 million videos to YouTube, or over 100 hours per minute. More than 3 billion videos are played every day across 500,000 channels. Just moving that much unstructured data requires a different sort of technology, such as a basic type of NoSQL database called a key-value store,
along with a lot of other networking wizardry and advanced software (for a great in-depth discussion on how YouTube built YouTube, check out this video, which is on YouTube, naturally).
Similarly, to analyze and understand the activity occurring on a property as massive as YouTube–which reaches 1 in 7 people on the planet every month–it really requires something beyond a relational SQL database. It needs something massively parallel and distributed–say something like Hadoop.
Winning on YouTube
One of the companies that’s using Hadoop to analyze YouTube content is Tubular Labs. Founded two years ago by Rob Gabel, Tubular is sort of like Nielsen, but instead of surveying what TV shows or radio programs consumers of a certain demographic have watched or listened to lately, Tubular dishes out real time info on who’s watching what on YouTube.
As Gabel explains, the violent growth of online video has rewritten the rules that old-school media standbys like Nielsen lived by for decades.
“It used to be there were media companies that created newspapers and TV shows, and there were advertisers that created ads. Things are a little bit different now,” the Tubular CEO tells Datanami. “There are 130,000 channels on YouTube who have over 10,000 subscribers. They’ve built big audiences, and understanding who they are, what content they make, and which ones you should work with is a much harder challenge than it is in a TV world of 200 channels.”
Tubular ingests and analyzes hundreds of millions of pieces of data every day about the YouTube audience. Much of this information is available directly through the YouTube API, but the company also monitors Twitter, Facebook, Instagram, and Vine through publicly available data feeds. If you like a YouTube video or post a comment about it, chances are the data ends up with Tubular.
The company sucks all this data into a Hadoop cluster based on Cloudera‘s CDH Enterprise. This cluster, which runs on Amazon‘s cloud, tells subscribers all kinds of stuff about the demographics of certain YouTube channels, including the age, gender, and other YouTube content they’re likely to be watching. This information, which is available for a five-figure subscription, is useful not only to consumer product goods companies who want the best viewership for their advertisements, but for other publishing firms looking to build their own audiences.
The Importance of Impala
Trends come and go quickly on YouTube, so getting these results in a short timeframe is critical to success. Tubular can process queries in under two seconds, providing customers with the capability to explore the YouTube audience data in something approaching real time.
Cloudera’s Impala is critical to that real-time capability, Gabel says. “Without Impala or a solution like Impala…the user wouldn’t be able to sift through in real time and see what’s been trending in the last day or two or how things have changed,” he says. “Without Impala, it would look a lot like the canned reports you’re used to, where on October 13 you get to hear about September data.”
For example, Cloudera’s SQL analytic engine for HDFS will let Tubular customers plot Thanksgiving promotional strategies using the latest demographic and audience data. If a turkey video gets hot right before the big day, Tubular will detect that and help steer customers’ YouTube advertisements to have the best result.
Tubular has been working with Impala since the SQL-engine was in beta. The company built its first prototype using a traditional SQL database, but soon found that it wouldn’t scale, Gabel says. “Using technologies like Cloudera’s suite let’s us focus on the things that are unique to our business, the machine learning and algorithms that are specific to us” rather than worrying about the underlying database technology.
Hadoop isn’t the only next-gen technology in play at Tubular, which also uses a Redis in-memory database to bring in messages, and a Cassandra NoSQL database to help serve its website. “We switched our primary storage from SQL to Cassandra to save money, because running SQL at Amazon is expensive,” he says.
The nature of media is evolving rapidly at the moment thanks to new technologies, such as smartphones, which are changing how we consume and share content. Behind the scenes, advances in non-SQL technologies like Hadoop and Cassandra provide the scalability that is allowing startup content producers and advertisers to reach big audiences.
TV isn’t going away, but the world of video is certainly changing, Gabel says.
“Whether it’s binging for hours on end with Netflix or watching 6 minute YouTube clips on cellphones, video is evolving. Consumer behavior is evolving,” he says. “It’s global. You can share it socially and the types of people who are making video are changing. We’re at the heart of that and trying to apply big data to help these publishers and brands navigate the ever-changing world and find an audience for their content.”