How Disney Built a Pipeline for Streaming Analytics
The explosion of on-demand video content is having a huge impact on how we watch television. You can now binge watch an entire season’s worth of Grey’s Anatomy at one sitting, if that suits your fancy. For a media giant like the Walt Disney Company, streaming video provides a great opportunity to engage with customers in new ways, but it also presents formidable technical obstacles.
Disney ABC Television Group (DATG) is the Burbank, California-based television arm of the global media conglomerate. With more than 7,000 employees, DATG is responsible for producing and delivering content across the ABC Television Network, including ABC News, ABC Entertainment, Disney Channels Worldwide, and others.
In recent years, the company has developed platforms that allow it to bypass traditional cable, satellite, and over-the-air broadcasts and deliver streaming television content over the Internet. Depending on where customer are, they can view current, archived, or even live ABC TV shows via Web browsers, apps on smart mobile devices, and streaming boxes from Roku.
Delivering that much content around the globe in a reliable manner is a big task in of itself. But beyond the challenges inherent in building a content delivery network (CDN) are business-oriented questions that every advertising-supported media company needs answered, such as “Who is watching my show?” “Are they seeing my ads?” and “How can I get them to watch more?”
Adam Ahringer, the manager of software data engineering at DATG, shared some details about how the company went about answering that type of question during a session at the recent Strata Data Conference in San Jose, California. One thing was readily apparent right from the outset: The old mechanism for Web analytics would no longer cut it.
“Omniture data was proving to not be sufficient,” Ahringer said. “For a long time, that was the standard way people instrumented their applications or website. We do get some decent analytics from it, but they’re really not timely and they’re very difficult to make changes to.”
To answer age-old media questions in the new streaming landscape, the company decided to build its own real-time data analytics pipeline in the Amazon Web Services cloud. The pipeline would be architected to collect event data from all streaming content endpoints and feed it back into a central warehouse for analysis.
Ahringer and his team put a lot of thought into how to architect the new pipeline. One of the core design principles revolved around the fact that they wanted to collect as much event data as was practicable. “We wanted to get everything that occurred,” he said. “For our use case, I was really interested in taking special care to make sure that all the events made it into the infrastructure.”
Three message busses were under consideration: Apache Kafka, Amazon Kinsesis, and Google‘s Cloud Pub/Sub. Since Disney doesn’t have a big presence on Google Cloud Platform, Pub/Sub was out. Ahringer didn’t see much difference between Kafka and Kinesis, except perhaps the fact that Kafka offered exactly once processing (to go along with at-most once and at-least once semantics), whereas Kinesis offered only at-least once processing. “There’s really not a horribly wrong answer here,” he said. In the end, DATG picked Kinesis.
Next, Ahringer and his team had to pick some transformation mechanisms. AWS offers pre-built mechanisms for landing and transforming raw data into something more usable, including taking data off the Kinesis stream, writing it to S3, and then landing it in Redshift. But these mechanisms didn’t suit DATG’s requirements. “It does introduce latency and you don’t actually get access to data as fast as you would like,” he said.
Instead, DATG opted to write its own AWS Lambda functions to work with the Kinesis firehose. Ahringer sounded impressed with this AWS functionality, which integrated with the Kinesis Client Library (KCL). “They give you an integration library, then you can basically roll your own ingestion and applications very easily,” he said. “It’s not difficult. You can embed your own business logic. You can do any kind of special situations. You can design it to scale in a way that works for you. For us it provided the most flexibility.”
Selecting a Database
To get useful information out of the real-time data, DATG needed a database. It was critical to find the right database to go with its message bus, Ahringer said. “If you have this really wonderful low-latency system for ingesting events, and you make it available for ingestion in this NoSQL key-value store, people are going to look at you like you’re crazy because they can’t use any of the tools that they’re used to using,” he said.
Likewise, using off-the-shelf SQL databases probably wouldn’t work either. The AWS Aurora relational database is a perfectly good database, Ahringer said, but its scale-up architecture and 64TB maximum database size would just be too limiting for a use case such as DATG’s. “We ended choosing MemSQL,” he said.
Ahringer said he was impressed with several aspects of MemSQL, including its capability to provide analytics and transaction capabilities simultaneously. He had some extra words to say about the company’s column-store, which is a disk-based storage mechanism designed to deliver fast reads on large data sets.
“As you add more and more data, it doesn’t really affect scale like you’d see in other databases,” he said. “If you’ve used relational databases, maybe they can handle very high concurrent inserts. But when you try to read the same data that you’re currently inserting, you’ll get locking or there’s a hot spot on the disk or something. It’s very problematic. Those kinds of databases just weren’t designed for that. MemSQL has figured out a way to get around that. You can do analytics on the data as soon as it’s been committed and there’s no contention with the data that’s getting inserted.”
Finally, DATG needed a front-end user interface where stakeholders can query the data and view dashboards. For this role, the company chose Looker. The product integrates natively with MemSQL and it was easy to get up and running, Ahringer said. “We were able to get the dashboards up very easily,” he said. “It’s a pretty nifty tool.”
DATG is also utilizing pre-built integration between MemSQL and Apache Spark. According to Ahringer, the company has created small Spark jobs that read data directly from MemSQL and write the data as Parquet files on S3. This makes it easy to share data stored in MemSQL with other groups within ABC or Disney without giving them direct access to the underlying database.
“Maybe they have their own Hadoop cluster or other data science groups who want access to large amount of data,” he said. “We can give them access to disk without really having to worry about all these different stakeholders connecting directly to our database.”
Today DATG is routing and landing tens of thousands of events per second through its real-time data analytics pipeline. Business stakeholders are have visibility into user behavior right as it happens from intuitive dashboards, while off-line batch interfaces provide the potential for more fine-grained, after-the-fact analysis.
The goal was to improve the overall customer experience, and it’s working, Ahringer said. “That’s really what we’re trying to do,” he said. “Being able to have access to the data in real time – like what the user is doing in our application, what are they watching, and are they experiencing any problems — really opens the door to lots of things that we haven’t been able to do before.”
For example, based on what show a user is watching and how they’re watching it, DATG can dynamically change the navigation of the application. DATG’s data scientists can use machine learning techniques to determine which customers are most likely to binge watch an entire season, or even make recommendations to help customers catch up with their favorite shows. By monitoring bitrates, DATG can detect bandwidth problems with carriers and work with them to resolve the problems so that it minimizes the impact on joint customers.
“The overall goal is trying to improve the user experience and all these things are possible if you have access to the data more quickly,” Ahringer said.