How comScore Uses Hadoop and MapR to Build its Business
In conjunction with MapR, Datanami presents comScore with this month’s “Big Data All Star” award.
When comScore was founded in 1999, Mike Brown, the company’s first engineer, was immediately immersed in the world of Big Data.
The company was created to provide digital marketing intelligence and digital media analytics in the form of custom solutions in online audience measurement, e-commerce, advertising, search, video and mobile. Brown’s job was to create the architecture and design to support the founders’ ambitious plans.
It worked. Over the past 15 years comScore has built a highly successful business and a customer base composed of some of the world’s top companies – Microsoft, Google, Yahoo!, Facebook, Twitter, craigslist, and the BBC to name just a few. Overall the company has more than 2,100 clients worldwide. Measurements are derived from 172 countries with 43 markets reported.
To service this extensive client base, well over 1.8 trillion interactions are captured monthly, equal to about 40% of the monthly page views of the entire Internet. This is Big Data on steroids.
Brown, who was named CTO in 2012, continues to grow and evolve the company’s IT infrastructure to keep pace with this constantly increasing data deluge. “We were a Dell shop from the beginning. In 2002 we put together our own grid processing stack to tie all our systems together in order to deal with the fast growing data volumes,” Brown recalls.
In addition to its ongoing business, in 2009 the company embarked on a new initiative called Unified Digital Measurement (UDM), which directly addresses the frequent disparity between census-based site analytics data and panel-based audience measurement data. UDM blends these two approaches into a “best of breed” approach that combines person-level measurement from the two million person comScore global panel with census informed consumption to account for 100 percent of a client’s audience.
UDM helped prompt a new round of IT infrastructure upgrades. “The volume of data was growing rapidly and processing requirements were growing dramatically as well,” Brown says. “In addition, our clients were asking us to turn the data around much faster. So we looked into building our own stack again, but decided we’d be better off adopting a well accepted, open source, heavy duty processing model – Hadoop.”
With the implementation of Hadoop, comScore continued to expand its server cluster. Multiple servers also meant they had to solve the Hadoop shuffle problem. During the high volume, parallel processing of data sets coming in from around the world, data is scattered across the server farm. To count the number of events, all this data has to be gathered, or “shuffled” into one location.
comScore needed a Hadoop platform that could not only scale, but also provide data protection, high availability, as well as being easy to use.
It was requirements like these that led Brown to adopt the MapR distribution for Hadoop.
He was not disappointed – by using the MapR distro, the company is able to more easily manage and scale their Hadoop cluster, create more files and process more data faster, and produce better streaming and random I/O results than other Hadoop distributions. “With MapR we see a 3X performance increase running the same data and the same code – the jobs just run faster.”
In addition, the MapR solution provides the requisite data protection and disaster recovery functions: “MapR has built in to the design an automated DR strategy,” Brown notes.
Solving the Shuffle
He said they leveraged a feature in MapR known as volumes to directly address the shuffle problem. “It allows us to make this process run superfast. We reduced the processing time from 36 hours to three hours – no new hardware, no new software, no new anything, just a design change. This is just what we needed to colocate the data for efficient processing.”
Using volumes to optimize processing was one of several unique solutions that Brown and his team applied to processing comScore’s massive amounts of data. Another innovation is pre-sorting the data before it is loaded into the Hadoop cluster. Sorting optimizes the data’s storage compression ratio, from the usual ratio of 3:1 to a highly compressed 8:1 with no data loss. And this leads to a cascade of benefits: more efficient processing with far fewer IOPS, less data to read from disk, and less equipment which, in turn, means savings on power, cooling and floor space.
“HDFS is great internally,” says Brown. “But to get data in and out of Hadoop, you have to do some kind of HDFS export. With MapR, you can just mount HDFS as NFS and then use native tools whether they’re in Windows, Unix, Linux or whatever. NFS allowed our enterprise to easily access data in the cluster and just as easily store it in a variety of warehouse environments.”
For the near future, Brown says the comScore IT infrastructure will continue to scale to meet new customer demand. The Hadoop cluster has grown to 450 servers with 17,000 cores and more than 10 petabytes of disk.
MapR’s distro of Hadoop is also helping to support a major new product announced in 2012 and enjoying rapid growth. Know as validated Campaign Essential (vCE), the new measurement solution provides a holistic view of campaign delivery and a verified assessment of ad-exposed audiences via a single, third-party source. vCE also allows the identification of non-human traffic and fraudulent delivery.
When asked if he had any advice for his peers in IT who are also wrestling with Big Data projects, Brown commented, “We all know we have to process mountains of data, but when you begin developing your environment, start small. Cut out a subset of the data and work on that first while testing your code and making sure everything functions properly. Get some small wins. Then you can move on to the big stuff.”