December 1, 2014

How comScore Uses Hadoop and MapR to Build its Business

Sponsored by MapR Technologies

In conjunction with MapR, Datanami presents comScore with this month’s “Big Data All Star” award.

When comScore was founded in 1999, Mike Brown, the company’s first engineer, was immediately immersed in the world of Big Data.

The company was created to provide digital marketing intelligence and digital media analytics in the form of custom solutions in online audience measurement, e-commerce, advertising, search, video and mobile. Brown’s job was to create the architecture and design to support the founders’ ambitious plans.

It worked. Over the past 15 years comScore has built a highly successful business and a customer base composed of some of the world’s top companies – Microsoft, Google, Yahoo!, Facebook, Twitter, craigslist, and the BBC to name just a few. Overall the company has more than 2,100 clients worldwide. Measurements are derived from 172 countries with 43 markets reported.

To service this extensive client base, well over 1.8 trillion interactions are captured monthly, equal to about 40% of the monthly page views of the entire Internet. This is Big Data on steroids.

Brown, who was named CTO in 2012, continues to grow and evolve the company’s IT infrastructure to keep pace with this constantly increasing data deluge. “We were a Dell shop from the beginning. In 2002 we put together our own grid processing stack to tie all our systems together in order to deal with the fast growing data volumes,” Brown recalls.

Introducing UDM

In addition to its ongoing business, in 2009 the company embarked on a new initiative called Unified Digital Measurement (UDM), which directly addresses the frequent disparity between census-based site analytics data and panel-based audience measurement data. UDM blends these two approaches into a “best of breed” approach that combines person-level measurement from the two million person comScore global panel with census informed consumption to account for 100 percent of a client’s audience.

UDM helped prompt a new round of IT infrastructure upgrades. “The volume of data was growing rapidly and processing requirements were growing dramatically as well,” Brown says. “In addition, our clients were asking us to turn the data around much faster. So we looked into building our own stack again, but decided we’d be better off adopting a well accepted, open source, heavy duty processing model – Hadoop.”

With the implementation of Hadoop, comScore continued to expand its server cluster. Multiple servers also meant they had to solve the Hadoop shuffle problem. During the high volume, parallel processing of data sets coming in from around the world, data is scattered across the server farm. To count the number of events, all this data has to be gathered, or “shuffled” into one location.

comScore needed a Hadoop platform that could not only scale, but also provide data protection, high availability, as well as being easy to use.

It was requirements like these that led Brown to adopt the MapR distribution for Hadoop.

He was not disappointed – by using the MapR distro, the company is able to more easily manage and scale their Hadoop cluster, create more files and process more data faster, and produce better streaming and random I/O results than other Hadoop distributions. “With MapR we see a 3X performance increase running the same data and the same code – the jobs just run faster.”

In addition, the MapR solution provides the requisite data protection and disaster recovery functions: “MapR has built in to the design an automated DR strategy,” Brown notes.

Solving the Shuffle

He said they leveraged a feature in MapR known as volumes to directly address the shuffle problem. “It allows us to make this process run superfast. We reduced the processing time from 36 hours to three hours – no new hardware, no new software, no new anything, just a design change. This is just what we needed to colocate the data for efficient processing.”

Using volumes to optimize processing was one of several unique solutions that Brown and his team applied to processing comScore’s massive amounts of data. Another innovation is pre-sorting the data before it is loaded into the Hadoop cluster. Sorting optimizes the data’s storage compression ratio, from the usual ratio of 3:1 to a highly compressed 8:1 with no data loss. And this leads to a cascade of benefits: more efficient processing with far fewer IOPS, less data to read from disk, and less equipment which, in turn, means savings on power, cooling and floor space.

“HDFS is great internally,” says Brown. “But to get data in and out of Hadoop, you have to do some kind of HDFS export. With MapR, you can just mount HDFS as NFS and then use native tools whether they’re in Windows, Unix, Linux or whatever. NFS allowed our enterprise to easily access data in the cluster and just as easily store it in a variety of warehouse environments.”

For the near future, Brown says the comScore IT infrastructure will continue to scale to meet new customer demand. The Hadoop cluster has grown to 450 servers with 17,000 cores and more than 10 petabytes of disk.

MapR’s distro of Hadoop is also helping to support a major new product announced in 2012 and enjoying rapid growth. Know as validated Campaign Essential (vCE), the new measurement solution provides a holistic view of campaign delivery and a verified assessment of ad-exposed audiences via a single, third-party source. vCE also allows the identification of non-human traffic and fraudulent delivery.

Start Small

When asked if he had any advice for his peers in IT who are also wrestling with Big Data projects, Brown commented, “We all know we have to process mountains of data, but when you begin developing your environment, start small. Cut out a subset of the data and work on that first while testing your code and making sure everything functions properly. Get some small wins. Then you can move on to the big stuff.”

Vendors: MapR

Tags: big data all stars, comscore, mapr

One Comment

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

How comScore Uses Hadoop and MapR to Build its Business

One Comment

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 18, 2024

April 17, 2024

April 16, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

How comScore Uses Hadoop and MapR to Build its Business

One Comment

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 18, 2024

April 17, 2024

April 16, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link