How a Web Analytics Firm Turbo-Charged Its Hadoop ETL
The Web analytics firm comScore knows a thing or two about managing big data. With tens of billions of data points added to its 400-node Hadoop cluster every day, the company is no stranger to scalability challenges. But there’s one ETL optimization trick in particular that helped comScore save petabytes of disk and improve data processing times in the process.
ComScore is one of the biggest providers of Web analytics used by publishers, advertising firms, and their clients. If you want to be sure your ad dollars are being optimally spent–that is, that the right people with the right demographics are seeing them at the right time on the right websites–then it’s likely that comScore and its various data-collection and data-processing methodologies are involved in some way.
Data is the heart of the comScore‘s operations. Every day, consumer activity triggers the generation of more than 65 billion events, or about 20TB of raw data, into its MapR Technologies Hadoop cluster. Over the course of a month, that translates into nearly 2 trillion events added to the cluster, or roughly equivalent to 40 percent of the total page views on the Web. With 8.3PB of disk space and 33TB of RAM, the Hadoop cluster serves as both the center of production analytic workloads as well as its long-term archive.
With so much data coming in, it’s no surprise that compression plays a major role in comScore’s data handling routines. The company uses the open source GZIP algorithm to compress the files as they arrive in the cluster. GZIP is able to reduce files to about 22 percent of their original size, so for an hour’s worth of comScore’s data input–or about a 2.3TB file–GZIP is able to reduce it to 509 GB.
That’s pretty good. But comScore CTO Mike Brown found another tool that could lead to even better reductions in data: the DMX family of data integration tools from SyncSort. According to Brown, using DMX to sort the data as its being loaded into Hadoop allows that 2.3TB file to be further compressed, via GZIP, to 324GB, or about 14 percent of the original.
“That saves us half a petabyte of disk space in a quarter,” Brown tells Datanami. “We’d have to have more disk spend if we weren’t using this to boost the compression ratio by sorting. We’d also have to implement things in Hive or do things in Pig, which are not as efficient as implementing that in DMX.”
A ‘Ginsu Blade’ for Data
ComScore has been using the DMX family of high-speeding sorting tools since 2000, which was one year after the company was founded. In those days, the company used massively parallel processing (MPP) data warehouses to store and process data. When the company moved to Hadoop in 2009, it brought its ETL tool (then called DMExpress) to Hadoop too.
Today, the firm is using SyncSort’s plain vanilla DMX version and a version optimized for Hadoop called DMX-h to provide optimized data sorting routines for its MapR Technologies M5 distribution as well as its 200-node EMC Greenplum data warehouse, which comScore’s data scientists use to explore data and build models.
“Basically we use DMExpress as our ginsu knife to get data prepped and ready for various environments,” Brown says. “So data comes in through our collection network, it runs through ETL, and we have SyncSort’s DMX built directly into those applications, and we use those things to load data into our Hadoop and GP clusters.”
DMX-h executes within the MapReduce framework via a pluggable sort enhancement. SyncSort actually worked with the open source Apache community to implement an enhancement, known as JIRA MAPREDUCE-2454, into the Apache Hadoop framework to enable sorting work to be executed externally to MapReduce itself. That work resulted in SyncSort being recognized as one of the biggest contributors to the Apache Hadoop codebase at the recent Hadoop Summit.
DMX-h often replaces portions of homegrown MapReduce programs that users develop in Java, Pig, or HiveQL, which can be a difficult, error-prone process. The product comes with a Windows-based development tool and a library of pre-built components for implementing common ETL tasks, such as joins, change data capture (CDC), web log aggregations, and accessing EBCDIC data on IBM mainframes. Not only does DMX-h deliver better performance, SyncSort says, but it maintains data lineage and dependencies automatically.
Proof In the (Hadoop) Pudding
DMX-h has proven itself to be an indispensable part of comScore’s operations, and has helped it manage some of the more esoteric aspects of HDFS, including how data is distributed across the nodes.
In its Hadoop journey, two specific issues have concerned comScore. The first was the fact that all the data had to go through the name node, which creates a throttling limit on one’s ability to process data, Brown says. The second is the “tiny file” problem and Hadoop’s inefficiency at storing huge numbers of small files. The combination of SyncSort and the tweaks that MapR made with its Hadoop distribution (which you can read more about here) helped to address these concerns.
“Hadoop stores data optimally in 256MB chunks,” Brown explains. “What you want to do is have the 256MB chunks distributed to lots of servers, so when you run a process that’s reading that data, you can get more servers reading that chunk of data.”
The SyncSort tool can automatically chunk and distribute the data to get the most use out of HDFS. “DMX-h’s chunking feature enables you to get that parallelism without having to do any code to break that file up,” Brown says. “It automatically just does it for you on the load, by checking the box. Performance-wise, it has a huge impact on your environment.”
Brown supplied some numbers to back up his claim. Say you have a 28GB file that has 50 million rows of data, and you want to run a simple program to count the number of rows. If you decided not to compress it before ran a simple program to count the number of lines in the file, it would take about 49 seconds, or 1 million rows per second, Brown says. If you first compressed it in GZIP before running the program, it would take 451 seconds, “because it has to go ahead and route all that decompression back to your one server, because it has to start wherever the file started.”
However, if you first used DMX-h to sort the data according to rows before compressing, it will go much faster. “You get the best of both worlds,” Brown says. “You get the size of a compressed file–or about 3.4GB, 12 percent of the original–and you increase the throughput to 1.3 million rows per second, so it takes 36 seconds to run.”
It’s a win-win scenario. “You get parallelism back, but you still get the efficiency of using a commodity off-the-shelf compression algorithm like GZIP.”