How NoSQL Drives Analytic Agility at Nielsen
For many years, Nielsen used a standard relational database to power the big data analytic offerings used by thousands of its customers around the world. But the heaviness of the infrastructure dragged on Nielsen, until finally it switched to a NoSQL database that was better suited to the task.
Most people know Nielsen best for its television rating system, and the so-called “Nielsen Families” who have agreed to be part of a massive ongoing TV survey. But in addition to gauging TV (and radio, online, and billboard) viewership, the company also collects, aggregates, and disburses information about the goods that people buy.
Major consumer processed goods (CPG) manufactures like Kraft and Proctor & Gamble and retailers like Safeway use the Answers on Demand component of Nielsen’s Global Buy program to figure out what products people are buying. This information is critical in determining not only what products to sell, but how to price and market them too.
The Answers on Demand program is fueled by three primary sources of data, including:
- Aggregated and anonymized point of sale (POS) data that Nielsen collects from nearly every major retailer in the world;
- Data from retailer’s loyalty card programs;
- And panel data from a group of volunteer customers who scan every UPC barcode on every product they buy.
Obviously, there is a ton of data involved in Answers on Demand. During any given week, this system handles billions of POS records, hundreds of millions of loyalty card records, and millions of UPC scans. Nielsen’s ability to efficiently manage and present this data to customer is a big factor in this program’s success, as well as the success of its customers.
Stretching the Limits of Relational
Darrell Pratt, former principal architect at Nielsen, explained the efforts to simplify the Answers on Demand architecture during a Couchbase conference held in September 2013. During the event, Pratt (who now works at Cars.com) says one of the biggest problems was the way the relational database stored data.
“The relational data model overload was one of our big issues,” Pratt said. “Our database, for something you would think is something fairly easy, was 12 tables to define a report. How do you put it back to together and get performance out of that?”
Besides the complex data schema, a heavy reliance on data transformations was also hurting Nielsen. While the company would store the data in the Oracle database as XML-based CLOBS, everything is JSON from the end-user’s point of view. Nielsen gives its customers extensive power to customize their reports, and all those configurations and filtering criteria are stored in native JSON documents.
“We really wanted to get out of the business of all those data transformations,” he said. “We’re serving up JSON. Why are we breaking it all apart into tables and different columns? It just doesn’t sense.”
The product data changes constantly for Nielsen, and keeping on top of those changes was becoming a challenge. For example, if one little characteristic of one product changed (such as changing a bottle size from 11 oz. to 13 oz.), the change could impact tens of millions of reports. To deal with it, Nielsen would kick off full table scans to ensure that the change was fully implemented (or mark the data as suspect so customers could avoid it).
“The complexity of those objects and how much they change causes so much churn,” he sid. “You’re re-compile everything, you’re re-generating everything with little changes to the database. Over time this gets to be way too much of a hassle.”
Running those full table scans was a real chore, especially considering that Pratt’s team typically performed them during system downtime on the weekend. “We kept joking that we have to come up with a new day,” he said. “Saturday and Sunday aren’t generally enough.”
A New Data Architecture
Adding another day to the weekend probably wasn’t going to fly with Nielsen’s CIO, so Pratt looked to other solutions—namely, re-architecting the database layer. The company selected Couchbase Server, a distributed document-based NoSQL database, to replace the Oracle layer. Couchbase’s native support for JSON and flexible data schema meant that Nielsen could get rid of much of the data transformations, and respond much more quickly to changing data and report attributes.
The fact that Nielsen is dealing with JSON natively in Couchbase is a huge efficiency boost. “We’re not doing all the transformations,” Pratt said. “We’re not building up Java objects here and then marshalling them out to JSON and sending them over the wire. We’re getting rid of stuff there, which is very important.”
Leaving the data in its native format, and having all the user-facing applications call the data in Couchbase and then put it back, has simplified the workflow for Nielsen. “XML to me is horrible. So we’re trying to get very far away from that, by pushing JSON further and further down the stack,” he said.
The new Couchbase-based system runs 50 percent faster than the old Oracle-based system, according to Arvind Jade, the current architecture lead at Nielsen, who was quoted in a recent Baseline Magazine article. “By moving the metadata to Couchbase, we were able to dramatically improve the efficiency of the system and speed data delivery,” Jade said today in a Couchbase press release. “We are able to query against the index and target specific documents, something we were not able to do previously.”