Inside WebTrends’ Big Data Analytics Pipeline
WebTrends has been collecting and analyzing Web data on behalf of its customers since it was founded way back in 1993. Considering the exponenetial growth of the Net since then, it’s not a stretch to say WebTrends was doing big data before big data was a “thing.” But following the recent creation of a data analytics pipeline built with technologies like Hadoop, Spark, and Kafka, the company is taking its big data analytic services to a whole new level.
WebTrends makes its living in the digital marketing arena. Whenever you visit the website or use a mobile app of a WebTrends client, the site or app is using various techniques to track who you are and what you do. This log data is then fed into WebTrends servers, where it’s processed and presented back to the client, which uses it to do things like market segmentation, website optimization, and email re-marketing.
Big data may be in the eye of the beholder, but in WebTrends case, the data is really big – and fast too. Every day about 2,000 clients generate 13 billion events, which WebTrends ingests, processes, and makes available to clients for reporting purposes within about 40 milliseconds. The data is growing so fast at WebTrends that it’s adding half a petabyte of storage every quarter to its twin data centers in Portland, Oregon, and Las Vegas, Nevada.
Today, those data centers house the Hadoop, Spark, and Kafka clusters that do the bulk of the work for WebTrends primary digital analytics offerings. But that wasn’t the case just several years ago, when it still stored “micro files” on big high-speed file shares that WebTrends would access with relational databases to create aggregate reports for clients.
The problem with the old approach was that it was inflexible, says WebTrends Director of Technology Peter Crossley. “If we wanted to make a change to the reporting, we would have to rerun the data across the system and re-generate the aggregates,” Crossley told Datanami at the recent Hadoop Summit event in San Jose, California. “We lost the ability to change them without doing reprocessing, which is kind of expensive.”
So while WebTrends customers might be able to view website or mobile traffic by location, WebTrends’ existing system didn’t easily allow them to all of a sudden view the data by device. It was possible, but took a bunch of work. “Millions and millions of files become a nightmare to manage,” Crossley said.
WebTrends adopted Hadoop about two years ago, and is now enjoying the file flexibility that HDFS affords. Crossley estimates that saving WebTrends from reprocessing the aggregates contributes a 20 to 40 percent reduction in costs versus doing these as services. And it’s allowing WebTrends to expand the types of services that it offers clients.
“The main platform, called the Infinity Engine, doesn’t have any restrictions in the data shape and size. We can collect anything for customers, and because Hadoop is so flexible, it allows us to take those data sets and pull that data in and repurpose it for different verticals,” Crossley said. “Some of our competitors have limitations on how much data they can consume….We’ve removed that box with Hadoop.”
Crossley says the flexibility of Hadoop is critical looking forward, especially considering the wide breadth of solutions that will become possible with the Internet of Things. “The IoT is creating an advent of tech requirements that are going to make people say ‘There’s so much data coming to us, what do we do with it?'” he says. “We live in a digital world and we know that tomorrow we’re going to have a new problem to solve with that data, but we don’t know what it is yet. Well, Hadoop gives you that landing space to be able to make those decisions later on.”
WebTrends is also heavily investing in real-time technologies like Apache Kafka to give its customers an even bigger edge. Kafka forms the messaging backbone for WebTrends and is responsible for streaming data from the Internet into WebTrends servers. The company is also using Storm and Samza to perform analytics on the data.
Apache Spark plays a critical role in WebTrends’ big data stack. “All the interaction with data goes through Spark,” Crossley says. “We basically have a grammar, a query language, that we’ve put on top of Spark that then translates it into Spark tasks, and then execute jobs against the data. We’ve done some really interesting processes that allow us to stream the data in and then stream the data out. We can actually see the data taking form and shape as it’s being returned.”
Any organization dealing with big data realizes what a pain data cleansing can be. And the problem of data variety can be an even a bigger pain at a company like WebTrends, which has thrown open the doors and accepts data in all sorts of shapes and sizes. But with Spark sitting in the data pipeline, WebTrends has found a way out of this dreaded chore.
“There’s all this data cleansing that we have to do potentially. What we really learned is we don’t do it,” Crossley says. “It sounds kind of funny but you don’t do it. You store it in their raw form and leveraging technologies like Spark, which we’ve been invested in for a long, long time now, is allowing us to take the data out, in near real-time, stream it, emit it, and then mutate it and modify it as you need to on the outbound, and decode it, split it, or zip it or whatever you want to do with it.”
WebTrends has been using Spark since the project’s very earliest days, before it was an Apache project even, and employs one of Spark’s committers, which Crossley admits “is awesome.” It also adopted Kafka before it was an official project.
“We’re pre-bleeding edge adopters of technology when we feel it’s valid technology that we can support, both from development standpoint and a production-ready standpoint,” Crossley says. “We’ve really been able to accelerate our growth and ability within the Hadoop ecosystem.”
Being on the bleeding edge is one thing, but having technical support is nice too. So to that end, WebTrends selected Hortonworks to supply its Hadoop distribution. Crossley says Hortonworks’ closeness to the core of Apache Hadoop was the main reason why it selected HDP.