Paving the Yellow Brick Road to Behavioral Analytics
Behavioral analytics is one of the largest nuggets in the data mining gold pan of prospectors in the big data gold rush. However, in the mess of clickstreams, transaction data, server logs, and the millions of rows of tables upon tables, getting from raw data to ROI is no easy feat. At Hadoop Summit this Summer, PayPal discussed how they are approaching the challenge of wicking away the complexity of the foundational tools and turning data into dollars.
It all starts with the API, explained Rahul Bhartia, an engineer with the Internet transaction giant. Anyone that has been on the Internet has seen the interface a thousand times. And if you have a PayPal account, then you’ve likely been through the PayPal transaction flow. It’s a seemingly simple process, but in the background, a fraud review is going on, and once that has cleared, there is authentication, confirmation, and finally shipment, with money being extracted from your account.
There’s almost nothing to it as you move through it on the front end, but as one might guess, there are many systems working together to make this flow happen. And in that flow, terabytes of data is being generated from all of the different moving parts. Clickstreams, server logs, transactions – the raw data piles up fast, but in raw data form, it’s useless to a business analyst.
“Our analysts need a single unified access to find behavior across these channels – across these data streams,” said Bhartia, explaining the need for the configurable Event Analytics Pipeline (EAP) that PayPal created to handle the millions of transactions (resulting in billions of server events) a day.
That the solution involves Hadoop will come as no surprise to anyone following the space at this point – that’s not what’s interesting. Paypal’s approach to eliminating the handwringing that happens when business analysts think about having to work with Hadoop directly is what is worth noting. “We started by asking ‘what are the design principles that we need to have to build a platform to enable analysts to quickly get to the data, yet not be bogged down by the complexities of Hadoop for analysts who have been using SQL for many years,” Bhartia explained.
To attack this challenge, Bhartia says they started by building a common extensible framework in which the analyst can describe what they need from the system without the need for them to write code. In a world where “democratizing data” is becoming a buzzword, PayPal is pushing the envelope by not only giving this codeless portal to their analysts, but also having them leave breadcrumbs for the next person.
“Think of it something like crowdsourcing the metadata in the company,” explained Bhartia noting that once the data is augmented, it is translated to a common format which anybody in the company can get access to. Through their library of tags and relations PayPal enables the analysts to provide meaning to the rest of the company on the data they are touching.
“An analyst can go in and say ‘this is a click which is interesting to my area of the business,’ and our system will start to building the collection of clicks,” he said, adding that these collections of clicks could be certified in the system, letting others know who is using the data, the importance of it, and the fact that the data is production ready for analytics.
With the data translated and defined, PayPal has further built an analytical framework to allow analysts to give process context to the data. “Our users can describe common uses of patters of data, which is taking the data, translating it from events all the way after metadata, and then figuring out the flow of events – saying what really happened in the transaction flow,” explained Bhartia.
To accomplish they built a module and job system on top of the input framework that enables them to run processing jobs directly connected to the enterprise data warehouse, with all querying done using SQL. Once the data is processes into usable analytics it can be fed into Tableau, or the analyst can use PayPal’s implementation of D3 to visualize the results.
If you haven’t noticed, we haven’t really talked about Hadoop, HDFS, HBase, or any of the systems that this is all happening on. The analyst doesn’t have to concern themselves with any of it, staying laser focused instead on the data. From the point of identifying the data, to giving it shape and meaning, the same single analyst can write processes, build workflows, query the enterprise database, and build visualizations without having to write a single line of MapReduce or any other scripting language.
While the complexity of Hadoop are sure to continue, especially as new features such as YARN are added into the elephant stew, solutions like PayPal has produced, along with others that exist in the industry, show that there are ways to get real value of the often complicated tools available.
You can see more on the PayPal solution, and even dive deeper into the architecture of their system here.