Follow Datanami:
March 10, 2014

How PayPal Makes Merchants Smarter through Data Mining

Alex Woodie

Since PayPal was founded 16 years ago, it’s become the go-to place for payment transaction processing on the Internet, particularly among smaller merchants with limited resources. A couple of years ago, PayPal launched a new Hadoop-based data mining program with the goal of helping its parent company, eBay, but also with aims toward enabling smaller merchants to connect more successfully with their customers.

PayPal has grown to become a $6-billion powerhouse in the field of ecommerce transactions. Last year, the company helped buyers and sellers exchange $180 billion in goods and services across 3 billion distinct transactions. Much of that activity is generated from 143 million active accounts, on top of a good number of “one and done” customers.

The company’s position at the crossroads of e-commerce gives it a unique view into the world’s online buying habits. That position translates into a treasure trove of data about what products people buy, who they buy them from, how they got there, and what devices they use. “It’s a humongous amount of data we’re trying to get our head around and provide some value to consumers and merchants together,” says Vamshi Ambati, a data scientist involved with PayPal’s Data Technologies team.

Ambati recently shared information about his team’s three main areas of focus–including graph mining, text analytics, and machine learning–during the recent Hadoop Innovation Summit in San Diego, California.

Graph Processing

PayPal uses graph processing to help data scientists and marketers visually identify any big trends that are showing up in its data. “What we try to do is extract variables from the graph, or patterns from the graph, and use those as features for downstream modeling to do predictive modeling and central analysis,” Ambati says.

The company uses Intel‘s graph modeler to build the graph on Hadoop, and Apache Giraph and GraphLab algorithms for actual processing. The graph will look different depending on the goals. For example, a graph that blends social media information with transaction data will be useful for matching social media activity with transactions, while mixing clickstream data atop that transaction data will show PayPal’s team how consumers go there.

Detecting fraud is the biggest use case for graph processing. The company configures the nodes in the graph to correspond to the devices that consumers use to log in to its customer merchant accounts. If a consumer uses a different IP address or a mobile account, PayPal wants to make sure that they’re not trying to siphon money out of an account. “We’re also trying to see if there are closed communities trying to do fraud,” Ambati says. “It’s not just one particular node in the graph trying to conduct fraud but three to four nodes trying to sell or withdraw and deposit at the same time. So we want to be able to capture those as well.”

NLP Text Mining

PayPal’s Hadoop-based text mining system is a critical component for a variety of data science activities at the company, including predictive modeling, sentiment analysis, influence scoring, profile ranking, and topic modeling and clustering. “Text by itself doesn’t provide extra value, except for dashboarding,” Ambati says. “But you’d want to use text in conjunction with other data we have, or you may have as a company, to conduct more predictive modeling.”

PayPal data scientist Vamshi Ambati

The company uses natural language processing (NLP) algorithms to extract meaning from the transactions and conversations people have online, for the purpose of improving a merchant’s ability to understand and successfully sell product to consumers.

This is not a straightforward process for PayPal, however. Unlike at Netflix, where a four- or five-star rating is a clear indicator of customer preference and a good place to begin the product recommendation system, the fact that a consumer bought from a merchant is not always a clear indicator that the consumer likes the merchant, Ambati says.

“When a consumer shops at a merchant, we don’t really know if a consumer is interested or likes the merchant or not.  What might be happening is he might be interested in the brand or product that’s being sold at the merchant,” he says. “So we do text mining over the product information to understand if there are particular brands that someone likes or not, then use that to start the recommendation system.”

Machine Learning

Much of the data that PayPal surfaces with its graph processing and NLP activities end up as the basis for the company’s third core data analytic focus: data mining with machine learning algorithms. PayPal’s data mining system is largely built on machine learning algorithms written in Python and Java running on Hadoop, and is used to mine complex data models for actionable insight.

One of the common use cases for this setup is to do predictive modeling on behalf of merchant customers. PayPal has access to an enormous amount of data about consumer buying habits–much more than a small merchant selling from on EBay could hope to amass. But thanks to PayPal’s data science team, that merchant can leverage PayPal’s vast data repository and its expertise to gain a competitive advantage.

“As a merchant, you may not know enough about your consumers, but PayPal really knows more about their consumers,” Ambati says. “If you think of PayPal, we don’t really have consumers. Our consumers are the merchants, and then every single consumer of the merchants is our consumer.  We talk to consumers through merchants. So we want to be able to provide all these capabilities to help the merchants improve the customer experience for consumers.”

PayPal is moving to YARN and looking at ways to leverage Spark and Storm. It’s involved in some “deep learning” projects with the University of Minnesota that Ambati didn’t elaborate much on. The company also spends a fair amount of time building recommendation engines for eBay. “If you do see ads asking you to go back and shop at eBay. That’s probably us, floating some ads assuming you really like those products,” Ambari says. “If you don’t, apologies.”

Related Items:

Apache Spark: 3 Real-World Use Cases

Top 10 Netflix Tips on Going Cloud-Native with Hadoop

A Peek Inside Cisco’s Hadoop Security Machine