How LifeLock Uses Big Data to Fight Identity Fraud
About 9 million Americans fall victim to identity theft each year, leading to about $50 million in losses, according to the Privacy Rights Clearinghouse. While the big data explosion gives identify thieves more material to exploit, it can also make it easier to spot the fraudsters, provided you have the right tools.
One of the companies that’s invested a considerable amount of time and money building tools to fight identity fraudsters is LifeLock. The Tempe, Arizona company banks on these tools to power the identity theft and fraud detection services that it sells to American consumers and businesses.
The big data engine powering LifeLock is largely built and maintained by ID Analytics, a San Diego, California company that LifeLock first partnered with in 2009 and acquired in 2012. Dr. Stephen Coggeshall, the Chief Analytics and Science Officer at ID Analytics who oversees about a dozen data scientists, recently explained to Datanami how the company uses big data technology to detect identify theft in the real world.
When Coggeshall and several other former FICO execs left that company to found ID Analytics in 2002, the problem of transaction fraud had largely been solved, Coggeshall says. “But new account origination fraud, which is the classic identity theft, hadn’t had a real solution,” he says. “So we started the company in 2002 specifically to solve the problem.”
ID Analytics set out to solve that problem from the point of view of the business. Companies like banks and telecommunications providers wanted to be able to crack down on the various types of identity theft that criminals were starting to perpetrate.
Coggeshall and his colleagues decided that the best way to attack that problem was to get a holistic view of all the account application activity occurring across the country. “We had to go around and get people to join the ID Network, which was a consortium of account origination data,” he explains.
Hundreds of banks, credit card companies, and telecommunications firms joined, and today, the ID Network has visibility into about half the applications that occur every day. Every time somebody applies for a new credit card or a new cell phone contract, there is a one in two chance that the application is hitting ID Analytics proprietary identity fraud detection system.
Graphing Big Fraud
The volume of account-related data and the speed at which Lifelock’s business customers want ID Analytics to score it for identity fraud make this a big data problem. “The key to identifying identity fraud is to watch for anomalies in these events, whether they’re account origination events or address change request or other kinds of account takeover events,” says Coggeshall, who has 20 years of experience in this industry.
ID Analytics uses a combination of technologies and techniques to give it the capability and capacity to return requests from its business customers in less than a second. When the firm gets a request, the first thing it does is pull all related events that are associated to that event. The company maintains about a dozen “linking keys” that are critical to finding past events that are relevant to the current event. The linking keys are organized by primary data types, such as name, date of birth, Social Security number, phone numbers, and email and IP addresses, Coggeshall says.
“When we see this event, we go to our databases and in real time, we pull all associated events and then we build what we call a topology,” Coggeshall says. “We use graph analytics to build up this topological connectivity around these different events–how they connect and relate to each other with respect to Social Security numbers, names, dates of birth, email addresses, etc. We build this topological representation and we encode that into variables that become inputs into our machine learning algorithms.”
It’s worth pointing out that ID Analytics builds its graphs in real time. The company does not have the luxury of time when building these graphs, which typically have perhaps 10 to 50 nodes and perhaps 10 times as many edges. The company built its own graph analytics engine, and uses high-speed busses to move data stored in the historical databases onto its graph engine.
The models built by the machine learning algorithms are able to score a credit application very quickly and accurately. If an anomaly pops up, then the application is denied, and hopefully the fraudster shrinks back into the shadows of his hollow, meaningless life. If the information on a denied application involved data that’s connected to one of LifeLock’s paying customers, then that customer is immediately notified that there’s been a fraud attempt made using their personal data. (That’s the capability that attracted LifeLock to ID Analytics in the first place.) If the consumer isn’t a paying LifeLock subscriber, then no alert is sent out (although the bank, teleco, or retailer that denied the application may contact them separately).
Casting a Broad Net
ID Analytics scores roughly half of all the activity that occurs daily across credit cards, retail credit, pay-day loans, peer-to-peer payments, and mobile phones, Coggeshall says. “It’s not complete coverage,” he says. “But based on our historical visibility, we see pretty much complete coverage of all people in the United States. So when we see somebody come through the system, even if we’ve never seen that person before, we have information about their Social Security number, perhaps their name, data of birth, address, or phone number. All of those components give us information to help make decisions about whether that’s an attempted fraud.”
Besides architecting a system that can store and transport tens of terabytes of data very quickly, the makeup of ID Analytics’ machine learning models helps to differentiate it from competitors. According to Coggeshall, the company’s machine learning activity runs the gamut, from neural nets and support vector machine to boosted trees and clustering. The models are frequently used against the company’s database of previous fraud attempts. With more than 3 million instances of actual fraud attempts, it’s the largest such database in the world, Coggeshall says.
“It’s mostly supervised learning, but we also have some unsupervised algorithms, especially around the clustering,” he says. “This is supervised because we have the 3 million fraud attempts in the past, so we’re able to build really high quality supervised fraud models to predict whether a new unseen event is likely fraud.”
The criminals who perpetrate identity theft are clever sons of guns, so ID Analytics employs unsupervised learning techniques to give it a wild-card factor against them. “We’re using deep learning neural nets and with that with some supervised clustering, and that’s allowing us to detect some new fraud modes that we didn’t know existed before,” Coggeshall says.
For example, some fraudsters are finding that they can skirt past corporate defenses by changing their legitimate Social Security number by a digit or two, or tweaking their date of birth by a month or a year. Then there’s “synthetic fraud,” which is where people, for example, choose a random set of numbers for their Social Security number. “There are new fraud modes emerging all the time,” Coggeshall says. “The costs are spread out to all of us. That’s why they have to charge fees and higher rates, because they have to cover the losses.”
Hadoop Makes an Appearance
Lately, a Hadoop cluster has started growing within ID Analytics’ primary Las Vegas, Nevada data center. The Hadoop cluster is less than 100 nodes, but it’s showing promise as an internally facing environment for running advanced machine learning models. Instead of following the linking keys out just two degrees, the distributed power of Hadoop can allow ID Analytics to follow the trail much deeper.
“With our Hadoop environment, we’re able to go out to higher order of linkages and create variables that we hadn’t been able to do before,” Coggeshall says. “It allows us to improve our algorithms and find pockets of fraud and anomalies that are difficult to spot, and it allows us to get more precise in our ability to predict whether an event is likely a fraud.”
Thanks to the big data work that ID Analytics is doing, the world is a slightly safer place–not just for LifeLock customers, but for all of us. While it won’t put an end identity theft, it does put criminals on notice that the good guys and their allies are watching every step they take.