While the ability to accurately predict future crime is still reserved for science fiction, the big data analytics approaches required for such predictions are steadily being shaped and refined.
The impetus behind much of this work isn’t crime prediction so much as massive-scale fraud detection. New methods have emerged to understand criminal networks, identity thieves and fraudsters who are bilking governments, healthcare organizations, insurance companies and others out of billions–and this is becoming a booming business.
According to Jo Prichard, a lead data scientist for LexisNexis Risk Solutions’ HPCC Systems initiative, fraudsters have been pitted against some of the most comprehensive data collection efforts in history. What’s unique here is that all of this data has deep context.
The standard data wells stacked with neat rows of names and addresses has been kicked up a notch. It’s now possible to crunch personal, family and network histories via a wide swath of associations. For example, now, when a identity thief files a false tax return under someone else’s name, there are hundreds of variables confirming whether that filer is who he claims to be. The system is, in essence, a lie detector on the global scale.
In this era of the massive graph, determining fraudsters has moved beyond the standard personal investigation to a new “guilt by association” system of flagging. The graph’s long tentacles reach into so many sources that hiding connections and histories becomes almost impossible, especially as fresh data is fed in. Adding to the colossal data fraud detection effort are increasingly smart algorithms that learn as more data is added and adapt accordingly. All of these aspects, run together on fine-tuned high performance systems, are making large-scale fraud almost impossible when the data conditions are right.
Prichard’s company as a subset within LexisNexis is providing massive-scale fraud detection as a data service, built on the company’s long-standing platform, to the government and several large companies to help them understand the roots of fraudulent activity. He told us that currently they have data from over 20 sources feeding into a central well that houses in-depth personal, work-related and asset histories for over 270 million Americans.
“We can see how people’s lives play out in data, which gives us the this backdrop to understand what we expect to see from people and the more we learn from that at the granular level of detail,” Prichard said. And in the end, the graph that this data creates yields around four billion relationships—a number that will continue to grow with the continuous updates to the system.
This is much like the much-discussed Facebook graph search, but on a far grander and more pervasive scale. Government agencies, insurance companies, banking entities—all of these high-end sectors are in desperate need of fraud fighting tools at incredible scale. What they require is a national lie detector powered by evolving, constant data streams wherein the more data that is fed in, the more granular the analysis will be.
Specifically, he described the process of forming a massive network as akin to starting with fragments of identity. These snippets of a person’s identity are culled from 20,000 public and private sources (everything from deeds, DMV records, credit reports, etc.) and must be cleansed and integrated in a fashion that recognizes the variability of formats and different types of information. From this, it becomes easier to piece together the fragments of identity to complete the puzzle of one person. With that piece in place, the associations (cohabitation, shared assets, etc.) between families and networks can be built—creating a new layer of the puzzle to built yet another employment network on top of.
The possibilities for examining personal relationships is rather staggering. Even for someone with a very common name, the associations across the work, credit, transaction, deed and licensing histories alone helps narrow down the subject. “To do all this we some really smart algorithms,” says Prichard. We have a technology called LexID, which is really a linking technology based on a learning algorithm where the more data you give it, the more it learns, and the better it is able to resolve identity in the end.”
Pritchard describes this smart algorithm as an ever-hungry, multi-armed entity that is constantly stitching together bits of the data fabric. With more data, it becomes simpler for the algorithm to properly decide if someone is John Smith from Pickle, Arkansas versus the other John Smith in the same town.
LexisNexis Risk Solutions, which is known in the big data realm for its distributed computing platform built under the HPCC Systems name, claims that it has built a coveted fraud detection platform on its time-tested technology platform that has served financial and other risk management needs for over a decade. Prichard told us that over time, they’ve been able to refine their graph-based analytics operation beyond the traditional rules-based engines and into a more dynamic system that is based on weighting and guidelines to help snap pieces into place.
LexisNexis Risk Solutions is but one entrant in the race to create a vast global network (a mega-graph) of relationship webs, which come into clearer focus as new iterations of data are processed. While their results are based on data from the United States, eventually such a global lie detector could be built as one entity, continuously snapping up feeds to round out the full extent of personal relationships, assets, history and associations. New data would be neatly plugged into the global graph to weave an intricate, broadly useful web at the international or personal level that only becomes more powerful with each new dash of data.
Other companies offer software services to help companies process their own data on a platform (versus send it to a company like Lex-Nex to process and receive a set of scores or vales back). For instance, last year around this time we talked to SAS, which demonstrated the Visual Analytics platform, which provides granular detail on complex strings of relationships and individuals. IBM and others have software that can enable similar graph-type approaches to the problem of fraud detection.
While government agents used to be the bane of fraudsters everywhere, now it’s data. Armed with the endless stream of feeds from an exhaustive list of sources, it has grown almost impossible to hide. Although there are many who feel that sharing a lot of personal information is harmful (like it’s possible to avoid anyway) Pritchard reminds us that in actuality, the more data we share, the easier it is to pinpoint our own identities in the global graph—making it far more difficult for fraudsters to claim our identity.
HPCC Systems Intros Machine Learning Beta
Breaching the Big Data Barrier in Healthcare
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real-Time