Novetta Throws Entity Analytics Hat Into Hadoop Ring
One of the new big data analytic vendors exhibiting at the recent Strata + Hadoop World conference was Novetta, a firm that’s well-known in the Washington D.C. area for its cyber analytic offerings. But now the company is widening its reach into the commercial market with a Hadoop-based solution called Novetta Entity Analytics.
One of Novetta’s first customers in the big data space was an unnamed government security agency that was having trouble pulling useful information out of an 8-billion record file. The agency had attempted to crunch the data using an Oracle-based cluster, but it would have taken two years to sort through all the data.
“That equals a failure of mission,” says Jenn Reed, product manager for Novetta Entity Analytics. “They came to White Oak–Novetta now–and said ‘We know you have some great data scientists. Can you help us figure out a way to find out what’s in this data, and do so in a way that enables us to make use of the data while it’s still valuable?'”
They didn’t put a timeframe on it, but “anything less than two years was fantastic,” Reed tells Datanami. When the folks at White Oak were done, they had built a parallel processing architecture designed to crunch massive amounts of data atop a cluster of commodity Linux computers. Called Wareman Pro, the solution bore some similarity to the framework that Doug Cutting and others would soon develop at Yahoo.
The software works by partitioning data out across a cluster appropriately based on so-called strategy rules, which is the equivalent of a mapping job. Depending on what the attributes of the data are, another set of rules de-conflict and analyze the data using thresholds. “It looks like Hadoop–it has a name node and slave nodes,” Reed says. “It is MapReduce, without calling it MapReduce.”
White Oak installed Wareman Pro on a cluster of 126 Linux nodes and set out to make sense of those 8 billion records. They were able to process all of that data, predominantly structured but also unstructured, in about six hours, Reed says. “They were able to figure out the people in the organizations and their relationships,” she says. “But more importantly, they were able to figure out which individuals posed a threat and focus on those and the people connected to them, and not pay attention to the rest of the data, because the rest of the data was noise at that point.”
Over the years, Wareman Pro was adopted by many governmental agencies to crunch large amounts of data. When Cloudera began to get traction several years ago with commercial Hadoop, White Oak decided to get out of the business of building and maintaining parallel architectures and leave that to Cutting and company, so it ported the software to run under CDH version 3, although some government customers continue to run Wareman Pro version 1. Today, the solution is certified on several Hadoop distributions.
In 2012, a group of analytic software firms–including White Oak, FGM, White Cliffs Consulting, and International Biometric–were acquired and joined to form a new entity called Novetta (which means 10 to the 27th power). Based in McLean, Virginia, Novetta and its 600 employees served the intelligence community with three primary solutions, called Identity Analytics, Cyber Analytics, and Multi-INT (multi-intelligence) Analytics.
This January, Novetta branched out with a new offering aimed at the commercial sector. Dubbed Novetta Entity Analytics, the software helps customers isolate people, organizations, locations, and events. Trying to figure out who’s who is not as easy as it sounds.
“The more homogenous the data set, things start to repeat,” Reed says. “I like to use myself as an example. Jennifer Reed is a very common name. Even with people who share my birthday, I get a lot of collisions on that date.”
What all the Jennifer Reed’s of the world don’t have, however, are the relationships that Jennifer Reed of Novetta has. “The social network or past relationships with organizations or behavioral patterns that are now transactions look very different,” Reed says. “We use that information to further separate very ambiguous data.”
Retailers and other companies running large marketing outfits are the primary customers that Novetta is targeting with Entity Analytics. Many of these companies rely on third-party outfits to crunch data for them and tell them what customers they’re reaching, using what marketing tactics and through which channel. Novetta wants to help these customers to bring those workloads back in-house, running in Hadoop.
In many cases, these companies already have the data science skills and even the business logic to pull segmentation data out of their own back-office systems. Instead of shipping this data out to a “black box” outfit that generates the reports, Novetta wants to empower them to perform this critical step for themselves.
“They want to bring it inside. It’s their logic anyway,” Reed says. “But they can also do things they have never done before, especially in customer and marketing analysis, such as being able to group not just by [end user customers], but by an event, like a campaign, as well as by the individual location of stores and customers, and understanding that interplay. This allows them to create more complicated segmentation groupings so they can be more responsive to what’s happing in the market.”
Novetta is also providing fraud detection with Entity Analytics for companies in the healthcare and financial services industry. It’s also helping oil and gas businesses with on-site security. These outfits often work in inhospitable regions of the world, such as North Africa, where local governments can’t guarantee the safety of work crews and executives. By gathering and interpreting data from social networks and the Web, Novetta can help these clients ascertain the relative safety of working in a given place at a given time.
Novetta provides the technical expertise to help customers get started with Entity Analytics, including building the entity maps for people, organizations, and locations. While many of the customers have the skills to build their own entity analysis engine in Hadoop, they don’t have the time. Plus, they don’t want to write the Pig scripts that bring it all together, Reed adds.
“If you’re trying to build it yourself, you’re going to stumble a lot before you succeed,” Reed says. “When people have stored a lot of data on their Hadoop cluster, they’ve re-created mud. Everybody says, ‘Just throw it in, we can figure out what to do with it later.’ Well guess what? Now I’ve got mud and I don’t know what the hell’s in it and I don’t know if there’s any use to me. And now even though it’s cheap, I’m still storing it, so it’s still costing me something. [Novetta can] tell you how that data is connected back to the things that drive your business, which are customers, the locations, your channels, and events.”
Most of the data sources that these organizations will run through Entity Analytics already exist within their own organization. But Novetta can also help them bring external data feeds to bear on these challenges, such as social network data, RSS feeds, credit reports, and even blog posts. There is no one-size-fits all solution that will apply to everybody’s data situation, especially when it comes to using big data to generate new business.