Follow Datanami:
July 13, 2017

Hacker Hunting: Combatting Cybercrooks with Big Data


When it comes to cybersecurity, the big data explosion represents both a liability and an asset. On the one hand, big collections of data represent a treasure trove that hackers would love to get their dirty little mitts on. But on the other hand, the capability to collect, store, and analyze huge reams of data gives the good guys a powerful tool to thwart the bad guys.

These are two ends of the same stick. If you’re building a data lake atop Hadoop, Amazon S3, MongoDB or any other big data platform, you sure as heck better be securing that data. After all, if that data is valuable to you – which is hopefully why you’re collecting it in the first place – then it will likely have value to some digital bottom dweller who steals for a living.

There are various open source and commercial tools designed to help you lock down access to that data. But there’s another element of big data security, and it involves using advanced analytics to better detect when bad guys (or bad software) are trying to do us harm.

Here’s how the good guys are using big data analytics to go after the black hats of the world.

Machine Learning De Rigueur

Machine learning techniques have been reducing spam in our email inboxes for years. Now that same technology is being applied by leaders in the cybersecurity world, including McAfee, which started using it widely with a major update to its Internet security software last fall.

“I do think it’s important to know that we have been shipping machine learning in our product at scale for many, many months,” says McAfee CTO Steve Grobman. “We did a massive launch last year in late October, November that was really retooling our detection technology to take full advantage of machine learning capabilities.”

That ML capability extends from the data lakes that McAfee uses to store examples of malware and infiltration techniques, all the way out to the millions of end-points that it protects.

“Our consumer product line, as well as our most current enterprise product, both have these technologies built into them,” Grobman tells Datanami. “We’re able to classify a much larger set of malicious scenarios using non-deterministic machine learning capabilities to find either new forms of malware or threat scenarios as compared to traditional techniques.”

Speed to Detection

While established security firms like McAfee play large roles in cybersecurity, organizations are also turning to big data analytic software companies to get a leg up on the detection of bad stuff going on in their networks.

One of those analytics firms seeing increased uptake for security use cases is Pentaho. Chuck Yarbrough, a vice president with the Hitachi subsidiary, says he’s seeing an uptick in customers using its big data software to bolster their company’s security.

“The reality is, you’re going to get intrusions,” Yarbrough says. “There are tools to help prevent intrusions….But oftentimes the challenge is to recognize that you’ve had an intrusion, and being able to figure that out as quickly as possible.”

On average it takes organizations 205 days to discover that they’ve had an intrusion into their internal systems, Yarbrough says. That means these ticks get nearly seven months to poke their nasty little heads into organizations’ digital crevices before the host figures out that something’s amiss. That’s just too long.

“The amount of time it takes to identify a problem is crazy,” Yarbrough says. “How can we bring that number down? Ultimately we want to get it down to hours or minutes, and part of that is blending multiple data sets together.”

The typical security engagement with Pentaho starts small with just a few data sources and analysts prepping the data. Then the customers will combine data sets, such as bringing employee social data or badge swipe data, to bear against log files recording network activity.

Once the data is landed, prepped, and mixed, the data scientists use machine learning algorithms, usually in R or Python, to discover anomalies that could indicate a breach. The Pentaho software helps to operationalize this entire data pipeline and run it repeatedly on Hadoop or other big data platforms.

“There are a lot of security systems out there that do parsing of data all the time, and they’re really good at it,” Yarbrough says. “But now we can parse that data in an appropriate way and blend in additional data sets to help it add that context. I build the data model on the fly so that either the data scientists a security analyst can do some level of interactive analytics against that to find the cause, or the potential problems, and then be able to tack action.”

Context Is Key

Finding context in the data is critical to stopping cybercrimes, such as fraud, says Poornima Ramaswamy, vice president of business analytics and insights at Cognizant Digital Business, a technology consultancy.

“We are basically trying to combine human science and data science to be able to create analytical solutions that are more human-based, deeply contextual, and can actually truly address the problems that we face,” Ramaswamy says.

(Nestor Rizhniak/Shutterstock)

For example, fraud conducted in-store is much different than fraud conduced over the Internet. In engagements with big box retailers, Cognizant collects data about the movement of individuals within the store, and combines that with other data to come up with a solution.

“We overlay the contextual data…on top of big data to look for patterns of human behavior,”Ramaswamy tells Datanami. “With this approach we can apply the contextual data on top of the data to get more human-based segmentation and then your analytics and insights are a lot more realistic.

A recent Cognizant project involved researching elements around fraud, and included interviews with 11 actual fraudsters to get more insight into their mindsets. What the research discovered is that cyber thieves don’t like to take unnecessary risks.

“The biggest characteristics for fraud are speed and liquidity and efficiency,” she says. “There’s very little [research] in looking at that end-to-end chain and looking at it as an opportunity to make it less attractive, where you’re not going to be able to liquidate the assets.”

Bad Guys Using It, Too

Big data analytics is helping the good guys, but the cybercriminals are getting hip to the techniques – and some are starting to use it themselves.

According to McAfee’s Grobman, hackers are starting to use machine learning “poisoning” techniques that basically involve throwing a lot of white noise at the good guys’ data receptors with the goal of confusing the model – and thereby throwing the good guys off their trail.


“They’re making it very difficult to tease out the signal within a very noisy set of data,” he says. “An attacker can craft things that look like an attack but are actually benign in order to intentionally create false positives that become very expensive things for the company to deal with, and are forced to lower the sensitivity.”

You can expect these evasion tactics to become more common in the years to come, as security professionals increasingly rely on machine learning and artificial intelligence to automate rote tasks.

“It’s absolutely something that’s beginning to be analyzed,” Grobman says. “We’re just now hitting the saturation point where there’s enough AI and ML in the industry for the counter measures” to work.

McAfee has also observed bad actors using big data technology and techniques to make their attacks more effective and produce a higher return on investment.

“If you look at what are some of these algorithms and technology are good at, one of the things is classification,” he says. “One of the things an attacker can do is evaluate many potential victims, then use a machine learning classifier to choose victims who have a high probability of ease of breach, or they have a high probability to extract high value data.”

Think of it as Netflix for digital ne’er-do-wells. “Just as Netflix wants to suggest to you the movie you’d like the most, an attacker wants to not waste their time on victims that are difficult to break or unlikely to yield high value,” Grobman says.

The rapid evolution of digital technology has benefitted people in many ways. The incredible power of personalization has resulted in consumers now expecting what Forrester analyst Mike Gualteiri adeptly calls “the celebrity experience.” And thanks to machine learning, we’re able to spot bad guys lurking in our networks faster than before.

However, this incredibly flowering of technology is also making cybercriminals’ job easier, too – and has ushered in an arms race between the good guys and the bad guys. In some respects, it’s the investable price that we all must pay for the benefits that big data ultimately brings.

Related Items:

Masking Technical Complexity in the Security Data Lake

How ‘Purple Rain’ Bolsters Security Intelligence for Capital One

Behavior Analytics Looks to Leapfrog SIEMs