Follow Datanami:
September 1, 2015

The Dark Web Gets a Little Brighter, Thanks to Big Data

Illegal drugs. Stolen credit card numbers. Hitmen for hire. These are some of things you can find on the Dark Web, a part of the Internet that’s not indexed by traditional search engines and where people browse in complete anonymity. Now a group of security researchers in Maryland is using big data technology to shine a light on some of the Dark Web’s seedier neighborhoods.

The public Internet can be a creepy place, but it’s nothing compared to the Dark Web, where all manner of human depravity is put on display, and some of it is traded as commerce. Not everything on the Dark Web is sinister or illegal, but enough of it is that it’s gained the attention of the FBI, which put the Dark Web on the public’s map two years when it took down the Silk Road marketplace.

Now a big data startup called Terbium Labs is now looking to shed a little light on the Dark Web. The company, whose founders come out of Johns Hopkins University’s Applied Physics Laboratory in Laurel, Maryland, isn’t indexing the Dark Web to allow people to search it. Rather, the company’s primary goal is to alert clients when any of its data has been compromised and offered for sale on forums and marketplaces located in the Dark Web.

Diving Into the Dark Web

It’s an unfortunate fact, but a good percentage of the data that hackers steal in data breaches–such as the ones the impacted Target and Home Depot–eventually makes its way to various Dark Web forums, like the Silk Road and their ilk. Credit card numbers, for instance, can be bought for pennies (Bitcoin really), whereas somebody’s Social Security number may be had for about $1. A full medical record is currently the big prize, getting upwards of $50 per, HIPAA or not.

Instead of waiting for somebody to use that stolen credit card number or medical record, the folks at Terbium decided to take a more proactive role and venture into the Dark Web themselves in search of stolen data. Considering that it often takes many months before the average company even becomes aware that it’s lost data—and that third parties are usually the ones making the discovery—staging agents in the Dark Web to monitor real-time activity would seem to be the right approach.

However, indexing the Dark Web and keeping a record of every piece of data that’s exposed there is no easy task. That’s where big data technologies, and specifically Hadoop, come into play.

Maintaining Secrecycrawlers

Because Terbium was founded to help maintain secrets, it can’t expect its clients to give up those secrets to enable them to be monitored. That would defeat the whole purpose of the exercise.

Terbium overcomes this hurdle by first masking its customers’ data. It does this by running it through a cryptographic hashing algorithm, which transforms the sensitive data into a hash, or a fingerprint of the data itself. Customers do this by themselves on their own networks, and then upload the fingerprints to Terbium. This ensures that Terbium never has a copy of the data that it’s looking for, just the fingerprint, which is not reversible.

Terbium has also built a substantial network of Web crawlers that are constantly searching the Dark Web. These crawlers cover an estimated 90 percent of the Dark Web domains accessible through the Tor network, including password protected websites, Terbium co-founder Michael Moore estimates. Because many of the criminals using Dark Web forums will share a sample of their stolen trove of data treasure to prove that it’s legitimate, Terbium is able to detect when a stolen credit number, for instance, is offered up for sale.

Terbium uses the same cryptographic hashing algorithm to create fingerprints of all the data that it finds on the Dark Web. When a fingerprint is discovered that is an exact match to the fingerprint of its client’s sensitive data, it means that its client’s sensitive data has been offered up for sale on the Dark Web.

The Dark Elephant

The third critical part of Terbium’s product, called Matchlight, relies on Hadoop. Hadoop would seem to be an obvious fit for indexing the Dark Web. After all, that’s basically what Doug Cutting was doing a decade ago when he created Hadoop: building a parallel system capable of indexing the World Wide Web for Yahoo. So it’s no surprise that Terbium selected Hadoop—in this case, MapR‘s top-of-the-line M7 distribution of Hadoop–to crawl and index the Dark Web.

“We built a very large-scale, privately protected Dark Web search engine,” Moore says. “Our business model wasn’t even possible five years ago without the big data technology that exists today.”Terbium_1

Terbium’s challenge is one of scale and throughput. While the M7 installation on AWS is not that big—just 240TB at the moment—it’s full of a large number of relatively small files, which runs counter to HDFS’ default preference to house a smaller number of large files. The Matchlight hashing algorithm works by breaking 14-character chunks of data into unique fingerprints. The company is currently storing 430 billion fingerprints on its MapR cluster, and it’s growing by 15 billion fingerprints a day.

“It’s a very big data type of problem,” Moore says. “It’s a complicated problem in that it’s very large scale. The reason we used MapR is because we tried the other large distributed key value stores and databases and MapR is the only one who can handle our ingest load at the price I was willing to pay.”

MapR’s re-implementation of the Hadoop APIs, its tweaks to HDFS, and the work it did integrating HBase enables Terbium to keep up with the huge influx of Dark Web fingerprints without resorting to compaction routines, which would eventually leave Terbium unable to keep up with the incoming flow of fingerprints.

“The idea of trying to push [100 million] individual PUTS into a database [per minute] is very, very challenging,” Moore says. “Because our crawlers are so fast, we get so much data off the Dark Web that we can’t tolerate a compaction delay. So MapR is basically the only thing out there that will keep up with that load.”

A Head Start on Data Breaches

Terbium just came out of stealth mode earlier this year and is still ramping up. But the early results show a lot of promise.

During one pilot program, a prospective customer loaded 30 million credit cards into Matchlight. If the system signals on one of those numbers, it could give the credit card company a head start in the breach detection and remediation process, and thereby help prevent it from paying for fraudulent transactions.

Terbium_logoDespite the fact that it provides criminals cover, the Dark Web is not going away anytime soon. The Tor browser receives a large part of its funding from the US Government, specifically the Department of Defense and the State Department, both of which place a high value on enabling people to communicate freely from anywhere in the world.

But instead of having an emotional overreaction to data breaches and the role that the Dark Web is playing, Moore encourages chief information security officers (CISOs) to think about the problem logically. “I think there’s a lot of FUD [fear, uncertainty, and doubt] going around about data breaches and compromises,” Moore says. “Certainly it’s a major problem. But it’s more commonplace now than scary.”

CISOs should approach the cyber problem as they would a problem in the physical world. There’s danger walking down a street, Moore says, but a smart person takes realistic precautions before venturing into a dark alley. “You have to do the same in the cyber world. Not be scared, but be ready,” he says.

“What’s needed in the industry is a level of seriousness and not more throwing gasoline on the fire,” Moore says. “The traditional, everyday IT security is still necessary. It’s just no longer sufficient.”

Related Items:

MapR Says Its Hadoop Tweaks Scale to Meet IoT Volumes

A Peek Inside Cisco’s Hadoop Security Machine

Datanami