Follow Datanami:
July 19, 2013

Zettaset Puts Hadoop on Lockdown

Isaac Lopez

One of the hottest drivers right now for securing Hadoop and other distributed frameworks, according to Brian Christian, CTO and co-founder of the automation and security company Zettaset, is the Affordable Care Act’s digital record-keeping mandate which requires all healthcare records to be digital by January 1, 2014.

However, he says, many of the new age database frameworks, such as Hadoop, either don’t have security, or are taking their first steps into it. The challenges to get there are significant, says Christian, who cautions that the old way of securing things don’t necessarily translate to the new era of big data.

How could it be that these new frameworks have come so far to challenge the old world relational database technologies with something as fundamental as security being overlooked? And what should implementation managers consider as they examine new solutions that come online?

The challenge, says Zettaset’s Christian, has to do with a combination of the way databases like Hadoop came into being and the nature of distributed databases themselves. When Google started development of the precursor technologies that became MapReduce and ultimately, Hadoop, they were storing public link data – there was no need for a high level of security.

Realizing that traditional database technology wasn’t going to work, Google set out to do it all on a distributed system. “What they needed was something quick, fast, dirty, and easy that ran on commodity hardware because at the end of the day, they weren’t worried about security at all because it was all public link data – if it got breached, so be it, they’re public links,” Christian explains.

“There’s really been no distributed system security in the wild that’s been done,” he adds, commenting that a lot of security vendors in the space have merely reverse engineered their current relational offerings, and are trying to run them on top of Hadoop. “That’s the wrong approach because stand-alone isolated database systems are fundamentally different than distributed systems.”

The challenge, says Christian, is that on a distributed system, traditional database security strategies do not scale. “The moment you introduce a product that does not scale to multiple petabytes of data, you have the potential to introduce another single point of failure because if that single system fails, you are now left with say, 200 machines and 10 PB of data that are now potentially corrupted that you can’t gain access to.”

Christian charges that while strategies like tokenization might work for small data sets, it simply doesn’t scale to the big data levels that such things as healthcare records require. “When you tokenize something like a social security number, you basically have to create another database – I replace that SSN with a random number, and then I match the number with the one in the database. If you try to tokenize everything on a petabyte cluster, you basically have to have another petabyte cluster to do the translation of everything that’s been tokenized on the first cluster.”

Other problems exist with encryption on distributed systems which go back to the common distributed cluster problem: single points of failure. “If I have a crypto box, will it scale to 100 TB, or 1 PB, or 10 PBs,” he asks. “If I have five or six of these boxes, what happens if one of these boxes goes down?”

Christian says that in order to get around this there needs to be a distributed keystore across the cluster where there is a machine doing key management that’s being replicated to another machine – in case the first machine dies, there is a high availability backup failover that would take control. “Without that, if your first machine dies, it’s like having your name node die in Hadoop – you can’t get access to any of the data, or if you can’t recover the keys, then all the data that was encrypted becomes corrupted.”

Adding to the complexity challenges is the fact that organizations will most often have more than one person or division accessing the cluster, requiring multi-tenancy. “So now you’ve actually have the added joy of not just handling how users interact with the cluster, how they’re authenticated, how they’re authorized, allowing what data they can see, what data they can’t see. Now you have to enforce multi-department multi-tenancy on the cluster, where Division A can’t see Division B, which can’t see Division C.”

All of these things have to be considered when examining security on a distributed system, says Christian, who adds that Zettaset built their Hadoop security offering from the ground up. Released last month, he says their approach was to build from scratch a framework which basically extends over a cluster and allows admins to isolate the cluster into segments.

“It’s our own proprietary mechanism which basically runs on the name node,” he explains. “It basically handles control as a traffic cop. We had to tie it into active directory, and into LDAP as well so that people could basically manage the security of the entire cluster through the way that they were managing security across the network – making things easier in that regard.”

As part of their response to the challenge, Christian says they’ve created a highly-available distributed key server in case of failure. However, he says, the challenge is seemingly never-ending. “The challenge becomes how do you do the syncs securely because you can’t have one machine syncing keys to another machine, so now you have to create SSL between all the nodes in the cluster – it goes on and on. This is what I mean when I say that when you have distributed systems, the complexity is much different than trying to secure one single machine.”

Christian says that Zettaset has plans to release an encryption offering later this year that is completely different than what has been done to date – with more planned after that. “There are a lot of things that I think we can do around distributed security that makes a lot of sense – data leak prevention being one of them.”

In the meantime, the clock is ticking on the industry to step to the plate and get this right. With government regulations looming, and enterprises tapping their collective foot, it will be interesting to see the myriad of approaches that start to filter in during this second half of 2013.

Related Items:

Big Data at the Heart of a New Cyber Security Model

Gartner’s Adrian Raps on Big Data’s Present and Future

Hortonworks Previews Future After Massive Funding Haul