In the most recent conversations about Hadoop, the attractive part of the story—finding more use cases for the platform in large-scale, mission-critical enterprise settings—is easy to tell. However, part of what enables that story involves...well, something of a less sexy angle to the tale.
We’ve reached the phase in Hadoop’s evolution where some of the discovery and wonder has given way to some rather dry, albeit demanding, details. In other words, Hadoop is all grown up now—or at least it’s arrived at the human equivalent of say, getting an MBA without completely knowing how much it will all pay off in the end.
While the issues that the post-adolescent platform is now growing into might not be as interesting as the wild youthful days of experimentation, they are core to the long-term viability and future growth of the still-evolving space. To bring back the human metaphor, Hadoop is still struggling to take the book learning of that MBA into the real world—although it’s starting to piece things together enough to find its way into the middle management level at a Fortune 500.
This week every vendor in the distro game, including newcomer to the Hadoop game, Intel, had commentary and release info on tweaks to add to Hadoop’s readiness for the world of big business. The goal at each company (and within the Apache community itself) is to increase the platform’s ability to be enterprise-robust and compliance-ready.
One of the chapters in Hadoop’s higher learning tome that is applicable to actual business environments revolves around the (again, rather unsexy) world of data governance. Throw in a little disaster recovery and compliance and you have all the makings of a boring (apologies to those of you who are over-the-moon excited about governance), but ultimately useful resource.
According to distro giant Cloudera, which itself sought to enhance the platform’s enterprise viability, this is no longer a goal or long wait for Hadoop’s graduation—their pet elephant has already been working hard at several companies in a mission-critical role.
In Cloudera’s Charles Zedlewski’s opinion, the missing pieces around data governance were keeping the platform’s out of regulated and highly policy-driven industries until more recently. When it comes to data governance, he notes that this “is a bigger issue now than ever for Hadoop, in part because these big systems are holding many disparate datasets. So what you had to do before was segregate all that data, which goes against the real value of Hadoop, which is meant to consolidate these.”
Further, most large businesses in regulated industries have extensive reporting, auditing and compliance requirements that Cloudera says haven’t been tackled before their string of releases this week. Outside of those, on a practical operations level, most of the large-scale enterprise users they are targeting have stringent business continuity requirements. As part of this, disaster recovery would be a requirement.
To this end, the company announced some updates to its core components, including CDH, which now has rolling upgrades (part of this continuity benefit) and to Impala, which has being continuously improved on the performance side to help certain industries deliver on their SLAs around data processing speeds.
At the heart of their enterprise-focused announcements though, is that larger concept of data governance—something that, in addition to disaster recovery and rolling upgrades—provides what Cloudera says are industry firsts (although MapR would take issue on that—more on that next week). Through the new Cloudera Navigator, which is directed at governance, Zedlewski says large companies can address all the data in their cluster under stringent auditability and access management capabilities.
“We’ve been improving security for some time,” he admits. “But the audit and access stories really haven’t been that strong.” He says that in addition to continuing to boost Impala (which is slated for the GA books around April), this marks another major area of investment. It’s all about getting users to trust more workloads to Hadoop—something that has been hereto barred for regulated and policy-heavy industries like financial services and life sciences.
Zedlewski pointed to one recent example of these needs in their customer, Monsanto, which was part of the impetus behind the move to extend these data governance and compliance features.
The agriculture behemoth was using Hadoop to store and analyze genomic information for their seed division in an effort to identify seeds resistant to drought, pests and diseases. However, due to their strict internal policies on governing that important data, they had particular ways the info needed to be handled, which put a damper on their Hadoop plans before Cloudera was able to step in with some solutions. Much of the work for Monsanto found its way into the foundations for Navigator, according to Zedlewski.
For other businesses, however, it’s more than just a matter of remaining compliant with laws or internal regulations. Others have continuity requirements that govern disaster recovery, a matter which Cloudera addressed with their BDR offering, which is an automated disaster recovery system built into Hadoop.
What’s worth noting here is that many large-scale customers already have sophisticated compliance and governance systems in place. Zedlewski says that they have no interest in replacing these (for example, Oracle’s DataGuard or Audit Vault) nor do they want to take on integrators (like Informatica) with their navigator or CDH products. For Cloudera, not to mention its competitors, including MapR, Hortonworks, and even Intel and its long list of partners, it’s about the platform versus taking over the entire datacenter management…at least for now.
While some conversations I had this week with actual users and developers I met at Strata led me to believe that Hadoop is far from enterprise-ready and is still in experiment phases at most companies, these types of core fixes are going a lot way toward extending (if not its capability) its reputation.