The evolution of Hadoop from an overflow parking lot for data into a field of analytic dreams is unfolding right before our eyes. Among the vendors trying to help the elephant along is Cloudera, which used the Strata +Hadoop World conference this week to lay out its plans to remake Hadoop as a centralized "data hub" for enterprises. The firm also launched betas for its Hadoop 2 distributions, a partnership with the company behind Apache Spark, and a new cloud program for partners.
Cloudera says its Enterprise Data Hub will provide "one place to store and work with all data." No longer will businesses have different infrastructures and systems to support operational data warehouses and backup repositories on the one hand, and specialty massively parallel databases and content management systems on the other. Instead, these various databases and file structures can be provisioned to live entirely within Cloudera Enterprise, the company's commercial distribution of Hadoop.
Cloudera says the new Enterprise Data Hub approach will give users "the flexibility to run a variety of enterprise workloads-- including batch processing, interactive SQL, enterprise search and advanced analytics"--all on one infrastructure. Keeping it all centralized will not only eliminate redundancy, but it will also help in the areas of integration, business continuity, security, and governance, the company says.
In practical terms, the Enterprise Data Hub strategy will mean changes in Cloudera's product packaging and pricing strategies. It's a much more enterprise-focused approach, and shows the company's desire to remake Hadoop into an uber-data machine to fulfill all of your wildest data fantasies (in a responsible and fully managed manner, of course).
"Over the last five years, we have worked closely with enterprises around the world to help them capture the value in the data they have," says Mike Olson, chairman and chief strategy officer for Cloudera. "Resoundingly, they have asked for a more secure, more reliable real-time data platform that streamlines their existing architectures and speeds up time to insight."
The company says that its Enterprise Data Hub product (yes it's actual software in addition to a strategy) is currently available, but only for Cloudera Enterprise 4. The Cloudera Enterprise line of software, you will remember, is the enhanced version of Hadoop that includes the core CDH (Cloudera's Distribution including Hadoop) software, in addition to Cloudera Manager, a data management and monitoring layer that sits atop Hadoop.
But Cloudera also sells other add-on products, including versions of the HBase NoSQL database and Impala SQL capabilities, as well as search and backup and recovery capabilities. With the forthcoming launch of Cloudera Enterprise 5, all of these capabilities will be delivered as part of its Enterprise Data Hub offering.
Cloudera's partners will also get a seat on the hub, which gives you a better idea where Cloudera is taking this thing. With the version of Cloudera Manager that it will ship with Cloudera Enterprise 5, the company is delivering the capability for customers to deploy, manage, and monitor third-party products from the Cloudera interface. SAS, Revolution Analytics, Syncsort, and Informatica have saddled up to the big data hub so far. It will be interesting to see what other vendors commit as we get closer to Cloudera Enterprise 5 GA in early 2014, and how the Enterprise Data Hub strategy will evolve in general.
Innovating on Hadoop
Cloudera Enterprise 5 and CDH 5 are both based on Hadoop 2, which became GA earlier this month. The inclusion of the YARN scheduler is the biggest improvement with that release, as it provides the capability to run MapReduce, HBase, and other workloads simultaneously without having to worry about the various jobs competing for hardware resources.
The software company has some of its own innovating with Cloudera Enterprise 5, notably a data-tiering function that enables users to "pin" data from the HDFS into memory, which the company says will boost MapReduce data processing performance and Cloudera Impala's analytic query response times.
The new version also brings support for user-defined functions. The UDFs put in place a repeatable process for custom queries that users wrote in Java or other scripting languages, and ran using Impala. Cloudera says it is also making the MADlib library of pre-built statistical and analytic functions available for users to perform in-database analytics.
Cloudera has also added new snapshot capabilities to protect data in HDFS and HBase, and native support for NFS version 3, which will improve data sharing. New role-based security features based on Apache Sentry have been added, and the Cloudera Navigator gets new data lineage and auditing capabilities.
Sparks and Clouds
The Palo Alto, California, company launched two other programs at Strata + Hadoop World, including Cloudera Connect: Innovators and Cloudera Connect: Cloud.
The Cloudera Connect: Innovators program's charter partner is Databricks, which was spun out of AMPLab at the University of California, Berkeley, and is the company behind the Apache Spark incubator program.
Spark is a Hadoop engine that aims to deliver faster analytics than MapReduce. Program backers claim it can process some types of analytic queries up to 100 times faster than MapReduce. Its use of primitives and in-memory computing reportedly make it particularly useful for running machine learning algorithms.
The goal of the Cloudera Connect: Innovators program is to enable customers to tap into promising new technologies that aren't being offered by Cloudera, says Charles Zedlewski, vice president of products at Cloudera. "Apache Spark is a prime example," he says. "It provides excellent data processing functionality and performance and has a vibrant developer and user community."
Meanwhile, the Cloudera Connect: Cloud program provides a way for Cloudera partners to (you guessed it) run Cloudera's Hadoop software in the cloud. The four charter members are Verizon Enterprise Solutions, Savvis (a CenturyLink company), SoftLayer (an IBM company), and T-Systems.
The company also announced plans to support its software on Amazon Web Services, and (AWS) Cloud "in the near future." It also plans to support private cloud deployments through OpenStack and VMware. "The company's long-term vision is to embrace a hybrid model, where the Cloudera stack can operate transparently between on-premises, private cloud and public cloud deployments," Cloudera said in a statement.