Hadoop, with its impressive scalability and predictive analytics capabilities, has made somewhat of a name for itself in the big data world and one of the leading vendor forces behind that move—Cloudera—is hoping to extend the open source framework’s reach.
Doug Cutting, Hadoop’s founder, will give the keynote address at the Hadoop World convention in New York City in October, where we will be on hand to report live. This week he gave a little preview on the state of Hadoop in a recent interview.
Though Cutting works for Cloudera, named because its founder believed it was the era of the cloud, he noted that Hadoop is facilitating a move away from the cloud for those serious about their big data. According to Cutting, it is expensive to host a lot of data in the cloud and inefficient to host some data in the cloud and some locally. As a result, many companies, especially those who constantly require access to their data, simply purchase their own Hadoop cluster.
“At Cloudera,” Cutting said “the vast majority of our customers are self-hosting. I think a lot of folks start out doing things hosted in the cloud, but as their clusters get very big and they’re starting to use them a lot, will bring them in-house. If you’re using your cluster 24/7, then it’s really expensive to keep it in the cloud compared to hosting it yourself.”
Companies’ integrative attitude toward Hadoop has helped make it a standard. Instead of trying to build their own Hadoop-like systems and compete with it, companies like Microsoft and Oracle incorporated Hadoop into their existing infrastructures and built on it themselves. In Cutting’s mind, this has created an open source Hadoop community which has been integral to Hadoop’s continued improvement.
“I didn’t expect Oracle and Microsoft to so quickly and readily adopt Hadoop as a component for their platform. I thought maybe they would develop their own proprietary technology and compete. They instead elected to join in and work with the open source community in building out Hadoop.”
As a result of this open source community, Hadoop is becoming more and more compatible. This stands to reason. As more people work on Hadoop, it is transposed to more systems and translated to more languages. “Compatibility, compatibility of data, compatibility of network protocols are all areas that we’re improving in and we need to keep improving.” This compatibility will hopefully see the amount of Hadoop-based projects rise going forward, a goal Cutting is focused on for the near future.
Eventually, Cutting would also like to see Hadoop be the one to bridge the gap between big data and fast data. It is already renowned for its batch-processing system, allowing it to scale to petabytes of data and even perform a measure of predictive analytics. “Hadoop started as a batch-processing system able to economically store petabytes and process them in ways you that couldn’t before – really get through datasets that large.”
However, Hadoop is not exactly the standard when it comes to processing those petabytes quickly. With that being said, there are a significant amount of people who work on and run Hadoop. Since they store and analyze a great deal of their data on Hadoop anyway, it would make sense to somehow leverage flash technology or something similar into the system.
“I think the challenge is to see if we can meet that promise and really provide the sort of holy grail of computing, something which is scalable to arbitrary large numbers of computers, arbitrary size of data sets, and arbitrary latencies. We can make it as fast and faster by adding more resources and giving the transactional consistency along with the scalability.”
While Cutting is optimistic about Hadoop improving its speed in the future, in order for Hadoop to become a force in quick data, users will need Hadoop’s speed to match its scalability. However, Hadoop has come a long way since its beginnings as a Yahoo side project under Cutting and should not be counted out.