Follow Datanami:
July 31, 2012

Tallying Hadoop’s Hard Numbers

Datanami Staff

Hadoop is emerging as not only one of the premier big data infrastructures but also a relatively cheap one for some users. While it’s no simple task to get a handle on how many users are floating the elephant with so many distributions and a still-emerging ecosystem, some distro vendors are pointing to the numbers they can pin down.

In a recent discussion, Cloudera’s VP of product management, Charles Zedlewski (who of course has clear incentive to emphasize Hadoop’s cheapness factor) chatted about how well Hadoop is doing in the big data marketplace in addition to putting this success in context of the platform’s much-hyped processing and analytic prowess.

According to Zedlewski, Hadoop is starting to outclass its competition in the big data processing realm. “The number of hours it takes,” he says “to pre-process data before they can make use out of it has grown from four hours to five, six. In many cases, the amount of time is exceeding 24 hours.” According to Bertolucci, Hadoop does significantly better. Of course, Hadoop is still relatively far away from achieving results in real-time since that is not really how it was designed (even though several Hadoop advocates—versus users–say differently), but that was always going to be a difficult task when terabytes of data are involved.

Or petabytes for that matter, a scale Hadoop is reportedly now working with. “With Hadoop,” writes Bertolucci, “it’s possible to store–and actually ask questions of–100 petabytes of data.” Zedlewski adds, “That’s something that was never before possible, and is arguably at least 10 times more scalable than the next best alternative.”

It is no secret that Hadoop is well respected in the big data storage and analysis market. Even the United States government is using it in determining the locations of terrorist land mines and proactively fixing military helicopters. However, what may come as a surprise is its relatively low cost. “The cost of a Hadoop data management system,” writes Bertolucci, “including hardware, software, and other expenses, comes to about $1,000 a terabyte–about one-fifth to one-twentieth the cost of other data management technologies.”

While probably not as cheap as it will get, it is certainly significantly cheaper than the alternatives. According to Zedlewski, the cost per terabyte can run from $5,000 to $15,000. “If you look at databases,” he says “data marts, data warehouses, and the hardware that supports them, it’s not uncommon to talk about numbers more like $10,000 or $15,000 a terabyte.”

Perhaps, according to Bertolucci, Hadoop’s best advantage is its malleability. It can store and process a variety of data types, including documents, images, and videos. Zedlewski echoes this sentiment, saying, “This is probably the single greatest reason why people are attracted to the system.” On top of that, business do not have to, according to Zedlewski, do away with their existing BI infrastructure to utilize Hadoop. “It’s common that Hadoop is used in conjunction with databases. In the Hadoop world, databases don’t go away. They just play a different role than Hadoop does.”

Related Stories

Cloudera Plots Enterprise Invasion

Six Super-Scale Hadoop Deployments

How 8 Small Companies are Retooling Big Data