Does Hadoop Need a Reality Check?
Hadoop garners a lot of the attention when it comes to big data, to the point where “Hadoop” and “big data” are practically synonymous in many people’s minds. But by all accounts, few companies outside of the Fortune 1000 are using Hadoop directly, and despite the attention it receives, Hadoop is driving little in actual revenues.
Hadoop was first conceived at Yahoo as a distributed file system (HDFS) and a processing framework (MapReduce) for indexing the Internet. It worked so well that other Internet firms in the Silicon Valley started using the open source software too. Before long, banks and retailers discover that the free software provided a good platform to perform other compute- and data-intensive tasks, such as detecting fraudulent transactions and making targeted purchase recommendations.
Apache Hadoop, by all accounts, has been a huge success on the open source front. Thousands of people have contributed to the codebase at the Apache Software Foundation, and the Hadoop project has spawned off into dozens of happy and healthy Apache projects like Hive, Impala, Spark, HBase, Cassandra, Pig, Tez, Ambari, and Mahout. Apart from the Apache Web Server, the Apache Hadoop family of projects is probably the ASF’s most successful project ever.
But commercially, Hadoop just isn’t there yet. According to Wikibon’s latest market analysis, spending on Hadoop software and subscriptions accounted for a mere $187 million in 2014, or less than 1 percent of $27.4 billion in overall big data spending. Wikibon expects Hadoop spending on software and subscriptions to grow to $677 million by 2017, when the overall big data market will have grown to $50 billion. That’s just over 1 percent, and if you include professional services, it more than doubles to about 3 percent.
None of the three pure-play Hadoop distributors–Cloudera, Hortonworks, and MapR Technologies—have yet to turn a profit. Cloudera, which got a head in developing a commercial Hadoop distribution, brought in revenues of about $91 million during the 2014 calendar year, according to the analysis by Wikibon’s Jeff Kelly (although Cloudera says it topped $100 million for its FY2015, which ended in January). That’s roughly the same revenue that Hortonworks ($43 million) and MapR ($42 million) combined brought in for calendar 2014, according the latest update to Wikibon’s Big Data Vendor Revenue and Market Forecast 2011-2020 report.
Cloudera says it had 535 unique customers at the end of its fiscal year, while Hortonworks says it had about 332; MapR says it has 700, which leads them both. When you consider the other Hadoop distributions, including those from IBM and Pivotal, and the firms running open source Apache Hadoop software, it’s estimated there are perhaps 2,000 Hadoop clusters in the world (those numbers courtesy an October story by former EnterpriseTech editor Timothy Prickett Morgan).
This relatively slow adoption has given Hadoop’s competitors in the data warehouse space lots of ammo to attack. In a recent Snowflake Computing survey of more than 300 data warehousing professionals, only 11 percent of respondents say they have a “big data pilot” in place (“big data” here can be taken as a proxy for Hadoop).
What’s more, Snowflake’s survey found that 91 percent of respondents report having concerns about Hadoop. “Only 12 percent have easy access to Hadoop expertise; in contrast, 93 percent have easy access to SQL expertise,” the survey says. (This, in a nutshell, is why the Hadoop vendors are so eager to get SQL warehouses like Impala, Hive, Hawq, Vortex, etc. running well on Hadoop.)
Hadoop is often positioned as a replacement for traditional data warehouses. However, when Snowflake (which sells a hosted warehouse service) asked survey participants if they consider Hadoop as a replacement, 64 percent said Hadoop would be complementary to existing systems, while 32 percent Hadoop would replace some of existing systems, but not all. Only 4 percent said Hadoop would be a replacement. (That’s probably close to the survey’s margin of error, which wasn’t disclosed.)
“This study found that big data initiatives and Hadoop have not diminished the importance of the data warehouse,” says Diane Hagglund, principal at Dimensional Research, which put together the study for Snowflake. “Data warehousing remains critically important and will not be replaced by Hadoop.”
The Data Warehousing Institute has found similarly low levels of actual adoption of Hadoop and other big data technologies like NoSQL databases. In its latest “Best Practices” report, which surveyed 450 individuals, TDWI found 13 percent of survey respondents reported using a commercial Hadoop distribution. That’s up from the 10 percent of users that TDWI said reported using Hadoop in early 2013.
While Hadoop has struggled to catch on beyond the Fortune 1000, there are indications that companies have begun to sour on big data projects in general. “The problem that developed in 2014 is too many employers have not been satisfied with the return on their sizable investments in big data initiatives,” David Foote, chief analyst and co-founder of Foote Partners, said last month.
“The dirty secret is that a significant majority of big-data projects aren’t producing any valuable, actionable results,” Michael Walker, a partner at Rose Business Technologies, told the Wall Street Journal in late 2014. The same story cited a report from Gartner that found “60% of big data projects will fail to go beyond piloting and experimentation and will be abandoned” through 2017.
The truth about Hadoop–and big data analytics in general–is that it’s not easy. In addition to data science skills, which are in notoriously short supply, organizations need the engineering skills to bring all the proper technologies to bear in the proper amounts. This is still cutting edge stuff.
There is no doubt that some companies have gotten great results out of Hadoop and are using it to hammer petabytes of less-structured data into usable insights. But these success stories are predominantly relegated to either the biggest firms in their respective industries, or well-funded startups looking to leverage new Internet business models to disruptive existing industries. By and large, it hasn’t trickled down into the marketplace as a whole, at least not yet.
At the current rate of growth, sales of core Hadoop software will never live up to their lofty expectations. Looking at it from an accountant’s point of view, you would have a tough time justifying the $1.5 billion in venture investments that have been made in Hortonworks and Cloudera alone.
The Hadoop infrastructure software racket will probably never pan out. But what Cloudera and the other distributors clearly hope will happen is that a bigger ecosystem for analytic application software will grow out of the foundation that “core Hadoop” (i.e. HDFS, MapReduce, YARN, Tez, etc.) is setting.
The distributors hope that application vendors, like Platfora and Tableau, and fast new frameworks, like Apache Spark, can abstract away the complexity and allow organizations to use big data analytic systems without becoming data scientists or brilliant architects themselves. They hope that predictive analytics via machine learning becomes a requirement, and that Hadoop-powered analytics get built in and integrated with other offerings. This is why Cloudera’s chief strategy officer Mike Olson last year said that he hopes Hadoop will eventually “disappear.”
Hadoop represents a marked shift from previous architectures, and holds a lot of promise for giving us a way to analyze huge amounts of data. However, it still remains a difficult system to program and manage, which is why adoption has been so slow. The ecosystem is working hard to burn the complexity out of the system and make big data analytics accessible to a wider audience, and hopefully they’ll succeed before the bankers and users give up on it.