Hosted Hadoop Gets Traction
The rise of big data and the promise of data analytics are putting unexpected kinks in the day-to-day pipeline of IT operations. While the phenomenon is driving many IT professionals painfully far outside of their comfort zones, it’s a blessing for the skilled operators of Hadoop clouds, who are starting to grow rather quickly.
In some circles, the very thought of outsourcing the source of a competitive advantage, such as a Hadoop-based analytics application that’s turning unstructured data into well-defined insight, would be viewed as an act of sheer lunacy. “The data won’t be safe!” the on-prem backers cry. “The secrets will be let out! We’ll lose our edge!”
But providers of hosted Hadoop report that these common cloud objections are melting away as the power of simple economics takes over. “Cloud is all about economies of scale,” said Ashish Thusoo, the CEO and co-founder of Hadoop cloud provider Qubole. “The larger the cloud becomes, the more economies can be passed on to the end users.”
Nobody has demonstrated the power of cloud economics as plainly as Amazon Web Services, which is clearly the big dog in the hosted big data solutions market. The company, which isn’t usually that forthcoming with details about the size of its business, let it slip in 2012 that it has more than 1 million Hadoop clusters running on Elastic MapReduce (EMR). Today, EMR is widely considered to be the biggest Hadoop provider on the planet; it likely has more Hadoop customers than all the third-party distros combined.
While Amazon EMR provides a robust Hadoop platform, it still requires users to be experts in managing Hadoop. Prospective Hadoop users who wish for a little more handholding from their vendors have several Hadoop as a service (HaaS) companies to pick from, such as Qubole.
Founded just two-and-a-half years ago, Qubole now has more than 30 customers running on Hadoop nodes that it manages, including prominent Web-based outfits like Pintrest and Quora, and a handful of online advertising specialists. Qubole, which maintains dual headquarters in Silicon Valley and Bangalore, runs its clusters on AWS and Google Cloud, and exposes a high-level Web interface for users to query their Hadoop clusters via Hive, Pig, Presto, and MapReduce.
“We process around 25PB of data every month,” Thusoo told Datanami at last week’s Hadoop Summit 2014, in San Jose, California, which attracted more than 3,200 attendees and nearly 100 vendors. “We cycle through around 230,000 nodes in the cloud in a month. The largest clusters that we brought up are on the order of 1,500 to 2,000 node clusters.”
Thusoo–a former Facebook data center manager who created Apache Hive with his Qubole co-founder, Joydeep Sen Sarma–says Qubole attracted mostly small and midsized businesses when it started running in production in January 2013. But now the company is attracting bigger enterprises with its HaaS offering, which starts at $1,250 per month for 5,000 compute hours and go up from there.
“If you look at the traditional model, you spend months just to get a cluster up and running, before you can actually start getting an ROI on the data. The cloud model completely disrupts that,” Thusoo said. “You can get the compute on demand and you can start using it right away.”
There is a well-publicized shortage of top-level data scientists who can make the most of data. But fewer people are aware there’s also a shortage of IT professionals who know how to build and manage a Hadoop cluster. “Managing a Hadoop cluster is not straightforward,” Thusoo said. “It’s not a platform that DBAs have been trained to deal with.”
Another HaaS vendor gaining traction is Altiscale, which was founded by Raymie Stata, the former CTO at Yahoo, which at one point ran a 40,000-node mega cluster. Stata saw a big gap between what was capable at Yahoo cluster and what customers were trying to put together on their own, so he and his partners launched a “white glove” Hadoop service in January.
Last week at Hadoop Summit, Altiscale announced that it’s now supporting Hive .13. The San Francisco, California company claims it’s the only HaaS provider to support the latest release of Hive, which features a big SQL performance increase over previous versions as a result of the Hortonworks-backed Stinger project. “Hive 0.13 is a key step toward real-time, interactive, and in-memory processing,” said Soam Acharya, head of applications architecture at Altiscale. “Customers should not be held back by their HaaS vendor in utilizing this capability.”
In the meantime, Altiscale continues to grow its business, which currently supports more than a dozen customers utilizing about 90TB of data. Customers are coming to Altiscale to get away from the hassles of managing Hadoop, Stata said. “Upgrading Hadoop is not non-trivial,” he tells Datanami. “They’re tired of upgrades.”
Another HaaS vendor making waves at Hadoop Summit is MetaScale, which is owned by Sears Holdings. The company is finding traction not only with HaaS, but with a series of Hadoop and NoSQL appliances that it unveiled in February. Customers can run the appliances on-premise, and leave the monitoring and management to MetaScale, which remotes into the boxes from its headquarters near Chicago, Illinois.
At last week’s show, MetaScale launched its new “Ready-to-Go Reports” program, which is designed to help clients analyze large amounts of social data. Ankur Gupta, MetaScale’s general manager, says the new offering is designed to help customers get their feet wet with Hadoop and big data analytics without breaking the bank. “Our Ready-to-Go Reports are a cost-effective solution for companies that may still be seeking to determine the real value of Hadoop and big data analytics at their firm,” Gupta said.
Hadoop leaders say 2014 is the year Hadoop clusters will go from being mere science project to being production-ready analytic systems. As the Hadoop market grows, the HaaS sector will move with it, benefiting not only the three vendors mentioned here, but untold new HaaS vendors that will enter the market in the coming years.
People are getting more comfortable with the cloud in general, and that’s going to make the economic argument of HaaS more compelling. In the near past, the break-even point for running Hadoop in the cloud used to be about 2,500 nodes, where anything over than that was cheaper to run on-premise, Qubole’s Thusoo said. “Traditionally there was a tradeoff,” he says. “Now I think that point has moved maybe to 20,000 nodes.”