Hosted Hadoop Gets Traction
The rise of big data and the promise of data analytics are putting unexpected kinks in the day-to-day pipeline of IT operations. While the phenomenon is driving many IT professionals painfully far outside of their comfort zones, it’s a blessing for the skilled operators of Hadoop clouds, who are starting to grow rather quickly.
In some circles, the very thought of outsourcing the source of a competitive advantage, such as a Hadoop-based analytics application that’s turning unstructured data into well-defined insight, would be viewed as an act of sheer lunacy. “The data won’t be safe!” the on-prem backers cry. “The secrets will be let out! We’ll lose our edge!”
But providers of hosted Hadoop report that these common cloud objections are melting away as the power of simple economics takes over. “Cloud is all about economies of scale,” said Ashish Thusoo, the CEO and co-founder of Hadoop cloud provider Qubole. “The larger the cloud becomes, the more economies can be passed on to the end users.”
Nobody has demonstrated the power of cloud economics as plainly as Amazon Web Services, which is clearly the big dog in the hosted big data solutions market. The company, which isn’t usually that forthcoming with details about the size of its business, let it slip in 2012 that it has more than 1 million Hadoop clusters running on Elastic MapReduce (EMR). Today, EMR is widely considered to be the biggest Hadoop provider on the planet; it likely has more Hadoop customers than all the third-party distros combined.
While Amazon EMR provides a robust Hadoop platform, it still requires users to be experts in managing Hadoop. Prospective Hadoop users who wish for a little more handholding from their vendors have several Hadoop as a service (HaaS) companies to pick from, such as Qubole.
Founded just two-and-a-half years ago, Qubole now has more than 30 customers running on Hadoop nodes that it manages, including prominent Web-based outfits like Pintrest and Quora, and a handful of online advertising specialists. Qubole, which maintains dual headquarters in Silicon Valley and Bangalore, runs its clusters on AWS and Google Cloud, and exposes a high-level Web interface for users to query their Hadoop clusters via Hive, Pig, Presto, and MapReduce.
“We process around 25PB of data every month,” Thusoo told Datanami at last week’s Hadoop Summit 2014, in San Jose, California, which attracted more than 3,200 attendees and nearly 100 vendors. “We cycle through around 230,000 nodes in the cloud in a month. The largest clusters that we brought up are on the order of 1,500 to 2,000 node clusters.”
Thusoo–a former Facebook data center manager who created Apache Hive with his Qubole co-founder, Joydeep Sen Sarma–says Qubole attracted mostly small and midsized businesses when it started running in production in January 2013. But now the company is attracting bigger enterprises with its HaaS offering, which starts at $1,250 per month for 5,000 compute hours and go up from there.
“If you look at the traditional model, you spend months just to get a cluster up and running, before you can actually start getting an ROI on the data. The cloud model completely disrupts that,” Thusoo said. “You can get the compute on demand and you can start using it right away.”
There is a well-publicized shortage of top-level data scientists who can make the most of data. But fewer people are aware there’s also a shortage of IT professionals who know how to build and manage a Hadoop cluster. “Managing a Hadoop cluster is not straightforward,” Thusoo said. “It’s not a platform that DBAs have been trained to deal with.”
Another HaaS vendor gaining traction is Altiscale, which was founded by Raymie Stata, the former CTO at Yahoo, which at one point ran a 40,000-node mega cluster. Stata saw a big gap between what was capable at Yahoo cluster and what customers were trying to put together on their own, so he and his partners launched a “white glove” Hadoop service in January.
Last week at Hadoop Summit, Altiscale announced that it’s now supporting Hive .13. The San Francisco, California company claims it’s the only HaaS provider to support the latest release of Hive, which features a big SQL performance increase over previous versions as a result of the Hortonworks-backed Stinger project. “Hive 0.13 is a key step toward real-time, interactive, and in-memory processing,” said Soam Acharya, head of applications architecture at Altiscale. “Customers should not be held back by their HaaS vendor in utilizing this capability.”
In the meantime, Altiscale continues to grow its business, which currently supports more than a dozen customers utilizing about 90TB of data. Customers are coming to Altiscale to get away from the hassles of managing Hadoop, Stata said. “Upgrading Hadoop is not non-trivial,” he tells Datanami. “They’re tired of upgrades.”
Another HaaS vendor making waves at Hadoop Summit is MetaScale, which is owned by Sears Holdings. The company is finding traction not only with HaaS, but with a series of Hadoop and NoSQL appliances that it unveiled in February. Customers can run the appliances on-premise, and leave the monitoring and management to MetaScale, which remotes into the boxes from its headquarters near Chicago, Illinois.
At last week’s show, MetaScale launched its new “Ready-to-Go Reports” program, which is designed to help clients analyze large amounts of social data. Ankur Gupta, MetaScale’s general manager, says the new offering is designed to help customers get their feet wet with Hadoop and big data analytics without breaking the bank. “Our Ready-to-Go Reports are a cost-effective solution for companies that may still be seeking to determine the real value of Hadoop and big data analytics at their firm,” Gupta said.
Hadoop leaders say 2014 is the year Hadoop clusters will go from being mere science project to being production-ready analytic systems. As the Hadoop market grows, the HaaS sector will move with it, benefiting not only the three vendors mentioned here, but untold new HaaS vendors that will enter the market in the coming years.
People are getting more comfortable with the cloud in general, and that’s going to make the economic argument of HaaS more compelling. In the near past, the break-even point for running Hadoop in the cloud used to be about 2,500 nodes, where anything over than that was cheaper to run on-premise, Qubole’s Thusoo said. “Traditionally there was a tradeoff,” he says. “Now I think that point has moved maybe to 20,000 nodes.”
May 27, 2016
- Splunk Announces Fiscal First Quarter 2017 Financial Results
- Accenture Completes Acquisition of OPS Rules
- AnalytixInsight Posts Strong Quarterly Results
May 26, 2016
- South Big Data Hub Links Talented Students With Tech Startups Through New Program
- NetApp Reports Fourth Quarter and Fiscal Year 2016 Results
- Saama Announces Fluid Analytics for Life Sciences
May 25, 2016
- Sumo Logic App for AWS Lambda Now Available
- The ASF Announces Apache Zeppelin as a Top-Level Project
- DocuSign Leverages Qlik
May 24, 2016
- Datameer 6 Unveiled
- Confluent Platform 3.0 Now Available
- Trifacta Launches Wrangler Partner Program
- Cray Unveils the Urika-GX System
- Snowflake Computing and Informatica Partner to Bring Together Data in the Cloud
- Ace Computers Rolls Out Big Data HPC Clusters for the Military and Government
- Confluent Enterprise 3.0 Now Available
- Tableau and Informatica Expand Partnership
- Information Builders’ iWay Big Data Integrator 1.2 Named Cloudera Certified Technology
- Datadog Adds Hadoop and Spark Integrations to Cloud-Scale Monitoring Platform
May 23, 2016
Most Read Features
- 9 Must-Have Skills to Land Top Big Data Jobs in 2015
- The Rise of Data Science Notebooks
- Apache’s Wacky But Winning Recipe for Big Data Development
- Spark Takes On Dataflow in Benchmark Test
- Solr or Elasticsearch–That Is the Question
- How Credit Card Companies Are Evolving with Big Data
- Spark Streaming: What Is It and Who’s Using It?
- How Uber Uses Spark and Hadoop to Optimize Customer Experience
- Why Winning Politics Is Now Tied to Big Data Analytics
- Apache Beam’s Ambitious Goal: Unify Big Data Development
- More Features…
Most Read News In Brief
- Six Big Name Schools with Big Data Programs
- Why Gartner Dropped Big Data Off the Hype Curve
- An Open Source Tour de Force at Apache: Big Data 2016
- DataRobot Looks to Cut Data Science Backlog
- Neo4j Pushes Graph DB Limits Past a Quadrillion Nodes
- Zillow Eschews Open Source for Proprietary Splunk
- NASA Helps Launch Data Science Grad Program
- Apache Kafka Gains Traction Among Enterprise Users
- More Tips For Navigating Big Data
- Skills Gap Also Includes ‘Failure to Communicate’
- More News In Brief…
Most Read This Just In
- Xentaurs Announces Partnership With Mesosphere to Build Big Data Analytics and DevOps Platforms
- Keynote Speakers Announced for Leverage Big Data 2016 Summit on “Establishing ROI in the Big Data Frontier”
- Datameer 6 Unveiled
- Infosys Launches New AI Platform
- MapR Announces Record Quarter
- MapR Announces Availability of Apache Spark 1.6.1 on the Converged Data Platform
- MapR Celebrates Rapid Innovations in SQL-on-Hadoop Technology
- TIBCO Introduces Project Flogo
- Dataiku Predictive Analytics Now On Microsoft Azure
- Atos Launches Big Data Appliance
- More This Just In…
June 5 - June 7Carlsbad CA United States
June 19 - June 23Hessen Frankfurt Germany
June 20 - June 23Chicago IL United States
July 12 - July 13San Francisco CA United States
July 14Bengaluru India