Yahoo: We Run the Whole Company on Hadoop
Hadoop is absolutely critical to the operations of Yahoo, executives with the company said this week at the Hadoop Summit. While the company, which spun out Hortonworks in 2011, is moving away from “traditional” Hadoop components like MapReduce in favor of YARN, Tez, and Hive, the Hadoop platform remains absolutely core to its operations.
There’s a stream of thought in the wider big data community that Hadoop is old hat, that Yahoo and Google, which developed the technology behind Hadoop nearly 10 years ago to run their own businesses, have moved onto something bigger and better, that what Hadoop vendors are hawking today are the technological leftovers of a better brew.
Hogwash, says Yahoo’s senior vice president of platform personalization products Jay Rossiter, who’s responsible for ensuring that Yahoo’s 32,000-node Hadoop cluster—possibly the biggest in the world–is serving the various parts of its business.
“We run the whole company on it, top to bottom,” Rossiter told Datanami yesterday. “We run this for everything–from intrusion detection and anti-spam to our advertising products, our personalization products, to our reporting stuff….Everything we do sits on this family of products. It’s massively important to our business.”
Yahoo’s business depends on its ability to harness large amounts of data, just as other large Web properties do. “Our whole ability to crunch data in different ways is part and parcel of what we do…It’s the whole game,” Rossiter says. “Everybody who does this runs their business on some variant. You take the parts and might piece them together differently. Everybody does it.”
Rossiter’s Yahoo colleague Sumeet Singh, the senior director of cloud and big data platforms, took it a step further and theorized than none of Web giants of Silicon Valley would exist as they do today were it not for Hadoop.
”If you think of any of the new Internet companies like Facebook Netflix you name it… I’m not sure they would have been who they are today without Hadoop,” he says.
Senior members of Yahoo’s Hadoop team were on hand at the Hadoop Summit this week in San Jose, California, to discuss the close relationship between Yahoo and Hortonworks, which is hosting the show. “People don’t know about this,” Rossiter says of the partnership. “Early on they did. But now the industry has grown and the number of participants has grown, so we should tell our story.”
That story goes like this: Yahoo created Hadoop to solve its internal data management problems, and then released it into open source. In 2011, Yahoo became concerned with the fragmentation of the product, so it spun out Hortonworks as a separate company to drive the continued development of Hadoop in close collaboration with the open source community.
While Yahoo had no interest in productizing Hadoop, it had a vested interest in seeing the software develop and evolve. “The whole goal of [spinning out Hortonworks] was to get as many people focused on creating and developing this technology as possible, so we at Yahoo could use the technology to run our business,” Rossiter says.
Nearly three years later, that approach is bearing huge gains for Yahoo through the rapid developing and hardening of Hadoop components like YARN, Tez, and Hive via the Stinger initiative. “It’s exactly what we dreamed would happen,” Rossiter says. “Our participation varies with some of these technologies but our benefit doesn’t.”
Yahoo and Hortonworks have maintained a close working relationship over the last three years, directly and through the Apache Hadoop project. Representatives with the companies meet regularly to collaborate on Hadoop, says Greg Pavlik, vice president of engineering at Hortonworks.
“As a part of this joint program we’ve maintained almost a virtual working group between the two teams,” Pavlik says. “YARN itself is a byproduct of the joint work of the two companies. It originally started actually at Yahoo before the spinout even occurred.”
Yahoo’s Hadoop cluster served as a test bed for YARN, which shipped with Hadoop 2 last October. When Yahoo went live with YARN in the first quarter of 2013, it allowed the company to shrink the size of its Hadoop cluster from 40,000 nodes to 32,000 nodes. But the number of jobs doubled to 26 million per month.
The fact that Yahoo’s data scientist are bringing huge workloads to bear on emerging Hadoop technologies helps the whole community. “We have very hard-hitting use cases that really try to drive this software very hard in many different dimension,” Rossiter says. “We beat the hell out of it.”
Yahoo’s fingerprints are also on the big SQL speedup we’re seeing in Hive .13. The company is now opening SQL access to other parts of the company, including marketing and finance. That has increased the number of SQL queries from half a million to 2.5 million per month. Some custom reports that previously took Yahoo six months can now be completed in a much shorter amount of time with SQL.
“Hortonworks frankly wouldn’t be able to exist without this level of hardening,” Pavlik says. “I wouldn’t be able to stand behind the product, the technology, from a support perspective, without this kid of iterative refinement of the software over time. So it’s kind of part and parcel of how we do the engineering work, but frankly also how we built the business out.”