April 15, 2015

From Spiders to Elephants: The History of Hadoop

Alex Woodie

Have you ever wonder where this thing called Hadoop came from, or even why it’s here? Marko Bonaci has wondered such things, too. In fact, he wondered about them so much that he decided to write a History of Hadoop chapter for his upcoming book, “Spark in Action.”

Bonaci’s History of Hadoop starts humbly enough in 1997, when Doug Cutting sat down to write the first edition of the Lucene search engine. In 2000, Cutting placed Lucene into the open source realm with a Source Forge project; he would contribute it to the budding Apache Software Foundation a year later.

By the end of 2001, the Stanford graduate would join up with University of Washington grad student Mike Carafella to create a “Web crawler” (or spider) program to index the World Wide Web. At that time, there were about 1.7 million websites (it’s currently approaching 1 billion sites).

Cutting and Carafella dubbed this project Apache Nutch, and deployed a proof of concept of the indexer on a single machine with about 1GB of RAM and about 1TB of disk, Bonaci writes. Nutch ran pretty well on this setup, and could index about 100 web pages per second. However, while running Nutch on a single machine simplified programming, it also limited the number of pages that it could index to about 100 million.nutch_logo

Faced with this scalability limit, Cutting and Carafella expanded Nutch to run across four machines. Expanding it beyond that would create too much complexity. But even that four-fold increase didn’t give them the processing bandwidth they needed, considering it’s estimated there were close to 1 billion individual web pages at the time.

Stumped, the developers weren’t sure how to proceed, until they stumbled across an obscure paper written by Google that described the hypothetical Google File System. “When they read the paper they were astonished,” Bonaci writes. “It contained blueprints for solving the very same problems they were struggling with. Having already been deep into the problem area, they used the paper as the specification and started implementing it in Java. It took them better part of 2004, but they did a remarkable job. After it was finished they named it Nutch Distributed File System (NDFS).”

NDFS, of course, would go on to become the Hadoop Distributed File System (HDFS), and Cutting and Carafella would go on to create the first processing tool to do actual work, called MapReduce, in 2005. By 2006, Cutting would establish a new sub-project of Apache Lucene that combined NDFS and MapReduce. He named it “Hadoop” after his son’s yellow plush elephant toy.

Meanwhile, the Web giant Yahoo was having trouble scaling its search engine. The Yahoo engineers were C++ bigots, as Bonaci reports, yet they were eager to get the same type of scalability benefits that their rivals at Google were getting with the combination of the Google File System (GFS) and MapReduce, which were implemented in Java. So Yahoo hired Cutting to help them adopt Hadoop, which Bonaci says may have saved the company (and given them good reason for rejecting Microsoft’s $45-billion acquisition offer two years later).

Hadoop_logo_2In 2007, Silicon Valley’s budding social media giants Twitter, Facebook, and LinkedIn caught wind of Hadoop and started experimenting with the system. These companies would contribute many of their creations, such as Cassandra and Hive (Facebook), Kafka (LinkedIn), and Storm (Twitter), back to the open source community.

As Hadoop continued to spread through the Valley, a group of developers from Google, Yahoo, Facebook, and BerkeleyDB got together and founded the first Hadoop distributor, Cloudera, in 2008, which Cutting would join the following year. Yahoo, concerned that it was losing too much talent to startups, spun out its own Hadoop company, Hortonworks, in 2011. Hortonworks and Yahoo remain close partners today.

The final piece of Bonaci’s history of Hadoop has to do with the development of YARN. While the combination of HDFS and MapReduce was powerful, the rigidity and batch-orientation of the setup was limiting Hadoop’s usefulness. Arun Murthy, one of  Hortonworks’ co-founders, had identified this as a problem as far back as 2006, Bonaci says. But it wasn’t until 2012 that the Hadoop community would move strongly to separate MapReduce from the stack and put YARN in charge of things.

The History of Hadoop is not a complete history, of course. But Cutting himself commended Bonaci’s work, calling it “well told and accurate.” It’s interesting that Bonaci chose to write about Hadoop only after beginning his upcoming book on Apache Spark. The final history has yet to be written on either Spark or Hadoop, but that’s another topic for another day.

Related Items:

Does Hadoop Need a Reality Check?

Beyond the 3 Vs: Where Is Big Data Now?

‘What Is Big Data’ Question Finally Settled?


Share This