Hadoop is almost a dirty word today, but back in 2011, it was the cutting edge of big data coolness. The story behind Hadoop is intriguing for many reasons, and instrumental in how we came to where we are today.
Hadoop (then called Nutch) was created in 2004 by Doug Cutting and Mike Cafarella as a Java-based implementation of the Google File system and MapReduce compute framework. The product came about to solve a very pressing problem: Yahoo’s index of the World Wide Web would no longer fit on a single computer, so the company needed a low-cost data storage and processing framework.
Cutting and Cafarella launched the open source Hadoop project in 2006, and soon the Silicon Valley web giants (who weren’t nearly as big back then) took notice. Facebook, Twitter, and LinkedIn adopted Hadoop to address their own burgeoning data requirements, and they made their own tweaks on the technologies, contributions like Cassandra, Hive, Kafka, and Storm. The rest of the computing world wanted to get into the act, and Cloudera responded.
Cloudera was founded in 2008 and was the first Hadoop distributor. The company was starting to ramp up its business in November 2011 when it raised $40 million in venture capital, foreshadowing a mammoth $900 million investment that would occur three years later at the peak of the Hadoop heyday.
But Cloudera already had competitors to Cloudera as the Hadoop ecosystem started to grow. MapR Technologies, which was founded in 2009, closed a $20 million funding round in 2011 as its 40-some employees built a proprietary version of Hadoop that supported NFS in addition to HDFS. Then in June 2011, Cloudera got another competitor when Yahoo spun out Hortonworks. Staffed with about 20 engineers who worked on Yahoo’s Hadoop system, Hortonworks aimed to align itself more closely with the open source Apache Hadoop project than either Cloudera or MapR.
The vast majority of Hadoop deployments were on-prem back then, as the notion of the public cloud was still forming (remember, Amazon Web Services wasn’t created until 2006). But Amazon gave us a glimpse of what was to come with Elastic MapReduce (EMR), the hosted Hadoop service that it launched in 2009. While Cloudera and MapR were gaining customers in retail and financial services, AWS was boasting about its tech-heavy customer list, which sported names like Etsy, Foursquare, Clickstream, and Yelp.
While Hadoop is considered a legacy technology today, thanks to the rise of cloud-native architectures that separate compute and storage, Hadoop’s impact lives on. Whenever talking with Cloudera or Hortonworks executives about Hadoop, there was always an elephant in the room: “What do you mean by Hadoop?” they would ask.
The question was meant to elucidate a discussion about the differences between Hadoop proper (i.e. the Apache Hadoop project and the core technologies that compose it, such as YARN, HDFS, and MapReduce, as well as supporting technologies that usually had their own Apache projects, including Hive, HBase, and in the later years, Spark.) and the broader Hadoop ecosystem.
While the core technologies in Hadoop are no longer in the vanguard, many of the surrounding technologies that were considered “Hadoop” – not to mention the hundreds of software vendors that developed supporting tools that lived in the Hadoop ecosystem (and which we covered closely at Datanami)–continue to live on.
Cutting was fond of saying that Hadoop was more of an “idea” that could evolve than a hard and fast set of technologies. Cutting was hailed, rightfully, as a technological visionary, even if he never seemed totally comfortable with all the attention that fame brought. While he may have miscalculated Hadoop’s potential to be an operating system for all workloads (including transaction processing), he always seemed to recognize that Hadoop’s day in the sun would eventually pass, and that the power of random “digital mutations” in the technological ether would eventually come up with something better that would take its place.
Cutting was right, and something did take its place at the center of big data–namely, cloud-native technologies, with Kubernetes replacing YARN as the workload orchestrator and S3-compatible object stores replacing HDFS for massive data storage. But many of the remaining cast of characters–or the zoo animals as Hadoop’s early pioneers were fond of calling them—continue to live on.
2020 – COVID-19 — Kicking Digital Transformation Into Overdrive
2019 – DataOps: A Return to Data Engineering
2018 – GDPR and the Big Data Backlash
2017 – AI, Deep Learning, and GPUs
2016 – Clouds, Clouds Everywhere
2015 – Spark Takes the Big Data World by Storm
2014 – NoSQL Has Its Day
2013 – The Flourishing Open Source Ecosystem
2012 – SSDs and the Rise of Fast Data