2013 – The Flourishing Open Source Ecosystem

(Wright Studio/Shutterstock)

By 2013, big data wildfire was sweeping the computing world, and open source software was the fuel helping it to grow. Apache Hadoop, based upon Google technology that Doug Cutting and Mike Cafarella re-created in Java and implemented at Yahoo, was among the first and the most influential of the open source big data projects. But it was by no means the only game in town.

Yahoo and the other Silicon Valley Web giants soon learned that open sourcing their technology was not only a good way to increase the number of eyes for spotting bugs and hands to fix them. But it also acted as a catalyst for accelerating adoption of a given open source product, and was a growth strategy unto itself.

Facebook created a relational database for Hadoop called Apache Hive, and then doubled down on analytical processing with Presto, which gave rise to several companies (Starburst, Ahana, Varada). Twitter gave us a stream processing system called Apache Storm, then Backspace followed that up with Apache Heron (acquired by Twitter). Yahoo gave us Apache Pig. LinkedIn gave us Apache Kafka, a next-gen message bus, which Confluent will soon be riding to an IPO. It also gave us Apache Samza, a stream processing system. Airbnb gave us Airflow, a data workflow orchestrator. Google gave us Google File System, Bigtable, Kubernetes, and Beam, and many other projects too.

Not all big data projects came from tech giants, of course. Apache Spark came from U.C. Berkeley’s AMPLab, while Apache Flink came out of three German universities via the Stratosphere project. The National Security Agency created the Apache Accumulo database, and donated it to the Apache Software Foundation in September 2011.

Many of these open source projects became components of the Hadoop distributions from companies like Cloudera and Hortonworks (which merged with Cloudera in 2019). There were so many open source projects in these distros that the Hortonworks product managers joked about the “asparagus charts”. With 30-plus open source projects included in a given Hadoop distro, it also gave rise to questions about “what is Hadoop.”

In the end, however, ensuring the compatibility among the various versions of these open source projects became a major operational issue for companies like Hortonworks. As simplified cloud data lake offerings based on object stores and Kubernetes grew in popularity, Hadoop’s influence waned, and with it the zoo animals’ heyday ended.

Buoyed by the success of open source Hadoop software, entrepreneurs founded hundreds of companies to pursue the commercial open source business model. Arguably the most visible product category grabbing onto open source was the NoSQL database. MongoDB rode its popularity with developers to the top of the NoSQL heap with its open source document database. Open source search engines, like Elasticsearch, also soared in popularity, as did data science and machine learning development environments, like Anaconda and H2O.ai.

Open source remains a powerful force in the big data community, particularly around developer tools. However, when it comes to core infrastructure components, like databases and search engines, there’s been a retrenchment of interests, at least relative to the wide-open state of open source big data software back in 2013.

For instance, while Elastic, Confluent, and MongoDB all offer their products under open source licenses, they have moved in recent years to strengthen protections and prevent unauthorized usage. In many cases, this is being driven by Amazon Web Services, which has launched many of its own hosted offerings based on open source projects created by others. Many products that are governed by the ASF are avoiding the liberal Apache 2.0 License.

Open source has proven its value to the IT industry as a whole many times over the years. While many data and AI startups today are eschewing the pure open source model in favor of holding some stuff back behind a proprietary curtain and just shipping the binaries, there’s no denying the awesome progress the industry made in a few wild years last decade. Hadoop is not a bad word anymore, but there’s no denying that the industry wouldn’t be where it is today if it wasn’t standing on the shoulders of open source Hadoop giants.

2019 – DataOps: A Return to Data Engineering

2018 – GDPR and the Big Data Backlash

2017 – AI, Deep Learning, and GPUs

2016 – Clouds, Clouds Everywhere

2015 – Spark Takes the Big Data World by Storm

2014 – NoSQL Has Its Day

2012 – SSDs and the Rise of Fast Data

2011 – The Emergence of Hadoop