2016 – Clouds, Clouds Everywhere

We are living in a “cloud first” world today in 2021. But roll the clock back 10 years, and that definitely wasn’t the case. When did things really begin to move the cloud’s way? In our book, the change roughly occurred sometime around 2016, give or take.

Why 2016? Well, obviously there was no single event that makes that the big year. Like most large, population-scale migrations or movements, it’s the culmination of a huge number of small factors and individual decisions occurring over an extended period time that are responsible for changing the state. The important milestones aren’t always recognized as they’re occurring, but the broad patterns and inflection points are more discernable in hindsight.

So in retrospect, the core pieces were in place in 2016 for the public cloud to come to the forefront. By that time, the Big Three were already well established to one degree or another–Amazon launched AWS in 2006, Google Cloud was founded in 2008, and Microsoft launched Azure in 2010. AWS was by far the biggest cloud and had the greatest share, a gap that Azure has since closed. Core compute and storage facilities were well established by the cloud giants, as were data warehousing and AI offerings. Vendors hawking analytics and data science wares were already offering their services on the clouds, but not to the extent they do today.

According to figures from Statista, 2016 was the high-water year in terms of rate of growth (see graph). Public cloud revenue grew by 51.3% in 2016, growing from $75.3 billion in 2015 to $114 billion in 2016, according to the firm. That exceeded the 33.7% growth in 2015 (from $56.3 billion in 2014) and the 35% growth in 2017 (which accounted for $154 billion in spending).

At that time, so-called “digital native” companies had already taken a cloud-first mentality. But bigger, more established companies were reluctant to make similar moves, and elected to run the majority of their applications and store most of their data on-prem (in fact, that is still the case, but bear with us).

Security is an oft-cited reason for slow-walking the move to the cloud. CIOs and CTOs didn’t want their companies in the headlines for a massive data breach, so they defaulted to a conservative, known position: running apps and storing data on-prem, as they always have.

But over time, the CIOs and CTOs gradually realized that security may actually be better on the public cloud. Securing data and applications takes constant work and effort, and companies must continually invest in those things. Thanks to economies of scale, cloud providers in some ways are better positioned to make those investments.

Another big factor at play in the big data-cloud revolution was Hadoop. In retrospect, we were at or slightly past “peak Hadoop” in 2016. Hortonworks had gone public in December 2014, and Cloudera wouldn’t go public until April 2017(they would merge 16 months later). Hadoop was still the platform at the forefront of people’s minds when they thought about “big data.”

But the ingredients for big data to shift into the cloud were already taking their places, and it was increasingly becoming difficult to resist the cloud’s growing data gravity. Frustrated with the difficulty of managing their own Hadoop clusters, companies were looking for easier-to-use cloud alternatives.

Taking advantage of a managed Hadoop offering in the cloud and storing data in an HDFS cluster was one option, which many companies would take. Other companies opted to build their own data lake in the cloud atop one of the object stores, such as Amazon’s Simple Storage Service (S3), Google Cloud Storage, or Microsoft Azure Data Lake Storage (ADLS), which launched in November 2016. IBM also launched its S3-compatible Cloud Object Storage in 2016.

Eventually, the monolithic, closely knit Hadoop stack would be disassembled into separate cloud components, with S3-compatible object stores replacing HDFS and Kubernetes replacing YARN as the scheduler. Breaking up compute and storage is necessary to enable customers to scale both elements independently, which is a core cloud-era concept that even Cloudera would eventually embrace (with caveats; YARN has some advantages and will continue to exist). But Kubernetes, the container manager that Google launched in 2014, would take a few more years to reach critical mass.

The clouds’ rise could be felt in other areas, too. This year, 2016, was the year that MongoDB launched Atlas, its uber-successful NoSQL database in the cloud. Numerous other NoSQL database vendors followed suit with fully managed cloud versions of their offerings. In fact, from 2017 to 2018, two-thirds of the growth in the global database market came from cloud deployments, according to Gartner, which declared in 2019 that the cloud was “the default” for new database deployments.

Cloud deployments of all sorts of data engines were building in 2016. Snowflake was steadily growing its business, as disaffected Hadoop users turned to cloud-based data warehousing, Confluent wouldn’t launch its managed Kafka service in the cloud until 2017, but many Kafka deployments were taking place in the cloud anyway. In 2016, a Databricks survey found that 61% of Spark deployments were taking place in the cloud, and that on-prem Spark deployments were dropping.

The transformation to a cloud-first mentality was complete, more or less, by the end of the 2017, when the cloud was everywhere at the fall Strata + Hadoop World conference in New York City. Hadoop? Well, the big yellow elephant was conspicuously absent.

2019 – DataOps: A Return to Data Engineering

2018 – GDPR and the Big Data Backlash

2017 – AI, Deep Learning, and GPUs

2015 – Spark Takes the Big Data World by Storm

2014 – NoSQL Has Its Day

2013 – The Flourishing Open Source Ecosystem

2012 – SSDs and the Rise of Fast Data

2011 – The Emergence of Hadoop