Seeing the Big Picture on Big Data Market Shift
Hidden from view in the “I want to be data-driven” conversation are the nitty-gritty details of how actually to become a data-driven organization. The grand hope is that artificial intelligence, in the guise of machine learning, will power our data-driven aspirations, and it’s clear that big data is the raw commodity that makes it all possible. But the tough question is how to effectively turn that raw data into something the algorithms can work with.
The truth is that many organizations are struggling to turn that raw data into high powered AI fuel. Capturing, storing, and processing big data still poses significant challenges for those organizations that can’t afford to hire an army of data engineers to hand-build custom systems from available open source components. For those of us who want to use big data systems in the real world, as opposed to designing and building and maintaining them, it’s worthwhile to know that the open source patterns for building distributed systems for storing and processing big data are in flux at the moment.
The Hadoop situation looms large in this discussion. Based on Google white papers and put into production by Yahoo in 2006, Hadoop represented a critical advance in the development of distributed systems. For the first time, organizations could store petabytes of data cheaply and reliably on clusters of inexpensive X86 servers. That opened up all sorts of new data services and business models that didn’t exist before.
“What Hadoop really demonstrated to the industry was you no longer need specialty hardware appliance and high-end gear and proprietary systems to deal with the data,” says Annand Babu (AB) Periasamy, the CEO and co-founder of MinIO and the co-creator of GlusterFS. “If Hadoop was not there, companies from Yahoo to Google to Uber — none of them would have existed without Hadoop-type solutions.”
Hadoop presented a giant leap forward relative to what came before it, namely rigid relational databases and more exotic kit running on specialized hardware. While it was originally developed to index the Internet, Hadoop quickly proved very useful for storing a variety of less-structured data that was flowing off Web and mobile applications, and which didn’t fit well into relational databases.
Eventually, Hadoop pushed beyond its storage roots to become a data processing platform too. Some data processing was necessary since Hadoop used a schema-on-read approach, which allowed customers to store huge amounts of data without first hammering it into shape (as you would with a database). The data would be structured after the fact through the various processing engines, starting with MapReduce and eventually Apache Spark.
While this approach has merit in some situations, it backfired in some respects. The Hadoop market largely failed to deliver easy-to-use mechanisms for cleaning up and preparing unstructured data for analysis using Hive, Impala, or any number of other distributed SQL query engines developed for Hadoop. Getting value out of the data necessitated an army of data engineers to run jobs, and data engineers in some places are harder to find than data scientists.
Monte Zweben, the CEO and founder of Splice Machine, says mission creep turned Hadoop into a platform for filtering, processing, and transforming data for downstream data marts and databases when it was not ready for that job.
“As a result, the data lakes ended up being a massive set of disparate compute engines, operating on disparate workloads, all sharing the same storage,” Zweben wrote in a blog post. “This is very hard to manage. The resource isolation and management tools in this ecosystem are improving but they still have a long way to go. Enterprises were not able to shift their focus away from using their data lakes as inexpensive data repositories to platforms that consume data and power mission-critical applications.”
We’re now in the midst of a major revolt against Hadoop and Hadoop-style computing as we lurch to the next generation of big data systems. At this point, the market is saying that data lake offerings on the cloud are easier to use than Hadoop, and the data is flying into clouds. The three major public clouds — Amazon Web Services, Microsoft Azure, and Google Cloud – have all built big data storage and processing systems in the cloud that can do everything that on-premise Hadoop does, but in a different way.
The emerging cloud architecture in today’s post-Hadoop world bears these key characteristics in both on-prem and cloud environments:
- For storage, an S3-compatible object storage system (if not Amazon S3 itself) as the data lake, replacing HDFS;
- For compute, Kubernetes for managing containerized operating system instances (via Docker), replacing the YARN scheduler in Hadoop.
In this new cloud architecture, compute and storage are separated, allowing for independent scaling of both. This is a major step forward relative to monolithic Hadoop clusters, where compute and data are co-located on the same node. Hadoop’s project architects are trying to move away from data locality, but the efforts don’t seem to be paying off.
“One common complaint I hear from customers is most of the Hadoop nodes are sitting idle just to store data,” Periasamy says. “You’re essentially using compute nodes for storage, not storage nodes for compute, and that led to huge operational probes, lots of compute nodes sitting idle.”
In the new cloud architecture, the data management tasks that plagued Hadoop — ETL/ELT, data cleansing, and format conversion — are being engineered using an emerging data pipeline construct, which allows more automation of data flow using APIs. Systems in the new cloud architecture are serverless, which lowers the level of operational sophistication that customers need to manage the systems on a day to day basis (a major problem for Hadoop).
What’s ironic is that the new cloud architecture re-uses much of the same software that came of age during the Hadoop era. Apache Spark still shines for data science and data engineering tasks, while Apache Kafka (which was never directly tied to Hadoop) simplifies the creation of durable event-data pipelines. Hadoop lives on in the public cloud in the guise of Amazon’s Elastic MapReduce (EMR), Microsoft’s HDInsight, and Google Cloud’s DataProc, all of which are central elements of these vendors’ big data architectures.
SQL still plays a major role in analytics, and SQL query engines like Hive, Cloudera, and Presto remain popular, in addition to cloud-specific versions like Redshift, Athena, Google BigQuery, and Azure SQL Data Warehouse. There is also an array of specialized computational engines for graph analysis, time-series data, and geospatial data, which existed to some extent on Hadoop but which are now flourishing in the cloud.
Once the data has been captured from its source, landed in the data lake, transformed and cleansed using automated pipelines, and labeled by humans (or humans with a machine assist) then it’s ready for machine learning, which is often conducted within packages like TensorFlow, PyTorch, and SparkML.
Hadoop isn’t going away. After all, IBM mainframes still roam the earth, more than six decades after they first landed here. But it’s clear that the architectural building blocks that organizations are using to build state-of-the-art data platforms are changing right beneath our feet. What used to be cutting-edge is now considered to be legacy, and a new cloud architecture has emerged.
Big data isn’t dead. It just moved to the cloud, says Ashish Thusoo, the CEO and co-founder of Qubole. “Invariably every company we talk to is doing something on the cloud,” says Thusoo, who helped create Apache Hive while at Facebook, another big Hadoop shop. “The market has moved.”
The future of big data is very bright, Thusoo says. “But you have to be on the right side of the architecture,” he adds. “The architecture has changed. The way the world was functioning in terms of the infrastructure and the way the world was functioning in 2007 when we started in this area — that has completely changed, with the advent of the cloud and the advent of separation of compute and storage.”