Microsoft Now Developing Its Own Hadoop
Hadoop might be dead, but that’s not stopping public cloud providers from using it. The latest to make a move is Microsoft Azure, which in July announced that it would begin developing its own distribution under its HDInsight brand.
Microsoft, of course, has been providing Hadoop software on its Azure cloud for many years. It was an early partner of Hortonworks, and basically had an OEM version of the Hortonworks Data Platform (HDP) for the cloud that it called HDInsight.
But Hortonworks merged with Cloudera in early 2019, and the HDP product line is no longer being developed, although it is still being supported by Cloudera, along with its legacy Hadoop distribution, Cloudera Distribution including Hadoop (CDH), until at least 2022.
Rather than form a partnership with Cloudera to use its converged Hadoop distribution, dubbed Cloudera Data Platform (CDP), Microsoft decided to head out on its own and continue developing HDInsight by itself.
Microsoft made the announcement on July 21, during its Inspire conference. “Microsoft’s supported distribution of Apache Hadoop, which will be generally available July 2020, is fully open source and compatible with the latest version of Hadoop,” the company said in its Inspire book of news.
“With the distribution, users can provision a new HDInsight cluster based on Apache code that is built and wholly supported by Microsoft,” the company continued. “Customers will be automatically migrated to the supported distribution.”
Clearly there is still room for the Hadoop family of products in the enterprise, at least on the cloud (Cloudera executives prefer not to mention the “H” word at all anymore, although its software still features the same motley gang of zoo animals that it always has).
Microsoft’s documents show that HDInsight 4.0 is based a Apache Hadoop and YARN version 3.1.1, which is a very fresh release. This data platform features the full supporting cast of Apache projects: Tez, Pig, Hive, HBase, Sqoop, Oozie, Ambari, and Zookeeper to wrangle them all.
HDInsight 4.0 customers can also add Spark and Kafka to the mix, providing a modern and powerful system for moving and analyzing big data. Storm and Mahout area also supported on HDInsight version 3.6, which Microsoft intends to support until June 2021.
Microsoft’s decision to continue investing in Hadoop shows that, despite the bad press that Hadoop has received and the (perceived and real) advantages that newer platforms based on object stores and Kubernetes hold over HDFS and YARN, respectively, Hadoop still has legs.
That means that Microsoft is in the same boat, as it were, with its cloud competitors. While the on-prem Hadoop distributors were struggling, Amazon Web Services and Google Cloud never publicly waivered from their support for Hadoop, which they provide via the Elastic MapReduce (EMR) and Cloud Dataproc offerings, respectively.
Of course, Hadoop in the cloud and Hadoop on prem have been very different beasts, with the biggest difference being that the cloud providers are on the hook for making sure the servers are serving and the application frameworks are working, not the customer who just wants to analyze some big data.
All three cloud providers have also modified their Hadoop offerings and tailored them to fit their particular clouds. That means that data in EMR is served by either HDFS or EMRFS, a file system that fronts the S3 cloud object store. Google Cloud DataProc also can be served with data stored in HDFS, or HFDS along with Google Cloud Storage via a connector.
Microsoft worked with Hortonworks to enable HDInsight to run on HDFS or to use Azure Blob Storage or Azure Data Lake Storage (ADLS) for storage. All three Hadoop implementations can also be used with the cloud providers specific column-oriented database services, namely Azure Synapse Analytics (previously SQL Data Warehouse), AWS RedShift, or Google Big Query (although Hive is typically included in these cloud Hadoop distros, and Presto and Spark SQL are very popular).
Google Cloud has been at the front of the pack in terms of modifying traditional Hadoop and updating it with more “modern” components. About a year ago, the company announced that Cloud Dataproc could run on the Kubernetes cluster manager, in addition to YARN (shifting from YARN to Kubernetes is not an easy thing to do, not even for Google, as there are many inter-related dependencies).
Microsoft has also gone its own way and is working closely with Databricks, the company behind Apache Spark. The Databricks solutions–including Spark, Delta Lake, Delta Engine, and MLflow–run on Azure and AWS (Google Cloud is reportedly in the works). Microsoft also is an investor in Databricks.
Despite these advanced Spark-based capabilities from Databricks, Microsoft felt it was necessary to provide a Hadoop distribution that supports more frameworks and is therefore able to do more things for customers. That was the original idea behind Hadoop, which its founders believed would become an operating system for big data. That idea might not have completely panned out, but it hasn’t been a complete failure either, despite the bad press.