Follow Datanami:
April 16, 2019

Microsoft Expands Hadoop on Azure

Staff report

Microsoft has upgraded its open source analytics services running on Azure with a new version of Hadoop incorporating enhancements of Apache Hive and other open source analytics frameworks.

The software giant (NASDAQ: MSFT), which completed its blockbuster acquisition of GitHub last October, continued its push into open source with this week’s release of Hadoop 3.0 on its Azure HDInsight analytics service. The release incorporates upgrades to Hive, including its data warehouse “connector” for Apache Spark, as well as new versions of HBase and Phoenix.

Lastly, Microsoft said Monday (April 15) its cloud-based Hadoop service integrates Spark IO cache, HDInsight’s data caching service designed to accelerate workloads running on Apache Spark clusters.

The Hadoop upgrade represents Microsoft’s ongoing efforts to boost support for big data analytics applications on its Azure cloud. Microsoft is positioning its Hadoop 3.0 distribution as an “enterprise-ready service for open source analytics” that can run Spark, Kafka and others open- source apps. Those tools can be used for data ingestion, preparation and management along with analytics, business intelligence and data visualizations.

Microsoft promotes the Hadoop and other open source upgrades as a means of boosting the performance and availability of analytics applications running on its cloud. For instance, the addition of the latest version of Hive data warehouse software to its HDInsight service targets developers seeking to build “traditional database” applications on data lakes. The company touts that capability as helping to build big data applications that comply with data privacy rules.

Meanwhile, the Hive warehouse connector for Spark underscores how the analytics tools are merging. The link is intended to advance that integration to the query engine level, the company said.

The upgraded HBase non-relational distributed database is designed to reorganize data in a memstore data-write buffer, thereby boosting performance by reducing reads of data stored remotely in the cloud. The accompanying Apache Phoenix relational database engine that supports transaction processing on Hadoop, bringing “more visibility into queries,” Microsoft noted in a blog post. The upgrade also provides details about queries being run against a cluster.

The data caching service also now available on Azure HDInsight seeks to boost the performance of Spark, Hive and Apache TEZ workloads. All can be run on Spark clusters.

HDInsight also supports a growing list of big data applications that included Kyligence, the analytic processing engine base on Apache Kylin, and the WANDisco data-migration tool used with cloud-based Hadoop and Spark deployments.

Recent items:

Microsoft Azure Data Warehouse Gets a Tune Up

From Big Beer to Big Data: Inside AB InBev’s Digital Transformation