Follow Datanami:
January 25, 2019

Hadoop Gets Improved Hooks to Cloud, Deep Learning

Organizations that adopt the latest version 3.2 release of Apache Hadoop will get new integration hooks into the AWS and Azure clouds, as well as access to a new deep learning project called Hadoop Submarine.

Hadoop may not excite big data enthusiasts as it once did. Nevertheless, the technology is widely deployed and relied upon to store and compute huge amounts of data by thousands of organizations around the world.

On Wednesday, the Apache Software Foundation announced the release of Apache Hadoop version 3.2, which the Hadoop community says is a major release in the version 3.x line, with more than 1,000 changes.

Among the more prominent additions are improved integration into public clouds, which are replacing Hadoop data lakes in some cases. Hadoop 3.2 sports an enhanced file system connector for Azure Blob File System (ABSF), which is Microsoft‘s cloud-based object storage system. With this release, Hadoop now supports the latest Azure Datalake Gen2 Storage.

The Hadoop team has also bolstered the S3A connector, which connects the framework to Amazon Web Services S3 object store. The refreshed connector offers better resilience to S3 and DynamoDB I/O that has been throttled, the community says.

On the deep learning front, Hadoop 3.2 brings a new sub-project called Hadoop Submarine. The community says Hadoop Submarine makes it easier for data engineers to develop, train, and deploy deep learning models using TensorFlow on YARN clusters.

Version 3.2 brings several other enhancements, including:

  • Support for “in-place seamless” upgrades for long-running services using the YARN Native Services API;
  • An upgraded C++ client for HDFS that will improve asynchronous I/O, which will help downstream projects like Apache ORC;
  • Support for node attribute labels in YARN to improve management of different types of nodes;
  • A new storage policy “satisfier” that allows HDFS applications to move blocks between storage types to adhere to storage policies on files and directories.

“This is one of the biggest releases in Apache Hadoop 3.x line,” says Sunil Govindan, Apache Hadoop 3.2.0 release manager. “The Apache Hadoop community continues to go from strength to strength in further driving innovation in Big Data.”

While Hadoop has failed to match the lofty expectations that were once attached to it, the big data storage and compute framework still has a lot going for it. The Apache organization offered a long list of tier-one companies that run Hadoop in production settings, including Apple, Capital One, eBay, Hulu, The New York Times, Tesla Motors, and Uber.

“Netflix captures 500+B daily events using Apache Hadoop,” the Apache organization stated in its press release. “Twitter uses Apache Hadoop to handle 5B+ sessions a day in real time. Twitter’s 10,000+ node cluster processes and analyzes more than a zettabyte of raw data through 200B+ tweets per year. Facebook’s cluster of 4,000+ machines that store 300+ petabytes is augmented by 4 new petabytes of data generated each day. Microsoft uses Apache Hadoop YARN to run the internal Cosmos data lake, which operates over hundreds of thousands of nodes and manages billions of containers per day.”

Apache also shared some stats about the Hadoop market that were collected by Transparency Market Research. The group claims the global Hadoop market “is anticipated to rise at a staggering 29% CAGR with a market valuation of $37.7B by the end of 2023.”

In terms of open source projects, Apache Hadoop remains “one of the most active projects” at the Apache organization, with a number ranking for code commits and a number five ranking by the size of its code repository. Hadoop is composed of more than 3.8 million lines of code, the Apache group says.

Related Items:

Hadoop Has Failed Us, Tech Experts Say

Why Hadoop Must Evolve Toward Greater Simplicity

Hadoop 3 Poised to Boost Storage Capacity, Resilience with Erasure Coding