Hadoop Past, Present, and Future
Every few years the technology industry seems to be consumed with a shiny new object that gets hyped far beyond reality. At worst, the inevitable bursting of the hype bubble leads to the disappearance of the technology from relevance (remember Internet browsing on your TV?), but more often the hype subsides until a real but narrower focus for the technology is found.
It’s been a decade since Hadoop was first created as an Apache top-level project, and during that decade we’ve certainly witnessed a lot of hype about what it can do. The hype was driven in part by real needs in the market that were not being met. It’s become easier and easier for applications and devices to generate “machine data” and lots of it. That created the need for a scalable, low-cost system to store that data that could also make it readily available for people to process and analyze. The traditional options for storing data—large-scale storage arrays or data warehouses, were simply too expensive and inflexible for that purpose. Enter Hadoop.
Initially created to process web crawl data for search engines, Hadoop started seeing broader use for data processing because it could scale to store large volumes of data on inexpensive commodity servers and make that data directly accessible to data programmers. The combination of a scalable, distributed filesystem (HDFS) combined with a flexible processing engine (MapReduce) formed the core of what became the Hadoop that gained interest in the market. Hadoop became used as a landing area for machine data created by applications and devices. To make use of that data, skilled developers learned ways to write MapReduce programs that could transform that data into more usable forms as well as apply machine learning algorithms against that data.
However, as more and more people started experimenting with Hadoop projects, its limitations started to become apparent. The limited supply of skilled MapReduce programmers made it very difficult to integrate Hadoop into the broader set of tools and processes that organizations had, making it difficult to use Hadoop more broadly. Attempts to address that led to a proliferation of projects hoping to create SQL interfaces for Hadoop since SQL has been the de facto language for tools and users of data for several decades. In spite of growing hype that this would make it possible for Hadoop to take over all data processing and displacing data warehousing, these fragmented attempts remained handicapped by other fundamental challenges, both technology and business challenges.
Designed as a batch processing system for search data, Hadoop was not designed for high-speed, interactive analytics and reporting. Significant development work has been expended trying to change that fundamental challenge by building new processing engines that could supplement or replace MapReduce while still taking advantage of the cost and scalability of the HDFS filesystem. Most of these projects have gained only limited traction due to the complexity and fragmentation of these efforts.
However one technology that has emerged is Spark. What is Spark? In a nutshell, Spark is a next-generation processing engine for data. Although developed independently of Hadoop, ambitious developers made it possible to combine Spark with HDFS, combining a high performance processing engine with the scalability of HDFS. Spark’s architectural approach was designed from the start for fast, efficient processing of large scale data. As a result, many Hadoop distributions now include Spark as a core processing engine.
In some ways, Spark has become the latest shiny object that has become the focus of hype. That has put a lot of development focus on Spark, but as Spark evolves it becomes increasingly difficult to see that Spark will remain tied to Hadoop. Just as putting a race car engine on a bicycle isn’t the right way to get a faster vehicle, it’s becoming increasingly clear that tying Spark to Hadoop may be holding it back.
The key value of Hadoop for Spark is using the HDFS filesystem to store data, but new options have emerged that are even more scalable and resilient while being just as cost effective. The most clear example is another technology that recently celebrated its 10-year anniversary—Amazon’s S3 storage service. S3 is quite possibly the largest storage platform in existence as a result of scalability, resiliency, ease of use, and economy of scale. Other cloud platforms have similar offerings that are maturing at a rapid rate. Given the availability of options like these, it makes a lot of sense for Spark to leave behind the complexity and limitations of Hadoop, giving users an option to more easily take advantage of Spark for complex data processing and machine learning algorithms.
Does that mean that Spark will be the technology that ultimately takes over all things data? The case of Hadoop makes it clear that it takes more to address the full scope of what is needed to use data. In particular, both Hadoop and Spark remain a complex and incomplete set of technologies that are difficult to deploy and integrate into a business. For that, needs for security, manageability, data quality, and compatibility that are well familiar to the world of data warehousing become paramount. As a result, the data warehouse will continue to play a key role, but one that connects to tools like Spark to take advantage of their capabilities.
About the author: Bob Muglia is the CEO of Snowflake Computing, a cloud-based provider of data warehousing services. Bob has over 20 years of management and leadership experience, including stints at Microsoft, where he led the $16 billion Server and Tools Business, and EVP of Software and Solutions at Juniper Networks. Bob holds a bachelor’s degree in computer & communication science from the University of Michigan.