How Advances in SQL on Hadoop Are Democratizing Big Data–Part 1
September 2014 marked the anniversary of Edgar F. Codd’s 1969 introduction of “A Relational Model of Data for Large Shared Data Banks”, which is a compellation of research and theories that ultimately provided the foundation for modern relational SQL databases. Since the start of widespread adoption of relational databases, analytics professionals have invested decades of experience in SQL expertise. More recently, Hadoop and other emerging platforms have disrupted the relational paradigm, introducing unfamiliar concepts for both data management and analysis. The result has been a shifting focus from how to manage Big Data to who can leverage Big Data analytics.
The power of analytic insight to transform business and deliver competitive advantage is unmistakable, enabling the quantitative decision making and informed actions necessary to achieve billions of dollars in productivity gains. The emergence of Hadoop has fundamentally changed the platform economics and feasibility of scaling analytics to deeper and broader data assets. At the same time, companies have struggled to deliver consistent timely results with Big Data technologies, constrained by the availability and productivity of engineers, analysts and statisticians to deliver sustainable momentum.
Modern information needs are pushing the boundaries of analytic scope and scale and a recent boom of new SQL on Hadoop technologies indicates that relational concepts are as relevant as ever. From open source projects such as Hive, Spark SQL, Drill, Tajo, Phoenix, and Presto to other proprietary solutions, the analytics industry is long on performant SQL on Hadoop as a must-have capability. The era of NoSQL (Not-only SQL) and NewSQL (shared-nothing OLTP) is bridging the gap between legacy analytic skillsets and modern analytic workloads to democratize Big Data for the masses.
The Importance of SQL for Big Data Analytics
For an industry quick to proclaim conventional databases a dying technology in favor of distributed key stores as the likely successor, the real question is why the sudden interest in SQL? The supply of large, complex and dynamic data sources is pushing mind-share and market-share toward Hadoop for flexible unstructured key-value storage. Equivalent pressures for responsive analysis, extensibility and interoperability have emerged supporting SQL as the preferred access path to satisfy the bulk of analytic demand. The market continues to prefer SQL as the common denominator for data management, manipulation and access, regardless of the underlying storage platform.
Even after the introduction and widespread adoption of Hive as the foundation for SQL on Hadoop, many developers continued to look to core MapReduce for mission critical analytics applications, citing greater control over performance and execution. MapReduce solutions rely on a one-size-fits-all framework driven by technical skills that are largely foreign to experienced analytic professionals whose expertise revolves around conventional platforms. The fundamental concepts of MapReduce are arguably accessible to the most tech savvy professionals. However, manually stringing together MapReduce jobs to replicate the functionality of even basic SQL queries is a tedious and error prone task resulting in code that is both difficult to manage and optimize with changing data or requirements.
The misalignment of existing skills to new platform capabilities and the productivity limitations of developing, deploying and extending low-level MapReduce code has been a driving force behind an expectation gap between the possibilities and the practicalities of enterprise Big Data analytics. With improvements to query execution, columnar storage, cost-based optimization and native extensibility to query external datasets, SQL on Hadoop is increasingly both the most productive analytic programming paradigm and the performant choice for production workloads.
Investments in performant SQL on Hadoop are creating new opportunities to both modernize legacy data platforms and tap into existing developer skillsets, making analytics on Big Data not only achievable but sustainable. Declarative SQL is a natural language for data management, manipulation and access, enabling developers to abstract business rules from procedural execution. Although various platforms support different dialects, ANSI SQL is a lingua franca, broadly portable across SQL database platforms. SQL inherently allows analytic application developers to focus on business logic and exploration rather than optimizing procedural code control and execution.
The demand for Hadoop-savvy skillsets has put pressure on analytics organizations to attract, train and retain Big Data talent. Enhanced SQL capabilities are enabling Hadoop newcomers and revitalizing the experienced developer. Analytics organizations can now augment specialized Hadoop skillsets with conventional data management expertise. This trend, along with advances in security, high availability and disaster recovery is bringing Hadoop into the fold of mature enterprise strength analytics platforms. Hadoop is no longer the rogue outlier, encouraging speculation that it may become as prevalent as Linux within the enterprise.
As relational database technology has improved since its initial adoption, SQL code remains portable, taking advantage of platform and optimizer improvements without the extensive regression testing required by other platform changes which can occur between data integration tools for example. Developers have invested in SQL precisely because of its durability and portability and it has now become a persistently relevant skillset.
Changes That Make SQL on Hadoop More Relevant
Both open source communities and proprietary software companies are accelerating the evolution of Hadoop features, creating a diverse ecosystem of Hadoop options. Each iteration of platform improvements takes advantage of a mature universe of data management capabilities. Recent advancements introduced in the Hadoop ecosystem including real-time data streams, MPP data analytics, in-memory analytics and data virtualization, all of which extend and challenge the strengths of a variety of established enterprise solutions. While these additions have been substantial with regard to capability enablement, they have gained minimal traction in enterprise environments due to rapid change and a lack of approachability. SQL on Hadoop has consistently become a unifying feature, allowing mature enterprises to achieve expanded Big Data capabilities while maintaining organizational and platform stability.
Improvements in Hive are increasingly a baseline for advancements in SQL on Hadoop maturity. While proprietary solutions have added new platforms and features encouraging platform isolation and vendor lock-in, Hive remains the common denominator across Hadoop distributions. This means that advancements in Hive are generally portable regardless of distribution vendors, although some vendors may be faster to integrate and adopt new features than others.
Driven by the Stinger and Stinger.next projects, recent advancements in Hive include achieving substantial feature and performance improvements. Broadly supported projects have made, and continue to make, broad strides towards enhanced SQL on Hadoop and are enabling Hive to enhance and enable conventional analytic database capabilities. These efforts are transforming Hive from a crude, batch-oriented platform operating on immutable data into a broadly-featured platform, accessible to developers accustomed to mature database platforms and suitable for integration with mature enterprise analytic environments.
Five SQL Improvements That Are Transforming Hadoop &Big Data Analytics
Several key Hive features introduced by the Stinger and Stinger.next projects are democratizing SQL on Hadoop and making Big Data accessible to experienced database developers. These features are bridging gaps in Hadoop’s analytic scope and extensibility, including the skills gaps required to effectively leverage Hadoop and the administrative capabilities needed to manage Hadoop platforms.
- ACID transactions (HIVE-5317): Hadoop and Hive emerged as inherently file-based platforms. Data tables could be loaded or appended, but implicitly could not support incremental updates or deletes without complex ETL-like batch transformations. Applications on Hadoop requiring updates or deletes relied on expensive restatement of the table or partitions. Introducing ACID transactions and native support for incremental updates and deletes eliminates the need to develop, maintain and manage entire sets of complex ETL/ELT code while improving performance certain types of workloads. This feature not only allows traditional database developers to engage SQL on Hadoop using familiar syntax, it reduces substantial time and effort within the development, test and operations workloads.
- Cost based query optimization (HIVE-5775): Hive originally supported simplistic rule-based optimization when transforming SQL queries into programmatic executable MapReduce code. Like many early SQL database technologies, Hive addressed gaps in query optimization using complex hints, which could be used to drive specific options within a query execution plan, for example a map-side vs. reduce-side join. Query optimization has been maturing in proprietary database technology for decades. Database developers have developed reasonable expectations that logical correctness along with a modicum of efficient organization suffice for a conventional SQL database to generate a performant execution plan. Introducing cost-based optimization moves Hive towards the capabilities of conventional databases and reduces the need for specialized skillsets for query optimization.
- Performant interactive query execution (HIVE-4660): Hive was originally built around a one-size-fits-all MapReduce execution framework optimized for batch execution. This approach sufficed for many Big Data workloads and applications, but proved inefficient for many types of workloads. Introducing Tez as an execution engine enabled optimizations such as pipelined execution of reduce steps without landing intermediate results to disk, vectorised query execution and other frameworks for query execution that step outside the MapReduce framework. This enables Hive to take on not just ETL/ELT workloads, but begins to enable new and previously impractical applications such as data discovery. The ability to work interactively with SQL on Hive expands the roles of Hive users and expedites development, testing and analysis pipelines.
- Analytic Query Support (HIVE-4197): Enhanced support for common analytic window functions moves Hive beyond rudimentary batch data movement to support complex analytics and common query techniques. Robust analytic capabilities are critical to a variety of data integration and analytic functions. Integration of these capabilities in SQL on Hadoop enables increased portability and productivity by supporting table stakes abilities of conventional RDBMS platforms and reducing the need for customized MapReduce or other low-level of procedural coding.
- Extensibility beyond SQL (HIVE-7292): As advancements on Hive continue to improve accessibility, performance and administration capabilities, extension to new execution frameworks means that SQL on Hadoop becomes natively accessible to new Hadoop ecosystem tools. Current efforts to extend Hive SQL to Spark as an execution framework means that Hive SQL and data structures are extensible to advanced analytic capabilities without adding additional specialized tools. The MLLib framework on Spark provides a dynamic, performant and easy to use toolkit for applying machine learning and advanced statistical capabilities.
Democratizing Big Data Addresses Analytic Agility
Core feature improvements in SQL on Hadoop are making Big Data analytics accessible and approachable by mature enterprise analytics organizations. These advancements are creating new opportunities and expanded capabilities that leverage decades of embedded experience. Companies are no longer solely constrained by the capacity of scarce specialized skillsets to drive analytic innovation. Integrating Hadoop specialists with experienced SQL resources elevates broader teams of engineers, analysts and statisticians to address the question of who will support sustainable Big Data agility across the entire data analytics ecosystem.