How Advances in SQL on Hadoop Are Democratizing Big Data–Part 2
In a previous article, we discussed several key advances in SQL on Hadoop that are making Big Data capabilities increasingly accessible to analytics organizations. While SQL democratizes Big Data by leveraging conventional skillsets, SQL access to data in isolation is not a guarantee of sustainable analytic agility. Architects successfully integrating Hadoop to accelerate and expand analytic capabilities understand the value of data governance, transparency and common vocabulary to ensure actionable value-add analytic capabilities. Similarly, effective managers engaging new SQL on Hadoop capabilities are building balanced teams that combine specialized skills across multiple domains with generalized experience to maximize the impact of data management and analytic agility.
Hadoop and its most prevalent SQL access layer, Hive, was designed to meet the needs of dynamic high-tech companies, constantly pressured to adapt to evolving business needs and changes to core platforms at petabyte scale. These same features are being applied within mature industries seeking increased agility and capability, although tempered by the need to create lasting solutions that can support lifecycles of a decade or more. The extensibility and cost effectiveness of Hadoop combined with the democratizing effect of SQL access has the potential to contribute billions to the analytic effectiveness of the modern economy. At the same time, sacrificing stability and maintainability can dramatically increase the total cost of ownership of Big Data solutions. Engaging both planning and architectural definition at the outset ensures that the real value proposition of Hadoop is realized without creating maintenance problems for years to come.
Extend and Enhance Legacy Platforms and Skills with Balanced Data Analytics Portfolios
The next generation of SQL on Hadoop is transforming the way teams build Big Data applications. Analytic skills and capabilities have traditionally been siloed within role-specific tools and application platforms requiring highly specialized expertise. The often-touted concept of the data scientist as a heroic solo actor who is part systems engineer, part statistical expert, part business analyst, has been readily replaced with team-based approaches that bring specialized talents together in close interaction to solve analytic challenges. A common language for data accelerates the agility of multidisciplinary teams, enabling more efficient interaction, communication and knowledge transfer across the entire analytic value chain. The widespread use of SQL as a common language for all stages of analytics makes it easier to support specialized experts whether platform engineers or data scientists, with generalists to the mutual elevation of the entire team.
The most effective analytic shops are increasingly leveraging a pipelined approach to analytic delivery, leading with agile data discovery prior to making more time-consuming investments in production code. In this scenario, an analyst may prototype analysis in SQL for presentation in a visualization tool, validate the results and requirements with business stakeholders, and then hand off this code for optimization and integration with data quality and governance checks to a data engineer to be production-ized as a regularly refreshed persistent data asset. The portability of SQL between roles reduces the opportunity for miscommunication at each stage of the delivery pipeline. This efficiency extends to quality assurance and data stewardship teams who are able to validate requirements and data quality assumptions at each stage in the development process and ensure they continue into production operation.
Core enterprise analytic applications traditionally have a long shelf life – in many cases a decade or more. The shortcomings of development can easily translate into the cost of maintenance and enhancement. SQL is an inherently declarative language, separating logical requirements from control flow. Shops that have invested heavily in fragile imperative approaches to data management and analytics frequently miss out on the benefits of platform improvements. In a rapidly evolving Hadoop ecosystem, SQL provides extensibility to new execution frameworks and platform optimizations. Just as the Tez execution improved performance for many query workloads for Hive, emerging enhancements and alternative execution engines such as Spark will continue to elevate performance while minimizing the cost of regression testing existing code or requirements. In an industry bound by human resource availability and productivity, the ability of SQL to capture business requirements while abstracting low-level execution means more time spent delivering new insights and less time refactoring code resulting in a lower total cost of ownership.
Tips to Enable Effective and Sustainable SQL on Hadoop For Enterprise Analytics
Even with recent advances in SQL on Hadoop solutions, there is still no magic bullet for delivering effective analytics. A few key considerations will ensure that your analytics teams sustain productivity for the long term:
- Build Balanced Data Analytics Teams: The most effective data analytics teams incorporate a multidisciplinary approach to delivery. While SQL enables a common vocabulary for accessing data across roles, improving communication and reducing time to productivity, it is important to balance business acumen with specialized technical expertise. It is unlikely that you will be able to hire enough data heroes capable of analyzing business requirements, developing efficient data management processes and extracting new features for statistical models to satisfy your business’ demand for analytics. Building balanced teams frees up specialists to elevate the capabilities of other team members to meet analytic demand.
- Avoid “System-Level Spaghetti”: It is critical to establish and enforce architecture, standards and organizational patterns in your data management applications. Agility is not a replacement for architecture and planning for long-term sustainability. A recent paper published by engineers at Google titled Machine Learning: The High-Interest Credit Card of Technical Debt describes “pipeline jungles” in data preparation that “evolve organically, as new signals are identified and new information sources added”. Code developed for data discovery is often a good template for prototyping requirements, but rarely satisfies all the needs of a production environment. Have a plan and checklist for making your data management processes enterprise strength and avoid deploying ad-hoc code to production environments for the sake of expediency. Short term technical debt can add millions in maintenance costs for a long-lived analytic application.
- Understand and Address Current Platform Limitations with a Balanced Data Analytics Portfolio: Be aware of the limitations and optimization requirements for your platform. The Hadoop ecosystem continues to evolve and improve, but in many ways still trails the robust optimization features of more mature RDBMS solutions. Be prepared to tune queries and underlying data structures to ensure scalable performance and effective stewardship of production resources. Performance tuning and optimization is a specialized skillset that serves other members of your balanced team and should be a part of your pre-deployment checklist. Successful Hadoop solutions are rarely an outright replacement of conventional platforms and data warehouse capabilities. Adjust workloads and access patterns to access the appropriate tools at the appropriate times. In many cases this means federating an analytic ecosystem across high-volume raw data management, transformation and deep discovery on Hadoop while pushing horizontally scalable aggregate reporting workloads to conventional RDBMS and purpose-built analytic applications.
- Integrate Quantifiable Data Quality and Transparent Data Governance Into Your Applications: Democratizing access to data and analytics can have side effects beyond expediting access to information. Establish layers and guidelines within your data and information delivery environment for specific types of data analysis and make these transparent to the consumers of analytic outputs. Enabling SQL access for new and deeper analysis is a powerful tool for accelerating analytics, but it is important to differentiate ad-hoc discovery and departmental data and metric definitions from governed metrics and standards. Wherever analysis may be directional or exploratory in nature, it is critical to provide this information to the end consumer. For productionalized analytic processes, providing traceable data lineage and ongoing stewardship and data quality monitoring ensures continued confidence in analytic output.
- Pipeline Analytics Workflows to Leverage Data Discovery: Leveraging SQL as a standard unifying language throughout the data pipeline delivers new efficiencies for analytics teams. Enabling analysts to prototype analysis and requirements through SQL provides a fail-fast approach to meet and validate business needs prior to investing in more rigorous development, performance tuning and quality assurance. This ensures that high demand analytics teams prioritize efforts on the right work to achieve actionable analytic results.
Fail-Fast, But Build Solid Foundations
Common sense guidelines and a balanced approach ensure sustainable results from Hadoop investments. Mature industries are looking to Hadoop for increased agility and affordability, but these new opportunities should not come at the expense of long term stability. Increasingly, Hadoop is less of a disruptive paradigm shift and more of an extension of core data management and analytic capabilities. New SQL on Hadoop features have democratized Big Data, enabling a broad variety of experience to contribute to the new era of data analytics. After-all, a balanced approach to Big Data can provide the agility to drive new insight while providing a guiding framework to capitalize on investments over the long term.