Cloud Data Warehouses and Cloud Data Lakes: There’s No Need to Choose
Particularly with the industry spotlight on Snowflake following its recent IPO, there’s no shortage of discussion right now around cloud data warehouses, cloud data lakes, and how the two overlap – or don’t. For many enterprise data and analytics professionals trying to modernize to support ML and AI, there’s still a good deal of confusion on what each type of data solution offers and where the key differences lie.
In this primer, I’ll look at the strengths of each data platform and what each is built to excel at. While cloud data warehouse and cloud data lakes may solve disparate issues, they can – and executed right, should – complement one another. Used in tandem and backed by the power of the cloud, these two architectures can more fully harness the complete data and analytics picture to deliver the value and business insight that enterprises continue to seek out.
Entering the Cloud Data Warehouse
Cloud data warehouses are a decade-old technology that enables analytics by using a mostly relational processing engine – structuring data via tables and columns. Generally categorized as schema-on-write, writes to the data warehouse must adhere to previously established schema. This is true for any deployment style, including cloud data warehouses.
Naturally, SQL is the universal language of cloud data warehouses. JSON with SQL extensions and similar solutions can also allow for semi-structured data and schema-on-read functionality. However, these solutions add in prohibitively strict ACID transaction overhead. Many non-SQL transactions do not require this: schema-on-read can naturally support these applications, while utilizing less stringent transaction semantics and delivering better performance.
Cloud data warehouses also necessitate data to be cleaned and structured in close alignment with the questions and analysis that business applications are enlisted to solve. Any and all necessary schema changes require a long, intensive, and manual process that includes design work and landing the data in preparation for analysis processes.
Data Warehouse User-Defined Extensions (UDX)
Data warehouse relational engines enable advanced analysis by allowing application developers to write user-defined functions (UDFs) and user-defined aggregates (UDAs) – collectively known as user-defined extensions (UDXs). Leveraging UDXs can equip business analytics with a feature set surpassing what can be accomplished using standard SQL. UDXs are used in the same way as other standard SQL functions and aggregates in SQL statements. UDXs offer a full range of use cases and levels of complexity, from simply validating URLs all the way through to statistical functions, encryption/decryption, and compression/decompression.
Utilizing a Data Warehouse for Business Intelligence
Cloud data warehouses are commonly tapped to analyze historical data, support business intelligence applications, and fulfill business analysts’ needs for interactive reporting and other ad hoc tasks. For example, a data warehouse might enable a vendor to analyze their product inventory and sales by location, drilling down into data by country, region, and city. The organization can then leverage those insights to better optimize its supply chain and sales processes.
Diving into the Cloud Data Lake
Cloud Data lakes are generalized data processing platforms. They support modernization with a broader range of data and analytics processing needs when compared to SQL-based data warehouses. Data lakes are schema-on-read, with data schema determined as it arrives. Data lakes are built to handle structured, semi-structured, or unstructured data. For enterprises, data lakes offer a singular unified platform that serves a wider swath of use cases, from data science to data engineering, machine learning, and reporting. Data lakes are able to tap into this breadth of tools, analysis, and data options by leveraging SQL, NoSQL, Apache Spark, Apache Flink, and other data processing engines.
Importantly, data lakes are often incorrectly categorized as simple data stores like AWS S3 or Azure ADLS (and early data lakes did focus on large data storage). In fact now provide the range of storage, processing capabilities, and tools necessary for a complete analytical environment for enterprise modernization. Within cloud data lakes, widely-used SQL processing engines offer support for modern advanced analytics, along with newer open source SQL engines like Impala, Presto, Arrow and others. Apache Spark is popularly used for its in-memory model and ability to process large data sets quickly.
Cloud data lakes are increasingly deployed for data science, advanced analytics, ML and AI. For example, data can be pre-processed with Python or R to interact with an Apache Spark framework, then fed to a data science application to enable machine learning or predictive analytics. A manufacturer might utilize a cloud data lake to land IoT sensor data from its products in the field, and then perform data analysis to predict failures and address proactive remediation. In a case like that, a cloud data lake is essential to harness semi-structured data and apply the advanced, timely analytics required for success.
Comparing cloud data warehouse and cloud data lake governance models
With schema-on-write data warehouses, schema changes are…expensive. There’s really no great way to get around that. Therefore, data warehouses are governed by strict change control processes, overseeing all schema changes or data additions. By contrast, data lakes tend to have flexible governance models. In practice, a data lake might use a strict governance model for ingestion of core data alongside a more lenient model for quickly-ingested data, such as ad hoc datasets used for more exploratory analysis.
Data Lake and Warehouse Co-Existence
The cloud enables wholly separate provisioning and scaling of storage and compute resources. A new generation of cloud data warehouses and cloud data lakes now make the most of those capabilities to provide analytics in a flexible, scalable and, just as importantly, cost-efficient way. Data on a cloud object store can commonly be shared across both a cloud data warehouse and cloud data lake, with no duplication necessary in order for data ingestion and transformation to proceed.
With the cloud democratizing data and pressure to modernize, enterprises don’t need to choose between the strengths of cloud data warehouses and cloud data lakes: co-existence models allow them to easily realize the best of both worlds.
About the author: Venkat Chandra is a Data Architect at Cazena, which provides instant cloud data lakes for enterprises. Prior to Cazena, he was a Senior Engineer at IBM, working on data warehouses.