Cloud Data Warehousing: Understanding Your Options
Cloud data warehouses have emerged as the go-to repositories for amassing huge amounts of data and running advanced analytics and AI upon it. This is great news for customers, who no longer must worry about provisioning enough storage and compute infrastructure to handle surges in workloads. But not all cloud data warehouses are created equal, which makes it important to research the offerings so you can choose the right one.
The first step in the cloud data warehouse selection process is to get a good understanding of the different offerings that are available on the clouds today.
The big three public cloud providers–Amazon Web Services, Google Cloud, and Microsoft Azure–all offer cloud data warehouses that live atop their various cloud storage offerings, S3, Google Storage, and Azure Data Lake Storage (ADLS) Gen 2, respectively. The use of these data warehouses is growing quickly as organizations look to get out of business of managing the computing environments underlying their business applications.
As these organizations move to the cloud, they’re naturally looking to the cloud data warehouses of the Big Three, which is where the analysis starts.
Cloud DWs: The Big Three
Google BigQuery took home top honors in a recent Forrester round-up of cloud data warehouses. Analyst Noel Yuhanna applauded BigQuery’s serverless nature, and its ability to scale to hundred of petabytes. Integration with various other Google Cloud offerings, like Cloud AI Noteboks and BigTable, was another plus. Its support for AI and machine learning capabilities, and for Spark-based data engineering also got high marks. GIS and BI options also await BigQuery users.
Amazon Redshift, which is based upon the ParAccel MPP database engine that is today owned by Actian, is another top performer in the Forrester Wave. Customers like Redshift because it offers a myriad of integrations with other Amazon offerings, such as SageMaker, EMR, Kinesis, and Elasticserach Service. Customers cite integration, performance at scale, and the serverless architecture as plusses. Its SQL dialect is based on Postgres, which users also like. RedShift users who want separation of compute and storage should choose the RA3 option, which AWS launched in 2020.
Microsoft Azure Synapse Analytics brings together data warehousing, integration, and real-time analytics for a unified data analytics and AI experience. Forrester applauded the cloud-native offering (i.e. compute and storage can be scaled independently) for its support for ETL (via Azure Data Factory), business intelligence, and machine learning workloads in a single repository. Spark integration is also a plus in Forrester’s book, while customers cited product functionality, scalability, and availability among the top features.
More Cloud Data Warehouses
But limiting yourself to one of these three solutions from the Big Three would be a mistake, as there is a wide and growing number of third-party data warehouse providers to choose from. Many, if not all, of these third-party data warehouses run across all of the three big public clouds, providing flexibility for customers who are concerned about vendor lock-in.
You may have heard of a company called Snowflake, which delivers a fully managed analytics services atop a scalable SQL engine that also supports AI and machine learning. Customers appreciate the offerings’ scalability, its “time travel” feature, as well as ease of administration. Data sharing is also a strength for Snowflake, which has also moved strongly into the emerging ecosystem for data sharing and third-party data.
Teradata also is in the running among the top tier of cloud data warehouse providers. The company developed what’s widely considered the gold standard for on-prem data warehouses. But lately, Teradata has pivoted hard to the cloud with its Vantage platform that combines SQL analytics with machine learning, and that work appears to be paying off.
Oracle rounds out the top-tier of cloud data warehouses in the Forrester Wave. Yuhanna and company applaud Oracle’s Autonomous Data Warehouse for the simplicity of provisioning, configuring, tuning, and management, as well as its built-in Web-baed notebook, broad SQL access, elastic scale, and support for concurrent workloads. However, the offering, which is based on the Exadata Database Appliance, is only available on Oracle Cloud.
Cloudera is trying hard to shake its Hadoop image, and providing a cloud data warehouse is a good way to do that. The company has focused on simplifying self-service analytics with its Impala-based Cloudera Data Warehouse, and Forrester noted how its Shared Data Experience (SDX) provides consistent security and governance across the entire Cloudera Data Platform (CDP).
Vertica is another on-prem data warehouse that’s seeing new life in the cloud. Now owned by Micro Focus, Vertica in the cloud offers benefits like support for S3 storage, separation of compute and storage, and integrated machine learning, that are available with “newer” offerings.
SAP Data Warehouse is a unified analytics database that runs on the SAP HANA Cloud. Forrester applauded its data management capabilities (data transformation, modeling, cataloging, etc.) and how it enables collaboration among groups.
IBM is offering something called the Db2 Warehouse on Cloud, which offers SQL analytics and machine learning capabilities integrated with Hadoop and Spark platforms. Forrester says IBM’s advantage is its data management, column store, compression, and in-memory processing.
Exasol made a name for itself with an on-prem, in-memory database, but today it’s running in the cloud and can spill over to disk too. Self-tuning functions help optimize analytics and machine learning workloads, thereby minimizing maintenance, per Forrester.
Alibaba also makes Forrester’s list of top cloud data warehousing platforms. The company offers an array of analytics and machine learning capabilities. But it’s limited primarily to customers in China.
Yellowbricks also made the cut in Forrester’s comparison. Like most third-party data warehouses, Yellowbricks supports all three public clouds. Plusses of the offering include real-time streaming ingest and hybrid cloud geo-replication across data centers, which is something the Big Three cannot do.
Databricks didn’t make the cut in Forrester’s report, but we would be remiss to not mention the company and its Apache Spark-based offerings. Databricks has spearheaded the idea of a data lakehouse that seeks to combine the benefits of a centralized data warehouse that maintains highly structured data, with the scalability benefits of a data lake that stores less structured data.
Firebolt is another player in the cloud data warehouse market that customers should keep an eye on. The company, which currently supports just AWS, brings novel compression techniques to bare on data residing in S3.
There are a number of other data warehousing offerings that users should be aware of, including several that run atop Presto, an open source query engine. A key difference between Presto and traditional column-oriented databases is that Presto queries the data wherever it resides, eliminating the need to ETL the data into a central repository, which is a major source of cost and complexity in data warehousing.
Starburst develops and sells a version of Presto (which is now called Trino) that can be used for both cloud and on-prem data analytics. Starburst takes a federated approach to data analytics, whereby queries are pushed down to the source database or file system, and the results are aggregated and returned to the user.
Another Presto company is Ahana, which launched about a year ago. Ahana is offerings a managed version of its Presto analytics engine running on AWS, thereby eliminating the need for users to manage Presto (which can be tricky).
AWS is also in the Presto business with Amazon Athena, which runs as a serverless service on Amazon Elastic MapReduce (EMR). Like other Presto offerings, Athena lets users query data residing in S3. It’s often deployed with other Amazon offerings, including Glue.
Finally there is Varada, which also offers a version of Presto. Like other Presto offerings, Varada layers additional functionality, such as indexing, caching, and a query optimizer, atop the core Presto engine to provide a more complete data analytics solution.
Peeking outside of the Presto community, we find Dremio, which has been gaining traction with its data analytics solution built atop Apache Arrow. Dremio has bolstered its core federated query capability with features like native query push-downs, data caching, and a query planner to bolster performance, making it another option for cloud analytic workloads, although, like the Presto offerings, it’s not technically a data warehouse.
These are heady days for cloud data warehousing. Customers have a plethora of solutions to choose from, and the alternatives are continuously being enhanced with compelling new features. Customers are reporting good returns from their cloud data warehouse investments, and the practice is sure to grow in the coming years.