Data Lakes Are Legacy Tech, Fivetran CEO Says
By some accounts, data lakes appear poised to supplant data warehouses as the center of gravity of modern analytics systems, particularly with today’s sophisticated data virtualization capabilities. But with the advent of cloud data warehouses that separate compute and storage, companies should take a hard look at their data lakes.
That’s the message that Fivetran CEO and co-founder George Fraser delivered during his keynote address for the “Modern Data Stack Conference 2020,” his company’s virtual conference that took place two weeks ago.
“In my opinion, data lakes are not part of the modern data stack. Data lakes are legacy,” Fraser said. “There are organizational [and] quasi-political reasons why people adopt data lakes. But there are no longer technical reason for adopting data lakes.”
Fraser is not advocating that companies (virtually) rip out their data lakes, like Amazon Web Services’ S3 or Microsoft Azure’s Data Lake Storage, and replace them with a cloud data warehouse, such as Snowflake’ offering or Google Cloud’s BigQuery.
“There are a lot of people who have data lakes because they inherited them, and that’s a perfectly valid reason to keep them going,” he said. “If you have a working system, you should keep using it. You shouldn’t take it out just for the sake of change.”
But if you were to build a new system from scratch, Fraser has a list of reasons why you should select a data warehouse rather than a data lake to live at the center of your modern data stack.
The first reason is cost. Data lakes grew in popularity because they carried a sizable cost advantage over data warehouses, Fraser said. Staging data in a data lake–essentially massive key-value stores accessible via a REST API–was much more affordable than storing data in a massively parallel processing (MPP) column-oriented relational database.
But with the separation of storage and compute in modern cloud data warehouses, the cost of storage has come down considerably, while simultaneously freeing customers to scale up processing to meet sudden changes in demand for massive SQL workloads. That essentially eliminated cost as a good reason for going with the data lake. And without the cost advantage, the technical shortcomings of analyzing data stored in a data lake compared to a data warehouse begin to rear their heads, Fraser said.
“Data warehouses that separate compute from storage have all of the advantages of data lakes and more,” Fraser continued. “They give you the kind of user management [you expect]. They give you even better performance [than data lakes] because they can do optimizations by controlling both the storage format and the compute format…Data warehouses are just fundamentally more user friendly that data lakes are.”
Fivetran develops software designed to load data from source systems into cloud data warehouses, which is sometimes called ETL (extract, transfer, and load), except that Fivetran has adopted the ELT method, whereby the transformation of the data occurs in the data warehouse (and it doesn’t really do much of the “T” anyway).
Armed with a cloud data warehouse, Fivetran, and a dedicated transformation tool, such as the Data Build Tool (dbt) offering from Fishtown Analytics (which, like Fivetran, is funded by Andreessen Horowitz), companies are well-equipped to meet the demands of modern data analytics, Fraser argued.
The popularity of cloud data warehouses manifested itself last month with Snowflake’s massive IPO, which valued the company at $68 billion. Other cloud data warehouses, including AWS’ Red Shift, Microsoft’s Synapse Analytics, Google’s BigQuery, and Databricks’ Unified Data Service (which includes SQL processing as well as support for machine learning), have also seen their popularity rise.
“The data warehouse that you should have at the center of a modern data stack should be based on MPP column-store technology,” Fraser said. The data warehouse “is the part of the modern data stack that really started the revolution. It underwent this extremely technical change that enabled everything else in the modern data stack to happen.”
Fraser said that his views on data lakes may be a bit controversial. They certainly go against what others in big data community are saying, including the folks at Dremio, which recently updated its data virtualization service to enable large-scale SQL analytics directly against data lakes. A similar federated approach to analytics is enabled by products like Presto, which can power SQL queries against cloud data stores, file systems, databases, and Kafka, as well as Hive, which was built to run atop HDFS.
By keeping analytics and data storage separate, customers can eliminate the need for a data warehouse and all of the ETL (or ELT) that goes along with it, says Dremio co-founder and chief product officer Tomer Shiran.
“All this data movement and the need to create data marts and extracts and aggregation tables and all that creates a huge amount of cost and also a long delay,” Dremio co-founder and chief product officer Tomer Shiran told Datanami. “Anytime you want to change the dashboard or change the data, you have to wait weeks.”
Clearly, there are multiple camps on the data centralization question, and even advocates of federated approaches admit there are cases where centralization makes sense. Fraser clearly sees a bright future in centralizing data and simplifying data integration and ETL/ELT to the greatest extent possible. With a valuation in excess of $1 billion, Fivetran is gaining speed.
Despite the focus on cloud data warehouses, Fivetran is actively working to support data lakes with its product, Fraser said. “So despite my opinion that they’re not the optimal solution in the world of the modern data stack, we are capable of listening,” he said.
But Fraser didn’t stop with some friendly data lake bashing. In fact, he also proclaimed that, thanks to advances in cloud data warehouses, they can also replace Kafka in some cases.
“With the emergence of stream processing, particularly in Snowflake, where you can create these tasks and streams to process data incrementally, you can do the kinds of workflows that previously you would need to hire an entire software engineering team and build a stream processing on to a message broker, like Kafka,” he said, “now you can do with a SQL query inside of a data warehouse.”
It’s not only easier to build a stream processing system with SQL, but it’s also easier to maintain, he said. Data warehouses can’t deliver answers with milli-second latencies, he said, but they work well in use cases that require latencies measured in seconds.
To that end, Fivetran is currently working to boost the frequency at which it can update a cloud data warehouse. Currently, the limit is one minute. That’s down considerably from the past. It started at once per day, then moved to one hour, then 15 minutes. The company is working to push latency lower.
“With new features being developed in the data warehouse, latencies of seconds and tens of seconds are fundamentally possible and we’re working hard at Fivetran to keep battling every element of the pipeline, keep battling down that latency number. Because we understand that lower latency is going to enable all of these exciting use case.”
You can view Fraser’s keynote address here.
November 25, 2020
- SKT Unveils its AI Chip and New Plans for AI Semiconductor Business
- European Commission Proposes Measures to Boost Data Sharing, Support Data Spaces
- Azure Databricks Achieves FedRAMP High Authorization on Microsoft Azure Government
- AWS Announces General Availability of Amazon Managed Workflows for Apache Airflow
- SoftIron’s Open Source-Based HyperDrive Storage Solution Verified Veeam Ready
November 24, 2020
- Wasabi and Sidepath Partner to Improve IT Infrastructure, Simplify Cloud Storage
- IBM and Confluent Announce Strategic Partnership
- Splunk to Acquire Network Performance Monitoring Leader Flowmill
- Cohesity to Unveil Data Management Offerings at AWS re:Invent
- Informatica Named Leader for Fifth Year in Gartner Magic Quadrant for Metadata Management
- Logz.io Announces $23M Funding Round
- Fujitsu and LARUS Leverage Graph Database and Explainable AI for Credit Card Fraud Detection
November 23, 2020
- Accenture to Acquire End-to-End Analytics
- Ascend.io Expands Global Partner Program for Enterprises
- Narrative Partners with Killi to Deliver Customer-Approved Personal Data Sets
- Sumo Logic Reports on the State of Modern Applications, DevSecOps, COVID-19’s Impact
November 20, 2020
- Nulogy Partners With MAJiK Systems to Bolster Supply Chain Operations
- Responding to EU Guidance, Microsoft Outlines New Steps to Defend Customer Data
- Alation Recognized as a Leader in the Gartner Magic Quadrant for Metadata Management Solutions
- Datadobi Announces Support for File Data Migration and Protection to Microsoft Azure
Most Read Features
- Snowflake Extends Its Data Warehouse with Pipelines, Services
- Data Lakes Are Legacy Tech, Fivetran CEO Says
- Big Data File Formats Demystified
- AI Model Detects Asymptomatic COVID-19 from a Cough 100% of the Time
- Did Dremio Just Make Data Warehouses Obsolete?
- How to Build a Better Machine Learning Pipeline
- Systemic Data Errors Still Plague Presidential Polling
- Is Kubernetes Really Necessary for Data Science?
- Data Lake or Warehouse? Databricks Offers a Third Way
- Snowflake: Not What You May Think It Is
- More Features…
Most Read News In Brief
- C3.ai Files to Go Public
- Qubole is Latest Acquisition Target
- The Shifting Landscape of Database Systems
- Algorithmia, Datadog Team on MLOps
- Databricks Plotting IPO in 2021, Bloomberg Reports
- Stanford COVID-19 Model Identifies Superspreader Sites, Socioeconomic Disparities
- DataRobot Eyes IPO After Another VC Haul
- War Unfolding for Control of Elasticsearch
- Testing Data Literacy on Main Street
- Intel Buys Another AI Startup
- More News In Brief…
Most Read This Just In
- Business Leaders Turn to Analytics to Reimagine a Post-COVID (and Post-Election) World
- Datanami Reveals Winners of Fifth Annual Readers’ and Editors’ Choice Awards
- LogicMonitor Makes Log Analytics Smarter with New Offering
- GoodData Open-sources Next Gen Analytics Framework
- Dynatrace Named a Leader in AIOps Report by Independent Research Firm
- Teradata Reports Third Quarter 2020 Financial Results
- DataRobot Announces $270M in Funding Led by Altimeter Capital
- Affinio Announces Snowflake Integration to Support Privacy Compliant Audience Enrichment
- Starburst Announces Datanova, a Two-Day Virtual Conference with Keynote by Bill Nye
- Amazon Textract Recognizes Handwriting and Adds Five New Languages
- More This Just In…