Data Lakes Are Legacy Tech, Fivetran CEO Says
By some accounts, data lakes appear poised to supplant data warehouses as the center of gravity of modern analytics systems, particularly with today’s sophisticated data virtualization capabilities. But with the advent of cloud data warehouses that separate compute and storage, companies should take a hard look at their data lakes.
That’s the message that Fivetran CEO and co-founder George Fraser delivered during his keynote address for the “Modern Data Stack Conference 2020,” his company’s virtual conference that took place two weeks ago.
“In my opinion, data lakes are not part of the modern data stack. Data lakes are legacy,” Fraser said. “There are organizational [and] quasi-political reasons why people adopt data lakes. But there are no longer technical reason for adopting data lakes.”
Fraser is not advocating that companies (virtually) rip out their data lakes, like Amazon Web Services’ S3 or Microsoft Azure’s Data Lake Storage, and replace them with a cloud data warehouse, such as Snowflake’ offering or Google Cloud’s BigQuery.
“There are a lot of people who have data lakes because they inherited them, and that’s a perfectly valid reason to keep them going,” he said. “If you have a working system, you should keep using it. You shouldn’t take it out just for the sake of change.”
But if you were to build a new system from scratch, Fraser has a list of reasons why you should select a data warehouse rather than a data lake to live at the center of your modern data stack.
The first reason is cost. Data lakes grew in popularity because they carried a sizable cost advantage over data warehouses, Fraser said. Staging data in a data lake–essentially massive key-value stores accessible via a REST API–was much more affordable than storing data in a massively parallel processing (MPP) column-oriented relational database.
But with the separation of storage and compute in modern cloud data warehouses, the cost of storage has come down considerably, while simultaneously freeing customers to scale up processing to meet sudden changes in demand for massive SQL workloads. That essentially eliminated cost as a good reason for going with the data lake. And without the cost advantage, the technical shortcomings of analyzing data stored in a data lake compared to a data warehouse begin to rear their heads, Fraser said.
“Data warehouses that separate compute from storage have all of the advantages of data lakes and more,” Fraser continued. “They give you the kind of user management [you expect]. They give you even better performance [than data lakes] because they can do optimizations by controlling both the storage format and the compute format…Data warehouses are just fundamentally more user friendly that data lakes are.”
Fivetran develops software designed to load data from source systems into cloud data warehouses, which is sometimes called ETL (extract, transfer, and load), except that Fivetran has adopted the ELT method, whereby the transformation of the data occurs in the data warehouse (and it doesn’t really do much of the “T” anyway).
Armed with a cloud data warehouse, Fivetran, and a dedicated transformation tool, such as the Data Build Tool (dbt) offering from Fishtown Analytics (which, like Fivetran, is funded by Andreessen Horowitz), companies are well-equipped to meet the demands of modern data analytics, Fraser argued.
The popularity of cloud data warehouses manifested itself last month with Snowflake’s massive IPO, which valued the company at $68 billion. Other cloud data warehouses, including AWS’ Red Shift, Microsoft’s Synapse Analytics, Google’s BigQuery, and Databricks’ Unified Data Service (which includes SQL processing as well as support for machine learning), have also seen their popularity rise.
“The data warehouse that you should have at the center of a modern data stack should be based on MPP column-store technology,” Fraser said. The data warehouse “is the part of the modern data stack that really started the revolution. It underwent this extremely technical change that enabled everything else in the modern data stack to happen.”
Fraser said that his views on data lakes may be a bit controversial. They certainly go against what others in big data community are saying, including the folks at Dremio, which recently updated its data virtualization service to enable large-scale SQL analytics directly against data lakes. A similar federated approach to analytics is enabled by products like Presto, which can power SQL queries against cloud data stores, file systems, databases, and Kafka, as well as Hive, which was built to run atop HDFS.
By keeping analytics and data storage separate, customers can eliminate the need for a data warehouse and all of the ETL (or ELT) that goes along with it, says Dremio co-founder and chief product officer Tomer Shiran.
“All this data movement and the need to create data marts and extracts and aggregation tables and all that creates a huge amount of cost and also a long delay,” Dremio co-founder and chief product officer Tomer Shiran told Datanami. “Anytime you want to change the dashboard or change the data, you have to wait weeks.”
Clearly, there are multiple camps on the data centralization question, and even advocates of federated approaches admit there are cases where centralization makes sense. Fraser clearly sees a bright future in centralizing data and simplifying data integration and ETL/ELT to the greatest extent possible. With a valuation in excess of $1 billion, Fivetran is gaining speed.
Despite the focus on cloud data warehouses, Fivetran is actively working to support data lakes with its product, Fraser said. “So despite my opinion that they’re not the optimal solution in the world of the modern data stack, we are capable of listening,” he said.
But Fraser didn’t stop with some friendly data lake bashing. In fact, he also proclaimed that, thanks to advances in cloud data warehouses, they can also replace Kafka in some cases.
“With the emergence of stream processing, particularly in Snowflake, where you can create these tasks and streams to process data incrementally, you can do the kinds of workflows that previously you would need to hire an entire software engineering team and build a stream processing on to a message broker, like Kafka,” he said, “now you can do with a SQL query inside of a data warehouse.”
It’s not only easier to build a stream processing system with SQL, but it’s also easier to maintain, he said. Data warehouses can’t deliver answers with milli-second latencies, he said, but they work well in use cases that require latencies measured in seconds.
To that end, Fivetran is currently working to boost the frequency at which it can update a cloud data warehouse. Currently, the limit is one minute. That’s down considerably from the past. It started at once per day, then moved to one hour, then 15 minutes. The company is working to push latency lower.
“With new features being developed in the data warehouse, latencies of seconds and tens of seconds are fundamentally possible and we’re working hard at Fivetran to keep battling every element of the pipeline, keep battling down that latency number. Because we understand that lower latency is going to enable all of these exciting use case.”
You can view Fraser’s keynote address here.
May 27, 2022
May 25, 2022
- Cloudian Object Storage Integrates with Microsoft SQL Server 2022
- Informatica Launches Intelligent Data Management Cloud for Financial Services
- Amplitude Announces Expanded Partnership with Snowflake
- SODA Foundation Announces Two New Open Source Projects
- Alteryx Empowers Public Sector to Accelerate Insights with Automated Analytics
- Informatica Unveils Product Innovations Designed to Empower Users
- Amplitude Announces New Customer Data Platform
- Intel oneDNN AI Optimizations Enabled as Default in TensorFlow
- Informatica Launches Free Data Loader for Google BigQuery
- LANL and Pavilion Partner to Explore Analytics Offloads to Computational Storage Arrays
- Informatica Announces Multidomain Master Data Management-as-a-Service on Azure
May 24, 2022
- Monte Carlo Raises $135M Series D to Accelerate Data Observability
- Matillion Announces General Availability of Data Loader 2.0
- KDD 2022 Opens Registration for International Data Science and Analytics Conference
- Morningstar Launches Analytics Lab, a Data Science Platform for Finance Professionals
- The FRONTdoor Collective Selects Stardog to Unlock Value of Data in e-Commerce Supply Chain
- CockroachDB 22.1 Eases the Creation and Operation of Data-Intensive Applications
May 23, 2022
- Informatica World 2022 Showcases Adoption of the Intelligent Data Management Cloud
- Census Invites Data Professionals to Celebrate the Summer of Data ’22
Most Read Features
- Five Ways Big Data Projects Can Go Wrong (And What You Can Do About Them)
- The Future of Data Management: It’s Already Here
- Five Emerging Trends in Enterprise Data Management
- Inside the Modern Data Stack
- Will the Data Lakehouse Lead to Warehouse-Style Lock-In?
- How to Stop Failing at Data
- Meet Andrew Ng, a 2022 Datanami Person to Watch
- ThoughtSpot Looks to the Future of Analytics
- All Eyes on Snowflake and Databricks in 2022
- AI That Works on Behalf of Workers
- More Features…
Most Read News In Brief
- Anaconda Unveils PyScript, the ‘Minecraft for Software Development’
- Looker Founder Helps Create New Data Exploration Language, Malloy
- Why So Few Are Mastering the Data Economy
- Google Cloud Launches New Postgres-Compatible Database, AlloyDB
- Google Debuts LaMDA 2 Conversational AI System and AI Test Kitchen
- OpenAI’s DALL·E 2 Is Surreal
- Big Data Career Notes: May 2022 Edition
- The Six New Rules of Data
- Anaconda’s Commercial Fee Is Paying Off, CEO Says
- Demand for Workflow Orchestration Solutions on the Rise, Says New Report
- More News In Brief…
Most Read This Just In
- CData Software and HULFT Announce Interoperability Partnership to Break Down Data Silos
- MariaDB Puts $25K on the Table in Distributed SQL Throwdown
- MariaDB Survey Reveals COVID-19’s Impact on Cloud Adoption
- IBM Enhances Global Data Platform to Address AI Adoption Challenges
- Splunk Extends its Data-to-Everything Platform with Cloud and Machine Learning Advancements
- DataRobot to Host Inaugural AI Experience Worldwide Conference
- Apple Makes Mobility Data Available to Aid COVID-19 Efforts
- Harnham Data Reveals Increase in Diversity Across Data and Analytics Industry, but Pay Gaps Continue
- Fivetran: Over 80 Percent of Companies Rely on Stale Data for Decision-Making
- Penn State Launches Master’s Degree in Spatial Data Science
- More This Just In…