Data Lakes Are Legacy Tech, Fivetran CEO Says
By some accounts, data lakes appear poised to supplant data warehouses as the center of gravity of modern analytics systems, particularly with today’s sophisticated data virtualization capabilities. But with the advent of cloud data warehouses that separate compute and storage, companies should take a hard look at their data lakes.
That’s the message that Fivetran CEO and co-founder George Fraser delivered during his keynote address for the “Modern Data Stack Conference 2020,” his company’s virtual conference that took place two weeks ago.
“In my opinion, data lakes are not part of the modern data stack. Data lakes are legacy,” Fraser said. “There are organizational [and] quasi-political reasons why people adopt data lakes. But there are no longer technical reason for adopting data lakes.”
Fraser is not advocating that companies (virtually) rip out their data lakes, like Amazon Web Services’ S3 or Microsoft Azure’s Data Lake Storage, and replace them with a cloud data warehouse, such as Snowflake’ offering or Google Cloud’s BigQuery.
“There are a lot of people who have data lakes because they inherited them, and that’s a perfectly valid reason to keep them going,” he said. “If you have a working system, you should keep using it. You shouldn’t take it out just for the sake of change.”
But if you were to build a new system from scratch, Fraser has a list of reasons why you should select a data warehouse rather than a data lake to live at the center of your modern data stack.
The first reason is cost. Data lakes grew in popularity because they carried a sizable cost advantage over data warehouses, Fraser said. Staging data in a data lake–essentially massive key-value stores accessible via a REST API–was much more affordable than storing data in a massively parallel processing (MPP) column-oriented relational database.
But with the separation of storage and compute in modern cloud data warehouses, the cost of storage has come down considerably, while simultaneously freeing customers to scale up processing to meet sudden changes in demand for massive SQL workloads. That essentially eliminated cost as a good reason for going with the data lake. And without the cost advantage, the technical shortcomings of analyzing data stored in a data lake compared to a data warehouse begin to rear their heads, Fraser said.
“Data warehouses that separate compute from storage have all of the advantages of data lakes and more,” Fraser continued. “They give you the kind of user management [you expect]. They give you even better performance [than data lakes] because they can do optimizations by controlling both the storage format and the compute format…Data warehouses are just fundamentally more user friendly that data lakes are.”
Fivetran develops software designed to load data from source systems into cloud data warehouses, which is sometimes called ETL (extract, transfer, and load), except that Fivetran has adopted the ELT method, whereby the transformation of the data occurs in the data warehouse (and it doesn’t really do much of the “T” anyway).
Armed with a cloud data warehouse, Fivetran, and a dedicated transformation tool, such as the Data Build Tool (dbt) offering from Fishtown Analytics (which, like Fivetran, is funded by Andreessen Horowitz), companies are well-equipped to meet the demands of modern data analytics, Fraser argued.
The popularity of cloud data warehouses manifested itself last month with Snowflake’s massive IPO, which valued the company at $68 billion. Other cloud data warehouses, including AWS’ Red Shift, Microsoft’s Synapse Analytics, Google’s BigQuery, and Databricks’ Unified Data Service (which includes SQL processing as well as support for machine learning), have also seen their popularity rise.
“The data warehouse that you should have at the center of a modern data stack should be based on MPP column-store technology,” Fraser said. The data warehouse “is the part of the modern data stack that really started the revolution. It underwent this extremely technical change that enabled everything else in the modern data stack to happen.”
Fraser said that his views on data lakes may be a bit controversial. They certainly go against what others in big data community are saying, including the folks at Dremio, which recently updated its data virtualization service to enable large-scale SQL analytics directly against data lakes. A similar federated approach to analytics is enabled by products like Presto, which can power SQL queries against cloud data stores, file systems, databases, and Kafka, as well as Hive, which was built to run atop HDFS.
By keeping analytics and data storage separate, customers can eliminate the need for a data warehouse and all of the ETL (or ELT) that goes along with it, says Dremio co-founder and chief product officer Tomer Shiran.
“All this data movement and the need to create data marts and extracts and aggregation tables and all that creates a huge amount of cost and also a long delay,” Dremio co-founder and chief product officer Tomer Shiran told Datanami. “Anytime you want to change the dashboard or change the data, you have to wait weeks.”
Clearly, there are multiple camps on the data centralization question, and even advocates of federated approaches admit there are cases where centralization makes sense. Fraser clearly sees a bright future in centralizing data and simplifying data integration and ETL/ELT to the greatest extent possible. With a valuation in excess of $1 billion, Fivetran is gaining speed.
Despite the focus on cloud data warehouses, Fivetran is actively working to support data lakes with its product, Fraser said. “So despite my opinion that they’re not the optimal solution in the world of the modern data stack, we are capable of listening,” he said.
But Fraser didn’t stop with some friendly data lake bashing. In fact, he also proclaimed that, thanks to advances in cloud data warehouses, they can also replace Kafka in some cases.
“With the emergence of stream processing, particularly in Snowflake, where you can create these tasks and streams to process data incrementally, you can do the kinds of workflows that previously you would need to hire an entire software engineering team and build a stream processing on to a message broker, like Kafka,” he said, “now you can do with a SQL query inside of a data warehouse.”
It’s not only easier to build a stream processing system with SQL, but it’s also easier to maintain, he said. Data warehouses can’t deliver answers with milli-second latencies, he said, but they work well in use cases that require latencies measured in seconds.
To that end, Fivetran is currently working to boost the frequency at which it can update a cloud data warehouse. Currently, the limit is one minute. That’s down considerably from the past. It started at once per day, then moved to one hour, then 15 minutes. The company is working to push latency lower.
“With new features being developed in the data warehouse, latencies of seconds and tens of seconds are fundamentally possible and we’re working hard at Fivetran to keep battling every element of the pipeline, keep battling down that latency number. Because we understand that lower latency is going to enable all of these exciting use case.”
You can view Fraser’s keynote address here.
September 24, 2021
- AWS Announces General Availability of Amazon QuickSight Q
- IDC’s 3rd Platform Industry Spending Guides Provide In-Depth Sub-Industry Forecasts for Technology Investments Across Nine Industries
- Scality Awarded US Patent for Hyperscale Data Protection
September 23, 2021
- AtScale Expands Semantic Layer Solution for Microsoft Excel
- CNCF End User Technology Radar Provides Insights into DevSecOps
- At Annual OCEANS 2021, Sofar Ocean Debuts First-of-Its-Kind Maritime Open Standard, Bristlemouth
- Elastic Announces the General Availability of Elastic App Search Web Crawler, New Features for Elastic Enterprise Search
- Securonix Achieves FedRAMP In-Process Authorization
- EDJX and Cubic Corporation Partner to Launch the Internet of Military Things Edge Platform
September 22, 2021
- GigaOm Names Moogsoft an Industry Leader in “Radar for AIOps Solutions” Report
- Clearsense Acquires Plug-and-Play AI Analytics Firm
- Purdue University Global Launches Master of Science in Data Analytics
- Dihuni OptiReady CognitX Deep Learning Servers and Workstations Powered by NVIDIA Ampere Architecture-based GPUs
- Scality Awarded New U.S. Patent for Breakthrough Technology in Hyper-Scale Data Protection
- MicroAI to Bring AI Training to Renesas MCUs
- Recent Gartner VP Analyst Sanjeev Mohan Joins Okera as a Strategic Advisor
- C3 AI Reinvents Enterprise Software UX With C3 AI Data Vision
September 21, 2021
- Healthcare Analytics Summit 21 Virtual Kicks Off Today
- Tesco Selects Teradata Vantage to Drive Enterprise-Wide Analytics at Scale
- Ketch Secures $20 Million in Series A1 Funding, Accelerating its Rapid Growth
Most Read Features
- One on One with Google Cloud Product Director Irina Farooq
- Big Data File Formats Demystified
- Tabular Seeks to Remake Cloud Data Lakes in Iceberg’s Image
- What’s the Difference Between AI, ML, Deep Learning, and Active Learning?
- Who’s Winning In the $17B AIOps and Observability Market
- SambaNova Brings Custom Silicon To Bear on High-End AI Workloads
- How the Coronavirus Response Is Aided by Analytics
- In Search of the Modern Data Stack
- COVID-Driven Cloud Surge Takes a Toll on the Data
- Rethinking Education in an AI-First World
- More Features…
Most Read News In Brief
- LinkedIn Open Sources Tech Behind 10,000-Node Hadoop Cluster
- Data and AI Salaries Continue Upward March, O’Reilly Says
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- Gartner Shuffles the Technology Deck with Latest ‘Hype Cycle’ Report
- Who’s Winning in Open Source Data Tech
- Bigeye Observes $45 Million in Funding
- Hands-Off: Manual Data Integration Tasks Plummeting, Gartner Says
- Why Is SAS Going Public?
- Can Apple Right its Privacy and Security Cart?
- Apollo CEO Bullish on GraphQL’s Potential in the Enterprise
- More News In Brief…
Most Read This Just In
- TIBCO NOW 2021 Showcases Limitless Power of Data
- Toloka Launches Data Research Grants, Announces First Eight Recipients
- Anaconda Announces Support for Pyston, Hiring Lead Developers Kevin Modzelewski and Marius Wachtler
- Kinetica Fuses Streaming and Contextual Analysis At Scale
- Aporia Launches Self-Serve Machine Learning Platform Open to Public
- MariaDB Announces SIS Provider Campus Cloud Services Migration to MariaDB SkySQL
- Transaction Processing Performance Council (TPC) Launches an Artificial Intelligence Benchmark (TPCx-AI)
- DataRobot Launches “DataRobot AI Cloud” Platform
- OneStream Previews New AI and ML Capabilities at Splash 2021
- JetBrains Launches Public Early-Access Program for JetBrains DataSpell IDE
- More This Just In…
Sponsored Partner Content
October 5 - October 7
October 12 - October 14
October 19London United Kingdom
October 27 - October 28
November 29 - December 3
December 6 - December 10San Diego CA United States