Data Lakes Are Legacy Tech, Fivetran CEO Says
By some accounts, data lakes appear poised to supplant data warehouses as the center of gravity of modern analytics systems, particularly with today’s sophisticated data virtualization capabilities. But with the advent of cloud data warehouses that separate compute and storage, companies should take a hard look at their data lakes.
That’s the message that Fivetran CEO and co-founder George Fraser delivered during his keynote address for the “Modern Data Stack Conference 2020,” his company’s virtual conference that took place two weeks ago.
“In my opinion, data lakes are not part of the modern data stack. Data lakes are legacy,” Fraser said. “There are organizational [and] quasi-political reasons why people adopt data lakes. But there are no longer technical reason for adopting data lakes.”
Fraser is not advocating that companies (virtually) rip out their data lakes, like Amazon Web Services’ S3 or Microsoft Azure’s Data Lake Storage, and replace them with a cloud data warehouse, such as Snowflake’ offering or Google Cloud’s BigQuery.
“There are a lot of people who have data lakes because they inherited them, and that’s a perfectly valid reason to keep them going,” he said. “If you have a working system, you should keep using it. You shouldn’t take it out just for the sake of change.”
But if you were to build a new system from scratch, Fraser has a list of reasons why you should select a data warehouse rather than a data lake to live at the center of your modern data stack.
The first reason is cost. Data lakes grew in popularity because they carried a sizable cost advantage over data warehouses, Fraser said. Staging data in a data lake–essentially massive key-value stores accessible via a REST API–was much more affordable than storing data in a massively parallel processing (MPP) column-oriented relational database.
But with the separation of storage and compute in modern cloud data warehouses, the cost of storage has come down considerably, while simultaneously freeing customers to scale up processing to meet sudden changes in demand for massive SQL workloads. That essentially eliminated cost as a good reason for going with the data lake. And without the cost advantage, the technical shortcomings of analyzing data stored in a data lake compared to a data warehouse begin to rear their heads, Fraser said.
“Data warehouses that separate compute from storage have all of the advantages of data lakes and more,” Fraser continued. “They give you the kind of user management [you expect]. They give you even better performance [than data lakes] because they can do optimizations by controlling both the storage format and the compute format…Data warehouses are just fundamentally more user friendly that data lakes are.”
Fivetran develops software designed to load data from source systems into cloud data warehouses, which is sometimes called ETL (extract, transfer, and load), except that Fivetran has adopted the ELT method, whereby the transformation of the data occurs in the data warehouse (and it doesn’t really do much of the “T” anyway).
Armed with a cloud data warehouse, Fivetran, and a dedicated transformation tool, such as the Data Build Tool (dbt) offering from Fishtown Analytics (which, like Fivetran, is funded by Andreessen Horowitz), companies are well-equipped to meet the demands of modern data analytics, Fraser argued.
The popularity of cloud data warehouses manifested itself last month with Snowflake’s massive IPO, which valued the company at $68 billion. Other cloud data warehouses, including AWS’ Red Shift, Microsoft’s Synapse Analytics, Google’s BigQuery, and Databricks’ Unified Data Service (which includes SQL processing as well as support for machine learning), have also seen their popularity rise.
“The data warehouse that you should have at the center of a modern data stack should be based on MPP column-store technology,” Fraser said. The data warehouse “is the part of the modern data stack that really started the revolution. It underwent this extremely technical change that enabled everything else in the modern data stack to happen.”
Fraser said that his views on data lakes may be a bit controversial. They certainly go against what others in big data community are saying, including the folks at Dremio, which recently updated its data virtualization service to enable large-scale SQL analytics directly against data lakes. A similar federated approach to analytics is enabled by products like Presto, which can power SQL queries against cloud data stores, file systems, databases, and Kafka, as well as Hive, which was built to run atop HDFS.
By keeping analytics and data storage separate, customers can eliminate the need for a data warehouse and all of the ETL (or ELT) that goes along with it, says Dremio co-founder and chief product officer Tomer Shiran.
“All this data movement and the need to create data marts and extracts and aggregation tables and all that creates a huge amount of cost and also a long delay,” Dremio co-founder and chief product officer Tomer Shiran told Datanami. “Anytime you want to change the dashboard or change the data, you have to wait weeks.”
Clearly, there are multiple camps on the data centralization question, and even advocates of federated approaches admit there are cases where centralization makes sense. Fraser clearly sees a bright future in centralizing data and simplifying data integration and ETL/ELT to the greatest extent possible. With a valuation in excess of $1 billion, Fivetran is gaining speed.
Despite the focus on cloud data warehouses, Fivetran is actively working to support data lakes with its product, Fraser said. “So despite my opinion that they’re not the optimal solution in the world of the modern data stack, we are capable of listening,” he said.
But Fraser didn’t stop with some friendly data lake bashing. In fact, he also proclaimed that, thanks to advances in cloud data warehouses, they can also replace Kafka in some cases.
“With the emergence of stream processing, particularly in Snowflake, where you can create these tasks and streams to process data incrementally, you can do the kinds of workflows that previously you would need to hire an entire software engineering team and build a stream processing on to a message broker, like Kafka,” he said, “now you can do with a SQL query inside of a data warehouse.”
It’s not only easier to build a stream processing system with SQL, but it’s also easier to maintain, he said. Data warehouses can’t deliver answers with milli-second latencies, he said, but they work well in use cases that require latencies measured in seconds.
To that end, Fivetran is currently working to boost the frequency at which it can update a cloud data warehouse. Currently, the limit is one minute. That’s down considerably from the past. It started at once per day, then moved to one hour, then 15 minutes. The company is working to push latency lower.
“With new features being developed in the data warehouse, latencies of seconds and tens of seconds are fundamentally possible and we’re working hard at Fivetran to keep battling every element of the pipeline, keep battling down that latency number. Because we understand that lower latency is going to enable all of these exciting use case.”
You can view Fraser’s keynote address here.
February 25, 2021
- Cloudera Launches Applied ML Prototypes in CML for Jumpstarting AI Use Cases
- Acxiom Everywhere: Acxiom Data Enrichment and Real Identity Accessible via Exponea CDP
- DataStax Astra Shatters Serverless Database-as-a-Service Barrier
- C3 AI Awarded Robust Omnibus US Patent for End-to-End Enterprise AI Platform
- Enhanced Reveal Business Intelligence Platform Offers Integrated Data Visualizations
- New Off-the-Shelf Datasets from Appen Accelerate AI Deployment
- Next Pathway Announces New Capabilities to Crawler360 and SHIFT for Cloud Migration
- Logi Analytics Research Finds Disconnect Between Value of Analytics with Current BI Tools
- The AI Infrastructure Alliance Launches with 25 Members
February 24, 2021
- dotData Launches Cloud for BI Teams to Quickly and Easily Fully Automate AI/ML Development
- Fivetran Doubles Revenue and Customers in 2020
- Seagate Unveils Lyve Cloud Built to Store, Activate, and Manage the Massive Surge in Data
- Pliops Closes $65M Funding Round for Datacenter Efficiency
- Digital Guardian Deepens Relationship with AWS
- Actian Launches New CX-Focused Capabilities for Cloud Data Warehouse
- Hazelcast Releases Cloud-based Architecture for Financial Services Risk Management Applications
- Veeam Releases New V11 with 200+ Enhancements, Eliminating Ransomware and Data Loss
- Katana Graph Secures $28.5 Million Series A Financing Round Led by Intel Capital
February 23, 2021
- Mission Launches Data, Analytics and Machine Learning Practice for Businesses on AWS
- Hasura Releases Version 2.0 of its Open Source GraphQL Engine
Most Read Features
- He Couldn’t Beat Teradata. Now He’s Its CEO
- Big Data File Formats Demystified
- Who’s Winning the Cloud Database War
- Why Data Science Is Still a Top Job
- Snowflake: Not What You May Think It Is
- Big Data Predictions: What 2020 Will Bring
- Apache Iceberg: The Hub of an Emerging Data Service Ecosystem?
- Understanding Your Options for Stream Processing Frameworks
- Governance, Privacy, and Ethics at the Forefront of Data in 2021
- Empowering the Data Consumer: Living, and Breathing Data Governance, Security, and Regulations
- More Features…
Most Read News In Brief
- Databricks Edges Closer to IPO with $1B Round
- Researchers Use Deep Learning to Plow Through NASA Snow Radar Data
- Databricks Plotting IPO in 2021, Bloomberg Reports
- The AI Inside NASA’s Latest Mars Rover, Perseverance
- Momentum Builds to Break Elasticsearch Licensing Deadlock
- Soda Launches Open Data Monitoring
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- Update: Elastic Shifts Licensing Terms, Citing Amazon Moves
- The Rise and Fall of Qlik
- Python Popularity Persists, AI Drives PyTorch
- More News In Brief…
Most Read This Just In
- Cal Poly Team Working on Cross-disciplinary Data Science and Analytics Effort
- UCL Reports: Online Search Activity Can Help Predict Peaks in COVID-19 Cases
- SingleStore Strengthens Executive Team with Oliver Schabenberger as Chief Innovation Officer
- Collibra Acquires Predictive Data Quality Vendor OwlDQ
- DataRobot Announces Feature Discovery Integration with Snowflake
- SAS Establishes Opioid Analytics Users Group
- SingleStore Adds AWS Glue for Simpler Cloud Data Integration
- NVIDIA Violates the Transaction Processing Performance Council’s Fair Use Policy
- Sinequa Announces Strong Momentum and Fiscal Year 2020 Results Amid COVID-19 Pandemic
- Wharton Research Data Services Expands RavenPack Analytics
- More This Just In…