Databricks Versus Snowflake: Comparing Data Giants
Databricks and Snowflake have emerged as predominant providers of big data analytics platforms in the cloud. While there are some similarities between the two rivals, there are also important differences in their offerings that prospective customers may care to know about.
If you’re moving your data operations to the cloud, you’re likely considering Databricks and Snowflake, which are two of the biggest and fastest growing companies in the space. Both companies seem to be benefiting from the relative independence of not being named AWS, Google Cloud, or Microsoft Azure, as the fear of lock-in drives enterprises to embrace multi-cloud strategies and multi-cloud platforms.
There are obviously similarities between the two companies, but they have different strengths and weaknesses that could impact your decision to go with one or the other–or neither, as the case may be. This is a very dynamic industry, and new big data startups are being spawned all the time to take down the titans. It wasn’t long ago that Databricks and Snowflake were the lively startups picking fights with establishment vendors.
With that said, here’s a rundown on the key similarities and differences between the two companies and their offerings.
Snowflake offers a data warehouse delivered via the software-as-a-service (SaaS) method. It supports structured and semi-structured data (unstructured support is still immature), and also runs on AWS, Google Cloud, and Microsoft Azure.
Databricks offers a data lakehouse delivered via the platform-as-a-service (PaaS) method. It supports structured, semi-structured, and unstructured data, and runs on AWS, Google Cloud, and Microsoft Azure.
Databricks began as an implementation of Apache Spark in the cloud (although today it’s much more than that) and it continues to excel in providing the type of large-scale data processing that Spark is renown for. Today, the company promotes its data lakehouse architecture, which combines the scalability advantages of data lake storage (via object storage) with the data quality advantages of a traditional warehouse, or analytics database. It boasts of its “unified analytics platform” that combines data engineering, AI, and machine learning.
Snowflake started as an analytics database with storage decoupled from compute, which makes it simpler to scale. The company often promotes the performance and speed of its proprietary analytics database, which was designed for traditional analytics and BI workloads. In recent years, it has started adding machine learning and AI capabilities via its Snowpark offerings, expanding into unstructured data types.
The Snowflake Data Cloud is best known for delivering high speed SQL-based data warehousing capabilities for traditional analytics and BI workloads. With UniStore, it’s mixing transactional with analytical data and workloads. Workloads scale elastically based on demand. Third-party integrations bring support for various ETL and data visualization tools. Unified data governance atop centralized storage is considered a strength. Its Snowpark developer framework brings support for Python, Java, and Scala development, as well as new capabilities for developing machine learning and AI applications on unstructured data, such as text and imagery. Its acquisition of Streamlit also provides access to tools for rapidly building Python apps.
The Databricks Lakehouse Platform offers a wide array of capabilities for data engineering, data science, and data analysis. Customers can build and run large batch jobs, real-time streaming workloads, and machine learning applications on Databricks. Development can be done via notebooks or IDEs, with SQL, Python, and Scala, as well as open source ML frameworks like PyTorch and Tensorflow. Its MLflow offering helps to manage machine learning workflows. Its Delta Lake offering supports secure data sharing, while it provides integrated data governance with its Unity Catalog. This year it rolled out Dolly, a large language model (LLM).
Databricks’ capability to scale to handle massive data workloads is considered a strength. It provides some automated query optimization capabilities through vectorization and cost-based optimization, but users will typically need some technical expertise to really dial in the performance of SQL analytics workloads. It’s more open for making changes, such as selecting certain node types. As a PaaS, Databricks is more open and invites users to plug in a variety of open-source tools.
As a SaaS offering, Snowflake is designed to be easy to get going quickly. Snowflake has done a lot of engineering under the covers to optimize performance out of the box, and its market success reflects that. There are not as many options for fine-tuning the configuration, as Snowflake intentionally shields customers from that complexity. There is no option to configure node types, for example. Snowflake in general is less open and offers fewer options to users, which reduces complexity and makes it easier to use by a wide group of people.
Snowflake manages data for customers. It supports encryption at rest and in transit, role-based access control (RBAC), and auditing. It also supports features such as AWS PrivateLink and Azure Private Link for enhanced network security, as well as data masking.
In Databricks’ cloud, customers manage their own data. Databricks supports encryption at rest and in transit and RBAC. Supports Azure Virtual Network (VNet Injection) and network security groups (NSGs) for network isolation on the Microsoft cloud.
Databricks offers pay-as-you-go pricing as well as committed-use pricing, which brings a discount. Users are charged for the specific compute services they use (such as “All-Purpose Compute”), the number of virtual machine instances they use, how often they use it, the cloud it’s running on, and the support program (standard, premium, enterprise). Since data is managed by customers, it doesn’t charge for storage.
Snowflake also offers pay-as-you-go pricing, but since it manages customers data, it charges for compute time as well as data storage (storage costs are passed on from the public cloud provider to Snowflake). For Snowflake On Demand, it charges based on the amount the customer uses it, with per-second pricing. Customers can get discounts by pre-purchasing Snowflake capacity. Pricing also varies by cloud, region, and support tier (standard, enterprise, business critical, and virtual private Snowflake [VPS]).
Snowflake launched its Data Exchange in 2019, and changed the name to the Data Marketplace a year later. It currently provides more than 2,200 data products, many of which are free. The Snowflake Marketplace also offers more than 1,700 applications, which it calls Native Apps.
Databricks launched its Marketplace in 2022 as a way to share data using its Delta Sharing protocol. It currently provides more than 500 data products, including 287 free data sets.
With its roots in Apache Spark, Databricks uses open source software extensively in its platform, and contributes a lot of its work to the open source community. However, it was criticized for holding back some of its technology, such as the Delta table format, from the open source community, a move it has since reversed.
Snowflake is not a big supporter of open source, and in fact its leaders have voiced many criticisms of open source software, including the failures of Apache Hadoop. The inner workings of its proprietary database is a mystery. However, it has come out in support of open source Apache Iceberg, a competitor to Delta table.
Databricks was founded in 2013 by the group of computer scientists at Cal Berkeley’s AMPLab who were behind Apache Spark. That includes Matei Zaharia, who’s generally credited with creating Spark, as well as his two advisors Ali Ghodsi and Ion Stoica. Co-founders Reynold Xin, Patrick Wendell, Andy Konwinski, and Arsalan Tavakoli-Shiraji are also computer scientists with ties to Berkeley.
Snowflake was founded in 2012 by three data warehousing experts, including Benoît Dageville and Thierry Cruanes, who both worked as data architect at Oracle, and Marcin Żukowski, the co-founder of Vectorwise, an MPP analytics database that is now owned by Actian.
Revenue, Customer Count, and Valuation
Databricks has about 10,300 customers, according to 6sense, a company that provides insights on technologies, or “technographics.” The company, which is privately held, is reportedly valued at $43 billion, a figure cited by Bloomberg in a recent story about the company being in talks for a new funding round. That’s up from $38 billion, a figure cited two years ago during the company’s most recent funding round. In June, Databricks passed the $1 billion revenue mark for the past 12 months for the first time.
Snowflake has a market capitalization of $52.5 billion, which is down from about $123 billion in November 2021, when its stock reached an all-time high of about $392 per share. Snowflake recorded $2.07 billion in revenue for fiscal year 2023. Snowflake reported that it had more than 8,100 customers at the end of its first quarter for fiscal year 2024, which ended April 30, 2023.
Editor’s note: This article has been updated to reflect Snowflake’s current capabilities for secure network connectivity in the cloud and storage costs.