Firebolt Touts Massive Speedup in Cloud Data Warehouse
The separation of compute and storage in the cloud allows data warehouse users to elastically scale up their clusters to process more data. But this brute force approach comes at an unacceptable cost, according to the Israeli startup Firebolt, which emerged from stealth today with $37 million and a cloud data warehouse that it claims delivers an order of magnitude better price-performance on analytic workloads relative to other cloud data warehouses.
According to Firebolt co-founder and CEO Eldad Farkash, Firebolt has spent several years innovating at the file format, indexing, and query engine levels. It starts with a new file format, which incorporates novel indexing technique that enables Firebolt to compress data at higher rates. That higher compression enables bigger data sets to be stored in RAM, which in turn is responsible for delivering extremely fast reads atop data stored in AWS S3, he claims.
“The biggest change is a new file format called the triple F format,” or F3, he says. “The purpose of this format is to tackle what we believe are the biggest problems and gaps in efficiency and speed when it comes to cloud data warehouse. And it all starts with S3.”
S3 is Firebolt’s hard drive, Farkash says. The company doesn’t support other public cloud object stores (although that may change in the future). There’s only so much one can do with S3, so Firebolt had to get creative with how data is stored once it leaves S3.
“The way to tackle S3 is to rethink the way we prune data in S3,” says Farkash, who was a co-founder and CTO at SiSense before co-founding Firebolt. “Usually, as we all know, existing platforms have some sort of partitioning, or micro-partitioning–some way to place the data in a way that will allow you to skip files.
“The problem with that is files are not granular enough,” he continues. “When we look at a file, we actually think of them as something that evolves over time. We want to merge files over time. and our goal is to increase the size of files over time so we can exploit compression and coding in a much better way because the data is ordered in that file.”
With larger, ordered files, Firebolt was able to apply a novel indexing approach, called sparse indexing. That, in turn, allows Firebolt to be able to download much less data from S3 into RAM and still have a reasonable shot at having the data that the user is looking for.
Firebolt pairs this sparse indexing in the F3 format with a just-in-time, vectorized query engine. Some of this technology was first described at the Centrum Wiskunde & Informatica (CWI) institute at the University of Amsterdam, which gave rise to the MonetDB in-memory database. (The Vectorwise and Snowflake query engines can also attribute their lineage to CWI.) But according to Farkash, the way that Firebolt implements all of these elements is fairly unique.
The impact of this approach is huge, he says.
“Whenever a query starts cold, the first thing users will notice is that they scan much less data,” he says. “If you compare it to BigQuery or Snowflake, the amount of data you scan is ridiculously lower than anything else out there. The reason is we work at the range level. We would never download the whole file unless there was a range that contains that file.”
Firebolt doesn’t just download the file from S3 and cache it in RAM, Farkash says. Instead, it’s optimizing how it’s stored in the cache. As data is streamed into the S3 lake, say from Kafka or Kinesis, Firebolt is continuously re-ordering it in its cache, and preparing it for analysis.
“This is not a post-ingestion process,” Farkash says. “This is a real-time, streaming environment where the data gests chunked and F3 format files are being generated. Those files are being ordered, compressed, and committed to S3. Merging happens at ingest, so we try to merge files into bigger one, depending on the age of files and the wish of the user. Do I wish to have near real time so every record that I insert will automatically be available for querying, versus I’m just doing a copy of a hundred TB of data and I want that to be as optimized as possible.”
Once the data is ingested, then the elasticity of the Firebolt service kicks in. The company supports multiple engines running concurrently. Different users can be doing different things on the same data, Farkash says. That’s not new. “But what is new is the efficiency and speed that we can get modern elasticity working.”
The name of the game is speed and efficiency. Tableau users will no longer need to work on an extract of the data, because they will be able to essentially load all of their data into Firebolt, according to Farkash. (The biggest use case so far is a 800TB of compressed data, but Farkash indicated that it could go higher.)
As companies rely more on data analytics and as analytics becomes more complex, the overhead in orchestrating the data and moving the data from S3 into the various engines is becoming really problematic. With a single engine that can handle the biggest workloads, Firebolt has the potential to significantly simplify the analytics architecture that companies are using, according to Farkash.
“Analytic queries are not linearly scalable, just by doubling the resource. You need to be much smarter than that,” he says. “That’s why we have baked into our SQL front-end, the query optimizer, a lot of new optimizations that are really intended to deal with interactive, ad-hoc, star schemas, multi-fact table schema situations where people can’t just pre-join the data.”
The company plans to be SQL complete, with support for the full array of functions, like joins and window functions, but with just two years in development, that will take more time. The company also can work on semi-structured JSON data, as well as Parquet and Avro. It works best on normalized data, according to Farkash.
“The purpose of Firebolt is to detach you from that constant feeling of having to calculate and ask your boss whether you can spin up engine to work some queries,” Farkash says. “The problem with [other architectures] in our view, is people need resource isolation. They need to be able to decide whether a specific data point or a specific report or specific query should cost $1 or $10.”
The company is not releasing benchmarks yet (those will be available in 2021, the company says). But according to Farkash, the minimum advantage over other cloud data warehouses is 3x to 4x. The bigger and more complex the data, the faster Firebolt goes, Farkash says.
“The more filters you add, the more complexity you add to the query, you can get to 100 to 1,000x faster,” he says. “Firebolt is a speedboat. It’s built to be a speedboat. And we love the fact that we’re a speedboat.”
Farkash says he has not seen a cloud data warehouse that can exceed Firebolt’s performance. “For typical use case, non ELT analytics, Firebolt will really melt the snow, seriously melt it, in a very up and in your face way,” he says. “There’s no way that Snowflake can run faster.”
The company conducted a soft launch three months ago and has more than a dozen customers. Now it’s selling access to its data warehouse running on AWS. The Tel Aviv company also today announced the completion of a $37 million round of venture financing led by Zeev Ventures, TLV Partners, Bessemer Venture Partners, and Angular Ventures.