Follow Datanami:
March 15, 2024

How the FDAP Stack Gives InfluxDB 3.0 Real-Time Speed, Efficiency

(GarryKillian/Shutterstock)

The world of big data software is built on the shoulders of giants. One innovation begets another innovation, and before long, we’re running software that’s doing some amazing things. That in part explains the evolution of InfluxDB, the open source time-series database that has cranked up the performance dial in its third incarnation.

InfluxData co-founder and CTO Paul Dix recently sat down virtually with Datanami to discuss the evolution of InfluxDB’s architecture over the years, and why it changed so radically with version 3, which launched in the distributed form last year and will launch in single-node version in 2024.

The InfluxDB story starts in 2016 with version 1.0, which excelled at storing metrics, but struggled to store other observability data, including logs, traces, and events, Dix said. With version 2.0, which debuted in late 2020, the InfluxDB development team kept the database intact, but added support for a new language they created called Flux that could be used for writing queries as well as scripting.

The market reaction to version 2 was mixed and provided important architectural lessons, Dix said.

“We learned that a lot of people just needed the core database to support the broader kinds of observational data [such as] raw event data, high cardinality data,” he said. “They needed a cheaper way to store historical data, so not on locally attached SSDs but on cheap object storage backed by spinning disks.”

InfluxDB users also wanted to scale their workloads more dynamically, which meant a separation of compute from storage was needed. And while some people loved Flux, the message from the user base was pretty clear that they wanted a language they already knew.

“We took that feedback seriously and we said, okay, with version 3, we need to support high cardinality data, we need far better query performance on analytical queries that span a lot of individual time series, we need it to all be able to store its data in object storage in this distributed way, and we wanted to support SQL,” Dix said.

“We saw all those things and were like, okay, that’s basically a totally different database,” he continued. “The architecture doesn’t match the architecture of version one or two, and all these other things are different.”

In other words, InfluxDB would be a total rethink and a total rebuild over previous releases. So in late 2019 and early 2020, Dix and a small team of engineers went back to the drawing board and over the next six months, they settled on a set of technologies that they thought would deliver faster results and integrated with a broad ecosystem and community.

The Apache Arrow Ecosystem

Apache Arrow is a columnar, in-memory data format created in 2016 by Jacques Nadeau, a co-founder of Dremio, and Wes McKinney, the creator of Pandas. The pair realized that continually modifying data for analysis with different engines, like Impala, Drill, or Spark did make sense, and that a standard data format was needed.

Over the years, a family of Arrow products has grown around the core in-memory data format. There is Apache Arrow Flight, which supports streaming data. And there’s also Apache Arrow DataFusion, a Rust-based SQL query engine developed by Andy Grove who was working at Nvidia.

Dix liked what he saw with the Arrow ecosystem, particularly DataFusion. However, DataFusion was pretty green. “At that point it had been developed by one guy working at Nvidia doing it in his spare time,” he said.

He looked at other query engines, including some written in C++, but they didn’t have exactly what they needed. The fact that DataFusion was written in Rust weighed heavily in its favor.

“Whatever we adopted, we would have to be heavy contributors to it to help drive it forward,” Dix said. “And we knew that InfluxDB 3.0 was going to be written in Rust and DataFusion is also written in Rust. So we said, we’ll just adopt the project that is written in the language we want, and we will just cross our fingers and hope that it will pick up momentum along the way.”

It turned out to be a good gamble. DataFusion has been picked up by other contributors by companies like Alibaba, eBay, and Apple, which recently contributed a DataFusion Spark plug-in called Comet to the Apache Software Foundation).

“Over the course of the last 3 and-a-half years, DataFusion as a project has matured a ton,” Dix said. “It has a ton of functionality that just wasn’t there before. It’s a full SQL execution engine that has best-in-class performance on a number of different queries versus other columnar query engines.”

In addition to Arrow, Arrow Flight, and DataFusion, InfluxDB 3.0 adopted Arrow RS, the Rust library for Arrow; Apache Parquet, the on-disk columnar data format; and Apache Iceberg, the tabular data format.

Dix initially called it the FDAP stack, for Flight, DataFusion, Arrow, and Parquet, but the addition of Iceberg has him rethinking that. “I’m converting now to calling it the FIDAP stack because I believe that Apache Iceberg is going to be an important component of all of this,” he said.

(Sergey Nivens/Shutterstock)

Every component gives InfluxDB 3.0 another capability it needs, Dix said. The combination of Flight plus Arrow gives the database RPC mechanisms for fast transfer of millions of rows of data. The addition of Iceberg plus object storage and Parquet makes it so that all the data ingested in InfluxDB is stored efficiently and available to other big data query engines.

Real Time Queries

“The tricky part is, all of our use cases are basically real time,” he said. “People write data in and they want to be able to query it immediately once it’s written in. They don’t want to have some data collection pipeline lag or going off to some whatever delayed system.

“And the queries they execute, they expect those queries to execute in sub one second, a lot of times sub a few 100 milliseconds depending on the query,” Dix continued. “And of course, no query engine built on top of object storage is really designed with those kind of performance characteristics in mind.”

To enable users to query data immediately, InfluxDB 3.0 caches the new data in a write-ahead log that lives in RAM or on an SSD. The new database also includes logic to move colder data into Parquet files stored on spinning disk.

InfluxDB 3 is a very different animal than version 2, Dix said, both in terms of architecture and performance.

“There are some things that just immediately, out of the gate, are just obviously so much better than what we had before,” he said. “The ingestion performance in terms of the number of rows per second we can ingest, given a certain number of CPUs and a certain amount of RAM, in InfluxDB 3.0 is way, way better than version 1 or 2.”

Paul Dix is the co-founder and CTO of InfluxData

The storage footprint is nominally 4x to 6x better using Parquet, Dix said. “It’s even better than that, because you’re looking at a storage medium, which is spinning disk on object store, that’s basically 10x cheaper than a high performance locally attached SSD with provisioned IOPs.”

The rebuild with version 3 puts InfluxDB in the same class of real-time analytics systems like Apache Druid, Clickhouse, Apache Pinot, and Rockset. All of the databases take a slightly approach to solving the same problem: enabling fast queries on fresh data.

InfluxData gives users lots of knobs to control whether data is kept in a cache on RAM/SSD or is pushed back to Parquet in object storage, where the latency is higher.

“It all amounts to essentially a cost versus performance tradeoff, and what we found is there is no one-size-fits-all, because different use cases and different customers will have different sensitivities for how much money they’re willing to spend to optimize for a second or two of latency,” Dix said. “And sometimes it’s been surprising what people say.”

As InfluxDB 3.0 continues to get fleshed out–the team is working on a new write protocol to support richer data types such as structured data, nested data, arrays, structs–the database will continue to support new workloads and applications that were impossible before. Call it the ever-upward thrust of community-developed technology. And more is on the way.

“None of this stuff was available before,” Dix said. “Arrow didn’t exist. Arrow came out in 2016. Containerization was brand new. Kubernetes wasn’t that big back then….What we’re trying to do with version 3, which is take that design pattern but bring it to real time workloads — that’s the big hurdle.”

Related Items:

InfluxData Touts Massive Performance Boost for On-Prem Time-Series Database

InfluxData Revamps InfluxDB with 3.0 Release, Embraces Apache Arrow

Arrow Aims to Defrag Big In-Memory Analytics

Editor’s note: This article was corrected. DataFusion was developed by Andy Grove. Datanami regrets the error.

Datanami