Follow Datanami:
July 1, 2020

Confluent Goes ‘Infinite’ with Kafka Cloud Storage

(Lunatictm/Shutterstock)

Companies that want to store oodles of event data in Kafka but don’t want to pay oodles of dollars may be interested in the new “infinite storage” option unveiled today by Confluent. The Kafka company says the new feature–which is made economically feasible due to newly separated compute and storage–will allow the storage of event data to scale automatically according to demand.

Apache Kafka has emerged as the defacto standard for storing event data, which refers to all the semi-structured and unstructured data that is generated by people and applications during the normal course of digital work and play. Kafka provides the mechanism for storing these data streams and flowing them to downstream repositories, like data lakes and warehouses. (Users can optionally analyze them in real time using something like Kafka Streams.)

Many companies have created giant Kafka clusters to store petabytes of event data, for either real-time actioning or historical analysis. However, the close linkage of storage and compute in Kafka brokers formed a technical barrier that prevented them from storing all of this data in an affordable manner.

In other words, to scale storage, Kafka customers needed to buy scale compute, even if they weren’t using all that computational horsepower, which is a waste of money. (Further complicating the math is the need to balance data throughput in Kafka.)

But thanks to a technical breakthrough unveiled by Confluent today, the company is enabling its Confluent Cloud customers to separate compute and storage, thereby enabling customers to scale storage independently from compute.

“This is a first and a big technical achievement as Kafka’s storage and compute layers are tightly coupled,” says Dan Rosanova, a group product manager at Confluent. (It’s worth noting that this separation is only available in Confluent Cloud, not Apache Kafka, at least not yet.)

With infinite retention, Confluent has removed a cap on how long and how much data could be stored in its Confluent Cloud, which is its hosted offering based on Kafka that’s available on all public clouds. This removes operational burden for customers, Rosanova says.

“This is the only fully managed event streaming service with no limits on the amount that is stored or time it is retained,” Rosanova tells Datanami via email. “It brings popular cloud attributes like automatic scaling based on traffic and being charged for only data that is used. Also real-time clients are not impacted by clients reading historical data, so doesn’t add to the operational workloads.”

There is no change in cost for data that is stored infinitely, Rosanova says. “Ultimately, infinite retention will be available to all Confluent Cloud users at no extra cost for the limitless storage capacity and retention,” he says. ”They would still continue to pay for any GB of data stored like they normally do.”

It’s yet to be seen whether infinite storage will actually save customers money. Cloud services have a mixed history in that department, and there’s no reason that hosted Kafka will be any different. Freed from buying compute and storage together, it’s possible that customers will spend much more money on storage. After all, they have been storing only weeks’ or months’ worth of data up to this point. How much would storing all of the data actually cost?

But what seems pretty clear at the moment is that infinite storage will give customers much greater flexibility in being able to retain data indefinitely, if they so desire. While companies could have hacked together their own infinite retention-like configuration in their own Kafka clusters, it would have required a bit of a manual work, Rosanova says.

“People could technically choose a retention schedule of ‘-1’ to retain data indefinitely in this environment, but the tuning, administration and costs didn’t make it feasible,” he says. “With Confluent Cloud, we’ve added performance optimization features that eliminate lots of the operational overhead, so it’s now feasible and practical to store big amounts of data in Kafka.”

(Blackboard/Shutterstock)

Ultimately, companies hold onto data for a reason: they want to keep it for historical analysis.  But companies often want to mix historical and real-time data to improve service to their customers. Thanks to the new infinite retention feature in Confluent Cloud, the company is enabling its hosted Kafka service to function as the system of record for both real-time and historical data, says Confluent co-founder Jun Rao, who is also a co-creator of Kafka.

“The common practice today is to maintain historical data in a separate system and to direct the application to that system when historical data is needed,” Rao says in a blog post today. “This adds complexity in that every application has to deal with an additional data source other than Kafka. Developers have to use two sets of APIs, understand the performance characteristics of two different systems, reason about data synchronization when switching from one source to the other, etc.

“Imagine if the data in Kafka could be stored for months, years, or infinitely. The above problem can be solved in a much simpler way,” he continues. “All applications just need to get data from one system—Kafka—for both recent and historical data.”

With infinite retention, Confluent is keeping data on primary storage, as opposed to pushing it off to an archive layer, which is what Confluent does with tiered storage. Tiered storage (unveiled as a preview earlier this year) allows Confluent Cloud customers to set different retention policies for different chunks of data. The platform can be configured to store the freshest data on the highest performing (but most expensive) storage layer, while pushing older data to S3 or other object storage systems, which are cheaper but slower.

Customers could expect a performance impact when recalling data stored in S3 under the tiered storage paradigm. But since infinite storage uses primary storage, there should be no performance impact (although it will also cost more).

“Infinite retention has the potential to actually increase performance because the data stays within Kafka rather than being put in another repository that requires API calls and bandwidth to get it back into Kafka,” Rosanova says. “This characteristic of Confluent Cloud is what opens the door for real-time and historical analysis use cases in the same cluster.”

Infinite services was unveiled as part of Confluent’s Project Metamorphosis, which it unveiled earlier this year. It’s is available today as a preview for Confluent Cloud customers running in AWS. The company plans to roll it out to other public clouds later this year.

Related Items:

Step One in Kafka’s Metamorphosis Revealed

Kafka Tops 1 Trillion Messages Per Day at LinkedIn

The Real-Time Future of Data According to Jay Kreps

Datanami