The folks behind Apache Flink just delivered something you may not have realized you needed: ACID guarantees in a stream processing framework. According to data Artisans, its new Streaming Ledger offering could even replace relational databases in some use cases, upping the ante in big data’s ongoing architectural war.
Banks and other companies with strict transactional needs have long required that their data processing systems adhere to the precepts of ACID, where transactions are Atomic, Consistent, Isolated, and Durable. Without these assurances, banks could not guarantee, for example, that money deducted from one account is deposited in another.
Delivering ACID support is no trivial matter, and typically requires a range of internal mechanisms to track the state of complex, multi-part transactions in an accurate and redundant manner. Because of the technical complexity, ACID support has traditionally been the realm of relational database management systems, and as a result most companies with ACID needs have turned to RDBMs like Oracle, SQL Server, and DB2 to serve their needs (although NoSQL databases are adding ACID support too).
The folks at data Artisans are hoping to turn those ACID assumptions on their heads with their new Streaming Ledger offering, which is available as part of the new proprietary River Edition of Flink the company announced today at its Flink Forward event in Berlin, Germany.
ACID support builds on capabilities that open source version of Apache Flink already offered, including its check pointing system and support for exactly one processing. The Streaming Ledger takes the technology up a notch, says Robert Metzger, co-founder of data Artisans and head of engineering.
“What Flink supports as of today is exactly once transactions. This allows you to perform stateful accesses on a single key,” he says. “Let’s say you have a table in a stream processor containing users, and you want to perform some updates on a user. This is basically what Flink provides with exactly once guarantees. So these updates are always consistent for the user, and if this processor fails, it will restart, and the data … will always be consistent.
“The problem is, if you’re working with data from the financial industries, where you want to transfer money between bank accounts, or in the logistics industry, where you want to change how many items you have in storage, you need to be able to do updates to multiple states at once,” Metzger continues. “You don’t want to have the system crash in the middle, where your money is removed from one bank account but not added to the other bank account. And you want to do it efficiently at large scale.”
That is basically what data Artisans has delivered with Streaming Ledger. “It’s essentially multi-key joins, joining data from multiple sources consistently,” Metzger adds. “This might be abstract, but this applies to a ton of use cases at the end of the day.”
Incoming transactions can be processed in an ACID-compliant manner with data Artisans new Streaming Ledger
Support for ACID transactions will simplify life for developers trying to process transactions using Flink, says Kostas Tzoumas, data Artisans co-founder and CEO and one of the original creators of the stream processing system that emerged from the Stratosphere research project at the Technical University of Berlin earlier this decade.
“If you have stream processing applications that require ACID guarantees, until now you’re living with weaker guarantees,” Tzoumas tells Datanami. “So you were taking care of it in the application layer. Now you can have it with full performance and full scalability.”
The company has been working on ACID transaction support for more than a year. It collaborated with customers in the financial services industry to gather requirements, and once it had a working prototype, to make sure it was working before announcing it to the world today. Customers who have the need for strong ACID guarantees with streaming data can immediately benefit by adopting the Streaming Ledger, Tzoumas says.
“Honestly, this is really hard stuff. You have applications developers, and they’re not necessarily distributed systems experts,” he says. “We come from a database background ourselves. When we do these things, we want to make sure that they are correct! So there was a lot of testing involved as well.”
Customers can now write Flink processes in high-level languages without becoming experts in distributed systems, concurrency controls, and database locking mechanisms, Metzger says. They also might be able to say sayonara to Oracle‘s database license fees, he adds.
“We are allowing companies to get rid of their really expensive and complicated Oracle database that currently provide them with ACID transactions and move to newer technology that allows you to scale, that allows you to implement custom code,” he says. “That allows you to not be forced to do stored procures, but to write Java code to really express your use case, to really implement your solution in, I would argue, a nice way.”
While the Streaming Ledger could replace a relational database in some circumstances, there are some obvious limitations, such as the fact that Flink is not API compatible with an Oracle database, Tzoumas says.
“This is not a request-response client-server system. There’s not a client issuing the transaction and then getting a response. It’s more like queuing the transaction into an event log,” he says. “So there are several parallels and several differences with databases, but there are a lot of similarities.”
Stream processing was never meant to be a drop in replacement for databases, Tzoumas adds. Instead, it’s about creating new ways to think about data. “Putting together a streaming architecture and writing an application for the streaming architecture is something that people typically do when they start with a new application,” he says. “And what this really gives you is the ability to move applications to a streaming architecture that you could not do before.”
In tests, the Streaming Ledger has proven to be quite scalable, handling 2 million transactions per second in a 32-node cluster, according to Tzoumas. The scalability of a streaming ACID-compliant system will be determined ultimately by the number of “collisions” that occur on look-up keys, he says.
While Streaming Ledger does deliver ACID support, Tzoumas says there is a caveat on the durability aspect. Durability is handled by Flink’s check pointing feature, which means that in between check points, the transactions are not durable.
Customers can set check points to occur as often as once every 200 milliseconds, Metzger says. “The only thing this affects is the time to recovery,” he says. “If the system crashes and you have a checkpoint of 200 milliseconds, then you will introduce the latency of 200 milliseconds at most.”
The Streaming Ledger is data Artisans second proprietary product, adding to the Application Manager that the company announced earlier this year with the launch of the dA Platform (which the company now is spelling out fully as the data Artisans Platform). While the ACID piece is not open source, the company is giving developers free downloads of the APIs that enable ACID free to download, which will allow them to play around with the technology on a small scale before committing to buying the product and implementing large-scale streaming ACID.
The company is hesitant to discuss the “secret sauce” of how it achieved ACID compliance in Streaming Ledger. “It’s not exactly MVCC,” Tzoumas says, referring to multi-version concurrent controls. “It’s a form of….concurrency control that deals with a serializable schedule of incoming transactions using a variety of mechanism that Flink offers, such as watermarks and checkpoints.”
The company believes that the technology, for which it has applied for patent protection for the technology in the United States and Europe, is a first in the industry. “We believe this is quite big news for the streaming market,” Metzger says, “and something that I hope our competition doesn’t have on their radar.”
Apache Flink Gets An SQL Client
How Netflix Optimized Flink for Massive Scale on AWS
Apache Flink Takes Its Own Route to Distributed Data Processing