Follow Datanami:
July 3, 2017

A Peek Inside Kafka’s New ‘Exactly Once’ Feature

Here’s some great news for Apache Kafka users: The open source software will support exactly once semantics for stream processing with the upcoming version 0.11 release, thereby eliminating the need for application developers to code the important feature themselves.

Exactly once processing is considered one of the classic architectural challenges in distributed computing. Computing and network failures happen and data is resent upon recovery from such failures, but ensuring that the application receives and processes each message only once is not a trivial task”

Without exactly once processing, users have to make a choice between inferior options when computing and network glitches invariably occur: Build a system that runs the risk of having a message delivered twice through what is known as “at least once processing,” or build a system that runs the risk of not having the message delivered at all (“at most once processing”).

Kafka supports both “at least once processing” and “at most once” approaches, with the latter being the default setting. Users requiring more consistency usually turn on “at least once processing” to ensure the message gets delivered,  with the caveat that the developers must deal with the possibility of having duplicate messages. Having a duplicate message isn’t a big deal in some situations, but in some use cases, such as equities trading and banking, it’s a potential deal breaker. In situations where exactly once processing is critical, developers must take extra steps to provide the guarantee at the application level.

“It’s a pretty hard problem,” Confluent CTO Neha Narkhede told Datanami in an interview in February. “Right now the only way to solve it on top of Kafka has been solve it in the application and have every message have a unique identifier. But that puts a lot of burden on the user, which we don’t want.”

In a blog post on Friday, Narkhede announced that exactly once semantics have been added to Apache Kafka with the upcoming release of version 0.11. Narkhede explained how the Confluent team built exactly once semantics into Kafka. It comes down to two main areas of focus, including idempotence and atomic transactions.

Idempotence and Atomicity

Neha Narkhede is the co-founder and CTO of Confluent, co-creator of Apache Kafka, and one of Datanami’s People to Watch for 2017.

An idempotent operation is defined as one that can be performed many times without causing a different result. It’s a concept that’s often used within the field of REST Web services to build durability into applications.

With the 0.11 release of Kafka, the producer send operation is now idempotent, which means that, in the event of an error, a Kafka message will be written to the Kafka log only once, even if it’s sent by the producer multiple times. Kafkfa’s new idempotence send operation works similarly to Transaction Control Protocol (TCP), one of the core protocols underlying the Internet.

“Each batch of messages sent to Kafka will contain a sequence number which the broker will use to dedupe any duplicate send,” the CTO explains. However, unlike TCP, which provides guarantees only within a transient in-memory connection, the sequence number used by Kafka is persisted to the replicated log, she says. “So even if the leader fails, any broker that takes over will also know if a resend is a duplicate.”

The second important area of focus for exactly once semantics is atomicity. In Kafka 0.11, a new transactions API is being introduced that allows a producer to send a batch of messages to multiple partitions in a manner such that either all the messages in a batch are eventually visible to any consumer, or none of them are ever visible.

“This feature also allows you to commit your consumer offsets in the same transaction along with the data you have processed, thereby allowing end-to-end exactly once semantics,” Narkhede explains.

Performance Penalty?

The Kafka team had a variety of goals they wanted to hit when designing the exactly once processing feature into the system. First, it had to be easy to use. After all, Java developers have been working with exactly once semantics in Kafka, but at the application level, not the Kafka platform level.

But the simplicity of having exactly once enabled in Kafka by default shouldn’t come at the expense of performance. That in turn ramped up the up the complexity level that the Kafka team was working with, Narkhede says. “We eliminated a lot of simpler design alternatives due to the performance overhead that came with those designs,” she wrote.

The design that the Kafka team eventually settled on is more complex than earlier designs, but keeps the performance penalty to a minimum. The company says exactly once processing with in-order delivery brought a 3% performance penalty for a 1KB message sent at 100ms intervals, compared to using at-least once processing semantics with in-order delivery.

The transaction hit for going exactly once with ordering guarantees increased to 20% compared to at-most once processing with no ordering guarantees, Narkhede writes. The company published its test methodology and benchmark results.

That performance penalty is relatively small, considering the durability gains that exactly once processing brings to the application, not to mention the resources saved by reducing the complexity at the application level. Just the same, though, Kafka users don’t have to use the idempotence send operations.

“In addition to ensuring low performance overhead for the new features,” Narkhede writes, “we also didn’t want to see a performance regression in applications that didn’t use the exactly-once features.” As a result, the Kafka team reworked the Kafka message format and made some other tweaks, all of which result in a 20% improvement in producer throughput and up to a 50% improvement in consumer throughput when small messages are used.

Testing, Testing

The Kafka team has been working to bring exactly once processing to the platform for more than a year. Because it is such a difficult challenge, the developers have taken pains to ensure that they’ve gotten it right.

That included publishing a 60-page design document that outlined every aspect of the design, as well as a nine-month review period during which the Confluent team endured extensive public scrutiny.

That added scrutiny resulted in some important design changes, the Confluent CTO says, including ditching consumer-side buffering for transactional reads with “smarter server-side filtering,” which resulted in avoiding a big performance penalty (see above).

“As a result,” Narkhede concludes, “we ended up with a simple design that also relies on the robust Kafka primitives to a substantial degree.”

The exactly once feature was tested extensively, including the use of “distributed chaos tests” that involved killing clients and servers to simulate worst-case scenarios. The feature worked, and data was neither lost nor duplicated.

Narkhede says the Kafka team is “beyond excited” to release the new exactly once feature into the broad Kafka community. But she warned that developers who are using the consumer Kafka API must still take steps to ensure that it works. It’s not magic pixie dust that you can just sprinkle on your apps.

However, for developer who are using the Kafka Streams API, the new exactly once feature “actually is a little bit like magic pixie dust,” she writes. “A single config change will give you the end-to-end guarantee.”

Related Items:

Exactly Once: Why It’s Such a Big Deal for Apache Kafka

How Pandora Uses Kafka