TL;DR: Adding Kafka replicas expands the coordination surface that Flink’s two-phase commit must traverse. When a replica falls out of the ISR or a leader election happens mid-commit, the transaction can diverge. Then it produces duplicates. Tighten transaction timeouts, pin the ISR, and align Flink’s checkpointing so every broker agrees on each commit. What steps keep the guarantee intact?

Key Takeaways - More replicas increase the chance of ISR churn, which can split a Flink transaction. - Flink’s checkpoint snapshot is safe, but the final commit still depends on Kafka’s quorum. - Properly tuned `transaction.timeout.ms`, `min.isr`, and HA JobManager settings keep exactly-once intact even with three-node replication.

The Replication Trap: Why More Isn’t Always Better

Adding more Kafka replicas seems to improve reliability, but in exactly-once pipelines it can introduce subtle failures. Each additional replica must stay in lock-step with the leader during every write.

When Flink writes a batch of records, it opens a Kafka transaction, writes to the leader. Then it waits for the ISR to acknowledge. If the ISR shrinks because a broker lags, the leader may still consider the write committed. Then a follower lags behind.

1# Example: Kafka topic config that tightens ISR requirements
2topic.replication.factor: 3
3min.isr: 2               # require at least two in-sync replicas
4unclean.leader.election.enable: false

The above forces the cluster to reject writes that cannot be replicated to two followers. Without it, the leader can accept the write, then crash, leaving the follower with an incomplete copy.

Flink’s checkpoint already captured the state before the write, so on recovery it will replay the same batch. If the follower later rejoins, the same records appear a second time downstream.

That subtle mismatch is why the “more replicas = safer” belief collapses under a Flink-driven exactly-once pipeline. The extra broker is not a passive copy; it is an active participant in the commit protocol.

Why the Obvious Fix, Adding Replicas, Fails in Practice

Two-phase commit (2PC) gives Flink a clean “prepare-then-commit” contract with Kafka. The prepare step writes to the leader and records the transaction ID in Flink’s state. The commit step tells the leader to make the write visible to all consumers.

If the ISR is a majority, the leader can commit even when one replica lags. That replica will later fetch the committed batch. Then it only fetches after the leader has already marked the transaction complete.

If a failure forces Flink to roll back to the previous checkpoint, the lagging replica still holds the committed batch. Then it will be replayed when the broker catches up, producing duplicates downstream.

1# Flink Kafka connector properties
2transaction.timeout.ms=900000   # 15 minutes
3isolation.level=read_committed

The default `transaction.timeout.ms` can expire while a replica is catching up. Then it causes the coordinator to abort the transaction even though the leader already committed. Flink then thinks the write never happened, but the lagging broker will later deliver it, breaking exactly-once.

Split-brain scenarios also surface when a broker temporarily loses network connectivity. The leader may step down, a new leader is elected. Then the old leader (still alive) may finish its commit. Flink’s checkpoint does not see the second leader’s state, so it cannot reconcile the two halves.

Inside Flink’s State Management: Checkpoints vs. Two-Phase Commit

Flink’s checkpointing runs asynchronously. Every checkpoint captures the entire operator state, including the pending Kafka transaction IDs. The snapshot occurs before the connector writes any records for that checkpoint.

When the write finishes, Flink invokes the commit phase of the Kafka 2PC. At this point the ISR set decides whether the transaction truly succeeds. If the ISR has shrunk, the commit may succeed on the leader but not on all followers. Flink’s checkpoint still thinks the transaction is pending until the commit callback returns.

1// Pseudo-code showing the commit flow
2public void commitTransaction(long checkpointId) {
3    kafkaProducer.commitTransaction(); // blocks until ISR ack
4    // Flink marks checkpoint as completed only after this returns
5}

The checkpoint does not guarantee that every replica has persisted the data; it only guarantees that Flink’s own state is consistent with the leader. Then if the ISR changes between the snapshot and the commit, the guarantee evaporates.

When Replication Breaks Exactly-Once: Real-World Failure Patterns

Scenario A - Leader election mid-commit

Flink opens a transaction, writes to leader.
Leader crashes before all replicas acknowledge.
New leader is elected from the remaining ISR.
Old leader, still alive, finalizes the commit.
Flink’s checkpoint sees the commit succeed on the new leader. Then the old leader’s write reappears when it rejoins, causing a duplicate.

Scenario B - ISR shrinkage after a broker bounce

Broker B lags, ISR drops to 2 of 3.
Flink writes a transaction; leader commits because quorum (2) is satisfied.
Broker B restarts, syncs the log, and replays the committed batch.
Flink’s checkpoint still points to the previous offset, so downstream sees the batch twice.

Both patterns surface when the ISR is not stable. The symptoms are subtle: occasional spikes in consumer lag, duplicate rows in a data lake. Then a sudden jump in Flink’s `numIncompleteCheckpoints`.

Implementation Playbook: Keeping Exactly-Once with Replicated Kafka

Raise the transaction timeout so that it exceeds the longest checkpoint interval.

```properties

transaction.timeout.ms=1800000 # 30 minutes

```

Force a stable ISR by setting `min.isr` to at least two and disabling unclean leader elections.

```yaml

min.isr: 2

unclean.leader.election.enable: false

```

Read only committed records on every Flink consumer.

```properties

isolation.level=read_committed

```

Pin Flink’s checkpoint to the leader ISR - configure the Kafka source. Then set it to fail if the ISR shrinks during a checkpoint.

```java

KafkaSource<String> source = KafkaSource.builder()

.setBootstrapServers("kafka:9092")

.setGroupId("flink-job")

.setTopics("events")

.setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST))

.setProperty("isolation.level", "read_committed")

.build();

```

Run the JobManager in HA mode so that Flink’s own coordinator does not become a split-brain.

```yaml

high-availability: zookeeper

high-availability.storageDir: hdfs://namenode:8020/flink/ha/

```

Monitor the right metrics - watch `kafka.consumer.lag` and Flink’s `numIncompleteCheckpoints`. Spikes often precede an ISR-related failure.

```bash

# Example Prometheus query

sum by (job) (kafka_consumer_lag{topic="events"})

```

These steps tighten the contract between Flink and Kafka. Then when every broker in the ISR must acknowledge before a commit, the exactly-once guarantee survives even a three-replica deployment.

Payoff: What Happens When Exactly-Once Holds Up Under Replication

Zero duplicate events flow into downstream lakes, warehouses, and dashboards. Analytics teams trust raw tables without writing de-duplication jobs.

Latency stays predictable because the extra replicas only add network hops, not coordination retries. Operators see the same end-to-end latency as a single-replica setup, but with the fault-tolerance of three brokers.

Operational toil drops dramatically. No more manual replay scripts, no emergency alerts about duplicate transaction IDs. The system runs unattended for months. Then it proves that exactly-once can coexist with high replication.

Consider reviewing your Kafka and Flink settings today.

Frequently Asked Questions

Q: Can I achieve exactly-once with Kafka replication enabled?

A: Yes, but you must align Flink’s transaction timeout, checkpointing, and Kafka’s ISR settings. Then all replicas agree on each commit.

Q: What is the safest replica factor for a production Kafka-Flink pipeline?

A: Three replicas is a common sweet spot; it gives fault tolerance. Then it keeps the ISR quorum manageable for Flink’s two-phase commit.

Q: Do I need to change my Flink job code to handle extra replicas?

A: No code change is required. But you must adjust connector properties and checkpoint configuration as described in the playbook.

Q: How does exactly-once differ from at-least-once in a replicated setup?

A: At-least-once tolerates duplicates, so leader changes are harmless. Then exactly-once requires every broker in the ISR to commit the same transaction, which replication can jeopardize if not tuned.

Q: Is there a monitoring metric that tells me when exactly-once is at risk?

A: Watch `kafka.consumer.lag` and Flink’s `numIncompleteCheckpoints`; spikes in either often precede a split-brain commit scenario.

Sources

Research and references cited in this article:

The Replication Trap: Why More Isn’t Always Better

1# Example: Kafka topic config that tightens ISR requirements
2topic.replication.factor: 3
3min.isr: 2               # require at least two in-sync replicas
4unclean.leader.election.enable: false

The above forces the cluster to reject writes that cannot be replicated to two followers. Without it, the leader can accept the write, then crash, leaving the follower with an incomplete copy.

Flink’s checkpoint already captured the state before the write, so on recovery it will replay the same batch. If the follower later rejoins, the same records appear a second time downstream.

Why the Obvious Fix, Adding Replicas, Fails in Practice

1# Flink Kafka connector properties
2transaction.timeout.ms=900000   # 15 minutes
3isolation.level=read_committed

Inside Flink’s State Management: Checkpoints vs. Two-Phase Commit

1// Pseudo-code showing the commit flow
2public void commitTransaction(long checkpointId) {
3    kafkaProducer.commitTransaction(); // blocks until ISR ack
4    // Flink marks checkpoint as completed only after this returns
5}

When Replication Breaks Exactly-Once: Real-World Failure Patterns

Scenario A - Leader election mid-commit

Flink opens a transaction, writes to leader.
Leader crashes before all replicas acknowledge.
New leader is elected from the remaining ISR.
Old leader, still alive, finalizes the commit.
Flink’s checkpoint sees the commit succeed on the new leader. Then the old leader’s write reappears when it rejoins, causing a duplicate.

Scenario B - ISR shrinkage after a broker bounce

Broker B lags, ISR drops to 2 of 3.
Flink writes a transaction; leader commits because quorum (2) is satisfied.
Broker B restarts, syncs the log, and replays the committed batch.
Flink’s checkpoint still points to the previous offset, so downstream sees the batch twice.

Implementation Playbook: Keeping Exactly-Once with Replicated Kafka

Raise the transaction timeout so that it exceeds the longest checkpoint interval.

```properties

transaction.timeout.ms=1800000 # 30 minutes

```

Force a stable ISR by setting `min.isr` to at least two and disabling unclean leader elections.

```yaml

min.isr: 2

unclean.leader.election.enable: false

```

Read only committed records on every Flink consumer.

```properties

isolation.level=read_committed

```

Pin Flink’s checkpoint to the leader ISR - configure the Kafka source. Then set it to fail if the ISR shrinks during a checkpoint.

```java

KafkaSource<String> source = KafkaSource.builder()

.setBootstrapServers("kafka:9092")

.setGroupId("flink-job")

.setTopics("events")

.setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST))

.setProperty("isolation.level", "read_committed")

.build();

```

Run the JobManager in HA mode so that Flink’s own coordinator does not become a split-brain.

```yaml

high-availability: zookeeper

high-availability.storageDir: hdfs://namenode:8020/flink/ha/

```

Monitor the right metrics - watch `kafka.consumer.lag` and Flink’s `numIncompleteCheckpoints`. Spikes often precede an ISR-related failure.