TL;DR
Kubernetes HA looks solid on paper, but a mis-configured leader election can silently kill stateful services. Switch from ConfigMap-based elections to the built-in Lease primitive and use client-go’s leaderelection library. The change eliminates 409 conflicts, prevents split-brain, and restores true uptime.
Key Takeaways - Cluster-level HA does not protect you from application-level leader election bugs. - Default ConfigMap elections generate 409 Conflict errors that hide behind “everything is fine”. - A Lease-based election with proper RBAC and metrics gives atomic renewals and observable health.
Why High-Availability Claims Miss the Real Failure Point

Most SREs celebrate a rolling update strategy as proof that their stack cannot go down. The confidence feels safe, but it hides a fragile assumption. Every component inside the cluster must also respect HA. Stateful services such as databases, schedulers, and consensus daemons rely on a single active instance. If that instance loses leadership silently, the whole service stalls while the rest of the cluster keeps humming.
The gap shows up only when a pod restarts or a node is drained. The new pod thinks it is the leader because the old leader’s lease expired, yet the old pod still holds a lock in memory. Requests start timing out, logs fill with “no leader” messages, and the incident ticket lands on the SRE board. Because the failure is internal to the application, the cluster health dashboard stays green. The outage feels like a phantom. But how do you detect it before it escalates?
Why does this happen? The common pattern is to use a shared ConfigMap as a lock. Pods write a timestamp into a key, read it back, and assume leadership if the timestamp is recent. When two pods race to update the same key, the API server returns a 409 Conflict. The client retries, but the retry logic often swallows the error, treating it as a transient network glitch. The result is a period where no pod holds the lock, or worse, two pods think they own it because each saw a different successful write.
Even worse, the ConfigMap approach lacks a TTL. A pod that crashes without cleaning up leaves stale data that other pods interpret as a valid leader. The service appears healthy, its readiness probe passes, while the real work never moves forward.
The false comfort of “cluster-level HA” blinds teams to this application-level fragility. They add more replicas and tighten node pools, yet they still see intermittent stalls. The usual fixes, such as adding replicas or tweaking readiness probes, only mask the deeper issue.
The Hidden Pitfalls of Default Leader Election Settings
When you inspect a failing stateful service, the first thing you see is a flurry of `Failed to acquire leader lock` messages. Under the hood those messages are HTTP 409 responses from the API server. The conflict occurs because multiple pods try to write to the same ConfigMap key at the same time. The API server enforces optimistic concurrency: the write must include the latest `resourceVersion`. If a pod’s view is stale, the write is rejected.
Because the error is a standard API response, it blends in with normal client-side retries. Most libraries treat 409 as “try again later”, so the pod logs a warning and continues looping. No alert fires, no metric spikes, and the service silently loses its coordinator.
Two dangerous states can emerge: - Split-brain: Two pods each succeed in writing a ConfigMap entry with different `resourceVersion`s because they read the map at slightly different moments. Both believe they are the leader, and they start processing the same workload. Data corruption follows. - Leader vacuum: All pods receive 409 errors and back off long enough that none ever writes a fresh timestamp. The service stops processing new requests until a manual restart.
Both scenarios are invisible until a downstream client reports missing data or duplicated actions. The root cause, misusing ConfigMap as a lock, remains hidden because the Kubernetes control plane reports no errors.
The fix is not “more retries”. It is to replace the primitive that produces the conflict. Kubernetes introduced the Lease object exactly for this purpose. A Lease is a first-class resource with a `spec.holderIdentity` and a `spec.renewTime`. Updates are atomic, and the API server enforces a leaseDurationSeconds field that acts as a built-in TTL. When the holder fails to renew before the deadline, the lease automatically becomes available.
Switching to Lease eliminates the 409 conflict pattern because the API server now validates the holder’s identity rather than a stale `resourceVersion`. The update path is a simple `PATCH` that only succeeds if the caller still holds the lease. This atomicity removes the race condition that ConfigMap suffered.
Understanding the root cause reveals a surprisingly simple tool. What does that tool look like in code?
Using Kubernetes Lease and client-go for Reliable Elections
The Kubernetes community ships a ready-made library: `k8s.io/client-go/tools/leaderelection`. It wraps the Lease API, handles exponential back-off, and emits Prometheus metrics for each renewal attempt. Using it requires three pieces:
- A Lease object in the same namespace as the pods.
- RBAC that lets the pod `get`, `update`, and `patch` the Lease.
- Leaderelection configuration that points to the Lease and defines callbacks for `OnStartedLeading`, `OnStoppedLeading`, and `OnNewLeader`.
Here’s a minimal Lease manifest:
1apiVersion: coordination.k8s.io/v12kind: Lease3metadata:4 name: my-service-leader5 namespace: production6spec:7 holderIdentity: ""8 leaseDurationSeconds: 159 renewTime: null
And the Go snippet that drives the election:
1import (2 "context"3 "time"45 "k8s.io/client-go/kubernetes"6 "k8s.io/client-go/tools/leaderelection"7 "k8s.io/client-go/tools/leaderelection/resourcelock"8 "k8s.io/client-go/rest"9)1011func startLeaderElection() {12 cfg, _ := rest.InClusterConfig()13 client, _ := kubernetes.NewForConfig(cfg)1415 lock := &resourcelock.LeaseLock{16 LeaseMeta: metav1.ObjectMeta{17 Name: "my-service-leader",18 Namespace: "production",19 },20 Client: client,21 LockConfig: resourcelock.ResourceLockConfig{22 Identity: os.Getenv("POD_NAME"),23 },24 }2526 ctx, cancel := context.WithCancel(context.Background())27 defer cancel()2829 leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{30 Lock: lock,31 LeaseDuration: 15 * time.Second,32 RenewDeadline: 10 * time.Second,33 RetryPeriod: 2 * time.Second,34 Callbacks: leaderelection.LeaderCallbacks{35 OnStartedLeading: func(c context.Context) {36 runService(c)37 },38 OnStoppedLeading: func() {39 log.Println("Lost leadership, shutting down")40 os.Exit(0)41 },42 OnNewLeader: func(identity string) {43 log.Printf("New leader elected: %s\n", identity)44 },45 },46 })47}
The library automatically records metrics like `leader_election_lease_renew_failure_total`. Those metrics surface in Prometheus and can trigger alerts when renewals fail repeatedly. Because the Lease object is the single source of truth, there is no chance for two pods to think they own it simultaneously.
The elegance of this approach lies in its zero-dependency nature: you don’t need an external store, a sidecar database, or a custom lock service. This simplicity reduces attack surface and operational overhead. Everything lives inside the Kubernetes API server, which already backs etcd with strong consistency guarantees.
Step-by-Step: Harden Leader Election in Your Stateful Service

- Create the Lease and RBAC
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: leader-election-role
namespace: production
rules: - apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "create", "update", "patch"]
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: leader-election-binding
namespace: production
subjects: - kind: ServiceAccount
name: my-service-sa
roleRef:
kind: Role
name: leader-election-role
apiGroup: rbac.authorization.k8s.io
```
The Role limits permissions to the Lease resource only, satisfying zero-trust principles.
- Deploy the Service with a Sidecar (optional)
If you cannot modify the main binary, run a sidecar that hosts the leaderelection loop and writes a file `/tmp/is-leader` that the app reads before processing.
```yaml
containers: - name: app
image: my-company/app:latest
env: - name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name - name: leader-elector
image: golang:1.22
command: ["./leader-elector"]
env: - name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
```
- Add Observability - Export `leader_election_master_status` as a gauge (1 if leader, 0 otherwise). - Track `leader_election_lease_renew_failure_total`. - Set an alert: if failures > 3 in a 30-second window, page the on-call engineer.
```yaml
# prometheusRule.yaml - alert: LeaderElectionStalled
expr: rate(leader_election_lease_renew_failure_total[1m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Leader election renewals are failing"
description: "Check pod {{ $labels.pod }} for lease issues."
```
- Chaos Test the Hand-over
Use `kubectl delete pod <leader-pod>` or a chaos mesh experiment to kill the leader. Verify that another pod becomes leader within the lease duration and that the metric `leader_election_master_status` flips accordingly.
```bash
# Kill leader pod
kubectl get pods -n production -l app=my-service -o jsonpath='{.items[0].metadata.name}'
kubectl delete pod <leader-pod> -n production
# Watch lease
kubectl get lease my-service-leader -n production -o yaml
```
- Validate Long-Running Stability
Run the service in a staging cluster for at least a week, monitoring the renewal metrics. With the configuration locked down, observe the impact on availability.
What Perfect Leader Election Looks Like in Production
In a well-engineered deployment, the leader’s metrics stay steady at `1`, and renewal failures drop to zero. When a node is drained, the hand-over happens within a single lease interval, typically under 15 seconds. No request spikes, no error bursts, and the service’s latency curve remains flat.
Key signals of a healthy system: - Zero unexpected restarts for the stateful workload over weeks of traffic. - Consistent request latency (e.g., 95th-percentile unchanged) even during node failures. - No split-brain alerts from downstream systems that rely on a single source of truth.
The modest cost is a single etcd write per renewal, which is negligible compared to the reliability gains. In practice, this translates to a few milliseconds per heartbeat.
If you’re still using ConfigMap for leadership, you’re leaving a hidden HA gap wide open. Switching to Lease is a one-line change that can be deployed in minutes.
Ready to audit your own leader election setup? Start by checking for 409 Conflict spikes in your API server logs. The answer may surprise you. Your logs might already hold the clue.
Frequently Asked Questions
Q: How can I detect leader election conflicts in Kubernetes?
A: Enable the leaderelection metrics endpoint and set alerts on high 409 Conflict rates or lease renewal failures; tools like Prometheus and Grafana can surface these instantly.
Q: Should I use ConfigMap or Lease for leader election?
A: Lease is the recommended primitive because it provides atomic updates and built-in TTL handling, eliminating the race conditions that plague ConfigMap-based elections.
Q: Can I retrofit an existing stateful app with the Lease-based election?
A: Yes - add a sidecar that runs the leaderelection library or modify the app to import client-go’s leaderelection package, then point it at a newly created Lease object.
Q: What monitoring should I add for leader election health?
A: Track metrics like `leader_election_master_status`, `leader_election_lease_renew_failure_total`, and watch for sudden drops in the leader count; combine with log alerts on conflict errors.
Q: Does using Lease affect my cluster's performance?
A: Lease updates are lightweight (a single etcd write per renewal) and have negligible impact compared to the reliability gains they provide.
Consider reviewing your leader election setup today. A quick audit can uncover hidden race conditions.
Sources
Research and references cited in this article:
- The Ten Most Common Kubernetes Security Misconfigurations & How to Address Them
- When Pods Fight for Power: Kubernetes Leader Election Demystified.
- Kubernetes Leader Election for Fun and Profit - Nick Young, VMware
- No leaders elected · Issue #206 - GitHub
- Building Stateful Services with Kubernetes | Kevin Sookocheff
- How to Implement Kubernetes Leader Election - OneUptime
- Top 15 Kubernetes Security Mistakes To Avoid In 2026 | AccuKnox
- Failure when acquiring leadership · Issue #2032 · spring-cloud ...
- Leader Election for Distributed Workloads in Kubernetes
- Running Stateful Applications on Kubernetes: A 2026 Guide for ...
- HA deployment strategy for pods that hold leader election - Reddit
- Kubernetes StatefulSet: The Complete Guide | Portworx
