TL;DR: Adding pods sounds safe, but in StatefulSets it overloads the scheduler. It also breaks quorum-based leader election and triggers cascading evictions.
Size replicas to match failure domains. Set realistic resource requests and protect quorum with PodDisruptionBudgets, HPA limits, and focused alerts.
Key Takeaways - Over-replicating a StatefulSet can starve the scheduler and cause split-brain elections. - A three-pod quorum spread across zones often outperforms five scattered pods. - Combine PodDisruptionBudget, conservative HPA limits, and eviction monitoring to keep HA intact.
More Replicas ≠ Higher Availability

The mantra “more pods = higher availability” feels obvious, until the cluster stalls. Each extra pod consumes CPU, memory, and a scheduler slot. When those resources run low, the control plane evicts pods to keep the node healthy.
1apiVersion: v12kind: Pod3metadata:4 name: heavy-worker5spec:6 containers: - name: app7 image: myapp:latest8 resources:9 requests:10 cpu: "500m"11 memory: "512Mi"12 limits:13 cpu: "1"14 memory: "1Gi"
Doubling the replica count of a pod like the one above. This doubles memory pressure on every node that receives the pods. This increase can quickly exhaust node resources. In a tight cluster the kube-scheduler may fail to place them, leaving the pods Pending forever. - Cascading eviction: When a node runs out of memory, the kubelet evicts the lowest-priority pod. Often this is a newly created replica. - Service disruption: The Service load-balancer removes the evicted endpoint, shrinking the pool that can answer requests.
The cluster looks over-replicated but actually loses capacity under load. How can you tell when the scheduler is struggling?
Why Scaling StatefulSets Backfires
StatefulSets give each pod a stable network identity and a persistent volume. They also rely on a leader election process to decide which replica writes to the shared store. Pushing the replica count beyond what the cluster can schedule triggers two problems at once.
First, the scheduler obeys anti-affinity rules and tries to spread pods across zones. Too many replicas force some zones to host more pods than they can handle, creating zone-level pressure. Nodes in that zone start throttling, and the kube-controller manager may mark pods as Failed.
Second, every replica joins the election. The protocol needs a quorum - usually a majority of pods. If pods are evicted or stuck in Pending, the quorum disappears. The remaining pods then enter a split-brain state where each thinks it is the leader.
1apiVersion: apps/v12kind: StatefulSet3metadata:4 name: db5spec:6 serviceName: "db"7 replicas: 5 # risky if cluster is small8 selector:9 matchLabels:10 app: db11 template:12 metadata:13 labels:14 app: db15 spec:16 containers: - name: postgres17 image: postgres:1318 ports: - containerPort: 543219 resources:20 requests:21 cpu: "250m"22 memory: "256Mi"
A five-replica PostgreSQL StatefulSet on a three-node cluster forces at least two pods onto the same node. This raises eviction risk.
Worked scenario: zone failure with five replicas
Assume a two-zone cluster (zone-A, zone-B) with three nodes each. The scheduler spreads the five pods as 3-2. If zone-A loses power, only two pods survive in zone-B - insufficient for a majority of five. The leader election stalls, and the database becomes unavailable.
What changes the outcome? Reduce replicas to three and enforce a `PodDisruptionBudget` that guarantees two pods remain. The same zone loss now leaves a quorum of two, and the service stays up. Curious how leader election works under the hood?
The Mechanics That Break HA: Leader Election & Resource Limits
Kubernetes’ built-in leader election uses a lease object stored in the API server. Each candidate writes a timestamp to the lease; the one with the most recent timestamp becomes leader. The lease is valid only while a majority of pods can read and write it.
When pods are evicted because of CPU or memory pressure, lease updates stop. The remaining candidates keep trying to acquire leadership. But without a quorum they cannot confirm they are the sole leader.
The result is a split-brain where two pods think they own the lease. Both serve traffic and both try to write to the same persistent volume.
Resource overcommit makes the problem worse. If a node’s `cpu.cfs_quota_us` limit is exceeded, the Linux scheduler throttles the pod’s CPU, slowing lease renewals. A delayed renewal looks like a missed heartbeat, prompting another election.
Zone-aware scheduling matters because a quorum must survive a zone failure. Imagine three replicas spread across two zones (2 in zone A, 1 in zone B). If zone A goes down, the single replica in zone B cannot form a majority and the service stalls. Adding a fourth replica in zone A does not help; it only adds more pressure to the failing zone.
Key failure mechanisms
- Quorum loss - fewer than ⌈N/2⌉ + 1 healthy pods → no leader.
- Split-brain - multiple pods think they hold the lease because they cannot see each other’s updates.
- Resource throttling - CPU limits delay lease renewals, causing needless elections.
How can you guard against these failures?
Designing Safe Replica Strategies for Kubernetes HA

A safe replica plan starts with failure-domain analysis. Identify how many zones or racks your cluster spans. Then place just enough pods to survive the loss of any single domain.
1apiVersion: policy/v12kind: PodDisruptionBudget3metadata:4 name: db-pdb5spec:6 minAvailable: 2 # guarantees quorum for a 3-replica set7 selector:8 matchLabels:9 app: db
The PDB above ensures that at most one pod can be voluntarily evicted during upgrades. This preserves a quorum of two for a three-replica StatefulSet.
Step-by-step checklist for a resilient StatefulSet
- Map failure domains - list zones, racks, or availability zones.
- Choose quorum size - set `minAvailable` to ⌈replicas/2⌉ + 1.
- Define anti-affinity - spread pods across zones:
```yaml
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution: - labelSelector:
matchLabels:
app: db
topologyKey: topology.kubernetes.io/zone
```
- Set realistic resource requests - base limits on observed usage plus a 20 % buffer.
- Cap horizontal scaling - use HPA with a conservative `maxReplicas`.
1apiVersion: autoscaling/v22kind: HorizontalPodAutoscaler3metadata:4 name: api-hpa5spec:6 scaleTargetRef:7 apiVersion: apps/v18 kind: Deployment9 name: api10 minReplicas: 211 maxReplicas: 412 metrics: - type: Resource13 resource:14 name: cpu15 target:16 type: Utilization17 averageUtilization: 60
The HPA limits scaling to four replicas, preventing runaway growth that could overwhelm the cluster.
Validate the leader election logic before you go live. Tools like chaos-mesh can inject pod failures and network partitions, letting you observe whether the quorum holds.
1# Example: kill one replica and watch lease status2kubectl delete pod db-03kubectl get lease db-leader -o yaml
Finally, monitor etcd health, pod eviction rates, and lease renewal latency. Set alerts when eviction rates exceed a few percent of total pods. Also alert when lease renewals take longer than a second. What will you see when the system is truly resilient?
What Success Looks Like: Resilient Services and Business Impact
A correctly sized StatefulSet shows steady high availability without frequent restarts. The leader remains stable, and write latency stays predictable because the persistent volume sees only one writer.
Operational cost drops as the cluster no longer churns pods to satisfy an oversized replica count. Fewer evictions mean less I/O on storage, extending SSD lifespan and reducing cloud-provider fees.
Developers gain confidence. With a stable leader, CI pipelines can roll out new versions without fearing a split-brain rollback. Release cycles shorten because the platform respects the PodDisruptionBudget during rolling updates.
The bottom line? Right-sized replicas turn “more is better” into “just enough is safe.” This delivers both technical stability and clear business value.
Frequently Asked Questions
Does adding more replicas always improve Kubernetes high availability?
No. More replicas increase resource demand and can stress leader election. This can lead to split-brain or pod evictions that reduce availability.
How many replicas are optimal for a StatefulSet?
It depends on your failure domains. A common safe baseline is three replicas spread across at least two zones. It is combined with a PodDisruptionBudget that guarantees a quorum.
What monitoring should I set up to detect HA degradation?
Track pod eviction rates, CPU/memory pressure alerts, etcd quorum health, and leader election churn metrics. Alert on spikes that exceed your defined thresholds.
Can I use HorizontalPodAutoscaler with StatefulSets without breaking HA?
Yes, but configure it with conservative scaling limits. Ensure the autoscaler respects the PodDisruptionBudget to keep quorum intact.
Where can I learn more about designing PodDisruptionBudgets?
Our detailed guide on PodDisruptionBudget best practices walks through sizing, testing, and alerting strategies.
*For deeper dives into Kubernetes cost dynamics, see our post on Kubernetes Costs Are Killing Your AI Budget. To learn how to design resilient microservices, read How to Architect Scalable Microservices on Kubernetes.
Sources
Research and references cited in this article:
- container.statefulSets.updateScale – 3 risks - IAM Privilege Catalog
- Is there some way only increase statefulset's replicas and NO ...
- StatefulSets | Kubernetes
- StatefulSets - Kubernetes
- Understanding StatefulSets in Kubernetes - Portworx
- Exam KCNA topic 1 question 37 discussion - ExamTopics
- Split-Brain bug in leaderelection client. · Issue #23731 - GitHub
- Kubernetes design interview question Is it possible for etcd to have ...
- StatefulSets misunderstood or my ignorance? : r/kubernetes - Reddit
- Split Brain, Quorum & Failover: What Every Kubernetes Admin Must ...
- Kubernetes HA Cluster Implementation for E-commerce Case Study
- How does Kubernetes provide high availability of applications in a ...
