TL;DR
A default-allow NetworkPolicy gives a false sense of safety. A rogue pod can flood the mesh and break high availability guarantees. The cure isn’t “add any policy”. Then it’s a deliberately designed, test-driven policy set. It treats NetworkPolicy as part of the high-availability control plane.
Key Takeaways - A missing policy leaves every pod exposed, turning a single compromised container into a cluster-wide outage risk. - Blindly adding a policy often breaks health-checks or relies on CNI plugins that ignore the rules. - Designing isolation, redundancy, and latency-aware policies, then validating them with CI, restores HA and speeds up deployments.
Why a Default-Allow NetworkPolicy Is a Silent HA Killer

Most SRE teams assume a default-allow NetworkPolicy is harmless, but that blank-check silently crushes Kubernetes high availability. The network layer therefore offers no guard against a compromised workload. Could a single compromised pod silently shut down the entire cluster?
When a pod is hijacked, it can open thousands of connections. Then it saturates the service IPs and starves legitimate traffic. The symptom often looks like intermittent slowness rather than a security breach. So the root cause stays hidden. Why does the cluster still appear healthy while traffic collapses?
1apiVersion: networking.k8s.io/v12kind: NetworkPolicy3metadata:4 name: allow-all5spec:6 podSelector: {}7 policyTypes: - Ingress - Egress
The above is the implicit policy you already have - no rules, no restrictions. What happens when a rogue pod starts sending thousands of connections? - Isolation gap: Without a deny-all baseline, malicious traffic bypasses any firewall you add later.
How can you prevent that bypass? - Failure cascade: A flooded pod can exhaust the kube-proxy iptables chain, causing node-level packet drops and pod restarts.
What is the impact on node performance? - Detection delay: Because the cluster still appears “up”, alerts fire only after latency spikes. But they do not fire when the breach begins.
When will you notice the breach?
What happens inside the kernel when traffic spikes?
The kernel’s hidden choke point
When a pod launches a burst of connections, the Linux kernel’s connection-tracking (conntrack) table fills. Each entry consumes memory and a slot in the nf_conntrack hash. Once the table is full, new connections are dropped, even for healthy services. The result is a sudden, cluster-wide slowdown that looks like a network glitch. What happens when the table is full?
You can observe the pressure with:
1# Show conntrack usage on a node2sudo sysctl net.netfilter.nf_conntrack_count3sudo sysctl net.netfilter.nf_conntrack_max
If `count` hovers near `max`, a single noisy pod can bring the whole node to its limit. How do you detect this before it causes outages? A default-allow policy gives that pod unrestricted outbound bandwidth, making the problem inevitable. What if you could limit that bandwidth?
Can a simple policy change stop the table from filling?
Why “just add a policy” fails
A common mitigation is to drop the default-allow stance and write a few allow rules. That approach misses the real problem. Why does it fail? - Silent ignore: Basic CNIs give a false sense of security. - Health-check blackout: Over-tight rules break kubelet probes, causing pod restarts. - Testing shortcut: Skipping policy validation to keep dev velocity open creates hidden regressions. Then those regressions only surface during a traffic surge.
The real solution lies in treating NetworkPolicy as a high-availability control plane, not just a firewall. How do you avoid these pitfalls?
Designing NetworkPolicies for High Availability
Think of NetworkPolicy as three pillars that keep the HA ship afloat: isolation, redundancy, and minimal latency.
- Isolation - Start with a deny-all baseline. Only services that truly need to talk are permitted.
- Redundancy - Use label selectors instead of static pod names so that newly scaled pods inherit the correct rules automatically.
- Latency-aware - Allow intra-service traffic for health probes on the same service label, but keep cross-service chatter to a minimum.
1apiVersion: networking.k8s.io/v12kind: NetworkPolicy3metadata:4 name: deny-all5spec:6 podSelector: {}7 policyTypes: - Ingress - Egress8---9apiVersion: networking.k8s.io/v110kind: NetworkPolicy11metadata:12 name: allow-svc-a13spec:14 podSelector:15 matchLabels:16 app: svc-a17 policyTypes: - Ingress18 ingress: - from: - podSelector:19 matchLabels:20 app: svc-a # intra-service21 ports: - protocol: TCP22 port: 8080 - from: - podSelector:23 matchLabels:24 role: health-probe25 ports: - protocol: TCP26 port: 10256 # health-check
- Label-driven rules survive scaling events. - Health-probe allowance keeps kubelet happy without opening the whole cluster. - Cross-service whitelist is explicit, so any accidental port exposure is caught early.
What steps are needed to roll this out safely in production?
Step-by-Step Implementation in a Production Cluster

- Confirm CNI support - Only Calico, Cilium, or Kube-router enforce policies. Run a quick test: create a deny-all policy and attempt `curl` between two pods. If traffic still passes, switch CNI.
1# Example check with netshoot2kubectl run netshoot --image=nicolaka/netshoot --restart=Never -- sleep 36003kubectl exec -ti netshoot -- curl -s http://<target-pod-ip>
- Apply baseline deny-all - Deploy the `deny-all` manifest shown earlier.
- Incrementally add allow rules - For each service, create a policy that permits only required ports and the health-probe port. After each `kubectl apply -f`, run connectivity checks from `netshoot` pods.
- Automate validation - Integrate `kube-audit` and `conftest` into your CI pipeline.
1# .github/workflows/networkpolicy.yml2name: NetworkPolicy CI3on: [push, pull_request]4jobs:5 audit:6 runs-on: ubuntu-latest7 steps: - uses: actions/checkout@v3 - name: Install kube-audit8 run: |9 curl -sSL https://github.com/Shopify/kubeaudit/releases/download/v0.19.0/kubeaudit_0.19.0_linux_amd64.tar.gz | tar -xz - name: Run audit10 run: ./kubeaudit all -f ./manifests/ - name: Conftest policy check11 run: conftest test ./manifests/
- Staged rollout - Use a canary namespace. Apply policies there first, monitor metrics, then promote to production. - Verification checklist - CNI supports NetworkPolicy? ✅ - Baseline deny-all applied? ✅ - Health-probe ports allowed? ✅ - CI lint passes? ✅
These steps turn a risky default-allow cluster into a hardened, HA-aware environment.
What measurable benefits can you expect after the rollout?
The Payoff: Resilient, Secure, and Faster Deployments
After the rollout, teams saw a dramatic shift in reliability. The cluster no longer fell prey to a single pod flooding the network. Then alerts now fire on policy violations rather than on downstream latency spikes.
Concrete observability gains -
1# alerts.yml2groups: - name: network-health3 rules: - alert: ConntrackHighUtilization4 expr: node_nf_conntrack_entries / node_nf_conntrack_max > 0.855 for: 2m6 labels:7 severity: warning8 annotations:9 summary: "Conntrack table > 85% on {{ $labels.instance }}"10 description: "Potential DoS from a noisy pod. Investigate traffic patterns."
- Grafana dashboard showing per-namespace egress bytes before and after policy enforcement. - Deployment cadence improves. With policy linting baked into CI, engineers spend less time debugging flaky health checks.
The effort also paid off in security posture. Continuous policy testing caught misconfigurations before they reached production, keeping the attack surface minimal.
Which remaining challenges might still hide in your network?
Frequently Asked Questions
Q: Why does a missing NetworkPolicy affect high availability?
A: Without isolation, a compromised pod can flood the network. Then it saturates service endpoints and triggers cascading failures that reduce overall cluster availability.
Q: Can I use any CNI plugin with NetworkPolicy?
A: Only CNI plugins that implement the NetworkPolicy API (e.g., Calico, Cilium, Kube-router) enforce rules. But basic plugins silently ignore them, leaving the cluster exposed.
Q: How do I test a new NetworkPolicy without breaking traffic?
A: Apply a deny-all baseline, then add allow rules incrementally. Then use `netshoot` pods or `kubectl exec` to verify pod-to-pod connectivity after each change.
Q: What’s the minimal set of rules to keep HA health checks working?
A: Allow inbound traffic on the pod’s health-check port (often 10256 or a custom value) from the same service label. Also permit egress to the node’s kube-proxy IP range.
Q: How fast can I roll out a secure NetworkPolicy in production?
A: With automated CI validation and a staged rollout, most teams achieve a safe policy deployment within a few weeks.
Related reads - [Active-Active HA Is Killing Statefulness](/posts/active-active-ha-killing-statefulness) - [When Pod Security Policies Break Kubernetes HA](/posts/pod-security-policies-break-kubernetes-ha)
Secure your cluster by treating NetworkPolicy as a core HA component.
*Ready to harden your network?
Sources
Research and references cited in this article:
- Mastering Kubernetes High Availability: A Comprehensive Guide
- Network Policies - Kubernetes
- Kubernetes Network Policy: Benefits, Use Cases & Best Practices
- Kubernetes Network Policy: A Beginner's Guide - Okteto
- Kubernetes Network Policies Best Practices - ARMO
- Top 15 Kubernetes Security Mistakes To Avoid In 2026 | AccuKnox
- Top Kubernetes Security Mistakes in 2026 - Server Management
- The Most Common Kubernetes Security Issues and Challenges - Wiz
- "Network policies slow down development." So teams skip them. Then wonder how one compromised pod became a cluster-wide breach. Here's what I see: Default Kubernetes networking = pods can reach… | Dave A.
- 5 Common Kubernetes Mistakes and How to Avoid Them
- Mastering Kubernetes Network Policy for Enhanced Security
- Kubernetes Security Best Practices Part 2: Network Policies
