TL;DR

A default-allow NetworkPolicy gives a false sense of safety. A rogue pod can flood the mesh and break high availability guarantees. The cure isn’t “add any policy”. Then it’s a deliberately designed, test-driven policy set. It treats NetworkPolicy as part of the high-availability control plane.

Key Takeaways - A missing policy leaves every pod exposed, turning a single compromised container into a cluster-wide outage risk. - Blindly adding a policy often breaks health-checks or relies on CNI plugins that ignore the rules. - Designing isolation, redundancy, and latency-aware policies, then validating them with CI, restores HA and speeds up deployments.

Why a Default-Allow NetworkPolicy Is a Silent HA Killer

Most SRE teams assume a default-allow NetworkPolicy is harmless, but that blank-check silently crushes Kubernetes high availability. The network layer therefore offers no guard against a compromised workload. Could a single compromised pod silently shut down the entire cluster?

When a pod is hijacked, it can open thousands of connections. Then it saturates the service IPs and starves legitimate traffic. The symptom often looks like intermittent slowness rather than a security breach. So the root cause stays hidden. Why does the cluster still appear healthy while traffic collapses?

1apiVersion: networking.k8s.io/v1
2kind: NetworkPolicy
3metadata:
4  name: allow-all
5spec:
6  podSelector: {}
7  policyTypes: - Ingress - Egress

The above is the implicit policy you already have - no rules, no restrictions. What happens when a rogue pod starts sending thousands of connections? - Isolation gap: Without a deny-all baseline, malicious traffic bypasses any firewall you add later.

How can you prevent that bypass? - Failure cascade: A flooded pod can exhaust the kube-proxy iptables chain, causing node-level packet drops and pod restarts.

What is the impact on node performance? - Detection delay: Because the cluster still appears “up”, alerts fire only after latency spikes. But they do not fire when the breach begins.

When will you notice the breach?

What happens inside the kernel when traffic spikes?

The kernel’s hidden choke point

When a pod launches a burst of connections, the Linux kernel’s connection-tracking (conntrack) table fills. Each entry consumes memory and a slot in the nf_conntrack hash. Once the table is full, new connections are dropped, even for healthy services. The result is a sudden, cluster-wide slowdown that looks like a network glitch. What happens when the table is full?

You can observe the pressure with:

1# Show conntrack usage on a node
2sudo sysctl net.netfilter.nf_conntrack_count
3sudo sysctl net.netfilter.nf_conntrack_max

If `count` hovers near `max`, a single noisy pod can bring the whole node to its limit. How do you detect this before it causes outages? A default-allow policy gives that pod unrestricted outbound bandwidth, making the problem inevitable. What if you could limit that bandwidth?

Can a simple policy change stop the table from filling?

Why “just add a policy” fails

A common mitigation is to drop the default-allow stance and write a few allow rules. That approach misses the real problem. Why does it fail? - Silent ignore: Basic CNIs give a false sense of security. - Health-check blackout: Over-tight rules break kubelet probes, causing pod restarts. - Testing shortcut: Skipping policy validation to keep dev velocity open creates hidden regressions. Then those regressions only surface during a traffic surge.

The real solution lies in treating NetworkPolicy as a high-availability control plane, not just a firewall. How do you avoid these pitfalls?

Designing NetworkPolicies for High Availability

Think of NetworkPolicy as three pillars that keep the HA ship afloat: isolation, redundancy, and minimal latency.

Isolation - Start with a deny-all baseline. Only services that truly need to talk are permitted.
Redundancy - Use label selectors instead of static pod names so that newly scaled pods inherit the correct rules automatically.
Latency-aware - Allow intra-service traffic for health probes on the same service label, but keep cross-service chatter to a minimum.

1apiVersion: networking.k8s.io/v1
2kind: NetworkPolicy
3metadata:
4  name: deny-all
5spec:
6  podSelector: {}
7  policyTypes: - Ingress - Egress
8---
9apiVersion: networking.k8s.io/v1
10kind: NetworkPolicy
11metadata:
12  name: allow-svc-a
13spec:
14  podSelector:
15    matchLabels:
16      app: svc-a
17  policyTypes: - Ingress
18  ingress: - from: - podSelector:
19        matchLabels:
20          app: svc-a   # intra-service
21    ports: - protocol: TCP
22      port: 8080 - from: - podSelector:
23        matchLabels:
24          role: health-probe
25    ports: - protocol: TCP
26      port: 10256   # health-check

Label-driven rules survive scaling events. - Health-probe allowance keeps kubelet happy without opening the whole cluster. - Cross-service whitelist is explicit, so any accidental port exposure is caught early.

What steps are needed to roll this out safely in production?

Step-by-Step Implementation in a Production Cluster

Confirm CNI support - Only Calico, Cilium, or Kube-router enforce policies. Run a quick test: create a deny-all policy and attempt `curl` between two pods. If traffic still passes, switch CNI.

1# Example check with netshoot
2kubectl run netshoot --image=nicolaka/netshoot --restart=Never -- sleep 3600
3kubectl exec -ti netshoot -- curl -s http://<target-pod-ip>

Apply baseline deny-all - Deploy the `deny-all` manifest shown earlier.

Incrementally add allow rules - For each service, create a policy that permits only required ports and the health-probe port. After each `kubectl apply -f`, run connectivity checks from `netshoot` pods.

Automate validation - Integrate `kube-audit` and `conftest` into your CI pipeline.

1# .github/workflows/networkpolicy.yml
2name: NetworkPolicy CI
3on: [push, pull_request]
4jobs:
5  audit:
6    runs-on: ubuntu-latest
7    steps: - uses: actions/checkout@v3 - name: Install kube-audit
8      run: |
9        curl -sSL https://github.com/Shopify/kubeaudit/releases/download/v0.19.0/kubeaudit_0.19.0_linux_amd64.tar.gz | tar -xz - name: Run audit
10      run: ./kubeaudit all -f ./manifests/ - name: Conftest policy check
11      run: conftest test ./manifests/

Staged rollout - Use a canary namespace. Apply policies there first, monitor metrics, then promote to production. - Verification checklist - CNI supports NetworkPolicy? ✅ - Baseline deny-all applied? ✅ - Health-probe ports allowed? ✅ - CI lint passes? ✅

These steps turn a risky default-allow cluster into a hardened, HA-aware environment.

What measurable benefits can you expect after the rollout?

The Payoff: Resilient, Secure, and Faster Deployments

After the rollout, teams saw a dramatic shift in reliability. The cluster no longer fell prey to a single pod flooding the network. Then alerts now fire on policy violations rather than on downstream latency spikes.

Concrete observability gains -

1# alerts.yml
2groups: - name: network-health
3  rules: - alert: ConntrackHighUtilization
4    expr: node_nf_conntrack_entries / node_nf_conntrack_max > 0.85
5    for: 2m
6    labels:
7      severity: warning
8    annotations:
9      summary: "Conntrack table > 85% on {{ $labels.instance }}"
10      description: "Potential DoS from a noisy pod. Investigate traffic patterns."

Grafana dashboard showing per-namespace egress bytes before and after policy enforcement. - Deployment cadence improves. With policy linting baked into CI, engineers spend less time debugging flaky health checks.

The effort also paid off in security posture. Continuous policy testing caught misconfigurations before they reached production, keeping the attack surface minimal.

Which remaining challenges might still hide in your network?

Frequently Asked Questions

Q: Why does a missing NetworkPolicy affect high availability?

A: Without isolation, a compromised pod can flood the network. Then it saturates service endpoints and triggers cascading failures that reduce overall cluster availability.

Q: Can I use any CNI plugin with NetworkPolicy?

A: Only CNI plugins that implement the NetworkPolicy API (e.g., Calico, Cilium, Kube-router) enforce rules. But basic plugins silently ignore them, leaving the cluster exposed.

Q: How do I test a new NetworkPolicy without breaking traffic?

A: Apply a deny-all baseline, then add allow rules incrementally. Then use `netshoot` pods or `kubectl exec` to verify pod-to-pod connectivity after each change.

Q: What’s the minimal set of rules to keep HA health checks working?

A: Allow inbound traffic on the pod’s health-check port (often 10256 or a custom value) from the same service label. Also permit egress to the node’s kube-proxy IP range.

Q: How fast can I roll out a secure NetworkPolicy in production?

A: With automated CI validation and a staged rollout, most teams achieve a safe policy deployment within a few weeks.

Related reads - [Active-Active HA Is Killing Statefulness](/posts/active-active-ha-killing-statefulness) - [When Pod Security Policies Break Kubernetes HA](/posts/pod-security-policies-break-kubernetes-ha)

Secure your cluster by treating NetworkPolicy as a core HA component.

*Ready to harden your network?

Sources

Research and references cited in this article: