Why Your Kubernetes Observability Stack Is Lying

TL;DR: Your Kubernetes dashboards are showing green because they treat ephemeral, deeply correlated workloads like static VMs. The fix is a fourth signal (Events and topology) stitched to the three pillars through a correlation layer that catches cascading failures before users feel them.

Key Takeaways: - Kubernetes observability fails when pods are treated as durable, when the three pillars stay in separate tools, and when the control plane is left dark. - Ephemeral workloads destroy scrape-based metrics models, so entire lifecycles vanish between samples. - A trustworthy stack correlates Events, metrics, traces, and topology so root cause surfaces in seconds, not hours.

Your Green Dashboard Is Reporting Fiction

Your Kubernetes dashboards have been showing green for weeks. Meanwhile, pods have been failing readiness probes in silence. Your trace data is too sparse to catch the cascade that just hit your checkout service.

This is the core lie. Most observability stacks treat Kubernetes like a traditional VM fleet. They scrape node-level metrics, tail container logs, and sample a fraction of traces.

The result is a status page that looks healthy. Underneath, the system is bleeding.

Three structural problems drive this: - The three pillars are collected in isolation, so they tell contradictory stories during an incident. - A pod can be "Running" and "Ready" while the app inside is deadlocked. It returns 500s, or it serves stale data from a cache that lost its connection. - The control plane (etcd, kube-scheduler, controller-manager) is often completely unmonitored, and it is the single biggest source of cascading failures.

The control plane problem is the one nobody talks about. etcd latency spikes cause scheduler delays. Controller-manager restarts cause stuck reconciliations.

Without proper Kubernetes observability that covers these components, you fly blind on the brain of your cluster.

If the basic signals are there, why does the picture look wrong? The answer starts with how Kubernetes destroys your assumptions about stable infrastructure.

The Ephemeral Nature of Pods Breaks Your Data Model

Pods do not behave like servers. They live for minutes or seconds, not months. Prometheus scrapes on intervals far longer than many pod lifecycles, so entire pod lifecycles vanish between scrapes.

The pod that crashed, restarted, and got replaced is a gap in your time series. It is not a data point you can analyze.

Logs are worse. Logs from crashed pods are lost the moment the container terminates. Unless you have configured node-level log persistence correctly (and most teams have not), those logs never reach your platform.

They die with the pod.

Trace context breaks in subtler ways. When sidecars or init containers are involved, spans get attributed to the wrong workload identity. Your trace shows a 400ms call from `checkout-service`.

The actual slow hop was a CNI DNS lookup that the SDK never saw.

The Kubernetes API itself is the source of truth for pod state. Most production monitoring stacks treat it as a secondary signal. That is backwards.

The API server is the system. Metrics and logs are just observations of it.

A well-built observability stack treats pod identity as the join key across all signals. Most teams configure their observability stack around node or service identity. That is why correlation breaks the moment HPA scales a deployment across many replicas.

The same architectural mistake that breaks HA under more replicas is the one that breaks observability under scale.

Losing data on individual pods is bad enough. The deeper problem is that even the data you do retain is structured to make correlation impossible.

Metrics, Logs, and Traces Don't Talk to Each Other

Each pillar answers a different question. Metrics give you the "what" (CPU at 95%). Logs give you the "why" (OOMKilled).

Traces give you the "where" (a slow downstream call).

But each lives in a separate tool with a separate query language and a separate retention policy. During an incident, you end up with three browser tabs that disagree.

Kubernetes makes this worse through label cardinality. A single Deployment with 50 replicas, across 3 namespaces, on 10 nodes, with 2 containers each, generates thousands of label combinations.

Your metrics traces logs storage explodes while the actual signal disappears under noise.

Distributed traces often miss the cluster boundary entirely. You see the app-to-app latency but not the kube-proxy, CNI, or DNS layer in between. The request that took 800ms was caused by CoreDNS round-robin failure.

Your trace shows only the happy path from the app's view.

The result is a predictable incident pattern: - SRE sees a spike in 5xx errors. - Searches logs. Finds nothing useful. - Checks traces. Context is incomplete. - Escalates without a root cause.

This pattern wastes hours per incident. Teams that have fixed this usually build the correlation layer before they need it. Teams that skip it pay every single time something breaks. A fourth signal, hiding in plain sight, would have caught most of these from the start.

Events and Topology: The Missing Fourth Pillar

Kubernetes Events are the most underused signal in your cluster. They capture why the scheduler made a decision, why a pod was evicted, and why a volume mount failed. They are not metrics.

They are not logs. They are causal breadcrumbs the control plane leaves for you. Most teams either drop them or store them with such short retention that they are useless.

Topology is the other missing piece. Which service talks to which. Which Deployment owns which pods. Which node hosts which workload.

Without topology, you have a pile of time series. With topology, you have an answerable question. This is the same lesson teams learn the hard way when they try to architect microservices on Kubernetes without mapping the dependency graph first.

eBPF-based instrumentation is what makes this practical at scale. It captures network flows, DNS resolution, and system calls without requiring app changes.

Traditional APM cannot do this for Kubernetes because it instruments the application, not the cluster infrastructure. eBPF instruments the kernel.

When you correlate them, root cause surfaces in seconds: - Event: `BackoffLimitExceeded` from Job controller. - Metric: restart count spike on the consumer Deployment. - Trace: connection refused from upstream after the consumer crashed. - Topology: the consumer was a critical dependency of the checkout path.

That is a root cause in three seconds, not three hours. This is what proper k8s observability looks like when it works.

The service mesh layer can extend this with mTLS-aware telemetry, but it is not required. eBPF gives you most of the value without sidecars. Getting the team to act on the data is often harder than wiring it up.

Rebuilding Your Stack: A Concrete Migration Path for SRE Leads

The migration is incremental. You do not need a big-bang rewrite. Here is the order that works.

Step 1: Stream the Kubernetes Events API to your existing log platform. This is free signal you are currently throwing away.

1apiVersion: v1
2kind: ConfigMap
3metadata:
4  name: event-exporter-config
5data:
6  config.yaml: |
7    logLevel: info
8    route:
9      routes: - match: - kind: "Pod" - kind: "Job" - kind: "Node"
10          receiver: "devnull" - match: - receiver: "stdout"

A simple devops observability add-on like `kubernetes-event-exporter` costs you one Deployment and unlocks a signal class most teams never see.

Step 2: Deploy kube-state-metrics and configure your Prometheus scrapers for pod-level granularity. Use metric relabeling to drop high-cardinality labels you do not query on. The default is to keep everything, which is why most zero trust-style clusters choke on their own metrics storage.

1metric_relabel_configs: - source_labels: [__name__, pod_template_hash]
2    regex: 'kube_pod_info;.*'
3    action: drop

Step 3: Adopt OpenTelemetry SDKs in your application code. Configure context propagation to span the cluster boundary. Pair this with eBPF instrumentation to capture the CNI, kube-proxy, and DNS layers your SDK cannot see.

This is especially critical in regulated environments, where missing the network layer means missing the audit trail.

Step 4: Build a correlation layer using a unified backend (Grafana, Honeycomb, Coroot) or a linking strategy where trace IDs appear in log records. The implementation choice matters less than the principle: one query should pivot across all signals.

Step 5: Define SLOs per service using latency, error rate, and saturation against user-facing outcomes. Stop alerting on infrastructure proxies that do not correlate to user pain.

What Changes When Your Observability Stops Lying

Mean time to detection drops. Events fire before metrics cross thresholds. You catch failures during the symptom phase, not the outage phase.

The scheduler evicts pods well before readiness probes start failing. The Event is there first.

Mean time to resolution drops. Engineers arrive at the incident with correlated evidence, not three browser tabs of disconnected data.

They see the Event, the metric, the trace, and the topology in one view. Diagnosis becomes confirmation, not a hunt.

Alert fatigue decreases. You alert on user-facing SLO violations, not on CPU thresholds that fire every time a batch job runs. Your on-call rotation stops dreading pages.

Postmortems shift from guessing to confirming. You know what happened because the data is there.

Remediation work gets better over time because the patterns are visible. This compounding improvement is what makes cloud infrastructure teams durable rather than reactive.

The dashboards stop lying. The cluster is either green because it is healthy, or red because it is not. There is no longer a third state where the dashboard is green and the users are suffering.

Levitation works with teams to wire Events, topology, and correlation layers into existing Kubernetes stacks without the big-bang risk.

Frequently Asked Questions

What is the difference between monitoring and observability in Kubernetes?

Monitoring tells you whether a system is working (known unknowns, like "is CPU above 80%?"). Observability lets you ask arbitrary questions about why a system behaves a certain way. Like "why are requests slow on this specific node, for this specific customer, since this specific deploy?". Kubernetes requires observability because failure modes are emergent and not predictable in advance.

Why are Kubernetes metrics alone insufficient for production monitoring?

Metrics are aggregated and sampled. In a cluster with hundreds of short-lived pods, the aggregation window often contains pods that no longer exist. Metrics show aggregate behavior but cannot show a specific failed request's path through the system. That is what you need during an incident.

How do you correlate metrics, logs, and traces in Kubernetes?

Propagate a `trace_id` and `span_id` through your application logs, then ensure your log platform and tracing backend share that identifier. Combined with the Kubernetes Events stream and a topology map, you can pivot from a metric anomaly to the trace that caused it. You get the log lines and scheduling decision too.

What is the role of eBPF in Kubernetes observability?

eBPF instruments the kernel without modifying application code. It captures network flows between pods, DNS resolution times, and system calls. That gives you visibility into the cluster infrastructure layer that application-level tracing cannot see. This is critical for diagnosing issues that originate below the application layer.

How often should you scrape Kubernetes metrics?

Standard practice is 15-30 second intervals for control plane and node metrics. For high-churn workloads (HPA, serverless functions, CI runners), use 5-10 second intervals or you will miss entire pod lifecycles. Pair this with metric relabeling to control cardinality and storage costs.

Sources

Research and references cited in this article: