Cloud Computing

Kubernetes v1.36: New Features to Combat Controller Staleness and Boost Observability

2026-05-03 12:36:10

Kubernetes controllers rely on an up‑to‑date view of cluster state to function correctly. However, staleness—an outdated cache—can silently lead to incorrect actions, missed actions, or slow responses. The Kubernetes v1.36 release introduces targeted improvements in client‑go and the kube‑controller‑manager to mitigate staleness and provide better observability. Below, we explore these enhancements through key questions and answers.

What exactly is controller staleness, and why does it matter in Kubernetes?

Staleness refers to a controller operating with an outdated view of the world inside its local cache. Controllers typically maintain this cache for fast performance, populated by watching the API server for changes. When a controller needs to take an action—like scaling a deployment or updating a service—it first checks its cache. If the cache is stale, the controller may make incorrect decisions, such as deleting a pod that no longer exists, failing to create a required resource, or delaying critical operations. Staleness often goes unnoticed until a production incident occurs. Even a brief period of outdated information can cascade into wider cluster instability. Understanding staleness is the first step toward building robust controllers that behave predictably under all conditions.

Kubernetes v1.36: New Features to Combat Controller Staleness and Boost Observability

How does a controller’s local cache become stale, and what problems can that cause?

A controller’s cache can become stale in several scenarios: after a controller restart (it must rebuild its cache from the API server), during API server downtime (no new events are processed), or when events arrive out of order. For example, if a controller restarts, its cache is empty and any operation attempted before the cache is fully populated will be based on incomplete information. Similarly, if the API server experiences a brief outage, the controller cannot refresh its cache, so it continues with stale data. The consequences include:

These issues are particularly dangerous for controllers managing stateful workloads or critical infrastructure.

What improvements does Kubernetes v1.36 introduce to address staleness?

Kubernetes v1.36 brings meaningful improvements both in the client‑go library and in the highly contended controllers within kube-controller-manager. The key addition is the Atomic FIFO feature (feature gate: AtomicFIFO). This new approach builds on the existing FIFO queue to atomically handle batches of events—such as the initial list used to populate an informer’s cache. By processing events in an atomic manner, the queue always reflects a consistent state of the cluster, even when events arrive out of order. This eliminates a major source of cache inconsistency that plagued previous versions. In addition, the kube‑controller‑manager now uses these client‑go improvements for its own controllers, reducing staleness across the board.

How does the new Atomic FIFO feature work in client‑go?

The Atomic FIFO feature enhances client‑go’s existing FIFO queue implementation. Previously, events were added to the queue in the order they were received from the API server. If events arrived out of sequence—for example, an update event before the corresponding add event—the queue could become inconsistent, leading to an inaccurate cache. With Atomic FIFO, the queue processes events in batches atomically. During a batch operation like the initial list, the entire set of objects is assembled and then applied to the queue as a single atomic unit. This guarantees that even if individual events arrive out of order, the queue never holds an inconsistent snapshot. The outcome is a cache that precisely mirrors the cluster state after every reconciliation cycle. Developers can enable this feature by setting the AtomicFIFO feature gate, which is now available in client‑go.

How can developers leverage these client‑go improvements to build more reliable controllers?

Developers using client‑go can directly benefit from Atomic FIFO by enabling the feature gate in their controller code. Once enabled, the informer’s cache becomes far less susceptible to ordering issues, especially during initial list operations or after reconnections. This leads to more predictable reconcile behavior—controllers will only act on consistent cache states. Additionally, developers can now introspect the cache to determine the latest resource version (as noted in the v1.36 release). This introspection capability allows controllers to verify whether their cached data is sufficiently fresh before making decisions. For example, a controller can check the resource version of a cached object and compare it with the API server’s current version; if the cache is too far behind, it can wait or trigger a forced refresh. These tools empower authors to build self‑healing controllers that detect and mitigate staleness proactively.

What observability benefits do these changes bring for controller monitoring?

Beyond stability, v1.36 enhances observability into controller behavior. With the introduction of cache introspection, operators can now monitor the freshness of each controller’s cache. This means they can answer questions like “How up‑to‑date is my controller’s view of the cluster?” or “Has the controller processed the latest events from the API server?” Previously, such insights required custom metrics or deep debugging. Now, client‑go exposes the latest resource version known to the cache, which can be logged or exposed via metrics. This data helps identify controllers that are lagging due to high load, network issues, or bugs. Combined with the Atomic FIFO improvements, operators gain both prevention of staleness and visibility into when it might occur. This is a significant step toward running controllers with higher confidence in production environments, as teams can now detect problems before they cause downstream failures.

Explore

Supply Chain Attack on Elementary Data: How a GitHub Actions Flaw Led to Malicious PyPI Package Walmart and ABB's 400 kW EV Fast Chargers: Full Q&A Guide 10 Surprising Facts About How Plant-Based Diets Slash Your Carbon Footprint Mastering Business Days Calculation in JavaScript: A Practical Q&A 10 Key Revelations About the Russian Mastermind Behind GandCrab and REvil Ransomware