Missing Piece of Kubernetes Operators

My journey with Kubernetes operators started more than three years ago, and since then I’ve built a few more, mostly for managing external resources, because I really like the idea of treating Kubernetes as a universal API. Before getting to the point, I’d highly suggest reading through Ahmet’s blog post on controller pitfalls if you’re not familiar with the operator pattern, or if you want to see some of the common mistakes in operator development. That post was the main motivator for me to understand controller-runtime deeply and eventually, to build kube-external-watcher, which is what this post is about.

The problem#

Any Kubernetes operator that manages resources outside of Kubernetes (a cloud API, a bare-metal hypervisor, etc.) has to answer the same reasonable question: how do you detect out-of-band changes caused by other actors?

This is a question you should always ask yourself when building an operator, especially one that manages external resources. You can’t assume your operator is the only actor that will ever touch the resource it’s managing. For example, if your CRD represents an RDS instance, a human might log into the AWS console and edit it directly or another controller might be reconciling your resource too. If your operator doesn’t know how to react to those changes, you’ll eventually find yourself in trouble.

Inside Kubernetes, the answer is simple. Informers watch the API server, the cache fires events, the reconciler runs. But when the object you’re managing lives outside of Kubernetes, the event bus stops at the cluster boundary. If a human edits that RDS instance in the AWS console, your operator has no way to know. At least, no native way.

How existing operators solve this problem#

They all solve it the same way: periodic reconciliation at some fixed cadence via RequeueAfter. A quick look through the major projects:

AWS Controllers for Kubernetes (ACK)#

ACK calls this pattern drift recovery. The default resync period is ten hours, defined as a constant in the runtime (runtime/reconciler.go#L57):

defaultResyncPeriod = 10 * time.Hour

Once a resource reaches ResourceSynced = true, the reconciler keeps requeueing on that interval to re-check AWS for divergence:

rlog.Debug("requeuing", "after", r.resyncPeriod)
return latest, requeue.NeededAfter(nil, r.resyncPeriod)

The period is configurable globally or per-resource kind via the Helm values reconcile.defaultResyncPeriod and reconcile.resourceResyncPeriods.

Crossplane#

Crossplane follows the same shape on a tighter default. Its managed-resource reconciler polls every external resource once a minute (reconciler.go#L51):

defaultPollInterval = 1 * time.Minute

The Crossplane docs put it plainly: managed resources “rely on polling to detect changes in the external system.” The interval is configurable per provider pod via the --poll-interval argument, alongside a separate --sync-interval (one hour by default) that drives full reconciliation sweeps.

Google Config Connector#

Config Connector documents its reconciliation strategy openly: every successful reconcile returns reconcile.Result{RequeueAfter: jitteredPeriod} (tf/controller.go#L278), and the same pattern runs across all controllers via a shared jitter.Generator. The default mean period is ten minutes (pkg/k8s/constants.go#L35):

MeanReconcileReenqueuePeriod = 10 * time.Minute
JitterFactor                 = 2.0

The interval is overridable per-resource via the cnrm.cloud.google.com/reconcile-interval-in-seconds annotation (introduced in Config Connector 1.102). Setting it to 0 disables drift correction for that resource once it reaches UpToDate, and per Google’s docs that choice is irreversible.

Azure Service Operator#

Azure Service Operator ships an interval calculator whose SyncPeriod parameter becomes the RequeueAfter on every healthy reconcile:

// SyncPeriod is the duration after which to re-sync healthy (non-error)
// requests. If omitted, requests are not re-synced periodically.
SyncPeriod           *time.Duration

The default is one hour, defined in config/vars.go#L22:

DefaultSyncIntervalString = "1h"

It is configurable via the AZURE_SYNC_PERIOD environment variable, with the special value "never" disabling periodic drift reconciliation entirely.

The limitations of RequeueAfter#

RequeueAfter is built-in and familiar, and it works. That’s why every project uses it. But it has a few fundamental problems:

Every requeue is a full reconciliation. Even if nothing changed, RequeueAfter sends the CR back to the queue and the reconciler runs through the whole loop again — fetch from the API server, fetch from the external system, diff, update status, write back. If you have a lot of resources, or if your reconciliation is expensive, this can lead to a lot of unnecessary work. It also means every reconciler has to be smart enough to detect “nothing changed” and return early cleanly, which is extra complexity you wouldn’t otherwise need.
There’s no cheap way to just observe. Inside Kubernetes, informers watch the API server and only wake the reconciler when something actually changes — observation and reaction are separate loops. For external state, the only tool controller-runtime gives you to check the world is to reconcile, so every look costs a full loop.
Picking an interval is a lose-lose. Too short and you drown the operator in pointless reconciles and burn through the external API’s rate limit. Too long and drift goes undetected for minutes or hours, which defeats the purpose. That’s partly why the major projects land on such different defaults — one minute for Crossplane, ten for Config Connector, one hour for Azure Service Operator, ten for ACK. There isn’t a right answer, only tradeoffs.
Drift checks compete with real work. The workqueue can’t tell the difference between “a user edited the CR” and “the periodic timer fired.” Both sit in the same queue, handled by the same workers. Crank the polling frequency up and genuine user-driven changes can end up waiting behind a backlog of drift checks.

RequeueAfter might be fine for a handful of CRs on a lazy interval. But when you’re managing thousands of resources and want tight drift detection, each one running the full loop on its own timer, it gets noisy fast.

First attempt: a custom watcher#

In one of my earlier operators, I ran into exactly these issues, so I wanted to get rid of RequeueAfter for detecting out-of-band changes. I wanted to watch those resources frequently without every check firing a reconcile and keeping the workqueue busy. I wanted to only reconcile when something changed, not on a timer.

The approach I came up with was a custom watcher, inline in the operator’s own package. Each controller registered a watcher per CR it owned. Each watcher was a goroutine with a ticker that, on each tick, called into a pile of callbacks the controller passed in:

func StartWatcher(ctx context.Context, resource Resource,
    stopChan chan struct{},
    fetchResource    FetchResourceFunc,
    updateStatus     UpdateStatusFunc,
    checkDelta       CheckDeltaFunc,
    handleAutoStart  HandleAutoStartFunc,
    handleReconcile  HandleReconcileFunc,
    deleteWatcher    DeleteWatcherFunc) (ctrl.Result, error) {
    ticker := time.NewTicker(ObserveInterval)
    for {
        select {
        case <-ticker.C:
            // handleAutoStart, fetchResource, updateStatus,
            // checkDelta, then maybe handleReconcile...
        case <-stopChan:
            return ctrl.Result{}, nil
        }
    }
}

This worked. It was also clearly the wrong place for it. A few things kept tripping me up:

Per-controller wiring. Every new CRD meant a fresh set of callbacks, registration sites, and lifecycle plumbing.
Tangled concerns. The watcher handled status updates, auto-start, and reconciliation triggering — work that belonged in the controller. The controller, in turn, had to know how to drive the watcher’s lifecycle. Each side was doing work that belonged to the other, and the boundaries were leaky and confusing.
No reuse. The moment I started building another operator, I had to copy-paste the whole watcher package and redo the drift-handling wiring per controller.

Overall, I thought the watcher pattern was the right approach, but it still felt like a hack bolted into the controllers rather than a first-class citizen in the controller-runtime ecosystem.

What I actually wanted#

What I actually wanted was a small, reusable component that did exactly one thing: observe external state and emit a signal when it drifts. Not reconcile. Not update status. Not manage the CR’s lifecycle. Just:

Has the resource I’m watching changed? If yes, trigger a reconcile.

The reconciler can listen for that signal by registering the watcher itself as a source.Source via controller-runtime’s WatchesRawSource, same as it would listen to any other source. Polling becomes cheap (one API call per interval, no reconciliation), and the controller queue only moves when there is genuine work. The observation loop and the reconciliation loop are finally separate.

To address both the limits of RequeueAfter and the shortcomings of my first attempt, I started kube-external-watcher: a small library built on controller-runtime that implements exactly this shape, plugging into a controller as a source.Source. Polling happens in a tiny goroutine, drift surfaces as reconcile requests, and the reconciler only runs when there’s actually something to do.

How it works#

The library provides an ExternalWatcher that implements source.Source, so it wires straight into a controller via WatchesRawSource and enqueues reconcile requests onto the controller’s own workqueue.

The user supplies four small methods(GetDesiredState, FetchExternalResource, TransformExternalState, IsResourceReadyToWatch), plus a StateComparator that decides what “drift” means for their type.

The controller wires the watcher in like any other source:

ew := watcher.NewExternalWatcher(myFetcher,
    watcher.WithDefaultPollInterval(30*time.Second),
    watcher.WithComparator(watcher.NewDeepEqualComparator()),
    watcher.WithAutoRegister(mgr.GetCache(), &myv1.Database{},
        func(obj client.Object) watcher.ResourceConfig {
            cr := obj.(*myv1.Database)
            return watcher.ResourceConfig{ResourceKey: cr.Status.InstanceID}
        }),
)

ctrl.NewControllerManagedBy(mgr).
    For(&myv1.Database{}).
    WatchesRawSource(ew).
    Complete(myReconciler)

When a resource is created, the cache informer tells the watcher. A goroutine spawns, polls on its interval, compares desired against actual. If drift is detected, a reconcile.Request lands directly in the controller’s workqueue with the CR’s name and namespace. The reconciler wakes up, for real work this time.

If nothing drifts, nothing happens. That’s the whole point.

Closing thoughts#

I don’t claim to have invented anything here. Polling external state and feeding reconcile requests into the workqueue is documented right in the controller-runtime source. What surprises me is that none of the major operators actually reach for it, they all settle on RequeueAfter. For anything that lives outside the cluster, I think that’s a miss.

Building this changed how I read controller-runtime. The primitives are all there but they don’t show up in any of the big operator frameworks, so unless you go looking, you’d never know they existed. That’s the real miss: not the pattern itself, but that nobody treats it as a first-class tool for external state.

kube-external-watcher is my take on packaging that pattern. Small on purpose — no CRDs, no lifecycle magic, no opinions about your reconciler. Just a source.Source you wire into a controller, point at a resource, and listen to. If you’ve hit the same wall in your own operators, I’d love to hear about it. Issues and PRs welcome on GitHub.

This post has been updated at 3th of June 2026 to reflect the breaking changes introduced in kube-external-watcher’s v0.1.0.