Introduction#

My journey with Kubernetes operators started more than three years ago, and since then I’ve built a few more, mostly for managing external resources, because I really like the idea of treating Kubernetes as a universal API. Before getting to the point, I’d highly suggest reading through Ahmet’s blog post on controller pitfalls if you’re not familiar with the operator pattern, or if you want to see some of the common mistakes in operator development. That post was the main motivator for me to understand controller-runtime deeply and eventually, to build kube-external-watcher, which is what this post is about.

Problem statement#

Any Kubernetes operator that manages resources outside of Kubernetes (a cloud API, a bare-metal hypervisor, etc.) has to answer the same reasonable question: how do you detect out-of-band changes caused by other actors?

This is a question you should always ask yourself when building an operator, especially one that manages external resources. You can’t assume your operator is the only actor that will ever touch the resource it’s managing. For example, if your CRD represents an RDS instance, a human might log into the AWS console and edit it directly or another controller might be reconciling your resource too. If your operator doesn’t know how to react to those changes, you’ll eventually find yourself in trouble.

Inside Kubernetes, the answer is simple. Informers watch the API server, the cache fires events, the reconciler runs. But when the object you’re managing lives outside of Kubernetes, the event bus stops at the cluster boundary. If a human edits that RDS instance in the AWS console, your operator has no way to know. At least, no native way.

How existing operators solve this problem#

They all solve it the same way: periodic reconciliation at some fixed cadence via RequeueAfter. A quick look through the major projects:

AWS Controllers for Kubernetes (ACK)#

ACK calls this pattern drift recovery. The default resync period is ten hours, defined as a constant in the runtime (runtime/reconciler.go#L57):

defaultResyncPeriod = 10 * time.Hour

Once a resource reaches ResourceSynced = true, the reconciler keeps requeueing on that interval to re-check AWS for divergence:

rlog.Debug("requeuing", "after", r.resyncPeriod)
return latest, requeue.NeededAfter(nil, r.resyncPeriod)

The period is configurable globally or per-resource kind via the Helm values reconcile.defaultResyncPeriod and reconcile.resourceResyncPeriods.

Crossplane#

Crossplane follows the same shape on a tighter default. Its managed-resource reconciler polls every external resource once a minute (reconciler.go#L51):

defaultPollInterval = 1 * time.Minute

The Crossplane docs put it plainly: managed resources “rely on polling to detect changes in the external system.” The interval is configurable per provider pod via the --poll-interval argument, alongside a separate --sync-interval (one hour by default) that drives full reconciliation sweeps.

Google Config Connector#

Config Connector documents its reconciliation strategy openly: every successful reconcile returns reconcile.Result{RequeueAfter: jitteredPeriod} (tf/controller.go#L278), and the same pattern runs across all controllers via a shared jitter.Generator. The default mean period is ten minutes (pkg/k8s/constants.go#L35):

MeanReconcileReenqueuePeriod = 10 * time.Minute
JitterFactor                 = 2.0

The interval is overridable per-resource via the cnrm.cloud.google.com/reconcile-interval-in-seconds annotation (introduced in Config Connector 1.102). Setting it to 0 disables drift correction for that resource once it reaches UpToDate, and per Google’s docs that choice is irreversible.

Azure Service Operator#

Azure Service Operator ships an interval calculator whose SyncPeriod parameter becomes the RequeueAfter on every healthy reconcile:

// SyncPeriod is the duration after which to re-sync healthy (non-error)
// requests. If omitted, requests are not re-synced periodically.
SyncPeriod           *time.Duration

The default is one hour, defined in config/vars.go#L22:

DefaultSyncIntervalString = "1h"

It is configurable via the AZURE_SYNC_PERIOD environment variable, with the special value "never" disabling periodic drift reconciliation entirely.

The limitations of RequeueAfter#

RequeueAfter is built-in and familiar, and it works. That’s why every project uses it. But it has two fundamental problems:

  1. Every requeue is a full reconciliation — Even if nothing changed, RequeueAfter sends the CR back to the queue, and the reconciler runs through the whole loop again. If you have a lot of resources, or if your reconciliation is expensive, this can lead to a lot of unnecessary work.
  2. There’s no cheap way to just observe. Inside Kubernetes, informers watch the API server and only wake the reconciler when something actually changes — observation and reaction are separate loops. For external state, the only tool you have to check the world is to reconcile, so every look costs a full loop. You also need to make sure that your reconcile method is smart enough to detect that nothing changed and return early to avoid doing unnecessary work, which adds complexity to your reconciler.

RequeueAfter might be fine for a handful of CRs. But when you’re managing thousands, each one running the full loop on its own timer, it gets noisy fast especially if you want to requeue very frequently.

First attempt: a custom watcher#

In one of my earlier operators, I ran into exactly these issues, so I wanted to get rid of RequeueAfter for detecting out-of-band changes. I wanted to watch those resources frequently without every check firing a reconcile and keeping the workqueue busy. I wanted to only reconcile when something changed, not on a timer.

The approach I came up with was a custom watcher, inline in the operator’s own package. Each controller registered a watcher per CR it owned. Each watcher was a goroutine with a ticker that, on each tick, called into a pile of callbacks the controller passed in:

func StartWatcher(ctx context.Context, resource Resource,
    stopChan chan struct{},
    fetchResource    FetchResourceFunc,
    updateStatus     UpdateStatusFunc,
    checkDelta       CheckDeltaFunc,
    handleAutoStart  HandleAutoStartFunc,
    handleReconcile  HandleReconcileFunc,
    deleteWatcher    DeleteWatcherFunc) (ctrl.Result, error) {
    ticker := time.NewTicker(ObserveInterval)
    for {
        select {
        case <-ticker.C:
            // handleAutoStart, fetchResource, updateStatus,
            // checkDelta, then maybe handleReconcile...
        case <-stopChan:
            return ctrl.Result{}, nil
        }
    }
}

This worked. It was also clearly the wrong place for it. A few things kept tripping me up:

  • Per-controller wiring. Every new CRD meant a fresh set of callbacks, registration sites, and lifecycle plumbing.
  • Tangled concerns. The watcher handled status updates, auto-start, and reconciliation triggering — work that belonged in the controller. The controller, in turn, had to know how to drive the watcher’s lifecycle. Each side was doing work that belonged to the other, and the boundaries were leaky and confusing.
  • No reuse. The moment I started building another operator, I had to copy-paste the whole watcher package and redo the drift-handling wiring per controller.

Overall, I thought the watcher pattern was the right approach, but it still felt like a hack bolted into the controllers rather than a first-class citizen in the controller-runtime ecosystem.

Moving forward#

What I actually wanted was a small, reusable component that did exactly one thing: observe external state and emit a signal when it drifts. Not reconcile. Not update status. Not manage the CR’s lifecycle. Just:

Has the resource I’m watching changed? If yes, fire an event.

The reconciler can listen for that event through a source.Channel registered via controller-runtime’s WatchesRawSource, same as it would listen to any other source. Polling becomes cheap (one API call per interval, no reconciliation), and the controller queue only moves when there is genuine work. The observation loop and the reconciliation loop are finally separate.

To address both the limits of RequeueAfter and the shortcomings of my first attempt, I started kube-external-watcher: a small library built on controller-runtime that implements exactly this shape, plugging into the manager as a manager.Runnable. Polling happens in a tiny goroutine, drift surfaces as events, and the reconciler only runs when there’s actually something to do.

How it works?#

The library provides an ExternalWatcher that implements manager.Runnable , so it lives next to the controllers inside the manager and has access to the cache and client. The user supplies four small methods (GetDesiredState, FetchExternalResource, TransformExternalState, IsResourceReadyToWatch), plus a StateComparator that decides what “drift” means for their type.

The controller wires the watcher’s event channel in like any other source:

ew := watcher.NewExternalWatcher(myFetcher,
    watcher.WithDefaultPollInterval(30*time.Second),
    watcher.WithComparator(watcher.NewDeepEqualComparator()),
    watcher.WithAutoRegister(mgr.GetCache(), &myv1.Database{},
        func(obj client.Object) watcher.ResourceConfig {
            cr := obj.(*myv1.Database)
            return watcher.ResourceConfig{ResourceKey: cr.Status.InstanceID}
        }),
)
mgr.Add(ew)

ctrl.NewControllerManagedBy(mgr).
    For(&myv1.Database{}).
    WatchesRawSource(source.Channel(ew.EventChannel(), &handler.EnqueueRequestForObject{})).
    Complete(myReconciler)

When a resource is created, the cache informer tells the watcher. A goroutine spawns, polls on its interval, compares desired against actual. If drift is detected, a GenericEvent lands in the controller’s work queue with the CR’s name and namespace. The reconciler wakes up, for real work this time.

If nothing drifts, nothing happens. That’s the whole point.

Closing#

I don’t claim to have invented anything new here. Polling an external source and emitting events is documented in the controller-runtime source code. But for reasons I don’t fully understand, none of the major operators adopt it. Instead, they all reach for RequeueAfter.

Although RequeueAfter is fine for most Kubernetes work, it’s not a good fit here. It forces you to run a full reconcile every time you want to check for drift. If you want to check frequently, or if your reconciliation is expensive, that turns into a lot of unnecessary work and noise.

So that’s the piece I keep finding missing in operators: a cheap way to just watch external resources. kube-external-watcher is my attempt at fixing it. It’s small on purpose, and I’d love to hear what you think. Issues and PRs welcome on GitHub.