Missing Piece of Kubernetes Operators
My journey with Kubernetes operators started more than three years ago, and since then I’ve built a few more, mostly for managing external resources, because I really like the idea of treating Kubernetes as a universal API. Before getting to the point, I’d highly suggest reading through Ahmet’s blog post on controller pitfalls if you’re not familiar with the operator pattern, or if you want to see some of the common mistakes in operator development. That post was the main motivator for me to understand controller-runtime deeply and eventually, to build kube-external-watcher, which is what this post is about.
The problem#
Any Kubernetes operator that manages resources outside of Kubernetes (a cloud API, a bare-metal hypervisor, etc.) has to answer the same reasonable question: how do you detect out-of-band changes caused by other actors?
This is a question you should always ask yourself when building an operator, especially one that manages external resources. You can’t assume your operator is the only actor that will ever touch the resource it’s managing. For example, if your CRD represents an RDS instance, a human might log into the AWS console and edit it directly or another controller might be reconciling your resource too. If your operator doesn’t know how to react to those changes, you’ll eventually find yourself in trouble.
Inside Kubernetes, the answer is simple. Informers watch the API server, the cache fires events, the reconciler runs. But when the object you’re managing lives outside of Kubernetes, the event bus stops at the cluster boundary. If a human edits that RDS instance in the AWS console, your operator has no way to know. At least, no native way.
How existing operators solve this problem#
They all solve it the same way: periodic reconciliation at
some fixed cadence via RequeueAfter. A quick look through the major
projects:
AWS Controllers for Kubernetes (ACK)#
ACK calls this pattern
drift recovery.
The default resync period is ten hours, defined as a constant in the
runtime
(runtime/reconciler.go#L57):
defaultResyncPeriod = 10 * time.Hour
Once a resource reaches ResourceSynced = true, the reconciler keeps
requeueing on that interval to re-check AWS for divergence:
rlog.Debug("requeuing", "after", r.resyncPeriod)
return latest, requeue.NeededAfter(nil, r.resyncPeriod)
The period is configurable globally or per-resource kind via the Helm
values reconcile.defaultResyncPeriod and reconcile.resourceResyncPeriods.
Crossplane#
Crossplane follows the same shape on a tighter default. Its
managed-resource reconciler polls every external resource once a minute
(reconciler.go#L51):
defaultPollInterval = 1 * time.Minute
The Crossplane docs
put it plainly: managed resources “rely on polling to detect changes in
the external system.” The interval is configurable per provider pod via
the --poll-interval argument, alongside a separate --sync-interval
(one hour by default) that drives full reconciliation sweeps.
Google Config Connector#
Config Connector documents its
reconciliation strategy
openly: every successful reconcile returns
reconcile.Result{RequeueAfter: jitteredPeriod}
(tf/controller.go#L278),
and the same pattern runs across all controllers
via a shared jitter.Generator. The default mean period is ten minutes
(pkg/k8s/constants.go#L35):
MeanReconcileReenqueuePeriod = 10 * time.Minute
JitterFactor = 2.0
The interval is overridable per-resource via the
cnrm.cloud.google.com/reconcile-interval-in-seconds annotation
(introduced in Config Connector 1.102). Setting it to 0 disables drift
correction for that resource once it reaches UpToDate, and per Google’s
docs that choice is irreversible.
Azure Service Operator#
Azure Service Operator ships an
interval calculator
whose SyncPeriod parameter becomes the RequeueAfter on every healthy
reconcile:
// SyncPeriod is the duration after which to re-sync healthy (non-error)
// requests. If omitted, requests are not re-synced periodically.
SyncPeriod *time.Duration
The default is one hour, defined in
config/vars.go#L22:
DefaultSyncIntervalString = "1h"
It is configurable via the AZURE_SYNC_PERIOD environment variable, with
the special value "never" disabling periodic drift reconciliation
entirely.
The limitations of RequeueAfter#
RequeueAfter is built-in and familiar, and it works. That’s why every
project uses it. But it has a few fundamental problems:
- Every requeue is a full reconciliation. Even if nothing changed,
RequeueAftersends the CR back to the queue and the reconciler runs through the whole loop again — fetch from the API server, fetch from the external system, diff, update status, write back. If you have a lot of resources, or if your reconciliation is expensive, this can lead to a lot of unnecessary work. It also means every reconciler has to be smart enough to detect “nothing changed” and return early cleanly, which is extra complexity you wouldn’t otherwise need. - There’s no cheap way to just observe. Inside Kubernetes, informers watch the API server and only wake the reconciler when something actually changes — observation and reaction are separate loops. For external state, the only tool controller-runtime gives you to check the world is to reconcile, so every look costs a full loop.
- Picking an interval is a lose-lose. Too short and you drown the operator in pointless reconciles and burn through the external API’s rate limit. Too long and drift goes undetected for minutes or hours, which defeats the purpose. That’s partly why the major projects land on such different defaults — one minute for Crossplane, ten for Config Connector, one hour for Azure Service Operator, ten for ACK. There isn’t a right answer, only tradeoffs.
- Drift checks compete with real work. The workqueue can’t tell the difference between “a user edited the CR” and “the periodic timer fired.” Both sit in the same queue, handled by the same workers. Crank the polling frequency up and genuine user-driven changes can end up waiting behind a backlog of drift checks.
RequeueAfter might be fine for a handful of CRs on a lazy interval.
But when you’re managing thousands of resources and want tight drift
detection, each one running the full loop on its own timer, it gets
noisy fast.
First attempt: a custom watcher#
In one of my earlier operators, I ran into exactly these issues, so I wanted to get
rid of RequeueAfter for detecting out-of-band changes. I wanted to watch those resources frequently without every check firing a reconcile and keeping the workqueue busy. I wanted to only reconcile
when something changed, not on a timer.
The approach I came up with was a custom watcher, inline in the operator’s own package. Each controller registered a watcher per CR it owned. Each watcher was a goroutine with a ticker that, on each tick, called into a pile of callbacks the controller passed in:
func StartWatcher(ctx context.Context, resource Resource,
stopChan chan struct{},
fetchResource FetchResourceFunc,
updateStatus UpdateStatusFunc,
checkDelta CheckDeltaFunc,
handleAutoStart HandleAutoStartFunc,
handleReconcile HandleReconcileFunc,
deleteWatcher DeleteWatcherFunc) (ctrl.Result, error) {
ticker := time.NewTicker(ObserveInterval)
for {
select {
case <-ticker.C:
// handleAutoStart, fetchResource, updateStatus,
// checkDelta, then maybe handleReconcile...
case <-stopChan:
return ctrl.Result{}, nil
}
}
}
This worked. It was also clearly the wrong place for it. A few things kept tripping me up:
- Per-controller wiring. Every new CRD meant a fresh set of callbacks, registration sites, and lifecycle plumbing.
- Tangled concerns. The watcher handled status updates, auto-start, and reconciliation triggering — work that belonged in the controller. The controller, in turn, had to know how to drive the watcher’s lifecycle. Each side was doing work that belonged to the other, and the boundaries were leaky and confusing.
- No reuse. The moment I started building another operator, I had to copy-paste the whole watcher package and redo the drift-handling wiring per controller.
Overall, I thought the watcher pattern was the right approach, but it
still felt like a hack bolted into the controllers rather than a
first-class citizen in the controller-runtime ecosystem.
What I actually wanted#
What I actually wanted was a small, reusable component that did exactly one thing: observe external state and emit a signal when it drifts. Not reconcile. Not update status. Not manage the CR’s lifecycle. Just:
Has the resource I’m watching changed? If yes, trigger a reconcile.
The reconciler can listen for that signal by registering the watcher
itself as a source.Source via controller-runtime’s WatchesRawSource,
same as it would listen to any other source. Polling becomes cheap (one API call per
interval, no reconciliation), and the controller queue only moves when
there is genuine work. The observation loop and the reconciliation loop
are finally separate.
To address both the limits of RequeueAfter and the shortcomings of my
first attempt, I started
kube-external-watcher:
a small library built on controller-runtime that implements exactly this
shape, plugging into a controller as a source.Source. Polling happens
in a tiny goroutine, drift surfaces as reconcile requests, and the
reconciler only runs when there’s actually something to do.
How it works#
The library provides an ExternalWatcher that implements source.Source, so it wires straight into a controller via WatchesRawSource and enqueues reconcile requests onto the controller’s own workqueue.
The user supplies four small methods(GetDesiredState, FetchExternalResource, TransformExternalState, IsResourceReadyToWatch), plus a StateComparator that decides what “drift” means for their type.
The controller wires the watcher in like any other source:
ew := watcher.NewExternalWatcher(myFetcher,
watcher.WithDefaultPollInterval(30*time.Second),
watcher.WithComparator(watcher.NewDeepEqualComparator()),
watcher.WithAutoRegister(mgr.GetCache(), &myv1.Database{},
func(obj client.Object) watcher.ResourceConfig {
cr := obj.(*myv1.Database)
return watcher.ResourceConfig{ResourceKey: cr.Status.InstanceID}
}),
)
ctrl.NewControllerManagedBy(mgr).
For(&myv1.Database{}).
WatchesRawSource(ew).
Complete(myReconciler)
When a resource is created, the cache informer tells the watcher. A
goroutine spawns, polls on its interval, compares desired against actual.
If drift is detected, a reconcile.Request lands directly in the
controller’s workqueue with the CR’s name and namespace. The reconciler
wakes up, for real work this time.
If nothing drifts, nothing happens. That’s the whole point.
Closing thoughts#
I don’t claim to have invented anything here. Polling external state
and feeding reconcile requests into the workqueue is documented right in the
controller-runtime source.
What surprises me is that none of the major operators actually reach
for it, they all settle on RequeueAfter. For anything that lives
outside the cluster, I think that’s a miss.
Building this changed how I read controller-runtime. The primitives are all there but they don’t show up in any of the big operator frameworks, so unless you go looking, you’d never know they existed. That’s the real miss: not the pattern itself, but that nobody treats it as a first-class tool for external state.
kube-external-watcher
is my take on packaging that pattern. Small on purpose — no CRDs, no
lifecycle magic, no opinions about your reconciler. Just a
source.Source you wire into a controller, point at a resource, and
listen to. If you’ve hit the same wall in your own operators, I’d love
to hear about it. Issues and PRs welcome on
GitHub.
This post has been updated at 3th of June 2026 to reflect the breaking changes introduced in kube-external-watcher’s v0.1.0.