Missing Piece of Kubernetes Operators
Introduction#
My journey with Kubernetes operators started more than three years ago, and since then I’ve built a few more, mostly for managing external resources, because I really like the idea of treating Kubernetes as a universal API. Before getting to the point, I’d highly suggest reading through Ahmet’s blog post on controller pitfalls if you’re not familiar with the operator pattern, or if you want to see some of the common mistakes in operator development. That post was the main motivator for me to understand controller-runtime deeply and eventually, to build kube-external-watcher, which is what this post is about.
Problem statement#
Any Kubernetes operator that manages resources outside of Kubernetes (a cloud API, a bare-metal hypervisor, etc.) has to answer the same reasonable question: how do you detect out-of-band changes caused by other actors?
This is a question you should always ask yourself when building an operator, especially one that manages external resources. You can’t assume your operator is the only actor that will ever touch the resource it’s managing. For example, if your CRD represents an RDS instance, a human might log into the AWS console and edit it directly or another controller might be reconciling your resource too. If your operator doesn’t know how to react to those changes, you’ll eventually find yourself in trouble.
Inside Kubernetes, the answer is simple. Informers watch the API server, the cache fires events, the reconciler runs. But when the object you’re managing lives outside of Kubernetes, the event bus stops at the cluster boundary. If a human edits that RDS instance in the AWS console, your operator has no way to know. At least, no native way.
How existing operators solve this problem#
They all solve it the same way: periodic reconciliation at
some fixed cadence via RequeueAfter. A quick look through the major
projects:
AWS Controllers for Kubernetes (ACK)#
ACK calls this pattern
drift recovery.
The default resync period is ten hours, defined as a constant in the
runtime
(runtime/reconciler.go#L57):
defaultResyncPeriod = 10 * time.Hour
Once a resource reaches ResourceSynced = true, the reconciler keeps
requeueing on that interval to re-check AWS for divergence:
rlog.Debug("requeuing", "after", r.resyncPeriod)
return latest, requeue.NeededAfter(nil, r.resyncPeriod)
The period is configurable globally or per-resource kind via the Helm
values reconcile.defaultResyncPeriod and reconcile.resourceResyncPeriods.
Crossplane#
Crossplane follows the same shape on a tighter default. Its
managed-resource reconciler polls every external resource once a minute
(reconciler.go#L51):
defaultPollInterval = 1 * time.Minute
The Crossplane docs
put it plainly: managed resources “rely on polling to detect changes in
the external system.” The interval is configurable per provider pod via
the --poll-interval argument, alongside a separate --sync-interval
(one hour by default) that drives full reconciliation sweeps.
Google Config Connector#
Config Connector documents its
reconciliation strategy
openly: every successful reconcile returns
reconcile.Result{RequeueAfter: jitteredPeriod}
(tf/controller.go#L278),
and the same pattern runs across all controllers
via a shared jitter.Generator. The default mean period is ten minutes
(pkg/k8s/constants.go#L35):
MeanReconcileReenqueuePeriod = 10 * time.Minute
JitterFactor = 2.0
The interval is overridable per-resource via the
cnrm.cloud.google.com/reconcile-interval-in-seconds annotation
(introduced in Config Connector 1.102). Setting it to 0 disables drift
correction for that resource once it reaches UpToDate, and per Google’s
docs that choice is irreversible.
Azure Service Operator#
Azure Service Operator ships an
interval calculator
whose SyncPeriod parameter becomes the RequeueAfter on every healthy
reconcile:
// SyncPeriod is the duration after which to re-sync healthy (non-error)
// requests. If omitted, requests are not re-synced periodically.
SyncPeriod *time.Duration
The default is one hour, defined in
config/vars.go#L22:
DefaultSyncIntervalString = "1h"
It is configurable via the AZURE_SYNC_PERIOD environment variable, with
the special value "never" disabling periodic drift reconciliation
entirely.
The limitations of RequeueAfter#
RequeueAfter is built-in and familiar, and it works. That’s why every
project uses it. But it has two fundamental problems:
- Every requeue is a full reconciliation — Even if nothing changed,
RequeueAftersends the CR back to the queue, and the reconciler runs through the whole loop again. If you have a lot of resources, or if your reconciliation is expensive, this can lead to a lot of unnecessary work. - There’s no cheap way to just observe. Inside Kubernetes, informers watch the API server and only wake the reconciler when something actually changes — observation and reaction are separate loops. For external state, the only tool you have to check the world is to reconcile, so every look costs a full loop. You also need to make sure that your reconcile method is smart enough to detect that nothing changed and return early to avoid doing unnecessary work, which adds complexity to your reconciler.
RequeueAfter might be fine for a handful of CRs. But when you’re
managing thousands, each one running the full loop on its own timer, it
gets noisy fast especially if you want to requeue very frequently.
First attempt: a custom watcher#
In one of my earlier operators, I ran into exactly these issues, so I wanted to get
rid of RequeueAfter for detecting out-of-band changes. I wanted to watch those resources frequently without every check firing a reconcile and keeping the workqueue busy. I wanted to only reconcile
when something changed, not on a timer.
The approach I came up with was a custom watcher, inline in the operator’s own package. Each controller registered a watcher per CR it owned. Each watcher was a goroutine with a ticker that, on each tick, called into a pile of callbacks the controller passed in:
func StartWatcher(ctx context.Context, resource Resource,
stopChan chan struct{},
fetchResource FetchResourceFunc,
updateStatus UpdateStatusFunc,
checkDelta CheckDeltaFunc,
handleAutoStart HandleAutoStartFunc,
handleReconcile HandleReconcileFunc,
deleteWatcher DeleteWatcherFunc) (ctrl.Result, error) {
ticker := time.NewTicker(ObserveInterval)
for {
select {
case <-ticker.C:
// handleAutoStart, fetchResource, updateStatus,
// checkDelta, then maybe handleReconcile...
case <-stopChan:
return ctrl.Result{}, nil
}
}
}
This worked. It was also clearly the wrong place for it. A few things kept tripping me up:
- Per-controller wiring. Every new CRD meant a fresh set of callbacks, registration sites, and lifecycle plumbing.
- Tangled concerns. The watcher handled status updates, auto-start, and reconciliation triggering — work that belonged in the controller. The controller, in turn, had to know how to drive the watcher’s lifecycle. Each side was doing work that belonged to the other, and the boundaries were leaky and confusing.
- No reuse. The moment I started building another operator, I had to copy-paste the whole watcher package and redo the drift-handling wiring per controller.
Overall, I thought the watcher pattern was the right approach, but it
still felt like a hack bolted into the controllers rather than a
first-class citizen in the controller-runtime ecosystem.
Moving forward#
What I actually wanted was a small, reusable component that did exactly one thing: observe external state and emit a signal when it drifts. Not reconcile. Not update status. Not manage the CR’s lifecycle. Just:
Has the resource I’m watching changed? If yes, fire an event.
The reconciler can listen for that event through a source.Channel
registered via controller-runtime’s WatchesRawSource, same as it would
listen to any other source. Polling becomes cheap (one API call per
interval, no reconciliation), and the controller queue only moves when
there is genuine work. The observation loop and the reconciliation loop
are finally separate.
To address both the limits of RequeueAfter and the shortcomings of my
first attempt, I started
kube-external-watcher:
a small library built on controller-runtime that implements exactly this
shape, plugging into the manager as a manager.Runnable. Polling happens
in a tiny goroutine, drift surfaces as events, and the reconciler only
runs when there’s actually something to do.
How it works?#
The library provides an ExternalWatcher that implements
manager.Runnable , so it lives next
to the controllers inside the manager and has access to the cache and client.
The user supplies four small methods (GetDesiredState,
FetchExternalResource, TransformExternalState,
IsResourceReadyToWatch), plus a StateComparator that decides what
“drift” means for their type.
The controller wires the watcher’s event channel in like any other source:
ew := watcher.NewExternalWatcher(myFetcher,
watcher.WithDefaultPollInterval(30*time.Second),
watcher.WithComparator(watcher.NewDeepEqualComparator()),
watcher.WithAutoRegister(mgr.GetCache(), &myv1.Database{},
func(obj client.Object) watcher.ResourceConfig {
cr := obj.(*myv1.Database)
return watcher.ResourceConfig{ResourceKey: cr.Status.InstanceID}
}),
)
mgr.Add(ew)
ctrl.NewControllerManagedBy(mgr).
For(&myv1.Database{}).
WatchesRawSource(source.Channel(ew.EventChannel(), &handler.EnqueueRequestForObject{})).
Complete(myReconciler)
When a resource is created, the cache informer tells the watcher. A
goroutine spawns, polls on its interval, compares desired against actual.
If drift is detected, a GenericEvent lands in the controller’s work
queue with the CR’s name and namespace. The reconciler wakes up, for real
work this time.
If nothing drifts, nothing happens. That’s the whole point.
Closing#
I don’t claim to have invented anything new here. Polling an external
source and emitting events is documented in the
controller-runtime source code.
But for reasons I don’t fully understand, none of the major operators
adopt it. Instead, they all reach for RequeueAfter.
Although RequeueAfter is fine for most Kubernetes work, it’s not a
good fit here. It forces you to run a full reconcile every time you
want to check for drift. If you want to check frequently, or if your
reconciliation is expensive, that turns into a lot of unnecessary work
and noise.
So that’s the piece I keep finding missing in operators: a cheap way to just watch external resources. kube-external-watcher is my attempt at fixing it. It’s small on purpose, and I’d love to hear what you think. Issues and PRs welcome on GitHub.