When EvictionRequest meets Checkpointing: An idea of live pod migration on Kubernetes
For a couple of years I’ve been running stateful workloads on Kubernetes here and there, from caches to databases to message brokers. Although I’m not the most knowledgeable person about those stateful systems, I have hit the limitations and quirks of Kubernetes when it comes to running them (especially if you’re not writing your own controller/operator to manage their lifecycle). Because of that, I had to learn about the actors that can disrupt my stateful workloads, and how to mitigate those disruptions.
At last year’s KubeCon EU I attended a great talk, “The Future of Kubernetes Node Lifecycle”. One thing that stuck with me was the EvictionRequest API. The idea is simple: instead of the kubelet just removing a pod, controllers can register as interceptors of an eviction request and run pre-eviction logic on the pod before it actually goes away.
If you want the details, the KEP is the best place to start: KEP-4563: EvictionRequest API.
The moment I heard about the EvictionRequest API, container checkpoint/restore clicked into place. The idea is also simple: you can freeze a running container, dump its full state (memory, processes, open files, sockets) to disk as a checkpoint, and later restore it from that checkpoint either on the same node or somewhere else. CRIU has been doing this on Linux for a while, and Kubernetes is slowly bringing it in through the kubelet’s checkpoint API. Today it’s mostly aimed at forensics (snapshot a misbehaving container so you can poke at it later), but the same primitives are what could let you move a running pod between nodes without restarting it.
Checkpoint & Restore landed in Kubernetes 1.30 as a beta feature, enabled by default. The EvictionRequest API is planned for 1.37, so the two are still a ways off from meeting. But I keep thinking about what happens when they do.
EvictionRequestAPI + Checkpoint & Restore = pod live migration#
Here’s the flow I’m excited about. A controller creates an EvictionRequest
for a stateful pod. Instead of the kubelet just terminating it, an
interceptor (a controller registered for that pod) picks the request up
and orchestrates a graceful migration by placing the pod in a
checkpoint-restore workflow:
┌───────────────────────────┐
│ EvictionRequest │
└─────────────┬─────────────┘
│ watch
▼
┌───────────────────────────┐
│ Migration Interceptor │
│ (pod migration ctrller) │
└──┬─────────────────────┬──┘
(1) │ │ (4)
checkpoint │ │ ack eviction
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Node A │ │ Node B │
│ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Pod: live │ │ │ │Pod:restore│ │
│ └─────┬─────┘ │ │ └─────▲─────┘ │
└───────┼───────┘ └───────┼───────┘
│ (2) snapshot │ (3) restore
▼ │
┌────────────────────────────┐
│ Checkpoint artifact │
│ (Shared volume) │
└────────────────────────────┘
Step by step:
- Trigger checkpoint. The interceptor calls the kubelet’s checkpoint API on Node A.
- Snapshot. Node A produces a frozen copy of the container’s memory and process state (by using kubelet checkpoint API) and pushes it somewhere the target node can pull from like a shared volume.
- Restore. A new pod is scheduled on Node B with a hint to restore from the checkpoint instead of booting from scratch.
- Ack. Once the restored pod is healthy on Node B, the interceptor
acknowledges the
EvictionRequest, and the original pod on Node A is terminated.
The payoff: stateful pods move between nodes without losing in-memory state, in-flight connections, or warm caches. No more hand-rolled drain logic, no more cold-start tax every time the cluster decides to rebalance.
Neither piece is fully there yet. Container checkpointing is still mostly forensic, and the EvictionRequest API is still too early. But once they meet, this is what falls out, and that’s why I keep an eye on both.
PS: I’ve mainly considered this from the stateful workload perspective, but from what I can see in WG discussions, it’s something that AI/ML workloads could also benefit from, since they often have long-running training jobs that would be disrupted by node maintenance or scaling events.