NVIDIA Dynamo Snapshot Delivers 21x Faster Cold-Start for GPU Inference on Kubernetes
Key Takeaways
- ▸NVIDIA Dynamo Snapshot enables near-instant checkpoint/restore of single-GPU inference workloads, addressing production cold-start latency that typically takes multiple minutes
- ▸The system combines CRIU (host state) and cuda-checkpoint (GPU device state) with optimizations like KV cache unmapping and GPU Memory Service for efficient restoration
- ▸Achieves up to 21x startup time reduction on large models, significantly improving elastic scaling for inference on Kubernetes and reducing GPU idle time costs
Summary
NVIDIA has introduced Dynamo Snapshot, a checkpoint/restore system designed to drastically reduce cold-start latency for AI inference workloads on Kubernetes. The system addresses a critical production problem: when demand fluctuates, scaling new inference replicas can take several minutes, during which GPUs sit idle and unavailable, increasing the risk of SLA violations during traffic spikes.
Dynamo Snapshot leverages two core technologies: CRIU (Checkpoint/Restore in Userspace) to serialize host-side process state, and CUDA checkpointing to capture GPU device state. The system employs optimization techniques including KV cache unmapping, parallel memfd restore, and a GPU Memory Service (GMS) to decouple large model weights from process state for concurrent restoration.
Experimental results demonstrate up to 21x startup time reduction on large models like gpt-oss-120b, with restoration times approaching near-instant speeds. The current prototype supports single-GPU workloads on Kubernetes, with future plans for multi-GPU/multi-node support and TensorRT-LLM integration.
- Designed to solve SLA compliance challenges during traffic spikes and enable more cost-efficient inference deployment at scale
Editorial Opinion
This is a thoughtful engineering solution to a genuine production pain point. The 21x startup improvement is impressive, but the real value proposition lies in enabling cost-efficient elastic scaling for inference workloads—eliminating expensive GPU idle time during scale-up. If this prototype scales beyond single-GPU workloads, it could become a critical optimization tool for organizations deploying large language models in Kubernetes environments.



