Insights
/
2025
Checkpointing Like a Pro: Designing AI Workflows for Pre-emption
Treat interruptions as normal, not exceptional—your reward is 40–70% lower compute spend.

Mina Zhou, ML Infrastructure Lead

Pre-emption terrifies teams that were raised on fixed instances and long-running processes. In micro-compute environments, pre-emption is a design assumption, not a failure mode. The key is systematic checkpointing.
Principles
Make progress atomic. Save state at natural task boundaries: per epoch, per N steps, per video chunk.
Separate state from compute. Persist to durable storage independent of the worker (object store or replicated volume).
Idempotent replays. On restart, detect the latest valid checkpoint and resume without side effects.
Checksum and verify. Each checkpoint should include hashes for model weights, optimizer state, and data offsets.
Metadata matters. Store runtime config (seed, LR schedule, SHA of code) alongside artifacts to guarantee reproducibility.
Reference patterns
Training: save
{weights, optimizer, dataloader position}every X minutes; validate on resume.Batch inference: chunk the queue; write success markers; retry only missing chunks.
RAG pipelines: cache embeddings and retrieval indices; only recompute invalidated partitions.
Observability tips
Expose checkpoint latency and wasted work metrics (redo minutes after pre-emption).
Alert when redo > target (e.g., >5% of wall time).
Keep receipts—verifiable usage proofs make finance comfortable with dynamic pricing.
Once checkpointing is boring, pre-emption stops being scary. You’ll take the cheaper capacity, accept occasional interruptions, and still finish sooner and cheaper.


