Insights

/

2025

Checkpointing Like a Pro: Designing AI Workflows for Pre-emption

Treat interruptions as normal, not exceptionalyour reward is 4070% lower compute spend.

/

AUTHOR

/

AUTHOR

/

AUTHOR

Mina Zhou, ML Infrastructure Lead

Pre-emption terrifies teams that were raised on fixed instances and long-running processes. In micro-compute environments, pre-emption is a design assumption, not a failure mode. The key is systematic checkpointing.

Principles

  • Make progress atomic. Save state at natural task boundaries: per epoch, per N steps, per video chunk.

  • Separate state from compute. Persist to durable storage independent of the worker (object store or replicated volume).

  • Idempotent replays. On restart, detect the latest valid checkpoint and resume without side effects.

  • Checksum and verify. Each checkpoint should include hashes for model weights, optimizer state, and data offsets.

  • Metadata matters. Store runtime config (seed, LR schedule, SHA of code) alongside artifacts to guarantee reproducibility.

Reference patterns

  • Training: save {weights, optimizer, dataloader position} every X minutes; validate on resume.

  • Batch inference: chunk the queue; write success markers; retry only missing chunks.

  • RAG pipelines: cache embeddings and retrieval indices; only recompute invalidated partitions.

Observability tips

  • Expose checkpoint latency and wasted work metrics (redo minutes after pre-emption).

  • Alert when redo > target (e.g., >5% of wall time).

  • Keep receipts—verifiable usage proofs make finance comfortable with dynamic pricing.

Once checkpointing is boring, pre-emption stops being scary. You’ll take the cheaper capacity, accept occasional interruptions, and still finish sooner and cheaper.