Checkpointing Like a Pro: Designing AI Workflows for Pre-emption

Treat interruptions as normal, not exceptional—your reward is 40–70% lower compute spend.

AUTHOR

Mina Zhou, ML Infrastructure Lead

Pre-emption terrifies teams that were raised on fixed instances and long-running processes. In micro-compute environments, pre-emption is a design assumption, not a failure mode. The key is systematic checkpointing.

Principles

Make progress atomic. Save state at natural task boundaries: per epoch, per N steps, per video chunk.
Separate state from compute. Persist to durable storage independent of the worker (object store or replicated volume).
Idempotent replays. On restart, detect the latest valid checkpoint and resume without side effects.
Checksum and verify. Each checkpoint should include hashes for model weights, optimizer state, and data offsets.
Metadata matters. Store runtime config (seed, LR schedule, SHA of code) alongside artifacts to guarantee reproducibility.

Reference patterns

Training: save {weights, optimizer, dataloader position} every X minutes; validate on resume.
Batch inference: chunk the queue; write success markers; retry only missing chunks.
RAG pipelines: cache embeddings and retrieval indices; only recompute invalidated partitions.

Observability tips

Expose checkpoint latency and wasted work metrics (redo minutes after pre-emption).
Alert when redo > target (e.g., >5% of wall time).
Keep receipts—verifiable usage proofs make finance comfortable with dynamic pricing.

Once checkpointing is boring, pre-emption stops being scary. You’ll take the cheaper capacity, accept occasional interruptions, and still finish sooner and cheaper.

BLOG

Other insights

More insights

Insights

Oct 3, 2025

The Micro-Compute Mindset: Why Slicing Resources Beats Renting Instances

Stop paying for idle capacity—pack jobs tightly, bill by micro-units, and measure success as $/completed task.

Insights

Oct 3, 2025

The Micro-Compute Mindset: Why Slicing Resources Beats Renting Instances

Stop paying for idle capacity—pack jobs tightly, bill by micro-units, and measure success as $/completed task.

AI Strategy

Oct 9, 2025

Architecting for Fractional GPUs: MIG, MPS, and Software Slicing

How to map real workloads to partial GPUs without tanking performance.

AI Strategy

Oct 9, 2025

Architecting for Fractional GPUs: MIG, MPS, and Software Slicing

How to map real workloads to partial GPUs without tanking performance.

Tips

Aug 6, 2025

Measuring What Matters: From $/hr to $/Outcome

Reframe your dashboards around unit economics your CFO respects.

Tips

Aug 6, 2025

Measuring What Matters: From $/hr to $/Outcome

Reframe your dashboards around unit economics your CFO respects.

Checkpointing Like a Pro: Designing AI Workflows for Pre-emption

Mina Zhou, ML Infrastructure Lead

Other insights

The Micro-Compute Mindset: Why Slicing Resources Beats Renting Instances

The Micro-Compute Mindset: Why Slicing Resources Beats Renting Instances

Architecting for Fractional GPUs: MIG, MPS, and Software Slicing

Architecting for Fractional GPUs: MIG, MPS, and Software Slicing

Measuring What Matters: From $/hr to $/Outcome

Measuring What Matters: From $/hr to $/Outcome

hello@kovanetwork.com

@KovaNetwork