Architecting for Fractional GPUs: MIG, MPS, and Software Slicing

How to map real workloads to partial GPUs without tanking performance.

AUTHOR

Dev Malik, GPU Systems Engineer

“Fractional GPU” isn’t magic; it’s scheduling plus isolation. Three paths dominate:

Hardware partitioning (e.g., MIG)
Carves an accelerator into isolated instances with guaranteed memory/SM resources. Best for isolation and predictable perf.
Multi-Process Service (MPS)
Shares an accelerator among processes with low context-switch overhead. Great for inference where kernels are small and frequent.
Software-level slicing
Runtime multiplexing with fair-share policies; useful on commodity GPUs or mixed vendors.

Mapping patterns

Training (large batch): prefer MIG-style partitions sized to your memory footprint; scale horizontally.
Training (small batch / many trials): MPS or small MIGs; you’ll get great device occupancy by running many parallel trials.
Batch inference: MPS excels; allocate memory pools per model and target high request concurrency.
Realtime inference: give the hot model a bigger slice (MIG or pinned share) and throttle others.

Practical guidance

Pin memory and warm kernels per slice; cold starts are the silent killer.
Use per-slice telemetry (utilization, achieved occupancy, H2D/D2H bandwidth).
Throughput > latency for most batch jobs; reserve bigger slices only when p95 demands it.
Treat fragmentation as a first-class metric. A good scheduler defragments by migrating short tasks first.

Do this well and you’ll run many jobs on the same card without noisy neighbors ruining the day—and your finance partner will notice the difference.

BLOG

Other insights

More insights

Insights

Oct 3, 2025

The Micro-Compute Mindset: Why Slicing Resources Beats Renting Instances

Stop paying for idle capacity—pack jobs tightly, bill by micro-units, and measure success as $/completed task.

Insights

Oct 3, 2025

The Micro-Compute Mindset: Why Slicing Resources Beats Renting Instances

Stop paying for idle capacity—pack jobs tightly, bill by micro-units, and measure success as $/completed task.

Insights

Sep 10, 2025

Checkpointing Like a Pro: Designing AI Workflows for Pre-emption

Treat interruptions as normal, not exceptional—your reward is 40–70% lower compute spend.

Insights

Sep 10, 2025

Checkpointing Like a Pro: Designing AI Workflows for Pre-emption

Treat interruptions as normal, not exceptional—your reward is 40–70% lower compute spend.

Tips

Aug 6, 2025

Measuring What Matters: From $/hr to $/Outcome

Reframe your dashboards around unit economics your CFO respects.

Tips

Aug 6, 2025

Measuring What Matters: From $/hr to $/Outcome

Reframe your dashboards around unit economics your CFO respects.

Architecting for Fractional GPUs: MIG, MPS, and Software Slicing

Dev Malik, GPU Systems Engineer

Other insights

The Micro-Compute Mindset: Why Slicing Resources Beats Renting Instances

The Micro-Compute Mindset: Why Slicing Resources Beats Renting Instances

Checkpointing Like a Pro: Designing AI Workflows for Pre-emption

Checkpointing Like a Pro: Designing AI Workflows for Pre-emption

Measuring What Matters: From $/hr to $/Outcome

Measuring What Matters: From $/hr to $/Outcome

hello@kovanetwork.com

@KovaNetwork