AI Strategy

/

2025

Architecting for Fractional GPUs: MIG, MPS, and Software Slicing

How to map real workloads to partial GPUs without tanking performance.

/

AUTHOR

/

AUTHOR

/

AUTHOR

Dev Malik, GPU Systems Engineer

“Fractional GPU” isn’t magic; it’s scheduling plus isolation. Three paths dominate:

  1. Hardware partitioning (e.g., MIG)
    Carves an accelerator into isolated instances with guaranteed memory/SM resources. Best for isolation and predictable perf.

  2. Multi-Process Service (MPS)
    Shares an accelerator among processes with low context-switch overhead. Great for inference where kernels are small and frequent.

  3. Software-level slicing
    Runtime multiplexing with fair-share policies; useful on commodity GPUs or mixed vendors.

Mapping patterns

  • Training (large batch): prefer MIG-style partitions sized to your memory footprint; scale horizontally.

  • Training (small batch / many trials): MPS or small MIGs; you’ll get great device occupancy by running many parallel trials.

  • Batch inference: MPS excels; allocate memory pools per model and target high request concurrency.

  • Realtime inference: give the hot model a bigger slice (MIG or pinned share) and throttle others.

Practical guidance

  • Pin memory and warm kernels per slice; cold starts are the silent killer.

  • Use per-slice telemetry (utilization, achieved occupancy, H2D/D2H bandwidth).

  • Throughput > latency for most batch jobs; reserve bigger slices only when p95 demands it.

  • Treat fragmentation as a first-class metric. A good scheduler defragments by migrating short tasks first.

Do this well and you’ll run many jobs on the same card without noisy neighbors ruining the day—and your finance partner will notice the difference.