Skip to main content

One GPU, Many Models: What Works and What Segfaults

UD2.120 (Chavanne) | Day 1 | 13:55 - 14:15 | Speakers: YASH PANCHAL

One GPU, Many Models: What Works and What Segfaults
A picture of a devroom at FOSDEM 2024
Open in browser

Notes

Abstract

Serving multiple models on a single GPU sounds great until something segfaults.

Two approaches dominate for parallel inference: MIG (hardware partitioning) and MPS (software sharing). Both promise efficient GPU sharing.

I tested both strategies for video generation workloads in parallel.

This talk digs into what actually happened: where things worked, where memory isolation fell apart, which configs crashed, and what survives under load.

By the end, you'll know:

  1. How to utilize unused GPU capacity.
  2. How to setup MIG and MPS.
  3. Memory issues, crashes, and failures.
  4. Workload specific configs

Attachments


Notice: The placeholder video image is licensed under CC BY-SA 4.0. The original image can be found hereChanges made to the image are: Cropped the image to a new ratio, part of the image was cut off.