Skip to main content

Supercharging LLM serving with Dynamo

UD2.120 (Chavanne) | Day 1 | 15:40 - 16:00 | Speakers: Piotr Tarasiewicz

Supercharging LLM serving with Dynamo
A picture of a devroom at FOSDEM 2024
Open in browser

Notes

Abstract

The explosive growth of Large Language Models (LLMs) requires massively efficient and scalable inference systems. This talk will share key innovations NVIDIA Dynamo (https://github.com/ai-dynamo/dynamo) adds to enable system-level optimizations while leveraging performance from inference engines such as vLLM, SGLang, and TRT-LLM:

  • Smart Scheduling that routes requests based on the KV cache hit rate and load, intelligently autoscales, and disaggregates the prefill and decode phases.
  • Hierarchical Memory Management that utilizes HBM, host memory, local disk, and remote storage.
  • Low-Latency Transfer of the KV cache across nodes and the memory hierarchy.

This talk will also introduce production-grade LLM serving features of Dynamo that enable users to: - Find the best configuration for disaggregated serving offline. - Tune performance automatically based on real-time traffic. - Dynamically scale prefill and decode workers via topology-aware gang scheduling. - Leverage LLM-specific fault tolerance.

Speakers

Piotr Tarasiewicz

Senior Software Engineer – AI at NVIDIA, working on Dynamo, an advanced LLM inference framework. He specializes in developing and optimizing inference and deployment technologies for large-scale deep learning models. He holds a B.Eng. from Warsaw University of Technology and an M.Sc. from University College London, where his research focused on probabilistic models and reinforcement learning applications in robotics.


Notice: The placeholder video image is licensed under CC BY-SA 4.0. The original image can be found hereChanges made to the image are: Cropped the image to a new ratio, part of the image was cut off.