Supercharging LLM serving with Dynamo
UD2.120 (Chavanne) | Day 1 | 15:40 - 16:00 | Speakers: Piotr Tarasiewicz
Abstract
The explosive growth of Large Language Models (LLMs) requires massively efficient and scalable inference systems. This talk will share key innovations NVIDIA Dynamo (https://github.com/ai-dynamo/dynamo) adds to enable system-level optimizations while leveraging performance from inference engines such as vLLM, SGLang, and TRT-LLM:
- Smart Scheduling that routes requests based on the KV cache hit rate and load, intelligently autoscales, and disaggregates the prefill and decode phases.
- Hierarchical Memory Management that utilizes HBM, host memory, local disk, and remote storage.
- Low-Latency Transfer of the KV cache across nodes and the memory hierarchy.
This talk will also introduce production-grade LLM serving features of Dynamo that enable users to: - Find the best configuration for disaggregated serving offline. - Tune performance automatically based on real-time traffic. - Dynamically scale prefill and decode workers via topology-aware gang scheduling. - Leverage LLM-specific fault tolerance.
Speakers
Senior Software Engineer – AI at NVIDIA, working on Dynamo, an advanced LLM inference framework. He specializes in developing and optimizing inference and deployment technologies for large-scale deep learning models. He holds a B.Eng. from Warsaw University of Technology and an M.Sc. from University College London, where his research focused on probabilistic models and reinforcement learning applications in robotics.
Links
External Links
Notice: The placeholder video image is licensed under CC BY-SA 4.0. The original image can be found hereChanges made to the image are: Cropped the image to a new ratio, part of the image was cut off.
