Name: Supercharging LLM serving with Dynamo
Start: 2026-01-31T15:40:00
End: 2026-01-31T15:40:00
Location: UD2.120 (Chavanne)

Abstract

The explosive growth of Large Language Models (LLMs) requires massively efficient and scalable inference systems. This talk will share key innovations NVIDIA Dynamo (https://github.com/ai-dynamo/dynamo) adds to enable system-level optimizations while leveraging performance from inference engines such as vLLM, SGLang, and TRT-LLM:

Smart Scheduling that routes requests based on the KV cache hit rate and load, intelligently autoscales, and disaggregates the prefill and decode phases.
Hierarchical Memory Management that utilizes HBM, host memory, local disk, and remote storage.
Low-Latency Transfer of the KV cache across nodes and the memory hierarchy.

This talk will also introduce production-grade LLM serving features of Dynamo that enable users to: - Find the best configuration for disaggregated serving offline. - Tune performance automatically based on real-time traffic. - Dynamically scale prefill and decode workers via topology-aware gang scheduling. - Leverage LLM-specific fault tolerance.

Speakers

Piotr Tarasiewicz

Senior Software Engineer – AI at NVIDIA, working on Dynamo, an advanced LLM inference framework. He specializes in developing and optimizing inference and deployment technologies for large-scale deep learning models. He holds a B.Eng. from Warsaw University of Technology and an M.Sc. from University College London, where his research focused on probabilistic models and reinforcement learning applications in robotics.

Supercharging LLM serving with Dynamo

Notes

Abstract

Speakers

Links

External Links