Multimodal support in llama.cpp - Achievements and Future Directions
UD2.120 (Chavanne) | Day 1 | 10:35 - 10:55 | Speakers: Xuan-Son Nguyen
Abstract
llama.cpp has become a key tool for running LLMs efficiently on any hardware. This talk explores how multimodal features have grown in the project. It focuses on libmtmd, a library added in April 2025 to make multimodal support easier to use and to maintain in llama.cpp.
We will first cover main achievements. These include combining separate CLI tools for different models into one single tool called llama-mtmd-cli. Next, we will discuss how libmtmd works with llama-server and show real examples of low-latency OCR applications. We will also talk about adding audio support, which lets newer models summarize audio inputs. Plus, we will cover the challenges of handling legacy code while keeping the project flexible for future models.
Looking forward, the talk will share plans for new features like video input, text-to-speech support, and image generation. Attendees will also learn how to contribute and use these multimodal tools in their own project.
Speakers
Links
External Links
Notice: The placeholder video image is licensed under CC BY-SA 4.0. The original image can be found hereChanges made to the image are: Cropped the image to a new ratio, part of the image was cut off.
