Skip to main content

Multimodal support in llama.cpp - Achievements and Future Directions

UD2.120 (Chavanne) | Day 1 | 10:35 - 10:55 | Speakers: Xuan-Son Nguyen

Multimodal support in llama.cpp - Achievements and Future Directions
A picture of a devroom at FOSDEM 2024
Open in browser

Notes

Abstract

llama.cpp has become a key tool for running LLMs efficiently on any hardware. This talk explores how multimodal features have grown in the project. It focuses on libmtmd, a library added in April 2025 to make multimodal support easier to use and to maintain in llama.cpp.

We will first cover main achievements. These include combining separate CLI tools for different models into one single tool called llama-mtmd-cli. Next, we will discuss how libmtmd works with llama-server and show real examples of low-latency OCR applications. We will also talk about adding audio support, which lets newer models summarize audio inputs. Plus, we will cover the challenges of handling legacy code while keeping the project flexible for future models.

Looking forward, the talk will share plans for new features like video input, text-to-speech support, and image generation. Attendees will also learn how to contribute and use these multimodal tools in their own project.


Notice: The placeholder video image is licensed under CC BY-SA 4.0. The original image can be found hereChanges made to the image are: Cropped the image to a new ratio, part of the image was cut off.