Multimodal Language Models#
These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models with multimodal encoders.
Example launch Command#
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ # example HF/local path
--host 0.0.0.0 \
--port 30000 \
See the OpenAI APIs section for how to send multimodal requests.
Supported models#
Below the supported models are summarized in a table.
If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for Qwen2_5_VLForConditionalGeneration, use the expression:
repo:sgl-project/sglang path:/^python\/sglang\/srt\/models\// Qwen2_5_VLForConditionalGeneration
in the GitHub search bar.
Model Family (Variants) |
Example HuggingFace Identifier |
Description |
Notes |
|---|---|---|---|
Qwen-VL |
|
Alibaba’s vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. |
|
DeepSeek-VL2 |
|
Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. |
|
Janus-Pro (1B, 7B) |
|
DeepSeek’s open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. |
|
MiniCPM-V / MiniCPM-o |
|
MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. |
|
Llama 3.2 Vision (11B) |
|
Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. |
|
LLaVA (v1.5 & v1.6) |
e.g. |
Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. |
|
LLaVA-NeXT (8B, 72B) |
|
Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. |
|
LLaVA-OneVision |
|
Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. |
|
Gemma 3 (Multimodal) |
|
Gemma 3’s larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. |
|
Kimi-VL (A3B) |
|
Kimi-VL is a multimodal model that can understand and generate text from images. |
|
Mistral-Small-3.1-24B |
|
Mistral 3.1 is a multimodal model that can generate text from text or images input. It also supports tool calling and structured output. |
|
Phi-4-multimodal-instruct |
|
Phi-4-multimodal-instruct is the multimodal variant of the Phi-4-mini model, enhanced with LoRA for improved multimodal capabilities. It supports text, vision and audio modalities in SGLang. |
|
MiMo-VL (7B) |
|
Xiaomi’s compact yet powerful vision-language model featuring a native resolution ViT encoder for fine-grained visual details, an MLP projector for cross-modal alignment, and the MiMo-7B language model optimized for complex reasoning tasks. |
|
GLM-4.5V (106B) / GLM-4.1V(9B) |
|
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning |
Use |
DotsVLM (General/OCR) |
|
RedNote’s vision-language model built on a 1.2B vision encoder and DeepSeek V3 LLM, featuring NaViT vision encoder trained from scratch with dynamic resolution support and enhanced OCR capabilities through structured image data training. |
|
DotsVLM-OCR |
|
Specialized OCR variant of DotsVLM optimized for optical character recognition tasks with enhanced text extraction and document understanding capabilities. |
Don’t use |
NVILA (8B, 15B, Lite-2B, Lite-8B, Lite-15B) |
|
|
NVILA explores the full stack efficiency of multi-modal design, achieving cheaper training, faster deployment and better performance. |
Video Input Support#
SGLang supports video input for Vision-Language Models (VLMs), enabling temporal reasoning tasks such as video question answering, captioning, and holistic scene understanding. Video clips are decoded, key frames are sampled, and the resulting tensors are batched together with the text prompt, allowing multimodal inference to integrate visual and linguistic context.
Model Family |
Example Identifier |
Video notes |
|---|---|---|
Qwen-VL (Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3-Omni) |
|
The processor gathers |
GLM-4v (4.5V, 4.1V, MOE) |
|
Video clips are read with Decord, converted to tensors, and passed to the model alongside metadata for rotary-position handling. |
NVILA (Full & Lite) |
|
The runtime samples eight frames per clip and attaches them to the multimodal request when |
LLaVA video variants (LLaVA-NeXT-Video, LLaVA-OneVision) |
|
The processor routes video prompts to the LlavaVid video-enabled architecture, and the provided example shows how to query it with |
Use sgl.video(path, num_frames) when building prompts to attach clips from your SGLang programs.
Example OpenAI-compatible request that sends a video clip:
import requests
url = "http://localhost:30000/v1/chat/completions"
data = {
"model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What’s happening in this video?"},
{
"type": "video_url",
"video_url": {
"url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4"
},
},
],
}
],
"max_tokens": 300,
}
response = requests.post(url, json=data)
print(response.text)
Usage Notes#
Performance Optimization#
For multimodal models, you can use the --keep-mm-feature-on-device flag to optimize for latency at the cost of increased GPU memory usage:
Default behavior: Multimodal feature tensors are moved to CPU after processing to save GPU memory
With
--keep-mm-feature-on-device: Feature tensors remain on GPU, reducing device-to-host copy overhead and improving latency, but consuming more GPU memory
Use this flag when you have sufficient GPU memory and want to minimize latency for multimodal inference.