Query VLM with Offline Engine#
This tutorial demonstrates how to use SGLang’s offline Engine API to query VLMs. We will demonstrate usage with Qwen2.5-VL and Llama 4. This section demonstrates three different calling approaches:
Basic Call: Directly pass images and text.
Processor Output: Use HuggingFace processor for data preprocessing.
Precomputed Embeddings: Pre-calculate image features to improve inference efficiency.
Understanding the Three Input Formats#
SGLang supports three ways to pass visual data, each optimized for different scenarios:
1. Raw Images - Simplest approach#
Pass PIL Images, file paths, URLs, or base64 strings directly
SGLang handles all preprocessing automatically
Best for: Quick prototyping, simple applications
2. Processor Output - For custom preprocessing#
Pre-process images with HuggingFace processor
Pass the complete processor output dict with
format: "processor_output"Best for: Custom image transformations, integration with existing pipelines
Requirement: Must use
input_idsinstead of text prompt
3. Precomputed Embeddings - For maximum performance#
Pre-calculate visual embeddings using the vision encoder
Pass embeddings with
format: "precomputed_embedding"Best for: Repeated queries on same images, caching, high-throughput serving
Performance gain: Avoids redundant vision encoder computation (30-50% speedup)
Key Rule: Within a single request, use only one format for all images. Don’t mix formats.
The examples below demonstrate all three approaches with both Qwen2.5-VL and Llama 4 models.
Querying Qwen2.5-VL Model#
[ ]:
import nest_asyncio
nest_asyncio.apply()
model_path = "Qwen/Qwen2.5-VL-3B-Instruct"
chat_template = "qwen2-vl"
[ ]:
from io import BytesIO
import requests
from PIL import Image
from sglang.srt.parser.conversation import chat_templates
image = Image.open(
BytesIO(
requests.get(
"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
).content
)
)
conv = chat_templates[chat_template].copy()
conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?")
conv.append_message(conv.roles[1], "")
conv.image_data = [image]
print("Generated prompt text:")
print(conv.get_prompt())
print(f"\nImage size: {image.size}")
image
Basic Offline Engine API Call#
[ ]:
from sglang import Engine
llm = Engine(model_path=model_path, chat_template=chat_template, log_level="warning")
[ ]:
out = llm.generate(prompt=conv.get_prompt(), image_data=[image])
print("Model response:")
print(out["text"])
Call with Processor Output#
Using a HuggingFace processor to preprocess text and images, and passing the processor_output directly into Engine.generate.
[ ]:
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
processor_output = processor(
images=[image], text=conv.get_prompt(), return_tensors="pt"
)
out = llm.generate(
input_ids=processor_output["input_ids"][0].detach().cpu().tolist(),
image_data=[dict(processor_output, format="processor_output")],
)
print("Response using processor output:")
print(out["text"])
Call with Precomputed Embeddings#
You can pre-calculate image features to avoid repeated visual encoding processes.
[ ]:
from transformers import AutoProcessor
from transformers import Qwen2_5_VLForConditionalGeneration
processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
vision = (
Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path).eval().visual.cuda()
)
[ ]:
processor_output = processor(
images=[image], text=conv.get_prompt(), return_tensors="pt"
)
input_ids = processor_output["input_ids"][0].detach().cpu().tolist()
precomputed_embeddings = vision(
processor_output["pixel_values"].cuda(), processor_output["image_grid_thw"].cuda()
)
multi_modal_item = dict(
processor_output,
format="precomputed_embedding",
feature=precomputed_embeddings,
)
out = llm.generate(input_ids=input_ids, image_data=[multi_modal_item])
print("Response using precomputed embeddings:")
print(out["text"])
llm.shutdown()
Querying Llama 4 Vision Model#
model_path = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
chat_template = "llama-4"
from io import BytesIO
import requests
from PIL import Image
from sglang.srt.parser.conversation import chat_templates
# Download the same example image
image = Image.open(
BytesIO(
requests.get(
"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
).content
)
)
conv = chat_templates[chat_template].copy()
conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?")
conv.append_message(conv.roles[1], "")
conv.image_data = [image]
print("Llama 4 generated prompt text:")
print(conv.get_prompt())
print(f"Image size: {image.size}")
image
Llama 4 Basic Call#
Llama 4 requires more computational resources, so it’s configured with multi-GPU parallelism (tp_size=4) and larger context length.
llm = Engine(
model_path=model_path,
enable_multimodal=True,
attention_backend="fa3",
tp_size=4,
context_length=65536,
)
out = llm.generate(prompt=conv.get_prompt(), image_data=[image])
print("Llama 4 response:")
print(out["text"])
Call with Processor Output#
Using HuggingFace processor to preprocess data can reduce computational overhead during inference.
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
processor_output = processor(
images=[image], text=conv.get_prompt(), return_tensors="pt"
)
out = llm.generate(
input_ids=processor_output["input_ids"][0].detach().cpu().tolist(),
image_data=[dict(processor_output, format="processor_output")],
)
print("Response using processor output:")
print(out)
Call with Precomputed Embeddings#
from transformers import AutoProcessor
from transformers import Llama4ForConditionalGeneration
processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
model = Llama4ForConditionalGeneration.from_pretrained(
model_path, torch_dtype="auto"
).eval()
vision = model.vision_model.cuda()
multi_modal_projector = model.multi_modal_projector.cuda()
print(f'Image pixel values shape: {processor_output["pixel_values"].shape}')
input_ids = processor_output["input_ids"][0].detach().cpu().tolist()
# Process image through vision encoder
image_outputs = vision(
processor_output["pixel_values"].to("cuda"),
aspect_ratio_ids=processor_output["aspect_ratio_ids"].to("cuda"),
aspect_ratio_mask=processor_output["aspect_ratio_mask"].to("cuda"),
output_hidden_states=False
)
image_features = image_outputs.last_hidden_state
# Flatten image features and pass through multimodal projector
vision_flat = image_features.view(-1, image_features.size(-1))
precomputed_embeddings = multi_modal_projector(vision_flat)
# Build precomputed embedding data item
mm_item = dict(
processor_output,
format="precomputed_embedding",
feature=precomputed_embeddings
)
# Use precomputed embeddings for efficient inference
out = llm.generate(input_ids=input_ids, image_data=[mm_item])
print("Llama 4 precomputed embedding response:")
print(out["text"])