Query VLM with Offline Engine#

This tutorial demonstrates how to use SGLang’s offline Engine API to query VLMs. We will demonstrate usage with Qwen2.5-VL and Llama 4. This section demonstrates three different calling approaches:

  1. Basic Call: Directly pass images and text.

  2. Processor Output: Use HuggingFace processor for data preprocessing.

  3. Precomputed Embeddings: Pre-calculate image features to improve inference efficiency.

Understanding the Three Input Formats#

SGLang supports three ways to pass visual data, each optimized for different scenarios:

1. Raw Images - Simplest approach#

  • Pass PIL Images, file paths, URLs, or base64 strings directly

  • SGLang handles all preprocessing automatically

  • Best for: Quick prototyping, simple applications

2. Processor Output - For custom preprocessing#

  • Pre-process images with HuggingFace processor

  • Pass the complete processor output dict with format: "processor_output"

  • Best for: Custom image transformations, integration with existing pipelines

  • Requirement: Must use input_ids instead of text prompt

3. Precomputed Embeddings - For maximum performance#

  • Pre-calculate visual embeddings using the vision encoder

  • Pass embeddings with format: "precomputed_embedding"

  • Best for: Repeated queries on same images, caching, high-throughput serving

  • Performance gain: Avoids redundant vision encoder computation (30-50% speedup)

Key Rule: Within a single request, use only one format for all images. Don’t mix formats.

The examples below demonstrate all three approaches with both Qwen2.5-VL and Llama 4 models.

Querying Qwen2.5-VL Model#

[ ]:
import nest_asyncio

nest_asyncio.apply()

model_path = "Qwen/Qwen2.5-VL-3B-Instruct"
chat_template = "qwen2-vl"
[ ]:
from io import BytesIO
import requests
from PIL import Image

from sglang.srt.parser.conversation import chat_templates

image = Image.open(
    BytesIO(
        requests.get(
            "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
        ).content
    )
)

conv = chat_templates[chat_template].copy()
conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?")
conv.append_message(conv.roles[1], "")
conv.image_data = [image]

print("Generated prompt text:")
print(conv.get_prompt())
print(f"\nImage size: {image.size}")
image

Basic Offline Engine API Call#

[ ]:
from sglang import Engine


llm = Engine(model_path=model_path, chat_template=chat_template, log_level="warning")
[ ]:
out = llm.generate(prompt=conv.get_prompt(), image_data=[image])
print("Model response:")
print(out["text"])

Call with Processor Output#

Using a HuggingFace processor to preprocess text and images, and passing the processor_output directly into Engine.generate.

[ ]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
processor_output = processor(
    images=[image], text=conv.get_prompt(), return_tensors="pt"
)

out = llm.generate(
    input_ids=processor_output["input_ids"][0].detach().cpu().tolist(),
    image_data=[dict(processor_output, format="processor_output")],
)
print("Response using processor output:")
print(out["text"])

Call with Precomputed Embeddings#

You can pre-calculate image features to avoid repeated visual encoding processes.

[ ]:
from transformers import AutoProcessor
from transformers import Qwen2_5_VLForConditionalGeneration

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
vision = (
    Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path).eval().visual.cuda()
)
[ ]:
processor_output = processor(
    images=[image], text=conv.get_prompt(), return_tensors="pt"
)

input_ids = processor_output["input_ids"][0].detach().cpu().tolist()

precomputed_embeddings = vision(
    processor_output["pixel_values"].cuda(), processor_output["image_grid_thw"].cuda()
)

multi_modal_item = dict(
    processor_output,
    format="precomputed_embedding",
    feature=precomputed_embeddings,
)

out = llm.generate(input_ids=input_ids, image_data=[multi_modal_item])
print("Response using precomputed embeddings:")
print(out["text"])

llm.shutdown()

Querying Llama 4 Vision Model#

model_path = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
chat_template = "llama-4"

from io import BytesIO
import requests
from PIL import Image

from sglang.srt.parser.conversation import chat_templates

# Download the same example image
image = Image.open(
    BytesIO(
        requests.get(
            "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
        ).content
    )
)

conv = chat_templates[chat_template].copy()
conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?")
conv.append_message(conv.roles[1], "")
conv.image_data = [image]

print("Llama 4 generated prompt text:")
print(conv.get_prompt())
print(f"Image size: {image.size}")

image

Llama 4 Basic Call#

Llama 4 requires more computational resources, so it’s configured with multi-GPU parallelism (tp_size=4) and larger context length.

llm = Engine(
    model_path=model_path,
    enable_multimodal=True,
    attention_backend="fa3",
    tp_size=4,
    context_length=65536,
)

out = llm.generate(prompt=conv.get_prompt(), image_data=[image])
print("Llama 4 response:")
print(out["text"])

Call with Processor Output#

Using HuggingFace processor to preprocess data can reduce computational overhead during inference.

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
processor_output = processor(
    images=[image], text=conv.get_prompt(), return_tensors="pt"
)

out = llm.generate(
    input_ids=processor_output["input_ids"][0].detach().cpu().tolist(),
    image_data=[dict(processor_output, format="processor_output")],
)
print("Response using processor output:")
print(out)

Call with Precomputed Embeddings#

from transformers import AutoProcessor
from transformers import Llama4ForConditionalGeneration

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto"
).eval()

vision = model.vision_model.cuda()
multi_modal_projector = model.multi_modal_projector.cuda()

print(f'Image pixel values shape: {processor_output["pixel_values"].shape}')
input_ids = processor_output["input_ids"][0].detach().cpu().tolist()

# Process image through vision encoder
image_outputs = vision(
    processor_output["pixel_values"].to("cuda"),
    aspect_ratio_ids=processor_output["aspect_ratio_ids"].to("cuda"),
    aspect_ratio_mask=processor_output["aspect_ratio_mask"].to("cuda"),
    output_hidden_states=False
)
image_features = image_outputs.last_hidden_state

# Flatten image features and pass through multimodal projector
vision_flat = image_features.view(-1, image_features.size(-1))
precomputed_embeddings = multi_modal_projector(vision_flat)

# Build precomputed embedding data item
mm_item = dict(
    processor_output,
    format="precomputed_embedding",
    feature=precomputed_embeddings
)

# Use precomputed embeddings for efficient inference
out = llm.generate(input_ids=input_ids, image_data=[mm_item])
print("Llama 4 precomputed embedding response:")
print(out["text"])