Quantization#
SGLang-Diffusion supports quantized transformer checkpoints. In most cases, keep the base model and the quantized transformer override separate.
Quick Reference#
Use these paths:
--model-path: the base or original model--transformer-path: a quantized transformers-style transformer component directory that already contains its ownconfig.json--transformer-weights-path: quantized transformer weights provided as a single safetensors file, a sharded safetensors directory, a local path, or a Hugging Face repo ID
Recommended example:
sglang generate \
--model-path black-forest-labs/FLUX.2-dev \
--transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "a curious pikachu"
For quantized transformers-style transformer component folders:
sglang generate \
--model-path /path/to/base-model \
--transformer-path /path/to/quantized-transformer \
--prompt "A Logo With Bold Large Text: SGL Diffusion"
NOTE: Some model-specific integrations also accept a quantized repo or local
directory directly as --model-path, but that is a compatibility path. If a
repo contains multiple candidate checkpoints, pass
--transformer-weights-path explicitly.
Quant Families#
Here, quant_family means a checkpoint and loading family with shared CLI
usage and loader behavior. It is not just the numeric precision or a kernel
backend.
quant_family |
checkpoint form |
canonical CLI |
supported models |
extra dependency |
platform / notes |
|---|---|---|---|---|---|
|
Quantized transformer component folder, or safetensors with |
|
ALL |
None |
Component-folder and single-file flows are both supported |
|
NVFP4 safetensors file, sharded directory, or repo providing transformer weights |
|
FLUX.2 |
|
Blackwell can use a best-performance kit when available; otherwise SGLang falls back to the generic ModelOpt FP4 path |
|
Pre-quantized Nunchaku transformer weights, usually named |
|
Model-specific support such as Qwen-Image, FLUX, and Z-Image |
|
SGLang can infer precision and rank from the filename and supports both |
|
Pre-quantized msmodelslim transformer weights |
|
Wan2.2 family |
None |
Currently only compatible with the Ascend NPU family and supports both |
NVFP4#
Usage Examples#
Recommended usage keeps the base model and quantized transformer override separate:
sglang generate \
--model-path black-forest-labs/FLUX.2-dev \
--transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--save-output
SGLang also supports passing the NVFP4 repo or local directory directly as
--model-path:
sglang generate \
--model-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--save-output
Notes#
--transformer-weights-pathis still the canonical CLI for NVFP4 transformer checkpoints.Direct
--model-pathloading is a compatibility path for FLUX.2 NVFP4-style repos or local directories.If
--transformer-weights-pathis provided explicitly, it takes precedence over the compatibility--model-pathflow.For local directories, SGLang first looks for
*-mixed.safetensors, then falls back to loading from the directory.On Blackwell,
comfy-kitchencan provide the best-performance path when available; otherwise SGLang falls back to the generic ModelOpt FP4 path.
Nunchaku (SVDQuant)#
Install#
Install the runtime dependency first:
pip install nunchaku
For platform-specific installation methods and troubleshooting, see the Nunchaku installation guide.
File Naming and Auto-Detection#
For Nunchaku checkpoints, --model-path should still point to the original
base model, while --transformer-weights-path points to the quantized
transformer weights.
If the basename of --transformer-weights-path contains the pattern
svdq-(int4|fp4)_r{rank}, SGLang will automatically:
enable SVDQuant
infer
--quantization-precisioninfer
--quantization-rank
Examples:
checkpoint name fragment |
inferred precision |
inferred rank |
notes |
|---|---|---|---|
|
|
|
Standard INT4 checkpoint |
|
|
|
Higher-quality INT4 checkpoint |
|
|
|
|
|
|
|
Higher-quality NVFP4 checkpoint |
Common filenames:
filename |
precision |
rank |
typical use |
|---|---|---|---|
|
|
|
Balanced default |
|
|
|
Quality-focused |
|
|
|
RTX 50-series / NVFP4 path |
|
|
|
Quality-focused NVFP4 |
|
|
|
Lightning 4-step |
|
|
|
Lightning 8-step |
If your checkpoint name does not follow this convention, pass
--enable-svdquant, --quantization-precision, and --quantization-rank
explicitly.
Usage Examples#
Recommended auto-detected flow:
sglang generate \
--model-path Qwen/Qwen-Image \
--transformer-weights-path /path/to/svdq-int4_r32-qwen-image.safetensors \
--prompt "a beautiful sunset" \
--save-output
Manual override when the filename does not encode the quant settings:
sglang generate \
--model-path Qwen/Qwen-Image \
--transformer-weights-path /path/to/custom_nunchaku_checkpoint.safetensors \
--enable-svdquant \
--quantization-precision int4 \
--quantization-rank 128 \
--prompt "a beautiful sunset" \
--save-output
Notes#
--transformer-weights-pathis the canonical flag for Nunchaku checkpoints. Older config names such asquantized_model_pathare treated as compatibility aliases.Auto-detection only happens when the checkpoint basename matches
svdq-(int4|fp4)_r{rank}.The CLI values are
int4andnvfp4. In filenames, the NVFP4 variant is written asfp4.Lightning checkpoints usually expect matching
--num-inference-steps, such as4or8.Current runtime validation only allows Nunchaku on NVIDIA CUDA Ampere (SM8x) or SM12x GPUs. Hopper (SM90) is currently rejected.
ModelSlim#
MindStudio-ModelSlim (msModelSlim) is a model offline quantization compression tool launched by MindStudio and optimized for Ascend hardware.
Installation
# Clone repo and install msmodelslim: git clone https://gitcode.com/Ascend/msmodelslim.git cd msmodelslim bash install.sh
Multimodal_sd quantization
Download the original floating-point weights of the large model. Taking Wan2.2-T2V-A14B as an example, you can go to Wan2.2-T2V-A14B to obtain the original model weights. Then install other dependencies (related to the model, refer to the modelscope model card).
Note: You can find pre-quantized validated models on modelscope/Eco-Tech.
Run quantization using one-click quantization (recommended):
msmodelslim quant \ --model_path /path/to/wan2_2_float_weights \ --save_path /path/to/wan2_2_quantized_weights \ --device npu \ --model_type Wan2_2 \ --quant_type w8a8 \ --trust_remote_code True
For more detailed examples of quantization of models, as well as information about their support, see the examples section in ModelSLim repo.
Note: SGLang does not support quantized embeddings, please disable this option when quantizing using msmodelslim.
Auto-Detection and different formats
For msmodelslim checkpoints, it’s enough to specify only
--model-path, the detection of quantization occurs automatically for each layer using parsing ofquant_model_description.jsonconfig.In the case of
Wan2.2onlyDiffusersweights storage format are supported, whereas modelslim saves the quantized model in the originalWan2.2format, for conversion in usepython/sglang/multimodal_gen/tools/wan_repack.pyscript:python wan_repack.py \ --input-path {path_to_quantized_model} \ --output-path {path_to_converted_model}
After that, please copy all files from original
Diffuserscheckpoint (instead oftransformer/tranfsormer_2folders)Usage Example
With auto-detected flow:
sglang generate \ --model-path Eco-Tech/Wan2.2-T2V-A14B-Diffusers-w8a8 \ --prompt "a beautiful sunset" \ --save-output
Available Quantization Methods:
[x]
W4A4_DYNAMIClinear with online quantization of activations[x]
W8A8linear with offline quantization of activations[x]
W8A8_DYNAMIClinear with online quantization of activations[ ]
mxfp8linear in progress