mlx.core.qqmm#
- qqmm(x: array, w: array, scales: array | None = None, group_size: int | None = None, bits: int | None = None, mode: str = 'nvfp4', global_scale_x: array | None = None, global_scale_w: array | None = None, *, stream: None | Stream | Device = None) array#
Perform a matrix multiplication using a possibly quantized weight matrix
wand a non-quantized inputx. The inputxis quantized on the fly. The weight matrixwis used as-is if it is already quantized; otherwise, it is quantized on the fly.If
wis quantized,scalesmust be provided, andgroup_size,bits, andmodemust match the parameters that were used to quantizew.Notes
If
wis expected to receive gradients, it must be provided in non-quantized form.If
xand w` are not quantized, their data types must befloat32,float16, orbfloat16. Ifwis quantized, it must be packed in unsigned integers.global_scale_xandglobal_scale_ware only used fornvfp4quantization.- Parameters:
x (array) – Input array.
w (array) – Weight matrix. If quantized, it is packed in unsigned integers.
scales (array, optional) – The scales to use per
group_sizeelements ofwifwis quantized. Default:None.group_size (int, optional) – Number of elements in
xandwthat share a scale. See supported values and defaults in the table of quantization modes. Default:None.bits (int, optional) – Number of bits used to represent each element of
xandw. See supported values and defaults in the table of quantization modes. Default:None.mode (str, optional) – The quantization mode. Default:
"nvfp4". Supported modes arenvfp4andmxfp8. See the table of quantization modes for details.global_scale (array, optional) – The per-input float32 scale used for x with
"nvfp4"quantization. Default:None.global_scale_w (array, optional) – The per-input float32 scale used for w with
"nvfp4"quantization. Default:None.
- Returns:
The result of the multiplication of quantized
xwith quantizedw. needed).- Return type: