mlx.core.fast.precompiled_cuda_kernel#

precompiled_cuda_kernel(*, name: str, compiled_source: bytes, inputs: collections.abc.Sequence[bool | int | float | mlx.core.array | ndarray[order='C', writable=False] | complex | mlx.core.ArrayLike], output_shapes: collections.abc.Sequence[tuple[int, ...]], output_dtypes: collections.abc.Sequence[mlx.core.Dtype], scalars: collections.abc.Sequence[object], grid: tuple[int, int, int], threadgroup: tuple[int, int, int], shared_memory: int = 0, init_value: float | None = None, ensure_row_contiguous: bool = False, stream: mlx.core.Stream | mlx.core.ThreadLocalStream | mlx.core.Device | None = None) → list[array]#

Run a precompiled CUDA kernel defined from PTX or cubin.

This op is still experimental and various parts of the API may change.

Parameters:

name (str) – Name for the kernel
compiled_source (bytes) – The precompiled kernel in raw bytes.
inputs (List[array]) – The inputs passed to the CUDA kernel.
output_shapes (List[Sequence[int]]) – The list of shapes for each output.
output_dtypes (List[Dtype]) – The list of data types for each output.
scalars (List[Union[bool, int, float]]) – A list of scalar arguments to pass to the kernel.
grid (tuple[int, int, int]) – 3-tuple specifying the grid to launch the kernel with. For compatibility with metal_kernel() the grid is in threads and not in threadblocks.
threadgroup (tuple[int, int, int]) – 3-tuple specifying the threadgroup size to use.
shared_memory (int) – The dynamic shared memory to request for the kernel. A value of 0 means no dynamic shared memory. Default: 0.
init_value (float, optional) – Optional value to use to initialize all of the output arrays. By default, output arrays are uninitialized. Default: None.
ensure_row_contiguous (bool) – Whether to ensure the inputs are row contiguous before the kernel runs. Default: False.
stream (mx.stream, optional) – Stream to run the kernel on. Default: None.

mlx.core.fast.precompiled_cuda_kernel

Contents

mlx.core.fast.precompiled_cuda_kernel#