Autotuning¶

TileLang includes a built‑in autotuner that searches configuration spaces for the best performing kernel, compiles candidates in parallel, validates correctness, benchmarks them, and caches the best result for reuse.

This guide covers two workflows:

Decorator‑based: @tilelang.autotune(configs=...) stacked on @tilelang.jit
Programmatic: AutoTuner.from_kernel(...).set_*().run()

It also explains input tensor supply, validation, caching, and environment variables that affect parallelism and cache behavior.

1) Decorator‑based Autotune¶

Use @tilelang.autotune above @tilelang.jit and expose tunable parameters as function arguments with defaults. The autotuner overrides these parameters with values from your config space.

import tilelang
import tilelang.language as T

def matmul_configs(M, N, K):
    # Example space — tailor to your target
    tiles = [64, 128]
    stages = [2, 3]
    threads = [128, 256]
    return [
        dict(block_M=BM, block_N=BN, block_K=BK, num_stages=S, threads=TH)
        for BM in tiles
        for BN in tiles
        for BK in [32, 64]
        for S in stages
        for TH in threads
    ]

@tilelang.autotune(configs=matmul_configs, warmup=25, rep=100, timeout=60)
@tilelang.jit(out_idx=[-1])
def matmul(M: int, N: int, K: int,
           block_M: int = 128, block_N: int = 128, block_K: int = 32,
           threads: int = 128, num_stages: int = 3,
           dtype: str = 'float16', accum_dtype: str = 'float32'):

    @T.prim_func
    def kernel(A: T.Tensor((M, K), dtype),
               B: T.Tensor((K, N), dtype),
               C: T.Tensor((M, N), dtype)):
        with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=threads) as (bx, by):
            A_s = T.alloc_shared((block_M, block_K), dtype)
            B_s = T.alloc_shared((block_K, block_N), dtype)
            C_f = T.alloc_fragment((block_M, block_N), accum_dtype)
            T.clear(C_f)

            for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=num_stages):
                T.copy(A[by * block_M, ko * block_K], A_s)
                T.copy(B[ko * block_K, bx * block_N], B_s)
                T.gemm(A_s, B_s, C_f)

            T.copy(C_f, C[by * block_M, bx * block_N])

    return kernel

# Usage
# Provide inputs via context (recommended for reproducibility across configs)
import torch
M = N = K = 1024
A = torch.randn(M, K, device='cuda', dtype=torch.float16)
B = torch.randn(K, N, device='cuda', dtype=torch.float16)
C = torch.empty(M, N, device='cuda', dtype=torch.float16)

from tilelang.autotuner import set_autotune_inputs
with set_autotune_inputs(A, B, C):
    tuned_kernel = matmul(M, N, K)   # compiles, tunes, returns best kernel
    tuned_kernel(A, B, C)            # run best kernel

Notes

configs can be a list of dicts or a callable (args...) -> list[dict]. Each dict’s keys must match the tunable function arguments (e.g., block_M).
The decorator returns a callable that runs autotune once per argument tuple and caches the resulting best kernel in‑process.
For explicit input control during tuning, wrap the call with set_autotune_inputs(...). Otherwise, supply_type (below) is used.

2) Programmatic Autotune¶

Use the AutoTuner class to manage configs and arguments more explicitly.

from tilelang.autotuner import AutoTuner

kernel_factory = matmul  # the function above (already @tilelang.jit)
tuner = AutoTuner.from_kernel(kernel_factory(M, N, K), configs=matmul_configs(M, N, K))

tuner.set_profile_args(
    warmup=25, rep=100, timeout=60,
    supply_type=tilelang.TensorSupplyType.Auto,  # or provide supply_prog/ref_prog
    ref_prog=lambda A, B, C: torch.allclose(C, (A @ B).to(C.dtype), rtol=1e-2, atol=1e-2),
)

tuner.set_compile_args(
    target='auto',                  # or 'cuda'/'hip'/'metal'
    execution_backend='auto',       # resolves per-target
    out_idx=[-1],                   # which outputs to return if multiple
    pass_configs={                  # optional TVM passes/flags
        # tilelang.PassConfigKey.EXAMPLE_KEY: value,
    },
)

artifact = tuner.run()             # compiles + runs + validates all configs
best_kernel = artifact.kernel      # JITKernel
best_latency = artifact.latency
best_config = artifact.config

# Reuse best kernel
best_kernel(A, B, C)

Example Gallery (in repo)¶

examples/gdn/example_chunk_delta_h.py:101 — uses @autotune to sweep configs
examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py:451 — uses @tilelang.autotune
examples/quickstart.py:84 — profiles a tuned kernel with get_profiler
examples/hadamard_transform/example_hadamard.py:152 — profiler with custom warmup
examples/dynamic_shape/example_dynamic.py:94 — profiler for dynamic shapes
examples/gemm/example_gemm_persistent.py:135 — compare persistent vs non‑persistent

Click any path to open the code and compare patterns.

Input Tensor Supply¶

The tuner needs inputs to compile and benchmark kernels. Provide them in one of three ways (priority order):

Context manager (fixed inputs across configs)

with set_autotune_inputs(A, B, C):
    tuned = matmul(M, N, K)

Custom supplier program

def supply_prog(signature):
    # signature holds KernelParam objects describing shapes/dtypes
    # Return a list of torch tensors matching the kernel’s arguments
    return [A, B, C]

tuner.set_profile_args(supply_prog=supply_prog)

Built‑in generators via supply_type

TensorSupplyType.Auto (default): heuristic per dtype (uniform ints / fp ranges)
Integer, Uniform, Normal, Randn, Zero, One

Important

Built‑in generators require static shapes; if your PrimFunc uses symbolic dimensions (T.dyn), supply concrete inputs via (1) or (2).
Float8 dtypes require PyTorch 2.1+ for torch.float8_* support.

Correctness Checking and Tolerances¶

Use one of the following validation methods:

ref_prog: Provide a reference program that receives the same inputs and checks results. You can return a boolean or raise on mismatch.
manual_check_prog: A callable that inspects outputs and raises on mismatch.
skip_check=True: Skip correctness checks (faster, use with caution).

Control numeric drift via:

rtol and atol (defaults 1e‑2)
max_mismatched_ratio (default 1%)

Configuration Spaces and Best Practices¶

What to tune

Tile sizes: block_M, block_N, block_K
Software pipelining: num_stages
Threads per block: threads (or (x, y) tuple)
Optional: dtype variants, epilogues, small scheduling knobs

Tips

Start from a working baseline. Tune a small, meaningful space first.
Respect hardware limits (shared memory bytes, registers per thread/block, max threads per block). Eliminate impossible configs up‑front.
Keep block sizes multiples of vector widths and warp sizes when relevant.
Use set_autotune_inputs to ensure each config is measured on identical data.
Record your best configs and bake them as defaults when stable.

Parallel Compilation/Benchmarking and Timeouts¶

The tuner compiles configurations in parallel using a thread pool and benchmarks them with a per‑config timeout. On CUDA, each worker sets the current device to avoid context issues.

Notes

timeout uses POSIX signals; on non‑Unix systems, it may not take effect.
Logs are written to autotuner.log in the working directory.

Caching¶

The autotuner caches best artifacts both in‑memory (per process) and on disk under $TILELANG_CACHE_DIR/autotuner. The cache key includes:

TileLang version, function source, closure free‑vars
Config list, compile args, profile args

Disk cache contents (per key)

Best config and latency: best_config.json, latency.json
Kernel sources and library: device_kernel.cu, host_kernel.cu, kernel_lib.so (or kernel.cubin/executable.so depending on backend)
Function and params: function.pkl, params.pkl

Control via env vars (tilelang.env)

TILELANG_CACHE_DIR (default ~/.tilelang/cache)
TILELANG_TMP_DIR (default $TILELANG_CACHE_DIR/tmp)
Disable all kernel caches: TILELANG_DISABLE_CACHE=1
Disable autotune disk cache only: TILELANG_AUTO_TUNING_DISABLE_CACHE=1

CPU worker control

TILELANG_AUTO_TUNING_CPU_UTILITIES (fraction, default 0.9)
TILELANG_AUTO_TUNING_CPU_COUNTS (int, -1 auto)
TILELANG_AUTO_TUNING_MAX_CPU_COUNT (int, -1 unlimited)

Backend notes

NVRTC backend persists .cubin and a Python launcher.
Torch/DLPack backend may not save artifacts to disk; in this case, only in‑memory caching applies and a warning is logged.

Alternative: Manual Sweeps with par_compile¶

If you prefer manual control, use JITImpl.par_compile to compile a batch of configs and drive your own benchmarking:

@tilelang.jit
def factory(M, N, K, block_M=128, block_N=128, block_K=32):
    @T.prim_func
    def k(A: T.Tensor((M, K), 'float16'),
           B: T.Tensor((K, N), 'float16'),
           C: T.Tensor((M, N), 'float16')):
        ...
    return k

impl = factory  # JITImpl
cfgs = [
    dict(block_M=64, block_N=128, block_K=32),
    dict(block_M=128, block_N=128, block_K=64),
]
kernels = impl.par_compile(cfgs, num_workers=4)
# Now benchmark kernels[i](A, B, C) yourself

Recording and Reusing Best Configs¶

The programmatic path returns an AutotuneResult that can be saved and later reloaded. This is useful for CI, multi‑host workflows, or shipping tuned configs.

artifact = tuner.run()  # AutotuneResult

# Save to disk
from pathlib import Path
save_dir = Path('out/best/matmul_1024')
artifact.save_to_disk(save_dir, verbose=True)

# Reload later
from tilelang.autotuner.param import AutotuneResult, CompileArgs
restored = AutotuneResult.load_from_disk(save_dir, CompileArgs())
best = restored.kernel
best(A, B, C)

Notes

DLPack/Torch execution backend may not persist compiled binaries; in that case, re‑compilation is needed on load or use a different backend.
The directory contains human‑readable JSONs (best config/latency) and sources.

Advanced: Config Space Callables¶

Derive config spaces from problem sizes to keep searches targeted and legal:

def matmul_configs(M, N, K):
    large = min(M, N, K) >= 1024
    tiles = [128] if large else [64, 128]
    for BM in tiles:
        for BN in tiles:
            for BK in [32, 64]:
                for S in [2, 3]:
                    for TH in [128, 256]:
                        yield dict(block_M=BM, block_N=BN, block_K=BK,
                                    num_stages=S, threads=TH)

Device and Backend Selection¶

Tune compile‑time options explicitly:

target='auto'|'cuda'|'hip'|'metal' (normalized to a TVM Target)
execution_backend='auto'|'tvm_ffi'|'ctypes'|'cython'|'nvrtc'|'torch'
pass_configs={...} to toggle TileLang/TVM passes for experiments

On CUDA with multiple GPUs, the tuner sets the current device per worker thread to avoid context mixups.

Troubleshooting¶

“No configurations to tune”: Ensure configs is a non‑empty list or callable.
Timeouts: Increase timeout; ensure inputs fit device memory; verify that your reference check isn’t the bottleneck.
Dynamic shapes: Provide concrete inputs via set_autotune_inputs or a custom supply_prog.
Disk cache disabled: Check TILELANG_AUTO_TUNING_DISABLE_CACHE and backend.