Autotuning¶
TileLang includes a built‑in autotuner that searches configuration spaces for the best performing kernel, compiles candidates in parallel, validates correctness, benchmarks them, and caches the best result for reuse.
This guide covers two workflows:
Decorator‑based:
@tilelang.autotune(configs=...)stacked on@tilelang.jitProgrammatic:
AutoTuner.from_kernel(...).set_*().run()
It also explains input tensor supply, validation, caching, and environment variables that affect parallelism and cache behavior.
1) Decorator‑based Autotune¶
Use @tilelang.autotune above @tilelang.jit and expose tunable parameters as
function arguments with defaults. The autotuner overrides these parameters with
values from your config space.
import tilelang
import tilelang.language as T
def matmul_configs(M, N, K):
# Example space — tailor to your target
tiles = [64, 128]
stages = [2, 3]
threads = [128, 256]
return [
dict(block_M=BM, block_N=BN, block_K=BK, num_stages=S, threads=TH)
for BM in tiles
for BN in tiles
for BK in [32, 64]
for S in stages
for TH in threads
]
@tilelang.autotune(configs=matmul_configs, warmup=25, rep=100, timeout=60)
@tilelang.jit(out_idx=[-1])
def matmul(M: int, N: int, K: int,
block_M: int = 128, block_N: int = 128, block_K: int = 32,
threads: int = 128, num_stages: int = 3,
dtype: str = 'float16', accum_dtype: str = 'float32'):
@T.prim_func
def kernel(A: T.Tensor((M, K), dtype),
B: T.Tensor((K, N), dtype),
C: T.Tensor((M, N), dtype)):
with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=threads) as (bx, by):
A_s = T.alloc_shared((block_M, block_K), dtype)
B_s = T.alloc_shared((block_K, block_N), dtype)
C_f = T.alloc_fragment((block_M, block_N), accum_dtype)
T.clear(C_f)
for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=num_stages):
T.copy(A[by * block_M, ko * block_K], A_s)
T.copy(B[ko * block_K, bx * block_N], B_s)
T.gemm(A_s, B_s, C_f)
T.copy(C_f, C[by * block_M, bx * block_N])
return kernel
# Usage
# Provide inputs via context (recommended for reproducibility across configs)
import torch
M = N = K = 1024
A = torch.randn(M, K, device='cuda', dtype=torch.float16)
B = torch.randn(K, N, device='cuda', dtype=torch.float16)
C = torch.empty(M, N, device='cuda', dtype=torch.float16)
from tilelang.autotuner import set_autotune_inputs
with set_autotune_inputs(A, B, C):
tuned_kernel = matmul(M, N, K) # compiles, tunes, returns best kernel
tuned_kernel(A, B, C) # run best kernel
Notes
configscan be a list of dicts or a callable(args...) -> list[dict]. Each dict’s keys must match the tunable function arguments (e.g.,block_M).The decorator returns a callable that runs autotune once per argument tuple and caches the resulting best kernel in‑process.
For explicit input control during tuning, wrap the call with
set_autotune_inputs(...). Otherwise,supply_type(below) is used.
2) Programmatic Autotune¶
Use the AutoTuner class to manage configs and arguments more explicitly.
from tilelang.autotuner import AutoTuner
kernel_factory = matmul # the function above (already @tilelang.jit)
tuner = AutoTuner.from_kernel(kernel_factory(M, N, K), configs=matmul_configs(M, N, K))
tuner.set_profile_args(
warmup=25, rep=100, timeout=60,
supply_type=tilelang.TensorSupplyType.Auto, # or provide supply_prog/ref_prog
ref_prog=lambda A, B, C: torch.allclose(C, (A @ B).to(C.dtype), rtol=1e-2, atol=1e-2),
)
tuner.set_compile_args(
target='auto', # or 'cuda'/'hip'/'metal'
execution_backend='auto', # resolves per-target
out_idx=[-1], # which outputs to return if multiple
pass_configs={ # optional TVM passes/flags
# tilelang.PassConfigKey.EXAMPLE_KEY: value,
},
)
artifact = tuner.run() # compiles + runs + validates all configs
best_kernel = artifact.kernel # JITKernel
best_latency = artifact.latency
best_config = artifact.config
# Reuse best kernel
best_kernel(A, B, C)
Example Gallery (in repo)¶
examples/gdn/example_chunk_delta_h.py:101 — uses
@autotuneto sweep configsexamples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py:451 — uses
@tilelang.autotuneexamples/quickstart.py:84 — profiles a tuned kernel with
get_profilerexamples/hadamard_transform/example_hadamard.py:152 — profiler with custom warmup
examples/dynamic_shape/example_dynamic.py:94 — profiler for dynamic shapes
examples/gemm/example_gemm_persistent.py:135 — compare persistent vs non‑persistent
Click any path to open the code and compare patterns.
Input Tensor Supply¶
The tuner needs inputs to compile and benchmark kernels. Provide them in one of three ways (priority order):
Context manager (fixed inputs across configs)
with set_autotune_inputs(A, B, C):
tuned = matmul(M, N, K)
Custom supplier program
def supply_prog(signature):
# signature holds KernelParam objects describing shapes/dtypes
# Return a list of torch tensors matching the kernel’s arguments
return [A, B, C]
tuner.set_profile_args(supply_prog=supply_prog)
Built‑in generators via
supply_type
TensorSupplyType.Auto(default): heuristic per dtype (uniform ints / fp ranges)Integer,Uniform,Normal,Randn,Zero,One
Important
Built‑in generators require static shapes; if your PrimFunc uses symbolic dimensions (T.dyn), supply concrete inputs via (1) or (2).
Float8 dtypes require PyTorch 2.1+ for
torch.float8_*support.
Correctness Checking and Tolerances¶
Use one of the following validation methods:
ref_prog: Provide a reference program that receives the same inputs and checks results. You can return a boolean or raise on mismatch.manual_check_prog: A callable that inspects outputs and raises on mismatch.skip_check=True: Skip correctness checks (faster, use with caution).
Control numeric drift via:
rtolandatol(defaults 1e‑2)max_mismatched_ratio(default 1%)
Configuration Spaces and Best Practices¶
What to tune
Tile sizes:
block_M,block_N,block_KSoftware pipelining:
num_stagesThreads per block:
threads(or (x, y) tuple)Optional: dtype variants, epilogues, small scheduling knobs
Tips
Start from a working baseline. Tune a small, meaningful space first.
Respect hardware limits (shared memory bytes, registers per thread/block, max threads per block). Eliminate impossible configs up‑front.
Keep block sizes multiples of vector widths and warp sizes when relevant.
Use
set_autotune_inputsto ensure each config is measured on identical data.Record your best configs and bake them as defaults when stable.
Parallel Compilation/Benchmarking and Timeouts¶
The tuner compiles configurations in parallel using a thread pool and benchmarks them with a per‑config timeout. On CUDA, each worker sets the current device to avoid context issues.
Notes
timeoutuses POSIX signals; on non‑Unix systems, it may not take effect.Logs are written to
autotuner.login the working directory.
Caching¶
The autotuner caches best artifacts both in‑memory (per process) and on disk under
$TILELANG_CACHE_DIR/autotuner. The cache key includes:
TileLang version, function source, closure free‑vars
Config list, compile args, profile args
Disk cache contents (per key)
Best config and latency:
best_config.json,latency.jsonKernel sources and library:
device_kernel.cu,host_kernel.cu,kernel_lib.so(orkernel.cubin/executable.sodepending on backend)Function and params:
function.pkl,params.pkl
Control via env vars (tilelang.env)
TILELANG_CACHE_DIR(default~/.tilelang/cache)TILELANG_TMP_DIR(default$TILELANG_CACHE_DIR/tmp)Disable all kernel caches:
TILELANG_DISABLE_CACHE=1Disable autotune disk cache only:
TILELANG_AUTO_TUNING_DISABLE_CACHE=1
CPU worker control
TILELANG_AUTO_TUNING_CPU_UTILITIES(fraction, default 0.9)TILELANG_AUTO_TUNING_CPU_COUNTS(int,-1auto)TILELANG_AUTO_TUNING_MAX_CPU_COUNT(int,-1unlimited)
Backend notes
NVRTC backend persists
.cubinand a Python launcher.Torch/DLPack backend may not save artifacts to disk; in this case, only in‑memory caching applies and a warning is logged.
Alternative: Manual Sweeps with par_compile¶
If you prefer manual control, use JITImpl.par_compile to compile a batch of
configs and drive your own benchmarking:
@tilelang.jit
def factory(M, N, K, block_M=128, block_N=128, block_K=32):
@T.prim_func
def k(A: T.Tensor((M, K), 'float16'),
B: T.Tensor((K, N), 'float16'),
C: T.Tensor((M, N), 'float16')):
...
return k
impl = factory # JITImpl
cfgs = [
dict(block_M=64, block_N=128, block_K=32),
dict(block_M=128, block_N=128, block_K=64),
]
kernels = impl.par_compile(cfgs, num_workers=4)
# Now benchmark kernels[i](A, B, C) yourself
Recording and Reusing Best Configs¶
The programmatic path returns an AutotuneResult that can be saved and later
reloaded. This is useful for CI, multi‑host workflows, or shipping tuned configs.
artifact = tuner.run() # AutotuneResult
# Save to disk
from pathlib import Path
save_dir = Path('out/best/matmul_1024')
artifact.save_to_disk(save_dir, verbose=True)
# Reload later
from tilelang.autotuner.param import AutotuneResult, CompileArgs
restored = AutotuneResult.load_from_disk(save_dir, CompileArgs())
best = restored.kernel
best(A, B, C)
Notes
DLPack/Torch execution backend may not persist compiled binaries; in that case, re‑compilation is needed on load or use a different backend.
The directory contains human‑readable JSONs (best config/latency) and sources.
Advanced: Config Space Callables¶
Derive config spaces from problem sizes to keep searches targeted and legal:
def matmul_configs(M, N, K):
large = min(M, N, K) >= 1024
tiles = [128] if large else [64, 128]
for BM in tiles:
for BN in tiles:
for BK in [32, 64]:
for S in [2, 3]:
for TH in [128, 256]:
yield dict(block_M=BM, block_N=BN, block_K=BK,
num_stages=S, threads=TH)
Device and Backend Selection¶
Tune compile‑time options explicitly:
target='auto'|'cuda'|'hip'|'metal'(normalized to a TVM Target)execution_backend='auto'|'tvm_ffi'|'ctypes'|'cython'|'nvrtc'|'torch'pass_configs={...}to toggle TileLang/TVM passes for experiments
On CUDA with multiple GPUs, the tuner sets the current device per worker thread to avoid context mixups.
Troubleshooting¶
“No configurations to tune”: Ensure
configsis a non‑empty list or callable.Timeouts: Increase
timeout; ensure inputs fit device memory; verify that your reference check isn’t the bottleneck.Dynamic shapes: Provide concrete inputs via
set_autotune_inputsor a customsupply_prog.Disk cache disabled: Check
TILELANG_AUTO_TUNING_DISABLE_CACHEand backend.