# Autotuning TileLang includes a built‑in autotuner that searches configuration spaces for the best performing kernel, compiles candidates in parallel, validates correctness, benchmarks them, and caches the best result for reuse. This guide covers two workflows: - Decorator‑based: `@tilelang.autotune(configs=...)` stacked on `@tilelang.jit` - Programmatic: `AutoTuner.from_kernel(...).set_*().run()` It also explains input tensor supply, validation, caching, and environment variables that affect parallelism and cache behavior. ## 1) Decorator‑based Autotune Use `@tilelang.autotune` above `@tilelang.jit` and expose tunable parameters as function arguments with defaults. The autotuner overrides these parameters with values from your config space. ```python import tilelang import tilelang.language as T def matmul_configs(M, N, K): # Example space — tailor to your target tiles = [64, 128] stages = [2, 3] threads = [128, 256] return [ dict(block_M=BM, block_N=BN, block_K=BK, num_stages=S, threads=TH) for BM in tiles for BN in tiles for BK in [32, 64] for S in stages for TH in threads ] @tilelang.autotune(configs=matmul_configs, warmup=25, rep=100, timeout=60) @tilelang.jit(out_idx=[-1]) def matmul(M: int, N: int, K: int, block_M: int = 128, block_N: int = 128, block_K: int = 32, threads: int = 128, num_stages: int = 3, dtype: str = 'float16', accum_dtype: str = 'float32'): @T.prim_func def kernel(A: T.Tensor((M, K), dtype), B: T.Tensor((K, N), dtype), C: T.Tensor((M, N), dtype)): with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=threads) as (bx, by): A_s = T.alloc_shared((block_M, block_K), dtype) B_s = T.alloc_shared((block_K, block_N), dtype) C_f = T.alloc_fragment((block_M, block_N), accum_dtype) T.clear(C_f) for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=num_stages): T.copy(A[by * block_M, ko * block_K], A_s) T.copy(B[ko * block_K, bx * block_N], B_s) T.gemm(A_s, B_s, C_f) T.copy(C_f, C[by * block_M, bx * block_N]) return kernel # Usage # Provide inputs via context (recommended for reproducibility across configs) import torch M = N = K = 1024 A = torch.randn(M, K, device='cuda', dtype=torch.float16) B = torch.randn(K, N, device='cuda', dtype=torch.float16) C = torch.empty(M, N, device='cuda', dtype=torch.float16) from tilelang.autotuner import set_autotune_inputs with set_autotune_inputs(A, B, C): tuned_kernel = matmul(M, N, K) # compiles, tunes, returns best kernel tuned_kernel(A, B, C) # run best kernel ``` Notes - `configs` can be a list of dicts or a callable `(args...) -> list[dict]`. Each dict’s keys must match the tunable function arguments (e.g., `block_M`). - The decorator returns a callable that runs autotune once per argument tuple and caches the resulting best kernel in‑process. - For explicit input control during tuning, wrap the call with `set_autotune_inputs(...)`. Otherwise, `supply_type` (below) is used. ## 2) Programmatic Autotune Use the `AutoTuner` class to manage configs and arguments more explicitly. ```python from tilelang.autotuner import AutoTuner kernel_factory = matmul # the function above (already @tilelang.jit) tuner = AutoTuner.from_kernel(kernel_factory(M, N, K), configs=matmul_configs(M, N, K)) tuner.set_profile_args( warmup=25, rep=100, timeout=60, supply_type=tilelang.TensorSupplyType.Auto, # or provide supply_prog/ref_prog ref_prog=lambda A, B, C: torch.allclose(C, (A @ B).to(C.dtype), rtol=1e-2, atol=1e-2), ) tuner.set_compile_args( target='auto', # or 'cuda'/'hip'/'metal' execution_backend='auto', # resolves per-target out_idx=[-1], # which outputs to return if multiple pass_configs={ # optional TVM passes/flags # tilelang.PassConfigKey.EXAMPLE_KEY: value, }, ) artifact = tuner.run() # compiles + runs + validates all configs best_kernel = artifact.kernel # JITKernel best_latency = artifact.latency best_config = artifact.config # Reuse best kernel best_kernel(A, B, C) ``` ### Example Gallery (in repo) - examples/gdn/example_chunk_delta_h.py:101 — uses `@autotune` to sweep configs - examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py:451 — uses `@tilelang.autotune` - examples/quickstart.py:84 — profiles a tuned kernel with `get_profiler` - examples/hadamard_transform/example_hadamard.py:152 — profiler with custom warmup - examples/dynamic_shape/example_dynamic.py:94 — profiler for dynamic shapes - examples/gemm/example_gemm_persistent.py:135 — compare persistent vs non‑persistent Click any path to open the code and compare patterns. ## Input Tensor Supply The tuner needs inputs to compile and benchmark kernels. Provide them in one of three ways (priority order): 1) Context manager (fixed inputs across configs) ```python with set_autotune_inputs(A, B, C): tuned = matmul(M, N, K) ``` 2) Custom supplier program ```python def supply_prog(signature): # signature holds KernelParam objects describing shapes/dtypes # Return a list of torch tensors matching the kernel’s arguments return [A, B, C] tuner.set_profile_args(supply_prog=supply_prog) ``` 3) Built‑in generators via `supply_type` - `TensorSupplyType.Auto` (default): heuristic per dtype (uniform ints / fp ranges) - `Integer`, `Uniform`, `Normal`, `Randn`, `Zero`, `One` Important - Built‑in generators require static shapes; if your PrimFunc uses symbolic dimensions (T.dyn), supply concrete inputs via (1) or (2). - Float8 dtypes require PyTorch 2.1+ for `torch.float8_*` support. ## Correctness Checking and Tolerances Use one of the following validation methods: - `ref_prog`: Provide a reference program that receives the same inputs and checks results. You can return a boolean or raise on mismatch. - `manual_check_prog`: A callable that inspects outputs and raises on mismatch. - `skip_check=True`: Skip correctness checks (faster, use with caution). Control numeric drift via: - `rtol` and `atol` (defaults 1e‑2) - `max_mismatched_ratio` (default 1%) ## Configuration Spaces and Best Practices What to tune - Tile sizes: `block_M`, `block_N`, `block_K` - Software pipelining: `num_stages` - Threads per block: `threads` (or (x, y) tuple) - Optional: dtype variants, epilogues, small scheduling knobs Tips - Start from a working baseline. Tune a small, meaningful space first. - Respect hardware limits (shared memory bytes, registers per thread/block, max threads per block). Eliminate impossible configs up‑front. - Keep block sizes multiples of vector widths and warp sizes when relevant. - Use `set_autotune_inputs` to ensure each config is measured on identical data. - Record your best configs and bake them as defaults when stable. ## Parallel Compilation/Benchmarking and Timeouts The tuner compiles configurations in parallel using a thread pool and benchmarks them with a per‑config timeout. On CUDA, each worker sets the current device to avoid context issues. Notes - `timeout` uses POSIX signals; on non‑Unix systems, it may not take effect. - Logs are written to `autotuner.log` in the working directory. ## Caching The autotuner caches best artifacts both in‑memory (per process) and on disk under `$TILELANG_CACHE_DIR/autotuner`. The cache key includes: - TileLang version, function source, closure free‑vars - Config list, compile args, profile args Disk cache contents (per key) - Best config and latency: `best_config.json`, `latency.json` - Kernel sources and library: `device_kernel.cu`, `host_kernel.cu`, `kernel_lib.so` (or `kernel.cubin`/`executable.so` depending on backend) - Function and params: `function.pkl`, `params.pkl` Control via env vars (tilelang.env) - `TILELANG_CACHE_DIR` (default `~/.tilelang/cache`) - `TILELANG_TMP_DIR` (default `$TILELANG_CACHE_DIR/tmp`) - Disable all kernel caches: `TILELANG_DISABLE_CACHE=1` - Disable autotune disk cache only: `TILELANG_AUTO_TUNING_DISABLE_CACHE=1` CPU worker control - `TILELANG_AUTO_TUNING_CPU_UTILITIES` (fraction, default 0.9) - `TILELANG_AUTO_TUNING_CPU_COUNTS` (int, `-1` auto) - `TILELANG_AUTO_TUNING_MAX_CPU_COUNT` (int, `-1` unlimited) Backend notes - NVRTC backend persists `.cubin` and a Python launcher. - Torch/DLPack backend may not save artifacts to disk; in this case, only in‑memory caching applies and a warning is logged. ## Alternative: Manual Sweeps with par_compile If you prefer manual control, use `JITImpl.par_compile` to compile a batch of configs and drive your own benchmarking: ```python @tilelang.jit def factory(M, N, K, block_M=128, block_N=128, block_K=32): @T.prim_func def k(A: T.Tensor((M, K), 'float16'), B: T.Tensor((K, N), 'float16'), C: T.Tensor((M, N), 'float16')): ... return k impl = factory # JITImpl cfgs = [ dict(block_M=64, block_N=128, block_K=32), dict(block_M=128, block_N=128, block_K=64), ] kernels = impl.par_compile(cfgs, num_workers=4) # Now benchmark kernels[i](A, B, C) yourself ``` ## Recording and Reusing Best Configs The programmatic path returns an `AutotuneResult` that can be saved and later reloaded. This is useful for CI, multi‑host workflows, or shipping tuned configs. ```python artifact = tuner.run() # AutotuneResult # Save to disk from pathlib import Path save_dir = Path('out/best/matmul_1024') artifact.save_to_disk(save_dir, verbose=True) # Reload later from tilelang.autotuner.param import AutotuneResult, CompileArgs restored = AutotuneResult.load_from_disk(save_dir, CompileArgs()) best = restored.kernel best(A, B, C) ``` Notes - DLPack/Torch execution backend may not persist compiled binaries; in that case, re‑compilation is needed on load or use a different backend. - The directory contains human‑readable JSONs (best config/latency) and sources. ## Advanced: Config Space Callables Derive config spaces from problem sizes to keep searches targeted and legal: ```python def matmul_configs(M, N, K): large = min(M, N, K) >= 1024 tiles = [128] if large else [64, 128] for BM in tiles: for BN in tiles: for BK in [32, 64]: for S in [2, 3]: for TH in [128, 256]: yield dict(block_M=BM, block_N=BN, block_K=BK, num_stages=S, threads=TH) ``` ## Device and Backend Selection Tune compile‑time options explicitly: - `target='auto'|'cuda'|'hip'|'metal'` (normalized to a TVM Target) - `execution_backend='auto'|'tvm_ffi'|'ctypes'|'cython'|'nvrtc'|'torch'` - `pass_configs={...}` to toggle TileLang/TVM passes for experiments On CUDA with multiple GPUs, the tuner sets the current device per worker thread to avoid context mixups. ## Troubleshooting - “No configurations to tune”: Ensure `configs` is a non‑empty list or callable. - Timeouts: Increase `timeout`; ensure inputs fit device memory; verify that your reference check isn’t the bottleneck. - Dynamic shapes: Provide concrete inputs via `set_autotune_inputs` or a custom `supply_prog`. - Disk cache disabled: Check `TILELANG_AUTO_TUNING_DISABLE_CACHE` and backend.