Language Basics¶

This page introduces the core TileLang (tile‑lang) DSL that you’ll use to write high‑performance kernels. It focuses on how to define a kernel, express iteration, move data across memory scopes, and run it with JIT.

The examples use the conventional aliases:

import tilelang
import tilelang.language as T
from tilelang import jit

1. Defining a Kernel with @T.prim_func¶

TileLang kernels are TIR (TVM IR) functions produced by the @T.prim_func decorator. Arguments are annotated with shapes and dtypes via T.Tensor or T.Buffer.

Note on dtypes

  • You can pass dtypes as a string (e.g., ‘float32’), a TileLang dtype (e.g., T.float32), or a framework dtype (e.g., torch.float32). TileLang normalizes all of these. See Type System for details.

@T.prim_func
def add_kernel(
    A: T.Tensor((N,), dtype),    # dtype could be 'float32' | T.float32 | torch.float32
    B: T.Tensor((N,), dtype),
    C: T.Tensor((N,), dtype),
):
    ...  # kernel body
  • Shapes may be concrete integers or symbolic. For symbolic, you can pass Python ints through the outer @jit wrapper (shown below), or annotate with T.dyn when you want a named symbolic dimension.

# Named symbolic dimension (optional)
K = T.dyn['K']
@T.prim_func
def uses_dyn(A: T.Tensor((K,), 'float32')):
    ...

Dynamic symbolic dimensions: two ways¶

TileLang supports two complementary ways to introduce symbolic (dynamic) dims:

  • Type-level annotations via T.dyn[...] (recommended for function signatures)

    • Use in T.Tensor((T.dyn['K'], ...), dtype) or bind once then reuse (as above).

    • Inside the kernel body, prefer reading from the buffer’s shape, e.g. M = A.shape[0].

  • Term-level variables via T.dynamic(name, dtype)

    • Creates a TIR tir.Var you can use directly in expressions/loops.

    • Handy when you need to reference the dimension symbol in the body.

# 1) Annotation-only symbol; read the bound size via shape
K = T.dyn['K']  # dtype defaults to int32
@T.prim_func
def foo(A: T.Tensor((K,), 'float32')):
    N = A.shape[0]
    for i in T.serial(N):
        ...

# 2) Explicit Var symbol usable in the body
K = T.dynamic('K', 'int32')   # or T.dynamic('K') defaults to int32
@T.prim_func
def bar(A: T.Tensor((K,), 'float32')):
    for i in T.serial(K):
        ...

Notes

  • T.symbolic(name, dtype) is a deprecated alias of T.dynamic; prefer T.dynamic.

  • Under @jit, concrete sizes come from the actual tensor arguments at the first call.

  • Symbols in annotations do not need to be separate kernel arguments; TileLang binds them from argument shapes.

2. Launching Work with T.Kernel¶

with T.Kernel(...) declares a launch context and creates block/thread bindings. For GPU backends, specify a grid and threads per block.

with T.Kernel(grid_x, grid_y, threads=128) as (bx, by):
    ...  # bx/by are blockIdx.x/y

You rarely need raw thread indices; most kernels use structured loops (T.serial, T.unroll, T.Parallel, T.Pipelined) inside a T.Kernel.

3. Loops and Control Flow¶

Core loop constructs map to familiar hardware patterns:

  • T.serial(start, stop[, step]): plain for‑loop

  • T.unroll(start, stop[, step]): unrolled loop

  • T.Parallel(ext0, ext1, ...): nested parallel loops (elementwise‑friendly)

  • T.Pipelined(iters, num_stages=N): software pipelining for producer/consumer

for i in T.serial(N):
    ...

for i, j in T.Parallel(M, N):
    C[i, j] = A[i, j] + B[i, j]

for k in T.Pipelined(T.ceildiv(K, BK), num_stages=3):
    # overlap copy/compute across stages
    ...

Conditionals use standard Python if/else. Guard edges with predicates when tile sizes do not divide problem sizes evenly.

4. Memory Scopes and Allocation¶

TileLang exposes key software‑managed scopes:

  • Global: device memory (default for T.Tensor arguments)

  • Shared: on‑chip, block‑visible (T.alloc_shared(shape, dtype))

  • Fragment and scalars: per‑thread fragments and scalar vars but in Shared View (T.alloc_fragment, T.alloc_var)

A_shared = T.alloc_shared((BM, BK), 'float16')
B_shared = T.alloc_shared((BK, BN), 'float16')
C_local  = T.alloc_fragment((BM, BN), 'float32')
T.clear(C_local)  # zero accumulators

5. Moving Data: T.copy¶

Use T.copy(src, dst) to move tiles between scopes. It accepts buffers, buffer regions, or buffer loads; extents are inferred or can be broadcast.

# Global -> Shared (tile copy), extents inferred from dst
T.copy(A[by * BM, ko * BK], A_shared)
T.copy(B[ko * BK, bx * BN], B_shared)

# Fragment -> Global (store back)
T.copy(C_local, C[by * BM, bx * BN])

T.copy performs coalescing and scope‑specific lowering during compilation.

6. A Minimal End‑to‑End Example (Vector Add)¶

import tilelang
import tilelang.language as T
from tilelang import jit

@jit  # infers target from tensors at first call
def add(N: int, block: int = 256, dtype: str = 'float32'):

    @T.prim_func
    def add_kernel(
        A: T.Tensor((N,), dtype),
        B: T.Tensor((N,), dtype),
        C: T.Tensor((N,), dtype),
    ):
        with T.Kernel(T.ceildiv(N, block), threads=block) as bx:
            for i in T.Parallel(block):
                gi = bx * block + i
                # Optional — LegalizeSafeMemoryAccess inserts a guard when an access may be OOB
                C[gi] = A[gi] + B[gi]

    return add_kernel

# Host side (PyTorch shown; NumPy/DLPack also supported)
import torch
N = 1 << 20
A = torch.randn(N, device='cuda', dtype=torch.float32)
B = torch.randn(N, device='cuda', dtype=torch.float32)
C = torch.empty(N, device='cuda', dtype=torch.float32)

kernel = add(N)
kernel(A, B, C)  # runs on GPU
torch.testing.assert_close(C, A + B)

Notes

  • The @jit wrapper returns a callable kernel after the first compilation.

  • You can pass compile‑time tunables (tile sizes, dtypes) through the outer Python function and bake them into the generated TIR.

7. Tiled GEMM Skeleton¶

Below is a minimal pattern for a tiled GEMM using shared memory staging and a fragment accumulator. It mirrors the quickstart style found in the repository.

@T.prim_func
def gemm(
    A: T.Tensor((M, K), 'float16'),
    B: T.Tensor((K, N), 'float16'),
    C: T.Tensor((M, N), 'float16'),
):
    with T.Kernel(T.ceildiv(N, BN), T.ceildiv(M, BM), threads=128) as (bx, by):
        A_s = T.alloc_shared((BM, BK), 'float16')
        B_s = T.alloc_shared((BK, BN), 'float16')
        C_f = T.alloc_fragment((BM, BN), 'float32')
        T.clear(C_f)

        for ko in T.Pipelined(T.ceildiv(K, BK), num_stages=3):
            T.copy(A[by * BM, ko * BK], A_s)
            T.copy(B[ko * BK, bx * BN], B_s)
            T.gemm(A_s, B_s, C_f)  # lowered to tensor‑core/ISA specific kernels

        T.copy(C_f, C[by * BM, bx * BN])

8. Debugging and Printing¶

Use T.print inside a kernel for quick introspection. TileLang emits printing from a single thread for shared/fragment scopes to avoid floods.

T.print(C_f, msg='accumulator:')
T.print(A_s, msg='A tile:')
T.print(C[0], msg='C[0] = ')

9. Where to Go Next¶

  • Control flow details: see Programming Guides → Control Flow

  • Memory topics: see Programming Guides → (removed cache/layout); basics are covered inline

  • Autotuning tile sizes and mappings: Programming Guides → Autotuning

  • Operator examples (GEMM, GEMV, attention): see Deep Learning Operators