Instructions¶

This page summarizes the core TileLang “instructions” available at the DSL level, how they map to hardware concepts, and how to use them correctly.

Quick Categories¶

Data movement: T.copy, T.c2d_im2col, staging Global ↔ Shared ↔ Fragment
Compute primitives: T.gemm/T.gemm_sp, elementwise math (T.exp, T.max), reductions (T.reduce_sum, T.cumsum, warp reducers)
Control helpers: T.clear/T.fill, T.reshape/T.view
Diagnostics: T.print, T.device_assert
Advanced: atomics, memory barriers, warp‑group ops

Data Movement¶

Use T.copy(src, dst, coalesced_width=None, disable_tma=False, eviction_policy=None) to move tiles between memory scopes. It accepts tir.Buffer, BufferLoad, or BufferRegion; extents are inferred or broadcast when possible.

# Global → Shared tiles (extents inferred from dst)
T.copy(A[by * BM, ko * BK], A_s)
T.copy(B[ko * BK, bx * BN], B_s)

# Fragment/Register → Global (store result)
T.copy(C_f, C[by * BM, bx * BN])

Semantics

Extents are deduced from arguments; missing sides broadcast to the other’s rank.
Access patterns are legalized and coalesced during lowering. Explicit vectorization is not required in HL mode.
Safety: the LegalizeSafeMemoryAccess pass inserts boundary guards when an access may be out‑of‑bounds and drops them when proven safe.

Other helpers

T.c2d_im2col(img, col, ...): convenience for conv‑style transforms.

Compute Primitives¶

GEMM and sparse GEMM

T.gemm(A_shared, B_shared, C_fragment): computes a tile GEMM using shared inputs and a fragment accumulator; lowered to target‑specific tensor cores.
T.gemm_sp(...): 2:4 sparse tensor core variant (see examples and README).

Reductions and scans

T.reduce_sum, T.reduce_max, T.reduce_min, T.cumsum, plus warp reducers (T.warp_reduce_sum, etc.).
Allocate and initialize accumulators via T.alloc_fragment + T.clear or T.fill.

Elementwise math

Most math ops mirror TVM TIR: T.exp, T.log, T.max, T.min, T.rsqrt, T.sigmoid, etc. Compose freely inside loops.

Reshape/view (no copy)

T.reshape(buf, new_shape) and T.view(buf, shape=None, dtype=None) create new views that share storage, with shape/dtype checks enforced.

Synchronization (HL usage)¶

In HL pipelines, you usually don’t need to write explicit barriers. Passes such as PipelinePlanning/InjectSoftwarePipeline/InjectTmaBarrier orchestrate producer/consumer ordering and thread synchronization behind the scenes.

If you need debugging or explicit checks:

T.device_assert(cond, msg='') emits device‑side asserts on CUDA targets.
T.print(obj, msg='...') prints scalars or buffers safely from one thread.

Putting It Together: GEMM Tile¶

@T.prim_func
def gemm(
    A: T.Tensor((M, K), 'float16'),
    B: T.Tensor((K, N), 'float16'),
    C: T.Tensor((M, N), 'float16'),
):
    with T.Kernel(T.ceildiv(N, BN), T.ceildiv(M, BM), threads=128) as (bx, by):
        A_s = T.alloc_shared((BM, BK), 'float16')
        B_s = T.alloc_shared((BK, BN), 'float16')
        C_f = T.alloc_fragment((BM, BN), 'float32')
        T.clear(C_f)

        for ko in T.Pipelined(T.ceildiv(K, BK), num_stages=3):
            T.copy(A[by * BM, ko * BK], A_s)  # Global → Shared
            T.copy(B[ko * BK, bx * BN], B_s)
            T.gemm(A_s, B_s, C_f)             # compute into fragment

        T.copy(C_f, C[by * BM, bx * BN])      # store back

Instruction Reference (Concise)¶

Below is a concise list of TileLang instructions grouped by category. For full signatures, behaviors, constraints, and examples, refer to API Reference (autoapi/tilelang/index).

Data movement

T.copy(src, dst, ...): Move tiles between Global/Shared/Fragment.
T.c2d_im2col(img, col, ...): 2D im2col transform for conv.

Memory allocation and descriptors

T.alloc_shared(shape, dtype, scope='shared.dyn'): Allocate shared buffer.
T.alloc_fragment(shape, dtype, scope='local.fragment'): Allocate fragment.
T.alloc_var(dtype, [init], scope='local.var'): Scalar var buffer (1 elem).
T.alloc_barrier(arrive_count): Shared barrier buffer.
T.alloc_tmem(shape, dtype): Tensor memory (TMEM) buffer (Hopper+).
T.alloc_reducer(shape, dtype, op='sum', replication=None): Reducer buf.
T.alloc_descriptor(kind, dtype): Generic descriptor allocator.
- T.alloc_wgmma_desc(dtype='uint64')
- T.alloc_tcgen05_smem_desc(dtype='uint64')
- T.alloc_tcgen05_instr_desc(dtype='uint32')
T.empty(shape, dtype='float32'): Declare function output tensors.

Compute primitives

T.gemm(A_s, B_s, C_f): Tile GEMM into fragment accumulator.
T.gemm_sp(...): Sparse (2:4) tensor core GEMM.
Reductions: T.reduce_sum/max/min/abssum/absmax, bitwise and/or/xor.
Scans: T.cumsum, finalize: T.finalize_reducer.
Warp reducers: T.warp_reduce_sum/max/min/bitand/bitor.
Elementwise math: TIR ops (T.exp, T.log, T.max, T.min, T.rsqrt, …).
Fast math: T.__log/__log2/__log10/__exp/__exp2/__exp10/__sin/__cos/__tan.
IEEE math: T.ieee_add/sub/mul/fmaf (configurable rounding).
Helpers: T.clear(buf), T.fill(buf, value).
Views: T.reshape(buf, shape), T.view(buf, shape=None, dtype=None).

Diagnostics

T.print(obj, msg=''): Print scalar/buffer from one thread.
T.device_assert(cond, msg=''): Device-side assert (CUDA).

Logical helpers

T.any_of(a, b, ...), T.all_of(a, b, ...): Multi-term predicates.

Annotation helpers

T.use_swizzle(panel_size=..., enable=True): Rasterization hint.
T.annotate_layout({...}): Attach explicit layouts to buffers.
T.annotate_safe_value(var, ...): Safety/const hints.
T.annotate_l2_hit_ratio(buf, ratio): Cache behavior hint.

Atomics

T.atomic_add(dst, value, memory_order=None, return_prev=False, use_tma=False).
T.atomic_addx2(dst, value, return_prev=False); T.atomic_addx4(...).
T.atomic_max(dst, value, memory_order=None, return_prev=False).
T.atomic_min(dst, value, memory_order=None, return_prev=False).
T.atomic_load(dst), T.atomic_store(dst, value).

Custom intrinsics

T.dp4a(A, B, C): 4‑element dot‑product accumulate.
T.clamp(x, lo, hi): Clamp to [lo, hi].
T.loop_break(): Break from current loop via intrinsic.

Barriers, TMA, warp‑group

Barriers: T.create_list_of_mbarrier(...), T.get_mbarrier(i).
Parity ops: T.mbarrier_wait_parity(barrier, parity), T.mbarrier_arrive(barrier).
Expect tx: T.mbarrier_expect_tx(...); sugar: T.barrier_wait(id, parity=None).
TMA: T.create_tma_descriptor(...), T.tma_load(...), T.tma_store_arrive(...), T.tma_store_wait(...).
Proxy/fences: T.fence_proxy_async(...), T.warpgroup_fence_operand(...).
Warp‑group: T.warpgroup_arrive(), T.warpgroup_commit_batch(), T.warpgroup_wait(num_mma), T.wait_wgmma(id).

Lane/warp index

T.get_lane_idx(warp_size=None): Lane id in warp.
T.get_warp_idx_sync(warp_size=None): Canonical warp id (sync).
T.get_warp_idx(warp_size=None): Canonical warp id (no sync).
T.get_warp_group_idx(warp_size=None, warps_per_group=None): Group id.

T.set_max_nreg(reg_count, is_inc), T.inc_max_nreg(n), T.dec_max_nreg(n).
T.annotate_producer_reg_dealloc(n=24), T.annotate_consumer_reg_alloc(n=240).
T.no_set_max_nreg(), T.disable_warp_group_reg_alloc().

Notes on Dtypes¶

Dtypes accept three equivalent forms:

String: 'float32'
TileLang dtype: T.float32
Framework dtype: torch.float32 All are normalized internally. See Type System for details.