tilelang.intrinsics.tcgen05_macro_generator¶
Attributes¶
Classes¶
Enum where members are also (and must be) ints |
|
Intrinsic emitter for Blackwell (SM100) TCGEN5MMA instructions. |
Module Contents¶
- tilelang.intrinsics.tcgen05_macro_generator.lift¶
- class tilelang.intrinsics.tcgen05_macro_generator.SwizzleMode¶
Bases:
enum.IntEnumEnum where members are also (and must be) ints
- NONE = 0¶
- SWIZZLE_128B = 2¶
- SWIZZLE_64B = 4¶
- SWIZZLE_32B = 6¶
- swizzle_byte_size()¶
- Return type:
int
- swizzle_atom_size()¶
- Return type:
int
- class tilelang.intrinsics.tcgen05_macro_generator.TensorCoreIntrinEmitter(a_dtype=T.float16, b_dtype=T.float16, accum_dtype=T.float16, a_transposed=False, b_transposed=False, block_row_warps=2, block_col_warps=2, warp_row_tiles=8, warp_col_tiles=8, chunk=16, reduce_k=1, num_elems_per_byte=1, is_m_first=False, thread_var=None)¶
Bases:
tilelang.intrinsics.mma_macro_generator.TensorCoreIntrinEmitterIntrinsic emitter for Blackwell (SM100) TCGEN5MMA instructions.
Generates TIR macros that lower to
tcgen05.mmaPTX instructions for both the SS (Shared-Shared) and TS (TensorMemory-Shared) GEMM variants. Also provides layout helpers for tensor-memory (TMEM) buffers.- Parameters:
- tcgen05_prefix: str¶
- tcgen05mma(A_buf, B_buf, C_local_buf, mbar, clear_accum=False)¶
Emit a TCGEN5MMA operation, dispatching to SS or TS variant based on A’s memory scope.
If A_buf resides in tensor memory (
shared.tmem), the TS variant is emitted; otherwise the SS variant is used (both A and B from shared memory).- Parameters:
A_buf (Buffer) – Operand A — either in shared memory (SS) or tensor memory (TS).
B_buf (Buffer) – Operand B in shared memory.
C_local_buf (Buffer) – Accumulator buffer in tensor memory.
mbar (PrimExpr) – Memory barrier used for MMA completion signalling.
clear_accum (PrimExpr) – Whether to zero the accumulator before the first MMA.
- tcgen05mma_ts(A_buf, B_buf, C_local_buf, mbar, clear_accum=False)¶
Emit the TS (TensorMemory-Shared) variant of TCGEN5MMA.
Reads operand A directly from tensor memory (TMEM) and operand B from shared memory via a descriptor. The TMEM column offset for A is computed assuming packed storage (e.g. two
bfloat16values peruint32column) to match the output oftcgen05.st.- Parameters:
A_buf (Buffer) – Operand A residing in tensor memory (
shared.tmem).B_buf (Buffer) – Operand B in shared memory.
C_local_buf (Buffer) – Accumulator buffer in tensor memory.
mbar (PrimExpr) – Memory barrier for MMA completion signalling.
clear_accum (PrimExpr) – Whether to zero the accumulator before the first MMA.
- abstractmethod make_mma_load_layout(local_buf, matrix='A')¶
- Parameters:
local_buf (tvm.tir.Buffer)
matrix (str)
- Return type:
tilelang.language.Fragment
- make_mma_store_layout(tmem_buf)¶
Create the TCGEN5 tensor-memory layout used to store MMA accumulators.
- Parameters:
tmem_buf (tir.Buffer) – The local buffer representing tensormemory of a mma’s output
- Returns:
Layout object describing how logical (i, j) coordinates map to the swizzled tensor-memory offsets required by TCGEN5MMA.
- Return type:
- Raises:
AssertionError – If tmem_buf is not detected to be a tensor-memory buffer.
- get_tcgen5_mma_meta(m, n, k)¶
Query the FFI for TCGEN5MMA atom metadata (atom_m, atom_n, atom_k, enable_ws, enable_2cta).
- Parameters:
m (int)
n (int)
k (int)
- get_tcgen5_instr_desc(atom_m, atom_n, atom_k, a_is_k_major, b_is_k_major, scale_in_a, scale_in_b)¶
Build the 64-bit instruction descriptor for a
tcgen05.mmaPTX call.