tilelang.intrinsics.wmma_macro_generator ======================================== .. py:module:: tilelang.intrinsics.wmma_macro_generator .. autoapi-nested-parse:: WMMA intrinsic emitter for AMD RDNA architectures (gfx11 / gfx12). Only supports the f16->f32, 16x16x16 variant with warp-size=32. Thread-data mapping (per AMDGPU ISA): gfx11: - A/B: duplicated across the two half-waves, so each logical input fragment is distributed over an effective wave size of 16 lanes. - C/D: distributed over the full wave32 output layout. gfx12: - A/B: distributed over the full wave32 input layout. - C/D: distributed over the full wave32 output layout. Attributes ---------- .. autoapisummary:: tilelang.intrinsics.wmma_macro_generator.lift Classes ------- .. autoapisummary:: tilelang.intrinsics.wmma_macro_generator.WMMAIntrinEmitter Module Contents --------------- .. py:data:: lift .. py:class:: WMMAIntrinEmitter(a_dtype = 'float16', b_dtype = 'float16', accum_dtype = 'float32', a_transposed = False, b_transposed = False, block_row_warps = 2, block_col_warps = 2, warp_row_tiles = 16, warp_col_tiles = 16, chunk = 16, k_pack = 1, thread_var = None, target = None) Intrinsic emitter for AMD RDNA WMMA (16x16x16, warp-size=32). Supports: - fp16 -> fp32 (f32_16x16x16_f16_w32, with `_gfx12` codegen suffix on gfx12) .. py:attribute:: M_DIM :value: 16 .. py:attribute:: N_DIM :value: 16 .. py:attribute:: K_DIM :value: 16 .. py:attribute:: WARP_SIZE :value: 32 .. py:attribute:: a_dtype :value: 'float16' .. py:attribute:: b_dtype :value: 'float16' .. py:attribute:: accum_dtype :value: 'float32' .. py:attribute:: a_transposed :value: False .. py:attribute:: b_transposed :value: False .. py:attribute:: block_row_warps :value: 2 .. py:attribute:: block_col_warps :value: 2 .. py:attribute:: warp_row_tiles :value: 16 .. py:attribute:: warp_col_tiles :value: 16 .. py:attribute:: chunk :value: 16 .. py:attribute:: k_pack :value: 1 .. py:attribute:: thread_var :value: None .. py:attribute:: target :value: None .. py:attribute:: rdna_gen .. py:attribute:: micro_size_x :value: 16 .. py:attribute:: micro_size_y :value: 16 .. py:attribute:: micro_size_k :value: 16 .. py:attribute:: local_size_a :value: 16 .. py:attribute:: local_size_b :value: 16 .. py:attribute:: local_size_out :value: 8 .. py:attribute:: warp_rows :value: 1 .. py:attribute:: warp_cols :value: 1 .. py:attribute:: threads :value: 128 .. py:attribute:: a_fragment_forward_fn .. py:attribute:: b_fragment_forward_fn .. py:attribute:: fragment_replicate :value: 2 .. py:attribute:: store_index_map_fn .. py:attribute:: wmma_shape :value: 'f32_16x16x16_f16_w32' .. py:method:: get_thread_binding() .. py:method:: extract_thread_binding(thread_id) Return (lane_id, warp_n, warp_m). .. py:method:: get_ldmatrix_index_map(is_b = False) Return (forward, reverse) index maps for shared→local loading. The actual layout functions are chosen during __init__ based on rdna_gen: - gfx11 uses half-wave duplicated A/B input layouts (32x16 naming). - gfx12 uses full wave32 A/B input layouts (32x8 naming). .. py:method:: get_store_index_map(inverse = False) Return the store index map. The forward map is (thread_id, local_id) -> (i, j), which is affine. The inverse map is (i, j) -> (thread_id, local_id). .. py:method:: ldmatrix_a(A_local_buf, A_shared_buf, ki, rk=0) .. py:method:: ldmatrix_b(B_local_buf, B_shared_buf, ki, rk=0) .. py:method:: wmma(A_local_buf, B_local_buf, C_local_buf, k_inner = 0) .. py:method:: stmatrix(C_local_buf, C_buf, pid_m=None, pid_n=None) .. py:method:: make_wmma_load_layout(local_buf, matrix = 'A') .. py:method:: make_wmma_store_layout(local_buf)