tilelang.contrib.cutedsl.cpasync ================================ .. py:module:: tilelang.contrib.cutedsl.cpasync Functions --------- .. autoapisummary:: tilelang.contrib.cutedsl.cpasync.cp_async_gs tilelang.contrib.cutedsl.cpasync.cp_async_gs_conditional tilelang.contrib.cutedsl.cpasync.extract_tensormap_ptr tilelang.contrib.cutedsl.cpasync.tma_load tilelang.contrib.cutedsl.cpasync.tma_store tilelang.contrib.cutedsl.cpasync.tma_reduce tilelang.contrib.cutedsl.cpasync.tma_store_arrive tilelang.contrib.cutedsl.cpasync.tma_store_wait tilelang.contrib.cutedsl.cpasync.cp_async_shared_global tilelang.contrib.cutedsl.cpasync.prefetch_tma_descriptor tilelang.contrib.cutedsl.cpasync.mbarrier_wait tilelang.contrib.cutedsl.cpasync.mbarrier_cp_async_arrive tilelang.contrib.cutedsl.cpasync.fence_proxy_async tilelang.contrib.cutedsl.cpasync.fence_barrier_init Module Contents --------------- .. py:function:: cp_async_gs(size, dst, src) .. py:function:: cp_async_gs_conditional(size, dst, src, cond) .. py:function:: extract_tensormap_ptr(tma_atom, *, loc=None, ip=None) extract the tensormap pointer from a TMA Copy Atom. :param tma_atom: The TMA Copy Atom :type tma_atom: CopyAtom .. py:function:: tma_load(tma_desc, mbar, smem_ptr, crd, *, loc=None, ip=None) Load data from global memory to shared memory using TMA (Tensor Memory Access). :param tma_desc: TMA descriptor for the tensor :type tma_desc: CopyAtom or tensormap_ptr or Tensor of tensormap_ptr :param mbar: Mbarrier pointer in shared memory :type mbar: Pointer :param smem_ptr: Destination pointer in shared memory :type smem_ptr: Pointer :param crd: Coordinates tuple for the tensor access :type crd: tuple[Int, ...] .. py:function:: tma_store(tma_desc, smem_ptr, crd, *, loc=None, ip=None) Store data from shared memory to global memory using TMA (Tensor Memory Access). :param tma_desc: TMA descriptor for the tensor :type tma_desc: TMA descriptor :param smem_ptr: Source pointer in shared memory :type smem_ptr: Pointer :param crd: Coordinates tuple for the tensor access :type crd: tuple[Int, ...] .. py:function:: tma_reduce(tma_desc, smem_ptr, crd, *, loc=None, ip=None) Reduce data from shared memory to global memory using TMA with atomic ADD reduction. This performs an atomic add of shared memory data to global memory using the TMA unit's reduce capability. :param tma_desc: TMA descriptor for the tensor :type tma_desc: TMA descriptor :param smem_ptr: Source pointer in shared memory :type smem_ptr: Pointer :param crd: Coordinates tuple for the tensor access :type crd: tuple[Int, ...] .. py:function:: tma_store_arrive(*, loc=None, ip=None) Indicate arrival of warp issuing TMA_STORE. Corresponds to PTX instruction: cp.async.bulk.commit_group; .. py:function:: tma_store_wait(count, *, read=None, loc=None, ip=None) Wait for TMA_STORE operations to complete. Corresponds to PTX instruction: cp.async.bulk.wait_group.read ; :param count: The number of outstanding bulk async groups to wait for :type count: Int .. py:function:: cp_async_shared_global(dst, src, cp_size, modifier, *, src_size = None, loc=None, ip=None) Asynchronously copy data from global memory to shared memory. :param dst: Destination pointer in shared memory :type dst: Pointer :param src: Source pointer in global memory :type src: Pointer :param size: Size of the copy in bytes :type size: Int :param modifier: Cache modifier :type modifier: Int :param cp_size: Optional copy size override :type cp_size: Int .. py:function:: prefetch_tma_descriptor(tma_desc, *, loc=None, ip=None) Prefetch a TMA descriptor. Corresponds to PTX instruction: prefetch.tensormap; .. py:function:: mbarrier_wait(mbar_ptr, phase, timeout_ns = 10000000, *, loc=None, ip=None) Waits on a mbarrier with a specified phase (blocking loop). Uses inline PTX to loop until the try_wait succeeds. The CUDA backend does: while (!mbar.try_wait(parity)) {} .. py:function:: mbarrier_cp_async_arrive(mbar_ptr, *, loc=None, ip=None) .. py:function:: fence_proxy_async() .. py:function:: fence_barrier_init()