tilelang.contrib.cutedsl.cpasync

函式

cp_async_gs(size, dst, src)

cp_async_gs_conditional(size, dst, src, cond)

extract_tensormap_ptr(tma_atom, *[, loc, ip])

extract the tensormap pointer from a TMA Copy Atom.

tma_load(tma_desc, mbar, smem_ptr, crd, *[, loc, ip])

Load data from global memory to shared memory using TMA (Tensor Memory Access).

tma_store(tma_desc, smem_ptr, crd, *[, loc, ip])

Store data from shared memory to global memory using TMA (Tensor Memory Access).

tma_reduce(tma_desc, smem_ptr, crd, *[, loc, ip])

Reduce data from shared memory to global memory using TMA with atomic ADD reduction.

tma_store_arrive(*[, loc, ip])

Indicate arrival of warp issuing TMA_STORE.

tma_store_wait(count, *[, read, loc, ip])

Wait for TMA_STORE operations to complete.

cp_async_shared_global(dst, src, cp_size, modifier, *)

Asynchronously copy data from global memory to shared memory.

prefetch_tma_descriptor(tma_desc, *[, loc, ip])

Prefetch a TMA descriptor.

mbarrier_wait(mbar_ptr, phase[, timeout_ns, loc, ip])

Waits on a mbarrier with a specified phase (blocking loop).

mbarrier_cp_async_arrive(mbar_ptr, *[, loc, ip])

fence_proxy_async()

fence_barrier_init()

Module Contents

tilelang.contrib.cutedsl.cpasync.cp_async_gs(size, dst, src)
tilelang.contrib.cutedsl.cpasync.cp_async_gs_conditional(size, dst, src, cond)
tilelang.contrib.cutedsl.cpasync.extract_tensormap_ptr(tma_atom, *, loc=None, ip=None)

extract the tensormap pointer from a TMA Copy Atom. :param tma_atom: The TMA Copy Atom :type tma_atom: CopyAtom

參數:

tma_atom (cutlass.cute.CopyAtom)

回傳型別:

cutlass.cute.Pointer

tilelang.contrib.cutedsl.cpasync.tma_load(tma_desc, mbar, smem_ptr, crd, *, loc=None, ip=None)

Load data from global memory to shared memory using TMA (Tensor Memory Access).

參數:
  • tma_desc (CopyAtom or tensormap_ptr or Tensor of tensormap_ptr) -- TMA descriptor for the tensor

  • mbar (Pointer) -- Mbarrier pointer in shared memory

  • smem_ptr (Pointer) -- Destination pointer in shared memory

  • crd (tuple[Int, ...]) -- Coordinates tuple for the tensor access

回傳型別:

None

tilelang.contrib.cutedsl.cpasync.tma_store(tma_desc, smem_ptr, crd, *, loc=None, ip=None)

Store data from shared memory to global memory using TMA (Tensor Memory Access).

參數:
  • tma_desc (TMA descriptor) -- TMA descriptor for the tensor

  • smem_ptr (Pointer) -- Source pointer in shared memory

  • crd (tuple[Int, ...]) -- Coordinates tuple for the tensor access

回傳型別:

None

tilelang.contrib.cutedsl.cpasync.tma_reduce(tma_desc, smem_ptr, crd, *, loc=None, ip=None)

Reduce data from shared memory to global memory using TMA with atomic ADD reduction.

This performs an atomic add of shared memory data to global memory using the TMA unit's reduce capability.

參數:
  • tma_desc (TMA descriptor) -- TMA descriptor for the tensor

  • smem_ptr (Pointer) -- Source pointer in shared memory

  • crd (tuple[Int, ...]) -- Coordinates tuple for the tensor access

回傳型別:

None

tilelang.contrib.cutedsl.cpasync.tma_store_arrive(*, loc=None, ip=None)

Indicate arrival of warp issuing TMA_STORE. Corresponds to PTX instruction: cp.async.bulk.commit_group;

回傳型別:

None

tilelang.contrib.cutedsl.cpasync.tma_store_wait(count, *, read=None, loc=None, ip=None)

Wait for TMA_STORE operations to complete. Corresponds to PTX instruction: cp.async.bulk.wait_group.read <count>;

參數:

count (Int) -- The number of outstanding bulk async groups to wait for

回傳型別:

None

tilelang.contrib.cutedsl.cpasync.cp_async_shared_global(dst, src, cp_size, modifier, *, src_size=None, loc=None, ip=None)

Asynchronously copy data from global memory to shared memory.

參數:
  • dst (Pointer) -- Destination pointer in shared memory

  • src (Pointer) -- Source pointer in global memory

  • size (Int) -- Size of the copy in bytes

  • modifier (Int) -- Cache modifier

  • cp_size (Int) -- Optional copy size override

  • src_size (cutlass.cute.typing.Int)

回傳型別:

None

tilelang.contrib.cutedsl.cpasync.prefetch_tma_descriptor(tma_desc, *, loc=None, ip=None)

Prefetch a TMA descriptor. Corresponds to PTX instruction: prefetch.tensormap;

回傳型別:

None

tilelang.contrib.cutedsl.cpasync.mbarrier_wait(mbar_ptr, phase, timeout_ns=10000000, *, loc=None, ip=None)

Waits on a mbarrier with a specified phase (blocking loop).

Uses inline PTX to loop until the try_wait succeeds. The CUDA backend does: while (!mbar.try_wait(parity)) {}

參數:
  • mbar_ptr (cutlass.cute.typing.Pointer)

  • phase (cutlass.cute.typing.Int)

  • timeout_ns (cutlass.cute.typing.Int)

回傳型別:

None

tilelang.contrib.cutedsl.cpasync.mbarrier_cp_async_arrive(mbar_ptr, *, loc=None, ip=None)
參數:

mbar_ptr (cutlass.cute.typing.Pointer)

回傳型別:

None

tilelang.contrib.cutedsl.cpasync.fence_proxy_async()
tilelang.contrib.cutedsl.cpasync.fence_barrier_init()