tilelang.contrib.cutedsl.cpasync¶
函式¶
|
|
|
|
|
extract the tensormap pointer from a TMA Copy Atom. |
|
Load data from global memory to shared memory using TMA (Tensor Memory Access). |
|
Store data from shared memory to global memory using TMA (Tensor Memory Access). |
|
Reduce data from shared memory to global memory using TMA with atomic ADD reduction. |
|
Indicate arrival of warp issuing TMA_STORE. |
|
Wait for TMA_STORE operations to complete. |
|
Asynchronously copy data from global memory to shared memory. |
|
Prefetch a TMA descriptor. |
|
Waits on a mbarrier with a specified phase (blocking loop). |
|
|
Module Contents¶
- tilelang.contrib.cutedsl.cpasync.cp_async_gs(size, dst, src)¶
- tilelang.contrib.cutedsl.cpasync.cp_async_gs_conditional(size, dst, src, cond)¶
- tilelang.contrib.cutedsl.cpasync.extract_tensormap_ptr(tma_atom, *, loc=None, ip=None)¶
extract the tensormap pointer from a TMA Copy Atom. :param tma_atom: The TMA Copy Atom :type tma_atom: CopyAtom
- 參數:
tma_atom (cutlass.cute.CopyAtom)
- 回傳型別:
cutlass.cute.Pointer
- tilelang.contrib.cutedsl.cpasync.tma_load(tma_desc, mbar, smem_ptr, crd, *, loc=None, ip=None)¶
Load data from global memory to shared memory using TMA (Tensor Memory Access).
- 參數:
tma_desc (CopyAtom or tensormap_ptr or Tensor of tensormap_ptr) -- TMA descriptor for the tensor
mbar (Pointer) -- Mbarrier pointer in shared memory
smem_ptr (Pointer) -- Destination pointer in shared memory
crd (tuple[Int, ...]) -- Coordinates tuple for the tensor access
- 回傳型別:
None
- tilelang.contrib.cutedsl.cpasync.tma_store(tma_desc, smem_ptr, crd, *, loc=None, ip=None)¶
Store data from shared memory to global memory using TMA (Tensor Memory Access).
- 參數:
tma_desc (TMA descriptor) -- TMA descriptor for the tensor
smem_ptr (Pointer) -- Source pointer in shared memory
crd (tuple[Int, ...]) -- Coordinates tuple for the tensor access
- 回傳型別:
None
- tilelang.contrib.cutedsl.cpasync.tma_reduce(tma_desc, smem_ptr, crd, *, loc=None, ip=None)¶
Reduce data from shared memory to global memory using TMA with atomic ADD reduction.
This performs an atomic add of shared memory data to global memory using the TMA unit's reduce capability.
- 參數:
tma_desc (TMA descriptor) -- TMA descriptor for the tensor
smem_ptr (Pointer) -- Source pointer in shared memory
crd (tuple[Int, ...]) -- Coordinates tuple for the tensor access
- 回傳型別:
None
- tilelang.contrib.cutedsl.cpasync.tma_store_arrive(*, loc=None, ip=None)¶
Indicate arrival of warp issuing TMA_STORE. Corresponds to PTX instruction: cp.async.bulk.commit_group;
- 回傳型別:
None
- tilelang.contrib.cutedsl.cpasync.tma_store_wait(count, *, read=None, loc=None, ip=None)¶
Wait for TMA_STORE operations to complete. Corresponds to PTX instruction: cp.async.bulk.wait_group.read <count>;
- 參數:
count (Int) -- The number of outstanding bulk async groups to wait for
- 回傳型別:
None
Asynchronously copy data from global memory to shared memory.
- 參數:
dst (Pointer) -- Destination pointer in shared memory
src (Pointer) -- Source pointer in global memory
size (Int) -- Size of the copy in bytes
modifier (Int) -- Cache modifier
cp_size (Int) -- Optional copy size override
src_size (cutlass.cute.typing.Int)
- 回傳型別:
None
- tilelang.contrib.cutedsl.cpasync.prefetch_tma_descriptor(tma_desc, *, loc=None, ip=None)¶
Prefetch a TMA descriptor. Corresponds to PTX instruction: prefetch.tensormap;
- 回傳型別:
None
- tilelang.contrib.cutedsl.cpasync.mbarrier_wait(mbar_ptr, phase, timeout_ns=10000000, *, loc=None, ip=None)¶
Waits on a mbarrier with a specified phase (blocking loop).
Uses inline PTX to loop until the try_wait succeeds. The CUDA backend does: while (!mbar.try_wait(parity)) {}
- 參數:
mbar_ptr (cutlass.cute.typing.Pointer)
phase (cutlass.cute.typing.Int)
timeout_ns (cutlass.cute.typing.Int)
- 回傳型別:
None
- tilelang.contrib.cutedsl.cpasync.mbarrier_cp_async_arrive(mbar_ptr, *, loc=None, ip=None)¶
- 參數:
mbar_ptr (cutlass.cute.typing.Pointer)
- 回傳型別:
None
- tilelang.contrib.cutedsl.cpasync.fence_proxy_async()¶
- tilelang.contrib.cutedsl.cpasync.fence_barrier_init()¶