tilelang.contrib.cutedsl.reduce

Reduce operations for CuTeDSL backend. Based on tl_templates/cuda/reduce.h

Classes

SumOp

Sum reduction operator

MaxOp

Max reduction operator

MinOp

Min reduction operator

BitAndOp

Bitwise AND reduction operator

BitOrOp

Bitwise OR reduction operator

BitXorOp

Bitwise XOR reduction operator

CumSum1D

1D cumulative sum operation.

CumSum2D

2D cumulative sum operation.

NamedBarrier

Named barrier policy for AllReduce, uses bar.sync instead of __syncthreads.

Functions

min(a, b[, c])

Type-aware min: uses arith.minsi for integers, nvvm.fmin for floats.

max(a, b[, c])

Type-aware max: uses arith.maxsi for integers, nvvm.fmax for floats.

bar_sync(barrier_id, number_of_threads)

bar_sync_ptx(barrier_id, number_of_threads)

AllReduce(reducer, threads, scale, thread_offset[, ...])

AllReduce operation implementing warp/block-level reduction.

Module Contents

tilelang.contrib.cutedsl.reduce.min(a, b, c=None)

Type-aware min: uses arith.minsi for integers, nvvm.fmin for floats. Falls back to integer path if float conversion fails (signless int types).

tilelang.contrib.cutedsl.reduce.max(a, b, c=None)

Type-aware max: uses arith.maxsi for integers, nvvm.fmax for floats. Falls back to integer path if float conversion fails (signless int types).

class tilelang.contrib.cutedsl.reduce.SumOp

Sum reduction operator

static __call__(x, y)
class tilelang.contrib.cutedsl.reduce.MaxOp

Max reduction operator

static __call__(x, y)
class tilelang.contrib.cutedsl.reduce.MinOp

Min reduction operator

static __call__(x, y)
class tilelang.contrib.cutedsl.reduce.BitAndOp

Bitwise AND reduction operator

static __call__(x, y)
class tilelang.contrib.cutedsl.reduce.BitOrOp

Bitwise OR reduction operator

static __call__(x, y)
class tilelang.contrib.cutedsl.reduce.BitXorOp

Bitwise XOR reduction operator

static __call__(x, y)
tilelang.contrib.cutedsl.reduce.bar_sync(barrier_id, number_of_threads)
tilelang.contrib.cutedsl.reduce.bar_sync_ptx(barrier_id, number_of_threads)
class tilelang.contrib.cutedsl.reduce.CumSum1D(threads, reverse)

1D cumulative sum operation. Based on tl::CumSum1D from reduce.h

Template params:

threads: Number of threads reverse: Whether to cumsum in reverse order

参数:
  • threads (cutlass.Constexpr[int])

  • reverse (cutlass.Constexpr[bool])

threads
reverse
SEG = 32
run(src, dst, N)

Perform 1D cumulative sum.

参数:
  • src (cutlass.cute.Pointer) -- Source pointer

  • dst (cutlass.cute.Pointer) -- Destination pointer

  • N -- Number of elements (must be compile-time constant or small)

class tilelang.contrib.cutedsl.reduce.CumSum2D(threads, dim, reverse)

2D cumulative sum operation. Based on tl::CumSum2D from reduce.h

Template params:

threads: Number of threads (must be power of 2, 32-1024) dim: Axis along which to cumsum (0 or 1) reverse: Whether to cumsum in reverse order

参数:
  • threads (cutlass.Constexpr[int])

  • dim (cutlass.Constexpr[int])

  • reverse (cutlass.Constexpr[bool])

threads
dim
reverse
SEG = 32
TILE_H
run(src, dst, H, W)

Perform 2D cumulative sum.

参数:
  • src (cutlass.cute.Pointer) -- Source pointer

  • dst (cutlass.cute.Pointer) -- Destination pointer

  • H -- Number of rows

  • W -- Number of columns (should be <= 32 for single-segment case)

class tilelang.contrib.cutedsl.reduce.NamedBarrier(all_threads)

Named barrier policy for AllReduce, uses bar.sync instead of __syncthreads. Based on tl::NamedBarrier<all_threads> from reduce.h

all_threads
tilelang.contrib.cutedsl.reduce.AllReduce(reducer, threads, scale, thread_offset, all_threads=None)

AllReduce operation implementing warp/block-level reduction. Based on tl::AllReduce from reduce.h

参数:
  • reducer -- Reducer operator class (SumOp, MaxOp, etc.)

  • threads -- Number of threads participating in reduction

  • scale -- Reduction scale factor

  • thread_offset -- Thread ID offset

  • all_threads -- Total number of threads in block

返回:

A callable object with run() and run_hopper() methods