Mastering CUDA and High-Performance Computing, Part VIII

Lorenzo Bradanini and Lorenzo Tettamanti

Apr 07, 2026

Where Part VII Left Us

Part VII ended with a promise and an architectural cliffhanger.

The promise: on Hopper, the compute-to-load instruction ratio in a GEMM inner loop approaches infinity from the SMSP’s perspective.

The cliffhanger: one instruction moves a 128×128 BF16 tile, the TMA unit generates all the addresses, and something called an mbarrier replaces the __syncthreads() you have been writing since your first CUDA “hello world”.

Let us unpack exactly what that means, why NVIDIA made those choices, and what you have to understand to write, read, or debug CUTLASS 3.x kernels without feeling like you are reading a foreign language.

We will go very deep. There is no other way.

The Problem cp.async Did Not Fully Solve

Part VII established that cp.async is superior to the conventional LDG → STS path because it removes the destination registers from the scoreboard. The SMSP issues the copy, hands it off to the Async Copy Engine, and is immediately free to issue the next instruction.

This is genuinely great. But it has a hidden cost that only becomes visible when you look at the SMSP instruction stream of a real GEMM kernel.

Consider a 128×128×32 BF16 tile. Loading that tile requires 128 × 32 BF16 elements = 4096 BF16 = 8 KB. At 16 bytes per cp.async, that is 512 individual CP.ASYNC.CA.SHARED.GLOBAL instructions.

Those 512 instructions have to be fetched from the instruction cache, decoded, dispatched through the MIO unit, and tracked by the hardware. They consume SMSP instruction bandwidth even though they produce no register results.

On Ampere, the SMSP can issue roughly one 128-bit cp.async every 4 cycles per SMSP. For 512 instructions, that is approximately 2048 SMSP cycles per tile load, just for the instruction overhead. The actual data movement happens asynchronously, but the instruction stream is not free.

For large tiles this is manageable. For smaller tiles, or for architectures where you want the SMSP to spend every cycle on tensor core instructions, it is a ceiling.

Hopper (SM90, H100) was designed to remove that ceiling entirely. The answer is the Tensor Memory Accelerator.

Tensor Memory Accelerator

The TMA is a hardware unit introduced in Hopper that performs multi-dimensional tensor copies between global memory and shared memory (or distributed shared memory across a cluster, but we will get to clusters).

It accepts a tensor descriptor computed on the host and a set of coordinates computed on the device, and it handles everything else: address computation, striding, data type conversion, out-of-bounds clamping, cache policy, and transaction completion signaling.

Let us be concrete about what “everything else” means.

In a conventional tiled GEMM, for every tile you load, every thread in the warp must compute its portion of the global memory address.

That address computation involves the block index, the thread index, the tile dimensions, the matrix stride, and the element size. It is entirely deterministic arithmetic that produces the same result every time you execute the same tile iteration.

It is also arithmetic that the SMSP has to execute. On Ampere with cp.async, that arithmetic still happens in the SMSP even though the subsequent memory transaction is asynchronous.

The TMA eliminates that arithmetic from the SMSP. One thread issues one instruction with a tensor descriptor handle and a pair of (y, x) coordinates.

The TMA unit uses those coordinates and the descriptor’s metadata to compute every address needed for the entire tile transfer, scatter or gather the data, and write it to shared memory. The SMSP emitted one instruction. One.

This is not a minor optimization. It is a qualitative change in what the SMSP does during a GEMM kernel. On Hopper, the SMSP’s job is to run WGMMA.MMA_ASYNC instructions.

The TMA’s job is to move data. These two jobs happen simultaneously, on separate hardware units, and the only communication between them is an mbarrier synchronization object.

The Tensor descriptor

Before a Hopper kernel runs, the host must create a tensor descriptor using cuTensorMapEncodeIm2col or, more commonly for GEMM, cuTensorMapEncodeTiled. This is a 128-byte opaque structure stored in constant memory (or passed through a register and loaded into the L1).

The descriptor encodes:

Base pointer: the global memory address of tensor element [0, 0, 0, ...].

Global dimensions: the actual size of each dimension in the full tensor, in elements. For an M×K matrix A, this is {M, K} (or {K, M} if column-major).

Global strides: the byte stride between consecutive elements in each dimension. For a row-major matrix with K columns and BF16 elements, the stride between row i and row i+1 is K × 2 bytes. These strides allow arbitrary non-contiguous tensors.

Box dimensions: the size of the tile to be transferred in each dimension. For a 128×32 BF16 tile, this is {128, 32}.

Interleave and swizzle mode: how data should be rearranged during the transfer to produce a shared memory layout that avoids bank conflicts. This is the part that replaces all the padding arithmetic from Part VII.

Element stride and data type: how to interpret the raw bytes.

The descriptor is created once on the CPU and passed to the kernel. On the device, a single warp or even a single thread can then use this descriptor to initiate a full tile transfer with one instruction, because all the per-tile invariant information is already encoded.

This is a deliberate design choice: move the expensive computation (descriptor creation) to the host, where latency is irrelevant relative to the kernel launch overhead, so that the device-side instruction can be as cheap as possible.

The TMA instruction itself

The PTX for a 2D TMA load looks like this:

cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
    [smem_dst], [gmem_desc, {coord_y, coord_x}], [mbar];

Let us parse every token.

cp.async.bulk means this is an asynchronous bulk copy; “bulk” distinguishes it from scalar cp.async. The transfer size is determined by the descriptor, not encoded in the instruction.

tensor.2d means the TMA will interpret the coordinates as a 2D tensor access. There are variants for 1D through 5D tensors.

shared::cluster is the destination scope: shared memory that is visible to the entire thread block cluster (more on clusters shortly). For single-CTA kernels this is simply shared memory.

global is the source: global memory, indexed via the descriptor.

mbarrier::complete_tx::bytes is the completion signaling mechanism. When the transfer completes, the TMA will signal a mbarrier object, decrementing its transaction count.

When the count reaches zero, threads waiting on the barrier are unblocked. This replaces consumer_wait() and __syncthreads() in the sense that the barrier itself tracks both the data arrival and the thread synchronization in a single primitive.

[smem_dst] is the destination address in shared memory.

[gmem_desc, {coord_y, coord_x}] is the descriptor plus coordinates. The TMA extracts the base pointer, strides, and box dimensions from the descriptor, applies the coordinates, and generates the full address range.

[mbar] is a pointer to the mbarrier object in shared memory.

In CUDA C++, the cuda::experimental::tma:: API (or __pipeline_memcpy_async for simpler cases) generates this instruction. The canonical production path is through CUTLASS 3.x’s cute::copy with a TMA copy atom, which we will examine in the CUTLASS section.

A synchronization primitive you have not seen before

__syncthreads() is a full thread block barrier. Every thread in the block must arrive before any thread proceeds.

It is implemented via a shared counter that is decremented by each arriving thread and checked by a hardware barrier mechanism. Its cost is proportional to thread block size, and it cannot distinguish between “I’m done computing” and “my data has arrived from the DMA engine”.

mbarrier (memory barrier, or more precisely, the Hopper barrier object) solves both of those problems.

An mbarrier object is a 64-bit value stored in shared memory. It has two phases, expect and arrive, and it tracks two distinct counts:

The arrival count is decremented by threads calling mbarrier.arrive or mbarrier.arrive_drop. When this count hits zero, the barrier phase flips.

The transaction count is decremented by the TMA engine itself when a bulk copy completes. This is the complete_tx::bytes in the PTX instruction above. The programmer initializes this count to the expected number of bytes that the TMA will deliver.

The barrier is “complete” when both counts reach zero: all participating threads have arrived, and all expected TMA transactions have completed.

This means you can have a consumer wait on a barrier that is signaled partly by threads and partly by hardware DMA engines, with no polling loop, no atomics in the critical path, and no __syncthreads() that serializes all 128 threads in the block.

The setup looks like this in CUDA C++:

__shared__ cuda::barrier<cuda::thread_scope_block> mbar;

// One thread initializes the barrier for N_THREADS participants
if (thread_rank == 0) {
    init(&mbar, N_THREADS);
    // Tell the barrier to also expect TMA_BYTES bytes of async data
    cuda::device::barrier_native_handle(mbar).arrive_tx(TMA_BYTES);
}
__syncthreads();  // This syncthreads is to publish the initialized mbar

// Producer thread issues TMA
if (thread_rank == 0) {
    tma_load(&mbar, smem_A, gmem_desc_A, tile_coord_m, tile_coord_k);
}

// All threads arrive at the barrier (decrement arrival count)
auto token = cuda::device::barrier_native_handle(mbar).arrive();

// Wait for both arrival count and transaction count to reach zero
cuda::device::barrier_native_handle(mbar).wait(std::move(token));

Note the asymmetry: one thread issues the TMA, all threads participate in the barrier synchronization. This is not a bug; it is the design.

The TMA is a singleton operation that one thread initiates, but the data it delivers is consumed by all threads, so all threads must synchronize on its completion.

The arrive_tx call informs the barrier that TMA bytes are expected. Without it, the barrier would complete as soon as all threads arrived, regardless of whether the DMA data had landed in shared memory. That would be a race condition.

The token returned by arrive is a phase token. mbarrier operates in alternating phases (like a double buffer at the synchronization level), and the token ensures that wait waits on the correct phase.

This is how Hopper avoids the ABA problem in barrier reuse: you cannot accidentally wait on a barrier phase that already completed in a previous iteration.

Warpgroup MMA

Part VII did not cover the compute side of Hopper in depth because the memory side was already enough to digest. Now we need to talk about WGMMA, and it is equally radical.

On Ampere, tensor core instructions are issued per-warp: HMMA.1688 or the PTX mma.sync.aligned operates on 16×8×16 tiles with 32 threads participating. Each warp independently executes its tile of the matrix multiply.

Warp-level tensor core instructions were already a significant departure from SIMT, since all 32 threads in a warp cooperate to produce a single 16×8 output tile. But the warp is still the unit of scheduling and the unit of tensor core execution.

On Hopper, the tensor core instruction is warpgroup-level: WGMMA.MMA_ASYNC operates on a group of 4 warps (128 threads) simultaneously. The input tile dimensions for BF16 are:

A: 64×16 per warpgroup (contributed from registers or shared memory)
B: 16×256 per warpgroup (always from shared memory)
C/D: 64×256 accumulator (in registers, split across the 128 threads)

A single WGMMA.MMA_ASYNC instruction computes a 64×256×16 BFGEMM, producing 64×256 = 16,384 output elements in one instruction.

For comparison, an Ampere mma.sync.aligned with the largest BF16 shape produces 16×8×16 BFGEMM, 128 output elements.

The output volume ratio is 128:1. This is what “approaching infinite compute-to-load ratio” means in practice.

The _ASYNC suffix is critical: WGMMA.MMA_ASYNC does not complete synchronously. The 4 warps issue the instruction and the result is not guaranteed to be in the accumulator registers until a WGMMA.WAIT_GROUP instruction is executed.

The hardware can overlap multiple WGMMA operations in flight simultaneously, and the programmer must insert explicit waits before reading the accumulators.

The programming model therefore looks like this at the instruction level:

WGMMA.MMA_ASYNC D, A, B   ; issue tile multiply k=0
WGMMA.MMA_ASYNC D, A, B   ; issue tile multiply k=1
WGMMA.MMA_ASYNC D, A, B   ; issue tile multiply k=2
...
WGMMA.WAIT_GROUP 0         ; wait for all outstanding WGMMAs
; D accumulator registers now hold valid results

In CUDA C++, this is exposed through the cute::wgmma abstractions in CUTLASS 3.x, or through the lower-level cuda::wgmma:: namespace. Direct PTX is also possible but strongly inadvisable outside of research contexts.

The reason B must always come from shared memory (not registers) is a hardware constraint. The tensor core units on Hopper are wired directly to the shared memory arrays.

The B operand is fetched directly from the shared memory banks by the tensor core datapath, without going through the register file.

This is why the TMA delivering B into shared memory is on the critical path, but there is no “load B from shared memory to registers” step. The tensor core reads shared memory directly.

A can come from either registers or shared memory. For the highest-performance kernels, A also comes from shared memory, which means both operands bypass the register file entirely on the compute side. The register file holds only the C/D accumulator.

Thread Block Clusters

Hopper introduced a new level of the GPU hierarchy between the thread block and the grid: the thread block cluster.

A cluster is a group of up to 8 thread blocks that are guaranteed to be co-scheduled on the same GPC (Graphics Processing Context, a group of SMs sharing an L2 slice).

Thread blocks within a cluster can access each other’s shared memory via the Distributed Shared Memory (DSMEM) mechanism, using TMA to move data between SMs without going through L2.

The PTX instruction for a cross-SM TMA transfer is:

cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
    [smem_dst], [gmem_desc, {coord_y, coord_x}], [mbar];

This is the same instruction as a regular TMA load, with the shared::cluster scope indicating that the destination is visible cluster-wide. The TMA unit manages the inter-SM data movement transparently.

Why does this matter for GEMM? Consider a cluster of 2 CTAs, each responsible for a different row block of C. Both need access to the same columns of B.

With clusters, CTA 0 loads B into its shared memory via TMA, and CTA 1 can read CTA 0’s shared memory directly via DSMEM. B is loaded once and consumed by two CTAs. This effectively doubles the B reuse without doubling the shared memory per CTA.

For an N=8 cluster, 8 CTAs share the B tile load, amortizing the HBM bandwidth for B across 8x more compute.

This is the mechanism by which Hopper GEMM kernels approach hardware peak on large problem sizes: the cluster architecture allows the working set of the entire computation to be held in distributed shared memory, with HBM touched only once per element.

The cluster size is specified at kernel launch:

cudaLaunchConfig_t config = {};
config.gridDim = grid;
config.blockDim = block;
cudaLaunchAttribute attr;
attr.id = cudaLaunchAttributeClusterDimension;
attr.val.clusterDim.x = 2; // 2 CTAs per cluster
attr.val.clusterDim.y = 1;
attr.val.clusterDim.z = 1;
config.attrs = &attr;
config.numAttrs = 1;
cudaLaunchKernelEx(&config, my_kernel, args...);

Cluster scheduling is cooperative: the hardware will attempt to co-locate the CTAs of a cluster on the same GPC, but this is a hint, not a guarantee for clusters larger than what fits on one GPC.

On H100 SXM5 with 132 SMs organized into 7 GPCs, clusters of up to 8 are always satisfied within a single GPC.

The Persistent Kernel Model

On Ampere, a typical GEMM kernel is a “grid kernel”: each thread block handles one (M_TILE, N_TILE) output tile and exits. The CUDA runtime schedules new thread blocks as soon as SM capacity becomes available.

For large matrices this is fine: there are enough tiles that the SM scheduler is always busy.

For smaller matrices, the overhead of launching and retiring thread blocks dominates. Each thread block must load its A and B tiles from scratch, write its C tile to global memory, and terminate. The shared memory state is not reused across thread blocks.

Hopper’s memory hierarchy and cluster model make a different approach attractive: persistent kernels.

In a persistent kernel, a thread block (or warpgroup) does not terminate after processing one tile.

Instead, it loops over multiple output tiles, maintaining the A and B tiles in shared memory between iterations where the tile is reused, and fetching new tiles via TMA only when necessary. The kernel terminates only after all output tiles in its assigned partition are complete.

CUTLASS 3.x implements this via the Tile Scheduler, a device-side component that manages the assignment of output tiles to persistent CTAs.

The scheduler atomically increments a work counter stored in global memory, assigning the next available (m_tile, n_tile) pair to the requesting CTA. When all tiles are assigned, the scheduler signals completion and the CTA exits the work loop.

The advantages are concrete:

L2 reuse improves because the same CTA processes multiple adjacent tiles, and the A or B tiles they share remain in L2 (or even in shared memory) between iterations.

Thread block launch overhead is amortized: the GPU launches one wave of persistent CTAs and they run to completion, rather than launching thousands of transient blocks.

Irregular problem sizes are handled more gracefully: the final partial tile is processed by whichever CTA happens to claim it, without requiring separate epilogue kernel launches.

The disadvantage is programming complexity: you are writing a software scheduler inside a CUDA kernel, with all the attendant concerns about correctness under concurrent access and load balancing across heterogeneous tile work.

CUTLASS handles this for you, which is one reason the library exists.

The CUTLASS 3.x Architecture

CUTLASS 3.x is a complete rewrite of CUTLASS 2.x, built on a new abstraction layer called CuTe (CUDA Template library).

Understanding CUTLASS 3.x requires understanding CuTe, because CUTLASS 3.x is essentially CuTe plus a set of kernel templates that use it.

CuTe: Layouts as First-Class Objects

CuTe’s central idea is that a layout is a function from a logical coordinate space to a physical offset in memory. A layout encodes both shape (the extents of each dimension) and stride (the distance in elements between consecutive elements along each dimension).

In CuTe, a layout is written as Shape:Stride. For example, a 4×8 row-major matrix with elements of size 2 bytes has layout (4,8):(8,1), meaning: the outer dimension (rows) has stride 8 (each row is 8 elements apart), and the inner dimension (columns) has stride 1. A column-major version of the same matrix would be (4,8):(1,4).

The power of this representation is that it composes. A tiling operation is just a layout composition. A swizzle (bit permutation of addresses to avoid bank conflicts) is a layout transformation that permutes the address bits in a specific pattern.

The entire address computation for a tiled, swizzled, transposed tensor is expressed as a sequence of layout compositions that the compiler evaluates at compile time, producing a single address formula.

This is why CUTLASS 3.x can express complex access patterns without any runtime branching in the address computation.

using LayoutA = Layout<Shape<_128, _32>, Stride<_32, _1>>;  // 128x32 row-major
using LayoutA_Swizzled = ComposedLayout<Swizzle<3,3,3>, LayoutA>;

The Swizzle<B,M,S> template encodes a specific XOR-based address permutation. B bits are permuted with S bits, offset by M bits.

For BF16 with 32 banks of 4 bytes each, the correct swizzle eliminates all bank conflicts without any padding. CUTLASS ships with the correct swizzle parameters for every element type and tile dimension it supports.

The MMA Atom and Copy Atom

In CUTLASS 3.x, a tensor core instruction is an MMA atom: a typed object that describes the input/output shapes, thread-to-data mapping, and instruction to emit. The canonical Hopper MMA atom for BF16 is:

using MMA_Atom = MMA_Atom<SM90_64x256x16_F32BF16BF16F32_SS>;

The name encodes: SM90 (Hopper), 64×256×16 tile dimensions, F32 accumulator, BF16 A and B inputs, F32 output, SS meaning both A and B come from shared memory.

A TMA copy is a copy atom:

using Copy_Atom_A = Copy_Atom<SM90_TMA_LOAD, bfloat16_t>;

The CUTLASS kernel template composes these atoms with tile dimensions, cluster shapes, and pipeline stages into a complete kernel:

using CollectiveMainloop = cutlass::gemm::collective::CollectiveMma<
    cutlass::gemm::MainloopSm90TmaGmmaRmemAAccumulator<3>,  // 3-stage pipeline
    Shape<_128, _256, _64>,                                   // tile MxNxK
    bfloat16_t, LayoutA,
    bfloat16_t, LayoutB,
    TiledMma,
    GmemTiledCopyA,
    SmemLayoutA,
    SmemCopyAtomA,
    cute::identity,
    GmemTiledCopyB,
    SmemLayoutB,
    SmemCopyAtomB,
    cute::identity
>;

This is verbose, but every template parameter maps to a concrete hardware mechanism: MainloopSm90TmaGmmaRmemAAccumulator<3> means “use TMA for loads, use WGMMA for compute, keep the accumulator in registers, with 3 pipeline stages”.

The compiler resolves all of this into a kernel where the main loop body is a tight sequence of WGMMA.MMA_ASYNC instructions, interrupted only by TMA-initiated mbarrier waits at stage boundaries.

The address computation for the TMA loads is essentially absent from the device code, having been moved to the descriptor construction on the host.

The Producer-Consumer Warpgroup Model

CUTLASS 3.x on Hopper adopts a warpgroup specialization model within each CTA. A thread block of 128 threads (one warpgroup) is divided at compile time into a producer warpgroup and one or more consumer warpgroups.

The producer warpgroup is responsible for issuing TMA loads (one thread per load, the others arrive at barriers). The consumer warpgroups are responsible for issuing WGMMA.MMA_ASYNC instructions and running the epilogue (writing C to global memory via the output TMA store).

This specialization is explicit:

if (warpgroup_id == 0) {
    // Producer: issue TMA loads into shared memory stages
    collective_mainloop.load(params, smem_tensors, pipeline, pipeline_state, k_tile_count);
} else {
    // Consumer: issue WGMMA instructions, run epilogue
    collective_mainloop.mma(params, smem_tensors, accumulators, pipeline, pipeline_state, k_tile_count);
    collective_epilogue.store(params, accumulators, ...);
}

The producer and consumer warpgroups communicate exclusively through the mbarrier-protected shared memory pipeline. There is no __syncthreads() between them in steady state. The barriers are sufficient.

This is architecturally important: __syncthreads() is a full CTA barrier. In a producer-consumer model where the producer and consumer have different amounts of work to do per iteration, a full CTA barrier would force the faster group to wait for the slower one on every iteration.

The mbarrier primitive allows asymmetric synchronization: the consumer waits only for the data it needs, not for the producer to reach any particular point in its control flow.

The N-Stage Pipeline on Hopper

Part VII described double buffering (2 stages) on Ampere. On Hopper, CUTLASS uses 3 to 8 stages by default, with the optimal stage count depending on the tile size, problem size, and occupancy target.

The pipeline state machine on Hopper manages N shared memory stages, N producer mbarriers (one per stage, signaling data arrival), and N consumer mbarriers (one per stage, signaling that the consumer is done reading and the stage can be reused).

The steady-state loop looks like this conceptually:

Stage 0: [TMA load A0, B0] → [mbar_full[0] signaled] → [WGMMA on A0,B0] → [mbar_empty[0] signaled]
Stage 1: [TMA load A1, B1] → [mbar_full[1] signaled] → [WGMMA on A1,B1] → [mbar_empty[1] signaled]
Stage 2: [TMA load A2, B2] → [mbar_full[2] signaled] → [WGMMA on A2,B2] → [mbar_empty[2] signaled]
Stage 0: [TMA load A3, B3] → ...

The producer issues TMA loads into stage i and signals mbar_full[i]. The consumer waits on mbar_full[i], runs WGMMA, signals mbar_empty[i], and moves to stage (i+1) % N.

The producer waits on mbar_empty[i] before reusing that stage for the next load. This circular buffer in shared memory, managed by mbarrier pairs, is the fundamental data structure of a Hopper GEMM kernel.

The prologue loads N-1 tiles before the main loop begins (same invariant as Part VII’s double buffer prologue, just with more stages). The epilogue drains the remaining in-flight tiles after the k loop exits.

With 3 stages on an H100 with 228 KB of shared memory per SM (up from Ampere’s 192 KB), a 128×256 BF16 tile pair consumes approximately:

A tile: 128 × 64 × 2 bytes = 16 KB
B tile: 64 × 256 × 2 bytes = 32 KB
Per stage: 48 KB
3 stages: 144 KB
Remaining for mbarriers and accumulator spills: 84 KB

At 3 stages and a 128×256 tile, one CTA per SM is feasible. Two CTAs would require 288 KB, which exceeds the 228 KB shared memory limit.

Occupancy is therefore 1 CTA per SM, which is fine on Hopper because the single CTA fills the SM with WGMMA instructions and the TMA unit is fully occupied.

This is a fundamentally different occupancy philosophy from Ampere. On Ampere, you often needed 2-4 CTAs per SM to hide memory latency through warp-switching.

On Hopper, one CTA with TMA and WGMMA already achieves near-peak throughput on large tiles, because the hardware units that matter (TMA, tensor cores) are all fully occupied.

What the Profiler Shows You on Hopper

The Nsight Compute metrics shift dramatically compared to Ampere.

smsp__warp_issue_stalled_long_scoreboard approaches zero. Not because the memory is fast, but because TMA loads do not involve the scoreboard at all. The SMSP is not waiting for memory; it is not the unit that issued the memory request.

smsp__warp_issue_stalled_mio_throttle is also low. The single TMA instruction per tile barely loads the MIO unit.

smsp__warp_issue_stalled_wgmma_global_wait is the new dominant stall: this is the SMSP waiting for a WGMMA.WAIT_GROUP to complete so it can read the accumulator registers.

This stall is unavoidable for kernels that read their accumulators between WGMMA groups (e.g., for split-K partial reductions). For kernels with long K dimensions, the WGMMA pipeline fills up and this stall disappears.

sm__pipe_tensor_op_hmma_cycles_active should be 80-95% for a well-tuned Hopper GEMM. Anything below 70% suggests either a pipeline depth problem (too few stages) or a cluster scheduling problem (the GPC is not scheduling the cluster CTAs together).

l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld counts shared memory read operations. For a kernel where both A and B are read from shared memory by WGMMA (SS variant), this metric reflects tensor core throughput, not programmer-visible loads. The tensor cores are reading shared memory directly, and this shows up in the LSU metrics.

The TMA throughput metrics are in the tma namespace: tma__read_bytes and tma__read_transactions. A kernel that is achieving peak TMA throughput will show TMA bandwidth close to the theoretical HBM bandwidth, because TMA is the only thing accessing HBM.

The key diagnostic insight on Hopper: if your WGMMA utilization is high and your TMA bandwidth is high, the kernel is good. The two hardware units are the bottleneck by design. Everything else should be idle or near-idle.

The Roofline on Hopper, revisited

Part VII introduced the roofline model and noted that the useful diagnosis is hierarchical: not “memory-bound” but “memory-bound at the L2 level, achieving 60% of L2 peak”. On Hopper the hierarchy has the same levels (L1, L2, HBM) but new slopes.

H100 SXM5 roofline parameters:

HBM3 peak bandwidth: 3.35 TB/s
L2 peak bandwidth: approximately 12 TB/s (across 50 MB of L2, two slices)
Shared memory peak bandwidth: approximately 33 TB/s aggregate (SM-local)
Tensor core peak (dense BF16): 494 TFLOP/s

Ridge points:

HBM ridge: 494 / 3.35 ≈ 147 FLOP/byte
L2 ridge: 494 / 12 ≈ 41 FLOP/byte
Shared memory ridge: 494 / 33 ≈ 15 FLOP/byte

For a GEMM with arithmetic intensity of 147 FLOP/byte or above, the kernel should be compute-bound assuming the memory hierarchy is properly utilized. Below 147 FLOP/byte, it is HBM-bandwidth-bound.

Below 41, even a perfect L2 hit rate cannot save you. Below 15, the tensor core throughput is limited by shared memory bandwidth, which means either bank conflicts or tile sizes that do not saturate the WGMMA datapath.

The key new insight on Hopper: TMA changes the shape of the memory hierarchy’s contribution. The SMSP instruction bandwidth, which was a secondary bottleneck on Ampere (and a primary bottleneck for small tiles), is effectively removed from the HBM bandwidth calculation.

The raw bandwidth to shared memory is now limited only by the TMA unit’s throughput, which the H100 documentation lists at approximately 900 GB/s aggregate (across all TMA units on all SMs).

This is below the HBM bandwidth of 3.35 TB/s, so for kernels that are purely bandwidth-limited (not compute-bound), TMA is not the constraint; HBM is.

For compute-bound kernels with large tiles, TMA’s instruction offloading is what enables the SMSP to run WGMMA at full throughput.

A Brief Look at Blackwell

Blackwell (SM100, B100/B200) was announced in March 2024 and began shipping to hyperscalers in late 2024. The architectural trajectory established by Hopper continues and accelerates.

The Blackwell tensor core introduces a 5th generation MMA with FP4 support (MXFP4 and NF4 formats), enabling 20 PFLOP/s peak at the full B200 system level (dual-die). The FP8 dense throughput is approximately 9 PFLOP/s per chip.

TMA on Blackwell gains native support for im2col pattern transforms (relevant for convolutions) and transposed stores, reducing the need for separate transpose kernels.

The cluster size limit increases to 16 CTAs (from 8 on Hopper), further amortizing B tile loads across more compute.

A new fifth-generation NVLink provides 1.8 TB/s bidirectional bandwidth per GPU in NVLink-connected systems (NVL72 rack), enabling multi-GPU kernels where the “global memory” seen by a TMA operation is distributed across 72 GPUs. This is the level at which the distinction between a single-GPU kernel and a distributed compute graph begins to blur.

CUTLASS 3.x supports Blackwell through new SM100 collective templates. The programming model is the same; the numbers are larger.

Conclusion

The trajectory from Volta through Ampere to Hopper is a coherent story: every generation pushes more of the data movement machinery off the SMSP and onto dedicated hardware.

Volta gave you tensor cores, so the SMSP stopped doing the arithmetic. Ampere gave you cp.async, so the SMSP stopped waiting for loads. Hopper gave you TMA, so the SMSP stopped issuing loads entirely.

The SMSP on a well-tuned Hopper GEMM kernel is a machine that does one thing: issue WGMMA.MMA_ASYNC. Everything else has been delegated.

This is not an accident. It is the logical endpoint of the observation that matrix multiply is the kernel that matters most for modern ML workloads, and the most efficient hardware for matrix multiply is hardware where the compute units are never idle.

Every architectural innovation from 2017 onwards has been an attack on a different reason why the compute units were idle: arithmetic latency (tensor cores), memory latency (cp.async), instruction bandwidth (TMA), inter-SM bandwidth (clusters, NVLink).

The mbarrier, the tensor descriptor, the warpgroup specialization, the producer-consumer pipeline, the tile scheduler: these are not ornamental complexity.

They are the mechanisms by which a 2024 GPU running a 2024 kernel achieves 80-90% of theoretical peak on matrix multiply, a number that would have seemed implausible to practitioners writing hand-tuned BLAS routines a decade ago.

Part IX will step back from the single-GPU picture and look at multi-GPU parallelism: tensor parallelism, pipeline parallelism, NCCL, and the question of how NVLink bandwidth interacts with the per-GPU compute performance we have spent eight parts building up.

The tools change; the principle does not: find the bottleneck, route around it, measure again.

The Software Frontier

Discussion about this post

Ready for more?