Mastering CUDA and High-Performance Computing, Part VII

A Deep Dive from Compiler Internals to High-Performance Parallel Computing

Lorenzo Bradanini and Lorenzo Tettamanti

Mar 27, 2026

Where Part VI Left Us

Part VI ended with a sentence that deserves to be unpacked:

cp.async instructions do not set the long scoreboard.
The register file is not involved, so no register’s bit is marked pending.
The SMSP issues the cp.async, the copy engine takes it, and the SMSP is immediately free to issue the next instruction for that warp.

This is not a minor optimization note.

It is a description of a fundamentally different execution model: one that requires you to abandon the mental model of “instruction issues, result arrives, next instruction proceeds”, and replace it with something more like a production pipeline in a factory:

stages overlap, buffers exist between them, and throughput is determined by the slowest stage, not the sum of all stage latencies.

Before we can make cp.async do useful work, we need an accurate model of what it is hiding from: the memory hierarchy.

The Memory Hierarchy of the A100

The A100 SXM4 has six levels of memory that matter to kernel programmers. They are not equally documented, and the numbers in marketing materials are frequently not the numbers in production code.

Registers

Each SM on Ampere has a 256 KB register file, shared across the four SMSPs: 64 KB per SMSP, with a 256-bit read port per cycle.

Register file access latency is effectively 0 cycles in the bypass case; for non-bypassed reads the cost is absorbed into the 4-cycle FMA pipeline. Registers are not a latency source. They are a capacity and bandwidth source.

The capacity limit is the one that matters: each thread can use at most 255 registers.

Pressure above this causes the compiler to spill values to local memory; a per-thread private region mapped to L1/L2/DRAM.

Spills are indistinguishable from any other global memory access at the hardware level: they go through the MIO unit, set the long scoreboard, and wait 400+ cycles for DRAM. Every spilled register costs two MIO operations.

Shared Memory / L1 Cache

Ampere’s per-SM L1 is a 192 KB pool partitioned between shared memory and the hardware L1 data cache.

The split is configurable (0/192, 32/160, 64/128, 100/92, 132/60, 160/32 (shared/cache, in KB)) via cudaFuncSetAttribute with cudaFuncAttributePreferredSharedMemoryCarveout.

Shared memory has 32 banks, each 4 bytes wide.

Bank index for a byte address:

bank = (address >> 2) & 31

Access patterns where multiple threads in a warp access different addresses in the same bank serialize.

One 4-bank conflict causes 4× the latency of the conflict-free case. The conflict-free latency is approximately 23 cycles; a 4-bank conflict extends this to ~35 cycles; an 8-bank conflict to ~51 cycles. The penalty scales linearly.

The broadcast exception: if all threads in a warp access the exact same address within a bank, the hardware services this as a single read and broadcasts the result.

Thirty-two threads accessing thirty-two different addresses that all map to the same bank is not a broadcast. It is a 32-way serialization.

L2 Cache

The A100 has 40 MB of L2 cache, split into two 20 MB slices. L2 hit latency: approximately 180–200 cycles, higher than most documentation implies.

Accesses to the local slice are ~160–180 cycles; accesses to the remote slice (requiring crossbar traversal) are ~200–230 cycles.

L2 bandwidth is approximately 4 TB/s aggregate. The ratio of L2 bandwidth to HBM bandwidth is approximately 15:1. Fitting a working set in L2 is qualitatively different from spilling it to HBM.

HBM2e

The A100 SXM4 has six HBM2e stacks providing a peak theoretical bandwidth of 2 TB/s. In practice: a kernel with access pattern regularity sufficient to saturate all channels achieves 1.6–1.9 TB/s.

Irregular access patterns with row buffer conflicts: 800 GB/s–1.2 TB/s. Random byte-granularity reads: tens of GB/s, due to cache line waste.

HBM2e latency, measured with L1 and L2 bypassed: approximately 450–600 cycles at 1410 MHz. Row buffer hits land around 300–350 cycles; misses around 550–650 cycles.

The consequence at 1410 MHz: 500 cycles × 0.71 ns/cycle ≈ 355 nanoseconds of stall per warp. In that window, 500 instruction issue slots across the SM go dark.

If every resident warp has issued an HBM load and is waiting, you have a 500-cycle stall with no eligible warp to rescue you.

This is the memory wall in concrete form. The solution is not a faster memory: it is to restructure data movement so that HBM latency is overlapped with computation.

The cp.async Instruction

cp.async was introduced in Ampere (sm_80). It performs a direct DMA-like transfer from global memory to shared memory, bypassing the register file entirely:

cp.async.ca.shared.global [dst], [src], size;
cp.async.cg.shared.global [dst], [src], size;   // bypass L1

The size parameter is 4, 8, or 16 bytes. The 16-byte variant is the most important: it issues a vectorized LDG.128, achieving maximum memory interface utilization.

What “bypassing the register file” actually means

The conventional load path:

LDG.128 R4, [R2]         ; → long scoreboard set for R4,R5,R6,R7
                         ; → warp stalls on any read of R4-R7
                         ; → 450-600 cycles later, HBM returns data
STS.128 [smem_ptr], R4   ; store registers → shared memory

This requires 4 registers in transit. The load sets four long scoreboard bits. The warp is ineligible for any instruction reading R4–R7 until the HBM transaction completes.

The cp.async path:

CP.ASYNC.CA.SHARED.GLOBAL [smem_dst], [R2], 0x10
; → no scoreboard bits set (no destination register)
; → warp immediately eligible to issue next instruction
; → data arrives in shared memory asynchronously

A dedicated Ampere Asynchronous Copy Engine receives the request via the MIO unit, takes ownership of the transaction, and performs the HBM load and shared memory write independently of the SMSP. The MIO unit is freed immediately after handoff.

The commit/wait mechanism

Commit (CP.ASYNC.COMMIT_GROUP): marks all preceding cp.async instructions as a commit group. Bookkeeping only,does not wait for anything.

Wait (CP.ASYNC.WAIT_GROUP N): stalls until at most N commit groups remain pending. N=0 is complete synchronization.

N=1 allows one in-flight group to remain outstanding while you compute on the previous.

auto pipe = cuda::make_pipeline();

for (int i = 0; i < BATCH_SIZE; i++)
    cuda::memcpy_async(smem[0][i], &gmem[base + i], sizeof(float4), pipe);
pipe.producer_commit();

for (int i = 0; i < BATCH_SIZE; i++)
    cuda::memcpy_async(smem[1][i], &gmem[base + BATCH_SIZE + i], sizeof(float4), pipe);
pipe.producer_commit();

pipe.consumer_wait();   // CP.ASYNC.WAIT_GROUP 1
__syncthreads();        // mandatory: propagates visibility to all threads

compute(smem[0]);

The __syncthreads() after consumer_wait is mandatory. consumer_wait ensures the data is in shared memory from the perspective of this warp.

Other warps in the thread block may not see the writes until __syncthreads() propagates them through the SM’s coherence domain.

Omitting it is a race condition: one that produces correct results most of the time and incorrect results unpredictably under heavy memory pressure.

The Double Buffer Pattern

A standard tiled GEMM loop is fully sequential: load tile, sync, compute, sync, repeat. The timeline is a flat chain of dependencies. For smaller problems or thinner tiles where T_load / T_compute > 1, the kernel is memory-bound.

The double buffer pattern breaks that chain:

Iter k:   |-- cp.async A[k] --|-- cp.async B[k] --|-- commit --|
                                                                |-- wait(k-1) --|-- compute(k-1) --|
Iter k+1: |-- cp.async A[k+1] --|-- cp.async B[k+1] --|-- commit --|
                                                                    |-- wait(k) --|-- compute(k) --|

Loads for iteration k+1 overlap with computation of iteration k. Memory latency is hidden as long as T_load(k+1) < T_compute(k). The pipeline then runs at the compute rate with zero memory stall.

This requires two ping-pong buffers in shared memory, doubling the shared memory requirement.

Doubling shared memory per thread block halves the maximum resident thread blocks per SM, reducing occupancy. The trade-off is explicit and computable.

Diagnostic signal: if smsp__warp_issue_stalled_long_scoreboard.avg.pct_of_peak_sustained_active exceeds 20%, memory latency is not being hidden. The first intervention is higher occupancy.

The second, when occupancy is already near maximum, is cp.async pipelining, which removes the long scoreboard from the equation entirely.

The Full Kernel Pattern

constexpr int TILE_M = 128, TILE_N = 128, TILE_K = 32;
constexpr int NUM_STAGES = 2;

__global__ void gemm_async_kernel(
    const __nv_bfloat16* __restrict__ A,
    const __nv_bfloat16* __restrict__ B,
    float* __restrict__ C,
    int M, int N, int K
) {
    __shared__ __nv_bfloat16 smem_A[NUM_STAGES][TILE_M][TILE_K];
    __shared__ __nv_bfloat16 smem_B[NUM_STAGES][TILE_K][TILE_N];
    float acc[4][4] = {};

    auto pipe = cuda::make_pipeline();
    const int k_tiles = K / TILE_K;

    // PROLOGUE: issue tile 0 before the main loop
    if (k_tiles > 0) {
        int row_a = threadIdx.x / TILE_K, col_a = threadIdx.x % TILE_K;
        if (row_a < TILE_M)
            cuda::memcpy_async(&smem_A[0][row_a][col_a],
                               &A[(blockIdx.y * TILE_M + row_a) * K + col_a],
                               sizeof(__nv_bfloat16), pipe);
        pipe.producer_commit();
    }

    // MAIN LOOP
    for (int k = 1; k < k_tiles; k++) {
        const int sw = k % 2, sr = (k - 1) % 2;

        int row_a = threadIdx.x / TILE_K, col_a = threadIdx.x % TILE_K;
        if (row_a < TILE_M)
            cuda::memcpy_async(&smem_A[sw][row_a][col_a],
                               &A[(blockIdx.y * TILE_M + row_a) * K + (k * TILE_K + col_a)],
                               sizeof(__nv_bfloat16), pipe);
        pipe.producer_commit();

        pipe.consumer_wait();   // CP.ASYNC.WAIT_GROUP 1
        __syncthreads();

        for (int ki = 0; ki < TILE_K; ki++)
            for (int i = 0; i < 4; i++)
                for (int j = 0; j < 4; j++)
                    acc[i][j] += __bfloat162float(smem_A[sr][threadIdx.y*4+i][ki])
                               * __bfloat162float(smem_B[sr][ki][threadIdx.x*4+j]);
        __syncthreads();
    }

    // EPILOGUE
    pipe.consumer_wait();   // CP.ASYNC.WAIT_GROUP 0
    __syncthreads();
}

Three things to internalize about this structure:

The prologue is not optional. Without issuing tile 0 before the loop, the first consumer_wait blocks on a commit group that doesn’t exist. Undefined behavior. The prologue establishes the “one stage ahead” invariant that the loop depends on.

Both synchronization primitives are required. consumer_wait ensures the DMA engine has written the data to shared memory for this warp. __syncthreads() ensures all threads in the block have reached this point before any thread reads.

They solve different problems. Neither substitutes for the other.

Stage read and stage write are never equal. The modular arithmetic guarantees sw ≠ sr for NUM_STAGES = 2. The DMA engine writes to one buffer while threads read from the other.

With NUM_STAGES ≥ 3 you deepen the pipeline, more latency hidden, more shared memory consumed.

N-Stage Generalization

With N stages, you issue N tiles’ worth of cp.async before the first computation begins. The latency is hidden when T_compute(tile) > T_HBM_load / N.

CUTLASS implements up to 5-stage pipelines for its Ampere GEMM kernels, with stage count as a compile-time template parameter swept by the profiler at tuning time. The shared memory cost scales linearly with stage count.

At some crossover point the shared memory requirement forces an occupancy reduction that exceeds the pipelining benefit.

This crossover depends on the specific kernel and problem size, which is why CUTLASS exposes the parameter rather than hardcoding it.

What the Profiler Shows You

Before pipelining (conventional LDG loads):

smsp__warp_issue_stalled_long_scoreboard — 40–70%, dominant stall
smsp__pipe_fma_cycles_active — 30–60%, computation starved

After pipelining (cp.async, double buffer):

smsp__warp_issue_stalled_long_scoreboard. <5%, cp.async sets no scoreboard bits
smsp__pipe_fma_cycles_active. 70–90% for a well-tuned kernel
Watch for smsp__warp_issue_stalled_mio_throttle; if you issue cp.async faster than the MIO unit can service them (~1 per 4 cycles per SMSP for 128-bit transfers), this stall replaces the scoreboard stall.
The fix is larger tiles or accepting the throttle if MIO throughput still exceeds compute throughput.

Bank Conflicts

The 32-bank model is documented. The practical implications for matrix access patterns are not.

In a tiled GEMM, tile A is loaded into shared memory in row-major layout, then read column-wise during the multiply.

For TILE_K = 32 and BF16 elements (2 bytes each), element [j][i] sits at byte offset j × 64 + i × 2. Bank index: (j × 16 + i/2) & 31.

For a warp reading column i (i fixed, j running 0..31) every pair of threads maps to the same bank. This is a 2-way bank conflict on every column read.

The fix is padding:

__shared__ __nv_bfloat16 smem_A[TILE_M][TILE_K + 2];   // +2 BF16 = +4 bytes per row

With the pad, element [j][i] is at byte offset j × 68 + i × 2. Bank index: (j × 17 + i/2) & 31. Since gcd(17, 32) = 1, the bank indices as j runs 0..31 form a complete permutation of 0..31. Zero conflicts.

The shared memory overhead is TILE_M × 4 bytes per buffer: 512 bytes for TILE_M = 128, trivial against the 8 KB tile.

CUTLASS’s Swizzle technique achieves the same result via address bit permutation rather than linear padding, which handles non-power-of-two tile sizes cleanly.

The arithmetic underneath is identical.

L1 Cache Policy

Cache behavior on Ampere is configurable at the instruction level:

Qualifier Behavior LDG.CA Cache in L1 (default) LDG.CG Bypass L1, go to L2 LDG.CS Streaming: insert at LRU position LDG.CV Bypass all caches (almost never correct)

In CUDA: __ldg() for L1-cached, __ldcg() / __ldcs() for the bypass variants. The compiler defaults to LDG.CA when uncertain.

For kernels that process each input element exactly once, elementwise operations, reductions, anything with no reuse, __ldcg() eliminates L1 pollution and preserves L1 capacity for data that does benefit from caching.

The effect in the profiler: lower L1 hit rate, unchanged L2 hit rate. The data skips one cache level without reducing effective bandwidth at the level where reuse actually exists.

The Roofline Model

The roofline model (Williams, Waterman, Patterson, 2009) plots FLOP/s against arithmetic intensity (FLOP/byte of DRAM traffic). For the A100 in FP32:

Peak compute: ~19.5 TFLOP/s
Peak HBM bandwidth: ~2 TB/s
Ridge point: ~9.75 FLOP/byte

Below the ridge: memory-bound. Above: compute-bound. The common mistake is treating DRAM bandwidth as the only line that matters.

The L2-based roofline has a ridge at ~4.9 FLOP/byte. The L1-based roofline has a ridge at ~1 FLOP/byte.

A kernel with strong L1 reuse can be compute-bound at an arithmetic intensity that looks memory-bound on the DRAM roofline.

A kernel that thrashes L2 will underperform the DRAM roofline because its effective bandwidth is below the theoretical peak. NCU’s roofline chart shows all three simultaneously.

The correct first diagnostic is hierarchical bandwidth analysis. Not “it’s memory-bound”; that’s a category.

The useful diagnosis is “it’s memory-bound at the L2 level, achieving 60% of L2 peak, because 40% of L2 bandwidth is wasted on non-reused data evicted before second use.” That tells you the fix.

The Tensor Memory Accelerator

Ampere introduced cp.async. Hopper (sm_90, H100) introduced the Tensor Memory Accelerator (TMA), the same idea taken to its logical conclusion.

With cp.async, the programmer still computes every element’s global memory address and constructs the instruction stream.

For a 128×128 BF16 tile, that is 512 vectorized 128-bit cp.async instructions consuming SMSP instruction bandwidth, even though the transfers are asynchronous.

TMA accepts a tensor descriptor (base address, dimensions, strides, element type) and issues a single instruction:

cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
    [smem_dst], [gmem_desc, {coord_y, coord_x}], [mbar];

One instruction. One 128×128 BF16 tile. The TMA unit generates all the addresses, manages all the transactions, and signals completion via the mbarrier primitive;

a synchronization mechanism lighter than __syncthreads(), designed for producer-consumer coordination without a full SM barrier.

The consequence: on Hopper, the compute-to-load instruction ratio in a GEMM inner loop approaches ∞ from the SMSP’s perspective.

The SMSPs run wgmma.mma_async continuously; the TMA unit handles all data movement independently. CUTLASS 3.x is designed around this model. Part VIII will cover it in full.

Conclusion

The line from Part VI, “the SMSP is immediately free to issue the next instruction for that warp”, is the hinge on which this article turns.

The memory hierarchy imposes latencies that are not negotiable in nanoseconds: 23 cycles for shared memory, 180 for L2, 500 for HBM. These numbers do not change by complaining about them.

They change by structuring code so that the latency is incurred before the result is needed: issuing the memory request while computing on previously loaded data.

cp.async is the mechanism. Software pipelining is the pattern. Double buffering is the minimum viable instance. The commit/wait protocol maintains correctness while the DMA engine and the compute engine run simultaneously.

The bank conflict analysis and the L1 bypass discussion are extensions of the same idea: minimize latency and maximize effective bandwidth at every level of the hierarchy, so that by the time data arrives at the computation, it has traveled through the hardware as efficiently as physics allows.

The limits of this approach on Ampere are what motivate TMA on Hopper: an architecture where the gap between what the programmer expresses and what the hardware executes narrows further, approaching the regime where the programmer describes what should move and the hardware decides when.

Part VIII begins there.

The Software Frontier

Discussion about this post

Ready for more?