Mastering CUDA and High-Performance Computing, Part X
A Deep Dive from Compiler Internals to High-Performance Parallel Computing
Where Part IX Left Us
Part IX ended with a provocation dressed as a summary. Training, we said, is a one-time cost. Inference is the workload that runs forever.
That sentence deserves to be interrogated before we accept it as a frame, because it contains a hidden asymmetry that shapes everything that follows.
Training a frontier model is an event: it happens once, or perhaps a handful of times with different hyperparameters, and then it stops. The cost is large and bounded.
Inference is a process: it happens billions of times per day, across hardware that may or may not resemble the training cluster, under latency constraints that the training job never had to respect, serving users who have no patience for pipeline bubbles and no interest in MFU.
The engineering discipline of inference optimization is therefore a different subject from the engineering discipline of training optimization, not merely a scaled-down version of it. The bottlenecks are different in kind. The metrics are different. The vocabulary is different. The hardware choices are sometimes deliberately different.
But the physics is the same, because physics does not have a training mode and an inference mode. Compute and memory bandwidth are always the two resources, and every optimization in this space is, at root, a claim about which of the two you are spending and whether you are spending it wisely.
What we will do in this part is work through the inference problem with the same level of precision we brought to the training problem.
We will derive the arithmetic of autoregressive decoding from first principles, establish exactly why the decode phase of transformer inference is memory-bandwidth-bound by construction, explain what that means for hardware selection and batching strategy, and then examine the tools that practitioners have developed to recover the compute utilization that the memory-bound regime takes away.
We will go into detail that most treatments of this subject avoid, because the details are where the engineering actually lives.
The two phases of transformer inference
Transformer inference for a generative model consists of two distinct phases that are so different in their computational character that they might as well be different workloads running on different hardware.
The prefill phase processes the input prompt. Given a prompt of length S tokens, the model performs a forward pass over all S tokens simultaneously.
This is a dense matrix multiply of shape [S, d_model] against the weight matrices, which is computationally equivalent to a training forward pass on a batch of S examples, with the important difference that no gradient computation happens.
The arithmetic intensity of prefill is high: the GEMM is large, the compute-to-memory ratio is favorable, and for sufficiently long prompts, prefill saturates tensor core utilization. Prefill is compute-bound.
The decode phase generates the output, one token at a time. At each step, the model processes a single new token and uses the key-value (KV) cache, which stores the key and value projections for all previously seen tokens, to compute attention over the full context without recomputing those projections.
The new token produces one row of the Q matrix, one row of the K matrix (appended to the KV cache), one row of the V matrix (also appended), and one output token.
The matrix multiply that dominates decode is therefore a matrix-vector product: a single vector of shape [1, d_model] multiplied against weight matrices of shape [d_model, d_model].
For an H100 at 494 TFLOP/s peak BF16, and a weight matrix that requires 2 × d_model² bytes to read from HBM (one load), the arithmetic intensity of this operation is:
flops = 2 × d_model² bytes = 2 × d_model² arithmetic intensity = 1 FLOP/byte
The H100’s ridge point, the arithmetic intensity at which the machine transitions from memory-bandwidth-bound to compute-bound, is approximately 494 TFLOP/s divided by 3.35 TB/s HBM3 bandwidth, which equals roughly 147 FLOP/byte.
Single-token decode has an arithmetic intensity of 1 FLOP/byte. The ridge point is at 147 FLOP/byte.
The gap is not a small inefficiency to be engineered away. It is two orders of magnitude. It is structural.
A matrix-vector product with batch size 1 will always be memory-bandwidth-bound on any hardware where compute throughput scales faster than memory bandwidth, which is every piece of hardware available today and likely every piece available for the next several years.
The H100’s tensor cores sit at 99.3% utilization waiting for data that cannot arrive fast enough. This is the inference problem, stated precisely.
What the arithmetic intensity gap actually costs
Before we can appreciate why batching is not a simple fix, we need to quantify what the gap costs in concrete terms.
Consider the weight matrices of a 70B parameter model. They occupy 140 GB in BF16. To generate a single output token, the decode phase must read essentially all of the weight matrices from HBM: every attention projection, every MLP layer, every embedding lookup.
(The KV cache is also read, but its size is proportional to context length and sequence position, not model size.) At the H100’s HBM3 bandwidth of 3.35 TB/s, reading 140 GB takes approximately 42 milliseconds.
In 42 milliseconds, a single token is produced. That is approximately 24 tokens per second per H100 for a 70B model at batch size 1.
Now read that sentence again: 24 tokens per second per H100, a machine that costs tens of thousands of dollars and can perform 494 trillion floating-point operations per second, is producing tokens at roughly the rate that a person reads them.
The 494 TFLOP/s are not being used. The H100 is acting as a very expensive, very fast HBM3 reader. The silicon that took years to design and billions of dollars to fabricate is waiting for data.
This is the central pathology of autoregressive decode, and it motivates every technique we will discuss in this part.
Batching as arithmetic intensity recovery
The solution that every inference practitioner reaches for first is batching: if you run decode for multiple requests simultaneously, the weight reads are shared across the batch, and the arithmetic intensity increases proportionally.
The arithmetic is clean. For a batch of B requests, the matrix multiply in decode is no longer a matrix-vector product but a matrix-matrix product: [B, d_model] × [d_model, d_model].
The flop count scales as B × 2 × d_model², while the bytes for the weight matrix remain 2 × d_model² (the weights are read once, regardless of batch size). Arithmetic intensity is now B FLOP/byte.
To reach the ridge point at 147 FLOP/byte, you need a batch of 147 requests running simultaneously on the same H100. At batch size 147, the tensor cores begin to saturate and further increasing the batch does not change the arithmetic intensity (you are now compute-bound, and more requests means more total compute, not more memory reads per unit time).
The batch size at which you saturate the machine is the target operating point for maximum throughput per GPU. Everything below this point is wasted hardware.
But batch size is not free. Each request in the batch has its own KV cache, and the KV cache size is proportional to the sequence length of that request. For a context length of 8192 tokens, a model with 80 layers, 8 KV heads, a head dimension of 128, and BF16 storage, the KV cache size per request is:
8192 × 80 × 2 × 8 × 128 × 2 bytes = 26,843,545,600 bytes ≈ 25 GB
Twenty-five gigabytes per request, on a GPU with 80 GB of HBM. You can serve at most three concurrent requests at 8192-token context length before HBM is exhausted and you cannot increase the batch further.
The tension is fundamental: to achieve good arithmetic intensity you need large batches, but large batches require large KV caches, and large KV caches consume the memory that large batches require.
This is the central resource allocation problem of transformer inference, and it is more constrained than it appears, because KV cache memory is not static. It grows with sequence length.
A request that has generated 100 tokens has a small KV cache; the same request after generating 4000 tokens has a KV cache that is 40× larger. The memory footprint of the batch changes continuously as generation proceeds.
Continuous batching and the end of static padding
The naive approach to batched inference is static batching: collect B requests, pad them all to the same sequence length, run them as a batch, return all B results when the longest sequence finishes.
Static batching is deeply inefficient. Consider a batch of 8 requests where one request will generate 2000 tokens and the others will generate 20 tokens each.
After the short requests finish at step 20, 7 of the 8 slots in the batch are empty, but the batch continues until step 2000 to service the one long request. The GPU is running at 1/8 occupancy for 99% of the wallclock time.
Continuous batching (also called iteration-level batching or in-flight batching), implemented in systems like vLLM, Orca, and TensorRT-LLM, solves this by removing the assumption that all requests in a batch start and finish together.
Instead, the batch is managed at the per-decoding-step level: at each step, the set of active requests is the set of requests that have not yet finished and for which memory is available.
When a request completes (generates a stop token or reaches the maximum length), its slot in the batch is immediately freed. A new waiting request is inserted into the freed slot and begins generating from its first decode step. There is no waiting for the batch to drain. The batch is always as full as memory permits.
The implementation requires that the CUDA kernels for attention and the MLP can handle variable-length sequences within a single kernel invocation, which is non-trivial because standard GEMM implementations assume a fixed batch dimension.
Paged attention (discussed shortly) is the memory management technique that makes this practical; the PagedAttention kernel from vLLM and the FlashAttention variants for variable-length sequences are the implementations that make it fast.
Continuous batching does not increase the maximum batch size (which is still limited by KV cache memory). What it does is ensure that the batch is always at or near the maximum size, eliminating the idling that static batching induces.
A system running continuous batching with maximum batch size 64 will achieve dramatically higher throughput than a system running static batching with the same maximum batch size, because the latter is almost never actually running 64 requests simultaneously.
PagedAttention and the memory management revolution
The KV cache memory problem has an analogy so precise it deserves to be stated explicitly: the KV cache is to inference systems what physical memory is to operating systems.
In an operating system, multiple processes compete for a fixed physical memory. Processes do not know their memory needs in advance (a process may allocate more memory as it runs).
Memory fragmentation is a real cost: even if the total free memory is sufficient, if it is not contiguous, an allocation may fail.
The solution that operating systems developed is virtual memory with paging: memory is divided into fixed-size pages, processes address a virtual space that the OS maps to physical pages on demand, and fragmentation is eliminated because non-contiguous physical pages can be mapped to a contiguous virtual space.
PagedAttention, introduced by Kwon et al. (2023) and implemented in vLLM, applies exactly this insight to KV cache management.
In a naive KV cache implementation, each request’s KV cache is a contiguous block of GPU memory allocated at request arrival. The maximum context length is reserved at allocation time (because the request might generate that many tokens), even if the actual generation is much shorter.
Fragmentation is severe: the gap between reserved and used memory across all requests is wasted, and new requests cannot use it.
PagedAttention divides the KV cache into fixed-size physical blocks (pages), where each block stores the keys and values for a fixed number of tokens (the block size, typically 16 or 32 tokens). When a request needs more KV cache space, it is allocated additional pages from a free pool.
Pages for a single request need not be contiguous in physical GPU memory; the PagedAttention kernel uses a block table (a small integer array per request mapping logical page indices to physical page indices) to find the right physical memory at attention computation time.
The consequences are significant. First, fragmentation falls from potentially 50% (if requests reserve maximum-length buffers but generate much shorter sequences) to under 5% (only the last partially-filled page of each sequence wastes space).
Second, sequences can share physical pages: if two requests have identical prompt prefixes (common in chat applications with a fixed system prompt), the KV cache pages for the shared prefix can be physically shared between them, eliminating redundant computation and halving the memory footprint of the prefix.
Third, memory allocation is lazy: pages are allocated only as tokens are generated, not at request arrival, which means a request does not consume its full potential KV cache until it actually generates enough tokens to need it.
The prefix sharing (also called KV cache sharing or prompt caching) deserves additional attention because its impact at scale is large. Consider a chat application where every request is prefixed with a 2000-token system prompt.
Without prefix sharing, each request independently computes and stores the KV cache for those 2000 tokens. With prefix sharing, the 2000-token prefix KV cache is computed once and shared across all concurrent requests.
For a batch of 64 requests, this eliminates 63 redundant prefill computations and reduces KV cache memory by 2000 × 64 tokens worth of activations, freeing space for a larger batch.
Speculative decoding and the bandwidth wall
Continuous batching with PagedAttention brings the inference system to the state of efficiently using available hardware at the maximum batch size the KV cache permits.
But we have still not escaped the fundamental constraint: at maximum batch size, the system is compute-bound, but getting to maximum batch size requires enough concurrent requests, which requires enough users, which is not always the case in low-traffic scenarios. At low batch sizes, we are still memory-bandwidth-bound, still producing tokens at the same rate that memory bandwidth permits.
Speculative decoding attacks this from a different angle. The observation is this: for a memory-bandwidth-bound system, the cost of generating one token and the cost of generating K tokens simultaneously in a single forward pass are approximately equal, because the bottleneck is reading the weight matrices from memory, and reading them once versus K times is the same operation if the K proposals can be evaluated in a single forward pass.
The mechanism: a small draft model (or a non-autoregressive heuristic, or a retrieval system) proposes a sequence of K candidate continuation tokens in a single forward pass.
The large target model then verifies this proposal in a single forward pass over the K tokens simultaneously. If the target model accepts all K tokens, K tokens have been generated at the cost of one large model forward pass plus one small model forward pass.
If the target model rejects some tokens, the generation rewinds to the first rejection and continues from there, having wasted some small model computation but no large model computation beyond what was necessary.
The analysis of when speculative decoding accelerates inference requires understanding the acceptance rate, which is the fraction of proposed tokens that the target model accepts.
For a good draft model generating from a similar distribution, acceptance rates of 70-90% are achievable. With an acceptance rate of α and a proposal length of K, the expected number of tokens accepted per large model forward pass is:
E[accepted] = sum_{k=0}^{K} (k+1) × αᵏ × (1 − α) + (K+1) × αᴷ
For α = 0.8 and K = 4: E[accepted] ≈ 3.3 tokens per large model forward pass, compared to 1 token per forward pass without speculation. Speedup is approximately 3.3×, reduced by the overhead of the draft model, typically 0.2-0.3 of a large model forward pass for a draft model that is 10-15× smaller.
Net speedup: approximately 3.3 / 1.2 ≈ 2.7×. This is not free, but it is substantial, and it is available even at batch size 1.
The reason speculative decoding works, and why it does not violate the memory-bandwidth constraint, is subtle. The verification pass is a prefill operation over K+1 tokens, which has higher arithmetic intensity than single-token decode.
For K=4, the verification pass processes 5 tokens simultaneously, which is 5× the arithmetic intensity of single-token decode. It is still memory-bandwidth-bound for small K, but the per-accepted-token cost of the memory reads is reduced because multiple tokens share the weight reads.
The deeper insight is that speculative decoding converts memory-bandwidth-bound decode operations into a mixture of memory-bandwidth-bound draft decode and slightly-less-memory-bandwidth-bound verification, and the mixture achieves better tokens-per-second because the draft model is smaller and therefore faster per forward pass.
The pathological case is when the draft model produces tokens that the target model almost never accepts, in which case the overhead of draft generation and failed verification exceeds the benefit.
Acceptance rates below approximately 0.5 make speculative decoding harmful, not helpful. This is why the choice of draft model matters: it should be distilled from or aligned with the target model, not simply chosen to be the fastest available.
Self-speculative decoding, where the model speculates with its own early layers (exiting at an intermediate layer for draft generation and running the full forward pass for verification) eliminates the draft model requirement at the cost of some architectural complexity.
Medusa, a multi-head speculative decoding method that adds dedicated draft heads to the target model, is a variant that achieves similar benefits with a different implementation strategy.
KV cache quantization and the memory tradeoff
Even with PagedAttention and prefix sharing, the KV cache is often the binding constraint on batch size for long-context models. A 70B model with a 128K context length and a batch of 8 requests generates a KV cache of:
128,000 × 80 × 2 × 8 × 128 × 2 bytes × 8 requests = 3.3 TB
Three terabytes for 8 requests. No H100 cluster of reasonable size holds this in HBM. The only responses are to reduce context length (not always possible), reduce batch size (reduces throughput), or reduce the precision of KV cache values.
KV cache quantization stores keys and values in INT8 or FP8 rather than BF16, halving the memory requirement at the cost of approximation error. The question is whether that approximation error materially affects output quality.
The answer, empirically, is that keys and values are more quantization-sensitive than weights, because the attention score computation amplifies outliers in the key matrix, and those outliers are common and important.
Naive INT8 quantization of the KV cache causes measurable quality degradation on tasks requiring precise retrieval over long contexts. The degradation is smaller for short contexts where fewer keys compete for attention.
Techniques that address this include grouped-query attention (GQA), which reduces the number of KV heads (and therefore the KV cache size) without proportionally reducing the expressivity of attention by sharing keys and values across groups of query heads; and mixed-precision KV caching, which stores frequently-accessed (recent) tokens in higher precision and distant-context tokens in lower precision, exploiting the empirical observation that attention weights are concentrated on nearby and highly salient tokens.
GQA deserves detailed treatment because it has become standard in essentially all modern large language models: Llama 3, Mistral, Gemma, Qwen, and their successors all use it.
The mechanism is to reduce the number of KV heads from H to H/G for a group size of G, where typically G = 8. Each KV head is shared by G query heads during attention computation. The KV cache size shrinks by a factor of G, and with G=8 on a 128-head attention, the cache is 8× smaller.
The expressivity cost is small for typical values of G because attention heads tend to specialize into redundant groups anyway: the empirical evidence across many model evaluations suggests that the quality loss from GQA at G=8 is negligible relative to the memory benefit.
The prefill-decode disaggregation architecture
We have established that prefill is compute-bound and decode is memory-bandwidth-bound. The hardware that is optimal for one is different from the hardware that is optimal for the other.
High-compute GPUs (H100, B200) with large HBM capacity are appropriate for both prefill and decode, but they are expensive. For decode in particular, the binding resource is memory bandwidth, not compute throughput. A GPU with lower compute throughput but the same memory bandwidth would serve the decode phase equally well at lower cost.
This observation motivates prefill-decode disaggregation: running prefill and decode on separate hardware pools, each sized for its actual bottleneck.
The architecture is approximately as follows: a request arrives at a scheduler, which assigns it to a prefill worker. The prefill worker (a high-compute GPU) processes the prompt and produces the initial KV cache.
That KV cache is then transferred to a decode worker (which may be a different, possibly cheaper GPU). The decode worker generates tokens autoregressively, managing the KV cache in its memory, and streams tokens back to the user.
The KV cache transfer itself is non-trivial: for a long prompt on a large model, the KV cache may be tens of gigabytes, and transferring it across a network link between prefill and decode workers takes time proportional to size. The transfer must complete before the first decode token can be generated, which adds latency to the time-to-first-token (TTFT) metric.
For applications where throughput matters more than latency, this tradeoff is acceptable: the disaggregated system produces more tokens per dollar per second because each hardware type is used for the phase it is efficient at.
For applications where TTFT is critical (real-time conversational AI), the added transfer latency may be unacceptable.
The engineering tension here is real, and production systems navigate it differently depending on workload: Splitwise (from Microsoft Research), DistServe (from Peking University and others), and production implementations at major AI serving providers all make different tradeoffs along the latency-throughput frontier.
Flash decoding and the attention bottleneck
We have spent considerable time discussing the weight matrix reads as the memory bottleneck for decode, but for long-context inference, a second bottleneck emerges: the KV cache reads during attention computation.
The attention operation during decode requires computing, for the new token’s query vector q of shape [d_head], an attention score against every key in the KV cache of shape [context_length, d_head], and then a weighted sum of every value in the KV cache of shape [context_length, d_head].
The KV cache for a single layer, a single head, at context length L is 2 × L × d_head × 2 bytes (keys and values, BF16). For L = 128,000, d_head = 128, this is 64 MB per head per layer per request.
For a model with 80 layers and 8 KV heads, the total KV cache read per decode step is 64 MB × 80 × 8 = 40,960 MB ≈ 40 GB.
At 3.35 TB/s HBM3 bandwidth, reading 40 GB takes approximately 12 milliseconds per decode step. The weight matrix reads (140 GB at 3.35 TB/s) take approximately 42 milliseconds. At 128K context length, KV cache reads are therefore about 22% of the per-step memory read time, a non-trivial contribution.
At context lengths beyond 128K (1M tokens is now a research target), KV cache reads can dominate weight reads entirely. The bottleneck for very-long-context inference is not the model weights but the growing KV cache.
Flash Decoding, introduced by Dao et al. and integrated into FlashAttention-3, addresses this by parallelizing the KV cache reads across multiple warps in a different pattern from standard FlashAttention.
In standard FlashAttention, which is designed for prefill (where Q, K, and V all have sequence dimension S), the parallelism is along the query sequence dimension.
In Flash Decoding, where the query has sequence dimension 1 but the KV cache has sequence dimension L, the parallelism instead decomposes along the key/value sequence dimension, allowing multiple warps to read different segments of the KV cache simultaneously and combine their partial softmax results using a numerically stable reduction.
The speedup from Flash Decoding is most pronounced when the KV cache length is large and the batch size is small, precisely the regime of long-context single-request inference where standard attention is most bottlenecked.
For L = 64K and batch size 1, Flash Decoding achieves 4-8× speedup over a naive attention implementation, which translates directly to 4-8× faster decode throughput for long-context requests.
The production inference stack
The mechanisms discussed above do not exist in isolation; they compose into a production inference serving system.
It is worth describing what that stack looks like end-to-end, because the interactions between components create optimization opportunities that are invisible when examining each component individually.
A production inference server for a 70B model on an 8-GPU node, circa mid-2026, runs approximately as follows.Tensor parallelism at degree 8 distributes the weight matrices across all 8 GPUs in the node, using NVLink for the all-reduce at each layer boundary.
The effective model per GPU is 70B / 8 = 8.75B parameters, requiring approximately 17.5 GB of HBM for weights in BF16. With 80 GB per GPU, this leaves 62.5 GB per GPU for KV cache.
The KV cache is managed by PagedAttention with a block size of 16 tokens. The pool of free blocks is divided among the 8 GPUs (with TP, the KV cache is also sharded, since each GPU handles a subset of the KV heads under GQA).
For a model with 8 total KV heads under 8-way TP, each GPU handles exactly 1 KV head, and the per-GPU KV cache is accordingly 1/8 of the total.
The scheduler runs continuous batching, maintaining a queue of waiting requests and a set of active requests. At each step, it admits new requests into the batch as KV cache pages become available. It preempts requests (evicts their KV cache pages, returning them to the free pool) when memory pressure is high, re-scheduling preempted requests from the beginning (or from a checkpoint) later.
Speculative decoding is enabled with a draft model of approximately 7B parameters, contributing an additional 14 GB of weight memory across the 8 GPUs (1.75 GB per GPU) and an additional 14 GB KV cache (1.75 GB per GPU), net of which there remains approximately 60 GB per GPU for the target model KV cache.
The prefill of long prompts is chunked (also called chunked prefill): instead of processing a 64K-token prompt as a single prefill operation that saturates compute for a long time and blocks decode requests from running, the prompt is processed in chunks of, say, 2048 tokens per step, interleaved with decode steps.
This trades slightly higher TTFT for dramatically better decode latency for concurrent users, a tradeoff that is almost always correct in a multi-user serving scenario.
What the series has built
Ten parts. From SMs to NVSwitch. From warp scheduling to speculative decoding. From single-GPU kernels to multi-node inference stacks.
Across all of it, one idea stays invariant: a GPU is not a “compute machine” in the naive sense. It is a latency-hiding system. Every layer of the stack exists to keep data moving while something else is waiting; warps hiding memory latency, NVLink hiding interconnect latency, batching hiding kernel inefficiency, and scheduling hiding autoregressive seriality.
At scale, the same structure repeats. Training systems hide communication under compute. Inference systems hide serial generation under parallel requests. Serving stacks hide memory fragmentation under virtualized allocation.
The details change, but the pattern does not: identify the bottleneck, then restructure the system so that bottleneck is no longer visible to the critical path.
What emerges is not a collection of optimizations, but a consistent way of thinking about hardware systems. Everything reduces to one question:
what resource is constraining progress right now (compute, memory bandwidth, or communication) and how do we prevent it from sitting idle?
The specific techniques will evolve. NVLink will be replaced, attention kernels will be rewritten, new quantization schemes will appear, and hardware will continue to shift the ridge points we computed throughout this series.
But the underlying structure will not change, because it is not a property of transformers: it is a property of physics.
That is the real object we have been studying. Not GPUs. Not transformers. But the constraints that govern how any system can compute under finite bandwidth and finite time.




Everything reduces to one question: what resource is idle right now.
Ten parts to get there, but that's the question worth asking.