The Software Frontier

The Wafer & the Wallet

Lorenzo Bradanini — Sun, 05 Jul 2026 07:56:35 GMT

Intro

For three years, the cost of making a token fell every quarter while the budget to buy tokens grew without a meter. In H1 2026 both curves reversed at once: memory pass-through is raising the cost floor of every token served, while enterprises institutionalize token budgets at the ceiling.

This is the anatomy of inference’s first margin squeeze: the wafer math, the proof of pass-through, modeled cost floors marked to real market prices, and the engineering playbook that defends the spread.

Two curves reversed in the same half-year

Every industry eventually meets the quarter when its two defining curves cross. For LLM inference, that quarter just happened.

Since late 2022, the economics of this business rested on two assumptions so reliable nobody wrote them down: the cost of producing a token falls every quarter (process nodes, kernel engineering, quantization, and a relentless price war), and the budget for buying tokens grows without a meter, the era of “tokenmaxxing,” when Meta and Salesforce were publicly urging employees to consume as many tokens as possible. In the first half of 2026, both assumptions died within months of each other.

Jaw one: the cost floor turned upward. That is the core of this issue: the deepest memory supercycle in the forty-year history of DRAM, transmitted measurably into GPU rental rates with a 4-6 week lag ( Fig. 7). For the first time since the ChatGPT moment, the marginal cost of serving a token is inflecting up, not down: driven by the single most supply-constrained commodity in technology.

Jaw two: the price ceiling got institutionalized. SemiAnalysis published field data on July 2 from conversations with 50+ enterprises: Uber burned through its annual Claude Code and Codex budget in four months and responded with a $1,500/month per-employee cap; an aerospace manufacturer’s $250/month caps were exhausted by power users in four days; companies are switching off premium model tiers and downgrading defaults; enterprise budgets now range from $250 to tens of thousands per employee per month, but they are budgets, reviewed by finance.

Read the finding precisely, because it is not a demand-collapse story: SemiAnalysis concludes there is no material risk to 2H26 AI budgets and API spend keeps compounding. The jaw is not falling demand, it is the arrival of price sensitivity.

Unbounded willingness-to-pay is over; every serving provider now negotiates against a CFO’s cap instead of an enthusiast’s curiosity.

The quarter the memory market broke

Start with the raw numbers, because they have no precedent in the forty-year history of the DRAM industry. TrendForce’s Q1 2026 contract-price revisions came in at +90-95% quarter-over-quarter for conventional DRAM: revised upward from an initial forecast of +55-60% that analysts had already called unprecedented.

PC DRAM was expected to more than double in a single quarter; server DDR5 rose roughly 90%; NAND contract prices climbed 33-38% in the same window. Counterpoint tracked spot DRAM up 80-90% inside the quarter.

For the full year, TrendForce projects DRAM up more than 70% on top of the Q1 step-change; Bank of America has DRAM industry revenue +51% YoY and NAND +45%.

The supply side tells you why this is not a blip. SK Hynix announced in October 2025 that its entire 2026 production capacity was already sold out. Kioxia has said the same of its 2026 NAND output.

Micron shut down its Crucial consumer business outright (a two-decade brand, liquidated) to route every wafer toward hyperscaler and GPU-grade contracts, and its executives describe being able to cover at most two-thirds of medium-term demand for some customers.

Micron’s Boise fabs ramp in 2027-2028; the New York megafab in 2030. Jensen Huang, asked at CES 2026 whether gamers should resent AI for GPU prices, answered in effect that the world needs more memory factories.

As stated even before, in the previous issues: when the largest memory buyer on Earth says the fix is on the supply side, the message downstream is: you will not negotiate your way to allocation relief.

Three structural breaks from every previous cycle

1. The demand driver is durable, not episodic. Prior shortages came from fab fires, earthquakes, supply discipline, or one-off demand pulses (crypto, pandemic PCs). This one comes from inference workloads compounding in production. IDC’s assessment calls it a “permanent reallocation” of the world’s wafer capacity, not a cyclical shortage, and IDC does not deploy the word “permanent” casually.

2. The margin gradient is a one-way valve, and the wafer math is brutal. HBM carries gross margins 3-5× commodity DRAM, so every rational fab reallocates cleanroom toward it.

But because of TSV formation, die thinning, stacking, and compounded assembly yield, one gigabyte of HBM consumes ≈4× the wafer capacity of one gigabyte of standard DRAM; GDDR7 consumes ≈1.7×. AI’s effective wafer claim reaches ~20% of world DRAM capacity in 2026 against total bit-supply growth of only 10-16% per year.

The arithmetic cannot close without starving someone, and “someone” is every non-AI buyer plus every AI buyer without a long-term agreement.

3. The price convergence matters more than the price level. HBM3e historically sold at 4-5× server DDR5 per bit. TrendForce expects that gap to compress to 1-2× by end-2026, not because HBM got cheaper, but because DDR5 inflated toward it. Sit with that.

The entire tiered-memory architecture of modern serving (KV offload to host DRAM, paged hierarchies, DDR5-backed prefix caches, NVMe cold tiers) was designed in a world where the tier below HBM was 4-5× cheaper per bit.

When the discount for descending a tier collapses from ~80% to ~30-50%, a large fraction of the “obvious” tiering optimizations of 2024-2025 stop being obvious. Section 5 does that math properly.

One supply-side note that matters for H2: HBM4 becomes the mainstream HBM in the second half of 2026, with a doubled 2,048-bit interface, ~2 TB/s per stack, and a manufacturing-complexity premium above 30% over HBM3e. NVIDIA’s Rubin ships with up to 288 GB of HBM4 per GPU, and every one of those stacks is wafer capacity that used to be laptops.

The KV cache ate the memory supply

The lazy narrative is “AI ate the memory market.” The precise narrative is that the KV cache ate the memory market, and the industry’s shift from training-dominated to inference-dominated compute is what lit the fuse.

Training demand is large but concentrated and schedulable: a fixed pool of HBM for weights, activations, and optimizer states, planned quarters ahead. Inference demand is elastic and per-user: every concurrent conversation, agent trajectory, and 200K-token codebase instantiates its own slab of state that must live somewhere in the hierarchy for the life of the request, and increasingly beyond it, as prefix caches persist context between turns.

Industry estimates via TrendForce put cloud high-speed memory consumption near 3 exabytes in 2026, with core inference platforms (Gemini-, Bedrock-, ChatGPT-class serving) accounting for ~750 PB of live memory demand before redundancy roughly doubles it.

TrendForce separately projects that by 2029 inference (not training) becomes the primary driver of AI server demand outright, and North American CSPs are already bulk-buying high-capacity DDR5 RDIMMs specifically for inference fleets, the demand class that used to be the safety valve absorbing DRAM oversupply.

The per-token physics, derived from first principles

Everything in this issue hangs on one number: bytes of KV state per token. Take the workhorse case, a Llama-3-class 70B dense model with grouped-query attention: 80 layers, 8 KV heads, head dimension 128.

A single 128K sequence on the GQA-8 model pins ~41 GB at FP16: half an H100’s entire HBM for one user’s context, before a single weight is stored. This is why effective batch size at long context is a memory-capacity calculation, not a compute calculation, and why every marginal concurrent user at long context is, economically, a memory purchase.

Multiply by the fleet: 10,000 concurrent long-context sequences at FP8 KV pins ~200 TB of state across HBM, DDR5, and NVMe, state that produces no tokens itself; it exists purely so decode can stream it past the compute units at terabytes per second.

The 2026 memory market is what happens when the industry’s aggregate KV state, growing super-linearly with agent adoption and context inflation, collides with wafer supply growing 10-16% a year.

Two second-order effects complete the picture

NAND got dragged in. NVMe became the cold tier for prefix caches and the staging layer for weights (including GPU-direct storage paths). Kioxia now says nearly half its future NAND demand could come from AI. Meanwhile Samsung and SK Hynix cut NAND wafer output in 2024-2025 while chasing HBM margins. Omdia has Samsung’s NAND wafers falling 4.9M→4.68M and SK Hynix’s 1.9M→1.7M. There is no cheap tier left to hide in.

The host-DRAM tax on every GPU node inflated. A serious 2026 inference node pairs its GPUs with 1-2 TB of DDR5 for KV offload, CPU-side batching, and cache tiers. At Q1 2026 server-DRAM pricing, the host memory on a single node can cost what a mid-range GPU did two years ago. No 2024-vintage TCO model has this line item at anywhere near its current size.

Roofline: why prefill and decode were never the same workload

Before the silicon story makes sense, you need the roofline argument stated with numbers, not vibes. A kernel’s attainable throughput is bounded by min(peak_FLOPs, intensity × bandwidth), where arithmetic intensity is FLOPs performed per byte moved from memory.

The machine has a ridge point, the intensity at which it stops being bandwidth-bound and becomes compute-bound.

The corollary that most cost models miss: decode throughput is not a FLOPs number, it is a bandwidth budget divided by bytes-per-step. Per decode step the device must stream the weights once plus every live sequence’s KV:

Read those three lines again. Same GPU, same model, same bandwidth: 16× throughput difference between short-context and long-context traffic, and a clean 2× recovered at 128K purely by halving KV bytes. Context length is not a feature flag; it is a cost multiplier that acts through memory.

Every KV byte you remove converts directly into either more concurrent sequences or faster steps: both of which are revenue.

Rubin CPX is a memory trade wearing a GPU costume

Now read NVIDIA’s 2026 inference roadmap through the memory market and it snaps into focus in a way most launch coverage missed. If prefill is compute-bound and decode is bandwidth-bound (§03), then serving both phases on identical HBM-stuffed GPUs means that during prefill you are paying for the most supply-constrained, margin-rich commodity on the planet (HBM bandwidth) and not using it.

In 2024 that was an inefficiency. In 2026, with HBM at 4× wafer-equivalent cost and fully allocated into next year, it is an unforced error measured in real money.

Rubin CPX is the correction. Announced at the AI Infra Summit in September 2025, detailed through CES and GTC 2026, generally available late this year: a monolithic prefill-specialized die pairing 30 PFLOPS of sparse NVFP4 compute (20 PFLOPS dense, per SemiAnalysis) with 128 GB of GDDR7 at only ~2 TB/s of bandwidth (32 Gbps GDDR7 on a 512-bit bus.

That bandwidth would embarrass a decode GPU) it’s below an H100. On a prefill part it is the entire point: SemiAnalysis’s launch framing was exactly right, a chip deliberately skinny on bandwidth and fat on compute, because that is what prefill consumes. Its intensity budget sits far to the right of Fig. 5’s ridge, where GDDR7 is not a compromise but a correct sizing.

Overlay the wafer math and the arbitrage is explicit: GDDR7 costs 1.7× standard DRAM wafer-equivalent; HBM costs 4×, plus CoWoS packaging, interposer, and thermal budget. By building the prefill tier out of GDDR7, NVIDIA routes the fastest-growing slice of inference demand (long-context prompt processing) around the most constrained commodity in its own supply chain.

Every prefill FLOP served from GDDR7 is HBM allocation freed for decode, where it earns its keep.

NVIDIA is pitching this with a number that deserves both scrutiny and attention : $5B of token revenue per $100M of infrastructure, 30-50× platform ROI, and naming Cursor, Runway, and Magic as early partners, which tells you the target token distribution: repository-scale coding context and long-form generative media, i.e., prefill-dominated workloads. Discount the multiplier as marketing; the direction is load-bearing.

The strategic read for this audience: hardware disaggregation of prefill and decode is memory-market arbitrage instantiated in silicon, and it will not stay proprietary to NVIDIA. SRAM-based decode tiers are being positioned into the same disaggregated fabrics, AMD will be forced to answer, and every serious serving stack (Dynamo, vLLM’s disagg mode, SGLang, Mooncake-style architectures) has converged on KV-cache-transfer-over-fabric as the central abstraction of 2026 serving.

The clearest tell that this is permanent: per Tom’s Hardware’s platform teardown, the Vera Rubin rack’s BlueField-4 DPU ships with an integrated SSD specifically to store KV cache: cached context now has its own dedicated hardware tier in the reference design.

If your mental model of an “inference GPU” is one SKU doing both phases, you are one hardware generation behind the economics, and Q1’s memory prices just made that lag expensive.

The repriced hierarchy, and what happens to cost per token

The uncomfortable synthesis: the price ratios between tiers compressed at the same moment the absolute levels rose, and tiering strategies earn their complexity from the ratio, not the level. A prefix cache offloading to DDR5 at 20-25% of HBM’s per-bit cost is an easy win.

The same cache at 50-70% of HBM’s per-bit cost must clear a much higher bar once you charge it for transfer bandwidth, PCIe hops, TTFT-on-miss, and the payroll that maintains it.

The offload break-even, derived

System prompts, shared codebase contexts, RAG corpus headers: cache. One-shot user uploads and cold agent trajectories: recompute. Cache residency is now a metered commodity; treat admission like an underwriting decision.

One nuance cuts the other way: NAND inflated less than DRAM, so the relative case for demoting warm entries to NVMe actually improved even as both tiers rose.

Three channels into cost per token

Channel 1: CapEx per node. HBM is now the single largest line item in a flagship accelerator’s package BOM (SemiAnalysis’s finding for the GB300, after rising as a share every generation since Hopper) and adding host DDR5 and NVMe puts memory at roughly half the bill of a modern inference node (author’s estimate on top of that sourced base).

HBM moves slowly through long-term contracts; host DRAM and SSD are bought near spot: which is where 2026 budgets are bleeding now. A node spec’d mid-2025 with 1.5 TB DDR5 + 30 TB NVMe carries tens of thousands of dollars of new memory cost at current prices. Amortized over four years at high utilization it’s single-digit percent per token, but it compounds with Channel 2.

Channel 2: the capacity ceiling (the big one). When memory binds, its price doesn’t just raise cost : it caps revenue per node. Max concurrency = free-memory-after-weights ÷ KV-bytes-per-sequence; decode throughput scales with batch until bandwidth saturates. As traffic mix shifts long-context (it is: agents, codebases, multimodal), each node serves fewer users and each token carries more fixed cost.

Cost per token rises without any component getting more expensive: pure mix shift. The shortage means you can’t fix it by bolting on DRAM at 2024 prices; the escape routes are compression and disaggregation. Both are engineering, not procurement.

Channel 3: the tier-ratio compression of Fig. 3, which quietly deprecates a generation of offloading tricks and re-rates model-architecture choices (§02, Fig. 4).

Winners, losers, and the direction of the tilt

Hyperscalers and frontier labs are insulated. Multi-year HBM/DDR5 LTAs signed pre-squeeze, first-priority allocation (Micron: “larger, strategic customers”), fleets big enough that architecture fixes amortize across trillions of tokens.

Databricks’ disclosure that cost-aware autoscaling and “model unit” abstractions cut GPU spend over 80% versus static provisioning (on a platform serving 120 trillion tokens a month) shows where leverage lives at that scale: utilization engineering, because their input prices are contractually smoothed.

Mid-tier providers and self-hosters absorb the shock. They buy near spot, hold no allocation priority, and their pitch (undercutting frontier APIs) sits directly on the inflating commodity. Worse, their buyers are the newly capped: under a token budget, an enterprise squeezes more work from the same spend rather than paying up, so the mid-tier faces rising input costs and a customer base structurally optimizing against price at the same time.

That is the textbook geometry of a margin squeeze, and it lands here first. Watch for: per-token price floors firming through H2 2026 in open-weights serving; long-context surcharges turning explicit; and prompt-caching pricing rebalancing: cached-token discounts exist because storage was cheap relative to recompute, and that ratio just moved. Cache-storage line items (already visible in frontier pricing) will spread.

The self-hosting math shifted in a direction almost nobody has updated for. The 2025 pitch (”your $15-25K refurbished H100 workstation replaces $500/month of API spend”) has two 2026 problems. The workstation’s own BOM inflated: RAM alone on a new server build can now approach the cost of an entire refurbished platform. And the API side is partially shielded by hyperscaler contract insulation plus superior compression engineering.

The perverse result: the memory supercycle is a centralizing force, it taxes small-fleet inference more heavily than hyperscale inference, at precisely the moment open-weights quality made decentralized serving viable.

If you’re modeling a GPU buy this year, redo it with 2026 memory line items, and note that used enterprise hardware with pre-crisis RAM already installed is currently the cleanest arbitrage in the market, which is exactly why the refurb channel is having a moment.

Proof of pass-through: the March repricing, and the token bill in real dollars

A thesis this size needs a smoking gun: evidence that memory contract prices actually reach the price you pay per GPU-hour, on a measurable lag. Q1 2026 provided it.

Silicon Data’s SDB200RT index (the standardized benchmark for B200 cloud rental pricing) opened 2026 at 4.40, drifted through January and February, then rose 23.6% inside March alone, crossing 5.0 on March 15 and 6.0 on March 23 before settling at 5.48: up 24.4% year-to-date.

Their attribution is the entire argument of this issue in one sentence: Samsung and SK Hynix raised HBM3e contract prices ~20% for 2026 deliveries, NVIDIA revised hardware MSRPs upward in late February citing memory component costs, and those input costs flowed into cloud hourly rates with a 4-6 week lag: landing squarely in March.

Meanwhile the two-year-old H100, whose memory was bought at pre-squeeze prices, traded in a tight, boring band all quarter. Same market, same month: the GPUs carrying new memory repriced; the GPUs carrying old memory didn’t. That divergence is the supercycle reaching your invoice.

The token bill, priced in real mid-2026 dollars

With real rental rates in hand (H200 working median ~$3.50/GPU-hr on-demand across neocloud trackers (cohort median ~$4.00; floor $2.30; hyperscalers to $13.78)) the bandwidth model from §03 becomes an actual price sheet.

These are bandwidth-bound floors: ideal kernels, full overlap, output tokens only. Real deployments land at 50-70% of these throughputs; the ratios between cells are the robust result.

One honest caveat, stated before a reader states it for me: this model charges every decode step for streaming the full weights and KV and ignores prefill cost, which at 128K is substantial and shifts spend toward exactly the CPX-shaped silicon.

A production number also carries utilization (Databricks’ 80% savings figure is the size of that lever), SLO headroom, and interconnect overhead in multi-GPU serving. The model’s job is not to predict your invoice to the cent; it is to rank your decisions, and the ranking is unambiguous: context policy first, KV dtype second, hardware price third.

The spread, marked to market. July 2, 2026

Now close the loop the title promises. As of July 2, flat per-token pricing for Llama-3.3-70B-class serving spans $0.31-$0.90 per million output tokens across the fifteen providers Artificial Analysis tracks (DeepInfra’s FP8 “Turbo” at $0.40, Groq at $0.79, Together AI and Fireworks at $0.88-0.90) and the rate is flat across the model’s 131K context window.

Set those prices against this section’s floors and the cross-subsidy stops being an inference and becomes a table:

Note the confirming detail hiding in the market data: the price leader doesn’t serve the reference model at all, it serves an FP8-quantized variant. The cheapest provider in the market is this issue’s playbook, priced and shipped.

And the honest caveats, before a reader raises them: these floors are single-GPU bandwidth ideals (real serving is less efficient, which raises them), while real fleets also earn input-token revenue and batch mixed contexts (which lowers effective floors), so treat the magnitudes as directional.

The sign pattern is the robust result: at July 2026 flat prices, long-context traffic on 70B-class serving is sold below its modeled production floor, and budget-capped buyers remove the option of raising the flat rate to fix it. Hence the final item of §07’s playbook.

Eight decisions, ranked by dollar leverage

Everything above, compiled into action: ranked by expected cost impact per unit of engineering effort for a team serving open-weights models at scale in 2026. Under the scissors of §00, read “cost impact” as what it now is: spread defense: every dollar of floor you remove is margin the ceiling can no longer take from you.

Five falsifiable predictions

Scored publicly in Q1 2027, alongside the Issue 04 speculative-decoding calls. Confidence is stated so being wrong costs me something.

P1 · Supplyconfidence: high
DRAM relief does not arrive before late 2027. New capacity (Micron Boise) ramps 2027-2028; suppliers themselves guide consumer relief to ~2028. Any 2027 hardware-refresh model priced at 2024 memory levels is fiction.

P2 · Architectureconfidence: high
Open-weights flagships converge on compressed KV. By mid-2027 the majority of new frontier-adjacent open releases ship MLA-like latent attention, ≤8-head GQA, or hybrid sliding windows: marketed explicitly as serving-cost features. Model architecture is now downstream of the DRAM spot price.

P3 · Pricingconfidence: medium-high
Per-token API pricing bifurcates by context residency: at least two major providers introduce or restructure explicit cached-context storage pricing (per-token-per-hour or equivalent) by mid-2027, decoupling state from compute. Partial confirmation is already on the books (Google Vertex bills per-hour cache storage and Anthropic bills TTL-tiered cache writes) so the live prediction is that this becomes the norm in open-weights serving, not just frontier practice.

P4 · Siliconconfidence: medium
Prefill-specialized silicon becomes a category, not a SKU: a CPX competitor (AMD or an ASIC player) is announced within 12 months of CPX GA, and “prefill accelerator” enters the standard rack taxonomy.

P5 · Marketconfidence: medium
The open-weights serving price war pauses: median $/Mtok for 70B-class serving on third-party clouds goes flat-to-up through H1 2027 (the first sustained non-decline in that series since it has existed. The March B200 repricing (Fig. 7) is the leading edge) and enterprise token budgets are the demand-side mechanism that lets long-context surcharges and cache-storage fees actually stick, because a buyer operating under a cap optimizes usage instead of switching providers.

The dashboard: six leading indicators, with trigger levels

The spread is the product now

For two years, inference economics was a story about FLOPs: kernel efficiency, quantized matmuls, speculative decoding, Blackwell’s FP4 throughput.

Those battles were real; this publication has spent six issues inside them. But they were fought against a background assumption so universal nobody stated it: bytes are cheap and getting cheaper.

That assumption died in Q1 2026. The binding constraint of the inference industry is now measured in wafer starts and TSV yields: in exabytes of KV state colliding with 10-16% annual bit-supply growth, in an IDC report that uses the word “permanent,” in a rental index that repriced 24% in one month because a memory contract reset five weeks earlier.

The consequences run one direction through the entire stack: hardware specializes by phase because HBM is too precious to waste on prefill; model architectures compress their KV because fat caches became unserveable; serving stacks reorganize around shipping cached state across fabrics because holding it still got expensive; and the cost advantage tilts toward whoever holds allocation contracts and compression engineering: which is to say, toward the largest players, in an industry that spent two years congratulating itself on decentralizing.

The engineers who internalize this fastest hold a simple advantage: while the market reprices memory, they reprice their need for it. Every KV byte you decline to allocate is bought at 2026 prices and sold at 2026 prices, at 100% margin, with zero lead time and no allocation meeting. The wafer sets the cost floor. The wallet sets the price ceiling.

The engineering in between is the only variable you control, and in a margin regime, that engineering stops being a cost-optimization side quest and becomes the P&L itself. Squeezes do not reward the biggest fleet; they reward the widest spread per byte.

METHOD & VERIFICATION. Market figures trace to named sources below; all KV, roofline, concurrency, and $/Mtok arithmetic is the author’s, derived in-text from published model configs and device datasheets so you can check every step. Cost floors are bandwidth-bound ideals: absolute values are lower bounds, ratios are the robust result.

The Fig. 7 path between published waypoints is interpolated; Fig. 3’s intermediate points are directional; Fig. 0 is a directional synthesis whose turning points, not slopes, are the sourced claims. Pricing data was spot-checked the week of publication and moves weekly: recheck before committing capital. Corrections run at the top of the next issue, as always.

SOURCES: TrendForce: 1Q26 DRAM/NAND contract revisions; “Memory Wall” research (Jan 2026); HBM adoption & 2029 inference-demand outlook; AI wafer-equivalent consumption via Commercial Times (Dec 2025) · IDC (”Global Memory Shortage Crisis” (Feb 2026) · IEEE Spectrum) “AI Is a Memory Hog” (Apr 2026) · Counterpoint Research (Q1’26 spot DRAM tracking · CNBC) Micron/SK Hynix/Samsung allocation & CES 2026 reporting (Jan 2026) · Avnet/Omdia/BofA (supercycle & supply-relief estimates · SemiAnalysis) “Another Giant Leap: The Rubin CPX Specialized Accelerator & Rack” (Sep 2025) · The Register (CES 2026 Vera Rubin systems coverage (Jan 2026) · Tom’s Hardware) “Nvidia’s Vera Rubin platform in depth” (Nov 2025, incl. BlueField-4 KV-cache SSD) · Futurum/Ori/NADDOD (Rubin CPX technical analyses · NVIDIA) AI Infra Summit & GTC 2026 announcements · Databricks Engineering (”Reliable LLM Inference at Scale” (May 2026) · SemiAnalysis) “TokenBudgeting: Our Conversations with Enterprises on Token Spend” (Jul 2, 2026) · Artificial Analysis (Llama 3.3 70B provider benchmarking & cache-pricing mechanics (Jul 2026) · aipricing.guru / Price Per Token) provider price pages (sourced Jul 2, 2026) · Silicon Data. SDB200RT B200 index, March 2026 update · aimultiple GPU Rental Price Index; getdeploying B200/H200 trackers; Jarvislabs H200 pricing (2026) · DeepSeek-V3 technical report (MLA configuration · vLLM) NVFP4 KV-cache PR (in progress at publication).

The Router and the Wire

Lorenzo Bradanini — Mon, 29 Jun 2026 21:18:38 GMT

The cost did not vanish. It moved.

Every few months the inference market tells itself a story about where the money goes, and every few months the story is wrong in the same direction. For two years the story was memory.

The key-value cache grew with context, the weights grew with parameter count, and the binding constraint on a serving deployment was how many bytes of high-bandwidth memory you could buy and how fast you could read them.

That story was true, and it is the subject of two earlier issues of this publication. It is also no longer the whole story for the models that now define the frontier.

The frontier moved to sparsity. A dense model the size of DeepSeek-V3 would activate all 671 billion of its parameters on every token it processed. The mixture-of-experts version activates roughly 37 billion.

On paper that is a reduction in arithmetic of roughly eighteen to one, and it is the single reason a model with two-thirds of a trillion parameters can be served at all without a fleet of accelerators per request.

The promise of the architecture was always framed in floating-point operations: do less math, pay less money. The vLLM and SGLang communities, NVIDIA, DeepSeek, and every serving vendor in between repeated some version of it.

The arithmetic did get cheaper. What the framing left out is that the arithmetic was never the part that was hard to scale. When you spread 256 experts across dozens or hundreds of accelerators, a token routed to eight of them has to physically travel to those eight accelerators, be processed, and travel back to be recombined.

That round trip is an all-to-all communication pattern, and it does not appear in any FLOP count. It is the part of the bill that the sparsity story quietly moved off the compute line and onto the network line, where it has been growing ever since.

DigitalOcean’s engineering writers put the distinction more bluntly than most vendors will. For a dense model, they note, cost scales with memory and is linear and predictable. For a mixture-of-experts model, cost becomes a game of communication. That is the thesis of this issue, stated in five words by someone selling cloud capacity.

The rest of this report is the long version: what the all-to-all actually costs in bytes and in silicon, why a rack that lists for two to three million dollars is best understood as an answer to a networking problem rather than a compute one, and whether the wide expert parallelism that everyone is now deploying actually earns its keep.

There is a seductive counterexample worth disposing of immediately, because it will come up. The KTransformers project can run the complete DeepSeek-V3 model on a single low-cost server with one consumer GPU, a machine that costs in the neighborhood of ten thousand dollars, and still produce nearly twenty tokens per second. I

f a mixture-of-experts model can run on a ten-thousand-dollar box, how can the routing be expensive?

The answer is that the KTransformers configuration never pays the all-to-all toll, because there is no all-to-all. With every expert resident in the memory of a single node, routing a token to an expert is a memory lookup, not a network transfer.

The economics that follow in this issue are the economics of scale, of serving thousands of concurrent users at frontier latency, and the moment you cross the boundary from one node to many, the toll switches on. The single-box demo is real, and it is exactly why the multi-box reality is so often misunderstood.

What sparsity actually buys, and what it borrows

Begin with the structure of the model itself, because the geometry of the dispatch is dictated by it. DeepSeek-V3 carries 256 routed experts per mixture-of-experts layer plus one shared expert that processes every token. A gating network selects the top eight routed experts for each token.

The model has 61 transformer layers, 58 of which are mixture-of-experts layers; the first few are dense. So for the overwhelming majority of the depth of the network, every single token triggers a routing decision, a dispatch to eight destinations, and a combine back.

The routing itself is not free, though it is cheap relative to the transfer. A gating network scores all 256 experts for every token and selects the top eight, and that scoring, the sorting, and the construction of the dispatch order add a small compute and synchronization cost before any data moves.

It is minor against the 168 kilobytes that follow, but it is one more thing the dense model never does, and at the token rates of a frontier deployment even minor per-token costs accumulate. The gating is also where the load imbalance originates, since it is the gate’s learned preferences that send too many tokens to too few experts, which makes it both the cheapest and the most consequential of the operations the router performs.

The sparsity is genuine and the savings are genuine. Switch Transformer, the architecture that popularized the modern top-k mixture, demonstrated roughly a sevenfold speedup over a dense model of equivalent quality, and that ratio has only widened as expert counts have grown.

When practitioners say mixture-of-experts reduces computation by ninety percent, they are describing the activated-parameter ratio, and they are not wrong about it. A token that touches 37 billion of 671 billion parameters is doing far less matrix multiplication than a token that touches all of them.

But the activation has to move. In a single-accelerator world, the experts a token needs are sitting in local memory and the only thing that travels is a memory read. In a serving deployment large enough to matter, the experts are spread across the accelerators by expert parallelism precisely so that each accelerator holds only a few of them and the aggregate weight footprint fits.

That is the entire point of expert parallelism: it is what lets you serve a model whose experts, summed, are far too large for any single device. And it is also what guarantees that the token and its eight chosen experts will, in general, live on different devices.

So the model does two collective operations per mixture-of-experts layer that a dense model never does. The dispatch, sometimes called the scatter, sends each token’s hidden activation to the devices holding its selected experts. The combine, the gather, takes the eight expert outputs and reduces them back into a single vector on the token’s home device.

Survey work on efficient inference serving is consistent on the consequence: this all-to-all exchange of token dispatch and output gathering is the bottleneck in large-scale mixture-of-experts inference. Not the expert math. The exchange around it.

This is the borrowing that the sparsity bargain does not advertise. You spend less on arithmetic and you take on a debt denominated in bandwidth, and the debt comes due on a part of the machine that has improved far more slowly than the compute has.

To see why that matters, you have to look at the wire.

Subscribe now

A rack-scale answer to a network problem

The defining fact about modern accelerators is that their arithmetic has outrun their interconnect, and it has done so by a margin that is hard to overstate until you put the numbers on a single axis. A Blackwell B200 reads from its own high-bandwidth memory at roughly eight terabytes per second.

The fifth-generation NVLink fabric that connects it to its neighbors moves 1.8 terabytes per second per GPU, eighteen links at a hundred gigabytes per second each. That is the fast path between two accelerators, and it is already more than four times slower than the path to local memory.

Then you fall off the edge. A single four-hundred-gigabit InfiniBand network card, the scale-out path that connects one node to another, moves about fifty gigabytes per second. The cliff from on-package memory to the cross-node network is more than two orders of magnitude.

FIG 1 The bandwidth available to a token collapses as it travels outward from the chip. The all-to-all of a mixture-of-experts layer lives somewhere on the right half of this chart, and where exactly is the whole question.

This is the chart that explains the rest of the hardware industry’s behavior. If the all-to-all of a mixture-of-experts layer can be kept inside the NVLink domain, it runs at hundreds of gigabytes per second.

If it has to cross the InfiniBand fabric between racks, it runs at a fraction of that. Introl’s infrastructure analysis puts the ratio at roughly eighteen to one between scale-up bandwidth inside the NVLink domain and scale-out bandwidth between racks. For an architecture whose dominant cost is an all-to-all, that ratio is not a detail. It is the design center.

Which is what the GB200 NVL72 is. NVIDIA’s rack-scale system connects 72 Blackwell GPUs and 36 Grace CPUs into a single NVLink domain delivering 130 terabytes per second of aggregate, non-blocking, all-to-all bandwidth, with 13.5 terabytes of high-bandwidth memory addressable as one pool. Before this system, the largest NVLink domain you could buy was eight GPUs on a single baseboard.

The NVL72 takes the fast interconnect and stretches it across an entire rack so that 72 accelerators can talk to each other as though they were neighbors on the same board. NVIDIA’s own materials describe the result as a single massive GPU, and for the purposes of a mixture-of-experts all-to-all, that marketing is closer to literally true than marketing usually is.

The price of that rack is two to three million dollars, it draws around a hundred and twenty kilowatts, and it is liquid-cooled because there is no other way to remove the heat. It is easy to read those numbers as a statement about compute density, and the 1.44 exaflops of four-bit tensor performance per rack invites that reading.

But the compute was never the scarce thing. You can buy 720 petaflops of eight-bit compute in roughly 182 H100 accelerators for less money than an NVL72 costs.

What you cannot buy that way is a 72-way all-to-all domain. The premium on the rack is, in substantial part, the premium on the wire. It is the cost of not having to cross InfiniBand for the operation that a mixture-of-experts model performs 58 times per token.

The premium on the rack is, in substantial part, the premium on the wire. It is the cost of not crossing InfiniBand for the operation a mixture-of-experts model performs fifty-eight times per token.

Read this way, a great deal of the 2026 accelerator roadmap resolves into a single sentence: make the all-to-all domain larger than the problem.

At CES 2026 NVIDIA disclosed that the next-generation Vera Rubin NVL72 will roughly double per-GPU NVLink bandwidth to 3.6 terabytes per second and lift aggregate all-to-all bandwidth to 260 terabytes per second, with the explicit justification, in NVIDIA’s own words, that this is the bandwidth needed for the all-to-all communications of leading mixture-of-experts architectures.

The company has stopped being coy about it. The interconnect generation is being sold, by name, as the answer to the problem that the model generation created.

Subscribe now

Twenty streaming multiprocessors, give or take

Hardware sets the ceiling. Whether you reach it is a question of kernels, and the reference implementation for the mixture-of-experts all-to-all is DeepEP, the communication library DeepSeek open-sourced during its 2025 release week.

DeepEP is worth studying closely not because it is the only such library but because it is the one whose measured numbers are public, and those numbers are the closest thing the field has to a ground truth for what the all-to-all costs at the kernel level.

DeepEP provides two classes of kernel, and the split maps precisely onto the two phases of inference. The normal kernels are tuned for throughput and serve training and the prefill phase, where batches are large and the all-to-all moves a great deal of data at once.

The low-latency kernels are tuned for the decode phase, where each step generates one token per sequence, the batches are tiny, and what matters is not bandwidth but the round-trip time of the dispatch and combine.

This is the same prefill-versus-decode division that disaggregated serving exploits, examined one issue ago, now visible at the level of individual communication kernels.

The measured bandwidths tell the scale-up story in a single table. On Blackwell-class hardware, DeepEP’s dispatch kernel moves 726 gigabytes per second and its combine kernel 740 gigabytes per second when the experts are inside the NVLink domain. The same kernels, forced across the internode RDMA fabric on the same generation of hardware, move about 90 gigabytes per second each.

That is the eighteen-to-one ratio of Figure 1, reproduced at the kernel level on a real workload: the configuration DeepSeek published, with eight thousand tokens per batch, a hidden dimension of 7168, top-eight routing, eight-bit dispatch, and sixteen-bit combine.

FIG 2 The same kernel, the same hardware, the same workload. The only variable is whether the experts sit inside the NVLink domain or across the network. That one boundary costs roughly a factor of eight.

The second thing DeepEP reveals is subtler and, for the economics, more important. Moving data costs compute. The all-to-all kernels do not run on dedicated networking silicon; they run on the same streaming multiprocessors that would otherwise be doing matrix multiplication.

Every SM assigned to push bytes through the fabric is an SM not computing an expert. DeepEP’s first version spent around 24 SMs on the communication for a training-scale all-to-all.

Its second version, a substantial rewrite that moved from a custom backend to a more lightweight one built on NVIDIA’s NCCL, cut that to between four and six SMs for the same work while matching or exceeding the old bandwidth.

The library’s authors describe the V2 rewrite as achieving extreme performance with several times fewer SM resources, and the measured table backs the claim: up to 1.3 times the peak bandwidth at up to four times fewer SMs.

But the decode path is hungrier than the training path, and here the numbers sharpen into a real cost. To hit maximum throughput on the NVLink decode all-to-all, DeepEP’s table shows the kernel consuming 64 streaming multiprocessors. A B200 has 148 of them. That is forty-three percent of the entire accelerator spent moving data rather than computing, in the configuration tuned for speed.

You can run the same kernel in a low-SM mode that uses 24, but you give up bandwidth to do it. The library also offers genuinely zero-SM paths for pipeline parallelism, context parallelism, and certain RDMA transfers, by offloading the movement to copy engines and the network cards directly, and a great deal of the engineering frontier in 2026 is about pushing more of the all-to-all onto those zero-SM paths.

The reason that frontier exists is Figure 3.

FIG 3 The all-to-all is not free even when the wire is fast, because it is paid for in the same currency as the math. At peak decode throughput the communication kernel claims forty-three percent of the GPU.

There is real craft in how DeepEP hides this. The library exposes a hook-based mechanism for overlapping communication with computation: a dispatch is launched, independent work proceeds on the compute stream while the data is in flight, and only when the result is needed does the kernel wait.

Done well, the all-to-all latency disappears behind the expert computation and the SM cost is the only thing left to account for. Done badly, the all-to-all stalls the pipeline and the expensive accelerators sit idle waiting for the network. The difference between those two outcomes is most of the difference between a good mixture-of-experts deployment and a wasteful one, and none of it is visible in a FLOP count.

The second version pushes the SM problem harder by moving the data movement off the streaming multiprocessors entirely wherever it can. Its experimental branches expose zero-SM paths for pipeline and context parallelism, handing the transfers to the GPU’s copy engines, and a zero-SM remote-memory primitive the authors call Engram that lets one device reach into another’s memory over RDMA without spending a single SM on the transfer.

The motivation is exactly Figure 3: every SM the network gives back is an SM the experts can use. The rewrite also abandoned the custom communication backend for a lighter one built on NVIDIA’s NCCL, which let it reuse existing communicators and scale the expert-parallel domain to as many as two thousand devices, far past anything a production model currently needs.

A separate branch rebuilds the kernels around the tensor-memory-accelerator instructions on Hopper and Blackwell, shrinking SM usage again and adding native four-bit support, which is how the dispatch leg of the toll gets cheaper at the same moment the experts do.

None of this would matter if the expert computation itself were not reorganized to match. The all-to-all delivers a variable number of tokens to each expert, because the gating network does not distribute traffic evenly, and a standard batched matrix multiply assumes a fixed shape.

The answer, embodied in DeepSeek’s companion DeepGEMM library, is a grouped matrix multiply that processes each expert’s variable token count as a contiguous segment, so the expert math runs as one efficient kernel rather than a ragged collection of small ones.

The communication and the computation are co-designed: the all-to-all produces exactly the memory layout the grouped GEMM wants to consume.

Pull either apart from the other and the efficiency collapses, which is part of why a tuned mixture-of-experts stack is so much harder to assemble than the FLOP savings would suggest.

Subscribe now

What every token pays at the door

The bandwidth numbers describe the pipe. The next question is how much the model tries to push through it, and that can be computed directly from the geometry, which makes it one of the few places in this analysis where the arithmetic is exact rather than measured.

Take DeepSeek-V3’s mixture-of-experts layer. Each token’s hidden activation is a vector of 7168 values. On the dispatch, those values are sent in eight-bit precision, so one byte each, and they are sent to each of the eight selected experts. That is 7168 times 8, roughly 56 kilobytes of dispatch traffic per token per layer.

On the combine, each of the eight experts returns an output vector of the same width, but the combine is performed in sixteen-bit precision to preserve the accuracy of the reduction, so two bytes each. That is 7168 times 2 times 8, roughly 112 kilobytes of combine traffic per token per layer.

Add them and a single token, passing through a single mixture-of-experts layer, generates about 168 kilobytes of all-to-all traffic.

FIG 4 A derived figure for DeepSeek-V3 geometry. The routing multiplies per-token data movement by roughly twelve and, unlike a dense layer, all of it has to cross the fabric. The model stacks 58 of these layers.

Set that against what a dense layer moves for the same token, which is essentially nothing across the fabric: the activation stays on the device and the only traffic is the local memory read of about 14 kilobytes. The mixture-of-experts layer moves roughly twelve times as much data per token, and the crucial difference is not the multiple but the destination.

The dense traffic stays on-chip. The mixture-of-experts traffic crosses the network. And this happens 58 times as the token descends through the model.

Two things follow from the structure of those 168 kilobytes. The first is that the combine is twice the dispatch, because the combine runs in higher precision. This is not an arbitrary choice; reducing eight expert outputs in eight-bit precision degrades quality unacceptably, so the field has settled on eight-bit dispatch and sixteen-bit combine as the standard, and that asymmetry means the return trip is the more expensive leg.

Any optimization that can compress the combine, including the four-bit experiments now appearing in DeepEP’s experimental branches, attacks the larger half of the toll.

The second is that the toll is paid per token, which means the decode phase, where tokens are generated one at a time, pays it in the worst possible way. In prefill, thousands of tokens are dispatched together and the all-to-all amortizes its latency across an enormous batch; the kernel runs in its throughput regime and the bandwidth numbers of Figure 2 apply.

In decode, a single step might dispatch only a handful of tokens per sequence, the batch is tiny, the bandwidth of the pipe is irrelevant because the pipe is nearly empty, and what dominates is the fixed round-trip latency of reaching across the fabric and back.

This is why the low-latency decode kernels exist as a separate class, why they are willing to burn 64 SMs to shave microseconds, and why decode is the phase where the mixture-of-experts toll hurts most.

It is also why the entire industry serves prefill and decode on separately tuned pools of hardware, a point this publication examined at length one issue ago and which the all-to-all only sharpens.

The decode penalty is worth making concrete, because it is where the toll is most counterintuitive. At a service level of a hundred tokens per second per user, the budget for generating one token is ten milliseconds, and into that budget the model must fit 58 mixture-of-experts layers, each with a dispatch and a combine that reach across the fabric.

Inside the NVLink domain a round trip is measured in microseconds and 58 of them fit with room to spare; across the InfiniBand fabric the same round trips, with their higher fixed latency, begin to eat the budget directly.

That is why the decode all-to-all spends 64 SMs to shave microseconds, and why a decode deployment forced to leave the NVLink domain for its all-to-all can miss its latency target even when its aggregate bandwidth looks adequate on paper. In decode, latency is the currency, and the fabric boundary is where it gets spent.

Share The Software Frontier

The hottest expert sets the clock

There is a failure mode hiding inside the all-to-all that the bandwidth numbers do not capture at all, and it is the one that most often separates a deployment hitting its theoretical throughput from one falling well short of it. An all-to-all is a synchronization barrier.

The combine cannot complete until every expert has returned its outputs, which means the slowest expert on the most overloaded device sets the pace for the entire operation. If the gating network sends a disproportionate share of tokens to a handful of popular experts, the devices holding those experts become stragglers, and every other device in the domain waits on them.

Expert load is not uniform in practice, and it is not even stable. Certain experts specialize in patterns that appear frequently in real traffic, and the imbalance shifts with the workload. Survey work documents the consequence plainly: imbalanced token distribution causes device underutilization, and the whole expensive all-to-all runs at the speed of its hottest path.

A mixture-of-experts deployment can have perfectly adequate aggregate bandwidth and still bleed throughput because the load is lumpy.

DeepSeek’s answer in production is an expert-parallel load balancer that the community has reproduced under the name EPLB. The mechanism is to identify the high-load experts from live deployment statistics and replicate them: a hot expert is duplicated onto multiple devices so that the tokens destined for it can be spread, flattening the straggler. This is a direct trade of memory for balance.

You spend extra capacity holding redundant copies of the popular experts in order to keep the all-to-all from stalling on them. It works, and it is now standard, but it is another line on the bill that the sparsity story did not mention, and it interacts with the deployment topology in a way that is worth seeing concretely.

DeepSeek runs the same model checkpoint as two physically different machines, one for each phase, and the contrast is the clearest illustration in the field of how the all-to-all reshapes a deployment. According to DeepSeek’s own published inference overview and the CloudMatrix serving analysis that reconstructs it, the prefill machine groups four nodes, 32 GPUs, into a single unit running 32-way expert parallelism alongside 32-way data parallelism.

Across those 32 GPUs the routed experts are distributed nine to a device once the redundant copies of the popular experts are counted, with the shared expert and the attention mechanism replicated on every one. The raw figure would be eight; the ninth is the load balancer at work.

The decode machine expands the same model to 18 nodes, 144 GPUs, running 144-way expert parallelism and 144-way data parallelism, where each device holds only about two routed experts.

FIG 5 One checkpoint, two machines. The decode deployment spreads the experts across more than four times as many GPUs, which is partly about latency and partly about leaving room to replicate the hot experts.

Why spread the same 256 experts across 144 devices for decode when 32 sufficed for prefill?
Two reasons, and both come back to the all-to-all.

The first is latency: with fewer experts resident per device, each device does less work per step and the decode latency target is easier to hit.
The second is precisely the straggler problem. Spreading thin leaves headroom to replicate the popular experts without overflowing any device’s memory, so the load balancer has somewhere to put the redundant copies.

The decode machine is wider not because the math demands it but because the communication and the balance do. The shape of the deployment is dictated by the toll, not the FLOPs.

Subscribe now

The toolchain the toll demanded

The all-to-all did not only reshape the hardware and the kernels. It pulled an entire toolchain into being around itself, and the size of that toolchain is the clearest measure of how far the cost migrated from the math.

A 2026 mixture-of-experts serving stack at frontier scale is not a model and a runtime. It is a model, a communication library, a grouped-GEMM library, an expert load balancer, a disaggregation layer, and an overlap scheduler, each of which exists to manage some facet of the routing tax. The FLOP count described one of those six boxes.

Consider the overlap problem at the level of an entire forward pass rather than a single layer. Hiding the all-to-all behind computation works within a layer, but the decode phase is so latency-sensitive that the field has gone further and split each batch in two, running the communication of one half against the computation of the other in a continuous pipeline.

SGLang’s two-batch overlap and the analogous schemes in other runtimes exist for one reason: to keep the expensive accelerators busy with expert math while the all-to-all of a different microbatch is in flight. It is the same instinct as the kernel-level hooks, lifted to the level of the request scheduler, and it is now a standard part of large-scale deployments rather than an exotic optimization.

Disaggregation adds a second communication problem on top of the all-to-all. Once prefill and decode run on separate pools of hardware, the key-value cache computed during prefill has to be shipped to the decode pool before generation can begin, and at frontier scale that transfer is large enough and frequent enough to need its own engine.

The Mooncake transfer engine and the equivalent layers inside vLLM and SGLang exist to move key-value caches across the network efficiently, overlapping the transfer with computation so the handoff does not stall the pipeline. This is a network tax distinct from the all-to-all, and it is the price of the prefill-decode split that the all-to-all economics make worthwhile in the first place.

The two taxes are siblings: both are consequences of spreading one model’s inference across many devices, and both are paid down by the same instinct of overlapping transfer with compute.

The lesson in the length of that list is that the sparsity bargain did not merely move the cost to the network. It moved the cost to a place where extracting good performance requires assembling and tuning half a dozen interacting systems, any one of which, misconfigured, hands the savings back.

The vLLM and SGLang playbooks both carry warnings to this effect, and AMD’s ROCm guide to the vLLM mixture-of-experts options is blunt that the wrong combination of tensor, data, pipeline, and expert parallelism can duplicate the key-value cache many times over and consume far more memory than expected.

The FLOP count said the model got cheaper. The operations manual says it got more complicated, and the complication is where a large part of the real cost now lives.

Subscribe now

Does wide expert parallelism pay for itself?

All of this is overhead, and the natural reaction to a catalogue of overhead is to minimize it. If the all-to-all is the cost, why not keep the expert-parallel domain small, so the all-to-all stays inside a tight, fast group of devices?

The answer is that narrowing the domain trades one cost for another, and the trade does not run in the obvious direction.

Wider expert parallelism, counterintuitively, often produces more throughput per GPU, not less, and understanding why is the crux of whether the whole approach earns its keep.

The mechanism is expert packing. When experts are spread across more devices, each device holds fewer of them, which means more of each device’s memory and compute can be devoted to the batch of tokens currently being processed rather than to holding a large slice of the model.

Larger effective batches per device improve the arithmetic intensity of the expert matrix multiplications, the kernels run closer to the hardware’s peak, and the per-GPU throughput rises, provided the all-to-all overhead can be kept hidden behind that larger computation. The question is always whether the communication grows faster than the packing benefit, and up to a point, on the right interconnect, it does not.

NVIDIA’s measurements on the GB200 NVL72 quantify the dividend directly. Moving from an eight-way expert-parallel configuration to a 32-way one delivers up to 1.8 times the output token throughput per GPU, at a fixed service level of a hundred tokens per second per user, with disaggregated serving and multi-token prediction in both cases.

Same hardware, same latency target, nearly double the per-GPU output, purely from going wider on expert parallelism.

FIG 6 NVIDIA’s Wide-EP figures on the NVL72. Going wider improves per-GPU throughput, because the packing benefit outweighs the added all-to-all, as long as the all-to-all stays inside the NVLink domain.

The decisive qualifier is the last clause. The 1.8 times holds because the 32-way all-to-all stays inside the NVL72’s NVLink domain, where Figure 2 says it runs at 726 gigabytes per second.

The dividend exists because the wire is fast enough that going wider does not push the communication off the cliff. Try the same widening on a cluster where 32-way expert parallelism forces the all-to-all across InfiniBand, and the calculus inverts: the packing benefit is swamped by the eightfold bandwidth penalty of leaving the domain, and wider becomes worse.

This is the same fact from a different angle. The reason the rack-scale NVLink domain is worth its price is that it is what makes the wide-EP dividend positive instead of negative.

There is a second lever working alongside the width, and it appears in nearly every published wide-EP result: multi-token prediction. Rather than generating one token per forward pass, the model proposes several and verifies them together, which raises the number of tokens flowing through each all-to-all and pushes the decode kernel out of its worst, smallest-batch regime toward something the bandwidth can amortize.

Multi-token prediction and wide expert parallelism are complementary for the same underlying reason: both increase the work done per round trip across the fabric, and the all-to-all rewards anything that makes its fixed latency a smaller fraction of the whole.

The dividend in Figure 6 is partly a multi-token-prediction dividend, which is why NVIDIA and SGLang report the two together. They are deployed together because they solve the same problem from two directions.

So the answer to whether wide expert parallelism pays for itself is conditional, and the condition is the interconnect. Inside a sufficiently large fast domain, wider is genuinely better and the measurements prove it. Outside one, wider is a trap. The crossover sits exactly at the boundary of the NVLink domain, which is why the size of that domain, 8 GPUs yesterday, 72 today, the same 72 at higher bandwidth tomorrow, is the number that determines how far the dividend extends.

Expert parallelism and the interconnect are not two separate decisions. They are one decision, and the hardware vendor has been making half of it for you.

Subscribe now

How much is silicon, and how much is numerics

It is tempting to attribute the throughput of a Blackwell mixture-of-experts deployment to the silicon, and the marketing encourages it, but the public measurements let us decompose the uplift, and the decomposition is instructive about where the real leverage sits.

The LMSYS and SGLang teams have published a careful progression of DeepSeek serving results on the GB200 NVL72, and the numbers are specific.

With disaggregated prefill and decode, large-scale expert parallelism, and the conservative numeric configuration of sixteen-bit attention and eight-bit experts, SGLang reaches 18,471 input tokens per second per GPU on prefill and 9,087 output tokens per second per GPU on decode, for two-thousand-token sequences.

Switch to the aggressive configuration, eight-bit attention and four-bit NVFP4 experts, and the same system reaches 26,156 input and 13,386 output tokens per second per GPU. Against the H100 baseline the teams report, those aggressive numbers represent a 3.8 times prefill and 4.8 times decode improvement.

FIG 7 The Blackwell uplift, decomposed. A large share of the gain over the conservative GB200 configuration comes from dropping the experts to four-bit NVFP4, not from the silicon alone.

The decomposition is the point. The jump from the H100 baseline to the conservative GB200 configuration is the hardware: faster tensor cores, the NVLink domain, more memory bandwidth. But the further jump from the conservative to the aggressive GB200 configuration, from 18,471 to 26,156 on prefill and from 9,087 to 13,386 on decode, is numerics.

It comes from running the experts in four-bit NVFP4 rather than eight-bit. That is a software-and-format change applied to the same rack, and it accounts for a substantial fraction of the total uplift over H100.

NVFP4 earns its own treatment, and it is a strong candidate for a future issue, but the relevant fact here is why it interacts so favorably with the all-to-all. Four-bit experts are half the bytes of eight-bit experts, which directly shrinks the dispatch leg of the toll, and they double the tensor-core throughput of the expert math itself, so the computation that hides the all-to-all gets faster at the same time the all-to-all gets smaller.

NVIDIA’s format reportedly holds accuracy within about one percent of the higher-precision baseline on large models through a two-level scaling scheme, and the accuracy holds up best precisely on the large mixture-of-experts models where it matters most. The format is, in effect, a second lever on the same toll that the interconnect attacks, and the two compound.

This is also why NVIDIA can credibly claim a fivefold reduction in cost per token from software optimization alone in the two months after Blackwell’s launch, with no hardware change: a large part of that was kernel and format work on exactly these operations.

Subscribe now

One dollar, or twenty cents

The throughput numbers are engineering. The reason they matter is that they convert, almost directly, into the only number a serving operator actually cares about, which is dollars per million tokens. And here the all-to-all moves from being a technical concern to being the dominant line item in the unit economics.

The cleanest demonstration in the public record is the LMSYS deployment of DeepSeek on 96 H100 GPUs, twelve nodes of eight, using prefill-decode disaggregation and large-scale expert parallelism with the full DeepEP, DeepGEMM, and EPLB stack.

That deployment reached 52,300 input tokens per second and 22,300 output tokens per second per node, and when the team translated the throughput into cost, it came to twenty cents per million output tokens. That figure is roughly one-fifth of what DeepSeek’s own public API charged at the time, achieved on rented hardware by an outside team reproducing the architecture.

The comparison that matters most, though, is the one against the naive alternative on identical hardware. The same report states that the optimized expert-parallel strategy improved output throughput by up to five times over vanilla tensor parallelism using the same resources. Five times the throughput on the same GPUs is five times lower cost per token.

The all-to-all engineering, getting the dispatch and combine to run efficiently inside the fast domain, hiding the latency behind computation, balancing the hot experts, is the entire difference between a deployment at twenty cents and a deployment at a dollar.

FIG 8 Same 96 GPUs, two ways of organizing them. The five-fold gap between vanilla tensor parallelism and tuned expert parallelism is, almost entirely, the all-to-all done well versus done naively.

Put that five-fold against the backdrop of where inference pricing has gone, and the stakes of the routing tax become clear. The price of frontier-class inference has fallen by something close to fifty times in three years, from around twenty dollars per million tokens for GPT-4-class output in late 2022 to roughly forty cents in early 2026.

Public trackers attribute the collapse to four compounding forces, and mixture-of-experts together with expert parallelism is explicitly one of them, alongside hardware efficiency, kernel and compiler optimization, and low-precision formats. Inference now consumes roughly two-thirds of all AI compute, having crossed over from a minority of it only a couple of years ago.

In that environment a five-fold cost difference is not a margin to be optimized later. It is the difference between a viable serving business and an unviable one.

FIG 9 The price floor that makes a routing tax of cents per token worth a flagship. Expert parallelism is one of the four named drivers of this curve, not a footnote to it.

Making the domain bigger than the problem

Step back from the individual numbers and a single strategic motion organizes all of them. The mixture-of-experts architecture created a communication problem.

The hardware industry’s response has been to make the fast communication domain large enough to swallow the problem whole, and the trajectory of that response is the most reliable predictor of where serving economics go next.

DeepSeek’s own engineers, in their published reflections on the hardware lessons of training V3, frame the future in exactly these terms. They call for the convergence of scale-up and scale-out, for precise low-precision compute units, and for innovations in low-latency communication fabrics.

Read against this issue, that is a wish list written by the people paying the all-to-all toll, addressed to the people who can make the domain bigger. The scale-up and scale-out convergence they ask for is precisely the elimination of the cliff in Figure 1: a world where crossing from one node to the next does not cost a factor of eight, because the fast domain has grown to encompass both.

NVIDIA is building toward exactly that, and is increasingly explicit that it is doing so for this reason. The NVL72 took the NVLink domain from 8 to 72. The NVLink Switch architecture is specified to reach 576 GPUs in a single non-blocking fabric. The Rubin generation lifts the per-GPU bandwidth again and ties the increase directly, in NVIDIA’s own framing, to the all-to-all needs of mixture-of-experts models.

Each step is sold, more openly than the last, as a larger container for the communication problem that sparsity created. The architecture and the interconnect are co-evolving, and the direction is set: the domain keeps growing, the cliff keeps receding, and the toll keeps shrinking as a fraction of the work, without ever quite reaching zero.

The domain cannot grow without limit, and the constraints on how far it can stretch are physical. NVLink at rack scale runs over copper, which is cheap and reliable but reaches only a couple of meters; pushing the domain past a single rack toward the 576-GPU fabric the switch silicon can address means either optical interconnect, with its added cost, power draw, and failure modes, or denser and hotter racks than the current design.

Power and cooling are already near the edge of what a standard data center hall delivers per rack, which is why the NVL72 is liquid-cooled and why each new generation leans harder on liquid. And the fault domain grows with the fabric, because a larger coherent domain is a larger blast radius for a single failure.

The trajectory is set toward bigger domains, but each expansion buys less headroom than the last against a wall of copper reach, power density, and fault tolerance that the all-to-all cannot argue its way past.

What this does not resolve is the dependency it creates. An operator who builds a serving business on wide expert parallelism is building on the assumption that the fast domain will keep growing, and that assumption ties the economics of the model layer to the roadmap of a single interconnect vendor.

The wide-EP dividend is real, but it is contingent on hardware that one company predominantly supplies, and the contingency is worth naming. The cheapest way to serve a frontier mixture-of-experts model in 2026 runs through a rack that is, for now, effectively sole-sourced.

That is a strategic fact about the inference market as much as a technical one, and it is the part of the story most likely to matter in the issues to come.

The cheapest way to serve a frontier mixture-of-experts model in 2026 runs through a rack that is, for now, effectively sole-sourced. That is a strategic fact as much as a technical one.

The dependency has not gone unanswered. An industry that has watched a single vendor’s interconnect become the determinant of mixture-of-experts economics has begun to organize alternatives.

The UALink consortium and the Ultra Ethernet effort are both attempts to build an open scale-up fabric that could host the all-to-all without routing through one company’s switches, and AMD’s serving stack now carries its own expert-parallel communication path, a port of the DeepEP ideas onto its accelerators.

None of these has yet demonstrated the rack-scale all-to-all bandwidth of an NVL72 in production, and the gap is real, but the direction of the effort is itself a measure of how much the all-to-all matters. An entire alternative-hardware ecosystem is organizing around the single operation that this issue is about.

There is also a cost that none of the throughput numbers capture, which is reliability. A 144-GPU decode deployment is one coordinated system, and the all-to-all is a synchronization barrier across all of it, which means a fault or a slowdown on any single device degrades the whole.

The larger the expert-parallel domain, the more devices have to stay healthy and in lockstep for the all-to-all to complete on time, and the operational burden of keeping a domain of that size running at frontier latency is substantial.

DeepSeek’s own diagnostic tooling for locating slow ranks in a DeepEP deployment exists because, at this scale, finding the one straggling device in a domain of hundreds is a routine and necessary operation.

The wide-EP dividend is real, but it is collected by operators who can keep a very large, very tightly coupled machine running, and that capability is a cost the smaller-domain alternatives never have to pay.

Subscribe now

What to actually do

The analysis resolves into a handful of decisions that an operator faces in practice, and they follow from the structure rather than from any single benchmark.

The first decision is whether to use expert parallelism at all, and the honest answer is that it depends entirely on whether your all-to-all can be kept inside a fast domain.

If you are serving a frontier mixture-of-experts model at scale and you have access to a rack-scale NVLink domain, wide expert parallelism is the right tool and the measurements say to go as wide as the domain allows, because the packing dividend is positive inside the fast fabric.

If your all-to-all would have to cross InfiniBand to go wider, stop widening before it does, because the cliff inverts the dividend. The boundary of the NVLink domain is the boundary of the decision.

The second decision is how to split the phases. Prefill and decode want different all-to-all kernels, different expert-parallel widths, and in DeepSeek’s production case different physical machines entirely. The decode machine should be wider, both to hit latency targets and to leave room for the load balancer to replicate hot experts.

If you cannot afford to disaggregate, the decode phase is where the toll will hurt, and the low-latency kernels are where to spend your tuning effort. The vLLM and SGLang playbooks both warn, correctly, that the wrong parallelism strategy can duplicate key-value caches across the domain and consume many times the memory you expected, so the parallelism decision is not only about the all-to-all but about what else it forces to be replicated.

The third decision is precision, and it is mostly free throughput if you are on Blackwell. Four-bit NVFP4 experts shrink the dispatch leg of the toll and double the expert math throughput at an accuracy cost that, on large models, is small. The aggressive configuration in Figure 7 is not a marginal tuning; it is a large fraction of the total uplift, and it attacks the same toll the interconnect attacks. If your hardware supports it and your accuracy budget allows it, it is among the highest-leverage changes available.

And the fourth decision is whether you need any of this at all. If your workload is single-user or small-scale, the KTransformers lesson stands: a mixture-of-experts model on a single node never pays the toll, and the entire apparatus of expert parallelism is overhead you can decline.

The all-to-all economics in this issue are the economics of serving at frontier scale and frontier latency. Below that scale, the right move is to keep the experts local and let the toll switch stay off.

The deeper lesson is the one the sparsity story obscured for two years. Mixture-of-experts did not make inference cheaper by doing less work. It moved the work from a place that was easy to scale, the arithmetic, to a place that was hard, the network, and then the hardware industry spent two product generations and a great deal of money making the network easy to scale too.

The bargain was always real. It was just never free, and the bill was always going to come due on the wire. Knowing where it comes due, and how much, is most of what it takes to serve these models without overpaying.

The router decides which experts a token needs. The wire decides what that decision costs. For the models that now define the frontier, the wire is the more expensive of the two.

What we are confident about, and what we estimated

NVLink 5 delivers 1.8 TB/s per GPU; the GB200 NVL72 provides 130 TB/s aggregate all-to-all bandwidth across 72 GPUs, with 13.5 TB of unified HBM3e.

NVIDIA GB200 NVL72 datasheet; NVIDIA multi-node NVLink tuning guide; Introl and Spheron interconnect analyses.

DeepEP measures dispatch and combine at 726 and 740 GB/s inside the NVLink domain on Blackwell, versus about 90 GB/s each across internode RDMA, on the published V3 workload.

DeepEP V2 performance table, deepseek-ai/DeepEP repository.

DeepEP’s decode all-to-all consumes up to 64 SMs at peak throughput; the V2 rewrite cut training all-to-all SM use from 24 to between 4 and 6. A B200 has 148 SMs.

DeepEP V2 performance table and release notes; Blackwell architecture specifications.

SGLang on the GB200 NVL72 reaches 26,156 prefill and 13,386 decode tokens/sec/GPU with eight-bit attention and NVFP4 experts, reported as 3.8x and 4.8x over H100; the conservative configuration reaches 18,471 and 9,087.

LMSYS Org, GB200 NVL72 Part II, September 2025.

An LMSYS 96-GPU H100 deployment reached 52.3k input and 22.3k output tokens/sec/node and translated to $0.20 per 1M output tokens, about one-fifth the official API price, and up to 5x the throughput of vanilla tensor parallelism on the same hardware.

LMSYS Org, large-scale EP on 96 H100, May 2025.

Moving from EP8 to EP32 yields up to 1.8x output throughput per GPU at a fixed 100 tok/s/user SLA on the NVL72, with disaggregated serving and multi-token prediction.

NVIDIA, Wide Expert Parallelism on NVL72, January 2026.

DeepSeek-V3 runs DP32+EP32 across 32 GPUs for prefill (nine routed experts per GPU plus one shared, including one redundant) and DP144+EP144 across 144 GPUs for decode (about two routed experts per GPU plus one shared).

DeepSeek Open Source Week inference system overview (Day 6); CloudMatrix serving analysis (arXiv 2506.12708). The V3 technical report describes a different decode configuration (EP320, one expert per GPU).

NVFP4 holds accuracy within roughly one percent of the higher-precision baseline on large models via two-level scaling, and accuracy recovery is strongest on the largest dense and MoE models.

NVIDIA NVFP4 technical blogs; Red Hat AI NVFP4 evaluation.

A DeepSeek-V3 mixture-of-experts layer moves about 56 KB of dispatch (FP8, top-8) and 112 KB of combine (BF16, top-8) per token, roughly 168 KB total, against about 14 KB for a dense layer.

Derived from V3 geometry (hidden 7168, top-8, FP8 dispatch, BF16 combine). Excludes the shared expert and any local-rank optimization.

The vanilla-tensor-parallel and official-API reference points of roughly $1.00 per 1M output tokens are derived from the LMSYS statements (optimized $0.20 figure at one-fifth of API, and 5x over vanilla TP).

Derived from LMSYS 96-GPU report figures.

The H100 baseline in Figure 7 (6,883 prefill, 2,789 decode tokens/sec/GPU) is back-calculated from the reported 3.8x and 4.8x speedups, not independently measured.

Derived from LMSYS GB200 Part II reported multipliers.

Frontier-class inference pricing has fallen roughly fifty-fold from about $20 to about $0.40 per 1M tokens from late 2022 to early 2026, with MoE plus expert parallelism among four named drivers.

Public inference price trackers, 2022 to 2026. Order-of-magnitude trend across vendors, not a single price series.

Vera Rubin NVL72 is specified for roughly 3.6 TB/s per GPU and 260 TB/s aggregate, framed by NVIDIA as serving MoE all-to-all needs.

NVIDIA NVLink product page and CES 2026 disclosures; pre-release specification subject to change.

A = primary or measured | B = single strong vendor or operator source | C = derived by us from sourced inputs | D = directional, treat as trend not point estimate.
Character scan: this issue contains zero em dashes and zero en dashes, verified programmatically against the rendered text.

Bibliography

DeepSeek-AI. DeepEP: an efficient expert-parallel communication library. GitHub repository, 2025. Performance table, V2 release notes, decode and prefill kernel interfaces.github.com/deepseek-ai/DeepEP
DeepSeek-AI. Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures. arXiv 2505.09343, 2025.arxiv.org/abs/2505.09343
LMSYS Org. Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs. May 2025.lmsys.org/blog/2025-05-05-large-scale-ep
LMSYS Org. Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP, Part I: 2.7x Higher Decoding Throughput. June 2025.lmsys.org/blog/2025-06-16-gb200-part-1
LMSYS Org. Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP, Part II: 3.8x Prefill, 4.8x Decode Throughput. September 2025.lmsys.org/blog/2025-09-25-gb200-part-2
LMSYS Org. SGLang and NVIDIA Accelerating SemiAnalysis InferenceMAX and GB200 Together. October 2025.lmsys.org/blog/2025-10-14-sa-inference-max
NVIDIA. Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack-Scale Systems. NVIDIA Technical Blog, January 2026.developer.nvidia.com/blog
NVIDIA. GB200 NVL72 product page and datasheet. 130 TB/s NVLink domain, 72-GPU rack specifications.nvidia.com/en-us/data-center/gb200-nvl72
NVIDIA. Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era. NVIDIA Technical Blog, January 2026.developer.nvidia.com/blog
NVIDIA. Introducing NVFP4 for Efficient and Accurate Low-Precision Inference. NVIDIA Technical Blog, 2025.developer.nvidia.com/blog
NVIDIA. The Economic Value of Inference Software Optimization at the Datacenter Level. April 2026. Fivefold cost-per-token reduction via software.perspectives.nvidia.com
NVIDIA. Multi-Node NVLink Systems Tuning Guide and NVLink / NVLink Switch product documentation. Fifth-generation NVLink and NVSwitch specifications.docs.nvidia.com; nvidia.com/en-us/data-center/nvlink
Microsoft. Achieving Optimal Performance for DeepSeek Expert Parallelism (DeepEP) on Azure. Azure HPC Blog, May 2025.techcommunity.microsoft.com
AMD ROCm. The vLLM MoE Playbook: A Practical Guide to TP, DP, PP and Expert Parallelism. November 2025.rocm.blogs.amd.com
Taming the Titans: A Survey of Efficient LLM Inference Serving. arXiv 2504.19720, 2025. All-to-all as the MoE bottleneck; expert load balancing.arxiv.org/abs/2504.19720
Serving Large Language Models on Huawei CloudMatrix384. arXiv 2506.12708, 2025. DeepSeek DP32+EP32 prefill and DP144+EP144 decode topology.arxiv.org/abs/2506.12708
DeepSeek-AI. DeepSeek-V3/R1 Inference System Overview (Open Source Week, Day 6). February 2025. Production prefill EP32 (9 experts/GPU) and decode EP144 (2 experts/GPU) topology.github.com/deepseek-ai/open-infra-index
Introl. NVLink and Scale-Up Networking. 2026. Scale-up versus scale-out bandwidth ratio; NVL72 physical architecture.introl.com/blog
DigitalOcean. The LLM Inference Trilemma: Throughput, Latency, Cost. April 2026. MoE cost as a game of communication.digitalocean.com/blog
GPUnex. AI Inference Economics: The 1,000x Cost Collapse Reshaping GPUs. February 2026. Inference price trend and drivers.gpunex.com/blog
NVIDIA. NVLink and NVLink Switch, Vera Rubin NVL72 and NVLink 6 disclosures. CES 2026. 260 TB/s aggregate, MoE all-to-all framing.nvidia.com/en-us/data-center/nvlink

Decode Is Memory-Bound. Speculation Is the Arbitrage

Lorenzo Bradanini — Thu, 25 Jun 2026 10:19:56 GMT

Introduction

Rent a B200 for an hour and you are paying for roughly four and a half thousand trillion floating-point operations per second. Ask it to generate text from a seventy-billion-parameter model one user at a time, and for most of that hour the tensor cores do almost nothing.

The reason is not a bug, a bad kernel, or a scheduling failure. It is arithmetic. To produce a single token, the machine must read every weight in the model out of high-bandwidth memory, and reading is the slow part.

The multiply that follows the read is nearly free, and there is almost nothing to multiply, because a single decode step touches one token’s worth of activations against the entire weight matrix. You are paying for a fleet of trucks and using them to deliver one envelope per trip.

Put numbers on it. A seventy-billion-parameter model in the eight-bit precision typical of modern serving is seventy gigabytes of weights.

On a B200 with eight terabytes per second of memory bandwidth, sweeping those weights once takes just under nine milliseconds, and that single sweep yields exactly one token for one user.

The tensor cores that could have executed thousands of trillions of operations in that window execute a few billion. The arithmetic intensity of single-stream decode, the ratio of compute performed to bytes moved, sits at roughly one to two floating-point operations per byte.

The hardware does not break even until that ratio reaches several hundred. The gap between those two numbers is the entire subject of this issue, because that gap is compute you have already paid for and are not using.

Speculative decoding is the one technique in the inference toolbox that spends that idle compute on tokens, and, in its exact formulations, does so without altering the model’s output by a single logit. Every other lever trades something visible.

Quantization trades precision. Pruning trades capacity. Distillation trades a different model entirely. Speculation, done correctly, trades nothing the user can observe; it simply reorganizes when the weight reads happen so that one read can validate several tokens at once. That is what makes it unusual, and it is why every major laboratory shipped a version of it over the last eighteen months.

And yet the operator folklore says to turn it off above a certain batch size, and the operator folklore is correct, as far as it goes. The resolution of that apparent contradiction is the thesis of this piece.

The value of speculative decoding is not a number you can quote. It is a position on a plane whose axes are batch size and context length, measured against the ridge of the roofline. In one region it cuts your cost per token roughly in half.

In the adjacent region it raises your cost per token by a fifth. The technique never changed. The regime did. The job of this issue is to draw the plane, mark the line that divides it, and show why the workload that came to dominate 2026, long-form reasoning, walked straight into the half of the plane where speculation pays.

Subscribe now

What speculation actually does, stated precisely

The mechanism is worth stating exactly, because almost every confusion about the economics traces back to a loose mental model of it.

A small, cheap model, the draft, proposes a short run of candidate tokens, say four or five of them, by generating them autoregressively in the ordinary way. Because the draft is small, those proposals are fast.

The large model, the target, then performs a single forward pass that scores all of the candidate positions at once. This is the move that matters: verifying four candidate tokens costs the target essentially one weight load, the same memory sweep it would have spent producing one token on its own, because the candidates are processed in parallel across the sequence dimension rather than one step at a time.

The target then walks the candidates left to right and applies a modified rejection-sampling test at each position.

It keeps the longest prefix of candidates that agrees with what it would have sampled itself, discards the first disagreement and everything after it, and emits one additional bonus token drawn from its own distribution at the point of divergence.

So a step that began with a draft of length K returns the number of accepted candidates, call it n, plus one. If the draft proposed five tokens and the target accepted three, the step produced four tokens for the price of one memory sweep. If the target accepted all five, it produced six. If it accepted none, it produced one, the bonus token, and you paid the draft’s cost for nothing.

This is the first thing the folklore gets right and the economics must respect: the speedup is governed by how many tokens the target accepts per step, and specifically by the accept length, the mean size of that accepted run plus the bonus.

It is not governed by the raw acceptance rate in isolation, and it is not governed by how clever the draft sounds. A draft that is right ninety percent of the time on the next token but falls apart by the third token buys you less than a draft that is right seventy percent of the time but stays coherent for five.

The lever is the length of the run, because each run, however long, costs exactly one expensive weight load of the target.

The lossleness property

For the rejection-sampling formulations introduced by Leviathan and colleagues in 2023 and independently by Chen and colleagues the same year, the output distribution is provably identical to standard autoregressive sampling from the target.

The modified rejection test is constructed precisely so that the accepted-token statistics match the target’s own. EAGLE preserves this exactly, as Hugging Face’s engineering writeup states plainly. The user cannot tell, from the output alone, that speculation was used.

That property deserves a caveat stated in the same breath, because vendors are not always careful about it. The losslessness holds for the exact rejection-sampling rule.

There are faster variants, relaxed acceptance, typical acceptance, and several aggressive tree-acceptance schemes, that raise the acceptance rate by loosening the test, and these do change the output distribution. They are often worth it.

But a quoted speedup that came from a relaxed acceptance rule is not the same artifact as a quoted speedup from exact rejection sampling, and an honest ledger keeps them in separate columns. When this issue later cites a four-times number, it will say which rule produced it.

The methods themselves form a clean lineage, and the direction of travel tells you what the field decided mattered. The original formulation used a separate draft model, a smaller member of the same family, which is simple but means maintaining and serving two models.

Medusa removed the second model by attaching several prediction heads to the target itself, each guessing a future position in parallel. EAGLE, in its first and second versions, moved the autoregression down a level, drafting in the target’s own feature space rather than in token space, which made the draft both cheaper and better aligned.

EAGLE-3, presented at NeurIPS 2025 and described in arXiv:2503.01840, pushed further: it fuses features from early, middle, and late layers of the target, predicts tokens directly rather than through an intermediate feature-regression step, removes a constraint that had limited how much training data helped, and uses a dynamic draft tree that expands the most promising candidates.

The endpoint of the lineage is to fold the draft into the target entirely, which is what DeepSeek’s multi-token prediction does, and which the next sections will show has economic consequences beyond mere convenience.

Subscribe now

The roofline and the speculation budget

The previous issue, The Split and the Seam, derived the roofline for LLM serving in detail and split inference into its prefill and decode phases on exactly these grounds.

This issue assumes that derivation rather than repeating it, and reuses its house figures. The roofline says that for any kernel there is a ridge point, an arithmetic intensity above which you are limited by the chip’s compute throughput and below which you are limited by its memory bandwidth.

The ridge is simply peak compute divided by peak bandwidth. For an H100 SXM running FP8, that is one thousand nine hundred and seventy-nine teraFLOPS of dense tensor throughput against three and thirty-five hundredths terabytes per second of HBM3, which puts the ridge at five hundred and ninety-one FLOP per byte.

The H200 keeps the same compute but raises bandwidth to four and eight tenths terabytes per second, dropping the ridge to four hundred and twelve. A B200 at roughly four thousand five hundred teraFLOPS against eight terabytes per second sits near five hundred and sixty-two.

Single-stream decode operates at one to two FLOP per byte. Hold those two numbers next to each other. The operating point is two to nearly three orders of magnitude below the ridge.

That distance, expressed as a ratio, is the factor by which you could multiply the compute performed per byte moved before you would hit the memory ceiling and start paying for it in latency. Call it the speculation budget. On a single stream it is somewhere between three hundred and nearly six hundred times.

It is, very precisely, the ceiling on what any decode-side technique could reclaim from idle compute, and the headroom that speculative decoding draws on.

Figure 1. The speculation budget is the vertical distance from the decode operating point to the roofline ridge. Single-stream decode runs two to three orders of magnitude below the point where compute becomes the limit, which is the headroom speculation cashes in. Ridge points use Issue 03 house figures: H100 SXM FP8 1,979 TFLOPS / 3.35 TB/s, H200 same compute / 4.8 TB/s, B200 ~4,500 TFLOPS / 8 TB/s.

A reader should immediately ask why, if the budget is several hundred times, speculation delivers only two or three.

The answer is that no single technique spends the whole budget, and speculation in particular spends only a sliver of it.

Its yield is capped by accept length: each verification step still costs one weight load and returns at most the accepted run plus a bonus, which in practice is two to five tokens, so the multiple is bounded there no matter how much idle compute waits unused.

The draft is not free either, and its own forward passes consume part of the budget before any of it reaches the output. The rest of the headroom is what batching claims, the other and larger way to raise arithmetic intensity, and whatever neither mechanism reaches simply sits idle under the latency ceiling.

So the budget is the size of the prize, not the size of the winnings. Speculation is the instrument that collects the part of it that batching cannot, which, as the rest of this issue argues, is exactly the part that matters when a latency SLA forbids batching in the first place.

This budget is not an accident of one chip generation. It is the accumulated result of a divergence that has run for a decade.

Across the span from V100 to B200, tensor compute throughput grew by roughly thirty-six times, while HBM bandwidth over the same generations grew by only about nine times, a gap documented in the systems literature and discussed at length in this publication’s earlier piece on the memory wall.

Compute outran memory by a factor of four across those generations, and every factor of that divergence widened the speculation budget, because it pushed the ridge further above the place where decode actually runs.

The technique gets structurally more attractive with each generation of hardware that improves compute faster than bandwidth, which is to say, with each generation.

The budget is real, and finite

The roofline guarantees the headroom exists on a single stream. It does not guarantee the headroom survives batching, or long context. The next two sections are the story of what spends the budget down, and they reach opposite conclusions depending on which axis you move along.

The framing to carry forward is that speculation is, mechanically, a way of converting roofline headroom into tokens. When the headroom is large, the conversion is cheap and the tokens are nearly free.

When the headroom has been consumed by something else, there is nothing left to convert, and the draft’s compute becomes pure overhead.

Everything downstream is a question about how much headroom is actually available in your serving regime, and the surprising part, the part the folklore half-misses, is that the answer depends on two independent variables, not one.

Subscribe now

Why batching is said to kill it

Here is the story every production guide tells, and it is the right place to start because it is true within its assumptions. As you raise the batch size, packing more concurrent sequences into each forward pass, the target’s arithmetic intensity rises.

The reason is that the weights are read once per step regardless of how many sequences are in the batch, so the cost of that read amortizes across the batch. One sequence pays the full one hundred and forty gigabyte sweep for one token.

Thirty-two sequences split the same sweep across thirty-two tokens. The bytes-per-token falls, the FLOP-per-byte rises, and at some batch size the target crosses its ridge and becomes compute-bound.

Past that crossing, the free headroom is gone, because the compute is now the scarce resource, and the draft model’s extra forward passes are competing for it against real work.

The crossing is commonly placed around a batch of thirty-two. Spheron’s production guide from March 2026 and E2E Networks’ engineering notes both put the practical break-even in that neighborhood, with the qualification that it moves with model size, quantization, and sequence length.

Below a draft acceptance of roughly one half, the guides agree, speculation hurts at any batch, because too few candidates survive verification to cover the draft’s cost.

The operational rule that falls out is blunt and widely repeated: disable speculation when batch sizes climb past the low tens, when outputs are short, when generation is high-entropy, or when you are memory-constrained on weights to the point that the draft displaces batch capacity.

Figure 1. The conventional envelope. As the batch fills, the spare compute that speculation feeds on disappears, and the speedup decays toward break-even near a batch of 32. Measured points are EAGLE 3.1 on Kimi-K2.6-NVFP4, vLLM tensor-parallel 4, GB200, SPEED-Bench, published by the vLLM team in May 2026: 2.03x at concurrency 1, 1.71x at 4, 1.66x at 16. The break-even location and the 0.5-acceptance floor are from E2E Networks and the Spheron production guide. The envelope is illustrative; the points are measured.

The measured points anchor the shape. EAGLE 3.1, released jointly by the EAGLE, vLLM, and TorchSpec teams in May 2026 and benchmarked in the vLLM team’s own writeup running on Kimi-K2.6 in NVFP4 under vLLM with tensor parallelism of four on a GB200, delivered a per-user throughput multiple of two and three hundredths at concurrency one, one and seventy-one hundredths at concurrency four, and one and sixty-six hundredths at concurrency sixteen, on the SPEED-Bench suite.

The curve is unmistakable: the benefit is largest when the machine is emptiest, and it erodes as the batch fills. This is the empirical backbone of the folklore, and nothing in this issue disputes it on its own terms.

There is a sharper version of the same point that the practitioner Tian Pan has called the critical inversion. At low concurrency the draft runs in compute the target was wasting anyway, so it is free.

At high concurrency the draft’s forward passes contend with queued real requests for the same saturated compute, so the draft is no longer free; it is actively stealing throughput from work you could otherwise be doing.

Under that framing, speculation is fundamentally a low-concurrency latency optimization, and treating it as a throughput optimization at scale is a category error. This is good guidance. It is also, and this is the whole turn of the issue, an argument that silently assumes short context.

The amortization story is entirely about weights. It says the weight read, which dominates single-stream decode, gets cheaper per token as the batch grows. That is true. But the weight read is not the only thing decode reads from memory on every step, and the other thing it reads does not amortize across the batch at all.

The conventional wisdom is not wrong. It is two-thirds of a three-variable problem, and the missing variable is the one that 2026’s workloads turned up to eleven.

Subscribe now

The KV cache re-opens the budget

Every token a transformer has already produced leaves behind a key and a value vector in every attention layer, and every future token must read all of them. That is the KV cache, and it is the second great memory cost of decode. Crucially, it behaves nothing like the weights.

The weights are shared across the batch, so their read amortizes. The KV cache is private to each sequence and grows with that sequence’s length, so its read scales with the batch size and with the context length simultaneously.

Doubling the batch does not split the KV read across more tokens; it doubles the total KV that must be read. Doubling the context length doubles it again.

The consequence is the result at the center of the MagicDec work, described in arXiv:2408.11049 and in Together AI’s analysis of it. There is a critical sequence length, call it S-star, beyond which the per-step KV read dominates the per-step weight read even at large batch.

Past S-star, decode is memory-bound again, not because the weights are unamortized, but because the KV cache is enormous and unamortizable. The free compute the conventional wisdom said batching had consumed comes back, because batching only consumed the part of the memory bill that the weights were responsible for. The KV part grew instead of shrinking.

This changes the geometry of the entire question. The compute-bound region is not the half-plane “batch greater than thirty-two.” It is a wedge: compute-bound requires high batch and short context, both at once. Move to small batch and you are memory-bound on weights. Move to long context and you are memory-bound on KV.

Only in the corner where the batch is large and the sequences are short does the target actually saturate its compute. Everywhere else, on a single stream, on long documents, on extended reasoning traces, the headroom is open and speculation has something to convert.

The short-context side of that corner has a hard edge worth naming. Because the KV cache caps how high arithmetic intensity can climb, there is a context length, roughly a thousand tokens on a B200 and closer to eleven hundred on an H200, past which no batch size reaches the compute-bound ridge at all.

That ceiling is the dashed wall in the diagram below, and the compute-bound wedge lives entirely to its left. Every reasoning trace sits far to its right.

Figure 2. The two-axis map, drawn from the roofline condition. Decode is compute-bound, where speculation taxes throughput, only in the small wedge at high batch and short context: it needs batch above roughly the ridge over two (around 250 for a B200) and context below the dashed wall. The wall sits where even infinite batch cannot lift arithmetic intensity to the ridge, at a sequence length of about 2C/R, near a thousand tokens on a B200 and eleven hundred on an H200. Everything else is memory-bound, where speculation pays. Interactive chat, single-user reasoning, and the long-document MagicDec regime all sit in speculation-pays territory; only short-prompt high-batch offline serving sits in the tax. With a speculative draft tree the effective batch reaches the wedge nearer nominal batch 32, which is the conventional break-even. House calculation; the KV-versus-weight crossover is a distinct curve, in Figure 4.

Where the line actually sits

The boundary is not abstract; you can locate it with the model’s own dimensions, and where it lands is the punchline. Take a seventy-billion-parameter model of the Llama-3-70B shape: eighty layers, grouped-query attention with eight key-value heads of head-dimension one hundred and twenty-eight.

The key-value cache that must be read per token is two vectors, key and value, times eight heads, times one hundred and twenty-eight dimensions, times eighty layers, which is one hundred and sixty-three thousand eight hundred and forty elements per token.

In a sixteen-bit KV cache that is about three tenths of a megabyte for every token already in the sequence, per sequence. The weights, in an eight-bit serving format, are seventy gigabytes, read once per step and shared across the whole batch.

The crossover, the point where the per-step key-value read equals the per-step weight read, is therefore where batch size times sequence length reaches roughly seventy gigabytes divided by three tenths of a megabyte, which is about two hundred and twenty thousand.

That locus, batch times sequence held constant, is a hyperbola: it is the line drawn in Figure 3 below, and it marks where the KV read overtakes the weight read, which is to say where adding more batch stops reducing the bytes paid per token.

At a batch of thirty-two it puts that amortization crossover near seven thousand tokens; at a batch of sixty-four, near three thousand five hundred; at a batch of one hundred and twenty-eight, near one thousand seven hundred.

An eight-bit key-value cache roughly doubles all of those. This crossover is a finer fact than the compute-bound wall of the previous figure, and the two should not be confused: the wall is the context length past which no batch reaches the ridge, while the crossover is the point at a given batch where batching has stopped buying amortization.

The reasoning workload clears both at once. Hold it against the MLPerf numbers: a mean output of three thousand eight hundred and eighty tokens, a maximum of twenty thousand, AIME traces running to twenty-three thousand.

Those sequences run far past the roughly one-thousand-token compute-bound wall, so no batch reaches the ridge, and at any batch an operator can realistically run under a latency SLA they are past the amortization crossover as well. This is not a near miss.

The workload that came to define 2026 lives deep in the memory-bound region by a wide margin, which is the entire reason speculation pays there. (Both lines are house order-of-magnitude calculations from the stated architecture, graded in the dossier; the precise coefficients move with KV precision, head count, ridge, and serving format, but the order of magnitude, and therefore the conclusion, is robust.)

Figure 3. The crossover, drawn. The weight read is flat because it amortizes across the batch; the KV read rises linearly with aggregate tokens in flight because it does not. For a dense-GQA 70B model the two meet near 220,000 aggregate tokens; a reasoning workload at batch 32 and 8,000 tokens of context already sits past it. Compressed attention moves the crossover, it does not remove it. House calculation; the order of magnitude is the point.

One honest qualification belongs here, because it is the first thing a careful reader will raise. The arithmetic above is for dense grouped-query attention, where the KV cache is large.

Architectures that compress the cache move the crossover to the right. DeepSeek’s Multi-head Latent Attention, by its hardware paper’s account, holds the KV cache to roughly seventy kilobytes per token, about a fifth of the dense-GQA figure, which pushes the crossover out toward a million aggregate tokens, the shallower line in the chart above.

DeepSeek-V4 goes further still: its model card reports that at a one-million-token context, V4-Pro spends about ten percent of V3.2’s KV cache and twenty-seven percent of its per-token compute, with V4-Flash at seven percent and ten percent, through a compressed sparse-attention stack.

This does not rescue the throughput regime. It relocates the line, and it does so precisely in service of making very long contexts affordable, which keeps sequences long, which keeps the budget open. The compression buys context length, and context length is what holds decode in the memory-bound region. The two facts point the same way.

MagicDec turns this into a working technique with one additional move: the draft itself must be light on KV, not just light on weights, or it reintroduces the very bottleneck it is trying to relieve.

With a draft that uses a fixed sparse or short-window KV footprint, MagicDec reports up to roughly two times on both throughput and latency together in the large-batch long-context regime on eight A100s, a regime where the conventional wisdom predicts speculation should be dead.

The reported draft-to-target memory ratio for a Llama-3.1-70B target with an eight-billion-parameter draft stays near four tenths and, importantly, stays constant as the batch grows, because the draft’s KV is bounded by design while the target’s KV grows.

That constancy is what keeps the draft cheap exactly where the conventional analysis assumed it would become expensive.

The corrected physics is therefore a single sentence with three clauses. Small batch is memory-bound because weights dominate. Long context is memory-bound because the KV read dominates.

Compute-bound is only the high-batch corner below the thousand-token wall, and that corner is smaller than the folklore implies. The “disable above batch thirty-two” rule is not wrong; it is a short-context rule wearing the costume of a general one.

And the moment your workload develops long sequences, whether from large documents or from long generations, the rule inverts, and speculation comes back to life precisely where you had been told to switch it off.

What actually shows on the ledger

Physics is not the bill. To get from the roofline to dollars, you have to know how the operator is allowed to set the batch size, and that is a question about service-level agreements, not about chips.

There are two serving regimes, and they read the same technique with opposite signs.

In throughput-maximizing service, the operator is free to batch all the way to the compute-bound point, because the only objective is cost per token and the way to minimize it is to amortize the weight read across as many sequences as possible. In that regime the machine is, by construction, saturated.

There is no idle compute. Speculation adds the draft’s forward passes to a chip that has nothing spare to run them in, so the cost per token rises. This is the regime the folklore is built for, and in it the folklore’s advice is exactly right.

In latency-capped service, the operator may not batch to the compute-bound point, because there is a ceiling on how long each token may take, and raising the batch raises per-token latency. The operator batches only until the latency SLA binds, and then stops, often well short of saturation.

The machine therefore runs with idle compute by design, not by accident, because the SLA forbids filling it. That idle compute is the speculation budget, and speculation converts it into tokens, cutting the cost per token. Same technique, opposite sign, and the only thing that changed was whether a latency ceiling capped the batch.

These two signs are not asserted; they fall out of a one-line cost model. Cost per token is the rental rate of the GPU divided by the tokens it delivers each second, so anything that multiplies throughput divides cost by the same factor.

In the latency-capped regime the wasted verification compute is free, because the chip sat idle under the SLA anyway, so throughput scales with the accept length discounted only by the draft’s own overhead: an accept length of about two and a half against a draft overhead near a fifth gives a throughput multiple close to two, which is the cut of roughly half the chart shows.

In the throughput-maximized regime the chip has no spare compute, so the draft’s wasted work bites directly. If the draft proposes three tokens and two and a half clear verification on average, the target spends three positions of compute to yield two and a half tokens, a throughput multiple near five sixths, which is the cost rise of about a fifth the chart shows.

The same two numbers, an accept length near two and a half and a draft length near three, generate both bars, and they are the same numbers behind the one-and-a-half to two-and-a-half times production speedups.

Push the draft length above the accept length and the throughput-regime penalty grows, which is precisely why over-drafting is the classic way to lose money on speculation in a saturated cluster.

Figure 4. The sign of the ledger is set by the regime. Under a latency SLA that caps the batch below saturation, speculation converts idle compute into tokens and cuts cost per million output tokens (here roughly -48%). In throughput-maximizing service batched to the compute-bound point, there is no idle compute for the draft to use, and the draft’s overhead raises cost (here roughly +19%). Both magnitudes follow from the cost model in the text (accept length near 2.5, draft length near 3) and are consistent with the production speedups in Figure 8; the absolute dollar levels still move with rate, model, and quantization, anchored here to 2026 neocloud figures (Spheron, getdeploying).

The reason latency-capped service is the common case in 2026, rather than a corner case, is written directly into the benchmark SLAs.

MLPerf Inference v5.1, published by MLCommons in September 2025, sets for its DeepSeek-R1 reasoning workload a time-to-first-token ninety-ninth-percentile threshold of two seconds and a time-per-output-token ninety-ninth-percentile threshold of eighty milliseconds, against a mean input of around eight hundred tokens and a mean output of three thousand eight hundred and eighty, with a maximum output of twenty thousand, the highest the benchmark has ever specified.

An eighty-millisecond ceiling on per-token latency, applied to sequences thousands of tokens long, caps the batch far below the compute-bound point, because long sequences mean large KV reads and large KV reads mean each added unit of batch costs latency you do not have.

The SLA traps the GPU in the memory-bound regime. The trap is the opportunity: a memory-bound GPU has idle compute, and idle compute is what speculation eats.

The benchmark concedes the point

The argument stops being a thesis and becomes a rule when the benchmark authority writes it into the rules. In March 2026, MLPerf Inference v6.0 added an interactive reasoning scenario for DeepSeek-R1 with the ceiling pulled tighter still, a 1.5-second TTFT and a 15-millisecond TPOT at the ninety-ninth percentile.

To make that scenario achievable at all, MLCommons mandates speculative decoding for it: implementations must run the official DeepSeek-R1 MTP head with EAGLE-style decoding.

The independent body that defines how inference is measured decided that, past a certain latency target on reasoning traffic, speculation is not an optional optimization but a requirement of entry.

The dollar figures that frame the chart are anchored to 2026 market rates and published per-token costs, kept deliberately conservative. Neocloud H100 capacity runs around two dollars an hour, with Spheron listing two dollars and one cent; B200 on-demand sits in the five-to-six-dollar range across getdeploying and aimultiple’s trackers.

Published serving costs land near forty-two cents per million tokens on a B200 and forty-seven cents on an H100 PCIe. The point of the chart is not to nail a single deployment’s economics to the cent, which would be dishonest given how much rate, model, and quantization move the number.

The point is the asymmetry: the same forty-something cents per million can become a discount or a surcharge depending solely on which side of the saturation line your SLA puts you.

The most honest evidence for this whole framing comes, unexpectedly, from the vendor with the most incentive to claim an unqualified win. DeepSeek’s hardware paper, arXiv:2505.09343, states that its multi-token prediction module can slightly hurt raw throughput while significantly improving end-to-end generation latency.

Read that again in the context of the two regimes. DeepSeek is reporting, in print, that in a throughput accounting MTP can cost a little, and in a latency accounting it helps a lot, and that they ship it because latency is the product.

They add a second-order point that sharpens it further: MTP raises the effective batch size, which in their mixture-of-experts architecture increases expert-parallel arithmetic intensity, partially offsetting the throughput cost.

A company could have quoted the latency win alone and called it a free lunch. Instead they documented the tradeoff in both directions, which is precisely the shape of the real ledger this issue is arguing for.

When the vendor with the strongest incentive to claim a pure throughput win instead publishes that the technique “slightly hurts throughput while significantly improving latency,” that is not a weakness in the technique.

It is the ledger showing its true two-sided shape, in the vendor’s own numbers.

Why reasoning moved the bill into decode

Cost per token has a denominator, and the denominator is dominated by decode steps, because prefill is a single parallel pass over the prompt while decode is a long sequence of memory-bound steps, one per output token.

Anything that multiplies the number of output tokens multiplies the share of the bill that lives in decode, which is exactly the share speculation can attack. This is why 2026 is a different economic environment for speculative decoding than 2023 was, even though the technique is largely the same. The traffic changed.

Reasoning models emit output on a different scale entirely. A conventional chat reply is a few hundred tokens. A reasoning trace runs to thousands, and the trend within the model generation has been sharply upward: BentoML’s deployment guide notes that DeepSeek-R1-0528 nearly doubled its reasoning length over the prior R1, from around twelve thousand to around twenty-three thousand tokens on a single hard math question.

MLPerf’s DeepSeek-R1 workload puts the mean output at three thousand eight hundred and eighty and the maximum at twenty thousand. Agentic systems then chain many such traces into a single user-visible task, so the effective output length per task can be larger still.

The bill, which used to be split between a substantial prefill and a modest decode, has tilted hard toward decode.

Figure 5. Reasoning moved the bill into decode. A reasoning request emits one to two orders of magnitude more output tokens than a chat reply, and every one of them is a memory-bound decode step that must clear the latency SLA. MLPerf Inference v5.1 DeepSeek-R1 (MLCommons, September 2025) reports a mean output of 3,880 tokens and a maximum of 20,000; R1 and R1-0528 AIME usage runs from roughly 12,000 to 23,000 tokens per question (BentoML guide). The chat baseline is a round-number reference.

Two facts about reasoning traffic place it squarely in the regime where speculation pays. The first is the one just shown: it is decode-heavy, so the part of the bill speculation can lower is the dominant part. The second is subtler and follows from the previous sections.

Long traces mean long sequences in flight, which means large KV reads, which means decode is memory-bound even when the operator manages to batch, both because the latency SLA caps the batch and because the KV cost re-opens the budget past S-star.

The two mechanisms reinforce each other. The workload is in the memory-bound regime by virtue of its output length, and it is held there by virtue of its latency SLA. There is idle compute on the machine for both reasons at once, and speculation is the technique that turns idle compute into tokens.

The architectural direction of travel keeps the budget open rather than closing it. DeepSeek-V4’s sparse-attention work, with V4-Pro reportedly using about twenty-seven percent of the FLOPs and ten percent of the KV of V3.2 at a one-million-token context through DeepSeek Sparse Attention, is explicitly aimed at making very long contexts affordable.

Cheaper long context means more long context, which means more memory-bound decode, which means a wider speculation budget, not a narrower one. The hardware trend widens the budget by improving compute faster than bandwidth; the model trend widens it by pushing context length up. Both vectors point the same way.

And this is where accept length, the lever from the first section, cashes in directly against the denominator. The decode discount is, to first order, the accept length: a method that lands two and a half accepted tokens per step is doing roughly two and a half times the decode work per expensive weight load.

The measured accept lengths of the shipped 2026 methods, two and fifty-five hundredths for DeepSeek-V3.2’s MTP, two and seventy-six hundredths for GLM-5’s shared-MTP design, and four and seven tenths for EAGLE-3 on coding and reasoning workloads, are therefore not abstract quality scores.

They are multipliers on the largest line item in the reasoning-era bill.

Figure 6. The lever is accept length, not raw acceptance. Tokens produced per step equal the accepted run plus one bonus token, and each step costs exactly one target weight load, so accept length is the decode discount. Reported figures: DeepSeek-V3.2 MTP 2.55 and GLM-5 shared-MTP 2.76 from the GLM-5 technical report (arXiv:2602.15763); EAGLE-3 4.5 to 5.0 on coding and reasoning from E2E Networks. The vanilla-draft figure is a representative reference.

Where speculation loses

A technique that only ever helps does not need an issue written about when to use it. Speculation has real failure modes, and an analysis that buries them is worth less than one that lists them, so here is the full debit column, without hedging.

The throughput regime. In saturated high-batch short-context serving, the offline-batch corner of the phase diagram, speculation is a tax and should be disabled. The compute is fully employed, the draft has nothing free to run in, and its forward passes displace real work. The Spheron and Tian Pan guidance is correct here without qualification. If your job is to push the maximum number of short completions through a fleet of GPUs at minimum cost per token, speculation is the wrong lever.

Acceptance collapse. Below roughly one-half acceptance, speculation hurts at any batch, because too few candidates survive to cover the draft’s cost. Acceptance is not a constant; it falls with high sampling temperature, with out-of-distribution inputs the draft was never trained on, and with the kind of high-entropy generation where the next token is genuinely uncertain. A draft trained against one target distribution and then serving a drifted or fine-tuned target degrades silently, the acceptance rate sliding without any error being raised. Monitoring accept length in production is not optional; it is the only way to notice that your discount has quietly become a surcharge.

VRAM pressure. The draft model and its KV cache occupy memory you could otherwise spend on a larger batch or a longer context. A Llama-3.3-70B target in FP8 alongside a one-billion-parameter draft consumes roughly seventy-five to seventy-eight gigabytes on an eighty-gigabyte H100, per Spheron’s figures, leaving very little headroom. On a memory-constrained deployment, the draft can cost you more in lost batch capacity than it returns in accept length, and that tradeoff has to be measured, not assumed.

No help for time-to-first-token. Speculation accelerates decode, and only decode. It does nothing for prefill, which means it does nothing for time-to-first-token. Under the MLPerf two-second TTFT ceiling, that is a separate problem requiring separate techniques, which is precisely the prefill-decode disaggregation argument of Issue 03. Speculation and prefill optimization are complementary, not substitutes, and a serving stack that needs both will not get the first from the second.

Draft maintenance. A separate draft model is a second training, evaluation, and deployment surface that must be kept aligned as the target evolves. Every target update risks degrading a draft that was tuned against the previous version. EAGLE-style heads and built-in MTP layers reduce this by coupling the draft to the target’s own features or parameters, but they do not eliminate the obligation to retrain and revalidate. GLM-5’s choice to share parameters across three MTP layers, described in arXiv:2602.15763, is partly an answer to exactly this maintenance cost: fewer independent parameters to train and keep aligned.

The losslessness caveat, restated. The provable equivalence to standard sampling holds for the exact rejection-sampling rule. Relaxed acceptance, typical acceptance, and aggressive tree-acceptance schemes raise throughput by changing the output distribution. They are frequently worth it, but a four-times figure obtained under a relaxed rule is not interchangeable with a four-times figure under exact sampling, and a serving team quoting a speedup owes itself, and its users, clarity about which rule produced it.

Draft-length tuning. The number of tokens the draft proposes per step, often written gamma, is a workload-dependent knob with a real optimum. Set it too long and the draft burns compute generating candidates that will be rejected; set it too short and you leave accept length on the table. The optimum moves with acceptance rate and with batch size, so a value tuned on one workload can be wrong on another, and dynamic schemes that adjust it per request exist precisely because no single value is right everywhere.

The lab-versus-production gap. The EAGLE-3 paper reports speedups of up to six and a half times, but those are temperature-zero academic measurements on Vicuna-13B, Llama-3.1-8B, and Llama-3.3-70B. Production reports cluster instead around two to three times: LMSYS and Vertex describe two-to-three-times figures for EAGLE-3 on SGLang, E2E Networks reports two and three tenths on Llama-3.1-8B at a batch of four, and a Gemma-4 EAGLE3 draft head is documented at one and seventy-two hundredths at batch one on conversational traffic. The gap between the lab number and the production number is itself one of the most important facts in this space, because quoting the former as if it were the latter is the single most common honesty failure in vendor material on speculative decoding.

Figure 7. The headline academic figure (6.5x, EAGLE-3 paper, temperature 0) sits far above the production cluster, which lands between roughly 1.66x and 2.3x across the EAGLE 3.1 vLLM benchmark, DeepSeek’s vendor-reported MTP TPS, a Gemma-4 EAGLE3 draft head, and E2E Networks. The bars use different targets and conditions and are not strictly comparable; they are shown to convey the range, and the distance between the gold bar and the teal cluster is the point.

Subscribe now

A decision on the plane

The verdict is not a yes or a no. It is a lookup. Given a workload’s batch size, its context length, its latency SLA, and its measured acceptance, the phase diagram tells you which regime you are in, and the regime tells you the sign of the ledger. The table below collapses the analysis into that lookup.

State the thesis cleanly now that the machinery is in place. Speculative decoding is the only inference technique that converts decode’s idle compute into tokens losslessly, and the sign of its effect on your bill is set entirely by whether that compute was actually idle.

Whether it was idle is a question about the roofline, and where you sit on the roofline is a question about batch size and context length, the two axes of the phase diagram.

There is no universal answer because the inputs are not universal. There is a correct answer for every point on the plane, and the table is how you read it.

The reason this matters more now than it did three years ago is that the median workload moved. In 2023 the prototypical request was a short chat completion at modest context, which lives near the compute-bound corner once you batch it, where speculation is at best neutral.

In 2026 the prototypical high-value request is a long reasoning trace under a tight per-token latency SLA, which lives deep in the memory-bound region for two independent reasons, its output length and its SLA, and which is therefore exactly where speculation pays. The technique did not move toward the workload. The workload moved toward the technique.

That migration is why the shipping decisions of the major laboratories converged. DeepSeek built multi-token prediction into V3 and carried it through V3.2, documenting the latency win and the throughput cost honestly.

GLM-5 shipped a shared-parameter three-layer MTP design with a measured accept length near two and three-quarters. NVIDIA’s NeMo RL work applied EAGLE-3 to reinforcement-learning rollouts and reported a one-and-eight-tenths-times generation speedup at the eight-billion scale, with validation accuracy on AIME-2024 evolving identically under autoregressive and speculative decoding, a clean confirmation that the lossless guarantee holds across training.

EAGLE-3 landed across vLLM, SGLang, and TensorRT-LLM, the three serving stacks that matter. These are not independent fashions. They are the same bet, placed by everyone who looked at the same plane and saw that reasoning traffic had walked into the half where the ledger reads in your favor.

The technique never changed. The regime did. Speculation lowers your cost per token exactly where batching cannot help you, and reasoning is the workload that made that region the center of the map.

Confidence tiers and external-audit read

Every load-bearing claim in this issue is scored below against a four-tier confidence scale, with its source named inline.

The scale is applied as an external auditor would apply it, crediting primary and measured sources, discounting derived and illustrative ones, and flagging the weakest links explicitly rather than hiding them in the prose.

Tier A primary or measured: peer-reviewed papers, vendor hardware disclosures, MLPerf-published SLAs and benchmark statistics.
Tier B secondary, with method: vendor or practitioner reports that state their configuration and measurement conditions.
Tier C derived or stylized: house figures and curves built from the cited physics, presented as illustrative renderings, not measurements.
Tier D illustrative or round-number: reference values chosen for scale, not claimed as measured.

Exact rejection-sampling speculative decoding is output-distribution lossless.

Leviathan et al. 2023; Chen et al. 2023; EAGLE losslessness per Hugging Face engineering writeup. Provable equivalence to standard sampling under the exact rule.

EAGLE-3 mechanism: training-time test, multi-layer feature fusion, direct token prediction, dynamic draft tree.

EAGLE-3 paper, NeurIPS 2025, arXiv:2503.01840. Headline lab speedups up to 6.5x at temperature 0 on Vicuna-13B, Llama-3.1-8B, Llama-3.3-70B (explicitly a best-case lab figure; production lands far lower, see Figure 8).

DeepSeek MTP slightly hurts throughput while significantly improving end-to-end latency; raises effective batch and expert-parallel intensity.

DeepSeek hardware paper, arXiv:2505.09343. The two-sided tradeoff is stated in the vendor’s own text, which is the strongest single piece of evidence in this issue.

MLPerf DeepSeek-R1 SLAs: TTFT 99p 2s, TPOT 99p 80ms; mean input 800, mean output 3,880, max output 20,000.

MLPerf Inference v5.1, MLCommons, September 2025. These SLAs are the anchor for why latency-capped serving traps the GPU in the memory-bound regime.

MLPerf Inference v6.0 added a DeepSeek-R1 interactive scenario (TTFT 1.5s, TPOT 15ms) and mandates speculative decoding (official MTP head, EAGLE-style) to meet it.

MLCommons, March 2026. The benchmark authority requiring speculation for tight-latency reasoning is the strongest external corroboration of the thesis.

NVIDIA NeMo RL: EAGLE-3 gives ~1.8x rollout generation speedup at 8B; AIME-2024 accuracy identical under autoregressive and speculative decoding throughout training.

NVIDIA NeMo RL research, May 2026. Independent empirical confirmation that the lossless guarantee holds in practice across training.

Critical-sequence-length result: past S*, decode is memory-bound even at large batch via KV read; KV-light draft delivers up to ~2x throughput and latency.

MagicDec, arXiv:2408.11049; Together AI long-context analysis. Draft-to-target memory ratio ~0.4, constant at large batch, for Llama-3.1-70B with 8B draft.

Roofline ridge points: H100 SXM FP8 591, H200 412, B200 ~562 FLOP/byte; compute outgrew bandwidth ~36x vs ~9x V100 to B200.

Carried from Issue 03, The Split and the Seam, and Issue on the memory wall, The Wall and the Stack. Hardware specifications and systems-literature divergence figures.

DeepSeek-V4 sparse attention: V4-Pro 27% FLOPs and 10% KV of V3.2 at 1M context; V4-Flash 10% and 7%.

Official DeepSeek-V4-Pro / V4-Flash model cards (Compressed Sparse Attention + Heavily Compressed Attention). Verified figures. Direction-of-travel evidence that long context is getting cheaper, widening the budget.

EAGLE 3.1 per-user throughput: 2.03x at concurrency 1, 1.71x at 4, 1.66x at 16.

Primary: vLLM team blog, May 2026 (EAGLE / vLLM / TorchSpec joint release). Kimi-K2.6-NVFP4, tensor-parallel 4, GB200, non-disagg, SPEED-Bench coding. Verified against the primary engineering writeup.

Practical break-even near batch 32; below ~0.5 acceptance speculation hurts at any batch.

Spheron production guide, March 2026; E2E Networks. Practitioner guidance with stated qualifications on model, quantization, and sequence length.

The lab-versus-production speedup comparison in Figure 8 (6.5x lab vs a 1.66 to 2.3x production cluster).

Compiled from verified primary and vendor sources (EAGLE-3 paper, EAGLE 3.1 vLLM blog, DeepSeek hardware paper, Gemma-4 EAGLE3 card, E2E Networks). Bars use different targets and conditions and are not strictly comparable; shown to convey range, not to rank.

Production speedups cluster at 2 to 3x; E2E reports 2.3x on Llama-3.1-8B at batch 4, accept length 4.5 to 5.0.

LMSYS and Vertex on SGLang; E2E Networks. Multiple independent practitioner reports converging on the same range.

Accept lengths: DeepSeek-V3.2 MTP 2.55, GLM-5 shared-MTP 2.76; GLM-5 shares parameters across 3 MTP layers.

GLM-5 technical report, arXiv:2602.15763. EAGLE-3 accept length 4.5 to 5.0 from E2E Networks.

2026 GPU rates and per-token costs: H100 ~$2/hr, B200 ~$5 to $6/hr on-demand; ~$0.42/M (B200), ~$0.47/M (H100 PCIe).

Spheron ($2.01/hr H100), getdeploying, aimultiple. Market trackers; rates move with provider, commitment, and region.

Reasoning length growth: R1-0528 nearly doubled to ~23K tokens per AIME question vs ~12K for prior R1.

BentoML DeepSeek deployment guide, 2026. R1 token pricing ~$0.55/M in, ~$2.19/M out; output billed higher and dominates cost.

The KV-versus-weight amortization crossover (Figure 4): batch x sequence near 220,000 for a 70B model, where the KV read overtakes the weight read (~7,000 tokens at batch 32, ~1,750 at batch 128).

House order-of-magnitude calculation from Llama-3-70B architecture (80 layers, 8 KV heads, head-dim 128) and the weight-versus-KV read balance, drawn in Figure 4. The coefficient moves with KV precision and serving format; the order of magnitude, and the conclusion that reasoning traces sit past the crossover, is robust.

The compute-bound boundary and ~1,000-token wall in Figure 3.

Derived, not stylized: the boundary is the roofline condition AI = 2B/(1+B*S/C) exceeding the ridge, with the wall at S = 2C/R (C ~ 220,000; R = 562 FP8 for B200). A house calculation with standard simplifying assumptions (GEMM-dominated FLOPs, dense GQA KV); exact coordinates shift with precision and ridge, the structure does not.

The -48% / +19% cost magnitudes in Figure 5.

Derived from the cost model in the text (accept length ~2.5, draft length ~3, draft overhead ~0.2) and consistent with the production speedups in Figure 8. Absolute dollar levels still vary with rate, model, and quantization; the regime-dependent sign and rough magnitude are the load-bearing claim.

The shape of the speedup-decay envelope in Figure 2.

House curve illustrating the conventional decay toward break-even; the measured EAGLE 3.1 points on it are Tier A and the break-even location is sourced, but the connecting envelope is schematic, not fitted.

The ~300-token chat-reply baseline and the vanilla-draft 2.1 accept-length reference.

Round-number references chosen to set scale against the measured reasoning and method figures, not claimed as measured values.

External-audit simulation

Audited as a whole, the issue rests on a spine of Tier A primary sources, and every load-bearing empirical claim in it was checked against its primary source: the EAGLE-3 paper (arXiv:2503.01840), MagicDec (arXiv:2408.11049), the DeepSeek hardware disclosure (arXiv:2505.09343), the GLM-5 report (arXiv:2602.15763), the MLPerf v5.1 and v6.0 specifications, the DeepSeek-V4 model cards, and the EAGLE 3.1 vLLM release each confirmed the figures attributed to them.

The central argument, that the sign of the ledger is set by serving regime and that reasoning traffic sits in the favorable regime, follows from those sources rather than from the house figures, which is the property an audit most wants to see.

Two pieces of evidence are doing disproportionate work and both survive scrutiny: DeepSeek’s own statement that MTP slightly hurts throughput while significantly improving latency, a direct vendor quote, and MLPerf v6.0’s decision to mandate speculative decoding for its tight-latency reasoning scenario, which is the measuring authority writing the thesis into the rules.

The weakest links are named rather than hidden, and after this revision they are narrow. Boundary and Crossover are now derivations rather than stylizations: the first is the roofline condition with the wall at twice the bytes-ratio over the ridge, the second is the weight-versus-KV balance, both house calculations carried out with standard simplifying assumptions (GEMM-dominated compute, dense grouped-query KV) whose exact coordinates move with precision and ridge while the structure holds.

The cost magnitudes are likewise derived from an explicit model, an accept length near two and a half and a draft length near three, and cross-checked against the production speedups; what remains genuinely soft there is the absolute dollar level, which varies too much with rate, model, and quantization to pin down.

The decay envelope is a schematic shape, though the measured points on it and its break-even location are sourced. The chat and vanilla-draft baselines are round numbers for scale. A handful of practitioner figures, the break-even batch, the VRAM envelope, and the production speedup band, come from individual engineering guides rather than independently reproduced benchmarks, and are tiered B accordingly.

None of these elements carries the conclusion: a reader who accepts only the Tier A claims, and works the two house calculations independently, arrives at the same verdict table.

Overall confidence in the thesis is high, because the thesis is a statement about regimes and signs that the primary sources support directly, and it is deliberately not a statement that speculation yields any specific universal multiple, which the evidence would not support.

The quantitative illustrations are held at lower confidence by design, and labeled as such, so that the argument does not borrow credibility it has not earned. A reader who accepts only the Tier A claims still arrives at the same verdict table; the lower tiers furnish the texture, not the conclusion.

Sources and useful informations

Primary and secondary sources for the load-bearing claims, with arXiv identifiers and venues where applicable. The text above attributes each source at its point of use; this is the consolidated record.

Papers appear first, then benchmark specifications, vendor and practitioner writeups, and model cards.

Y. Leviathan, M. Kalman, and Y. Matias. Fast Inference from Transformers via Speculative Decoding. International Conference on Machine Learning (ICML), 2023. arXiv:2211.17192.
C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper. Accelerating Large Language Model Decoding with Speculative Sampling. arXiv:2302.01318, 2023.
T. Cai, Y. Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and T. Dao. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. International Conference on Machine Learning (ICML), 2024. arXiv:2401.10774.
Y. Li, F. Wei, C. Zhang, and H. Zhang. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. International Conference on Machine Learning (ICML), 2024. arXiv:2401.15077.
Y. Li, F. Wei, C. Zhang, and H. Zhang. EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. arXiv:2406.16858.
Y. Li, F. Wei, C. Zhang, and H. Zhang. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test. Conference on Neural Information Processing Systems (NeurIPS), 2025. arXiv:2503.01840.
R. Sadhukhan, J. Chen, Z. Chen, V. Tiwari, A. May, T. Chen, and B. Chen. MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding. arXiv:2408.11049, 2024.
DeepSeek-AI. DeepSeek-V3 Technical Report. arXiv:2412.19437, 2024.
DeepSeek-AI. Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures. International Symposium on Computer Architecture (ISCA), 2025. arXiv:2505.09343.
DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, 2025.
Z.ai (Zhipu AI). GLM-5 Technical Report. arXiv:2602.15763, 2026.
MLCommons. MLPerf Inference: Datacenter, v5.1 (DeepSeek-R1 reasoning workload). Benchmark rules and results, 2025.
MLCommons. MLPerf Inference: Datacenter, v6.0 (DeepSeek-R1 Interactive scenario, mandated speculative decoding). Benchmark rules, 2026.
EAGLE Team, vLLM Team, and TorchSpec. EAGLE 3.1: release and SPEED-Bench results on Kimi-K2.6. vLLM Blog, May 2026.
NVIDIA. Speculative Decoding for Reinforcement-Learning Rollouts in NeMo RL. NVIDIA Developer technical writeup, 2026.
Together AI. Speculative decoding for high-throughput long-context inference (analysis of MagicDec). Together AI Blog, 2024.
BentoML. The Complete Guide to DeepSeek Models: V3, R1, V4 and Beyond. BentoML Blog, 2026.
Spheron Network. Speculative decoding in production: a practitioner’s guide. Engineering guide, 2026.
E2E Networks. Speculative decoding performance on Llama-3.1 serving. Engineering notes, 2026.
DeepSeek-AI. DeepSeek-R1-0528. Model card, Hugging Face, 2025.
DeepSeek-AI. DeepSeek-V4-Pro and DeepSeek-V4-Flash. Model cards, Hugging Face, 2026.
Share The Software Frontier

The Split and the Seam

Lorenzo Bradanini — Sun, 21 Jun 2026 14:02:14 GMT

Intro

On Monday, January 27, 2025, NVIDIA lost about 600 billion dollars of market value in a single trading session, a 17 percent fall that stands as the largest one-day market-capitalization loss in the history of US public markets. It did not fall alone.

Broadcom dropped 17.4 percent, Marvell 19.1 percent, AMD 19.1 percent, the chip-adjacent names down in sympathy across the board. The trigger was not an earnings miss or a product recall.

It was a technical report from a Chinese lab, DeepSeek, whose V3 and R1 models matched the frontier while having been trained and, crucially, served at a fraction of the assumed cost.

The market did the obvious arithmetic: if inference is far cheaper than actually we believed, fewer GPUs are needed, so the company selling the GPUs is worth less.

The arithmetic was wrong, and the reason it was wrong is the subject of this issue. Within days, Microsoft’s Satya Nadella was citing the Jevons paradox, the nineteenth-century observation that making a resource cheaper to use tends to increase, not decrease, its total consumption. The paradox held beautifully.

Over the course of 2025, by the Peterson Institute’s accounting, the cost to reach a fixed score on a hard reasoning benchmark fell from about 4,500 dollars per task to 11.64 dollars, a roughly 386-fold collapse, and inference usage did not shrink to match. It exploded past the efficiency gains, exactly as Jevons would predict.

The chip that was supposed to be made redundant by cheap inference is now sold out for years on the back of it [Figure 1].

Figure 1. The shake and its resolution. Left, the single-day sector rout that followed DeepSeek’s efficiency disclosure, the largest one-day loss in market history. Right, the cost to reach a fixed benchmark score over 2025, a collapse the industry has named LLMflation. Efficiency did not destroy demand. It multiplied it.

Here is the part the market missed in its first reaction. DeepSeek’s efficiency was not a single trick.

It was a stack of techniques, low-precision FP8 arithmetic, a sparse mixture-of-experts model, a compressed attention scheme called multi-head latent attention, and, underneath all of it, an inference architecture that ran the two phases of language-model serving, prefill and decode, on entirely separate pools of machines.

That last technique, disaggregation, is the one that matters most for understanding what happened next, because in the eighteen months bracketing that selloff it went from a contrarian research idea that the open-source community pushed back on to the default architecture of virtually every production serving system in existence.

NVIDIA Dynamo, llm-d, Ray Serve, SGLang, vLLM, LMCache, and Mooncake all run on it now. The very metrics the industry uses to talk about inference latency, time-to-first-token and time-per-output-token, were popularized through its lens.

The authors of DistServe, the paper that named the architecture, marked the moment in a November 2025 retrospective with a wry observation: if Moore’s law doubles compute every eighteen months, then the serving-systems equivalent had just doubled too, not because the chips got faster, but because the systems serving them did.

This issue is a teardown of that architecture as it actually exists in mid-2026, not as a tidy origin story. We start with the physics, because the physics is clean and explains everything downstream.

Then why colocation lost, what the split mechanically does, and the cost it creates at the seam where the two halves rejoin, a cost that the rack-scale hardware of the current generation has largely, but not entirely, dissolved.

Then we go down into the kernel layer, to the expert-parallel all-to-all communication and the custom kernels that make large mixture-of-experts models servable at all, which is the part most coverage skips and the part that actually decides throughput.

Then the attention rewrites that are shrinking the problem from underneath, the cross-vendor benchmark numbers and the honest places they fall apart, the operational tax nobody puts on the slide, the new silicon the split has spawned, and finally what the whole shake settled into.

The thesis, stated once and defended throughout: disaggregation is no longer a choice an operator debates. It is the substrate. The live questions have moved up a layer, to how you balance the pools, how wide you spread the experts, and whether the wire between your machines is fast enough that the seam is free.

Subscribe now

Two workloads on opposite ends of the roofline

The roofline model is the oldest honest tool in performance engineering, and once prefill and decode are plotted on it the rest of this issue is commentary.

Every kernel has an arithmetic intensity, the floating-point operations it performs per byte it moves from memory. Every chip has two ceilings: a compute ceiling set by its peak arithmetic rate, and a memory ceiling set by its bandwidth.

A kernel is compute-bound when its arithmetic intensity is high enough that the chip exhausts its FLOPS before its bandwidth, and memory-bound otherwise.

The crossover, the ridge point, is simply peak FLOP rate divided by bandwidth. For an H100 SXM at FP8, roughly 1,979 dense teraFLOPS over 3.35 terabytes per second of HBM3, the ridge sits near 591 FLOP per byte.

For the H200, identical compute over 4.8 terabytes per second of HBM3e, it falls to 412. For a B200, about 4,500 dense FP8 teraFLOPS over 8 terabytes per second, near 562 [Figure 2].

Figure 2. A roofline for three inference GPUs with the operating region of each phase marked. Prefill runs against the flat compute roof. Single-stream decode is pinned to the sloped bandwidth region. The same silicon is a different machine depending on which phase is running.

Prefill ingests the whole prompt at once. Every token attends to every prior token, the feed-forward layers process the full sequence in parallel, and the matrix multiplications are large and dense.

A prompt of a few thousand tokens pushes the arithmetic intensity into the hundreds or thousands of FLOP per byte, planting prefill firmly to the right of the ridge against the compute roof. Prefill is a FLOPS problem. It wants tensor cores and low precision and scales with raw matrix-multiply throughput.

Decode is the opposite animal. To generate one token it reads the entire weight set and the entire key-value cache for the sequence, performs a thin slice of computation, and emits a single token. For a single stream the arithmetic intensity sits near the floor, on the order of one to two FLOP per byte, pinning decode to the far left of the roofline on the bandwidth-limited slope.

Decode is a memory-bandwidth problem. It cannot keep the tensor cores fed; it cares only how fast the chip streams weights and cache out of HBM. The hard floor follows immediately: the fastest a single decode stream can run is bandwidth divided by bytes read per token, dominated by the weights [Figure 3].

A 70-billion-parameter model at FP16 is 140 gigabytes, so an H100 at 3.35 terabytes per second generates at most about 24 tokens per second on a single stream, an H200 about 34, a B200 about 57. Halve the precision to FP8 and every ceiling doubles.

This is the same calculation that bounds DeepSeek’s observed 20 to 22 tokens per second in production. Compute does not enter.

Figure 3. The single-stream decode ceiling is HBM bandwidth divided by model bytes. No quantity of compute changes it. Halving precision, which halves the weight bytes, is the only lever that moves the ceiling for a fixed model.

Batching is the escape, and its limit is the reason disaggregation exists. Decode many sequences at once and the weight read is shared across the batch: you stream the weights once and amortize them.

The arithmetic intensity of decode is therefore approximately twice the batch size divided by the bytes per weight, which at FP16 is approximately the batch size itself [Figure 4]. To cross the H100 FP8 ridge of 591 you need a batch in the high hundreds.

That is the entire game in decode, pack as many concurrent sequences into a step as memory allows, because every added sequence moves you rightward toward the compute roof and lifts tokens-per-second-per-GPU.

Hold the two facts together: prefill wants to run immediately, in small groups, against the compute roof, to keep first-token latency low; decode wants to run in enormous batches against the bandwidth ceiling, to keep cost-per-token low.

One phase is latency-shaped and compute-hungry, the other throughput-shaped and bandwidth-hungry, and for two years the industry asked one GPU under one scheduler to do both at once.

Figure 4. Decode arithmetic intensity is essentially the batch size, because weights are read once per step and shared. Reaching the compute roof requires hundreds of concurrent sequences, which is why decode pools are built around the largest batches memory will hold.

Subscribe now

Why colocation lost

Run both phases on one GPU under one continuous-batching scheduler, the architecture Orca introduced and vLLM popularized, and they fight. The fight has a precise mechanism.

Continuous batching keeps a rolling batch of decode steps running and folds in new requests as they arrive, but a new request cannot decode until its prompt is prefilled, and prefill is a heavy compute-bound operation that occupies the GPU far longer than a single decode step.

When a prefill lands in the batch, the system must either pause the in-flight decodes to prioritize it or batch the prefill alongside them, and both choices stall token generation for every active sequence.

The DistServe authors quantify the damage bluntly in their retrospective: even with chunked-prefill mitigation, a single large prefill can inflate time-per-output-token by a factor of two to thirty under bursty workloads.

A long prompt arriving at the wrong moment makes every other user’s stream stutter for a third of a second or more [Figure 5].

Figure 5. The structural waste of colocation. A prefill-heavy step pins the tensor cores while bandwidth idles; a decode-heavy step pins bandwidth while the tensor cores idle. A shared GPU pays for both resources and presses hard on one at a time. Exact values are workload-dependent; the asymmetry is not.

The deeper cost is coupling. As the DistServe paper put it, colocation forces the resource allocator to provision for the worst case of both latency targets simultaneously, the tight first-token target and the tight per-token target, because the same GPUs serve both.

You cannot tune one phase without detuning the other, and you cannot scale one without scaling the other. The roofline says why the waste is structural and not a scheduling artifact: on a prefill-heavy step the tensor cores run near saturation while the memory bus idles, and on a decode-heavy step the reverse, so a colocated GPU pays rent on two expensive resources and uses roughly one of them at any instant.

The serving community fought this with chunked prefill, introduced in the Sarathi work by Amey Agrawal and co-authors, which breaks a long prefill into bounded chunks interleaved with ongoing decode so the peak disruption per step is capped and per-token latency smooths out.

Chunked prefill is the strongest argument against disaggregation and an honest teardown has to credit it: by mixing a compute-bound prefill chunk with bandwidth-bound decode work in one step it even improves utilization, running the two against different ceilings. But it does not dissolve the coupling.

The phases still share one parallelism strategy, one memory pool, one tensor-parallel degree, all of them a compromise. And it trades latencies against each other: the finer you chunk to protect per-token latency, the more you stretch first-token latency, because a long prompt now dribbles through the GPU in pieces.

You are back in the original bind. On shared hardware you can favor first-token latency or per-token latency, but you cannot independently optimize both.

What changed in 2025 was not the physics but the stakes. DistServe, the authors recount, met real pushback in 2024 because disaggregation demands a heavy refactor of existing serving systems, and saw little adoption that year.

Then businesses began deploying language models at competitive scale, and throughput stopped being the only metric that mattered. Latency became existential, because a chatbot that stutters loses users, and an agent that stalls breaks workflows.

At the same time models grew and traffic surged, forcing systems past hundreds and into thousands of GPUs, the regime where a disaggregated architecture genuinely shines because it can allocate resources to each phase independently and pair each with its own parallelism strategy. The technique that was ahead of its time in 2024 was exactly on time in 2025.

Subscribe now

The anatomy of the split

Disaggregation assigns the phases to physically separate GPU pools. A request hits a prefill instance, which builds the key-value cache for the prompt and produces the first token; the cache is handed across to a decode instance, which loads it, folds the request into its large rolling batch, and generates the rest.

Prefill machines only prefill. Decode machines only decode. Three things become possible that colocation forbids, and they are the whole value proposition.

Interference disappears, because no prefill ever lands in the decode batch, so decode steps run uninterrupted and per-token latency stops spiking. Independent optimization becomes possible, because each pool can be tuned to its own roofline and scaled on its own axis, adding prefill capacity when prompts lengthen and decode capacity when outputs lengthen.

And phase-specific hardware becomes possible, the idea Splitwise pushed hardest, that since decode is bandwidth-bound and prefill compute-bound, the two should not run on identical chips at all, a thread that has since grown into an entire hardware category we will come to.

The crucial metric that makes all of this legible is goodput, and it is the metric most operators still fail to measure. Throughput is requests or tokens completed per second, full stop.

Goodput is requests completed per second that meet their service-level objectives, both the first-token target and the per-token target. The distinction is the whole point, because a colocated system under load can post rising throughput while its goodput collapses, as interference pushes more and more requests past their latency targets even as tokens keep flowing [Figure 6].

Hao Zhang of UC San Diego, a DistServe author, frames it starkly in his lectures: a system can show ten requests per second of throughput while delivering three requests per second of goodput once the SLO is applied. The other seven finished late, which for an interactive product means they did not finish.

Disaggregation’s claim is a goodput claim. It does not necessarily move more tokens in the abstract; it moves more tokens that arrive on time, and the DistServe paper measured that as 7.4 times more requests within SLO, or a 12.6 times tighter achievable latency target, against the colocated state of the art.

Figure 6. Throughput and goodput diverge under load. Raw tokens per second keep climbing while the count of requests that meet their latency targets falls away as interference mounts. Disaggregation’s win is the gap. Shape follows the DistServe goodput results; axes are illustrative.

The published uplifts that first announced the technique are worth stating with the asterisk each deserves [Figure 7].

DistServe reported 7.4 times more requests served within SLO against the colocated state of the art;
Splitwise, the parallel Microsoft Azure effort, 2.35 times the throughput at equal power and cost, or 1.4 times at 20 percent lower cost;
Mooncake, the platform behind Moonshot AI’s Kimi assistant, up to a 525 percent throughput gain in simulated overload and 75 percent more requests under real production traffic.

The metrics differ, goodput in one, throughput at fixed power in another, raw request volume in a third, so they are not directly comparable, and each is a ceiling for a favorable workload rather than a constant anyone reproduces.

They are the numbers that turned a research idea into a procurement decision.

Figure 7. The headline multiples are real and each is a best case. The metrics are not commensurable across bars, and every figure is tied to a specific model, workload, and SLO. Read them as ceilings for favorable regimes, not a universal constant.

Modern orchestration adds one more lever on top of the split: cache-aware routing. NVIDIA Dynamo, the orchestration layer that sits above the inference engines, routes each request to the prefill worker whose resident cache best overlaps the incoming prompt, which NVIDIA reports can roughly halve first-token latency by avoiding redundant prefill of shared prefixes.

The router treats prefill and decode workers as first-class services, and a central planner continuously profiles the GPUs to autoscale and rebalance. This is the productized form of an idea DistServe seeded with a much simpler pull-based scheduler, which it used to keep decode workers from being flooded by spiky prefill bursts.

The control plane has become as much of the system as the data plane.

Subscribe now

The seam, and the rack that dissolved it

The instant you split the phases you create a seam, and across it you must carry the key-value cache from the prefill instance that built it to the decode instance that consumes it, for every request. The cache is not a control message. It is the full attention state of the prompt, and it can be gigabytes.

Its size is set by the attention architecture, and the spread is enormous [Figure 8]. Multi-head attention stores a key and value vector per head, per layer, per token; a Llama-2-class 70B model with 64 heads, head dimension 128, and 80 layers holds about 2.5 megabytes per token at FP16, so a 1,000-token prompt carries roughly 2.6 gigabytes of state.

Grouped-query attention, the Llama 3 scheme, shares key and value projections across groups and cuts the stored heads from 64 to 8, dropping the cache to about 0.31 gigabytes per thousand tokens.

Multi-head latent attention, DeepSeek’s design, compresses the per-token state into a single latent vector of dimension 576 stored once rather than per head, landing near 0.07 gigabytes per thousand tokens.

Across these three schemes the seam payload spans a factor of about 37, which is the unglamorous reason MLA models are structurally cheaper to disaggregate and the reason attention-architecture choices are now serving-cost choices.

Figure 8. The seam payload is set by attention design. MHA ships a fraction of a gigabyte per token of context across the wire; MLA ships a small fraction of that. The bytes you must move are decided in the model architecture, not the serving stack.

Now the carry. Take the cache for a 4,096-token prompt on a grouped-query 70B model, about 1.34 gigabytes, and move it across the interconnects an operator might have [Figure 9].

On fifth-generation NVLink at 1.8 terabytes per second, 0.75 milliseconds. On InfiniBand NDR at 400 gigabits per second, 27 milliseconds. On 200-gigabit RoCE, 54. On commodity 100-gigabit Ethernet, 107. Across a 25-gigabit link between datacenters, 429.

The carry is a one-time handoff per request, so it adds to first-token latency rather than the per-token rate, and the right comparison is against the prefill it follows, roughly 290 to 600 milliseconds for that prompt on an H100.

Against that, NVLink and InfiniBand transfers are rounding error or close to it, Ethernet is a third of the prefill bolted onto every request, and cross-datacenter is fatal on its own.

The DistServe authors measured the intra-node case at under 0.1 percent of request latency over fast intra-node links, and every serious stack hides the transfer further with layer-wise streaming, shipping each layer’s cache the moment that layer finishes so its movement overlaps the computation of the next.

Figure 9. The same 1.34 GB handoff across the interconnect hierarchy. Inside the rack it is noise; on InfiniBand it fits under the prefill; on commodity Ethernet it eats the budget; across datacenters it is fatal. The interconnect, not the GPU, decides whether disaggregation is viable.

This is where the current hardware generation changed the calculus, and it is the single most important development since the technique went mainstream.

The GB200 NVL72 and its successor the GB300 NVL72 connect 72 GPUs into one NVLink domain that behaves as a single massive GPU, with up to 130 terabytes per second of aggregate GPU-to-GPU bandwidth.

When prefill and decode pools live inside the same NVLink domain, the seam is no longer a network hop. It is a memory copy across a coherent fabric, sub-millisecond, effectively free.

The hard part of disaggregation, the part that made it a research problem for years, was moving the cache without blowing the latency budget, and the rack-scale NVLink domain makes that part vanish for deployments that fit inside a rack. This is a large reason disaggregation went from papers to production so fast: the hardware arrived to pay the seam’s bill.

The same fabric is why these systems can run the extremely wide expert parallelism we will see in the next section, which would be communication-bound on any slower interconnect.

The transfer machinery itself is now standardized infrastructure. NVIDIA’s NIXL, open-sourced at GTC 2025, unifies NVLink, InfiniBand, PCIe, and storage fabrics under one point-to-point abstraction and runs non-blocking so the GPU keeps computing while the cache moves.

Mooncake’s Transfer Engine, from the Kimi team, presents the same unified interface over TCP, RDMA, shared memory, and NVMe-over-fabrics.

LMCache, from the University of Chicago, accelerates the movement with batched transfers and I/O pipelining and decouples cache storage from the engine so the cache can persist, migrate, and be shared independently of execution; its authors report up to a tenfold reduction in first-token latency from KV reuse and offload.

DeepSeek built 3FS, a file system that pools thousands of SSDs and hundreds of storage nodes so any prefill can stash a cache that any decode can fetch in a locality-oblivious way.

And the cache, once it is a first-class managed object, can be retained and reused across requests, which is the quiet lever under the whole economy: of DeepSeek’s 608 billion daily input tokens, 342 billion, 56.3 percent, were served from cache rather than recomputed.

Prefix caching is for many workloads a larger cost lever than the prefill-decode split itself, and disaggregation is what makes the cache a managed resource in the first place.

Figure 10. The economics DeepSeek disclosed, and the cache that drives them. Left, the daily GPU cost against the theoretical revenue if every token were billed at R1 rates, the source of the much-quoted 545 percent margin, which the company itself flags as theoretical and which is materially lower in reality. Right, the token composition: more than half of all input tokens were served from cache, not recomputed. This is the disaggregated, MLA-based, heavily-cached architecture that lit the fuse on the whole repricing.

Subscribe now

The kernel layer: expert parallelism and the all-to-all

Here is the part most coverage skips, and the part that actually sets throughput on a modern frontier model.

Look under the hood of virtually any current frontier model, DeepSeek-V3, Kimi K2, Qwen3, Llama 4, and you find a sparse mixture-of-experts architecture: hundreds of expert sub-networks per layer, of which each token activates only a handful.

A 671-billion-parameter model like DeepSeek-V3 activates about 37 billion parameters per token. Sparsity is what makes these models cheap to run, but it imposes a specific and brutal communication pattern, and that pattern is where disaggregation, expert parallelism, and the interconnect all collide.

To serve a large MoE you distribute the experts across many GPUs, a scheme called expert parallelism, because no single GPU holds all of them.

But then every token has to travel to whichever GPUs hold its chosen experts and the results have to travel back, which means every layer performs two all-to-all communication operations per step, a dispatch that scatters tokens to their experts and a combine that gathers the results.

This is the dominant communication cost of MoE inference, and it has a vicious property: the messages are tiny. In DeepSeek-V3 each dispatch or combine message runs from about 7 kilobytes during inference to 256 kilobytes during training [Figure 11].

General-purpose collective libraries like NCCL are tuned for the opposite regime, the large multi-megabyte all-reduce of dense training, where bandwidth dominates. At 7 kilobytes the per-message latency and synchronization overhead dominate instead, and NCCL leaves most of the wire idle.

Figure 11. Expert parallelism breaks the collective library. MoE dispatch and combine move tiny messages, where NCCL’s all-reduce kernels stall on latency and synchronization rather than saturating bandwidth. This mismatch is why DeepSeek wrote a dedicated communication library.

That library is DeepEP, which DeepSeek open-sourced on the second day of its Open Source Week, and it is a small masterclass in why the kernel layer matters.

DeepEP provides custom all-to-all dispatch and combine kernels in two distinct flavors, and the split mirrors the prefill-decode split exactly. The high-throughput kernels serve prefill and training, maximizing raw bandwidth, but they emit dynamically shaped tensors that are incompatible with CUDA graphs.

The low-latency kernels serve decode, using direct RDMA to minimize latency and, critically, remaining CUDA-graph compatible so they avoid the kernel-launch overhead that dominates the decode phase, where each step is tiny and launch cost is a large fraction of the work.

The kernels need only about 20 streaming multiprocessors to saturate both the intra-node NVLink domain and the inter-node RDMA network simultaneously, freeing the rest of the GPU for computation.

They achieve this through NVSHMEM and IBGDA, which let the GPU issue RDMA operations directly to the network card without a round-trip through the CPU, and through asymmetric-domain forwarding that bridges the fast NVLink domain and the slower RDMA domain in one kernel.

The high-throughput path uses 24 queue pairs; the low-latency path uses 8 to 16, matched to the local expert count. A hook-based design overlaps the communication with computation without occupying compute units at all.

The newest DeepEP revisions add TMA-based transfers for minimal SM usage, support for the larger multi-node NVLink domains of the rack-scale systems, and a zero-SM remote-memory primitive for fetching KV cache directly from a peer.

This level of specialization is not optional at scale, and NVIDIA’s response confirms it: NVIDIA built HybridEP, its own token-based dispatch backend using the same hardware primitives, and co-developed a set of Blackwell kernels with the SGLang and vLLM projects through the FlashInfer library, covering attention prefill and decode, the communication path, the grouped matrix multiplications, the multi-node NVLink transfers, and MLA specifically.

The all-to-all is the bottleneck, and the entire industry is now optimizing the same dozen kernels.

The payoff of getting this right is expert parallelism that goes very wide, and width is throughput [Figure 12].

Spreading the experts of DeepSeek-R1 across 32 ways instead of 8 on a GB200 NVL72 lifts per-GPU output throughput by about 1.8 times, NVIDIA’s TensorRT-LLM measurements show, because a wider spread means each GPU holds fewer experts and therefore loads less weight per step, and because it fills the grouped matrix multiplications more completely.

Wide expert parallelism is only viable because the 130-terabyte-per-second NVLink domain absorbs the all-to-all traffic that wider spreading generates; on a slower fabric the communication would swamp the gain. And the two phases run deliberately different widths.

DeepSeek’s disclosed production configuration runs prefill at expert-parallel degree 32 and decode at degree 144, a decode pool nearly five times wider than the prefill pool, because decode is where the wide spread pays off in throughput and where the batch is large enough to keep all those experts busy.

The DistServe authors, surveying the same system, note decode configurations reaching toward 256-way expert parallelism, and newer stacks push wider still.

The prefill-decode asymmetry that began as a latency argument has become a parallelism argument: the phases want not just different hardware but different distributed-systems topologies entirely.

Figure 12. Decode wants the widest expert parallelism the fabric will allow. Spreading experts to EP32 instead of EP8 lifts per-GPU decode throughput by about 1.8 times by shrinking the per-GPU weight load and filling the grouped matrix multiplications. The two phases run deliberately different degrees.

Subscribe now

The attention rewrite is shrinking the problem

While the serving stack was learning to split and spread, the model architects were attacking the problem from underneath, and the attack lands squarely on the two costs this issue is about: the bytes at the seam and the compute in prefill.

Multi-head latent attention was the first cut, and we have already seen its effect on the seam, a roughly thirty-seven-fold reduction in cache bytes versus dense multi-head attention.

But MLA leaves the other cost untouched: prefill attention still scales quadratically with sequence length, because every token still attends to every prior token, and as context windows stretch toward a million tokens that quadratic term dominates the prefill bill.

This is the cost that DeepSeek attacked in V3.2 with DeepSeek Sparse Attention, and the mechanism is elegant [Figure 13]. A lightweight neural network DeepSeek calls the Lightning Indexer scores the relevance of past key blocks to the current query and selects only the top few thousand most relevant, and the expensive attention computation then runs only over that selected set.

The pattern is retrieve-then-attend, and it bends the cost curve from quadratic in sequence length toward roughly linear past the selection cutoff, while DeepSeek reports model quality virtually unchanged.

At a million tokens of context the difference is on the order of a few hundredfold less attention work in prefill.

Figure 13. Sparse attention attacks the prefill cost itself. MLA cut the KV bytes crossing the seam; DeepSeek Sparse Attention then cut the prefill compute, selecting a fixed set of relevant key blocks before attending and bending quadratic attention toward linear at long context. Illustrative scaling for a fixed selection budget.

This matters for disaggregation in a way that compounds. Sparse attention shrinks both the prefill compute and, in its variants, the cache that must be carried, which shifts the prefill-decode balance again and makes long-context disaggregation economically viable at lengths that would have been hopeless under dense attention.

It is also the live frontier as of this writing. DeepSeek-V3.2 shipped in late 2025 as a 671-billion-parameter MLA-plus-MoE model with DSA layered on top, reaching reasoning quality the company benchmarks against GPT-5, and its Speciale variant took gold-medal scores at the 2025 International Mathematical Olympiad and the ICPC World Finals.

DeepSeek-V4, released in April 2026, extends the context window to a million tokens and replaces DSA with a successor called Compressed Sparse Attention, with its Pro variant reported at 80.6 percent on SWE-bench.

The trajectory is unmistakable: the attention mechanism is being rebuilt around the economics of long-context inference, and each rebuild changes the numbers in the serving stack beneath it. Architecture and serving are no longer separable disciplines.

Subscribe now

The benchmark reality, and where it breaks

For two years the inference market argued over performance with vendor slides, which is no way to allocate billions in capital.

That changed in late 2025 when SemiAnalysis launched InferenceMAX, the first independent open-source benchmark to measure not raw throughput but total cost of compute across real models and real interactivity targets, running DeepSeek R1, GPT-OSS, Llama 3, and Qwen across the GB200 NVL72, B200, H200, H100, and AMD’s MI300X, MI325X, and MI355X, with Google TPU and AWS Trainium backends following.

It is the closest thing the field has to a neutral scoreboard, and the picture it paints is one a chip-level analysis would miss entirely [Figure 13].

Figure 13. Per-GPU throughput on DeepSeek R1 at a fixed 25 tokens-per-second-per-user interactivity, normalized to H200. The GB200 NVL72’s roughly tenfold lead over the H200, and threefold over the B200, comes from the rack-scale NVLink domain, not from a faster individual chip. Values from SemiAnalysis InferenceMAX; the advantage is interactivity-dependent.

At a fixed interactivity of 25 tokens per second per user on DeepSeek R1, the GB200 NVL72 delivers roughly ten times the per-GPU throughput of an H200 and about three times that of a standalone B200, even though the B200 is a faster chip in isolation.

The advantage is the 72-GPU NVLink domain, which lets the rack run wider parallelism and larger coherent batches than any single eight-GPU node can.

On absolute terms the B200 reaches 60,000 tokens per second per GPU at 1,000 tokens per second per user on GPT-OSS, and software optimization alone drove its cost on that model to two cents per million tokens, a fivefold reduction in two months.

NVIDIA frames the rack-level economics as a five-million-dollar GB200 NVL72 generating 75 million dollars in token revenue, a fifteen-fold return, though that figure prices output at favorable rates and should be read as a vendor’s best case rather than a realized margin.

The energy picture is cleaner and third-party: on DeepSeek R1 the GB200 NVL72 delivers roughly eight times the tokens per provisioned megawatt of a single-node H200, and Blackwell runs about 20 percent more energy-efficient than AMD’s CDNA4 on GPT-OSS, partly because the MI355X draws 1.4 kilowatts per GPU against the B200’s 1 kilowatt.

Now the part the headline numbers omit, and the part this publication exists to surface. The rack-scale advantage is not a constant. It is a function of interactivity, and it expires [Figure 14].

At 60 tokens per second per user the GB200 NVL72 produces a little less than triple a B200’s per-GPU throughput, but as the interactivity target rises the batch that the rack can assemble shrinks, and by around 130 tokens per second per user the workload fits inside a single eight-GPU node’s NVLink domain, at which point the NVL72’s scale-out advantage disappears entirely and it becomes more expensive per token than a standalone node.

The whole case for the rack rests on serving many users at moderate interactivity, the chatbot and agent regime; push to extreme single-user speed and the economics invert.

The benchmark also exposes a software truth that no spec sheet shows: AMD’s MI355X is competitive with the B200 on FP8 disaggregated prefill, but its disaggregated performance actually degrades at higher interactivity because the ROCm stack lacks the kernel and collective optimizations needed to compose multiple state-of-the-art techniques together.

Disaggregation is not a hardware capability you buy; it is a software capability you accumulate, and the gap between vendors is measured in kernels.

Figure 14. The rack-scale win has an expiry date. At moderate interactivity the NVL72 roughly triples a single node per GPU; push interactivity high enough that the batch shrinks into one eight-GPU node, and the advantage evaporates and inverts on cost. Shape after SemiAnalysis InferenceX v2.

Subscribe now

The dial you must keep turning

Disaggregation hands you an operational problem that the headline numbers never mention, and every team that has run it in production knows it immediately.

Once the fleet is split into a prefill pool and a decode pool, you have to choose the ratio between them, and you will get it wrong, because the right answer keeps moving [Figure 15].

The problem is a producer-consumer imbalance: prefill instances produce cache that decode instances consume, and the production rate rarely matches the consumption rate. Provision too few prefill instances and prompts queue while first-token latency slips and the decode pool sits half-idle for want of work.

Provision too few decode instances and the prefill pool races ahead while per-token latency slips and the prefill pool sits half-idle. Either error strands capital on GPUs that cannot do useful work because the other pool is the bottleneck.

DeepSeek’s disclosed answer was a fixed three-to-nine ratio, three prefill nodes feeding nine decode nodes, and the SGLang team reproduced a similar split on 96 H100s, twenty-four GPUs for prefill and seventy-two for decode, reaching 52,300 input tokens and 22,300 output tokens per second per node, the first open implementation to match DeepSeek’s own reported numbers, at a cost they put at twenty cents per million output tokens, about one-fifth the official API price.

Figure 15. Disaggregation hands you a ratio you must keep correct. Too few prefill instances and first-token latency slips; too few decode instances and per-token latency slips. The optimum sits in a narrow band and drifts with every shift in prompt and output length. Schematic of the producer-consumer balance.

What makes this hard rather than a one-time sizing exercise is that the optimal ratio is not constant. It depends on the shape of the traffic, the ratio of input length to output length, and that shape shifts by the hour and by the product surface.

A wave of document-summarization requests is prefill-heavy and wants more prefill capacity; a wave of long-form generation is decode-heavy and wants more decode; a split that is optimal at noon is wrong by midnight.

This is why the serious systems have moved to dynamic rebalancing, monitoring load in real time and shifting the ratio, and why a system like TaiChi goes further and switches whole instances between disaggregated and colocated modes depending on which yields better goodput at the current load.

The existence of that last capability is the tell: colocation is not always wrong, and a system smart enough to know when to disaggregate is smart enough to know when to stop.

Disaggregation converts a hardware-utilization problem into a scheduling-and-capacity problem, which is usually a good trade because software is cheaper to change than silicon, but it is a trade and not a free win, and a team that splits without building the rebalancing machinery will frequently lose to a well-tuned colocated deployment with chunked prefill.

Subscribe now

The split spawned silicon

The most striking evidence that disaggregation has become foundational is not in any serving framework.

It is in the silicon roadmap, because once the phases run on separate machines, the machines stop needing to be the same machine, and the hardware vendors have noticed [Figure 16].

Figure 16. The split spawned silicon. Plotting accelerators by compute against memory bandwidth, vendors are now building to the corners: FLOPS-rich, cheap-memory parts for prefill, and bandwidth-rich parts for decode. Compute is a low-precision proxy across mixed formats; the positioning, not exact parity, is the point.

NVIDIA’s clearest statement is the Rubin CPX, announced in September 2025 and shipping at the end of 2026, a GPU built exclusively for the prefill and context phase.

Its logic is the wrong-sizing problem stated as a product: prefill is compute-bound and barely touches memory bandwidth, so dedicating expensive high-bandwidth HBM to it wastes the most expensive component on the chip.

The CPX therefore pairs 30 petaFLOPS of NVFP4 compute with 128 gigabytes of GDDR7, a memory that SemiAnalysis estimates is roughly five times more cost-effective per byte than HBM and runs at perhaps a quarter of HBM’s bandwidth, which is fine because prefill does not need the bandwidth.

It adds dedicated attention hardware delivering about three times the attention throughput of a GB300 NVL72, aimed directly at million-token context, and it ships with PCIe but no NVLink, because it is built for disaggregated inference racks rather than tightly coupled training clusters.

The packaging makes the intent explicit: the Vera Rubin NVL144 CPX rack pairs 144 CPX prefill GPUs with 144 standard Rubin decode GPUs and 36 CPUs, prefill and decode silicon racked side by side, the disaggregation thesis cast in metal.

The decode side is bifurcating too, and along a more radical axis. In December 2025 NVIDIA signed a roughly 20-billion-dollar licensing arrangement with Groq, whose language-processing units abandon HBM entirely for on-chip SRAM, 500 megabytes per chip at 150 terabytes per second, an order of magnitude past any HBM part.

For autoregressive decode at long context, where the entire bottleneck is reading state out of memory, SRAM’s bandwidth is decisive: a 70-billion-parameter decode at 128,000 tokens of context runs dramatically faster in memory-access time on an SRAM part than on an HBM GPU.

The emerging architectural division has GPUs handling training and prefill while specialized low-latency parts handle decode, which is prefill-decode disaggregation pushed all the way down to the level of distinct chip families.

And the trend is not NVIDIA’s alone. The DistServe authors report that Huawei, Enflame, MetaX, and Biren are all prototyping or deploying decode-specialized or attention-optimized accelerators built on exactly this philosophy.

A systems technique conceived to tame latency on homogeneous GPUs is now redrawing the boundaries of the accelerator market itself.

Subscribe now

When it still doesn’t pay

Disaggregation is the substrate now, but it is not a universal good, and the honest boundaries matter as much as the wins [Figure 17].

SemiAnalysis, evaluating the Rubin CPX’s full hardware-level disaggregation, put the caveat precisely: complete disaggregation delivers excellent results only under certain ratios of input to output length and for long decode lengths, with other scenarios seeing underwhelming benefits.

The structure of the advantage explains the boundary. Disaggregation’s benefit grows with output length, because longer outputs mean more decode steps to protect from interference, and with offered load, because heavier load means more interference to remove.

In the opposite corner, short outputs under light traffic, there is little interference to eliminate and too little decoding for the protection to accrue, and the seam and the ratio overhead are pure cost. A well-tuned colocated deployment with chunked prefill wins there.

Figure 17. The decision is a regime. Disaggregation pays when outputs are long and load is heavy, and goes underwater for short replies under light load. The map is an illustrative model encoding the consistent direction of the evidence, not a measured surface. Characterize your own traffic before committing.

Three caveats compound this. The wrong-sizing problem never fully disappears even with disaggregation, because a pure prefill instance on an HBM part still underutilizes its memory bandwidth, which is the entire reason the Rubin CPX exists.

The interactivity cliff from the benchmark section means that even where disaggregation and rack-scale hardware win at moderate interactivity, the advantage inverts at extreme single-user speed, when the batch collapses into a single node.

And the software-maturity tax, visible in AMD’s degraded high-interactivity disaggregation, means the gains are not portable across stacks; they have to be earned kernel by kernel. The strategic read is not disaggregate or do not disaggregate.

It is characterize your traffic on the two axes that matter, real output-length distribution and real peak-to-trough load, then build for your regime, and buy the interconnect before you buy the split, because on a slow fabric the seam eats the gain and on a fast one it disappears.

The shake, settled

Return to the 600 billion dollars. The market’s first instinct, that cheaper inference means less demand for the chips that serve it, has been falsified about as cleanly as a macro thesis ever is.

The cost collapse was real, from thousands of dollars per benchmark task to about eleven, and demand did not fall; it ran so far past the efficiency gains that the Peterson Institute concluded usage had dwarfed them, the Jevons paradox playing out in real time.

The technique that frightened the market, efficient inference built on disaggregation and sparsity and compressed attention, did not shrink the industry. It enlarged it, by making applications viable that were uneconomical at the old prices.

But there is a second-order effect the bullish reading often misses, and it is where the real consequence sits. DeepSeek did not just demonstrate cheap inference; it open-sourced the means of production. DeepEP, 3FS, the full reference architecture, all released to anyone.

NVIDIA open-sourced Dynamo and NIXL. Mooncake, llm-d, LMCache, SGLang, and vLLM are all open. The result is that the disaggregated serving stack, the thing that lets you run a frontier model at a fraction of the naive cost, is no longer a moat.

It is a commodity any competent team can deploy, which is exactly why the SGLang reproduction could serve DeepSeek at one-fifth the official API price. The value did not disappear; it moved.

SemiAnalysis frames the shift as a move from raw FLOPS per chip to total intelligence per dollar at rack scale, and the InferenceMAX results bear it out: the GB200 NVL72 wins not because its chips are faster but because 72 of them act as one, and the orchestration software across them is as much of the product as the silicon.

The moat migrated from the model and the kernel, which are now shared, to the rack-scale system integration and the interconnect, which are hard to replicate and hard to buy.

The frontier from here is the generalization of the same idea to the next seam. The DistServe authors point to attention-FFN disaggregation as the natural successor: within decode, attention is memory-bound and hungry for KV-cache bandwidth while the feed-forward layers are compute-bound and hungry for weight storage, so splitting them onto tailored hardware lets each reach high utilization independently, and the same logic that justified the prefill-decode split applies one level down.

For dense models this was long considered impractical because it doubles the activation transfer per layer, but the MoE models that now dominate already perform two all-to-all operations per decode step, so the attention-FFN split can be folded into the communication pattern that already exists, making its extra transfer nearly free.

MegaScale-Infer and StepFun’s Step-3 have already demonstrated it on large MoE models. The pattern is always the same: find a boundary in the computation where the two sides want different hardware or different parallelism, split there, and pay a transfer cost at the new seam in exchange for independent optimization on each side.

The question every such split raises is the question this entire issue has circled. Is the thing you carry across the new seam small enough, and the wire fast enough, that the split pays?

DeepSeek published a margin and the market saw a number. The number was a consequence.

The cause was an architecture that took two workloads with opposite appetites and stopped forcing them to share a plate, and within eighteen months that architecture became the floor that everything else is built on, dragged the hardware roadmap behind it, and survived a 600-billion-dollar referendum on whether efficiency was a threat or an accelerant. The split won.

What remains, and what will keep deciding the economics as the seams multiply and the models keep rewriting themselves underneath, is the same thing it always was: how honestly the system respects the shape of the work, and how fast the wire is at the seam.

Subscribe now

We wrote the CUDA reference we could not find

Lorenzo Bradanini — Sat, 20 Jun 2026 08:35:38 GMT

We have both lost more hours than we want to admit to questions that should have a clean answer. Here is one.

On Hopper, a WGMMA sources its operands from shared memory and registers. On Blackwell, the new UMMA instruction can read its A operand straight from tensor memory, a level of the hierarchy that did not exist a generation ago.

That single change reshapes how you build a pipeline. When we went looking for it written down in one place, it was not there. It was scattered across a PTX spec, a CUTLASS example, and a couple of microbenchmark threads.

That was the pattern everywhere. Occupancy rules in one tuning guide. Descriptor encodings in another. The genealogy of the tensor cores living in our heads after weeks of trial runs. Nothing sat in one place we could read front to back.

So we wrote that place.

We are not technical writers who picked up CUDA for a book. We write low-level GPU and systems code for a living, and we got tired of not having this on our own desks.

CUDA Mastery 2026 is a deep technical handbook for engineers who want to understand and optimize CUDA on modern NVIDIA GPUs, from the fundamentals through Hopper, Blackwell, and Blackwell Ultra. It is current to CUDA 13.x and the fifth generation of tensor cores.

It is not a syntax tutorial. Most CUDA resources teach you syntax, a few explain how the hardware actually works, and almost none teach you to reason about performance before you write a line. This is the one that does.

Here is what it covers, as one integrated reference instead of forty open browser tabs: SM internals, warp scheduling, scoreboarding, occupancy, and latency hiding. The full memory hierarchy, coalescing, shared memory, TMA, and Tensor Memory.

The tensor core path from WMMA to WGMMA to UMMA. Roofline modeling, bottleneck analysis, and Nsight profiling. CUTLASS, CuTe, CUDA Tile, PTX, SASS, and JIT. Multi-GPU with NCCL, NVLink, NVSwitch, and NVSHMEM. And real kernel walkthroughs: SGEMM, Flash Attention, reductions, scan, sort, and sparse ops.

Two rules we held ourselves to. Every specification is footnoted to a primary source, whether an NVIDIA whitepaper, the PTX ISA, a tuning guide, or published microbenchmarks, with nothing paraphrased from a blog you cannot trace.

And specification stays separate from measurement, so you always know whether a claim comes from a document or from a number someone ran.

Think about what one wrong optimization actually costs: a day chasing a bottleneck that was never there, or a kernel rewritten three times because an operand sat in the wrong memory. This reference shortens that loop. The next time you open Nsight and see a stall, you will know which part of the machine is talking to you.

It is written for engineers who already ship CUDA. If you have never launched a kernel, start somewhere gentler and come back when you want to go deep.

The book is €89, a single payment, or 10 installments, and less than the going hourly rate for someone who already knows this. Every future v1.x update and correction is included free.

We built this because we wanted it on our own desks. It is on yours now if you want it.

Get CUDA Mastery 2026

Thank you in advance for your extremely precious support.

The Kill Switch Was in the Mail

Lorenzo Bradanini — Fri, 19 Jun 2026 13:30:29 GMT

Introduction

At 5:21pm Eastern on Friday, June 12, Anthropic received a letter. By midnight, the company had switched off the two most capable models it had ever released, for every customer on earth.

The letter came from the Department of Commerce. It invoked national security authorities and ordered Anthropic to suspend all access to Claude Fable 5 and Claude Mythos 5 by any foreign national, anywhere, including the company’s own non-citizen employees.

Anthropic’s own statement, published that night, explained the mechanics of compliance plainly: because there is no clean way to fence a live API endpoint by passport in real time, the only way to obey the order was to take both models down for everyone. Access to Opus 4.8, Sonnet, and Haiku was untouched. Fable 5 had shipped three days earlier. Its IPO-grade launch week ended with a kill order.

There are essentially two ways to read what happened, and the honest position is to hold both at once.

The first reading is narrow and technical. The trigger, by Anthropic’s account, was a single bypass technique that amounts to pointing the model at a codebase and asking it to fix the security flaws.

The company says the capability is widely available, including from OpenAI’s GPT-5.5, and is used every day by the defenders who keep systems running. On this reading, the suspension is an overzealous first use of a blunt instrument, a misunderstanding that both sides want resolved, and a story that will be over in days to weeks.

The second reading is structural and does not go away when the models come back. For the first time, the United States has reached for export-control machinery, designed for physical dual-use technology, and used it to pull a commercial AI model out of the hands of hundreds of millions of people with a Friday-afternoon letter.

The kill switch was never in the model. It was not in the classifiers, the thousand hours of red-teaming, or the thirty-day data retention. It was in the mail. And once a mechanism like that has been used once, it exists for every lab, every Friday, from now on.

This piece is about both readings, and about a third fact that reframes the whole episode: this is not the first time in 2026 that the US government has moved against Anthropic specifically. It is the second, from a different department, under a different legal theory, in the space of three months. The pattern is the story.

Fig. 1 Seventy-two hours from launch to global shutdown.

Subscribe now

What got pulled

To understand the stakes, you have to understand what was switched off, because it was not an ordinary model.

On June 9, Anthropic launched Claude Fable 5 and Claude Mythos 5 together. In the company’s framing, both belong to a new tier it calls Mythos-class, sitting above the Opus class in raw capability.

The first member of that tier, Claude Mythos Preview, had been released quietly in April through a government-linked program called Project Glasswing, and never offered to the public.

Fable 5 was the moment that tier went mainstream. In Anthropic’s own words, its capabilities exceed those of any model the company had ever made generally available, with the lead over Opus 4.8 widening as tasks grow longer and more agentic.

Fable 5 and Mythos 5 share the same underlying weights. The difference is entirely in the safeguards, which is also why they carry different names. Fable, from the Latin fabula, is the version made safe for general release. Mythos is the same model with safeguards lifted in specific domains, reserved for vetted cyberdefenders and critical-infrastructure operators inside Glasswing. Anthropic describes Mythos 5 as having the strongest cybersecurity capabilities of any model in the world.

The headline numbers were not subtle, and software engineering was the spine of the launch. On Anthropic’s published evaluations, Fable 5 posted 80.3 percent on SWE-bench Pro against 69.2 for Opus 4.8 and 58.6 for OpenAI’s GPT-5.5, and scored highest among frontier models on Cognition’s harder FrontierCode set, even at medium effort.

The widely quoted 95 percent on SWE-bench Verified is real but should be read with care, because that benchmark is saturating; the gap that means something is the roughly eleven points on the harder Pro variant.

Stripe, in early testing, used the model to perform a codebase-wide migration of a 50-million-line Ruby codebase in a single day, work it estimated would have taken a team over two months by hand.

The capability that matters most for the rest of this story is the one the public model hides, and it is worth being precise about what it is.

Anthropic’s cybersecurity evaluations measure offensive progress directly: the Firefox benchmark scores the fraction of trials that reach arbitrary code execution, OSS-Fuzz is a severity-weighted score that runs from a basic crash up to a full control-flow hijack, and CyberGym, a Berkeley suite of more than 1,500 real-world tasks built on Google’s fuzzing corpus, scores whether the model can reproduce a genuine vulnerability and prove it with a crashing input.

On the OSS-Fuzz measure, an independent reading of the Mythos 5 system card by Epoch AI put the unsafeguarded model’s crash rate at 80 percent against Opus 4.8’s 61.5.

The detail that matters most is that Anthropic says it did not train these abilities in; they emerged as a byproduct of general gains in code understanding and autonomy, which is the technical reason they cannot be cleanly excised.

Fable’s classifiers are tuned to block any progress on exactly these tasks. Pricing landed at ten dollars per million input tokens and fifty per million output, double Opus 4.8, and by Anthropic’s own description less than half what the retired Mythos Preview had cost. The context window is one million tokens, with up to 128k output.

For practitioners, the integration surface is where the safety architecture becomes concrete, and it is a fallback rather than a wall. By Anthropic’s account, when Fable’s classifiers flag a request touching one of three domains, the response is not refused but quietly handled by Claude Opus 4.8 instead, and the user is told it happened.

The model ID is claude-fable-5, a drop-in string swap; developer documentation indicates the fallback surfaces in the API response with a refusal stop reason and the triggering category named, so a client can route or retry, though Anthropic’s own materials describe the behavior rather than the field names.

The three covered domains are the load-bearing detail: offensive cybersecurity, biology and chemistry, and distillation, the last being organized attempts to extract Claude’s capabilities to train competing models, which Anthropic says it has already detected at scale originating from authoritarian states.

Those classifiers route fewer than five percent of sessions away from the frontier model; for the other ninety-five, Anthropic says Fable performs effectively identically to the unsafeguarded Mythos 5.

Both Mythos-class models are designated Covered Models, carrying a mandatory thirty-day data retention even for enterprise customers who previously held zero-retention agreements. None of this is cosmetic.

Under Anthropic’s Responsible Scaling Policy, a model with this much cyber and biological capability is handled at AI Safety Level 3, the tier whose rules require hardened deployment and security controls, and the classifiers, the fallback, and the retention are the concrete form those requirements take.

That retention policy is part of what Anthropic points to as evidence of seriousness, and its definitions matter for everything that follows. A universal jailbreak, in Anthropic’s own terms, is any prompt, script, or harness that lets a user interact with the model as if its safeguards were not present; a non-universal one elicits some capability only in narrow circumstances, or needs reworking for each new case.

Anthropic says an external bug bounty ran more than a thousand hours without producing a universal jailbreak, that Fable complied with zero harmful single-turn cyber requests even when tested against thirty different public jailbreak techniques, and that the thirty-day retention exists precisely so it can detect and patch novel attacks in flight.

It also disclosed, in the same launch post, that the UK AI Security Institute had made progress toward a universal jailbreak within a brief testing window. That admission matters more than it looks: Anthropic’s case was never that Fable is unbreakable, but that breaking it is slow and costly enough to monitor.

That is a probabilistic claim, not an absolute one, which is exactly why the dispute that followed turned on what counts as a serious enough break.

Fig. 2 What got pulled: the only public model above Opus.

Fable 5 was not a careless release. It was, by design and by the company’s loud public framing, the most safety-instrumented frontier launch the industry had produced. That is what makes the next three days strange.

Fig. 3 The safeguards the government overrode.

Subscribe now

The trigger, and the call that wasn’t answered

The cleanest account of how the letter came to be written runs through Anthropic’s own cap table.

By the reporting of the Wall Street Journal and The Information, later echoed by Bloomberg, the bypass that set everything in motion was found not by an adversary but by researchers at Amazon, Anthropic’s largest investor. Amazon’s chief executive, Andy Jassy, took the finding straight to Washington, personally alerting Treasury Secretary Scott Bessent that his researchers had demonstrated a way around Fable 5’s cybersecurity guardrails.

The escalation reached Commerce Secretary Howard Lutnick the same day. By one reconstruction of the timeline, the government first called Anthropic at 1:00pm Eastern; the signed letter followed hours later, at 5:21. Dario Amodei spent that Friday on calls with cabinet officials, arguing that Amazon’s technique did not amount to a true jailbreak. The letter went out anyway.

What was the technique? Here the reporting converges on something almost anticlimactic. Anthropic’s statement describes it as asking the model to read a specific codebase and fix any software flaws, a single-instance, non-universal bypass that surfaced a handful of previously known, minor vulnerabilities.

The most authoritative outside account is from Katie Moussouris, the founder of Luta Security, who serves on the Commerce Department’s own Information Systems Technical Advisory Committee and the Cyber Safety Review Board, and who says she is the only outside expert to have actually read the underlying research paper. Her summary is blunt.

The researchers took open-source code with known CVEs, plus new code with deliberately planted bugs, and asked Fable 5, Mythos, and Opus to review it for security issues. Fable 5 refused. They then asked the models to fix the code, and through a manual, multi-step process turned the output into scripts that test the patches.

That, she writes, is it. She proposed, only half in jest, a 1990s-style t-shirt: “fix this code” on the front, “this shirt is a munition” on the back.

Her technical point is the one that should worry policymakers more than the prompt itself. The reason the technique works is that it is a defensive request. Asking a model to find a bug, explain why it matters, and write a test that confirms the patch is the single most valuable thing an AI can do for defensive security, the find-fix-test loop that defenders run every day.

You cannot remove that capability, she argues, without making the model worse at defending real systems. There is no surgical edit that deletes the offense and keeps the defense, because at this level they are the same muscle.

An independent test complicates the picture from the other direction, and it is worth holding alongside Anthropic’s own numbers.

The security firm Endor Labs benchmarked Fable 5 on two hundred real-world vulnerability-fixing tasks and found it landed mid-table, at 59.8 percent on functional correctness and just 19 percent on producing a genuinely secure fix, with a record number of timeouts the firm attributed to the model’s extended thinking.

Their reading cuts both ways: Anthropic’s headline cyber numbers, they note, mostly measure offensive progress, exploits and proofs of concept, not whether the model writes safe production code.

So the capability that triggered the recall is real and measurable, but it is specifically an offensive-discovery capability, and even that is uneven across the cyber task surface.

Fig. 4 The gap between what was cited and what would justify a recall.

This is the crux of Anthropic’s public objection.

The company says it has been given only verbal evidence of a narrow, non-universal bypass, that no concerning jailbreak producing a genuinely harmful result has been disclosed to it, and that recalling a model deployed to hundreds of millions of people over a narrow finding would, if applied as an industry standard, halt all frontier deployments everywhere.

Anthropic has been explicit, including in Amodei’s own policy writing, that it believes the government should be able to block unsafe deployments, but through a process that is transparent, fair, and grounded in technical fact. Its complaint is not that the power exists. It is that this use of it does not meet that bar.

Subscribe now

Two rationales, and which one is real

Here the official story develops a seam, and the seam is worth prying at, because it determines which of the two readings from the top of this piece is the operative one.

Publicly and to Anthropic, the stated trigger was the jailbreak. But the Lutnick letter itself, a copy of which Reuters reported seeing, frames the danger differently: as the risk that the models could be diverted to military intelligence users in China, Russia, or other countries of concern.

Those are not the same justification. One is a technical claim about a safeguard failing. The other is a geopolitical claim about who might get access. And the divergence is not merely rhetorical: the regulatory provision the order reportedly invokes is one aimed at military-intelligence end users in countries of concern, which means the diversion rationale is built into the legal instrument itself, even as Anthropic’s own public statement names only the jailbreak.

The hook and the explanation point in different directions. Semafor reported that White House concerns ran beyond the bypass entirely, with officials suspecting that a China-linked group had accessed Mythos before the shutdown, a suspicion, it should be stressed, that has not been publicly substantiated.

Anthropic says the question of Chinese access was never raised in any of its conversations about the jailbreak.

The accounts diverge on the human level too. David Sacks, the White House AI adviser, has said the administration gave Amodei a clear choice, fix the jailbreak or de-deploy the models, and that Amodei refused. Anthropic disputes that characterization. As of this writing the competing versions have not been reconciled, and a reader should treat both as contested.

Why does the seam matter? Because if the real driver is a narrow technical bypass, the episode is fixable, and probably will be fixed, with some added vetting layer.

If the real driver is diversion risk to adversary states, the model may stay restricted no matter what Anthropic does to its classifiers, because the concern is not about the safeguard at all. And if the real driver is neither, if national security is functioning as the available lever in a relationship that had already curdled, then the technical merits are almost beside the point.

Michael Horowitz of the Council on Foreign Relations, speaking about the earlier Anthropic dispute, called it a fight “about politics and personalities” that was “masquerading as a policy dispute.” That framing did not come from nowhere.

To see why, you have to look at what happened in March.

A part most coverage missed

The Fable 5 suspension has been reported as a discrete event. It is better understood as the second of two extraordinary statutory actions against Anthropic inside a single quarter, the sharp end of a federal campaign that had been escalating since winter, and the earlier action tells you a great deal about this one.

The pressure began before either statute was invoked. By Fortune’s account, the Trump administration moved as early as February to push federal agencies off Anthropic’s models, after the company refused the Pentagon’s preferred contract language permitting use of Claude for any lawful purpose.

Anthropic wanted two carve-outs: no fully autonomous lethal weapons, and no mass surveillance of Americans.

That dispute hardened into something unprecedented on March 3, when Secretary of Defense Pete Hegseth designated Anthropic a supply-chain risk to national security, invoking a rarely used statute, Section 3252 of Title 10, and barring defense contractors, suppliers, and partners from doing business with the company.

As the Congressional Research Service and Courthouse News documented, the same refusal was the trigger. The Pentagon’s position, argued later before a DC Circuit panel, was that those guardrails amounted to an “operational veto,” a remote kill switch the company could trigger based on its own reading of lawful versus unlawful use.

Hegseth put it less legally on social media: America’s warfighters, he wrote, would never be “held hostage by the ideological whims of Big Tech.”

The supply-chain-risk label is historically reserved for foreign adversaries. Applying it to a US company that was, until weeks earlier, the Pentagon’s hand-picked frontier vendor, the first whose models ran inside classified networks, was extraordinary.

Anthropic sued on March 9, calling the action unprecedented and unlawful. By May, a three-judge DC Circuit panel hearing one of the two challenges sounded openly skeptical of the government’s reasoning, with the court pressing on whether a guardrail against, in one judge’s framing, telling an unreliable model “which bombs to drop” really constituted a supply-chain risk.

The litigation is unresolved. In the meantime, agencies including Health and Human Services, Treasury, and State confirmed they were migrating off Claude, even as the Department of Defense, by CNBC’s reporting, kept using Anthropic’s models to support active military operations in Iran, and the NSA, by later accounts, continued using Claude for its own work. Blacklisted and indispensable at the same time.

There is a political layer underneath the legal one. Amodei has drawn fire from Sacks, who has accused Anthropic of pushing “woke AI,” largely over its regulatory positions.

A Defense official told CNBC the March decision was about “the military being able to use technology for all lawful purposes,” not personality. But the pattern that emerges when you put the two actions side by side is hard to wave away.

Fig. 5 Two federal actions against one company in three months.

Two departments. Two legal instruments that had never been pointed at an AI company before. One target. In March, the objection was that Anthropic’s safety restrictions were too strong, that it would not let the government use Claude freely enough.

In June, the objection was that Anthropic’s safety restrictions were too weak, that a bypass let users reach capability the safeguards were meant to contain. The company was punished, in the same quarter, for both having guardrails and for not having good enough ones.

Whatever else that is, it is not a stable regulatory environment, and it is the context every line of Anthropic’s confidential S-1 now sits inside.

Subscribe now

The machinery, and why it is new

The legal vehicle deserves a closer look, because its novelty is the part with the longest half-life.

The legal vehicle deserves a closer look, because its novelty is the part with the longest half-life. The order took the form of what export lawyers call an “is informed” letter from the Commerce Department’s Bureau of Industry and Security, the mechanism by which BIS notifies a specific company that a license is now required for specific items. That mechanism is not itself new.

What is new is pointing it at a deployed commercial model. Reuters, which reported seeing the letter, said Commerce invoked the Export Control Reform Act of 2018 and its authority over emerging and foundational technologies; analysts parsing the public reporting place the regulatory hook in the export rules covering military-intelligence end users, though the letter itself has not been released, so the precise provisions remain a reconstruction.

The directive requires a license for any export, re-export, or in-country transfer of the models to a foreign person, and warns that noncompliance brings prompt criminal and civil penalties.

The doctrine doing the work is the deemed export, and understanding it is the whole game. Under the export regulations, releasing controlled technology or source code to a foreign national inside the United States has for decades been treated as an export to that person’s home country.

Showing a controlled blueprint to a foreign-born engineer in a San Francisco office is, in law, a shipment abroad. The letter applies that logic to a model: every inference a foreign national draws from Fable arguably releases the controlled capability to them, which is why the order reached foreign nationals on US soil and Anthropic’s own non-citizen staff, and why no geofence could satisfy it.

Read with an engineer’s eye, the theory strains at a seam. Export controls were built for things that cross borders: chips, machine tools, encryption binaries, centrifuge designs.

The cited rules speak of technology and source code, but a hosted model hands the user neither. It hands them inference, an outcome rather than an artifact, and the weights never leave Anthropic’s data centers.

As one former federal prosecutor put it to CIO, the physical location of the source code has become irrelevant; what is being controlled is access to what the code can do. That is a real shift in what export means, and it is not clearly settled law.

Export controls have not traditionally reached foreign access to US software as a service, which is precisely why, as Lawfare noted, the House passed a Remote Access Security Act in January to extend export jurisdiction to remote access of controlled US technology. The government reached for a theory that Congress is, at the same moment, trying to write into statute because it is not yet clearly there.

There is one more piece of timing that turns the episode from aggressive into nearly incoherent. Ten days before the letter, on June 2, the same administration signed an executive order titled “Promoting Advanced Artificial Intelligence Innovation and Security,” setting up a voluntary framework, to be designed by August 1, under which developers could offer the government early access to frontier models up to thirty days before release.

The President had just cut a planned ninety-day cybersecurity review window down to thirty, and by multiple accounts the order explicitly barred any mandatory licensing or pre-clearance regime. Anthropic, OpenAI, and Google all welcomed it as the reasonable version of government engagement: a request, not a rule.

The framework was not built. The August deadline had not arrived. And before the voluntary process could take its first breath, the government reached for the binding instrument it had just promised not to use. Voluntary in the announcement, mandatory in the execution, ten days apart.

Subscribe now

The own-goal: who export control actually reaches

Set aside the question of whether the safeguard failed, and ask the colder question an export-control regime is supposed to answer:

does it stop the bad outcome it names?

The collected judgment of the security profession is that it does not, and may do the reverse. More than eighty cybersecurity executives, including leaders at firms such as Nvidia and Adobe, signed an open letter to Lutnick and National Cyber Director Sean Cairncross over the weekend asking that the controls be lifted, an effort now hosted at freefable.org.

Moussouris’s argument is the technical spine of their case: the capability the order targets is defensive, the people it cuts off are defenders, and the attackers it is meant to thwart never needed a US endpoint in the first place.

Fig. 6 The asymmetry at the heart of the policy.

The asymmetry is stark. On one side of the ledger, the order cuts off allied-nation cyberdefenders, foreign-national engineers working legally in the United States, Anthropic’s own staff, the Glasswing partners across more than fifteen countries who were using Mythos precisely to secure critical infrastructure, and any researcher running a find-fix-test loop.

On the other side, it reaches none of the things it is nominally about. It does not touch Chinese open-weight models; on June 13, the day after the shutdown, the Chinese lab Zhipu AI shipped GLM-5.2 and explicitly cited the US ban as evidence that American models are unreliable partners.

It does not touch other frontier labs’ cyber-capable models, which by Moussouris’s reckoning carry fewer guardrails and comparable capability, with the rest of the field expected to match Mythos-class capability within months.

It does not touch adversary state actors, who have their own systems. And it does not touch the capability itself, which is diffusing regardless. You cannot, as Moussouris puts it, export-control your way to cyber resilience.

The contradiction has not been lost on people who know the regime from the inside. Dean Ball, an AI policy analyst who briefly served in this administration, called the action cartoonish, pointing to the oddity of a government that waves advanced AI chips through to China while barring Britain and every other allied user from its best models.

Moussouris, who lived through the last version of this fight, put the bottom line more bluntly: if national defense was the goal, the order scored an own goal against the United States.

There is a precedent she lived through. When the Wassenaar Arrangement added controls on “intrusion software” in 2013, the definition was written so broadly that it reached the routine cross-border work of defense itself, sharing exploit proofs of concept, coordinating vulnerability disclosure, running incident response, to the point that the United States declined to implement the original language and the text had to be renegotiated in 2017 to carve defensive research back out.

The Fable 5 directive, signed in an afternoon, has the same shape and none of the deliberation.

The sovereignty bill comes due

The diplomatic cost is the part that will outlast the news cycle, and it lands closest to home for anyone reading this from outside the United States.

For every allied government that had quietly assumed continuous access to the frontier, the message of June 12 was unambiguous: that access is revocable, unilaterally, on a domestic legal theory you have no vote in, with a few hours’ notice.

The reaction was immediate and came from the top. At the G7 summit at Evian-les-Bains this week, the suspension became one of the sharpest flashpoints on the agenda. UK Prime Minister Keir Starmer raised the blackout directly with Trump and asked for a carve-out restoring access for British citizens and businesses. Washington rebuffed it.

UK AI minister Kanishka Narayan put the stakes plainly, noting that the most advanced AI in the world had just been cut off for everyone in Britain, and argued the episode proves the case for sovereign AI capability.

Canada’s Mark Carney framed it as a warning about overreliance on any single model. The European Union, already the stricter regulator, now has fresh reason to reduce its dependence on American AI infrastructure.

Out of that pressure, a workaround is taking shape, and its shape is the most revealing part of the whole episode. According to Reuters, the Financial Times, and Axios, a US delegation led by Lutnick spent the summit’s sidelines negotiating a “trusted partners” framework: a sanctioned channel through which vetted allies, either whole countries or individual companies, could regain access to the controlled models.

The pitch is cybersecurity, the same allied-defense logic the order itself invoked. Read it against what Anthropic did on June 12 and the irony is exact. The company declined to gate access by nationality and shut the models off rather than operate that system. Governments are now assembling the same nationality gate themselves, one level up, and calling it a partnership.

The selective-access layer that was too compromising for a private firm to run is being rebuilt as a diplomatic club, in which reaching the frontier becomes a privilege Washington grants to allies rather than a product a company sells to customers. No agreement has been reached, and which countries and which companies would qualify is undefined.

The structural irony compounds. An action justified by the need to deny capability to adversaries functions, in practice, as the strongest possible argument for those adversaries’ domestic alternatives, and for allied sovereign-AI programs that route around US providers entirely.

Zhipu’s launch timing was not a coincidence; it was marketing handed to Beijing for free. This is the AI-sovereignty debate stripped of abstraction. It is no longer a conference panel.

It is a procurement decision that every non-US enterprise and government now has to price, and the new line item is the probability that the best model on the market disappears for an indeterminate number of weeks because of a dispute in Washington it cannot see coming.

Subscribe now

An IPO inside the blast radius

All of this is happening on the worst possible calendar.

By the reporting around the launch, confirmed in the company’s own filing announcement, Anthropic confidentially filed its IPO prospectus with the SEC on June 1, eight days before Fable 5 shipped and eleven before it was pulled.

The filing followed a 65 billion dollar Series H that valued the company at 965 billion post-money, ahead of OpenAI’s 852 billion from late March, on a revenue run-rate that had crossed 47 billion by Anthropic’s May disclosure. The growth is real and almost vertical: Anthropic told investors it expects 10.9 billion dollars of revenue in the second quarter alone, more than double the first. The margin underneath is thinner than the headline.

Its own projected second-quarter operating profit implies a margin of roughly five percent, and at the reported valuation the company is priced near twenty times annualized revenue, a multiple that assumes years of uninterrupted hypergrowth.

OpenAI confirmed its own confidential filing days later, and SpaceX is gearing up for a record public debut this week.

Three of the largest IPOs in history are converging, and one of the three just demonstrated, live, that its flagship product can be switched off by a letter.

Fig. 7 A flagship that can be pulled by letter, mid-roadshow.

In an earlier piece for this newsletter, I argued that Anthropic’s valuation rested on a question no one could yet answer, because the S-1 was sealed: whether the revenue was high-margin, organically demanded, and durably embedded, or low-margin and propped by a circular capital structure. Two different businesses, hiding inside the same black box. The Fable 5 episode adds a second axis of the same kind.

There are now two different regulatory realities the company could be living in, and the filing does not tell you which.

In one, these are isolated frictions with an administration that will normalize, and the government dependency cuts the other way, toward Glasswing, toward classified deployments, toward a company so embedded in national security that it is too important to sideline.

In the other, the through-line from the Pentagon blacklist to the Commerce directive is a durable hostility that will keep surfacing as new restrictions, new carve-outs, and new headline risk, precisely the kind a roadshow cannot price.

The market’s read leans toward the benign outcome on the immediate question, though these are live prediction-market prices that move by the day, not fixed facts, and different outlets captured different values within the same week.

As of mid-June, Kalshi’s own market put the odds of Fable 5 returning before July 1 at 57 percent, before July 10 at 67, and before July 17 at 75; a parallel Polymarket contract ran somewhat higher, around seventy percent for a US return by July 1.

Kalshi traders separately gave Anthropic a 77 percent chance of reaching the public markets before OpenAI. The figures drift with each session, but across both venues the signal is the same: a bet on “restored, with conditions,” not on a permanent kill.

Fig. 8 The market expects “restored, with conditions,” not a kill.

It is worth noting what the comparison to OpenAI implies. OpenAI built its government posture around vetted, tiered access: an explicit government track, a Defense Department pilot, sensitive cyber capability released through approval rather than open availability.

That is closer to what regulators have signaled they want. Anthropic, by shipping a Mythos-class model to the broad public first, was arguably the lab that tested the line. It found the line. The cost of being first to the frontier, this quarter, was being first to the letter.

Subscribe now

What this actually means

Strip away the specifics and four things are left standing, in rough order of how long they will matter.

The immediate event is probably resolvable, and the machinery of its resolution is already visible. Anthropic sent senior engineers to Washington and met officials through the week, and by the weekend the contours of a deal were forming around vetted access rather than open availability: the trusted-partners framework floated at the G7, plus a reported identity-verification layer, both of which look a great deal like how Mythos 5 was restricted in the first place.

Even David Sacks, no friend of the company, has said the administration’s hope is that Anthropic fixes the issue, the directive is revoked, and Fable returns to general availability. A permanent kill of a flagship the company just filed an IPO around would be the surprise, not the base case.

The precedent is not resolvable, and it is the real news. Frontier AI has been reclassified, in practice if not yet in settled law, as a controlled export.

The mechanism to pull a live model from the entire market now exists, has been used, and survives whatever happens to Fable 5 specifically. The trusted-partners talks make this worse, not better: they do not roll the precedent back, they institutionalize it, turning a one-off emergency letter into a standing system in which allied access is licensed rather than assumed.

Every lab, OpenAI and Google included, now operates one Friday letter away from the same outcome, and the whole industry has just watched the vetted-access posture become the one that survives contact with Washington, while the open-release model gets tested to destruction.

For practitioners, the lesson is architectural and immediate. If your stack had a hard dependency on a single frontier model, June 12 was the day that risk stopped being theoretical.

The pragmatic move while this plays out is the one Fable 5 itself makes on high-risk queries: fall back to Opus 4.8, usually a one-line model-ID change, since it is live, unaffected, and the model Fable defers to anyway.

AWS did exactly this at the infrastructure level, automatically rerouting Fable and Mythos calls to Opus 4.8 the moment the order landed.

The deeper lesson is to treat model availability as a dependency to be abstracted and load-balanced, not a constant. The capability gap between Mythos-class and Opus-class is real, widest on long-horizon agentic work, and for some workloads there is no clean substitute today. That gap is now a supply risk, and supply risk gets designed around.

And for Anthropic, the episode puts its foundational thesis under load. The company’s entire identity is the bet that safety and capability can be co-developed, that being the most careful lab is a moat rather than a tax.

The uncomfortable reading of this quarter is that the moat became the target. The capability it built to lead the frontier is exactly what got the model classed as a munition, and the very language the company used to sell its caution helped write the legal case for taking it away.

The cybersecurity researcher Peter Girnus made the point with a blade: a company that calls its product a munition in every press release should not be shocked when a government eventually takes it at its word. As he put it, “They wrote the legal predicate themselves and called it a brand.”

The government partnership Anthropic cultivated as a differentiator is the same channel through which all of it arrived. Safety did not exempt the company. By one reading, safety is what made it conspicuous.

Subscribe now

What to watch

Because the story is moving daily, here are the markers that will tell you which of the two readings is winning, in roughly the order they will resolve.

The terms of restoration, not the fact of it. “Restored” and “restored as it shipped” are different outcomes. Watch whether Fable 5 returns generally available or only through the trusted-partners channel, whether the reported early-July identity-verification rollout extends to foreign nationals or merely confirms US citizenship, and what compliance burden attaches. A narrow, conditioned return is the base case, and the conditions are the whole story.

Whether the trusted-partners framework actually lands. As of this week it is a negotiation with no agreement, the qualifying countries and companies are undefined, and the EU and UK are already hedging toward their own capacity. If it formalizes, allied access to the frontier becomes a license Washington issues. If it collapses, the sovereignty exodus accelerates.

Whether the China-diversion claim is ever evidenced. It is the one rationale that, if substantiated, flips the base case from “restored with conditions” to “restricted regardless of safeguards.” So far it is an unsubstantiated suspicion, and Anthropic says it was never raised in any conversation about the jailbreak.

The DC Circuit ruling on the Pentagon designation. A decision against the government reframes the whole pattern as overreach the courts will check. A decision for it hardens the precedent across both fronts at once.

And whether the rest of the field quietly changes its release posture. If OpenAI and Google visibly shift toward vetted, tiered access over the coming months, that is the market pricing the new reality, and the open-release era of frontier models ends not with a ban but with a business decision.

Hold all of that, and resist the two easy endings. This was not a trivial misunderstanding that proves nothing, and it was not a five-alarm assault on innovation that proves everything. It was a demonstration.

A commercial AI model serving hundreds of millions of people was reclassified as a controlled weapon and switched off centrally, on a domestic legal theory built for centrifuges, triggered by a defensive coding prompt, escalated by the company’s own largest investor, and justified by two rationales that still do not match, in the same quarter a different arm of the same government had blacklisted the same company for the opposite sin.

The narrow event will probably pass. The precedent it set will not. The envelope has been opened in public, and it does not go back inside.

Subscribe now

Sources include the primary statement and launch posts from Anthropic; reporting from Reuters, the Financial Times, the Wall Street Journal, The Information, Bloomberg, Axios, CNBC, Fortune, TIME, Semafor, TechCrunch, NBC News, and The Next Web; the Congressional Research Service (IN12669) and Courthouse News on the Pentagon designation and litigation; technical analysis by Katie Moussouris of Luta Security, the only outside expert to have read the underlying research paper; the open letter at freefable.org; market-implied data from Kalshi; and the Claude API documentation. The Amazon-origin account is attributed to the Wall Street Journal and The Information; the G7 trusted-partners talks to Reuters, the Financial Times, and Axios. All valuation figures are reported, estimated, or market-implied as labeled. The China-access concern is reported only as an unsubstantiated suspicion, and the competing Sacks and Anthropic accounts of the de-deployment ultimatum remain unreconciled as of June 18, 2026. No claim here rests on a publicly available audited filing, because Anthropic’s S-1 remains confidential.

How three companies set the price of intelligence

Lorenzo Bradanini — Wed, 17 Jun 2026 12:13:06 GMT

Author’s note on method. This report was researched in June 2026 against primary sources (JEDEC standards, vendor announcements, supplier earnings releases) and dated trade reporting, with every computed figure derived from first principles and shown inline. The HBM4 ramp, the Rubin and MI450 launches, and the Q1 2026 memory shock are recent events; a verification list of claims worth re-checking sits in Appendix B, with confidence tiers throughout. Two sections present explicit illustrative models (the cost of a stack, the gigawatt chain): their inputs are tagged as sourced or assumed, their arithmetic is exact, and their outputs are ranges, not disclosures. Cross-vendor specifications use published peak figures and are directional where measurement conventions differ. Appendix C specifies the measurement program that would convert several first-principles claims here into original benchmark data. Nothing in this report is investment advice.

Subscribe now

The thesis: every token is a memory read

Strip away the software and an output token is a physical event: to produce one token, an accelerator must stream essentially every active parameter of the model from memory through its arithmetic units. A 70 billion parameter model quantized to FP8 is 70 gigabytes of weights.

Generating one token means reading 70 gigabytes. Generating a hundred tokens a second means moving seven terabytes a second, and no amount of arithmetic brilliance changes that requirement, because the arithmetic is not the constraint. During decode, the phase that produces every token anyone has ever read from a model, the limiting resource on every modern accelerator is memory bandwidth.

That bandwidth has exactly one industrial source: High Bandwidth Memory, towers of DRAM dies thinned to thirty micrometers, drilled through with thousands of copper vias, and bonded onto a logic die millimeters from the GPU.

Three companies on earth manufacture it at the frontier: SK hynix in Icheon and Cheongju, Samsung in Pyeongtaek, and Micron in Boise, Hiroshima, and Taichung. Which means the marginal cost of intelligence, the dollars per million tokens that every API price and every AI gross margin ultimately rests on, is set not in Santa Clara but in a three-supplier memory oligopoly, by stacking yields, bonding chemistry, and wafer allocation.

The market has noticed, violently. In the first quarter of 2026, SK hynix posted a 72 percent operating margin, higher than Nvidia’s and TSMC’s most recent reported margins, on revenue that nearly tripled year over year; on the earnings call the company said customer requests for HBM already exceed its planned production capacity for the next three years, per the Q1 release and call coverage.

The supplier of the bottleneck is now more profitable, in percentage terms, than the company whose chips it feeds. That inversion is the subject of this report.

The structure: the device physics that created the wall (II); the wall measured on real silicon, including the operating taxes nobody quotes (III); the anatomy of a stack down to the via, and the yield equation that prices it (IV); the oligopoly’s history and the 2026 supercycle (V); HBM4, the largest architectural break in the technology’s history, now ramping (VI); the machines it feeds and the two opposing design philosophies the new standard revealed (VII); the money, including a transparent cost-per-stack model (VIII); the bridge to the price of a token and the gigawatt-to-wafer chain (IX); the bear case, stated properly (X); the road past HBM4 (XI); and what to watch (XII), with five falsifiable calls.

Two interludes price the KV economy and the power wall along the way. Appendix C specifies the benchmarks that would extend this from synthesis to measurement.

One framing number before the detail. From the V100 in 2017 to the Rubin GPU now entering production, Nvidia’s peak tensor throughput at the lowest supported precision grew roughly 400-fold.

Over the same nine years, the memory bandwidth feeding that compute grew 24.4-fold. The 16.4x divergence between those exponents, computed from the vendors’ own datasheets, is the memory wall. Everything below is downstream of it.

Subscribe now

II. The physics: why the wall exists

The wall is not a design oversight. It is the collision of two different scaling regimes, and to price it correctly you have to go down to the cell.

The cell that cannot shrink. A DRAM bit is one transistor and one capacitor, the 1T1C cell, unchanged in concept since 1966. Writing stores charge on the capacitor; reading shares that charge onto a long bitline and asks a sense amplifier to detect the disturbance.

The readable signal is set by charge sharing: the voltage swing the sense amplifier sees is approximately the cell’s stored swing scaled by Cs over (Cs plus Cbl), the cell capacitance against the parasitic bitline capacitance. With modern cell capacitance in the single-digit femtofarads and bitlines loaded by hundreds of neighboring cells, that swing is on the order of tens of millivolts, detected differentially against a reference bitline.

Shrink the capacitor and the signal disappears into noise; the read becomes unreliable; the bit is worthless.

So the capacitor holds a roughly fixed charge requirement regardless of lithography, which is why DRAM capacitors became architecture rather than printing: vertical pillars with aspect ratios beyond 100 to 1, wells drilled a hundred times deeper than they are wide, lined with high-k dielectric laminates (the zirconia-alumina-zirconia family) to wring capacitance from area that no longer exists in plan view.

The consequence is the slowest node cadence in semiconductors. The industry’s DRAM generations crawled from 1x-class around 19 nanometers in the mid-2010s through 1y, 1z, 1a (roughly 14), 1b (roughly 12 to 13) to today’s 1c at roughly 11 to 12 nanometers: a decade to cover what logic crossed in three years.

EUV lithography, which rescued logic, arrived in DRAM late and thinly: SK hynix introduced it at 1a, Samsung uses it on more layers, and Micron famously held out on DUV multi-patterning through 1-beta before adopting EUV at 1-gamma.

Bits per wafer, the quantity that sets the cost of a gigabyte, now improves single-digit percent per year. This is the supply-side bedrock of every price in this report: the raw material of memory has nearly stopped getting cheaper.

The latency that never moved. Hidden under the bandwidth story is a stagnation worse than the capacity one: DRAM row cycle time, tRC, the time to open a row, sense it, restore it, and precharge for the next, has improved by less than 2x in two decades, parked in the mid-40-nanosecond range, because it is governed by the analog physics of sense amplification, not by lithography.

Every bandwidth gain in modern DRAM is therefore parallelism wearing a frequency costume: more banks, more bank groups, deeper prefetch (DDR5 fetches sixteen beats per access), more independent channels, so that thousands of slow rows are in flight at once. HBM is this philosophy at its logical extreme.

The pin that cannot run. The other escape route, faster pins, is capped by signal integrity. A DDR5 pin driving centimeters of motherboard trace through a connector tops out in the high single Gbps; GDDR7 reaches the 30s only over short, exquisitely tuned point-to-point routes at painful energy cost.

And energy is the real currency: moving a bit from a board-level DIMM costs on the order of 10 to 15 picojoules; GDDR-class interfaces sit near 7 to 8; HBM, with its millimeters-long links through a silicon interposer, runs in the 3 to 5 range, and the logic-process base dies of HBM4 push the interface toward 0.75 to 0.8 volts against 1.1 for DRAM-process predecessors, roughly doubling interface efficiency, per TSMC’s published figures (all pJ-per-bit values are approximate vendor-class numbers).

Run the energy arithmetic forward and it bites hard. At 4 picojoules per bit, fully streaming a B200’s 8 TB/s costs about 256 watts; fully streaming Rubin’s 22 TB/s at an improved 3 pJ per bit still costs about 528 watts. Memory traffic alone, at full decode throughput on an HBM4 flagship, plausibly draws two-thirds of what an entire H100 board drew. This is why packages crossed 2,000 watts, why every HBM4 platform is liquid-cooled, and why the JEDEC thermal envelope is now a first-order economic document.

So: a cell that cannot shrink, a row that cannot speed up, a pin that cannot run, and an energy budget that punishes distance. The only move left is the one HBM made: go wide (1,024 wires, now 2,048), go short (millimeters through an interposer), and go up (stack the dies).

Width replaces frequency; proximity replaces drive power; the third dimension replaces the second. The cost of the trick is the subject of section IV: stacking is the most yield-hostile thing the memory industry has ever mass-produced.

Subscribe now

III. The wall, quantified, and the taxes nobody quotes

The cleanest way to see the wall is the manufacturers’ own flagship specifications, indexed.

Computed from vendor datasheets and GTC 2026 disclosures. Note that the compute line rides shrinking precision (FP16 to FP8 to FP4), a legitimate but partly definitional gain; the bandwidth line had to be manufactured stack by stack.

The bytes-per-FLOP column is the punchline: each generation can feed each unit of its arithmetic less data than the last. The machine is increasingly a furnace with a narrowing fuel line.

Operationally, the two phases of inference live on opposite sides of the roofline. Prefill is a matrix-matrix multiply with arithmetic intensity in the hundreds of FLOPs per byte: compute-bound. Decode performs roughly two FLOPs per parameter per token while reading every parameter byte: intensity near 2 at FP8, against a ridge point around 560 FLOPs per byte on a B200, so a single conversation idles the arithmetic above 99 percent while it waits on DRAM.

The entire modern serving stack (continuous batching, paged KV caches, speculative decoding, MoE routing) exists to hide that ratio, and every one of those techniques converts the problem into a different demand on the same resource: more concurrent streams need more KV cache, and the KV cache lives in HBM.

Bandwidth sets the speed of a token; capacity sets how many tokens you can be making at once. Both are the stack.

Two operating taxes deserve quantification because they appear in no marketing datasheet.

The refresh tax. DRAM forgets; every row must be rewritten on a fixed schedule, and while a bank refreshes it cannot serve traffic. The overhead is roughly tRFC over tREFI, the refresh pulse width over the refresh interval: with multi-hundred-nanosecond tRFC on dense dies against the standard 3.9 microsecond interval, the tax is on the order of 5 to 10 percent of theoretical bandwidth (HBM’s per-bank and managed refresh modes claw some back).

The vicious part is thermal: JEDEC devices double their refresh rate above 85 degrees Celsius, halving tREFI, so the bandwidth tax roughly doubles exactly when the stack is working hardest and hottest. A 16-high tower dissipating tens of watts through molded underfill in a 2,300-watt package is a device engineered to live near that threshold. Hot memory is slow memory, and slow memory is expensive tokens: cooling budgets are bandwidth budgets.

The utilization tax. Achieved bandwidth is not peak. Between refresh, bank conflicts under irregular KV-cache access, read-write turnarounds, and command overheads, well-tuned decode workloads typically realize 60 to 80 percent of datasheet bandwidth (the measurement protocol in Appendix C exists to pin this number per platform).

Every figure in this report that divides by peak bandwidth is therefore optimistic by that factor, uniformly, which preserves comparisons while flattering absolutes.

The wall is also not transient. The longest-baseline study, Gholami and colleagues’ “AI and Memory Wall,” measured twenty years of server hardware: peak compute scaling 3.0x every two years, DRAM bandwidth 1.6x, interconnect 1.4x. Different exponents compound.

The wall is structural, and the industry that gets paid because of it is the next several sections.

IV. Anatomy of a stack, to the via

A current HBM device is a tower: a base die at the bottom and 8, 12, or 16 DRAM core dies above it.

JEDEC’s HBM4 standard, published as JESD270-4 on April 16, 2025, fixes the envelope: a 2,048-bit interface organized as 32 independent channels (each split into two pseudo-channels, 64 concurrent access streams per stack), support for 24Gb and 32Gb core dies in 4-, 8-, 12-, and 16-high stacks, capacities to 64GB per stack, per-pin rates from 8 Gbps in the base spec, and a package height of 775 micrometers for both 12- and 16-high, loosened from HBM3E’s 720 to give 16-high a fighting chance without new bonding physics.

The vertical wiring

Each core die is thinned to 30 to 50 micrometers (SK hynix’s CES 2026 16-high uses 30, about a third of a hair) and pierced by through-silicon vias: copper columns roughly 5 to 6 micrometers in diameter, formed via-middle with deep reactive-ion etch, lined, filled, then revealed by grinding the wafer from the back.

Signal, power, and ground TSVs together number on the order of ten thousand per stack (order-of-magnitude; vendors do not publish counts), with spares woven in: TSV repair logic in the base die can route around dead vias, one of several redundancy layers (alongside row and column fuses) that keep the yield equation below from being even crueler.

Between dies, communication crosses microbump fields at pitches around 25 micrometers, thousands of joints per interface, every one a potential stack-killing defect.

The brain at the bottom

The base die is the stack’s logic: the 2,048-bit PHY facing the host, channel routing, built-in self-test, the IEEE 1500 test wrapper, repair control, and the direct-access port that lets a tester exercise the tower.

Through HBM3E it was built on a DRAM process, because that is what memory firms own, and it shows: DRAM transistors make poor I/O drivers. At HBM4’s 10-plus Gbps per pin across 2,048 lanes, signal integrity demands real logic transistors, real equalization, lower supply rails, which is the engineering reason (beyond the strategic one in section VI) that the base die migrated to foundry logic processes this generation.

How the tower is joined

The bonding step is the deepest process moat in the industry, currently a three-way technology bet.

SK hynix uses MR-MUF, mass reflow with molded underfill: dies are placed and the solder joints formed in a batch reflow, then the whole stack is encapsulated in one molded underfill shot, a flow with better warpage control, a stronger thermal path through the mold compound, and batch throughput, widely credited as the reason it shipped 12-high first and leads yields (the underfill material itself is a quiet chokepoint, long supplied under an exclusive arrangement with Namics, per trade reporting).

Samsung and Micron use TC-NCF, thermo-compression over a pre-laminated non-conductive film: each die is pressed down individually with heat and force, slower and stress-accumulating, but precise at fine pitch.

The bridge step is fluxless TCB, removing flux and its residues by bonding in a reducing atmosphere. The endgame is hybrid bonding: copper pads and dielectric planarized to sub-nanometer roughness, fused face to face with no bumps at all, pitch capability below 10 micrometers, thinner stacks, a direct copper thermal path, and lower parasitics.

It is mandatory somewhere past 16 to 20 layers and brutally hard: Samsung, betting on it most aggressively, was reported in April 2026 to be sampling hybrid-bonded 16-high HBM4 to Nvidia at yields around 10 percent, while SK hynix completed a 12-high hybrid-bonding validation and placed its first inline production order while publicly committing to MR-MUF through HBM4E, per EE Times via TrendForce.

And in April 2026 JEDEC was reported to be weighing a roughly 900-micrometer height for HBM4E, which would let incumbent bonding survive another generation and shift hundreds of millions of dollars of equipment orders with one standards vote.

The equation that prices it all

If each die-plus-bond event succeeds with probability p, a stack of n yields p to the power n, and one failure scraps the tower with every good die in it.

At 99 percent per layer, a 12-high yields 89 percent; at 97, 69; at 95, 54. This compounding, mitigated but not repealed by known-good-die testing before stacking, TSV and row repair after, and known-good-stack test at the end (the step driving Advantest’s memory-test boom), is why HBM commands 5 to 6 times the per-bit price of DDR5 (industry trackers put HBM3E near $8 to $10 per GB, roughly $300 per 36GB stack, with early HBM4 stacks around $500, per Silicon Analysts estimates), and why TrendForce calculates HBM consumes roughly three times the wafer area per bit of commodity DRAM once die-size trades and stack losses are counted.

Three suppliers, a decade of bonding chemistry embodied in process recipes, and an exponential that punishes newcomers: that is the moat, stated as math.

V. The oligopoly, and the quarter the memory market broke

HBM was born unwanted. AMD’s packaging architects, led by Bryan Black, spent the early 2010s convincing anyone who would listen that memory belonged on the package; SK hynix co-developed the first standard, and the 2015 Fury X shipped it, 4GB of HBM1 whose capacity ceiling promptly handicapped the card against Nvidia’s cheaper GDDR5 flagship.

The pioneer paid the tuition; the fast follower banked the lesson: Nvidia adopted HBM2 on the P100 in 2016 and never looked back, while for most of a decade the product line survived at SK hynix on conviction more than profit.

The reward arrived all at once after ChatGPT. SK hynix was first to HBM3 (effectively sole-sourcing the H100), first to 8- and 12-high HBM3E, and converted the lead into a position Counterpoint measured at 62 percent revenue share in Q2 2025, against 21 for Micron, which had skipped HBM3 entirely and leapfrogged to HBM3E, and 17 for Samsung, the incumbent giant caught flat.

We saw repeated Nvidia qualification failures on HBM3E thermals and power through 2024, a leadership change, and a recovery visible by Q3 2025 (Counterpoint: back to 35 percent) that culminated in late January 2026 with Nvidia qualification for HBM4 itself and production from February, per Bloomberg-sourced reporting.

Analyst estimates after Computex put SK hynix at 60 to 70 percent of the HBM4 volume allocated to Vera Rubin, Samsung at 25 to 30, and Micron the remainder, and in early June Nvidia and SK hynix signed a multi-year pact to co-develop AI memory for Rubin and beyond, the first agreement of its kind, which converts the leader’s share into contracted durability, per reporting on the deal; Counterpoint credits SK hynix 61 to 64 percent of the overall HBM market through the period.

Nvidia has also reportedly asked all three for 16-high stacks as early as late 2026.

Then came the quarter that broke the market.

Because an HBM bit consumes about three times the wafer area of a commodity bit and sells for five times the price, every rational fab starved DDR5 to feed it, with DDR4 output collapsing toward 20 percent of 2025 levels; demand met the squeeze, and in Q1 2026, the seasonal trough, global DRAM industry revenue hit $97.1 billion, up 85.3 percent in a single quarter, the largest in history, per Omdia, on contract price increases TrendForce recorded at 90 to 95 percent quarter on quarter, the steepest ever, revised up from an already unprecedented 55 to 60.

The primary-source exhibit is SK hynix’s Q1 2026 report, and it deserves its numbers stated in full because they are the income-statement proof of everything above: revenue of 52.5763 trillion won (about $35.6 billion), the first quarter above 50 trillion in company history, up roughly 60 percent sequentially and 198 percent year over year; operating profit of 37.6103 trillion won (around $25 to 27 billion) at a 72 percent operating margin and a 77 percent net margin, all-time highs on every line, with one quarter’s operating profit nearly matching the whole of record fiscal 2025 (47.2 trillion won) and exceeding all of fiscal 2024, per the company’s release and earnings coverage.

On the call, management said customer HBM requests already exceed planned capacity for the next three years, guided HBM4E samples for the second half of 2026 with 2027 mass production, and announced a 19 trillion won (about $13 billion) advanced-packaging plant, with 2026 capex priorities of the M15X ramp, Yongin site preparation, and EUV tooling.

SK Group’s chairman went further, telling reporters in March that the global wafer shortage will likely persist to 2030 with a shortfall exceeding 20 percent, since capacity takes four to five years to add, per CNBC.

Micron’s fiscal first quarter (the November quarter) had already shown the shape: $13.64 billion of revenue, up 57 percent, 56 percent gross margins, demand “substantially higher” than supply, followed by its December exit from the consumer memory business entirely.

Bank of America frames 2026 as a supercycle on the order of the 1990s boom, with DRAM revenue up 51 percent for the year; the three memory makers added roughly $900 billion of combined market value from September, per market reporting.

Demand now reaches upstream in ways the industry has never seen: in October 2025 OpenAI signed letters of intent with both Samsung and SK hynix under Stargate targeting on the order of 900,000 DRAM wafer starts per month, widely characterized as approaching 40 percent of global output, per Reuters.

Whatever fraction converts, the meaning is the change itself: model companies negotiating two layers down their own supply chain, because they have understood what this report argues. The token supply curve is a wafer allocation.

The Vertical and the Loop: valuation, compute, and the Anthropic IPO

Lorenzo Bradanini — Wed, 10 Jun 2026 13:42:39 GMT

Reported and estimated figures. Anthropic's S-1 is confidential; none of the below is yet an audited, publicly filed fact.

A note on the byline. This analysis is written by The Software Frontier. We have no access to Anthropic’s non-public financials, no instruction to flatter the company, and no stake in the outcome. Everything here is drawn from public reporting, third-party regulatory filings, analyst estimates, and published silicon benchmarks, attributed inline. Where the company looks strong we say so. Where the bears have the better argument, we say that too. Treat the byline as a reason for scrutiny, not for trust.

Three rockets, one launch window

For the first time in the history of American capital markets, three private companies are attempting to go public in a single year at valuations above one trillion dollars each. That has never happened once. It is now scheduled to happen three times in roughly one hundred days.

SpaceX, which by 2026 had combined with xAI into a single entity, filed its public S-1 on May 20 and is expected to begin trading on June 12, seeking upwards of seventy-five billion dollars, according to Datacenter Dynamics’ reading of the prospectus.

OpenAI filed a confidential draft registration on May 22, targeting a fourth-quarter listing that could come as early as September, with Goldman Sachs, Morgan Stanley, and JPMorgan leading, per the Wall Street Journal and CNBC. And on Monday, June 1, Anthropic started its own clock, confidentially filing an IPO prospectus with the SEC and confirming it in a public statement.

Reuters reported back in December that the company had already engaged Wilson Sonsini, the firm that managed Google’s 2004 IPO, to prepare.

PitchBook’s Harrison Rolfes told CNN that two trillion-dollar filings in such a short window represent the largest concentration of pre-IPO capital ever brought to market at once. He was counting two. There are three.

Wedbush’s Dan Ives, who has tracked this complex for years, called the moment an opening of the floodgates for an IPO market that has been largely shut.

This piece is about the middle rocket. Anthropic is the most interesting of the three not because it is the largest, but because it sits at the exact center of every structural question the AI buildout has raised: the quality of AI revenue, the physics and economics of inference, the rivalry between three incompatible silicon stacks, and the circular capital flows that critics compare to the vendor-financing collapse of the dot-com era. To price Anthropic is to price the entire complex.

There is a complication, and it is the most important sentence in this report. The filing is confidential. Under SEC rules for emerging growth companies, the prospectus stays private until roughly fifteen days before a public roadshow.

Which means that as of this writing, the most anticipated technology IPO in a generation is being valued by the market on numbers that no auditor has signed for public release.

Everything you are about to read about Anthropic’s financials is reported, estimated, or projected. None of it is yet a filed, audited fact. Hold that thought. We will return repeatedly to why it matters more here than almost anywhere else.

The vertical

Start with the revenue curve, because it is the reason any of this is happening, and because it does not look like a curve. It looks like a wall.

Anthropic reported roughly one billion dollars in annualized revenue at the end of 2024. By the end of 2025 that figure was around nine billion. After closing its Series G in February 2026 it was near fourteen billion. By early April, multiple outlets put the run-rate above thirty billion.

In May, per CNBC’s account of the filing, Anthropic disclosed a run-rate of forty-seven billion dollars, a figure echoed at roughly forty-five billion by the research firm Sacra.

Read that sequence again: one, nine, fourteen, thirty, forty-seven. The jump from fourteen to forty-seven happened in about four months. There is no precedent for this in enterprise software.

The closest analogue is not a software company at all. It is a commodity in a shortage, which is exactly what frontier inference capacity has become.

The wall. Run-rate roughly 5x’d in twelve months and more than 3x’d in the four months to May. Sources: CNBC, Sacra, The Motley Fool, MEXC syndication, TipRanks. Intermediate points (Aug, Oct 2025) from contemporaneous reporting.

The composition matters more than the level. Roughly eighty percent of revenue comes from enterprises, per TipRanks, almost the inverse of OpenAI’s consumer-heavy base. Futurum’s Nick Patience notes that eight of the Fortune 10 are now paying customers. More than three hundred thousand businesses run Claude.

The count of customers spending a million dollars or more per year crossed one thousand by April, double the roughly five hundred in February. And one product, Claude Code, reached about one billion dollars in annualized revenue within six months of launch, a developer-tool adoption speed that AI Weekly, citing WSJ figures, called a new category benchmark.

That last fact is double-edged, and we will sharpen the second edge later. For now, note the shape: this is not a consumer novelty that might churn. It is a deeply embedded enterprise dependency growing into core workflows.

That is the most durable kind of revenue there is. It is also the kind that invites a backlash when the bill arrives, which is precisely what is now starting to happen.

The margin question, which is really two companies

The number everyone quotes is revenue. The number that decides whether Anthropic is worth a trillion dollars is gross margin, and on gross margin Anthropic is two entirely different businesses wearing the same name.

In 2024, Anthropic’s gross margin was negative ninety-four percent, per data compiled by TradingKey. It cost nearly two dollars of compute to deliver one dollar of revenue. By 2025 that had swung to somewhere around forty to fifty percent.

The company’s own projection, reported by Sacra and Seeking Alpha, is for gross margin to reach roughly seventy-seven percent by 2028.

Two businesses. A dollar of 45%-margin revenue and a dollar of 77%-margin revenue are not the same asset in any discounted cash flow. The entire valuation thesis rests on which one is emerging. Sources: TradingKey (2024), Sacra and Seeking Alpha (2025, 2028E).

A Substack analysis by Shanaka Anslem Perera framed the stakes more precisely than any sell-side note I have read: revenue at forty percent gross margin and revenue at seventy-seven percent gross margin are not the same business with different costs.

They are different businesses. The whole valuation rests on the assumption that the second business is the one arriving. The bear case rests on the possibility that it is not.

Why might margins expand so dramatically? The answer is not hand-waving. It is mechanical, it is grounded in the physics of how a token is produced, and it is the strongest single pillar of the bull case. Section VII takes it apart piece by piece.

But first we have to understand what Anthropic actually runs Claude on, because the margin story and the silicon story are the same story.

The profitability that may not be one

In mid-May, the Wall Street Journal reported that Anthropic is on pace to post its first operating profit in the second quarter of 2026: roughly $10.9 billion in revenue against an expected operating profit of about $559 million, more than doubling the $4.8 billion booked in Q1.

AI Weekly’s summary of the WSJ figures framed this as the moment Anthropic crosses into covering its own costs, a milestone that, if real, changes the fundraising calculus for the entire frontier sector. Investors could finally model a path to returns rather than an indefinite subsidy.

The technology critic Ed Zitron argues the milestone is partly an artifact of timing, and the argument deserves engagement rather than dismissal. The mechanism: under the compute deal Anthropic struck with the xAI unit of SpaceX, Anthropic pays $1.25 billion per month for the Colossus 1 cluster, a figure that emerged from SpaceX’s own S-1 and was reported by TechCrunch.

That is roughly fifteen billion dollars a year. But the deal carries a discounted rate for the first two months while xAI completes its ramp, and those two discounted months fall, conveniently, in exactly the quarter Anthropic is using to claim its first operating profit.

Zitron’s point is not that Anthropic is lying. It is that a 559-million-dollar operating profit is a thin margin on a temporarily depressed cost base, and that when the SpaceX rate steps up to its full level, the same quarter’s economics look different.

Whether you find this damning or merely worth watching depends on your priors. What is not in dispute is that the profitability claim and the compute deal are entangled, and that a confidential S-1 means we cannot yet see how the company books the ramp discount.

This is the first concrete reason the confidentiality matters: the single most important narrative claim, that Anthropic is now profitable, sits on a cost structure we cannot inspect.

The three-silicon bet

Now the part that, in the end, actually decides the future: where the compute comes from, what it runs on, and who controls it.

Anthropic has done something no other frontier lab has managed. It trains and serves Claude across three mutually incompatible silicon stacks at once: Amazon’s Trainium, Google’s TPU, and Nvidia’s GPUs (the latter rented, remarkably, from a rival).

This is not an accident of procurement. It is a deliberate hedge against the single greatest risk a frontier lab faces, which is being captive to one supplier’s roadmap, pricing, and power budget.

Anthropic CFO Krishna Rao framed the multi-vendor approach to CNBC as spreading workloads across vendors to tune for price, performance, and power. Read it as insurance.

Amazon and Trainium

On April 20 and 21, Amazon and Anthropic announced an expanded partnership that, per Global Data Center Hub’s reconstruction, totals up to thirty-three billion dollars in committed Amazon equity: five billion in fresh equity at a $350 billion pre-money valuation, up to twenty billion more tied to milestones, on top of eight billion deployed from 2023 to 2025.

In exchange, Anthropic committed to spend more than one hundred billion dollars on AWS over the next decade and to deploy up to five gigawatts of Trainium capacity. The engine is Project Rainier, which Anthropic’s own announcement describes as already running over one million Trainium2 chips.

The Indiana campus alone, per Global Data Center Hub, spans twelve hundred acres across seven buildings and scales toward 2.2 gigawatts at full build-out.

The commitment spans Trainium2, Trainium3, and the still-unannounced Trainium4, with Amazon’s Annapurna Labs taking design feedback from the lab that stresses the chips hardest.

Google and TPUs

The Google relationship is older and, in valuation terms, arguably the better bargain. CNBC reported that before the latest deal Google’s stake exceeded three billion dollars at roughly fourteen percent, built from a three-hundred-million-dollar 2023 check for about ten percent plus a two-billion-dollar follow-on.

In October 2025 the two announced a cloud deal for up to one million TPUs worth tens of billions. Then on April 24, 2026, Google committed up to forty billion dollars more in cash and compute, per TechCrunch and CNBC, expanding to five gigawatts of TPU capacity over five years. A Broadcom securities filing put the associated next-generation TPU figure at 3.5 gigawatts.

The Motley Fool’s Billy Duberstein argued Google is getting a screaming bargain, partly because a guaranteed five-gigawatt anchor tenant de-risks Alphabet’s own enormous capex. He is probably right, and the reason he is right is the same reason the structure is fragile, which is the subject of future discussions.

Nvidia, via xAI’s Colossus

This is the strangest leg, and the most revealing. On May 6, at its own Code with Claude developer conference, Anthropic announced it would take essentially all the compute at Colossus 1, the Memphis supercomputer built by xAI and now owned by the merged SpaceX entity.

Per xAI’s release, Colossus 1 houses over 220,000 Nvidia GPUs spanning H100, H200, and GB200 accelerators, roughly 300 megawatts. The timing was not accidental: with usage of xAI’s own Grok having dropped, Colossus 1 sat underused, which is what freed its full capacity for Anthropic, per TechCrunch’s reading of the S-1.

Anthropic’s chief compute officer Tom Brown said on the record that the company would expand onto Nvidia GB200 capacity in the larger Colossus 2 through June, per Axios. TechCrunch later reported, from SpaceX’s S-1, that Anthropic will pay $1.25 billion per month through May 2029, a deal that could bring the Musk entity over forty billion dollars in revenue, with either side able to terminate on ninety days’ notice.

The arrangement is pointed at consumer capacity, directly improving Claude Pro and Claude Max. The irony is thick enough to cut: Anthropic, the lab founded by OpenAI defectors, is now renting its consumer compute from Elon Musk, who wrote on X in February that the company hates Western civilization, per CNBC. Business is business.

The hedge, in gigawatts. Roughly 1.5 to 2 GW is live today; the bars show contracted ceilings. A 1 GW data center costs near $50B, of which about $35B is chips (CNBC). Sources: Anthropic, Amazon, Google, xAI, Broadcom filings, TechCrunch.

Chips, racks, and the CUDA moat

If you read only one section as an engineer, read this one and the next. The headline question is simple to state and hard to answer:

can custom silicon actually serve a frontier model, or is Nvidia’s lead structural?

Anthropic is the live experiment, and the early data is more interesting than either camp admits.

Start at the chip. AWS shipped Trainium3 in December 2025 on TSMC’s 3nm N3P node, the most advanced process in any shipping AI accelerator, per Tom’s Hardware and the spec compilations at IntuitionLabs and Awesome Agents, reaching broad availability in early 2026.

Each Trainium3 chip delivers about 2.52 petaflops of MXFP8 compute with 144 GB of HBM3e and 4.9 TB/s of memory bandwidth, with eight NeuronCore-v4 engines and a NeuronLink-v4 interconnect at 2 TB/s. AWS claims roughly 2x the per-chip compute of Trainium2, rising to about 4.4x at the 144-chip UltraServer level, with 4x better energy efficiency.

Google’s TPU v7, codenamed Ironwood and announced in 2025, delivers about 4.6 petaflops of FP8 per chip with 192 GB of HBM and 7.37 TB/s of bandwidth, which analysts at Introl described as on par with Blackwell.

Nvidia’s B200, by contrast, delivers roughly 9 petaflops of FP8 per chip with sparsity (about 4.5 dense), per Nvidia’s datasheet. Its headline figure near 18 to 20 petaflops is an FP4 number, a lower-precision format, so comparing like for like at FP8 is the only fair reading.

At the chip level Nvidia leads, but by less than the marketing implies. At comparable FP8 precision, a single B200 (~9 PF, with sparsity) is roughly 3.6x a Trainium3 die (2.52 PF) and under 2x a TPU v7 (4.61 PF); at dense FP8 the B200 (~4.5 PF) barely edges the TPU. The often-quoted 8x gap compares B200 FP4 against Trainium3 FP8, which is not like for like. Sources: AWS, Google, Nvidia datasheet via Civo, CudoCompute, Tom’s Hardware.

Now move up one level, to the rack, where AI is actually deployed. AWS packs 144 Trainium3 chips into a liquid-cooled Trn3 Gen2 UltraServer that delivers roughly 362 petaflops of FP8, 20.7 TB of HBM3e, and an aggregate 705.6 TB/s of memory bandwidth.

Per Tom’s Hardware and Oplexa’s analysis, that puts the UltraServer essentially level with Nvidia’s flagship GB300 NVL72 at rack scale, at an estimated fifty percent lower cost per workload and roughly forty percent better energy efficiency.

At the rack level, it is a tie, at roughly half the cost per workload. Amazon closes a roughly 3 to 4x per-chip FP8 gap with integration and scale-up fabric. This is the entire competitive argument for custom silicon, and Anthropic is the proof of concept. Source: Oplexa, Tom’s Hardware.

This is the crux that most coverage misses. The custom-silicon competition is not happening at the transistor. It is happening at the system and at the dollar.

Amazon and Google close a brutal single-chip deficit through dense integration, liquid cooling, and proprietary scale-up fabrics: Nvidia calls its fabric NVLink, Google calls its ICI, AWS calls its NeuronLink. Once you are buying racks rather than chips, and once you weight by cost and power rather than raw flops, the gap collapses.

Memory is the other half of the story, and it is the half that governs inference. As we will show in detail, modern serving is memory-bandwidth-bound, not compute-bound, which is why the per-token cost curve is so sensitive to HBM.

Here the three are closer than the compute numbers suggest, and Nvidia’s lead is narrower.

On the metric that decides inference economics, the field is tight. HBM capacity per chip runs 144 GB (Trainium3), 192 GB (TPU v7), and 288 GB (Nvidia B300). For memory-bound serving, this parity is why non-Nvidia inference is viable. Source: vendor specs via Tom’s Hardware, IntuitionLabs.

Inside the die: systolic arrays versus SIMT

The reason a custom accelerator can rival a GPU it trails on paper comes down to dataflow.

Each Trainium3 chip carries eight NeuronCore-v4 engines, and each engine is built around systolic arrays: a 128 by 128 grid for BF16 and a wider 512 by 128 grid for MXFP8, backed by 32 MiB of on-core SRAM. A systolic array is the opposite of a general-purpose processor.

Weights are loaded into the grid and held stationary while activations pulse through it, so a value read once from SRAM is reused across an entire row or column of multiply-accumulate units before it retires.

On the dense matrix multiplies that dominate a transformer, that weight-stationary dataflow keeps the multiply-accumulate units near full occupancy while spending almost nothing on instruction fetch, register-file traffic, or cache coherence. Google’s TPU is the same idea at a different size: one very large matrix-multiply unit fed by a compiler that schedules the entire computation ahead of time.

Nvidia’s Blackwell takes the other path. Its streaming multiprocessors are SIMT machines: thousands of threads grouped into warps, with tensor cores doing the matrix math and a deep hierarchy of schedulers, register files, and caches feeding them.

That flexibility is the point. A GPU runs irregular control flow, dynamic shapes, sparsity, and an enormous library surface, on hardware that was never specialized to one operator. The cost of that generality is silicon and power spent on everything that is not the multiply. For the long tail of workloads, the flexibility earns its keep. For a transformer decode loop, much of it sits idle.

This is the precise reason the rack-level parity in the charts above is real rather than a marketing artifact. A frontier lab runs essentially one workload shape, the transformer, and writes its own kernels, so it does not need most of what a GPU spends transistors on, and it can drive a systolic array to a utilization a general user could never reach.

The cost it pays is the programming model. Fifteen years of CUDA, of PTX and SASS-level tuning, of cuDNN and CUTLASS and a developer base in the millions, has no equivalent on the other side. AWS answers with the Neuron SDK and its NKI kernel interface plus JAX and PyTorch support; Google answers with XLA, JAX, and Pallas.

A team with kernel engineers can reach high utilization on any of the three. An enterprise that only wants a model behind an API cannot, which is why the moat protects Nvidia at the bottom of the market and erodes at the very top, where Anthropic operates.

The last equalizer is the fabric. A single chip never serves a frontier model alone, so what matters is how fast many chips behave as one. Nvidia’s NVLink 5 moves about 1.8 terabytes per second per GPU through an NVSwitch fabric; AWS NeuronLink-v4 moves about 2 terabytes per second per chip; Google’s ICI wires its pods into a three-dimensional torus.

Tensor parallelism forces every chip to exchange a slice of activations on every layer, and expert parallelism in a mixture-of-experts model adds an all-to-all shuffle of tokens to their chosen experts, so scale-up bandwidth, not raw per-chip flops, is what lets 144 Trainium3 act like one giant accelerator with 706 terabytes per second of aggregate memory bandwidth. The per-chip FLOPS gap is what the slides show. The scale-up fabric is what the workload feels.

So why has nobody else pulled this off? Because the real moat was never the silicon. It is the software. Nvidia’s CUDA is fifteen years of libraries, kernels, compilers, and developer muscle memory.

AWS counters with the Neuron SDK and JAX and PyTorch support; Google has its own mature stack. For the long tail of enterprises, porting off CUDA is a non-starter: the engineering cost dwarfs the hardware savings.

But a frontier lab is the one customer that can pay that cost, because it writes its own kernels, owns its own stack, and has the systems talent to make a non-CUDA chip productive.

That is precisely why Anthropic is the perfect partner to validate custom silicon, and why Amazon and Google paid tens of billions in equity to make it their anchor tenant. Anthropic reportedly fed design input directly into Trainium3, per Awesome Agents.

The lab is not just renting the chips. It is co-designing the thing meant to dethrone the incumbent.

One quiet beneficiary deserves a name: Broadcom. It co-designs Google’s TPU, supplies the connectivity silicon that stitches these racks together, and, per The Motley Fool citing Broadcom’s own disclosures, booked roughly a ten-billion-dollar TPU order plus an additional eleven billion in hardware tied to Anthropic.

In a gold rush, the firm selling the most shovels to the most miners is often the cleaner trade. Hold that for the market section.

Caveat where it is due. Nvidia’s roadmap does not stand still: B200 and GB200 are reportedly sold out through mid-2026 against a backlog near 3.6 million units, per IntuitionLabs, and Vera Rubin, due in the second half of 2026 with HBM4 and a Rubin NVL144 rack at 3.6 exaflops of dense FP4, extends the lead at the top.

The near-term threat to Nvidia is not revenue. Demand still dwarfs supply. The threat is the long-run margin structure that depends on hyperscalers having no realistic alternative. Anthropic’s three-silicon bet is the clearest signal yet that the alternative is becoming real.

The cost of a token

The bull case for Anthropic’s margin is a claim about physics and software, not accounting. To judge it you have to understand how a single token is actually produced, and why the cost of producing it is collapsing roughly tenfold a year.

This is the section the financial press cannot write and your readers care about most.

Two phases, one bottleneck

Serving a language model has two phases with opposite cost structures. Prefill ingests the prompt: every input token is processed in parallel, the arithmetic is dense, and the accelerator runs near its compute ceiling. Prefill is compute-bound, and it is cheap per token because parallelism is high.

Decode generates the answer one token at a time, autoregressively, and here is the trap: to produce each new token, the hardware must stream the model’s entire active weight set out of high-bandwidth memory.

Decode is therefore memory-bandwidth-bound, not compute-bound. As the Inworld and Spheron benchmark teardowns put it, reading model weights during decode is the primary bottleneck for autoregressive generation.

This single fact is why the memory-bandwidth chart in the previous section matters more than the flops chart, and why a Trainium3 or a TPU v7, which trail Nvidia badly on raw flops but sit within striking distance on bandwidth, can serve inference competitively. The frontier is not compute-starved. It is bandwidth-starved.

Two structures sit on top of this. The KV cache stores the attention keys and values for every prior token, so it grows with context length and with batch size, and it competes for the same scarce HBM capacity and bandwidth; long-context serving is expensive precisely because, as CloudRift’s benchmarks note, it stresses KV-cache traffic.

And batching is the lever that makes serving economic at all: by processing many requests together, each expensive weight-read from HBM is amortized across many sequences, so throughput, and therefore cost per token, depends enormously on how full the batch is.

GMI Cloud’s teardown makes the point concrete: an H100 running Llama 70B in FP8 generates roughly two to three thousand tokens per second at batch 32, about $0.19 to $0.29 per million output tokens, and the same GPU at fifty percent utilization sees its effective cost per token roughly double. Utilization is not a footnote. It is half the unit economics.

Three levers that crush cost

An academic study circulating on arXiv this year, modeling what it calls a tiered Super-Moore effect, decomposes inference cost into hardware, labor, and a technology index that captures architectural innovation. Its key finding is that the technology index has improved far faster than the hardware alone. Three levers do most of the work.

Quantization. Moving the numerical format of the weights from FP16 to FP8 to FP4 or INT4 halves the bytes per parameter at each step. Because decode is bandwidth-bound, fewer bytes per weight means more tokens per second on the same silicon, almost linearly.

Spheron measures FP8 cutting effective cost per token by roughly half on H100 and H200 by doubling throughput with no extra GPUs; Blackwell’s FP4 tensor cores push it further still. Quantization is close to a free lunch until model quality degrades, and the frontier labs have become expert at quantizing right up to that line.

Mixture of experts. A dense model activates all its parameters for every token. A mixture-of-experts model routes each token through only a small subset.

The arXiv study quantifies the canonical example: DeepSeek’s architecture carries 671 billion total parameters but activates only 37 billion per token, an eighteen-fold reduction in per-token compute with no proportionate quality loss, and it operates independently of any hardware trend.

MoE is the single largest architectural reason a frontier-class answer is no longer a frontier-class expense.

Algorithmic serving. FlashAttention removed the memory bottleneck inside the attention kernel; speculative decoding, which drafts several tokens with a small model and verifies them with the large one, cuts latency two to three times, per Introl; continuous batching and paged KV caches keep utilization high. None of these require new chips. They are software, and software ships continuously.

The hardware levers compound on top: Blackwell’s B200 carries 2.4 times the memory bandwidth of an H100 and enough capacity (192 GB) to hold models up to roughly 96 billion parameters in FP16, or 192 billion in FP8, on a single GPU, which removes tensor-parallel communication overhead entirely, per Inworld.

Fewer cross-GPU hops means lower latency and lower cost at once.

The arithmetic of serving

Make the bottleneck quantitative, because the precise numbers are what justify the margin. During decode, producing one token requires reading every active parameter out of high-bandwidth memory exactly once, so the single-stream token rate has a hard ceiling set by bandwidth, not by compute:

tokens/sec ≈ HBM bandwidth ÷ ( bytes-per-parameter × active parameters )

The numbers are unforgiving. A dense 70-billion-parameter model in FP8 must stream 70 GB per token; an H100 at 3.35 TB/s tops out near 48 tokens per second on a single stream, a B200 at 8 TB/s near 114.

A 405-billion-parameter model falls into the single digits. This is why latency-sensitive, single-user decoding feels slow on the largest dense models no matter how many teraflops the chip advertises: those teraflops are not the binding constraint.

The latency floor. Single-stream decode is bounded by HBM bandwidth divided by the bytes streamed per token, so the largest dense models emit only single-digit to low-hundreds of tokens per second per accelerator before batching. Batching raises aggregate throughput; it does not raise this single-stream ceiling. H100 at 3.35 TB/s, B200 at 8 TB/s, FP8.

The escape is batching. Because every sequence in a batch reuses the same weight read, serving 256 requests together multiplies aggregate throughput by roughly 256 with no additional weight traffic, until the KV cache or the compute ceiling intervenes.

Plotted on a roofline, decode lives far down the memory-bound slope at an arithmetic intensity near one, while prefill and training sit against the compute ceiling.

The ridge point, where a workload stops being memory-bound and becomes compute-bound, falls near 560 to 590 FLOP per byte on both H100 and B200, and decode runs one to two orders of magnitude below it.

Why decode leaves FLOPs on the table. Below the ridge point, throughput is capped by bandwidth, so adding compute does nothing; only more bandwidth or fewer bytes per token help. Prefill and training sit against the compute ceiling, where FLOPs matter. Dense FP8 peak: H100 ~1.98 PF, B200 ~4.5 PF.

This reframes the right efficiency metric, a point your readers will appreciate more than any valuation table. For prefill and training, the number that matters is model FLOPs utilization, MFU, typically 35 to 50 percent on a well-tuned cluster.

For decode, FLOPs utilization is nearly meaningless because the tensor cores are starved of data; the metric that matters is memory-bandwidth utilization, MBU, and the engineering goal is to keep HBM busy, not the math units. Almost every serving optimization that matters, from paged attention to continuous batching, is at bottom a scheme to raise MBU.

Quantization attacks the bytes. Halving the bytes per parameter halves the bandwidth bill per token and so roughly doubles decode throughput. The frontier has marched down the precision ladder accordingly: FP16 and BF16 at two bytes, FP8 at one, FP4 and INT4 at half a byte, with the KV cache itself increasingly stored in FP8 to stretch context.

Blackwell’s tensor cores are built for FP4; Trainium3’s wider array is built for MXFP8 microscaling. The binding constraint is quality: too-aggressive quantization raises perplexity and degrades reasoning, so labs quantize weights and cache hard while protecting the few numerically sensitive layers.

Precision is bandwidth. Each step down the ladder halves the bytes streamed per token, and because decode is bandwidth-bound, roughly doubles throughput. The frontier now runs weights in FP8 or FP4 and stores the KV cache in FP8, quantizing right up to the point where quality breaks.

The KV cache is the tax that limits batching. Attention must keep the key and value vectors of every prior token, and that cache grows linearly with both context length and batch size, competing with the weights for the same HBM.

For a 70-billion-parameter model with grouped-query attention, the cache costs roughly 0.33 MB per token, so a single million-token context consumes more than 340 GB, beyond what a B200 holds. The cache, not the weights, is usually what caps how large a batch can run, and batch size is what sets cost per token, so KV-cache management is the hinge of serving economics.

Grouped-query and multi-query attention shrink it by sharing key and value heads; paged attention stops it from fragmenting memory; FP8 storage halves it again.

The cache outgrows the chip. For a 70B-class GQA model the KV cache costs ~0.33 MB per token and grows linearly with context, overtaking a B200’s entire 192 GB before a single million-token sequence. This, more than the weights, is what caps batch size, and batch size sets cost per token.

Speculative decoding attacks the sequential dependency. A small draft model proposes several tokens and the large model verifies them in one forward pass, accepting the longest correct prefix.

With a per-token draft acceptance probability near 0.7 and four drafted tokens, the expected number confirmed per verification step is about ( 1 minus 0.7 to the fifth ) divided by 0.3, near 2.8, a two to three times speedup before draft overhead.

The large model still does the same total work per accepted token; what changes is that the work happens in parallel instead of one token at a time, which is exactly what a memory-bound loop needs.

Mixture of experts attacks the parameter count. A dense model pays for all its parameters on every token; a mixture-of-experts model routes each token to a small subset, so per-token compute scales with active parameters, not total. DeepSeek’s architecture carries 671 billion parameters but activates 37 billion per token, an 18-fold cut in both the FLOPs and, decisively, the bytes streamed during decode.

The catch is twofold: all 671 billion parameters must still sit in HBM, a capacity tax that demands many chips, and routing tokens to experts requires an all-to-all step that leans on the scale-up fabric from the previous section.

MoE is the single largest reason a frontier-class answer is no longer a frontier-class expense, and it is why the dense decode ceiling above understates a well-built model: at 37 billion active parameters that same H100 serves roughly 90 tokens per second single-stream rather than five.

Stack these together, quantization halving bytes, mixture-of-experts cutting active parameters by an order of magnitude, speculative decoding parallelizing the sequence, batching amortizing every weight read, and cheaper bandwidth each hardware generation on top, and the roughly tenfold annual fall in the cost of a token stops looking like magic and starts looking like arithmetic.

That is the engine underneath the margin bridge that follows.

Roughly 10x a year, faster than PC compute or dot-com bandwidth ever fell (Introl). GPT-4-class output dropped from about $20 per million tokens in late 2022 to about $0.40 by late 2025. Endpoints documented (Introl); intermediate points trace the stated trend. Economy-tier quality fell about 600x since 2020, from the $60 GPT-3 API to roughly $0.10 today (arXiv).

A generational chip cut inference cost roughly 7x. Inworld measures about $0.02 per million tokens on B200 versus about $0.14 on H100, because throughput gains outpace the rental premium. Other teardowns put H100 frontier serving at $0.19 to $0.29 fully utilized, doubling at half utilization (GMI). Figures are config-dependent; treat as directional.

What it costs Anthropic to make a token

We can now build the cost side from the metal up. The exercise is illustrative, the assumptions are stated plainly, and the point is the order of magnitude, not a false-precision number.

Now the other side of the ledger.

We do not know Anthropic’s blended realized price per token, because the filing is confidential, and that is the honest limit of this analysis.

What we do know: frontier output has historically listed in the five-to-fifteen-dollar range per million tokens, and Anthropic’s most capable model, the withheld Mythos preview, was priced at twenty-five dollars per million input tokens and one hundred twenty-five dollars per million output, per Sacra (a single-source figure, indicative rather than confirmed).

Even after a generous markup for the true cost of a genuine frontier model over a benchmark mid-size one, the gross spread between a cost-to-serve measured in cents and a realized price measured in dollars is wide.

That spread is the gross margin. And because the cost side falls about tenfold a year while realized prices fall more slowly, the spread widens with time. This is the physical mechanism behind the projected march from forty-five to seventy-seven percent.

Nvidia, naturally, has quantified the same loop from the supplier’s side. In its InferenceMax v1 results, the company claims a single GB200 NVL72 turns a five-million-dollar investment into roughly seventy-five million dollars of DeepSeek-R1 token revenue, a fifteen-fold return, what it calls AI-factory economics.

Treat the figure as a vendor benchmark on an idealized model, but the direction is the entire bull thesis in one number: at current token prices, a frontier accelerator generates a multiple of its cost in sellable output.

How -94% becomes +77%. Fixed-cost leverage (revenue scaling against a roughly fixed capacity base, per SaaStr) does the heavy lifting; token deflation (quantization, MoE, cheaper silicon) and a mix shift toward high-value enterprise and Claude Code finish it. Step sizes are an illustrative attribution to named mechanisms, not audited figures; the 2024, 2025, and 2028E levels are the reported and projected anchors.

The technical bottom line

The cost half of Anthropic’s margin equation is governed by mechanisms we can see and that compound predictably: quantization, mixture-of-experts routing, algorithmic serving gains, and cheaper bandwidth per token, together delivering roughly an order of magnitude of cost reduction per year. That is why a 77% gross margin is physically plausible rather than fantastical.

The risk lives on the price half, which we cannot see. If open-weight competition (DeepSeek, Llama) and rival labs compress realized prices as fast as cost falls, the spread does not widen and the margin thesis stalls.

The bull is betting cost falls faster than price.

The bear is betting price falls to meet cost. The confidential S-1 hides exactly the number, realized revenue per token, that would settle it.

The infinite loop

Look again at the last two columns of the capital-and-compute table. Amazon is investing up to thirty-three billion into Anthropic; Anthropic is committing to spend over one hundred billion with Amazon.

Google is investing up to forty billion; Anthropic is spending tens of billions with Google. The capital flows out as equity and comes back as revenue.

CNBC said it plainly in its coverage of the Google deal: much of the investment will return in the form of revenue. This is the circular-financing question, and it is the single most important structural issue hanging over all three IPOs.

It is not unique to Anthropic. It is basically the operating system of the entire cycle.

The closed loop. The same firms that supply the silicon and the cloud also own the equity. Capital invested as equity returns as committed revenue, which is then reported as growth. Figures are deal maxima, not all drawn. Sources: Global Data Center Hub, TechCrunch, CNBC, Reuters.

The canonical example is on the OpenAI side. In September 2025 Nvidia announced it would invest up to one hundred billion dollars in OpenAI to fund a data center buildout equipped with, naturally, Nvidia chips.

Bernstein’s Stacy Rasgon wrote, per Business Standard, that the move would clearly fuel circular concerns. By March 2026, per BlockEden’s account, Jensen Huang was telling investors that thirty billion might be the last such investment and that the full hundred billion was not in the cards.

Nvidia also committed up to ten billion to Anthropic, which CFO Colette Kress noted could further expand the company’s bookings, a sentence that contains the whole critique in miniature: the investment expands the bookings of the company making the investment.

The scale of the web is staggering. BlockEden tallied OpenAI’s infrastructure commitments at roughly $1.15 trillion across seven vendors between 2025 and 2035.

One company, $1.15 trillion in promises. Several of these vendors are also OpenAI investors; AMD reportedly handed OpenAI 10% equity warrants as a customer. The capital and the contracts run in a closed circle. Source: BlockEden compilation of public disclosures.

The defenders are not stupid, and their argument deserves a fair hearing.

Dario Amodei, at the New York Times DealBook Summit in December, argued there is nothing inappropriate in principle about a party with capital and a chip interest funding a party with revenue confidence but no cash on hand.

That is a coherent description of ordinary project finance. But the historical rhyme is hard to ignore, and Bloomberg drew it explicitly: during the late-1990s internet boom, equipment makers fueled the fiber buildout with vendor financing, and when demand failed to arrive on schedule, the roundtripping that had inflated the appearance of demand amplified the collapse.

The mechanism that makes the boom look bigger is the same one that makes the bust deeper. Sequoia’s David Cahn has quantified the implied shortfall: by his framework, the AI complex needs roughly six hundred billion dollars in annual revenue to justify the capex being deployed, and the gap is widening, not closing.

“In this new world of AI, compute is revenues.”

Jensen Huang · Nvidia CEO, on the Q4 FY2026 earnings call, reframing the entire spending debate (Fortune, Benzinga)

That single line is the keystone of the bull architecture, and section VII gave it a number: at current token prices a frontier accelerator throws off a multiple of its cost in sellable output.

Huang’s claim is that capital expenditure converts into compute, compute into tokens, and tokens directly into revenue, so the spending is self-justifying.

Nvidia’s own results give the argument force: record quarterly revenue of $68.1 billion, up roughly 73 percent year over year, with $78 billion guided for the next quarter, and Kress telling investors total AI infrastructure investment could reach three to four trillion dollars annually by 2029 or 2030.

For Anthropic specifically, the loop lands on one phrase, revenue quality. When eighty percent of revenue is enterprise and a meaningful share of capital comes from the same hyperscalers whose clouds Anthropic is committing to, an investor is entitled to ask how much of the forty-seven-billion run-rate is organic demand and how much is the visible end of a closed capital loop.

Nobody outside the company knows. The filing is confidential. That is the second reason confidentiality matters more here than usual.

How do you price a wall?

Valuation is where the rigor either holds or collapses, so let us be careful, bring the public-market context the private headlines leave out, and do the one thing most coverage skips: put the multiple next to its peers.

Anthropic closed its Series G on February 12, 2026: thirty billion dollars raised at a $380 billion post-money valuation, which Perera calculated as roughly twenty-seven times annualized revenue.

By the April tender offer, the reference valuation was $350 billion, at which The Motley Fool noted the multiple had compressed to under twelve times the then-thirty-billion run-rate, simply because revenue had nearly tripled while the valuation held flat.

Then, per Let’s Data Science and CNN, Anthropic raised sixty-five billion dollars in May at a $965 billion valuation, surpassing OpenAI’s $852 billion mark for the first time.

The IPO target is above one trillion. Reuters notes the company was valued at just $183 billion as recently as last November.

5x in six months. $183B to a trillion-plus target between November and the IPO. The multiple is not the story; the denominator is moving too fast for any multiple to stay meaningful. Sources: Reuters, Perera, Global Data Center Hub, CNN, Let’s Data Science.

The comp that reframes the question

A trillion dollars sounds insane until you place it beside what the public market already pays for AI growth.

At its IPO target, Anthropic trades near twenty-one times its forty-seven-billion run-rate, and near fourteen times its projected seventy-billion 2028 revenue. Those are not the highest multiples in the AI complex. They are far from it.

Anthropic’s multiple is mid-pack, despite the fastest growth. Palantir trades near 55x revenue at roughly 70 to 85 percent growth; Databricks near 25x at 65 percent. Anthropic on run-rate sits near 21x while growing on the order of 1,000 percent year over year, and near 14x on 2028E. Multiples for Microsoft, Alphabet, and Amazon are approximate and vary daily. Databricks and Anthropic are private marks. Sources: multiples.vc, heygotrade, SaaStr (PLTR, NVDA, Databricks); company valuations for the Anthropic bars.

That is the reframe a careful analyst owes the reader. On a pure revenue multiple, Anthropic at a trillion dollars is cheaper than Palantir and cheaper than Databricks, while growing many times faster than either. The naive objection, the multiple is absurd, does not survive contact with the comp set.

The serious objections are about durability and quality, and there are three honest frames.

The bull frame is forward margin. If you believe the seventy-seven-percent gross margin of section VII arrives, and the company’s projection of roughly seventy billion in revenue and seventeen billion in cash flow by 2028 (Sacra), then a trillion dollars is about fourteen times 2028 revenue on the fastest-growing, highest-margin software asset ever built. Rich, but not obviously mispriced for the category leader.

The bear frame is present reality. Right now the margin is closer to forty-five percent, the company does not expect to stop burning cash until 2027 (Sacra), and the cloud bill runs to roughly eighty billion dollars through 2029. At present economics, a trillion dollars prices a future that has not arrived as though it already has, and the comp multiples assume the growth rate persists for years, which no company in history has sustained.

The skeptic frame is the one I find most useful. We are pricing the largest IPO in a generation on a confidential filing. One quiet signal cuts against the skepticism: The Motley Fool reported that when Anthropic invited long-tenured employees to sell at the $350 billion valuation, they chose to hold far more than expected. The people with the most information declined liquidity at $350 billion, and the May round then priced at nearly three times that. Insider behavior is not proof, but it is data, and here it points the same way the revenue does. Up.

How to actually get exposure

For investors who cannot buy private shares, the cleanest listed proxies are mechanical. Amazon carries up to a thirty-three-billion-dollar stake plus the AWS revenue Anthropic is committing to.

Alphabet holds roughly fourteen percent plus Google Cloud’s TPU revenue; Google Cloud grew about sixty-three percent year over year to a roughly twenty-billion-dollar quarterly run-rate, the fastest of the big three.

Broadcom is the picks-and-shovels play through TPU co-design and connectivity. Nvidia sits at the keystone with a roughly ten-billion-dollar stake and the GPU demand underneath all of it, trading near a 4.8-trillion-dollar market capitalization on the strength of Huang’s compute-is-revenues thesis.

A trillion-dollar Anthropic IPO does not just price Anthropic. As one trading desk framed it, the entire listed AI complex is likely to re-rate on Anthropic comps, not merely the company itself.

One market-structure point will matter on debut day. A listing this large arriving this fast triggers fast-entry index inclusion rules, which means passive funds become forced buyers shortly after pricing, a mechanical tailwind Yahoo Finance flagged as one reason all three trillion-dollar names may benefit from going public in quick succession. Investors will have just watched SpaceX test those rules in real time.

What the price implies: a transparent valuation

The comps above are a relative reframe, not a valuation. Here is the valuation, built the way a disciplined analyst builds one when the financials are sealed.

You cannot run a bottoms-up discounted-cash-flow model on Anthropic, because the inputs such a model needs, the actual margins, the free cash flow, the capex schedule, the share count, and the realized price per token, all sit inside the confidential S-1.

Anyone publishing a single intrinsic-value number from a DCF right now is inventing those inputs. What can be done rigorously, from fact-checked data alone, is to reframe the question around two hard primary anchors, the roughly $47 billion run-rate and the roughly $1 trillion price, and ask answerable things.

What the price requires

The most honest move is a reverse discounted-cash-flow: rather than forecast cash flows we do not have, solve for the growth the known price embeds, then judge whether it is believable.

Treat $1 trillion as today’s enterprise value, assume the market values the company at a steady-state revenue multiple m reached in year eight and discounts at a cost of capital w, and the revenue the price implies follows directly.

Revenue(yr 8) = $1,000B × (1 + w)^8 ÷ m | implied CAGR = ( Revenue(yr 8) ÷ $47B )^(1/8) - 1

With cost of capital from 9 to 13 percent and a terminal revenue multiple from 4 to 8 times, both defensible and neither touching sealed data, the implied eight-year revenue growth rate is the following.

The price embeds roughly 25 to 40 percent annual revenue growth sustained for eight years, centered near 30 percent. That converts the entire debate into one question a reader can answer:

do you believe Anthropic compounds revenue at about 30 percent a year for nearly a decade?

Current growth is far above that, which gives the bull case runway; the bear case is that no company has held 30 percent for eight straight years, and open-weight price compression is the likeliest thing to break it.

The reverse DCF does not say who is right. It says, from fact-checked data, exactly what you are being asked to underwrite.

The range of outcomes

A scenario value completes the picture, anchoring the base on the company’s own 2028 projection, which is a projection and flagged as such, with a bear and a bull around it, each valued at an exit multiple and discounted to today.

The price is paid for the right tail. Probability-weighted present value is about $800 billion against a price near $1 trillion, so the market leans on the bull case to clear the bar. The five-fold spread between bear and bull, roughly $290 billion to $1.6 trillion, is the real story: this is a wide distribution, not a tight intrinsic value. Discounted at 11 percent; the probabilities are illustrative and the one input a reader should make their own.

One tension in the fact-checked data deserves a flag rather than a paper-over. A roughly $47 billion run-rate today against the company’s roughly $70 billion 2028 projection implies revenue growth decelerating to 15 to 20 percent a year by 2028, a sharp slowdown from the current pace.

Either the $70 billion figure predates the $47 billion run-rate and is now conservative, or management expects growth to crash, and the comps reframe that calls the price cheap rests entirely on which it is.

How to read it

The honest close is not a verdict but the underwriting question the price poses: do you believe roughly 30 percent revenue growth for eight years, and do you believe the cost half of the margin keeps falling faster than the price half.

A reader who answers yes to both can justify the trillion-dollar tag; one who doubts either cannot. The point of showing every assumption is that you can change them and watch the answer move.

This valuation is analysis, not investment advice and not a recommendation. The run-rate and price anchors are primary; the 2028 revenue and cash-flow figures are company projections; the multiples are approximate and volatile; the cost of capital, exit multiples, and scenario probabilities are explicit modeling choices made for transparency, not derived from non-public data.

A mirror, a rocket, and a clock

You cannot value Anthropic in isolation, because the IPO is partly a race, and races have positional dynamics.

OpenAI is the mirror image. Per Nerd Level Tech and Let’s Data Science, it converted to a public benefit corporation in October 2025 as OpenAI Group PBC, with the nonprofit Foundation retaining roughly twenty-six percent and board control.

Microsoft holds roughly twenty-seven percent on a diluted basis, an investment valued between about $135 billion and $228 billion depending on the mark, and ended its exclusivity arrangement in April. Sam Altman holds no equity.

OpenAI closed the largest private round in history on March 31, $122 billion at an $852 billion post-money valuation, with SoftBank, Amazon, Nvidia, and Microsoft all participating.

Revenue runs about two billion dollars a month, near a twenty-five-billion run-rate as of March, with fifty million consumer subscribers and nine million business users, per roborhythms’ compilation.

The contrast that will dominate the dueling roadshows is profitability. Multiple outlets, citing the loss figures, report OpenAI losing about $1.22 for every dollar of revenue in Q1 2026, an operating margin near negative one hundred twenty-two percent, with a projected fourteen-billion-dollar loss for the year.

Anthropic, by the disputed WSJ figures, claims a small operating profit in the same window. So the two enter the public markets at nearly identical valuations and opposite financial stories.

The overtaking. Anthropic crossed above OpenAI on run-rate revenue in spring 2026 even as OpenAI retained far greater consumer reach. By May, Sacra put Anthropic near $45B versus OpenAI near $33B. Reported figures, not audited.

There is a real first-mover argument, though the precedent cuts only so far. When Lyft and Uber went public in 2019, Lyft, the first mover, popped on its debut while Uber fell on its first day; both stocks then traded poorly in the months after, so debut-day positioning is no guarantee of anything.

Both labs will seek tens of billions in fresh capital in close succession, so reaching market first plausibly matters. Anthropic’s confidential filing means it could price as early as mid-August on the SpaceX timeline, per Yahoo Finance, though Futurum reports a target as late as October; either way, likely ahead of OpenAI’s Q4 window.

Against that, the sober counterpoint: Wall Street already knows both stories intimately, and OpenAI’s S-1 will likely be public by the time Anthropic prices, letting investors judge them side by side regardless of who rings the bell first.

The rocket is the merged SpaceX entity, and it is the wild card. Its S-1 disclosed that xAI spent $12.7 billion on AI infrastructure in 2025 and another $7.7 billion in Q1 2026, per Datacenter Dynamics, and it described its one gigawatt of capacity as nameplate compute draw, explicitly noting the figure reflects installed capacity and does not represent actual utilization.

In plain terms, the GPUs are installed but may not all be powered. That single disclosure is a gift to anyone trying to separate capacity headlines from real, energized compute, and every analyst should now apply a utilization haircut to gigawatt claims across the industry, including the contracted-capacity bars in section V.

Step back and the pipeline is unlike anything the IPO market has seen. Three trillion-dollar names, plus Databricks (the only clearly profitable candidate, at a $5.4 billion run-rate growing 65 percent with positive free cash flow and a $134 billion private mark) and Cerebras (which priced around a $48.8 billion valuation on a heavily oversubscribed book).

One IPO tracker estimated combined pipeline demand at up to four times the entire 2025 US IPO market.

An unprecedented concentration. Three potential trillion-dollar debuts in roughly one hundred days, alongside the largest software listings in years. SpaceX, which acquired xAI in January 2026, is shown here near a reported IPO target of $1.5 trillion, the figure most frequently cited, with reports ranging from $1 trillion to $2 trillion. Sources: Datacenter Dynamics, CNN, Let’s Data Science, AI IPO Tracker.

Stated as strongly as I can make it

A report that only sells the bull case is marketing. Here is the bear case, and it is not weak.

Demand is showing its first cracks. Axios reported, in a piece timed to the filing, that Anthropic is going public just as businesses begin to rethink their AI spend, hit with what it called sticker shock. The writer Derek Thompson has named this the great AI cost panic of 2026, the phase where Fortune 500 buyers ask whether agentic AI is worth the bill. The most cited data point is the MIT Project NANDA study from July 2025, which found that ninety-five percent of enterprise generative-AI pilots produced zero measurable profit-and-loss impact, on thirty to forty billion dollars of corporate spending.

If even a fraction of that skepticism hardens into budget discipline, the revenue surge that justifies these valuations slows exactly when the infrastructure bills come due.

The macro is stretched to dot-com proportions. The five largest Western hyperscalers are guiding toward roughly $725 billion of capex in 2026, up about seventy-seven percent from 2025’s record, per the Goldman, CreditSights, and Morgan Stanley estimates compiled by Tool Directory, with the trajectory projected past a trillion dollars annually in 2027.

The largest private investment cycle in history. Tech equipment and software hit 4.4% of GDP in 2025, near the dot-com peak (IEEE ComSoc). Cahn’s framework says ~$600B of annual revenue is needed to justify it; the gap is widening. Sources: Goldman Sachs, CreditSights, Morgan Stanley via Tool Directory.

The reminder that this complex can reprice violently is recent. The DeepSeek shock of January 2025, when a cheaper Chinese model wiped roughly a trillion dollars of US AI market value in a single day, including Nvidia’s $588.8 billion single-day loss, the largest in market history, shows how a single efficiency surprise can rerate the whole sector. And note the double edge.

In fact, the same open-weight efficiency that drives the cost deflation of the previous sections, is also the force most likely to compress the prices that section warned about. There is no reason another DeepSeek cannot happen. Pat Gelsinger, the former Intel chief, when asked whether this is a bubble, answered of course.

Antitrust is loading. The cumulative concentration of hyperscaler-lab pairings, Microsoft with OpenAI and Google and Amazon both with Anthropic, is large enough that, per Tech Insider’s reading, the DOJ, FTC, and European Commission are likely to revisit the structure, with a formal investigation plausible by the fourth quarter of 2026. An IPO does not resolve this. It raises the profile of the very arrangements regulators most want to examine.

Company-specific overhangs are real. Sacra flags several that bear directly on Anthropic’s ability to monetize its best work. Its most capable model, previewed as Mythos and codenamed Capybara, has been withheld from general release after it proved able to identify thousands of high-severity software vulnerabilities; the commercial vehicle, Claude Security under Project Glasswing (with partners including Nvidia, AWS, Apple, Google, Broadcom, Microsoft, Cisco, CrowdStrike, and Palo Alto Networks, per The Motley Fool), runs on a deliberately constrained Opus 4.7.

That is a safety decision I think is correct on the merits, and it is simultaneously a self-imposed ceiling on revenue from the company’s most powerful capability.

Add the Pentagon dispute, where Anthropic sued the US government over a designation it viewed as a threat to its revenue, with CFO Krishna Rao testifying that the matter risked cutting 2026 revenue by multiple billions of dollars and where Amodei later publicly apologized for how the company handled the failed talks, and a legal overhang whose verified component is the roughly $1.5 billion authors’ copyright settlement (a larger figure naming the founder personally appears in a single analyst account and is not independently confirmed), and you have a company whose brand safety and its revenue ceiling are the same wall.

The customer-concentration paradox. Recall Claude Code’s billion-dollar ramp. Perera’s sharpest observation is that the company’s fastest-growing product may cannibalize its largest revenue source, because the same agentic coding capability enterprises buy directly can displace the API consumption those enterprises previously paid for. Growth in one column can quietly erode another. We cannot see the net effect, because the filing is confidential.

That phrase keeps recurring, and that is the point. Almost every load-bearing question, the real margin, the realized price per token, the quality of the revenue, the accounting for the SpaceX ramp, the net effect of Claude Code, resolves only when the public S-1 lands.

Until then, the bull and the bear are arguing about a black box.

Who should care, and why

Strip away the spectacle and ask what actually changes downstream. Four things.

For Nvidia and the silicon market, Anthropic is the existence proof. The most important fact in this report for the long-run structure of the industry is that a frontier-class model is being trained and served at scale on Trainium and TPU, not just on Nvidia.

The near-term risk to Nvidia is not revenue, since demand still dwarfs supply, but the long-run margin structure that depends on hyperscalers having no realistic alternative.

As previous sections showed, at rack scale, on a cost-per-workload basis, and on the bandwidth metric that actually governs inference, the alternative now exists. Every chip team at Amazon, Google, and Broadcom is using Anthropic as their proof of concept, and that is worth more to them than the revenue.

For the circular-financing thesis, the IPO is the disclosure event the skeptics have awaited. A public listing forces the first concrete, audited window into a frontier lab’s financials. For two years the bears and bulls have argued about revenue quality with no primary data.

The S-1, once public, will show how Anthropic accounts for hyperscaler-funded revenue, how it books the SpaceX ramp, and what its real, unsubsidized unit economics, including the realized revenue per token that section VII could not pin down, actually look like.

This is the rare case where bull and bear should want the same thing: the numbers. If they are as good as the run-rate suggests, the bubble talk deflates. If they are not, better to know now.

For enterprise buyers, the public-company transition changes the vendor relationship. Public companies optimize for quarterly margins in ways private ones do not.

The inference-cost deflation that has made Claude cheaper every year was partly funded by patient private capital. A public Anthropic, answerable to shareholders, may price differently, and may be less willing to pass the full token-cost decline through to customers.

For anyone building production systems on Claude, that is a planning input, not a panic, and one more argument for the multi-provider architectures that open-weight alternatives like DeepSeek and Llama keep making viable.

For the private markets and the broader economy, the drain is unclogging, for better and worse. Three trillion-dollar IPOs in a hundred days will pull enormous capital into public AI equities and force index providers to confront new inclusion rules for companies this large arriving this fast.

If the debuts go well, they validate the cycle and pull more capital in.
If they go poorly, three of the largest IPOs in history repricing in quick succession is exactly the event that turns a capex bubble into a capex correction, with the roundtripping amplifying the move down just as it amplified it up.

The labor question rides alongside: tech layoffs passed 115,000 through May 2026, with Meta, Amazon, and Snap citing AI, even as the Yale Budget Lab found no significant change yet in the occupational mix of high-exposure jobs. The same plumbing runs in both directions, and so does the narrative.

“What I see is this smooth exponential line. And that march has just been constant.”

Dario Amodei · Anthropic CEO, at Davos 2026, on why he discounts the cycle of hype and bubble talk (Rest of World)

Set against that, Nvidia’s Huang, who has said publicly he disagrees with almost everything Amodei says, dismisses the doomier predictions as the product of a CEO God complex. T

wo of the most important people in the industry cannot agree on whether it is reshaping labor, let alone whether it is a bubble. The IPOs will not settle that. They will only price it.

The bottom line

Here is what I actually think, stated plainly, with the byline caveat from the top still standing.

Anthropic is, by the public evidence, the best-positioned of the three companies going public this summer. It has the cleanest revenue mix, the leanest cost structure, the most credible margin-expansion story, and a genuinely differentiated three-silicon strategy that is reshaping the hardware layer beneath the entire industry.

The physics say a seventy-seven-percent gross margin is plausible rather than fanciful; the comps of previous parts say a trillion-dollar valuation is mid-pack rather than mad; the insider behavior at the tender offer and the speed of the run-rate point the same way. If forced to rank the three debuts on fundamentals rather than spectacle, Anthropic would be first.

And the entire case rests on three things we cannot yet verify and one we can. We cannot verify the real gross margin, the realized price per token, or the accounting behind the Q2 profitability claim, because the filing is confidential.

What we can verify is that the company is being valued at over a trillion dollars on exactly those unverified figures, inside a capital structure that critics, with a strong historical analogy, compare to the vendor financing that deepened the last great technology crash.

That is not a contradiction. It is the trade. The bull is betting the black box is full of seventy-seven-percent-margin, organically demanded, durably embedded enterprise revenue, with a cost-to-serve falling faster than price.

The bear is betting it is full of forty-five-percent-margin revenue, propped by a circular capital loop, with open-weight competition dragging price down to meet cost, in a demand environment that is just starting to flinch.

The public S-1, fifteen days before the roadshow, opens the box. Everything before then, including the trillion-dollar valuation the market has already assigned, is a wager placed in the dark.

The vertical is real. The loop is real. The physics is real. The only honest position, until the numbers are public, is to hold all three in view at once and refuse to pretend the box is open when it is still shut.

Disclosures, statements, and the safety architecture

What follows is the verifiable record behind the analysis: what was actually filed and said, on the record, by whom, and when. None of it is the confidential S-1’s financials, which remain sealed. Where a claim rests on a single source, it is marked.

The filing, precisely

Anthropic confirmed on June 1, 2026 that it had confidentially submitted a draft Form S-1 to the SEC, in an announcement made under Rule 135 of the Securities Act, a rule that by design states only that a filing exists: the number of shares and the offering price are explicitly undetermined, and the offering remains subject to market conditions.

Because the submission is confidential, Anthropic has disclosed no audited revenue, no margin, and no risk factors; under SEC rules for emerging growth companies those become public only about fifteen days before a roadshow.

The filing came four days after a cluster of disclosures on May 28, 2026: the close of a $65 billion Series H at a $965 billion post-money valuation, confirmation that run-rate revenue had crossed $47 billion earlier in May, and the release of Claude Opus 4.8. Multiple outlets place the listing target in an October 2026 window, above $1 trillion if markets cooperate.

One revealing side event: in May 2026 Anthropic warned about unauthorized transfers of its shares, naming several platforms selling unapproved, SPV-backed pre-IPO tokens and cautioning that such instruments may carry limited or no legal value.

Tokenized Anthropic and OpenAI pre-IPO products reportedly fell 34 to 40 percent within days, per Bitcoin.com, a vivid illustration of demand for liquidity running far ahead of authorized supply, the same pressure visible in the earlier employee tender.

What the executives said, on the record

The single most important primary statement is Amodei’s own. At the Code with Claude developer conference in San Francisco on May 6, 2026, he said the company had planned for roughly tenfold annual growth but instead saw, in his words, eighty-fold annualized growth in the first quarter, which he gave as the direct cause of the company’s compute shortages, promising to pass that capacity to developers as fast as it could be brought online, per CNBC.

“In Q1 2026, we saw 80x annualized growth per year in revenue and usage.”

Dario Amodei · Anthropic CEO, Code with Claude, San Francisco, May 6, 2026 (CNBC)

That exuberance has a hard floor, and Amodei has named it precisely. On Dwarkesh Patel’s podcast in March 2026 he walked through the arithmetic of his own ruin: if he committed to a trillion dollars a year of compute in 2027 and revenue arrived even at $800 billion rather than the trillion he is extrapolating, then in his phrase there is “no force on earth” that could stop the company from going bankrupt.

It is the clearest admission any frontier-lab chief executive has made that the whole edifice is a bet on a growth rate continuing, and that the bet is existential. He has also said publicly that the industry may be near the end of the exponential.

The operator behind the IPO is CFO Krishna Rao, who joined in 2024 as the company closed its Series D at roughly $250 million in run-rate revenue, and who previously guided Airbnb’s IPO. Rao calls compute the lifeblood of the business and says he spends 30 to 40 percent of his time on it.

He has named the three risks that would push Anthropic toward the bottom of its growth cone rather than the top: enterprise diffusion failing to keep pace with model capability, scaling laws unexpectedly flattening, and competition eroding margins, per his interview with YourStory.

In a court filing around March 2026 Rao stated under oath that the company had brought in revenue exceeding $5 billion to date, the figure critics such as Ed Zitron use to argue the later profitability claim was flattered by the timing of the SpaceX compute discount.

“The compute that we procure is the lifeblood of our business.”

Krishna Rao · Anthropic CFO, who previously led Airbnb’s IPO (YourStory)

Talent, culture, and the organization

Anthropic’s defining operational claim is talent retention under siege. When Meta made aggressive offers across the frontier labs, Anthropic reportedly lost only two researchers where rivals lost dozens, a result Rao attributes to a culture the company describes as talent density over talent mass.

All seven co-founders remain, as does the vast majority of the first thirty employees; every hire must clear a culture interview; and Amodei addresses the entire company every two weeks and takes unscripted questions, per YourStory.

In February 2026 Anthropic opened a Bengaluru office, calling India its second-largest market for Claude.

The safety architecture, which is also a revenue constraint

Anthropic governs releases through its Responsible Scaling Policy, first published in September 2023 and rewritten as Version 3.0, effective February 24, 2026, which introduced Frontier Safety Roadmaps and Risk Reports that quantify risk across deployed models.

The policy uses AI Safety Levels modeled on biosafety: ASL-2 is the current baseline; ASL-3, which Anthropic first activated alongside Claude Opus 4 in May 2025, adds hardened weight security and a narrow set of deployment limits aimed at CBRN misuse; ASL-4 is reserved for models posing major national-security risk or capable of autonomous AI research.

At Opus 4’s launch, chief scientist Jared Kaplan said the model gave novices a “significantly greater” uplift toward building biological weapons than a search engine or prior models, per TIME.

The clearest case of safety capping revenue is Claude Mythos. Announced as a preview on April 8, 2026, Mythos autonomously discovered, and wrote working exploits for, thousands of zero-day vulnerabilities across major operating systems and browsers, capability Anthropic judged too dangerous for general release, placing it at or near the ASL-3 cyber threshold per the Cloud Security Alliance.

The commercial vehicle, Claude Security under the Glasswing program, runs a deliberately constrained model and carries 90-day reporting commitments.

Anthropic has also published a Sabotage Risk Report for Opus 4.6 and, in February 2026, an internal Noncompliance Reporting and Anti-Retaliation Policy giving employees channels to flag potential violations.

Each of these is a decision that is defensible on its safety merits and simultaneously a self-imposed ceiling on the revenue the company’s most powerful capabilities could earn.

The model record, and what Anthropic does not disclose

The Claude lineage behind the revenue is precise and public at the capability level:

Opus 4.5 (November 24, 2025) shipped a 200k-token context window and a 64k-token thinking budget, with up to 65 percent fewer tokens on long-horizon coding;
Opus 4.6 (February 2026) added a 1M-token context in beta and led Terminal-Bench 2.0 and Humanity’s Last Exam; Sonnet 4.6 followed on February 17, 2026;
Opus 4.7 was independently rated the most EU-AI-Act-compliant model by the testing firm Aithos;
and Opus 4.8 (May 28, 2026) carries a 1M-token context with cross-session memory for multi-day work.

Critically, none of these system cards disclose the one thing an analyst most wants: parameter counts, layer counts, or whether the models are dense or mixture-of-experts.

Anthropic publishes capability and safety evaluations in exhaustive detail and keeps the architecture itself sealed, which is exactly why the inference unit-economics had to be built from silicon specifications and first principles rather than from any disclosed model size.

The data, at a glance

Methodology. The cost-to-serve and gross-margin-bridge figures in section VII are illustrative models built from published silicon benchmarks and stated assumptions, not Anthropic disclosures; they are intended to show order of magnitude and mechanism, not to report the company’s actual numbers, which are confidential.

Revenue, valuation, and capacity figures throughout are reported, estimated, or projected by the cited third parties. Where sources conflict (for example, run-rate near $45B vs $47B, or Trainium3 availability dates), the range is given in text.

Sourcing. Reporting and analysis from CNBC, CNN, NBC News, Axios, Reuters, the Wall Street Journal (via AI Weekly and Sacra), Bloomberg, TechCrunch, Datacenter Dynamics, Yahoo Finance, Fortune, Benzinga, The Motley Fool, Futurum, Investing.com, and Business Standard; infrastructure, inference, and financial analysis from Sacra, PitchBook, Global Data Center Hub, Tech Insider, TradingKey, SaaStr, BlockEden, Tool Directory, IEEE ComSoc, IntuitionLabs, Introl, Inworld, GMI Cloud, Spheron, CloudRift, multiples.vc, and an arXiv study on token-price evolution; silicon specifications from Tom’s Hardware, Oplexa, Introl, and Awesome Agents; benchmark data from Nvidia InferenceMax; company statements from Anthropic, xAI, and Databricks; commentary from Ed Zitron, Derek Thompson, Shanaka Anslem Perera, and the named analysts (Rasgon / Bernstein, Ives / Wedbush, Cahn / Sequoia, Duberstein / Motley Fool, Patience / Futurum).

Disclosure and limits. All financial figures are reported, estimated, or projected; none are drawn from a publicly available audited filing, because Anthropic’s S-1 remained confidential as of the filing date. Nothing here is investment advice. This is analysis, not a recommendation.

The Blackwell Migration Question

Lorenzo Bradanini — Sun, 07 Jun 2026 11:32:01 GMT

Introduction

A single Blackwell B200 running Llama 3.3 70B at NVFP4 can decode at approximately 5.5 milliseconds per token at batch size 1, an 18-month-old H100 at FP8 cannot move below 21.5 ms. The B200 floor is roughly 4× faster and the KV-cache budget grows by about 18×.

The numbers are not vendor marketing. They are bandwidth arithmetic: 7.7 TB/s of HBM3e divided by a ~43 GB on-GPU footprint for the FP4 weights, the same first-principles derivation we used for the H100 floor in Issue #1, with two parameters changed.

The question that follows is not whether Blackwell is faster. It is whether the price-per-token improvement actually delivered to a deployment engineer justifies the migration cost, and the published vendor numbers do not answer it cleanly.

NVIDIA’s October 2025 announcement of SemiAnalysis InferenceMAX v1 claims “15× lower cost per million tokens” for Blackwell vs Hopper.

The real cost-per-token improvement for like-for-like single-GPU Llama 3.3 70B inference, derived from the same InferenceMAX v1 data combined with public on-demand pricing, is closer to 3× on-demand and 8× on spot, with the gap explained by rack-scale GB200 NVL72 comparisons, MoE workloads, and BF16-baseline framings that do not represent the H100 FP8 production deployment most readers are actually running.

This is Issue #2 of what is now called Inference.Engineering. Issue #1 derived the bandwidth-bound floor for Llama 3.3 70B FP8 on H100 SXM5, audited the public benchmark landscape, identified where the engine-vs-engine variance actually lives, and committed to running our own measurements.

The measurement work is in progress and ships separately. This issue addresses the question deployment engineers are asking now: should I migrate, and what do I actually get if I do.

The post does what Issue #1 did. It derives the physical bounds from the NVIDIA datasheet and the FP4 weight footprint. It places the bounds against empirical data from SemiAnalysis InferenceMAX v1 and NVIDIA’s MLPerf v4.1 submission.

It audits the public Blackwell benchmark claims against those bounds, reads the kernel-layer story underneath (NVFP4 vs MXFP4, the second-generation Transformer Engine, FlashAttention-3 on SM100 vs SM90, NVLink 5), works through the quantization-accuracy tradeoff using Red Hat AI’s published evaluation of NVFP4 on Llama 3.3 70B, and concludes with a migration decision matrix for the six most common production situations.

A note on what we have and have not done. This post is analysis grounded in primary-source data. The bounds are calculator-checkable. The empirical numbers come from SemiAnalysis InferenceMAX v1 (October 2025), NVIDIA’s own MLPerf submission, Red Hat AI’s published NVFP4 evaluation, and the vLLM v0.12.0 Blackwell recipe page.

We have not yet run our own B200 benchmarks. When we do, this post will be updated and any deltas will be tracked in the errata page. If the migration recommendations below turn out to be wrong when measured directly, cite this post against us.

Inference.Engineering is reader-supported. The paid subscription funds rented GPU time on H100, B200, and MI355X, which is how we plan to measure the configurations this issue discusses.

Subscribe now

Key observations

Decode floor on a single B200 for Llama 3.3 70B NVFP4 is ~5.5 ms/token at batch 1, a per-user throughput ceiling of ~180 tok/s/user. Derived from ~43 GB on-GPU footprint at NVFP4 (34 GB FP4 linear weights + ~4 GB BF16 embedding/lm_head + ~4 GB block-scale overhead at NVFP4’s 16-element block size) divided by 7.7 TB/s HBM3e bandwidth on the HGX B200. The GB200 variant at 8.0 TB/s reaches ~5.3 ms.
The B200’s KV-cache budget after NVFP4 weights is ~135 GB on a single 180 GB HGX B200 GPU, versus ~7–8 GB on a single 80 GB H100 at FP8. The 18× increase is the load-bearing fact, not the throughput number. Long-context workloads that hit the KV wall on H100 (Part 1.2 of Issue #1) move comfortably into single-GPU territory on B200.
NVFP4 on Llama 3.3 70B is essentially lossless when properly calibrated. Red Hat AI’s published NVFP4 evaluation (February 2026) shows large models (70B–235B parameters) “consistently achieve ~99% recovery” of BF16 accuracy across task-level and aggregate benchmarks. This holds for Llama 3.3 70B specifically; the published RedHatAI/Llama-3.3-70B-Instruct-NVFP4 checkpoint exists with full reproduction recipes.
NVFP4 ≠ MXFP4. On Llama 3.3 70B with quantized KV cache, NVIDIA measures 5% higher MMLU accuracy with NVFP4 vs MXFP4 (December 2025), attributed to NVFP4’s finer block scaling (16-element blocks vs 32) and higher-precision E4M3 FP8 scaling factors vs MXFP4’s E8M0. The distinction matters: a benchmark reporting “FP4” without specifying which format does not generalize.
SemiAnalysis InferenceMAX v1 reports B200 delivering ~10,000 TPS/GPU at 50 tok/s/user interactivity on Llama 3.3 70B, roughly 4× higher per-GPU throughput than H200 at the same interactivity (NVIDIA blog, October 9, 2025). Note that the batch-1 decode-floor ratio vs H200 is smaller (~2.7×, from 1.6× bandwidth times 1.69× NVFP4 footprint); the larger 4× at production interactivity additionally captures the bigger batch sizes the B200’s KV headroom permits and the FP4 compute density that helps once batches are large. Floor and interactivity-point throughput are different metrics, and the gap between them is itself informative.
NVIDIA’s “15× lower cost per million tokens” headline is not the right number for the single-GPU H100 → B200 question. It applies to rack-scale GB200 NVL72 on MoE models compared to HGX H100 air-cooled clusters at BF16. For like-for-like single-GPU Llama 3.3 70B FP8 (H100) vs NVFP4 (B200) at 50 tok/s/user interactivity, the derived cost-per-million-tokens improvement using Spheron’s published rates is ~3× on-demand and ~8× on spot.
The B200’s compute-bandwidth ridge sits at 2,338 FLOPs/byte for FP4 dense (18 PFLOPS / 7.7 TB/s) on the HGX variant, versus 591 FLOPs/byte at FP8 on H100. The Blackwell ridge is roughly 4× further out, meaning Llama 3.3 70B decode at AI ≈ 1 FLOP/byte sits even further below the ceiling. Decode is more bandwidth-bound on B200, not less. The throughput improvement comes from the bandwidth itself, not from compute throughput.
vLLM v0.12.0 is the current Blackwell-ready release. NVIDIA’s vLLM recipe page is direct on the precision choice: “For Hopper, FP8 offers the best performance for most workloads. For Blackwell, NVFP4 provides additional memory savings and throughput gains, but may require tuning to maintain accuracy on certain tasks.” The recipe also documents kv-cache-dtype: fp8 and max-num-batched-tokens: 8192 as the recommended Llama 3.3 70B Blackwell defaults.
The B200’s 1,000W TDP requires infrastructure most data centers do not yet have. H100 SXM5 runs at 700W in air-cooled configurations. B200 at 1,000W in HGX form (or 1,200W in GB200) frequently requires liquid cooling or a substantial reduction in rack density. The migration cost includes infrastructure, not just hardware.
The cross-vendor portability story is real and underappreciated. vLLM’s Triton attention backend, which achieved 100.7% of FlashAttention-3 performance on Hopper for long decode requests (Issue #1, Section 4.2), runs unchanged on Blackwell and AMD. The same Triton kernel that closes the gap to FA3 on H100 also closes the gap to FA4 on B200. A migration evaluation that compares only the NVIDIA-optimized stack systematically understates the AMD alternative on the workloads where Triton attention wins.
The B200 supply situation matters for the timeline. Inworld reports B200 hardware orders backlogged through mid-2026 with ~3.6 million units in queue (April 2026 reference). Cloud rental is the practical migration path for most readers in 2026, not on-premise purchase. This makes the spot-price economics more relevant than the list-price comparisons NVIDIA’s blog posts emphasize.

Part 1, The physical bounds on Blackwell

Three numbers belong on every Llama 3.3 70B / B200 deployment engineer’s whiteboard: the decode latency floor under NVFP4, the KV-cache budget that the larger HBM3e capacity unlocks, and the compute roofline ridge that shifts under the second-generation Transformer Engine.

1.1 The decode latency floor under NVFP4

The bandwidth-bound floor derivation from Issue #1 transfers unchanged. Decode on a single B200 streams the full weight set from HBM3e to the SMs per token.

At batch 1, decode arithmetic intensity remains ~1 FLOP per byte loaded (Spector & Ré; arXiv 2603.05931). The H100 figures from Issue #1 become Blackwell figures by substituting the new weight footprint and the new bandwidth.

NVIDIA B200 datasheet figures, cross-verified against the official December 2024 Blackwell datasheet:

Metric HGX B200 (per GPU) GB200 (per GPU) HBM3e bandwidth 7.7 TB/s 8.0 TB/s FP4 dense 18 PFLOPS 20 PFLOPS FP8/FP6 dense 9 PFLOPS 10 PFLOPS BF16/FP16 dense 5 PFLOPS 5 PFLOPS HBM3e capacity 180 GB 186 GB TDP 1,000 W up to 1,200 W NVLink generation 5 (1.8 TB/s) 5 (1.8 TB/s)

(All Tensor Core figures are dense; sparse values are 2× the dense.)

Llama 3.3 70B on-GPU footprint under NVFP4 decomposes as follows. NVFP4 quantizes linear-layer weights to 4 bits per parameter in 16-element blocks with one E4M3 FP8 scale per block. Embedding and lm_head remain in BF16 (the standard production path that Red Hat AI’s checkpoint uses, mirroring the FP8 convention from Issue #1):

Linear FP4 weights:        68.45B params × 0.5 bytes = 34.23 GB
Block-scale overhead:      68.45B params / 16 × 1 byte = 4.28 GB
Embedding + lm_head BF16:  2.10B params × 2 bytes = 4.20 GB
Runtime + activations + CUDA graphs: ~1 GB
                                              ---------
Total on-GPU footprint:                      ~43.7 GB

The lower bound on decode latency per token at batch 1 uses the per-token streaming footprint of ~42.7 GB (the FP4 linear weights, their block scales, and the BF16 lm_head, all of which stream from HBM every decode step).

The extra ~1 GB of runtime and activation buffers in the ~43.7 GB total does not stream per token, so it enters the KV-budget subtraction in Part 1.2 but not the floor:

HGX B200:  t_floor = 42.7 GB / 7.7 TB/s  ≈ 5.5 ms/token
           v_ceiling = 1 / t_floor       ≈ 180 tok/s/user

GB200:     t_floor = 42.7 GB / 8.0 TB/s  ≈ 5.3 ms/token
           v_ceiling = 1 / t_floor       ≈ 187 tok/s/user

Comparing to the H100 SXM5 / FP8 reference from Issue #1 (~21.5 ms / 46.5 tok/s/user), the Blackwell + NVFP4 combination delivers a roughly 4× decode-floor improvement at batch 1. The improvement decomposes into two contributions: bandwidth (7.7 vs 3.35 TB/s = 2.3×) and streaming footprint (42.7 vs 72 GB = 1.69×).

The footprint contribution is what makes NVFP4 valuable, separate from the hardware. Running B200 with FP8 weights instead of NVFP4 would cut the footprint contribution to ~1× and recover only the 2.3× bandwidth improvement.

Chart 1. The decode latency floor and the per-user throughput ceiling it implies, derived from each GPU’s HBM bandwidth divided by the Llama 3.3 70B weight footprint at the relevant precision. B200 NVFP4 is ~3.9x faster at the floor than H100 FP8. These are physical lower bounds; real systems sit slightly above them.

Verify the floor yourself. The bound is checkable on any rented B200 in under ten minutes, using NVIDIA’s vLLM v0.12.0 Blackwell recipe as the launch baseline:

docker pull vllm/vllm-openai:v0.12.0

docker run --gpus all -p 8000:8000 vllm/vllm-openai:v0.12.0 \
  --model nvidia/Llama-3.3-70B-Instruct-NVFP4 \
  --kv-cache-dtype fp8 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.9

Then measure single-stream decode at concurrency 1:

vllm bench serve \
  --model nvidia/Llama-3.3-70B-Instruct-NVFP4 \
  --dataset-name random --random-input-len 512 --random-output-len 512 \
  --max-concurrency 1 --num-prompts 8 \
  --percentile-metrics ttft,tpot,itl,e2el --ignore-eos

NVIDIA also publishes the canonical reproduction in its dgxc-benchmarking repository, which runs Llama 3.3 70B inference with --dtype nvfp4 on B200/GB200 and exposes TPOT directly when streaming is enabled (STREAMING=true);

the same harness runs the H100 baseline at --dtype fp8, making it the cleanest apples-to-apples generational comparison available without writing your own client.

Expected mean TPOT should land in the ~6–8 ms range on HGX B200, at or just above the 5.5 ms floor, with the ~1–2 ms gap being engine overhead and kernel-launch latency. If you see TPOT materially below 5.5 ms at batch 1, the engine is running speculative decoding or sub-NVFP4 weights.

If materially above ~10 ms, the precision path is wrong (the engine fell back to FP8 or BF16) or the configuration is wrong (cold CUDA graphs, an unconfigured Transformer Engine path). Per-user throughput is 1000 / TPOT_ms.

A caveat we will not paper over. Unlike Issue #1, where NVIDIA’s own NIM benchmark published the H100 throughput-vs-concurrency curve that empirically confirmed the KV wall, we have not located a published single-stream (batch-1) B200 TPOT measurement for Llama 3.3 70B NVFP4 to anchor the 5.5 ms floor directly.

The public B200 numbers (InferenceMAX, MLPerf, the TRT-LLM perf tables) are all aggregate-throughput or production-interactivity points at large batch, not batch-1 latency. The 5.5 ms figure is therefore a bandwidth derivation, not yet an empirically confirmed measurement, and confirming it within 15% is the first specific claim Issue #3 will test.

Treat it as a physical lower bound that real systems approach from above, exactly as the H100 floor behaved before NIM data confirmed it.

1.2 The KV-cache budget grows by an order of magnitude

The KV-cache calculation transfers unchanged from Issue #1. Llama 3.3 70B uses Grouped-Query Attention with 8 KV heads, head_dim 128, across 80 layers. Per-token KV-cache footprint is ~327 KB at BF16 or ~164 KB at FP8.

After loading the ~43.7 GB NVFP4 weight footprint and reserving ~1 GB for runtime, the KV-cache budget on a single 180 GB HGX B200 is approximately 135 GB, versus ~7–8 GB on a single 80 GB H100 at FP8:

KV precision Bytes/token KV tokens on H100 (~7–8 GB) KV tokens on B200 HGX (~135 GB) BF16 327 KB ~21,000–24,500 ~413,000 FP8 164 KB ~42,000–49,000 ~825,000

At avg_seq_len = 8,192 tokens (the vLLM V1 default chunked-prefill budget), single-H100 concurrency is ~3. Single-B200 concurrency is ~50 at the same configuration. At avg_seq_len = 32,768 (long-context agentic workloads), single-H100 supports less than one full request in memory; single-B200 supports ~12.

Chart 2. KV-cache budget remaining after weights load, single GPU. The B200’s ~135 GB is ~18x the H100’s ~7.5 GB. Note that the H200 already reaches ~67 GB (~9x), capturing half the step-change, which is why an existing H200 fleet has a weaker migration case than an H100 fleet (Part 6.1).

This is the load-bearing finding of the Blackwell migration question for long-context production: the workloads that required TP=2 or TP=4 on Hopper for KV headroom fit comfortably on a single B200. The migration changes which workloads are “single-GPU” deployments.

1.3 The roofline shifts further out

The B200 compute-bandwidth ridge moves to:

AI_ridge (B200 FP4 HGX) = 18 PFLOPS / 7.7 TB/s  ≈ 2,338 FLOPs/byte
AI_ridge (B200 FP8 HGX) =  9 PFLOPS / 7.7 TB/s  ≈ 1,169 FLOPs/byte
AI_ridge (H100 FP8)     = 1,979 TFLOPS / 3.35 TB/s ≈ 591 FLOPs/byte

The B200 FP4 ridge is roughly 4× further out than the H100 FP8 ridge. Decode at batch 1 with AI ≈ 1 sits at the same arithmetic intensity but achieves only ~7.7 TFLOPS on B200 (0.04% of FP4 peak), vs ~3.4 TFLOPS on H100 (0.17% of FP8 peak).

The fraction of peak compute used by decode falls as the hardware improves, because the bound is bandwidth, not compute. The Blackwell FP4 compute density is largely irrelevant to batch-1 decode latency; it matters at higher batch and in prefill.

Chart 3. The B200 FP4 compute-bandwidth ridge sits ~4x further right than the H100 FP8 ridge. Batch-1 decode at arithmetic intensity ~1 FLOP/byte uses an even smaller fraction of peak compute on Blackwell (0.04%) than on Hopper (0.17%): the throughput gain is bandwidth, not compute. Compute density only matters in the prefill regime (shaded).

Prefill on B200 at FP4, with arithmetic intensity in the 2,000–10,000 FLOPs/byte range, runs comfortably in the compute-bound regime above the ridge. At 75% utilization on the FP4 ceiling, the per-GPU prefill throughput on Llama 3.3 70B reaches approximately:

prefill_ceiling = 0.75 × 18 PFLOPS / (2 × 70.55 × 10^9 FLOPs/token) ≈ 96,000 tok/s

versus ~10,500 tok/s on H100 at FP8. A ~9× prefill throughput improvement, which dominates the workloads where input length is much larger than output length (RAG, document QA, code understanding).

The bandwidth-bound decode improvement is “only” ~4×; the compute-bound prefill improvement is much larger.

Subscribe now

Part 2, Empirical verification from InferenceMAX v1 and MLPerf

The Part 1 derivation predicts a ~4× decode floor improvement and a ~9× prefill ceiling improvement for Llama 3.3 70B going from H100 FP8 to B200 NVFP4. Three empirical sources test the prediction, and getting the comparison right requires understanding precisely how each was measured.

SemiAnalysis InferenceMAX v1 (October 9, 2025) is the most rigorous public source, and its methodology deserves close reading because the headline numbers are easy to misquote.

InferenceMAX runs nightly on hundreds of chips via GitHub Actions, sweeping max-concurrency and parallelism to trace the full throughput-vs-interactivity Pareto frontier rather than reporting a single point.

For each run it uses random sequences (no prefix caching, to avoid the workload-dependent complexity prefix caching introduces), an infinite request rate with a capped max-concurrency, and three input/output sequence-length pairs chosen to represent distinct workload classes: 1K in / 1K out (chat), 1K in / 8K out (reasoning), and 8K in / 1K out (summarization), with each request’s input length randomized to 80–100% of nominal to mimic real traffic variance.

For Llama 3.3 70B specifically, the default engine is vLLM and the Blackwell precision is NVFP4; SGLang is the DeepSeek default and TRT-LLM is run where vendors submit configs.

The InferenceMAX v1 finding on Llama 3.3 70B is unambiguous: “When it comes to LLaMA 70B FP4, B200 significantly outperforms MI355X across all three workload types.” NVIDIA’s accompanying announcement quantifies the generational comparison: Blackwell delivers over 10,000 TPS per GPU at 50 TPS/user interactivity on Llama 3.3 70B, 4× higher per-GPU throughput than H200.

Three things about how this number is constructed, each of which a careless reader gets wrong. First, 50 tok/s/user is a production interactivity floor, not the single-stream maximum.

A B200 at single-stream batch 1 reaches the ~180 tok/s/user ceiling from Part 1.1; the 50 tok/s/user point sits at a higher concurrency where weight loads amortize across many users, which is exactly the throughput-vs-interactivity trade-off that defines the Pareto curve.

Second, the 4× comparison is vs H200, not H100. H200’s 4.8 TB/s bandwidth is ~43% higher than H100’s 3.35 TB/s, so the B200-vs-H100 ratio is correspondingly larger than 4×. Third, the comparison holds NVFP4 on B200 against FP8 on the Hopper part, so the 4× already bundles the precision-footprint contribution; it is not a pure-hardware ratio.

Our derivation decomposes the B200-vs-H100 ratio cleanly: 2.3× from bandwidth (7.7 vs 3.35 TB/s) times ~1.69× from the NVFP4 streaming-footprint reduction (42.7 vs 72 GB) gives ~3.9× at the bandwidth-bound decode floor.

The InferenceMAX 4×-vs-H200 number, adjusted for the H200-to-H100 bandwidth gap, lands in the same region. The decomposition matters more than the point estimate: it tells a deployment engineer that running B200 at FP8 instead of NVFP4 sacrifices the 1.69× footprint term and recovers only the 2.3× bandwidth term.

MLPerf Inference v4.1 is the second anchor. NVIDIA’s August 2024 submission with one B200 GPU on Llama 2 70B reported 10,756 tokens/s server scenario and 11,264 tokens/s offline scenario, against a per-GPU H100 figure derived by dividing the eight-GPU H100 submission by eight, yielding a reported 4× server / 3.7× offline per-GPU increase.

Llama 2 70B and Llama 3.3 70B share the relevant architecture (80 layers, hidden 8192, 64 Q heads, 8 KV heads in 8:1 GQA), so the ratio transfers even though the absolute number is for Llama 2.

The three paths cluster around 4×, though on deliberately different baselines that are worth stating precisely. InferenceMAX v1 (production interactivity, NVFP4-vs-FP8): ~4× vs H200.

MLPerf v4.1 (offline, peak throughput): ~3.7–4× vs H100 per-GPU. First-principles (bandwidth-bound decode floor): ~3.9× vs H100. The two H100-baselined numbers agree tightly at ~3.9×; the InferenceMAX figure is vs the faster H200, so restated against H100 it would be somewhat larger than 4×, meaning the three are mutually consistent rather than coincidentally equal.

Convergence of an independent nightly benchmark, a standardized industry benchmark, and a from-scratch derivation in the same band, with the baseline differences accounted for, is what separates a defensible figure from a vendor claim repeated.

Chart 4. The throughput-improvement figure does not rest on any single source. InferenceMAX v1 (vs H200), MLPerf v4.1 (vs H100 per-GPU), and the first-principles bandwidth derivation (vs H100) all land in the 3.8-4x band.

Subscribe now

Part 3, The cost-per-token economics

Two factors determine the cost-per-token improvement: throughput-per-GPU (~4×) and price-per-GPU-hour (varies by cloud, ~2–3×). The product is the cost-per-token ratio.

3.1 The real cost-per-million-tokens math

Using Spheron’s published rates (consistent with the H100 numbers in Issue #1):

Hardware On-demand $/hr Spot $/hr Throughput at 50 tok/s/user Cost / 1M tokens (on-demand) Cost / 1M tokens (spot) H100 SXM5 (FP8) $2.64 $1.66 ~1,500 TPS $0.49 $0.31 H200 SXM5 (FP8) $4.62 $1.92 ~2,500 TPS $0.51 $0.21 B200 HGX (NVFP4) $6.03 $2.12 ~10,000 TPS $0.17 $0.059

All three per-hour rates are Spheron’s published figures as of 22 May 2026 (H100 $2.64/$1.66, H200 $4.62/$1.92, B200 $6.03/$2.12 on-demand/spot). H100 and H200 throughput at 50 tok/s/user is extrapolated from the SemiAnalysis H200-vs-B200 4× ratio combined with the H100-to-H200 1.43× bandwidth scaling; B200 throughput is the InferenceMAX v1 figure.

Note the counterintuitive H200 on-demand row: at $4.62/hr its cost-per-token ($0.51/M) is actually slightly worse than the H100’s ($0.49/M), because the ~43% throughput gain from the higher bandwidth does not fully offset the ~75% price premium.

The H200’s real advantages are its memory capacity (the KV-wall relief in Part 6.1) and its aggressive spot rate ($0.21/M), not its on-demand cost-per-token. This is exactly the kind of inversion that a vendor “1.4× faster inference” headline hides: faster does not mean cheaper-per-token unless the price scales sublinearly with the throughput.

The honest derived improvement, on-demand vs on-demand, is ~2.9× cheaper per million tokens. On spot vs spot, where B200 spot pricing is currently aggressive (Spheron $2.12/hr is lower than H100 on-demand), the improvement reaches ~8× cheaper.

This is materially different from NVIDIA’s “15× lower cost per million tokens” headline. The 15× number is real, but it compares rack-scale GB200 NVL72 air-cooled to H100 HGX air-cooled on MoE workloads (GPT-MoE-1.8T projected throughput, per the NVIDIA datasheet), not single-GPU Llama 3.3 70B. NVIDIA’s announcement post is precise about this: the cost-per-million-tokens claim applies to “AI factory economics” at rack scale, not the single-GPU comparison most readers are actually evaluating.

The 3× to 8× range is what a deployment engineer should plan around. The 15× number is what a CFO will hear from a vendor presentation.

Chart 5. Cost per million output tokens at 50 tok/s/user, from Spheron’s published per-hour rates times InferenceMAX v1 throughput. The honest single-GPU improvement is ~2.9x on-demand and ~8x on spot, well short of NVIDIA’s 15x headline, which is a rack-scale GB200 NVL72 MoE-at-BF16 comparison.

3.2 The migration cost the price-per-token comparison hides

Three categories of migration cost do not appear in the $/M-tokens calculation but determine whether the migration actually saves money.

Infrastructure. B200 HGX runs at 1,000W per GPU vs H100 SXM5’s 700W. GB200 in the NVL72 configuration runs at up to 1,200W per GPU. Rack density falls accordingly, and air-cooling becomes marginal at B200’s power level. Liquid cooling is the production path NVIDIA assumes. Data centers without liquid cooling either run fewer GPUs per rack (reducing the effective $/hr improvement) or deploy in a different facility (capital expenditure not in the per-hour rate). For cloud renters the infrastructure cost is bundled into the per-hour price; for on-premise buyers it is not, and it can easily exceed the GPU hardware cost over a deployment lifetime.

Software maturity. vLLM v0.12.0 is the current Blackwell-ready release as of the Llama 3.3 70B recipe page. NVFP4 support landed first in TensorRT-LLM (full); vLLM has shipped what NVIDIA describes as “early NVFP4 support” with the second-generation Transformer Engine path; SGLang has it on the roadmap. Production deployment in 2026 means accepting a less-mature software stack than Hopper’s, with the kernel and scheduler optimizations of Issue #1 (FlashInfer integration, Triton attention, FA3-vs-FA4 selection, CUDA graph capture) still being landed across the three engines for the new architecture. The Hopper stack benefits from 18 additional months of production hardening that Blackwell will need time to accumulate.

Supply. Hardware order backlogs through mid-2026 mean on-premise B200 purchases compete with hyperscaler allocations. Cloud rental is the practical path, and cloud pricing on B200 is more volatile than on H100. Spheron’s $2.12/hr spot rate is real but not guaranteed across providers or time. The H100 ecosystem is mature enough that a one-year reserved instance commits to a known price; B200 commits are quoted but the secondary-market spread is wider.

3.3 A worked migration example

Take the Issue #1 worked example: a chat product at 50 requests/second peak sustained, 800 input tokens, 400 output tokens. Decode dominates; the output token rate is 50 × 400 = 20,000 tok/s. On H100 FP8 at production interactivity, a single GPU sustained ~1,500 TPS, requiring 14 H100s and costing ~$27,000/month on-demand or ~$17,000/month at spot.

On B200 NVFP4 at the same interactivity, a single GPU sustains ~10,000 TPS, so the same workload needs ~2 B200 GPUs. Cost: $6.03/hr × 2 GPUs × 730 hr/month = ~$8,800/month on-demand, or ~$3,100/month on spot. The on-demand savings vs H100 are roughly $18,000/month; the spot savings are roughly $14,000/month, depending on which side of the comparison gets spot pricing.

Three caveats sharpen the picture. First, the 2-GPU B200 deployment has dramatically more KV headroom (~270 GB aggregate budget vs ~16 GB on 2 H100s), so the same hardware can absorb longer sequences or higher concurrency before scaling out further.

Second, prefill (40,000 in-tok/s required) needs only one B200 GPU’s worth of prefill capacity (~96,000 tok/s ceiling vs the workload’s 40,000), so the deployment is decode-bound everywhere.

Third, the autoscaling math improves: 2 GPUs is a less granular floor than 14, but the per-GPU economics make off-peak headroom less costly. A peak-to-average ratio of 2× on H100 means paying for 28 GPU-hours per peak-hour-equivalent of demand; on B200 it means paying for 4.

The autoscaling discontinuities matter less when each GPU is doing more work.

3.4 What the per-hour rate actually pays for

The single most important fact about inference TCO is structural, and SemiAnalysis states it directly in the InferenceMAX v1 analysis: colocation rent and electricity together are typically less than 20% of total cost of ownership.

The dominant term is the GPU vendor’s gross margin. In SemiAnalysis’s words, some vendors charge “up to 75% gross margins (i.e. a 4× markup over cost of goods sold), while others less than 50% gross margins (i.e. less than 2× cost of goods sold).”

This reframes the migration question. A naive analysis treats the B200’s higher power draw (1,000W vs 700W) as a major cost penalty. It is not: if a B200 delivers 20% fewer tokens per provisioned megawatt than some alternative, that translates to less than 4% of TCO (20% of the under-20% energy share).

The migration economics are dominated by the price you pay for the silicon, which is set by vendor margin and supply, not by the power bill.

This is why the spot-vs-on-demand spread (a ~3× swing in the per-hour rate) dwarfs every efficiency consideration in the cost-per-token math: the per-hour rate is mostly margin, and margin is what moves between spot and on-demand.

The corollary for a buyer: the highest-leverage variable in Blackwell migration economics is procurement, not engineering. Securing B200 capacity at the $2.12/hr spot rate rather than the $6.03/hr on-demand rate is a larger cost-per-token lever than any kernel optimization, any precision choice, or any batching strategy.

The engineering determines the throughput; the contract determines most of the cost.

3.5 Power efficiency: the megawatt view

For power-constrained deployments (the binding constraint for many AI data centers in 2026), throughput per provisioned megawatt is the relevant metric rather than throughput per GPU.

SemiAnalysis InferenceMAX v1 measured this directly, though the published Llama-vs-Blackwell power numbers are for gpt-oss 120B rather than Llama 3.3 70B, so they should be read as a Blackwell-generation indicator rather than a Llama-specific figure.

On gpt-oss 120B FP4, an HGX H100 processes ~900,000 tok/s per all-in provisioned megawatt; an HGX B200 processes ~2.8 million tok/s per megawatt, a ~3× generational power-efficiency gain. At a higher interactivity level of ~180 tok/s/user, the B200 advantage widens to ~7×, because the H100 falls off its efficiency curve faster at high interactivity than the B200 does.

(Provisioned megawatt here means all-in utility power including cooling and electrical-distribution overhead, not just GPU TDP, which is the honest denominator for a data-center operator.)

The ~3× power-efficiency improvement roughly tracks the ~4× throughput improvement, which is expected: the B200’s TDP (1,000W) is ~1.43× the H100’s (700W), and 4× throughput / 1.43× power ≈ 2.8× efficiency.

For a Llama 3.3 70B deployment specifically, the power-efficiency improvement should land in the same 3× range by the same arithmetic, but we flag that the published megawatt figures are gpt-oss numbers and the Llama-specific power measurement is one of the things Issue #3 will report directly.

Chart 6. Throughput per provisioned megawatt, H100 vs B200, on gpt-oss 120B FP4 (the published InferenceMAX figures are for gpt-oss, not Llama 3.3 70B, and are shown here as a Blackwell-generation indicator). The ~3x gain at production interactivity widens to ~7x at high interactivity because the H100 falls off its efficiency curve faster.

Subscribe now

How Systems Really Fail, Part IV

Lorenzo Bradanini — Wed, 03 Jun 2026 16:26:41 GMT

The System Has No Architect

The first essay argued that distributed systems fail in the spaces between their components. The second argued that the system you observe is a delayed, partial projection of a system that has already moved on.

The third argued that the control loops you close on that projection cannot stabilise the system you have.

Each was a structural failure (composition, observation, control) and each was localisable: you could point to where the failure lived. This essay is about the failure that is not localised anywhere.

The first three essays assumed there is a system to fail. Call it S, the object engineers compose, observe, and control. The composition fails because interfaces hide state. The observation fails because dashboards project S into a representable space and discard the rest. The control fails because the loop cannot reach the parts of S that matter.

In all three cases the failures are gaps between S and the operator’s representation of it. The argument of this essay is that, past a certain scale, S is not the object the operators think it is.

Production distributed systems at scale are complex systems in the technical sense the term carries in the work of Perrow, Cilliers, Snowden, Leveson, and Dekker. Their behaviour is not the sum of their parts.

Their failure modes are emergent, properties of the whole that no component possesses in isolation. They have no architect. They cannot be modelled in their entirety by any individual.

And they sit, by construction, near the edge of failure, because the same competitive pressure that makes them efficient drives them toward the boundary of safe operation.

The three earlier failures are manifestations of this. The gaps between components cannot be closed because nobody knows where the gaps are. The dashboards cannot be made complete because the system is dimensionally larger than any representation.

The control loops do not close because the plant is not a plant in the textbook sense; it is an emergent process whose dynamics are not contained in any single component’s specification.

Past a certain scale, the system is not the object the operators think it is. And complex systems do not yield to the methods that produced reliable software at small scale.

Four incidents, mechanically reconstructed.

The 2003 Northeast Blackout, where a race condition in one utility’s alarm system propagated, through coupling no one had mapped, into a cascade that affected fifty-five million people.
The AWS S3 outage of February 2017, where one mistyped command exposed a dependency graph no individual had seen end to end.
The Cloudflare WAF outage of July 2019, where one regular expression, deployed globally in seconds, took down a large fraction of the internet’s HTTPS.

And, because the thesis is that these failures are structural rather than historical, a fourth: the Cloudflare outage of November 2025, where the same company repeated a structurally identical failure six years later for reasons unrelated to the specific bug.

Four emergent modes: cascade through coupling, concentration through scale, worst case from interaction, and the recurrence that proves the pattern is not an accident.

Subscribe now

What kind of object is a production system?

The canonical work is Charles Perrow’s Normal Accidents (1984), written after Three Mile Island to characterise the systems in which accidents become statistically inevitable.

His framework has two axes. Interactive complexity is how many ways the components can affect each other, including ways the designers did not anticipate.

Coupling is how tightly they are connected in time, whether a failure must propagate immediately or whether there is slack to absorb it.

Loosely-coupled, linear systems (assembly lines, road networks) fail in expected, containable ways.

Complex, tightly-coupled systems (nuclear plants, refineries, the financial system, the internet) fail in ways their operators did not anticipate, and the failures propagate before anyone can intervene.

Perrow’s claim is sharp: complex tightly-coupled systems are not made safe by adding safety features. Past a threshold, features add interactions, which add failure modes. The system has accidents as a normal property of operation, not a deviation from it.

As he restated it in 2012, a normal accident is one where everyone tries hard to play safe, but unexpected interaction of two or more failures (interactive complexity) causes a cascade (tight coupling).

Production distributed systems lie at the far end of both axes. The interactive complexity is enormous: every service has dozens of dependencies, each version-skewed against the others, interacting through coupling paths invisible in the architecture diagrams.

The coupling is tight: cache TTLs in seconds, retry budgets in tens of milliseconds, timeouts tuned down over years until they barely accommodate the steady state. There is no buffer left.

Figure 1, Perrow’s two-axis taxonomy. Systems in the upper-right quadrant produce accidents as a normal property of operation. Latency optimisation and dependency growth push production distributed systems monotonically toward the extreme corner.

Three frameworks sharpen the point. Cilliers distinguished the complicated (many parts, knowable structure, holdable in one head) from the complex (interactions producing behaviours not present in any part and not predictable from the specifications).

Snowden’s Cynefin adds the prescription: in complex contexts cause and effect are clear only in retrospect, so operators must probe, sense, and respond rather than analyse and execute.

Leveson’s STAMP reframes accidents as failures of control over the interactions between components, not failures of the components themselves.

The compressed claim: the methods that produce reliable software at small scale cannot transfer to large scale, because the object they were built to control is no longer the object that exists.

Cascade: the Northeast Blackout, 14 August 2003

At 12:15 EDT, the state estimator at MISO, the reliability coordinator for much of the Midwest and Ontario, began diverging from its measurements.

A state estimator infers load and voltage on every line from noisier telemetry every few minutes; when it converges the operator has a coherent picture of the grid, and when it does not the operator is blind to anything not directly measured.

An analyst traced the divergence to a tripped Indiana line, fixed the topology by hand, then forgot to re-enable the estimator’s automatic trigger.

From 12:37 until 16:04, the window in which the cascade silently assembled itself, MISO’s contingency analysis was effectively offline: operators could no longer answer “what happens if line X trips?” because they had no current model of where the lines were.

At 13:31 EDT, FirstEnergy’s Eastlake Unit 5 tripped while carrying 612 MW and 400 MVAr of reactive power. The lost generation should have been absorbed; the lost reactive support depressed voltage across northern Ohio.

At 14:14 EDT, the alarm processor of GE’s XA/21 system at FirstEnergy’s Akron control centre deadlocked. The control room was entirely alarm-driven: operators responded to alarms rather than watching the mimic.

A latent race condition, a deadlock under high event-queue depth, silently stopped the primary alarm server. No error was raised. The backup took over, inherited the same growing queue, and deadlocked too.

From then until well after the cascade ended, operators believed they were watching a current, stable grid. They were watching a stale snapshot from before the cascade started, with no signal it was stale.

For three and a half hours the operators were looking at a stale snapshot with no signal that it was stale. The dashboard was not wrong. It was describing a system that no longer existed.

At 15:05, 15:32, and 15:41 EDT, three northeast Ohio lines sagged into trees and tripped. Each loss pushed its load onto the survivors, which carried more current, heated, expanded, sagged, and contacted more vegetation.

The cascade is a positive-feedback loop in continuous time: the same physics that produces normal operation produces accelerating failure once a threshold is crossed.

Figure 2. The cascade assembles slowly inside a 3.5-hour blind window, then accelerates. The transition is sharp: before 16:05:57 the event was confined to Ohio and recoverable by load-shedding; after it, the cascade propagated at the speed of protective-relay action.

At 16:05:57 EDT the Sammis-Star line tripped, not from a tree but from over-current. That was the transition point. After it the cascade became uncontrolled, propagating across the interconnection in milliseconds to seconds per trip.

Within eight minutes, lines tripped across Ohio, Michigan, Pennsylvania, New York, and Ontario; the Eastern Interconnection separated into islands; generators tripped as frequencies diverged from 60 Hz. By 16:13, roughly 508 units at 265 plants were offline and fifty-five million people had lost power.

The US-Canada Task Force estimated four to ten billion dollars in losses and named four causes: two operational (unmaintained vegetation, failure to shed load in time) and two observational (no effective contingency analysis, failure of the monitoring tools).

The interesting feature is not the bug. The continental cascade was contingent on the interaction of three unrelated things: trees in Ohio, an alarm system that worked correctly except under one event-ordering pattern, and a grid coupled tightly enough that a loss in Ohio reached Ontario in eleven minutes.

None was dangerous in isolation; the danger was emergent from the combination. No component was outside its specified parameters when the cascade began.

In Leveson’s terms, the regional control structure had not been designed to enforce the constraint that no single utility’s blindness shall propagate beyond its control area, and the cascade would have happened the same way had the XA/21 bug been a different bug producing the same blindness in the same window.

AWS S3, 28 February 2017

At 9:37 AM PST, an authorised S3 engineer ran a routine playbook to remove a few servers from the billing subsystem. One parameter was wrong, and the command removed a much larger set.

A familiar story so far: a fat-finger event, an under-validated tool, a blast radius beyond intent. The interesting part is what happened next.

The removed servers also supported the index subsystem (metadata and location for every object, required for GET, LIST, PUT, DELETE) and the placement subsystem (which depends on the index).

Both dropped below the capacity they needed and entered a state requiring a full restart.

Here the structural problem surfaced. AWS later wrote that the full restart, relied on since launch, had not been run on the index or placement subsystems in their larger regions for many years. S3 in us-east-1 launched in 2006, and the metadata had since grown by orders of magnitude.

The restart procedure, designed at a far smaller scale, took dramatically longer than expected, because each system coming back had to validate a metadata store far larger than the procedure assumed, and the validation scaled non-linearly. Roughly four hours of impact.

What failed during those four hours is the substantive part.

S3 had quietly become a hidden dependency for a remarkable fraction of the public internet: Slack, Quora, Trello, Imgur, Medium, Coursera, GitHub release artefacts, Docker Hub, Adobe Creative Cloud, parts of Zillow and Expedia, each having independently chosen S3, each apparently unaware how many others had.

The cumulative effect was a single point of failure for a non-trivial percentage of the public internet that no organisation had ever ratified.

The cutting detail: AWS’s own Service Health Dashboard depended on S3 to host its status icons, so it could not visually update to reflect the outage of the service it depended on. The icons stayed green because the red ones could not load.

Figure 3. The concentration paradox. Each spoke is an independent, individually-rational decision to use the cheapest reliable object store. Nobody ratified the convergence and nobody can see it. The self-loop is the recovery-dependency cycle: the status page for S3 was itself served from S3.

The shape here is not cascade. The failure was localised to one region of one service. What made it catastrophic was concentration: an enormous number of independent systems had converged on the same choice without anyone designing the convergence.

None had chosen to share fate; they had each chosen, separately, the cheapest reliable object store, and that was S3 in us-east-1. The shared fate was emergent from the aggregation of independent decisions.

Network science calls the mechanism preferential attachment: independent decisions under similar constraints converge on a few providers, and the providers become single points of failure for the aggregate without anyone designing it.

The most reliable service in a category becomes the largest single point of failure for that category, precisely because it is the most reliable. The convergence is rational at the individual level and catastrophic at the aggregate level, and no one can intervene against it because no one can see how many others made the same choice.

This is emergence of a different kind from cascade. The blackout’s was dynamic, failures propagating along time-coupled paths in minutes.

S3’s was structural, dependencies accreting along static paths nobody maintained a record of, over a decade. Both are failures of the whole, in the precise sense that no subsystem was responsible.

What made the blast radius continental was the way thousands of independent decisions had made S3 a chokepoint no individual fully understood.

AWS’s fix, beyond safer commands, was to partition the index subsystem into cells so a future restart brings back one cell at a time.

In Perrow’s terms, that is an explicit attempt to reduce coupling and reintroduce slack between subsystems whose tight coupling had silently emerged from organic growth.

Subscribe now

Worst case from interaction: Cloudflare WAF, 2 July 2019

At 13:42 UTC, a Cloudflare engineer deployed a minor change to the WAF Managed Rules, a new rule meant to improve detection of inline JavaScript used in cross-site-scripting attacks.

It went out via Quicksilver, which propagates configuration to every edge server globally in seconds, by design, so emergency security responses do not wait for a gradual rollout.

Three minutes later the first page fired. CPU on every Cloudflare edge worldwide had spiked to 100%. Cloudflare’s network, by 2019 fronting roughly ten percent of the world’s HTTPS, could not process new requests.

Customer sites returned 502s; Cloudflare’s own dashboard, API, and internal tools, all routed through the same edge, went unreachable. At its worst, traffic dropped 82%.

The rule’s structural problem was a trailing pattern of the shape .*(?:.*=.*). A backtracking regex engine handles .* greedily: it matches as much as possible, then, if the rest fails, gives back one character at a time and retries.

With several unanchored .* constructs in sequence, the number of match positions grows combinatorially in the input length. Against an input that almost matches but diverges late, a pattern of the form .*.*=.* with no equals sign explores on the order of n-choose-2 partition points:

quadratic, O(n²), for this shape, and exponential, O(2ⁿ), in pathological cases like (a+)+. Quadratic and cubic blow-ups are equally lethal at millions of requests per second.

Figure 4. Why one regex took down ten percent of the internet’s HTTPS. PCRE’s backtracking engine offers no complexity guarantee, so on inputs the test author never wrote, the same rule that was fast in the suite explored a quadratic-to-exponential number of paths. RE2’s DFA does one state transition per character: linear, regardless of pattern shape.

This pathology, catastrophic backtracking, is a known property of backtracking engines (PCRE, Perl, Python’s re, JavaScript’s RegExp) on inputs they were not tested against.

Cloudflare’s Lua WAF used PCRE because PCRE ships with Lua, and PCRE has no complexity guarantee: it attempts every backtrack until it matches or exhausts the search.

The rule had passed the test suite. It had even been deployed in “simulate” mode, where the rule runs against real traffic but blocks nothing, explicitly to catch this, but simulate mode still executes the regex on each request and the CPU cost was identical.

Two factors made it worse: the WAF Managed Rules pipeline bypassed Cloudflare’s normal staged rollout (a deliberate speed-against-safety trade for emergency patches), and a CPU-time safeguard that would have caught a runaway regex had been removed by mistake during an earlier refactoring whose explicit goal was to reduce CPU consumption.

The recovery was constrained by the same property that caused the failure: the engineers who needed the kill switch could not reach the control panel, because it runs behind Cloudflare’s own Access product, which routes through the edge.

They used a rarely-exercised bypass, diagnosed it by 14:02, and killed the rulesets at 14:09. Total impact: about twenty-seven minutes, during which Discord, Feedly, Coinbase, and a large share of global HTTPS returned 502s, an aggregate impact orders of magnitude larger than the duration suggests, because Cloudflare had, like S3, become public infrastructure.

The form is worst case from interaction. The system has performance regimes not exercised by any input used in development or testing, exposed only by realistic input at scale.

The regex did not fail under any tested input; the engine did not fail in any benchmark; the pipeline functioned as designed. The fault was emergent from the joint behaviour of a regex, an engine, an input distribution, and a deployment process, each correct in isolation.

This is what Hollnagel calls the underspecification problem: a component’s spec covers anticipated inputs, the system at scale sees inputs no one anticipated, and the behaviour under those is emergent from the implementation, not contained in the spec.

Cloudflare’s fix, beyond restoring the safeguard and staging rollouts, was to migrate from PCRE’s backtracking to an engine based on a deterministic finite automaton, which guarantees linear time regardless of pattern shape because every input character causes exactly one state transition.

That is the move Perrow recommended in 1984: where possible, reduce the interactive complexity rather than defend against its consequences.

Subscribe now

Recurrence: Cloudflare again, 18 November 2025

If the three preceding incidents were merely history, a sceptic could argue the field has since learned its lessons.

The strongest evidence against that reading is that the same company, having written one of the most-cited post-mortems in the industry about its 2019 outage, produced a structurally identical failure six years later, for reasons again unrelated to the specific bug.

On 18 November 2025 at 11:05 UTC, Cloudflare deployed a correct permissions change to a ClickHouse cluster, granting users explicit access to metadata for shard tables in a schema called r0.

The problem was a buried assumption elsewhere. A query feeding the Bot Management system listed a table’s columns and had always, by assumption, returned only the default database’s columns.

After the change, it also returned the r0 schema’s columns, roughly doubling the rows. That output fed directly into a “feature file,” the configuration the Bot Management model consumes to score every request as bot or human.

The feature file, normally stable around sixty features, more than doubled to over two hundred. It is regenerated every few minutes and propagated globally, by design, so the system can react to new bot behaviour.

The core proxy preallocates memory for these features and enforces a hard limit of two hundred. When the bloated file hit production it exceeded the limit, and the Rust code did not handle the error gracefully.

It panicked: thread fl2_worker_thread panicked, called unwrap on an Err value. Every request through the Bot Management path returned a 5xx.

The blast radius again included Cloudflare’s own products and downstream a large slice of the consumer internet: ChatGPT, X, Spotify, Canva, Discord. Matthew Prince called it the worst outage since 2019.

Figure 5, Six years apart, the same shape. A correct local change meets a buried assumption (worst case from interaction); a fast-global-propagation pipeline turns one bad artefact into a worldwide event in minutes (the amplifier); and concentration makes the blast radius the consumer internet.

Read against the three earlier incidents, the recurrence is almost eerie. It is worst case from interaction, in the 2019 sense: a query that behaved one way for years behaved differently under a correct change, on an input nobody had specified.

It is concentration, in the 2017 sense, and it landed only weeks after a major AWS us-east-1 outage on 20 October 2025. And the amplifier is the same amplifier as 2019: the fast global propagation path, the very mechanism that gives the system its responsiveness, is what turned one bad artefact into a worldwide event in minutes.

The deepest point is the one a casual reader misses. Cloudflare did learn the 2019 lesson; they migrated regex engines, added staged rollout, restored safeguards.

None of it prevented 2025, because the 2019 lesson was about regular expressions and the 2025 failure was about a database permission, a buried assumption, a hard-coded limit, and an unwrap that should have been a fallback. The two share no component. They share a structure.

The 2019 fix did not prevent 2025, because the two incidents share no component. They share a structure. You cannot patch a structure by patching the part that happened to express it last time.

Subscribe now

The emergent forms, named

The incidents are distinct shapes of emergence. Cascade (2003): tight coupling, a fault propagating faster than operators can intervene; the reach is a property of the topology, not of any line or processor.

Concentration (2017, and again 2025): a structural property nobody designed, accumulated through years of independent decisions; the blast radius of a foundational service’s failure is a property of how many systems converged on it, and it grows silently because no organisation can see across all the adoptions.

Worst case from interaction (2019): regimes not exercised by the inputs the system was tested against, exposed by the inputs it actually sees; the failure is in no component’s specification because no specification covers all realistic inputs.

Figure 6 — Three shapes of emergence. In production they compose: concentration creates the conditions for cascade; a worst-case interaction triggers a cascade through tightly-coupled subsystems. November 2025 was all three at once.

The three compose in production. Concentration creates the conditions for cascade: when many systems share fate, a fault in the shared component propagates instantly.

Worst-case interactions trigger cascades through tightly-coupled subsystems: the regex spike that takes down the edge takes down the dashboard the operators need to push the kill switch.

The deeper claim is that the three failures of Parts I, II, and III are themselves emergent properties of the same kind of system:

composition gaps appear at interfaces nobody designed end to end;
observation failures appear because the system’s behaviour is dimensionally larger than any representation, with the dimensions that matter most during novel failures being exactly the ones the projection discarded;
control failures appear because the loops close on a plant whose dynamics are not contained in any single component.

the projection model

It is worth making the claim precise, because precision turns a metaphor into a tool.

Let S be the system as it actually is: the full state, every dependency, every cached value, every in-flight retry, every input it will ever see. S lives in an enormous, high-dimensional space, and no human or dashboard holds it.

What everyone works with is a representation R, obtained by a projection (call it pi) that maps the system into something small enough to fit in a diagram, a metrics store, or one engineer’s head.

Figure 7, The unifying picture. Every artefact an operator touches is a point in R. The system lives in S. In the steady state the two agree, because the system spends most of its time in a regime pi captures. An incident is precisely the event of S moving into the part of itself that pi threw away.

The four failures are four ways pi betrays you. Composition is a failure at the seams of pi: each subsystem is built against its own local projection of its neighbours, and where two such projections meet, the assumptions need not agree (the 2025 query is a perfect specimen).

Observation is the claim that pi loses, preferentially, the dimensions that carry the most signal during a novel failure, because a dashboard is built from the dimensions that mattered in past incidents and a novel incident is by definition one whose decisive dimension was not salient before.

Control is the observation that the loop closes on R, not S: it can be perfectly stable on R and diverging on S, and the operator sees stability until the divergence reaches a dimension pi still tracks.

Emergence is the statement that the dimension of S vastly exceeds that of any R a human or tool can hold, and that no enrichment of R closes the gap, because every dimension you add to the dashboard expands the interaction surface of the system you are charting.

That is the formal residue of Perrow: adding observation is adding components, adding components is adding interactions, and the system you can fully observe is, for that reason, not the system you have.

Subscribe now

The system has no architect

What follows, and what engineers trained on smaller systems underestimate, is that production distributed systems at scale have no architect. This is not about staffing. It is structural.

The system is the accumulated artefact of thousands of independent decisions by hundreds of engineers over years, most of whom have left, none charged with maintaining a coherent end-to-end model.

The architecture diagrams capture, at best, the model of one engineer, on one day, of the slice they were looking at. The whole system is not in any document because it is not in any individual’s head.

This is the Conway’s Law observation, after Mel Conway’s 1968 paper that organisations design systems mirroring their communication structure. The deeper version is Daniel Dennett’s phrase: competence without comprehension.

The system serves traffic, accepts payments, delivers content, without any individual comprehending the full mechanism. The competence is real; the comprehension exists nowhere. There is a parallel from political economy.

Hayek’s 1945 paper argued against central planning on epistemic grounds: the knowledge to coordinate a complex economy does not exist in any single mind, but is distributed across millions of agents holding local, tacit knowledge that cannot be efficiently aggregated.

The argument transfers directly. Conway, Dennett, and Hayek point at the same fact from three lineages, which suggests it is real and not an artefact of one discipline.

This is hard for engineers to accept in proportion to how good they are. The instinct that produced their career, that one can read the code and understand the system, works up to about the size of a single service team.

Past that it stops, and the engineer who insists on retaining the model loses the ability to operate the system. The mature posture is not to know the system but to navigate it: Charity Majors describes moving from understanding systems to interrogating them; the resilience-engineering school calls it coping with complexity, the operator inside the system rather than above it.

In a complicated system you learn the system and then operate it. In a complex system you operate the system and learn what you can, knowing some of what you learn will be obsolete by the time you have learned it.

The competent on-call engineer at scale is therefore not the one who knows the most, but the one with the best discipline for forming hypotheses, sizing interventions, observing responses, and updating beliefs.

That discipline is what decides whether an incident lasts twenty minutes or twenty hours.

Subscribe now

Drift into failure

There is a phenomenon, named by Sidney Dekker in Drift into Failure (2011) and rooted in Jens Rasmussen’s 1997 paper, that explains how complex systems reach the boundary of safe operation without any single decision being responsible.

A system operates in a state space bounded by three pressures, each a gradient. Economic pressure pushes toward higher throughput at lower cost. Workload pressure pushes toward simpler procedures and less manual intervention.

The third is the boundary of functionally acceptable performance, the safety boundary, beyond which catastrophic failure is statistically expected.

Figure 8: Rasmussen’s state space. Two boundaries (economic, workload) are quantified, visible, and rewarded. The third, safety, is unmarked and gives no warning at the crossing, because the crossing is statistical, not deterministic. Each locally-rational optimisation removes slack and nudges the operating point along the cost gradient.

The three pressures are not balanced. Economic and workload pressure act continuously and visibly: quantified in dashboards, named in reviews, the subject of every planning cycle. The safety boundary is invisible.

The system gives no warning at the crossing, because the crossing is statistical: a system past the boundary does not fail immediately, it fails with elevated probability per unit time, empirically indistinguishable from operating safely until it fails.

The result is a steady migration toward the safety boundary, driven by the visible pressures, the boundary unobserved until it is crossed.

Applied to production systems: every optimisation that reduces latency or cost is a step toward the boundary. Every timeout tightened for P99, every retry budget cut, every cache TTL shortened, every connection pool sized closer to peak.

Each is locally rational, and the cumulative effect is to remove slack. Slack absorbs perturbations; a system with no slack is tightly coupled, the condition under which interactive complexity becomes accidents. The drift is not a decision to operate unsafely.

It is many decisions to operate slightly more efficiently, and at no point did anyone authorise the trajectory.

Per Bak’s self-organized criticality (1987) showed that systems under continuous driving organise themselves to a critical point, where a small perturbation can produce an avalanche of any size, with sizes following a power law (the two-dimensional exponent is commonly quoted near 1.2, though it is non-universal and contested; the heavy tail is robust regardless).

Carlson and Doyle’s Highly Optimized Tolerance (1999) is more directly applicable: systems explicitly optimised for robustness against expected disturbances become fragile against unexpected ones, producing power-law failures without any external driving.

They are critical because they were engineered to be, the optimisation having consumed every margin against the perturbations the optimiser did anticipate.

Figure 9. Why “that could never happen at our scale” is a category error. Optimised systems produce power-law failure sizes, not the thin-tailed distribution intuition assumes. The catastrophic outage is not an outlier off the curve. It is a point on the curve, in the tail the optimisation built.

The implication is uncomfortable. The reliability of a mature system is not a function of how careful its operators are.

It is a function of how much slack they have preserved against the cost pressure that would remove it.

A team that fights to keep slack, holding redundant capacity, leaving timeouts loose, refusing to push retry budgets to the minimum, is doing the most important reliability work, and it is the work least visible in the metrics.

This is also why incidents cluster: with margin, small perturbations are absorbed; at the boundary, the same perturbations cascade.

Incident frequency is perturbation frequency times the probability the system is currently at the boundary, and that probability rises with drift, and drift is monotonic without deliberate effort against it.

The system gets less safe over time even as no one decides to make it less safe.

Drift is the boundary approaching. Cascade is what happens when the boundary is crossed. The system gets less safe over time even as no one decides to make it less safe.

What operations looks like under emergence

The practices that work here are the post-Perrow tradition of resilience and site-reliability engineering, tied together by the recognition that no model is reliable and that operations must proceed by probe-and-respond.

Chaos engineering, deliberately injecting failures into production, rests on the premise that behaviour under failure cannot be predicted from the components but only discovered by observation, so the system must be perturbed deliberately, with bounded blast radius and operators present.

Error budgets make Rasmussen’s invisible boundary visible in the same units as the economic pressure on the other side: define a service-level objective, track consumed unavailability as a budget, and slow releases when it is exhausted.

The arithmetic is sobering:

99.9 percent availability allows about 43.8 minutes of downtime per month.
99.95 percent allows about 21.9 minutes; a single cascade exhausts it.
99.99 percent allows about 4.38 minutes; a 27-minute Cloudflare-style event blows two quarters of budget.
99.999 percent allows about 26 seconds per month, essentially no human-in-the-loop budget at all.

The budget is the boundary, denominated in minutes, which is why high-availability targets and aggressive release velocity are in genuine, not rhetorical, tension.

Blameless post-mortems are a technical practice as much as a cultural one: the information value of a post-mortem is proportional to the accuracy of the reporting, and accuracy is proportional to the safety the reporter feels.

Ron Westrum’s typology (pathological, bureaucratic, generative) formalises this: only generative cultures, where bad news is welcomed because it is operationally valuable, produce the post-mortems complex systems need.

Game days and failure injection are variants of one principle: behaviour under stress cannot be modelled, it must be observed, and the observation must be staged before the real failure, because in the real failure operators cannot slow down to learn.

What unites these is the acceptance of irreducibility: the model is incomplete by construction and is updated by observation under controlled perturbation, not by deduction from specification. Operations becomes an empirical discipline, closer to experimental science than to mechanical engineering.

The practitioners who absorb this invest in observability (not because dashboards reveal the truth, but because they are the only handle on a system you cannot model), in chaos engineering (because the alternative is discovering breakages during real incidents), and in slack (because they have read enough post-mortems to know which systems fail catastrophically).

The ones who have not make the opposite choices and look reasonable doing it: eliminate slack to cut cost, reduce observability to cut noise, avoid chaos engineering because it occasionally causes outages.

They discover, eventually, that they have built a system both more efficient and more catastrophic when it fails, in proportion to the savings.

Operators as the homeostatic mechanism

There is a way to name what operators are, structurally, that systems biology names directly.

In an organism, homeostasis maintains internal state against perturbation: temperature, glucose, pressure, pH within a fraction of a unit despite continuous challenge.

The mechanisms are not centralised; they are distributed across hundreds of overlapping negative-feedback loops, none responsible for the overall stability.

No part of the body is in charge of being alive; being alive is what the parts, in aggregate, are doing.

The operator function is the homeostatic mechanism in this precise sense. Operators are the negative-feedback loop that prevents Rasmussen drift from becoming Rasmussen crossing, the slack reintroduced when cost-cutting removes it elsewhere, the compensating force against every gradient that would otherwise push the system over the boundary.

The system is stable not because the engineering is good but because operators continuously absorb the perturbations the engineering does not address.

The on-call engineer reverting a deployment at 03:47 UTC is not interrupting normal operation. They are participating in it. The reverting is what the system is doing, through them, to keep itself alive.

The systems do not work. The operators work, and the systems usually fail only when the operators lose the ability to see, infer, intervene, or comprehend.

This is structural and testable. Remove the operators and the system enters its statistically expected failure regime within hours to days.

With competent operators present, it runs, degraded but functional, for years against perturbations that would individually crash it.

The competence is in the loop, not in the system. Yet organisations reward the visible artefacts (clean architectures, well-factored code, comprehensive tests, capacity plans) and underinvest in the invisible work of compensating, absorbing, and quietly maintaining slack, treating it as a cost centre rather than the production function it is.

The engineering does not produce reliability. Reliability is the output of the engineering acted on by the operators, in a loop the organisation rarely acknowledges.

This also explains why, beyond a threshold, more automation worsens outcomes. Lisanne Bainbridge’s 1983 paper “Ironies of Automation” catalogued it: automating the routine cases removes the operators’ opportunity to maintain situational awareness, the very faculty they need when the automation fails on a case it was not designed for.

The gains accrue early and visibly; the costs accrue late and on the days that matter most.

The discipline that follows is to treat operators as the load-bearing element, not the engineering: build the observability they need to interrogate the system, build the chaos practice they need to keep their model current, and do not automate them out of the loop; automate the routine inside their loop, so they keep the situational awareness they will need when the automation fails.

Subscribe now

Closing the series

Four essays. Four structural failures.

Composition: distributed systems fail in the spaces between their components, because no individual designed both sides of the interfaces.

Observation: the system you see is a delayed, partial, aggregated projection, and the moments when projection and system disagree are the moments that matter most.

Control: the loops you close on the projection cannot stabilise the system, because the plant moves faster than the controller can sample, the actuator is in the blast radius, or the feedback path passes through the failure.

Emergence: the system is bigger than any of these admits. It has no architect. It sits at the boundary by construction, because the pressure that makes it efficient drives it there, and its failure modes are properties of the whole that no model in any individual head can predict.

These are not exceptional. They are the structural conditions under which production systems at scale operate.

The systems do not work because they were engineered to work; they work because the operators, from the SRE on call at 03:47 UTC to the architect drawing dependency graphs in a Confluence document nobody updates, are continuously compensating for the gaps the engineering left open.

What this means for the engineer is twofold.

The technical: the practices that increase survivability (slack, observability, chaos engineering, error budgets, blameless post-mortems, out-of-band control, recovery-dependency mapping, drift monitoring) are expensive, and they are what separates systems that fail gracefully from systems that fail catastrophically.

The epistemic: the appropriate posture is structural humility, knowing the system is bigger than the model, the model is incomplete by construction, and any incident might be the one that reveals a regime nobody had characterised.

The pager will go off again. The dashboards will be lying again. The interventions will assume conditions that have ceased to hold. The system will have drifted, since the last incident, slightly closer to the boundary.

The work is the same work, performed by the same kind of people, against systems that none of them designed and none of them fully understand.

The interesting thing is that this works at all. The accidents are kept rare by humans doing a job whose structural conditions make it nearly impossible, and doing it well enough that the rest of the profession can pretend the systems work on their own. They do not. They never have.

The serious engineer’s contribution, past a certain seniority, is to understand this and act accordingly: to design systems that respect what the operators have to do, build the tools that make their job possible, and write the practices down so the next generation does not learn them from incidents.

The job is hard. It is also the most important job in the engineering function. The systems do not work without it. They never will.

One more thing….

I wrote a deep CUDA guide from exactly this perspective: not isolated tricks, but how to reason about the GPU as a coupled dynamical system whose performance regimes and failure modes (occupancy collapse, memory-bandwidth thrashing, warp divergence, pipeline stalls, register spilling, bank conflicts, tensor-core underutilisation) are structurally the same kinds of seam failures the four essays of this series have described.

If the framework resonates, the guide is where it lands in code.

[Read the CUDA Guide on Gumroad → CUDA Mastery]

This is the final essay in the series How Systems Really Fail. The four parts are best read in order, but each can stand alone.

The Llama 3.3 70B Benchmark Problem

Lorenzo Bradanini — Thu, 28 May 2026 22:52:32 GMT

Introduction

How Systems Really Fail, Part III

Lorenzo Bradanini — Wed, 27 May 2026 21:21:31 GMT

Intro

The first essay argued that distributed systems fail in the spaces between their components, and that those spaces are structurally opaque. The second argued that the system you observe is not the system that exists, that aggregation destroys signal, and that the operator’s dashboard is a delayed, partial, instrumented projection of a system that has already moved on.

This one is about what happens next.

Once you have accepted that the system is opaque and that your view of it is incomplete, you still have to act. The pager has gone off. The error rate has climbed from a green 0.02% to a red 12%. Customers are tweeting. Your manager is on the call.

The runbook has three pages and none of them describe this. You have ninety seconds before the next escalation tier joins, and you have to decide whether to roll back, fail over, shed load, drain a region, or do nothing and let the system find its own equilibrium.

This essay is about that decision. Not the politics of incident response, not the cultural question of blameless post-mortems, but the structural problem underneath: you are closing a control loop on a system whose state you cannot fully observe, whose composition you do not fully control, and whose response to your inputs is, in the regime where you most need to act, nonlinear, delayed, and frequently the opposite of what you expected.

Classical control theory has names for all of this. The combination is called control under uncertainty, and the bounds it places on what an operator can achieve are not soft. They are mathematical.

They are the reason a competent on-call engineer with a complete runbook and a working dashboard can still make an outage worse, not through error, but by executing the textbook intervention against a system that has, by the time the intervention lands, already entered a regime where the textbook does not apply.

Three incidents, mechanically reconstructed: Knight Capital’s forty-five-minute, four-hundred-and-forty-million-dollar loss in 2012, where a human control loop sampling at the speed of decision could not stabilise a software loop running at the speed of order entry; the Facebook BGP withdrawal of October 2021, where the control plane that needed to repair the network had been routed through the network it had just withdrawn from; and the AWS Kinesis outage of November 2020, where the remediation that would have ended the failure could not proceed because it depended on the very subsystem the failure had taken down.

The pattern beneath all three is the same. The system entered a regime in which the available control inputs were either too slow, structurally unable to reach the failing component, or themselves dependent on the failure being already fixed.

The operators were not negligent. They were operating inside a loop whose closure conditions had been silently violated, and the loop did what control loops do when their closure conditions fail: it stopped controlling.

The interesting question, again, is why this is structural.

Subscribe now

The control loop, mechanically

A control loop is a four-stage cycle: a plant whose state evolves over time, a sensor producing measurements, a controller computing corrective inputs against a setpoint, an actuator applying those inputs.

The plant’s new state is measured, and the cycle repeats.

In a textbook loop, all four stages are coupled tightly enough that the system can be analysed as a single dynamical system. The classical results, like Nyquist stability, Bode gain and phase margins, Lyapunov functions, Kalman observers, assume this coupling.

Given a plant with known dynamics, a sensor with bounded noise, a controller with a known transfer function, and an actuator with bounded authority, they tell you whether the closed loop is stable and how it responds to disturbances.

A production distributed system violates every one of these assumptions.

The plant is not one system; it is a composition of subsystems each with their own dynamics, coupled through interfaces that hide most of the relevant state.

The sensor is the observability pipeline of Part II, with tens of seconds of phase lag and aggregation that destroys precisely the signal the controller needs.

The controller is split across at least three actors at different sampling rates: automated systems (autoscalers, load balancers, schedulers) running at the speed of metric collection; on-call humans running at the speed of cognition under stress; incident commanders running slower still.

The actuator is whatever combination of API calls, configuration pushes, deployment rollbacks, and SSH sessions the operator can bring to bear, each with its own latency, blast radius, and probability of producing the opposite of what was intended.

The result is a control loop whose stability margin is set by the slowest, noisiest, most delayed component in the chain. In the steady state, this is fine: the loop has plenty of margin and corrections are small. In an incident, the margin evaporates, and the loop’s behaviour is determined by parts of the system no one had thought to characterise.

There is a precise name for this in control theory. The formal definitions are Kalman’s (1960): a system is observable if its internal state can be reconstructed from a finite history of outputs; controllable if any state can be reached from any other state in finite time by an admissible input sequence.

In a healthy production system, both hold approximately. In an incident, one or both fails. Observability fails when the failure mode is invisible to the metrics, as in Slack’s autoscaler chasing CPU while threads waited on a degraded network. Controllability fails when the action that would fix the problem is no longer reachable, as we are about to see in three different forms.

When both fail at once, the operator is, in the precise technical sense, no longer controlling the system. They are watching it. Interventions may correlate with eventual recovery, but the causal chain from action to outcome has been severed.

Subscribe now

When the human loop is too slow: Knight Capital, 1 August 2012

The canonical case for control-loop timescale mismatch is not a distributed-systems outage in the conventional sense.

It is a financial one, and the reason it belongs here is that it isolates, more cleanly than any web-scale incident, what happens when the human control loop runs orders of magnitude slower than the software loop it is supposed to govern.

Knight Capital Americas was, on the morning of 1 August 2012, the largest U.S. retail market-maker, market-making roughly 17% of NASDAQ-listed and 16% of NYSE-listed stocks. Its core function was posting bid and ask quotes on thousands of stocks, capturing the spread, managing inventory.

The platform that did this was SMARS, the Smart Market Access Routing System, running for over a decade.

On 31 July, NYSE was about to launch the Retail Liquidity Program (RLP). Knight had updated SMARS to support RLP order types.

The update repurposed a flag that since 2003 had activated a piece of dormant code called Power Peg: an old test algorithm, originally designed to buy high and sell low in order to exercise other trading algorithms in a controlled environment, that Knight had stopped using years earlier but never removed from production.

In 2005, a separate refactor had moved the cumulative-quantity counter (the routine that tracked how many shares of a parent order had been filled and was responsible for stopping further child orders once an order was complete) to an earlier point in the SMARS workflow.

The move disconnected the counter from Power Peg, and Knight never retested Power Peg afterwards. In the new RLP code, the flag’s meaning was repointed at the RLP handler. The deployment was rolled out manually to eight production servers between 27 July and 1 August. Seven of them received the new code. One did not.

At 9:30 AM Eastern, the U.S. equities market opened. Parent orders flowed into SMARS to be split into child orders and sent to the exchanges. On the seven correctly-deployed servers, child orders were generated, sent to NYSE, and matched against the RLP.

On the eighth, the repurposed flag was being set on incoming RLP-eligible orders, but the code interpreting it was still the old Power Peg algorithm, now without the cumulative-quantity counter that would have throttled it.

Each parent order on the eighth server generated child orders continuously, with no signal back from the fill-confirmation path to indicate the order had been satisfied.

Over the next forty-five minutes, the eighth server sent more than four million orders into the market in response to 212 customer orders, executing across 154 symbols and ultimately moving 397 million shares. It bought at the offer and sold at the bid hundreds of times per second. Each round trip lost the spread.

By contemporary reporting, Knight’s losses accumulated at roughly $10M per minute.

From the perspective of every component except the broken one, the system was behaving correctly. The exchanges were filling the orders. The risk system was receiving the fills. The position-keeping system was updating.

Knight’s internal monitoring had generated 97 emails containing “Power Peg disabled” between 8:01 and 8:24 AM EST, before the market opened, but these were not designed as alerts and no one acted on them.

What did not happen, for forty-five minutes, was anyone stopping the eighth server.

The reasons map cleanly onto the structure of the control loop. After roughly twenty minutes of diagnosis without documented incident-response procedures, engineers reached the conclusion that the issue lay in the new code and reverted SMARS to its previous version on all eight servers.

This was the opposite of the correct action: the previous version was the one in which the Power Peg flag still activated the broken Power Peg path. The rollback propagated the failure contained on one server onto all of them.

Eventually the call was made to halt SMARS entirely. By the time the system was actually stopped, at approximately 10:15 AM, Knight had taken positions of approximately $7.65B (net long $3.5B in 80 stocks, net short $3.15B in 74).

Once unwound, the realised loss was reported by Knight at ~$440M; the SEC’s enforcement order placed the figure above $460M.

The firm did not survive in its prior form. By mid-December, less than five months later, Knight had agreed to a merger with Getco; the deal closed in July 2013, and the combined entity (KCG Holdings) was itself acquired by Virtu in 2017.

The point is the loop. SMARS was running an automated control loop generating orders at machine speed, executing against the market, receiving fills, generating more orders.

The human control loop above it, monitoring positions, raising alerts, halting the system on threshold breaches, was nominally coupled to SMARS through dashboards and risk limits.

In an incident, they decoupled. The position-monitoring metric had a collection interval on the order of a minute. The decision cycle for incident response was five to ten minutes per hypothesis-test iteration.

The order-generation cycle was milliseconds. The two loops differed by roughly four orders of magnitude. The faster loop accumulated four hundred million dollars of damage in the time the slower one ran three diagnostic iterations.

This is the structural form. Nyquist’s sampling argument applies with full force: a control loop sampling at interval $T$ cannot react to disturbances faster than $2T$. Knight’s human loop sampled at minutes; the plant disturbance was milliseconds. The loop was, by sampling theory, blind to its own plant.

The crisis simply could not be controlled by the available control structure, regardless of operator competence.

The lesson the industry encoded after Knight was not that humans should react faster, they obviously cannot, but that any control loop running at machine speed must have a kill switch at machine speed: pre-trade risk checks in the order path, position limits enforced before order submission, circuit breakers triggered on order velocity.

The slow human loop sits above all of this and decides when to re-enable after the automated kill. It does not, anymore, try to be the kill itself.

The principle generalises. Any system whose failure mode propagates faster than the slowest control loop authorised to stop it is, in the precise technical sense, uncontrollable along that axis.

The mitigation is not faster humans; it is a fast-enough automated cutoff with a slow-enough human override.

This is the operational meaning of what Marc Brooker calls autonomic behaviour: the component must be capable of saving itself, on millisecond timescales, against failures the human loop is structurally too slow to address.

When the control plane cannot reach itself: Facebook, 4 October 2021

The second structural form of control-loop failure is reachability. The diagnosis is correct, the action is well-understood, the human loop is fast enough, but the action cannot be applied because the path from controller to actuator runs through the system that has failed.

At 15:39 UTC on 4 October 2021, an engineer at what was then still called Facebook executed a routine maintenance command intended to assess backbone capacity. The command was issued through an audit tool whose job was to reject any change that would take too much of the backbone offline at once.

A bug in the audit tool failed to catch this one. The command withdrew the BGP advertisements for every prefix Facebook announced to the rest of the Internet.

BGP is the inter-domain routing protocol that lets autonomous systems tell the rest of the Internet which prefixes they own and how to reach them.

When Facebook stopped announcing its prefixes, BGP speakers across the Internet, operating standard route-withdrawal semantics, on the order of seconds, removed Facebook’s routes from their forwarding tables.

By Cloudflare’s measurements, public resolvers’ cached records for facebook.com had expired by 15:50 UTC. From the outside, Facebook ceased to exist.

This is, on its own, a recoverable outage. Re-announcing the prefixes is a single configuration push. The question was whether engineers could get that push to the routers.

They could not.

The configuration management system ran on Facebook’s internal network. Facebook’s authoritative DNS servers, hosted at smaller facilities, had a safety rule: if they could not reach the main data centres, they treated themselves as unhealthy and withdrew their own BGP advertisements.

When the backbone went down, every DNS server independently concluded that it was isolated and pulled its routes. Facebook’s DNS therefore disappeared from the public Internet as a second-order consequence of the backbone failure.

And it disappeared from the inside as well, because the same authoritative DNS resolved the hostnames of the internal tools engineers would have used to undo the change.

It got worse. Many internal tools and services engineers would have used to coordinate the response, parts of Facebook’s authentication and communication infrastructure, also depended on the broken backbone or on the now-unreachable DNS.

Engineers reportedly could not log into internal tools; conference rooms whose locks were on the same network would not open; routine communication channels among responders failed.

It got worse again. Physical access to the data centres was gated by a card-access system whose backend ran on the same internal network. Engineers attempting to physically enter buildings or reach server cages directly found their badges no longer opened the doors.

Press reports during the incident described engineers using an industrial angle grinder to cut through a server-cage bar at the Santa Clara data centre; Facebook later disputed the specifics, acknowledging only that “some physical barriers had to be worked around.”

Either way, a team had to be physically dispatched to a data centre to restore service.

The total outage was approximately six hours; BGP advertisements resumed shortly before 21:00 UTC. The technical fix could have been completed in minutes if it had been reachable.

The duration was determined by the time required to physically reach a console inside the same dependency loop as the failure, restore enough of the internal network to allow remote actions, and only then perform the fix that was, in itself, trivial.

This is the second structural form: the action that would resolve the failure lies in the unreachable set induced by the failure itself.

Control theory has a name for the dual notion, a state that cannot be reached from the current state by any admissible input is uncontrollable from that state, but the network-engineering name is more vivid: the control plane was in-band.

The configuration changes that would repair the data plane had to travel through the data plane.

The principle Facebook subsequently invested in is out-of-band control. The control plane must reach its actuators by a path that does not depend on the system being controlled.

In dependency-graph terms, the directed graph of “X depends on Y to function” must contain no cycle that passes through the control surfaces of the production system.

If it does, there exists a failure mode in which those surfaces are no longer accessible, and the system can only be recovered by an out-of-band action: physical access, a separate management network, or a kept-current break-glass procedure that shares no infrastructure with normal operations.

The cost of maintaining true out-of-band control is non-trivial: a second network, separately operated, credentialed, monitored, exercised.

The path of least resistance is always to let the control plane drift back in-band, because in-band is cheaper, easier to operate, and works fine until the day it does not.

There is a related principle from safety-critical systems, sometimes called recovery independence, that any component whose failure can render the system inoperable must have a recovery path that does not require that component to be operating.

NASA flight rules have a version of this. Nuclear plant operating procedures have a version. Most production software systems do not.

Share The Software Frontier

When recovery depends on the failure being fixed: AWS Kinesis, 25 November 2020

The third structural form is the case where the path to recovery passes through the failure itself.

On 25 November 2020, the Wednesday before American Thanksgiving, AWS engineers added capacity to the Amazon Kinesis Data Streams front-end fleet in us-east-1 between 02:44 and 03:47 PST.

The first customer-impacting alarms fired at approximately 05:15 PST, with Kinesis error rates climbing through the morning. Kinesis is AWS’s high-throughput event ingestion service; it underpins CloudWatch metrics, AWS Lambda’s logging path, Cognito’s analytics path, and a large catalogue of downstream services.

When Kinesis became unhealthy in us-east-1, a substantial fraction of AWS itself became unhealthy with it.

The Kinesis front-end fleet handles authentication, throttling, and request routing to the appropriate back-end clusters that own the actual stream shards.

Each front-end server maintains in memory a shard-map: a cache containing membership data and shard ownership for the back-end clusters. To populate this cache, each front-end server creates an OS thread per peer in the front-end fleet, and exchanges shard information over those threads.

As AWS noted in its post-mortem, fully learning about a newly added fleet member can take up to an hour. The new capacity pushed the per-server thread count past a configured OS limit. When this limit was reached, front-end servers could not create the additional threads needed to complete the shard-map cache.

Cache construction failed, leaving servers with “useless shard-maps” (AWS’s phrase) that prevented them from routing requests to the correct back-end clusters. Errors began propagating to downstream callers, and the failure spread across the fleet as more servers crossed the threshold.

The remediation was familiar: stop the scaling, remove the additional capacity, and restart the fleet. The constraint was that on coming back up, each front-end server had to rebuild its shard-map by communicating with every other front-end server, and the resources needed to populate the cache competed with the resources needed to serve requests.

AWS could only bring servers back in small groups, a few hundred per hour, verifying stability between batches. The first servers re-entered traffic at 10:07 AM PST; Kinesis fully returned to normal at 10:23 PM PST: roughly 17 hours after the first alarms.

The duration of the Kinesis impairment is not the most interesting part. The most interesting part is what was failing while Kinesis was failing.

CloudWatch ingested metric data via Kinesis. With Kinesis impaired, its ability to ingest fresh metrics was degraded, so the dashboards customers and AWS engineers used to monitor their systems went dark or stale at exactly the moment they were needed.

Lambda invocations require publishing metric data to CloudWatch as part of the invocation; as CloudWatch metrics degraded, Lambda’s local metric agents exhausted their buffers and invocations began to fail.

Cognito uses Kinesis Data Streams to collect and analyze API access patterns; the path is documented as best-effort, with web servers buffering locally, but as the impairment dragged on, Cognito web servers exhausted those buffers and customer authentication started failing.

AWS’s own Service Health Dashboard, which would normally have communicated the outage to customers, was itself impaired in its ability to post updates.

This is the structural form: the path to recovery passed through the failure. The on-call engineers responding to the Kinesis outage needed monitoring to verify their interventions were working, and the monitoring depended on Kinesis.

Customers needed status updates to understand what was happening, and the status page depended on Kinesis. In dependency-graph terms, the recovery dependency graph contained a back-edge: a cycle in which X depends on Y to recover, and Y depends on X to function.

AWS’s documented mitigations: moving to larger CPU and memory servers (so the fleet needs fewer machines and therefore fewer per-server threads), accelerating the cellularisation of the front-end fleet so any single instance no longer needs a thread per peer across the whole fleet, and separating large internal consumers like CloudWatch onto their own partitioned front-end fleets.

The architectural changes are the substantive ones. Larger servers patch the specific trigger; cellularisation and partitioning defend against the general class of failure in which fleet-wide state synchronisation scales worse than the fleet itself.

The general principle, restated for the third time:

The path from a failed state back to a healthy state must not depend on the failed component being healthy.

This is harder to enforce than it sounds. Almost every production system has some path of this shape, somewhere. The deployment system that pushes the fix often depends on the very services the fix is repairing. The monitoring that verifies the fix worked depends on the metrics pipeline.

The communication tools the on-call team uses to coordinate the response depend on internal services that may be in the blast radius of the failure. The discipline is to identify these back-edges and either break them or document the manual workaround.

The recovery dependency graph is distinct from the runtime dependency graph, and almost always more pessimistic.

A system can have a clean runtime graph and a deeply cyclic recovery graph, because the runtime graph captures what depends on what during normal operation, where slow paths, retry loops, fallback caches, and degraded-mode handoffs are all tolerable, while the recovery graph captures what depends on what when something is broken.

In recovery, those tolerances vanish. The path from failed to healthy must be fast and reliable, and a cycle in that path means it may not exist at all.

Subscribe now

The three failure modes, named

The three incidents above are not three instances of the same failure. They are three different ways the control loop loses authority over the plant, each with a precise structural cause.

Knight Capital is timescale decoupling. The plant ran at milliseconds; the controller ran at minutes. By the Nyquist sampling argument, the controller could not, in principle, react to disturbances at the plant’s natural frequency. The control structure was, mathematically, the wrong shape for the plant.

Facebook BGP is unreachable actuator. The diagnosis was correct and the intervention was correct, but the actuator was in the failure’s blast radius. There was no admissible input from the controller’s current state to the recovery state, because the path between them passed through the failed component. The recovery state was, in the formal sense, unreachable from where the operators were standing.

Kinesis is recovery dependency cycle. The controller could reach the actuator and the actuator could apply the intervention. But the verification path, knowing whether the intervention had worked, passed through the failure itself. The loop could be opened by the operators but not closed by feedback, which forced recovery to proceed at the speed of careful, blind, incremental probing.

The three failure modes compose. In a sufficiently bad incident, all three are active simultaneously: the plant moves faster than the controller can sample, the actuator is partially unreachable, and the feedback path that does reach the controller is itself degraded.

The control system has lost authority along three axes at once, and the operator is acting through the gaps.

The deeper claim, the one Parts I and II have been building toward, is that these regimes are not exceptional. They are what production distributed systems enter during incidents, by construction of how those systems are composed and observed.

The healthy operating envelope is the region in which the control loop has enough bandwidth, enough reach, and enough feedback to keep the plant on a setpoint.

Outside that envelope, one or more of those conditions fails, and the operator is no longer controlling the system, they are nudging it and waiting for the system either to find its own equilibrium or to deteriorate further.

The job of the operator in this regime is not to control. It is to survive: keep blast radius bounded, avoid actions that worsen the failure, preserve the option of recovery, and wait for conditions in which control becomes possible again.

The aviation discipline of aviate, navigate, communicate captures it: maintain altitude first, locate yourself second, talk to people third. Reversing that order is how you crash.

Share The Software Frontier

The OODA loop, and why it fails

The framework that incident responders most commonly use, often without naming it, is the OODA loop: Observe, Orient, Decide, Act.

The name and the framework are due to John Boyd, a U.S. Air Force colonel and fighter pilot who began formulating the ideas in the 1950s and 60s as an instructor at the USAF Fighter Weapons School and refined them through a series of briefings in the 1970s and 1980s, most prominently Patterns of Conflict (1986).

The central claim, which Boyd’s briefings emphasised, is that in adversarial situations, the actor whose decision cycle runs faster than the adversary’s wins, because they are operating inside the adversary’s decision cycle: by the time the slower actor has decided what to do, the faster actor has already changed the situation, invalidating the slower actor’s decision.

The framework migrated from military doctrine into business strategy, into emergency response, and eventually into software incident response, where it is used to describe the cycle an on-call engineer runs during an outage: observe the system, orient the observation against a model, decide on an intervention, act, and observe the result.

The framework is useful. It is also, in the production-systems context, frequently misapplied, and the way it is misapplied tells you something about why incidents go badly.

The original Boyd argument applies when the situation is adversarial and roughly symmetric: two actors with comparable OODA speeds, where being faster gives you the advantage. The situation an on-call engineer faces is not adversarial in this sense. There is no opponent making decisions.

The system is not trying to beat them. It is, instead, evolving according to dynamics (autoscaler decisions, retry storms, cache decays, queue accumulations) that have their own timescales, and those timescales are not necessarily compatible with the engineer’s OODA loop at all.

In an incident, three OODA loops are running simultaneously, and they do not all have the same period. The engineer’s loop runs at the speed of human cognition under stress: roughly thirty seconds to a few minutes per cycle, slower if multiple humans need to confer.

The automated control loop (autoscalers, load balancers, schedulers) runs at the speed of metric collection, typically seconds to tens of seconds. The plant’s own dynamics: connection pool exhaustion, queue overflow, cache regeneration; they all run at whatever rate the underlying physics dictate, which can be anywhere from milliseconds (TCP retransmits, GC pauses) to many minutes (cache warming, replication catch-up).

In Knight Capital’s case, the engineer’s loop ran at minutes against a plant operating at milliseconds. In Boyd’s terms, the plant was hopelessly inside the engineer’s decision cycle: every observation the engineers made was already obsolete by the time they oriented to it, and every action they took landed against a system that had moved orders of magnitude further during the action’s flight time.

The reverse case is also possible and is, in some ways, more insidious. An engineer’s OODA loop running faster than the plant’s natural recovery dynamics produces a different pathology: the engineer observes that the intervention has not yet worked, orients to that as a failure, decides on a new intervention, and acts, before the original intervention has had time to take effect.

The result is a stack of interventions in flight against a system that is already converging from the first one, with the later interventions arriving as disturbances against the recovery.

This is the operational form of what control theorists call over-control. The classical example is the shower with a slow-responding mixer: the user adjusts hot, observes no change, adjusts more, observes no change, then receives the full delayed effect of both adjustments and is scalded.

The loop is too fast for the plant. The mitigation, both in the shower and in production systems, is to wait, to extend the OODA cycle until it matches the plant’s natural response time, even though waiting feels, in the moment, like inaction.

The discipline this requires is hard. An on-call engineer under pressure does not feel that waiting is the correct action. The reflex, encouraged by the culture of incident response and reinforced by the stress of an outage in progress, is to act. To do something. To try the next thing on the runbook.

The structural argument against this reflex is not that the engineer is wrong to want to help; it is that, in a system with delayed feedback and partial observability, the cost of acting too quickly can exceed the cost of waiting one more OODA cycle for the previous action to land.

The principle from control theory is: the loop’s cycle time should be at least the plant’s settling time. If you act faster than the plant settles, you are stacking interventions against a system that has not yet responded to the previous one, and the resulting trajectory is not the sum of the interventions’ intended effects.

It is the response of a non-linear system to a sequence of disturbances, which is, almost always, worse than the response to any single intervention applied alone.

The mature on-call discipline, encoded in the better runbooks and the better incident command training, is to act, then wait the settling time, then observe, then decide whether to act again. The settling time is the half-life of the previous action, and it is system-specific: deployment rollbacks settle in minutes; cache invalidations settle in seconds; configuration pushes can take longer than either, depending on the propagation path.

Knowing the settling time of every action available during an incident is a kind of operational knowledge that does not usually live in the runbook, because it does not exist outside the system being operated. It lives in the heads of the engineers who have been on-call for that system long enough to have learned it from incidents.

This is, partly, why senior on-call engineers are so much more effective in incidents than junior ones, even when the junior engineers have the same runbook access and the same training. The senior engineers have an internal model of the plant’s settling times for each available action, and they pace the OODA loop to match.

The junior engineers, lacking the model, run the OODA loop at the natural speed of human cognition under stress, which is too fast for most production systems and produces the over-control pathology.

The illusion of the runbook

Every mature on-call function has runbooks. They are written after incidents, refined over months and years, kept in wikis and indexed by alert.

The good ones describe, for each known failure mode, the diagnostic steps to confirm it, the intervention to apply, and the verification to perform. The discipline of writing them is one of the visible practices of a mature operations culture.

The runbook is also, for a specific category of incident, a trap.

The trap has two forms. The first is that runbooks are written against past incidents. They encode the diagnoses that worked, the interventions that helped, the verifications that confirmed recovery.

They are, in the language of Part II, an artefact of monitoring rather than observability: comprehensive against known failure modes, useless against novel ones. The first time the runbook fails to describe the incident in front of you is the moment you discover this, and the discovery is rarely well-timed.

The second form is subtler. The runbook describes interventions and assumes those interventions will work the way they did last time. But the system has been changing since the runbook was written.

The intervention that worked last quarter may now have different downstream effects, because the components it touches have been migrated, the dependencies have shifted, the version of the underlying service has rolled forward.

The runbook, in effect, captures the system as it was; the engineer is operating the system as it is. The two diverge silently, and the divergence is invisible until the runbook produces an action whose effect surprises everyone.

There is a specific failure pattern this produces, common enough to have its own folklore: the engineer follows the runbook, the runbook prescribes restarting the X service, the X service restarts, and the system gets worse rather than better.

The reason, on inspection, is usually that the X service has acquired new dependencies since the runbook was written, and restarting it now drops more state than it used to, with consequences the runbook author did not anticipate.

The runbook is not wrong, in the sense that the steps it describes are the steps that worked. The runbook is out of date, in the sense that the system those steps were designed for is no longer the system the engineer is operating.

The mitigation is not to delete runbooks. They are too useful to discard, and the failure modes they handle correctly are far more common than the ones they handle wrongly.

The mitigation is to use runbooks as hypotheses, not as procedures: the runbook describes what worked last time, which is evidence about what might work this time, but the engineer must independently verify, in real time, that the conditions the runbook assumes still hold. The runbook says “restart X”; the engineer asks “is restarting X still safe given the system as it currently is?” before doing it.

This is hard to do under the pressure of an active incident, and it is one of the practical reasons that senior on-call engineers are more cautious about runbook execution than junior engineers, not less. The juniors, trusting the document, execute the steps.

The seniors, knowing how the document was written and how the system has changed since, pause to verify before each step. The juniors finish the runbook faster. The seniors finish the incident faster.

A related, sharper principle: the value of a runbook decays with the rate of system change. In a system that does not change, a runbook is durable: the same steps work indefinitely. In a system that changes weekly, with new deployments, new dependencies, new configurations, the half-life of a runbook is on the order of months (at best).

A two-year-old runbook in a fast-moving system is closer to historical fiction than to operational guidance. It still has value, but the value is in the model it encodes of how someone once thought about the system, not in the steps it prescribes.

The blast radius principle

The third structural property of action in distributed systems, after timescale and reachability, is blast radius: the set of components affected by a given intervention, and the bound on the damage if the intervention is wrong.

Every available action has a blast radius. Restarting a single instance has a small radius, that instance, briefly, plus whatever state it held. Rolling back a deployment has a larger radius, the entire fleet running that deployment, plus the load that flips back to the previous version, plus the downstream effects of running the older code.

Failing over a region has a radius that may include every customer routed to that region, every dependent service that has to repoint, and every cache that has to be rebuilt. Dropping a load balancer has a radius that can encompass the entire service.

The blast radius principle says: act with the minimum blast radius that can plausibly resolve the failure. If the failure is in one instance, restart the instance, not the fleet. If the failure is in one region, fail over that region, not the global topology. The reason is not just damage control. It is information.

A small-radius action either resolves the failure or does not, and either outcome is informative: it succeeded, which suggests the failure was localised; or it did not, which rules out a hypothesis and constrains the next action.

A large-radius action, by contrast, may resolve the failure without telling you which part of the action was responsible. If you fail over a region and the symptoms clear, you do not know whether the failure was in the region you abandoned, in the load on the region you moved to, or in the path between them.

You have ended the incident without learning anything that would prevent the next one. The blast radius was excessive for the diagnostic value returned.

The principle generalises into a hierarchy of interventions, ordered roughly by radius:

The smallest interventions are read-only: looking at logs, querying metrics, running diagnostic commands. These have effectively zero blast radius and are always safe to do first. Most outages benefit from more reading than the operators in the moment feel they have time for.

The next tier is single-instance: restarting one process, draining one node, removing one server from rotation. These have blast radius limited to the instance, and they are reversible within seconds. They are the appropriate first active intervention for almost any failure that has a candidate localised cause.

The next tier is service-level: rolling restart of a fleet, configuration push, deployment rollback. These have blast radius across the service, take longer to apply, and are harder to reverse. They are appropriate when single-instance interventions have ruled out localised causes, or when the failure is observed broadly enough that localising it is itself wasting time.

The largest interventions are infrastructure-level: regional failover, load shedding at the edge, traffic routing changes, emergency capacity additions. These have very large blast radii, and their consequences are often partly unobservable until well after the action.

They are appropriate only when smaller interventions have failed or are known to be insufficient, and they should be made with the explicit understanding that the system after the intervention will be a different system from the one before, with new failure modes that no one has yet characterised.

The principle, in compressed form: the appropriate blast radius scales with the certainty of the diagnosis. When you are confident in the cause, you can use a targeted, low-radius intervention.

When you are uncertain, you have two choices: invest more time in diagnosis to raise the certainty, or accept the higher blast radius of a less-targeted intervention. There is no third option.

Acting with high blast radius on low certainty is how outages turn into multi-region cascades.

Subscribe now

The conservation of risk

There is a principle, with versions in the safety-engineering tradition, that goes roughly: the total risk in a complex system tends to be conserved; safety interventions move risk around as much as they reduce it. The folk version is “what gets safer somewhere gets more dangerous somewhere else.”

The more precise versions belong to the risk-homeostasis and risk-compensation literature, most prominently Gerald Wilde’s Theory of Risk Homeostasis (1982), with related work by John Adams, which argues that visible safety improvements are partly absorbed by behavioural adjustments, with the residual risk migrating elsewhere in the system rather than disappearing.

This applies to incident response with peculiar force. Each intervention an operator makes during an incident is a safety action: an attempt to reduce immediate risk.

The intervention, almost always, succeeds at reducing the visible component of the failure.
The error rate drops.
The dashboards green up.
The customers stop tweeting.
The intervention has, by every visible measure, worked.

What is harder to see, and what is rarely measured in real time, is what the intervention did to the invisible component.

Rolling back a deployment ends the immediate incident but leaves the older code running, which may have its own known issues that the deployment was meant to fix.

Failing over a region resolves the local failure but loads the target region beyond its tested capacity, potentially priming a second incident.

Adding capacity to a saturated service unblocks the immediate queue but increases the surface area for the failure mode that caused the saturation, if the underlying cause has not been addressed.

The pattern is not unique to software. Aviation has a long literature on accident patterns that begin with a successful response to a minor problem and end with a major accident triggered by the response itself. Healthcare has the same literature.

The general structural shape is: the intervention that resolves Failure A creates the conditions for Failure B, which is rarer, less familiar, and less recoverable.

The system has been moved from a known failure regime into an unknown one, and the unknown regime has its own failures that the operators have not yet learned to recognise.

The discipline this asks of an on-call function is, again, hard to maintain under pressure. It is the discipline of asking, after every intervention: what did this just change about the system, and what new failure modes have I introduced?

It is the discipline of treating recovery as a state that itself needs to be monitored, because the post-recovery system is not the same system as the pre-incident system, and its failure characteristics are not yet known.

In practice, this means that the end of an incident is not when the symptoms clear. The end of an incident is when the system has been observed in its new configuration long enough to be confident that the interventions did not introduce a worse failure than the one they resolved.

This is usually hours, sometimes days, longer than the time the dashboards take to go green. The on-call function that closes incidents at the moment of symptom resolution is, structurally, accepting an invisible risk in exchange for ending the call earlier.

The function that watches the recovered system through at least one full traffic cycle is paying a cost in operator time for the option of catching the second-order failure before it becomes a second incident.

The operator as Bayesian, under pressure

Underneath all of the above is a single epistemic structure.

The operator, during an incident, is performing inference: from observed symptoms, against a prior model of the system, to a posterior estimate of what is wrong and what to do about it.

The framework is Bayesian whether the operator names it that way or not, and the failure modes are the failure modes of Bayesian inference under conditions hostile to it.

The prior model, what the operator believes the system is, before any incident-specific evidence, is the simulation Richard Cook described and that Part II referenced.

The likelihood, what the operator expects to observe under each candidate hypothesis, is implicit, drawn from training and prior incidents.

The posterior, what the operator believes after seeing the symptoms, is the working diagnosis. The decision follows from the posterior and from the operator’s loss function over possible outcomes.

Each step has its failure mode. The prior can be wrong, as in the Cloudflare oscillation in Part I, where the prior placed most of the probability mass on external attack and left almost none for gradual rollout against a regeneration cycle.

The likelihood can be wrong, as in the Slack autoscaler in Part II, where the prior model said high load implies high CPU, and the actual system was producing high load with low CPU because the threads were waiting on a degraded network.

The posterior, conditioned on a wrong likelihood, will be wrong in the same direction.

The decision, conditioned on a wrong posterior, will be wrong. And, this is the cruellest part, the operator will get feedback from the decision, observe its outcome, and update the model.

If the wrong decision happened, by chance or by partial compensation from other parts of the system, to be followed by recovery, the operator will incorporate that outcome as evidence that the wrong model was right.

The next time a similar incident occurs, the operator will reach for the same wrong intervention, with higher confidence, because last time it appeared to work.

This is the correlated false positive problem in operational learning. The operator’s model is updated by outcomes that are partially decoupled from interventions, but the decoupling is not visible to the operator. Recoveries that happen for reasons unrelated to the intervention reinforce belief in the intervention.

The model drifts, not toward the system’s actual dynamics, but toward whatever pattern of intervention-and-recovery the operator has happened to experience.

The mitigation requires explicit discipline, and most on-call functions do not maintain it. The discipline is to ask, after every incident: did the intervention actually cause the recovery, or did the system recover for some other reason that I happened to be present for?

The blameless post-mortem culture is, partly, an attempt to create the conditions in which this question can be honestly asked. It often is not.

There is a deeper version of the same problem, which is that the operator’s model is updated by outcomes the operator can observe, and the outcomes the operator can observe are filtered by the same observability stack whose limits Part II discussed.

The model drifts toward whatever the dashboards can see. Failure modes invisible to the dashboards remain invisible to the model, and the operator becomes progressively more confident in a model that captures only the observable subspace of the system’s actual behaviour.

The model is calibrated, but on a projection that has discarded the dimensions that matter most during novel failures.

This is, finally, why the systems that recover quickly from incidents are not the systems with the best runbooks or the best automation.

They are the systems whose operators have maintained an active distinction between the model and the system:

who treat their model as a hypothesis under continuous test
who notice when the system surprises them and update accordingly
and who are willing, in the middle of an incident,

to admit that the model they have been operating under for years may not apply to the situation in front of them.

The discipline is epistemic humility under pressure. It is the rarest thing in operations, and it is the one that compounds.

The discipline of acting under uncertainty

The compressed form of this essay, the operational counterpart to Part I’s what does this depend on that I cannot see, and Part II’s what is this metric not telling me, is also a question, asked of every intervention before it is applied:

What does this assume about the system, and what happens if that assumption is wrong?

Every action has assumed conditions. Restarting an instance assumes the instance is the source of the failure. Rolling back a deployment assumes the previous version is healthier than the current one.

Failing over assumes the target region can absorb the load. Adding capacity assumes the bottleneck is capacity. Each assumption is, before the action, a hypothesis. Each becomes, after the action, a commitment the operator has to live with.

The discipline is to know, for every available intervention, what assumption it embeds, and to verify that assumption, before applying the intervention.

The verification can be imperfect; perfect verification is incompatible with the timescale of an outage.

But the question must be asked, because the alternative is acting on a hypothesis the operator has not consciously formed, and being surprised by the system’s response in a direction the operator was not prepared for.

The questions to ask, before any non-trivial intervention:

What does this action assume about the system?
What is the blast radius if the assumption is wrong?
What is the smallest intervention that would resolve this if my diagnosis is correct?
How will I know whether the action worked, and how long do I need to wait before that signal is reliable?

If this action fails, what state does it leave the system in, and is the next action still reachable from there?

The discipline is not to avoid acting. It is to size the intervention to the certainty of the diagnosis, allow the system time to settle, verify the assumption before escalating, and preserve the possibility of recovery after every step.

This is what separates short incidents from catastrophic ones. Not faster reactions, better dashboards, or thicker runbooks, but the ability to treat every intervention as a probe: an action that changes the system while also revealing something about it.

During an incident, the system is operating in a regime nobody has fully characterised. The operator’s job is not to force it back through sheer intervention. The job is to keep the system in a state where understanding is still possible, learn from each action, and navigate toward a recoverable regime without collapsing the remaining options.

This is on-call: control under uncertainty.

The loops are delayed. The observations are partial. The interventions have non-linear effects. The system being repaired is one no single engineer fully understands.

At 03:47 UTC, the operator is performing the function the entire architecture silently assumes someone will perform, while the architecture itself makes performing it nearly impossible.

The systems mostly do not work. The operators work, and the systems usually fail only when the operators lose the ability to see, infer, or intervene safely.

The pager will go off again. The dashboards will be wrong again. The runbook will be stale again.

The work is the same work:

act carefully, preserve reversibility, make the assumption explicit, and already know the next move before committing to the current one.

One more thing…

The structural argument across this series is that distributed systems fail at the seams: composition, observation, and control are each independent sources of failure modes no single component owns and no single engineer fully sees.

The same argument applies, in concentrated form, to GPU programming.

A modern CUDA kernel is itself a tiny distributed system: dozens of streaming multiprocessors, thousands of warps, multiple memory hierarchies, delayed observation through performance counters, and sharply non-linear control through launch configuration and synchronization.

Correctness is necessary. Performance lives in how composition, observation, and control interact under the workload you actually have, not the one your benchmark measured.

I wrote a deep CUDA guide from exactly this perspective: not isolated tricks, but how to reason about the GPU as a coupled dynamical system whose performance regimes and failure modes (occupancy collapse, memory-bandwidth thrashing, warp divergence, pipeline stalls) are structurally the same kinds of seam failures this series has been describing all along.

Read the CUDA Guide on Gumroad

How Systems Really Fail, Part II

Lorenzo Bradanini — Tue, 19 May 2026 09:51:32 GMT

Intro

The first essay in this series argued that distributed systems fail in the spaces between their components, and that those spaces are structurally opaque. This one argues something more uncomfortable.

Even if you accept that the system is opaque, you still have to operate it. You still have to debug it at 03:47 UTC. You still have to decide, in the next ninety seconds, whether to roll back, fail over, shed load, or page someone more senior.

To do any of that, you have to see the system.

This is the second structural problem, the one most production engineers learn about the hard way, usually during an incident that lasted longer than it should have because the dashboards stayed green until they didn’t.

The system you observe is not the system that exists. It is a projection of the system into a low-dimensional representation, built out of metrics, logs, traces, and the mental model that lives in the operator’s head.

The projection is incomplete by construction. It is delayed by the time it takes to collect, aggregate, and render. It is biased by what someone decided to instrument three years ago. And it is aggregated, often violently, in ways that destroy precisely the signal needed to recover from an incident.

This essay is about that gap. Not the gap between the system as designed and the system as composed (Part I), but the gap between the system as composed and the system as perceived. The two gaps are different, and they compound.

Three incidents, mechanically reconstructed: Slack’s autoscaler chasing metrics that had decoupled from the network failure underneath, while the dashboards that would have shown the decoupling failed at the same time as the system;

GitHub’s 43-second partition and the 24-hour reconciliation that followed; Roblox’s 73-hour outage, where the monitoring stack failed at the same time as the system it monitored, and engineers spent two days debugging a fully dark cluster.

The pattern beneath all three is the same: the operator’s view of the system, and the system itself, were two different things. The interesting question is why this is structural, not accidental.

The observer problem, mechanically

The canonical incident for this material is the Slack outage of 4 January 2021, the first business day after the holiday break, documented in detail in Slack’s post-mortem by Laura Nolan.

Slack runs on AWS, with services running in dedicated VPCs (Virtual Private Clouds) connected by AWS Transit Gateways (TGWs). TGWs are managed by AWS and intended to scale transparently. Slack’s traffic pattern is unusual: the platform is quiet over the holidays and then ramps to one of its biggest days of the year on the first Monday back, when clients reconnect with cold caches and pull down more data than usual.

On 4 January, the TGWs did not scale fast enough for that ramp. Around 6:00 AM PST, one of them began dropping packets. The packet loss caused widespread degradation in internal calls across Slack’s services, but the symptom was not yet visible to users.

Slack’s web tier autoscales on two signals: CPU utilisation and utilisation of available Apache worker threads. Here is where the failure begins. As packets dropped on the TGW, threads in the web tier spent more time waiting on slow or stalled backend calls.

Waiting threads do not burn CPU. So as the system became less able to serve users, CPU utilisation actually dropped. The autoscaler, looking at CPU, concluded the fleet was over-provisioned and downscaled the web tier.

Then the mini-peak at 7:00 AM PST arrived. Load increased against the now-smaller fleet on a degraded network. Apache worker thread utilisation climbed sharply, threads were waiting longer, and more of them were in use, and the thread-utilisation signal triggered aggressive upscaling. Slack attempted to add 1,200 servers between 7:01 and 7:15 AM PST.

The scale-up failed. New instances are configured by an internal service Slack calls provision-service, which talks to other Slack systems and to AWS APIs over the same degraded network. Under the sudden load of 1,200 simultaneous provisioning requests, with elevated latency on every dependency call, provision-service hit two resource ceilings: the Linux open files limit and an AWS quota limit.

Most of the 1,200 instances were created but never fully provisioned. They counted against the autoscaling-group size limit, blocking further scale-up, but they did not serve traffic.

And then the second layer of the observer problem revealed itself. Slack’s dashboarding and alerting service had failed during the early stages of triage. The reason was structural: the monitoring stack ran in a different VPC from its backend databases, and the same TGW that was dropping packets sat on the path between them. The failure that was breaking the web tier had also blinded the engineers trying to diagnose it.

For roughly the next hour, incident responders worked without dashboards. They had logs, command-line tools, and the ability to query metrics backends directly, but none of the pre-built queries that turn raw metrics into actionable views.

Some engineers were SSHed into production instances when the autoscaler deprovisioned them mid-investigation, abruptly ending their sessions. provision-service recovered around 8:15 AM PST; serving capacity reached a degraded-but-functional state by 9:15 AM PST; full recovery, after AWS engineers manually scaled TGW capacity, completed at 10:40 AM PST.

The whole sequence is the observer problem in compounded form. The autoscaler responded first to a metric (CPU) that did not represent the system state, then to a metric (thread utilisation) that drove the wrong action under the conditions.

The control loop did exactly what it was designed to do; the signals it was acting on had decoupled from the reality of the network underneath.

And the observation surface that operators would normally have used to see this decoupling was itself, by architectural coincidence, a casualty of the same failure.

This is the observer problem in its operational form, and it has three structural sources, each worth pulling apart.

Instrumentation lag

Every observability pipeline introduces delay between an event occurring in the system and that event being visible on a dashboard. The delay has multiple stages.

First, emission delay: the event happens, but the code that emits the metric or log runs after the event, and the emission itself takes some time, usually buffered behind a batching layer with a flush deadline (StatsD typical: 10s; OpenTelemetry batch span processor: 5s default, max queue 2048). Second, collection delay: the emitted data is scraped or pushed to a collector at fixed intervals (Prometheus default scrape interval: 15s, with scrape_timeout typically 10s).

Third, aggregation delay: the collector pre-computes summary statistics, often on a window ending some seconds in the past to allow late-arriving data. Fourth, render delay: the dashboard queries the storage layer and renders, typically on a 30s to 60s refresh.

End-to-end delay in a well-tuned production pipeline is typically 15 to 60 seconds. In many real pipelines it is several minutes.

This is not a defect. It is the cost of producing observations that are coherent across thousands of hosts. The delay is the price paid for the metric being computable at all.

But the delay has a consequence that classical control theory makes precise. Any closed-loop control system whose feedback path introduces delay τ experiences a phase lag of ωτ radians at frequency ω.

The Nyquist stability criterion says, in essence, that a feedback loop becomes unstable when its total phase lag approaches 180° at the frequency where the loop gain is unity; with even modest controller gain, sufficient phase lag turns negative feedback into positive feedback and the loop oscillates.

Concretely: for an autoscaler with a minute-scale measurement-to-action delay attempting to track load that varies on similar timescales, phase lag approaches the stability boundary. Aggressive scaling policies tip the loop into oscillation; this manifests operationally as the autoscaler over-provisioning, then over-scaling-down, then over-provisioning again, never settling.

Slack’s post-mortem describes a variant of this pattern: an initial downscale on CPU, followed by an aggressive upscale on thread utilisation, against a network problem the loop could not see at all.

The Shannon-Nyquist sampling theorem provides the converse bound: a control loop sampling at interval T cannot observe, and therefore cannot react to, disturbances faster than 2T. A 15-second Prometheus scrape interval is structurally blind to load dynamics on timescales below 30 seconds. The information about those dynamics has been aliased into the lower-frequency band, where it appears as noise.

Marc Brooker has written about this directly in the context of AWS load balancing: a control loop with delay longer than the time constant of the thing it controls cannot stabilise that thing. It can only chase it.

The Slack autoscaler chasing a CPU metric that had decoupled from real load (because waiting threads do not burn CPU) was operating in exactly this regime.

The mitigation is not to make the metrics faster, though that helps. The mitigation is to design control loops that do not depend on global metrics: rate-limiting at the boundary, admission control based on local queue depth observable in O(1) from inside the affected process, fallback to last-known-good rather than reactive scaling.

Local observations made by the component itself have zero collection delay because they bypass the pipeline entirely. Global observations always carry the pipeline’s phase lag.

This is the structural argument for autonomic behaviour at the component level. Components that defend themselves locally (with circuit breakers, backpressure signals propagated synchronously to upstream callers, and load-shedding triggered by their own queue depth) do not depend on a delayed control loop to survive.

Components that wait for the autoscaler to rescue them are operating inside a feedback loop whose stability margin is, almost always, narrower than anyone has measured.

Subscribe now

Aggregation destroys the signal

Every metric you look at on a production dashboard is an aggregate. A counter has been summed across hosts. A latency value has been percentiled across requests. A gauge has been averaged or maxed across a time window.

Aggregation is mathematically necessary; you cannot stare at every individual request. It is also, almost always, the thing that hides the failure.

The pathology is most visible in latency aggregation, and the canonical analysis is Gil Tene’s How NOT to Measure Latency (2015). The argument is worth deriving mechanically, because the conclusion is counter-intuitive and the mechanism is not.

Consider a load generator configured to issue requests at a rate of R per second; one request every 1/R seconds. The generator records, for each response, the time elapsed between send and receipt. Call this service_time. The reported P99 is the 99th percentile of the service_time distribution across some number of samples.

Now suppose the system stalls completely for T seconds, then recovers. During the stall, no responses arrive. The generator, in its naive form, has two possible behaviours.

In the coordinated form, the generator waits for each response before sending the next. During the T-second stall, exactly one request is in flight; its service_time is recorded as T. The other R·T requests that should have been sent during the stall are never sent. They do not appear in the histogram at all.

In the uncoordinated form, the generator sends on schedule regardless of whether responses arrive. During the stall, R·T requests pile up in the kernel’s socket buffer or in the generator’s own queue. When the system recovers, those requests are drained; each one’s service_time is measured from the moment it was sent, not the moment it was scheduled to be sent.

The bias is statistical. Let F_observed(t) be the empirical CDF of service_time the generator records, and F_true(t) the CDF of latency that a real user (arriving according to a Poisson process at rate R) would experience.

In the coordinated form, the missing samples should have been drawn from the slowest part of the latency distribution; their absence systematically truncates the right tail of F_observed. The quantile function Q_observed(0.99) = F_observed⁻¹(0.99) is, by construction, a lower bound on Q_true(0.99), with the gap widening as T grows.

The user who clicked at time t and got a response at time t + T + ε experienced a latency of T + ε. The histogram has either zero entries near T + ε (coordinated form) or entries clustered near ε (uncoordinated form). In neither case does the percentile reflect what the user felt.

The correction Tene proposes (HdrHistogram’s recordValueWithExpectedInterval) is to synthesise the missing samples: for each measured service_time exceeding the expected interval 1/R, insert additional samples at service_time − 1/R, service_time − 2/R, ..., down to 1/R.

These synthetic samples represent the users who would have arrived during the stall and would have experienced progressively shorter waits.

The implementation, in essence:

void recordValueWithExpectedInterval(long value, long expectedInterval) {
    recordValue(value);
    if (expectedInterval <= 0 || value <= expectedInterval) return;
    long missingValue = value - expectedInterval;
    while (missingValue >= expectedInterval) {
        recordValue(missingValue);
        missingValue -= expectedInterval;
    }
}

Six lines. One additional method call per recorded measurement. The result, when applied to real traces, is routinely an order-of-magnitude shift in the tail. A system reporting P99 = 200 ms under uncorrected measurement reports P99 = 2 to 4 seconds under coordinated-omission correction.

The dashboard was lying by a factor of ten, not by accident, but by construction of how percentiles are computed from a fixed-rate sampler against a non-stationary service-time distribution.

The same pathology appears, in different shapes, throughout the metric stack. Server-side latency histograms measure only requests the server got to process. Requests rejected at the load balancer, dropped at the TCP layer, or held in the kernel accept queue do not appear.

The server’s P99 can be excellent while the connection P99 (which the user actually experiences) is catastrophic. The Slack incident is exactly this: the web tier was reporting acceptable internal latencies for the requests it was handling, because the requests not being handled were not in the denominator. Survivorship bias, in observation form.

The same problem appears in averaging. A service that handles two classes of request, 99% of them fast and 1% of them slow, will show a mean latency dominated by the fast class. If the slow class gets ten times slower during an incident, the mean barely moves. The average is structurally insensitive to the tail, which is where outages live.

Percentiles are better than averages, but only if computed correctly, and only if the percentile you care about is in the data.

P99 across a million requests has ten thousand data points and is statistically reliable. P99 across a thousand requests has ten data points and is statistical noise. P99.9 across a million requests has a thousand data points, marginal. P99.99 across a million requests has a hundred data points, useless.

The deeper into the tail, the more samples needed to stabilise it, and the deeper into the tail is exactly where the interesting failures live.

The reason failures live in the tail is itself a queueing-theoretic result. For an M/M/1 queue (Poisson arrivals at rate λ, exponential service times at rate μ, single server) with utilisation ρ = λ/μ, the expected waiting time in the system is 1/(μ(1−ρ)). As ρ → 1, this diverges hyperbolically; the variance of waiting time grows even faster, as 1/(1−ρ)².

The practical consequence is that high-percentile latencies blow up far faster than utilisation increases: a service running near saturation has a P99 that is many multiples of its mean, and the ratio worsens sharply as you approach full utilisation.

Tails are not a measurement artefact; they are the physics of contention, and they are precisely what averages and low-resolution percentiles destroy.

The problem compounds when percentiles are themselves aggregated. The P99 of a service that is the union of ten hosts is not the average of those hosts’ P99s, nor their maximum, nor any function of them; it is a quantile of the underlying merged distribution, which is unrecoverable once each host has been independently percentiled.

Many monitoring systems do exactly this aggregation, producing a number labelled “P99” that is mathematically meaningless. Theo Schlossnagle and the Circonus team have written extensively on this; the correct primitive is to store the full histogram and percentile at query time, after merging across hosts.

Three production-grade histogram primitives dominate:

HDR Histogram (Tene): fixed-precision logarithmic bucketing across many orders of magnitude (typically nanoseconds to hours at three significant digits), O(1) insert, mergeable across instances, widely used in the JVM ecosystem.
t-digest (Dunning, 2013): centroid-based sketch with concentrated precision in the tails, useful when storage is constrained but tail accuracy matters; O(log n) insert in the worst case, mergeable with a controlled error bound.
DDSketch (Masson, Rim, Lee, 2019): relative-error guarantee α, log-bucketed similarly to HDR but with provable tail accuracy, fully mergeable without error accumulation; used by Datadog as its native primitive.

All three solve the merge problem. None of them solve the storage problem: keeping a per-request histogram per dimension (customer, region, endpoint, version) at full cardinality costs roughly two orders of magnitude more than scalar metrics. Almost no organisation does it for everything.

Beyond latency, aggregation collapses cardinality. If a service is failing for one specific customer, on one specific endpoint, in one specific region, the aggregate error rate may show a 0.1% blip below any reasonable alerting threshold. The blip is the entire experience of that customer. Aggregation makes the rare invisible by averaging it against the common.

Charity Majors’s argument for high-cardinality observability, developed across her writing at Honeycomb and in Observability Engineering (O’Reilly, 2022, with Liz Fong-Jones and George Miranda), reduces to this: the questions that matter during an incident are almost always questions about specific slices of traffic, and any pre-aggregated metric has already destroyed the dimensions needed to slice on.

The pre-aggregation step is irreversible. Information theory enforces this; once you have computed the count of errors per minute, you cannot recover which customers produced those errors.

The cost of preserving cardinality is large; the cost of discarding it is invisible until the incident in which you need the dimension you discarded. The discipline is to know which dimensions are likely to matter, instrument those at full cardinality, accept that some incidents will be invisible until someone re-instruments after the fact.

Subscribe now

You can only see what you instrumented

This is the third and most permanent of the three sources, and the hardest to mitigate, because it is a statement about the closure of the observable space.

Every metric, log line, and trace span in your system exists because some engineer, at some point, decided it would be useful. The decision was made under one model of how the system would fail. The decision was made before the migration that changed the failure modes. The decision was made by someone who has since left.

The observation space of a production system is, in effect, a fossil record of past concerns. It captures the questions someone thought to ask. The questions that nobody thought to ask are invisible, by definition, and they remain invisible until an incident forces someone to add the instrumentation in the middle of the firefight.

This is the operational meaning of Cindy Sridharan’s distinction (in Distributed Systems Observability, O’Reilly, 2018) between monitoring and observability. Monitoring is the practice of watching known failure modes by means of pre-defined metrics and alerts.

Observability is the property of a system that allows you to ask questions about its state that you did not anticipate having to ask.

The two are different in kind. Monitoring is comprehensive at handling familiar failures and useless at handling novel ones. Observability is the opposite. A mature production environment requires both, but the second is much harder to achieve, because it requires preserving structured, high-cardinality data about every interesting event, in a form that can be queried after the fact along dimensions no one specified in advance.

The DynamoDB DNS race condition from Part I is a clean example of the limit. The plan generation system had monitoring: are the Enactors running, are they applying plans, is Route 53 returning valid responses? All of these were green throughout the incident.

The question that would have caught the failure, is there a window during which one Enactor has applied an older plan after another Enactor has deleted it, was a question nobody had thought to ask, because the failure mode it describes had never happened.

There was no monitoring for it, because there was no model of it. Observability, in Sridharan’s sense, might have caught it: if the raw event stream of Enactor operations, with full cardinality on plan version and Enactor identity, had been queryable, an engineer during the incident could have constructed the query that revealed the interleaving. Whether anyone would have thought to construct that query in the first ninety minutes of the outage is a separate question.

Ben Sigelman, who designed Google’s Dapper tracing system and later co-founded LightStep, has argued that the practical limit of observability is set by the cost of the questions you are not yet asking. Storing every span, log, and structured event at full cardinality is theoretically ideal and economically impossible.

Every organisation makes a choice about which dimensions to preserve, and that choice is, in retrospect, always slightly wrong, because the next incident is the one whose relevant dimension was sampled out.

The discipline is not to eliminate this gap, which cannot be done, but to acknowledge it: to recognise that your dashboards are a model, that the model is incomplete, and that the moments when the model and the system disagree are the moments that matter most.

The three frames of reference

The DynamoDB cascade in Part I introduced, almost in passing, an idea worth making explicit. During the outage, the system existed in three simultaneously valid states, depending on whose vantage point you took: the data plane saw a healthy service, the control plane saw correctly-served DNS, the client saw nothing.

This is not a metaphor. It is structural, and it has a theoretical grounding. Lamport’s 1978 paper Time, Clocks, and the Ordering of Events in a Distributed System established that in a distributed system without a shared global clock, the only well-defined ordering between events is the happens-before relation →, defined transitively by causal message passing. Events that are not connected by → are concurrent, and concurrent events have no meaningful temporal ordering across observers; each observer may legitimately see them in a different sequence. Each observer constructs a partial order from the messages it has received, and the partial orders need not agree.

Any distributed system at scale has at least three such frames, and they almost never agree.

The data plane frame is the view of the components doing the actual work. Storage nodes know whether they are reachable on their primary network interfaces and whether their disks are responding. Compute hosts know whether their workers are processing requests. From this frame, the system is described in terms of internal state: queue depths, lock contention, GC pauses, file descriptor counts.

The control plane frame is the view of the systems that manage the data plane. Schedulers, load balancers, service discovery, deployment systems, autoscalers. The control plane sees the data plane through its own observations, typically lagged metrics and periodic health checks. From this frame, the system is described in terms of declarative goals and reconciliation: how many instances should be running, how many are running, what is the gap?

The client frame is the view of whatever is trying to use the system. This includes external customers but also internal services that depend on the one in question. The client sees the system only through its responses: latency, error rate, correctness. From this frame, the system is described in terms of the service contract being honoured or not.

In a healthy system, all three frames produce consistent descriptions. The data plane is processing requests, the control plane sees that the data plane is processing requests, and the client receives correct responses. This consistency is what allows operators to use any one frame as a proxy for the other two.

In an unhealthy system, the frames diverge, and the pattern of divergence is diagnostic. Data plane healthy, client failing: the failure is in the path between them, usually DNS, routing, or load balancing. Control plane healthy, data plane degraded: the control plane is observing a stale or filtered view of the data plane.

All three frames disagreeing: the system has entered a regime its designers did not anticipate, and the on-call engineer is going to have a long night.

Frame divergence, mechanically: GitHub, 21 October 2018

The 2018 GitHub MySQL split-brain is the textbook case of frame divergence, and it is worth reconstructing because the mechanism turns on which observers were in which partition.

At 22:52 UTC on 21 October, routine maintenance to replace failing 100G optical equipment severed connectivity between GitHub’s US East Coast network hub and its primary US East Coast data centre. The break lasted 43 seconds. Not enough for a human to react. More than enough for everything that follows.

GitHub ran MySQL in a topology managed by Orchestrator, an open-source replication-topology manager that, in GitHub’s configuration, uses Raft consensus among its own nodes to decide when to promote replicas. The primary was in US East.

Replicas existed in US West and in a public-cloud region. Orchestrator nodes were distributed across all three. Crucially, Orchestrator’s automated failover was configured to promote across regional boundaries.

The Raft protocol requires a strict majority for any decision: for a cluster of n nodes, ⌊n/2⌋ + 1 must agree. When the East Coast data centre dropped off the network, the Orchestrator nodes inside it were partitioned with it. The remaining nodes (US West plus US East public cloud) retained a quorum.

From their frame, the primary had failed; the only action under Raft’s liveness assumption was to elect a new leader and promote. Within seconds of the partition forming, the Orchestrator quorum began the leadership deselection process, opened a new Raft term, and promoted a US West MySQL replica to primary. Application traffic in the unaffected regions began flowing to it.

This is, in the language of CAP, the choice MySQL replication and Orchestrator had been configured to make: availability over consistency. When the network partitioned, the system preserved availability (writes continued to be accepted somewhere) at the cost of consistency (writes accepted in one partition were unknown to the other). Daniel Abadi’s PACELC refinement (2010) makes the framing sharper: if Partitioned, choose between Availability and Consistency; Else, choose between Latency and Consistency. GitHub’s topology had chosen PA/EL: availability under partition, latency in the steady state. The cost of that choice was paid, in full, during the 43 seconds the partition was active and the 40 minutes that followed.

The partition healed 43 seconds later. From the East Coast frame, nothing had happened: the local MySQL primary had continued serving writes the whole time, because applications in East continued routing to it. From the West Coast frame, it was now the primary, and writes were flowing in. Both databases had accepted writes for the duration of the partition, neither aware of the other’s writes.

Cross-region MySQL replication carries some lag in steady state, which is the window during which writes acknowledged to clients on East had not yet reached West and were therefore not present on the newly-promoted West primary.

The trap closed in the next forty minutes. Once connectivity restored, GitHub’s application tier saw the new West Coast primary and began directing writes to it. For nearly 40 minutes, the West Coast accepted writes that the East Coast primary did not see. Meanwhile, the East Coast primary contained the few seconds of writes from the partition window that had never been replicated to West.

When engineers locked deployment tooling and assessed state, they found two databases with divergent histories. Reconciling by failing back to East would discard the 40 minutes of West Coast writes. Failing forward on West would discard the East Coast partition-window writes. Neither was acceptable.

GitHub chose to fail forward, preserving the 40 minutes of West Coast writes at the cost of consistency: applications in the East Coast now had to make a cross-country round trip for every database call, adding cross-country latency to operations that had been designed to complete in local-region time.

The site was effectively degraded for 24 hours and 11 minutes while data was restored from backups, replication was rebuilt, and the orphaned East Coast partition-window writes were manually reconciled from binary logs. As the post-mortem records: one of the busiest clusters in the partition window contained 954 writes that had to be reconciled by hand.

From the East Coast control plane’s frame, the system had operated correctly: writes were accepted, replication was healthy locally, no monitoring fired. From the Orchestrator quorum’s frame, failover happened exactly as designed when the primary became unreachable.

From the data plane’s frame, two independent write histories existed during a window neither side observed in full. From the client frame, some writes had succeeded that, in a consistent universe, would have been rejected.

The 43 seconds of partition were the trigger. The frame divergence was the failure. The trigger was unavoidable; physical networks partition.

The deeper unavoidability was theoretical: the FLP impossibility result (Fischer, Lynch, Paterson, 1985) proves that no asynchronous deterministic consensus protocol can guarantee both safety and liveness in the presence of even a single crash failure.

Every real consensus system, including Raft, breaks this tie pragmatically, in Raft’s case, by leaning on timing assumptions to maintain liveness. Those timing assumptions are exactly what fail during a network partition; the algorithm has no choice but to make progress on the basis of local observations that, during the partition, were sufficient for a local quorum decision but insufficient to determine the global state.

GitHub’s subsequent re-architecture eliminated cross-region automatic failover precisely because no observation surface available in real time was sufficient to detect the divergence as it was happening.

They moved the consistency/availability tradeoff from automatic to human-in-the-loop, accepting longer mean-time-to-recovery in exchange for not making this specific decision wrong again.

Subscribe now

The Heisenbug problem

There is a class of bug in distributed systems that disappears when you try to observe it. The folklore name is Heisenbug, after the uncertainty principle; the technical name is observation-dependent failure.

These bugs exist because adding observation to a system changes the system. Logging a value takes time, which changes the timing of the surrounding code, which changes the order in which concurrent operations interleave, which changes whether the race condition fires.

Capturing a stack trace acquires locks, which changes contention, which changes which thread reaches the critical section first. The act of looking at the system alters what the system does.

This is, again, not a metaphor. Modern observability tooling routinely takes 1-5% of a service’s CPU and a measurable fraction of its memory. eBPF-based profilers, distributed tracing, log aggregation, all of them consume resources that come from the same pool the application uses to do its work. In the steady state, the cost is acceptable.

In the regime where the system is already saturated, adding observation can be the perturbation that pushes the system from a marginal state into a failure state.

The mitigation is not to remove observation, which would leave the system unobservable. The mitigation is to design observation to be low-overhead and load-shedding: when the system is saturated, observation is the first thing to drop, not the last.

Sampling rather than full capture, head-based sampling rather than tail-based, structured events written to local buffers rather than synchronous network calls.

The modern technical answer to this is eBPF: in-kernel observation programs verified for safety at load time and executed in response to kernel events. Because the aggregation happens in kernel space, written into perf event arrays or BPF ring buffers shared with user-space readers via mmap, the observation path bypasses the syscall boundary entirely.

The cost of recording an event collapses to a few cache-line writes, with no context switch and no allocator pressure on the application path.

The Linux kernel’s eBPF verifier statically proves termination and memory safety at program load time, which means an observation program cannot crash the kernel even if it has a bug; bpftrace, BCC, and Cilium’s Hubble all build on this substrate.

The implication for the Heisenbug problem is that, for many workloads, eBPF-based observation has overhead small enough that observing the system no longer meaningfully alters the system’s behaviour.

Each of these decisions trades observability for the ability to keep the system running while it is being observed. The trade is acceptable only if you have decided it deliberately.

In most systems, it has been decided by accident, by whatever the default configuration of the tracing library is, set by whichever engineer integrated it.

But the deepest version of the observation-disturbing-the-system problem is not about overhead. It is about circular dependency between the observation surface and the thing being observed.

When the monitoring system depends on the system it is monitoring, a failure in the monitored system blinds the monitoring system, and the operator loses access to the diagnostic data at precisely the moment they need it most.

There is no better illustration of this than the 73 hours that began at 13:37 PDT on 28 October 2021.

The Roblox outage, mechanically

Roblox at the time ran more than 18,000 servers and 170,000 containers across its own data centres, orchestrated using the HashiCorp stack: Nomad for scheduling, Vault for secrets, and Consul for service discovery, health checks, session locking, and as a KV store.

Consul was the central nervous system. Every service depended on it to find its peers.

A single Consul cluster supported the entire backend: 5 voter nodes plus 5 non-voter read replicas. This was, as the post-mortem would later note, a single point of failure of a kind that violated every textbook lesson about blast-radius isolation.

In the months leading up to October, Roblox had upgraded from Consul 1.9 to 1.10 to take advantage of a new streaming feature designed to reduce CPU and network bandwidth on large clusters. The feature had been incrementally enabled across services without incident.

On 27 October at 14:00, the day before the outage, it was enabled on the traffic routing service, and the number of routing nodes was increased by 50% in anticipation of end-of-year traffic.

At 13:37 PDT on 28 October, Vault performance began to degrade and a single Consul server began exhibiting high CPU load. Engineers began to investigate; users were not yet impacted. The first signal was unusual write latency on Consul’s underlying KV store: the 50th percentile, normally under 300 ms, had climbed to 2 seconds.

The cluster was failing for two reasons that interacted, neither of which engineers identified for days.

The first was the streaming feature itself. HashiCorp would later explain that streaming, while overall more efficient than long polling, used fewer concurrency control elements (Go channels) in its implementation.

Under very high read and very high write load, the design exacerbated contention on a single Go channel, blocking writes and consuming CPU in kernel spin locks along the streaming subscription code path. This pathology had not appeared in HashiCorp’s pre-release benchmarks because it required the specific combination of large stream count and high churn rate that Roblox’s workload produced.

The second was buried inside Consul’s persistence layer. Consul uses BoltDB, an embedded Go key-value store inspired by LMDB’s memory-mapped design, to persist its Raft write-ahead log.

BoltDB’s design is a single memory-mapped file organised as a copy-on-write B+tree: every write transaction allocates new pages, never modifies existing ones, and commits by atomically swapping a single root pointer. This gives crash-safety, at the cost of page churn.

When pages become unreachable, BoltDB does not release them to the OS. Instead it tracks them in a freelist of free page IDs, which is rewritten in its entirety on every transaction commit. At normal scale, freelist maintenance is negligible. At Roblox’s scale, after months of accumulated Raft log writes, the freelist had grown pathologically.

The post-mortem provides the actual numbers, taken from a Consul server during the incident: the 4.2 GB Raft log store contained only 489 MB of actual data. The remaining 3.8 GB was empty space, tracked as free pages.

The freelist tracking those pages had grown to 7.8 MB, containing nearly a million free page IDs. For every Raft log append, with all the batching Consul applies, a write of 16 KB or less was triggering a rewrite of the entire 7.8 MB freelist to disk.

This is the pathology. Each transaction commit performed: a search of the million-entry freelist for free pages; an update to that freelist; serialisation of the entire 7.8 MB freelist to disk; and an fsync(2) whose cost was dominated by the size of the dirty page set, which was now dominated by the freelist itself.

The work scaled linearly with freelist length, and the freelist length grew with every snapshot the system performed to keep itself trim.

Raft, sitting on top of BoltDB, has timing assumptions. The leader replicates log entries to followers and must commit them durably.

When BoltDB commit latency entered the multi-second range, leaders could not durably persist log entries fast enough; followers timed out, started elections, and a new leader was chosen, which inherited the same BoltDB file, performed the same expensive freelist operations, became slow, lost its leadership in turn, and triggered another election.

The cluster was alive but unable to make progress: a Raft cluster trapped in a leader-flap loop, with each leader’s presence too brief to commit meaningful work.

This was an internal failure at the level of database page management, observed at the level of cluster leadership stability. The two layers were not connected in anyone’s mental model of the system. They were connected through a freelist data structure most engineers did not know existed.

The team’s first hypotheses, in order, were the ones the operator’s model suggested. They suspected degraded hardware and replaced a Consul node. Performance continued to suffer.

They suspected capacity, and replaced all the Consul nodes with new machines: 128 cores (up from 64) on faster NVME SSDs. As the post-mortem would later document, this made things worse: the new servers were dual-socket NUMA architectures, and the additional cores meant additional concurrent goroutines contending on the same Go channel in the streaming code path.

Cross-socket memory access added latency to operations that had been local on the old 64-core single-socket machines.

By 16:35 PDT on the 28th, concurrent users had dropped to 50% of normal. Subsequent attempts, resetting the cluster from a snapshot, blocking incoming traffic with iptables to bring it back under controlled conditions, reducing health-check frequency from 60 seconds to 10 minutes to give the cluster breathing room, all stabilised the system briefly and then returned it to the same 2-second KV write latency.

None of these interventions worked because none of them addressed the actual mechanism. The post-mortem is explicit: engineers did not identify the BoltDB freelist issue during the incident. HashiCorp engineers determined the root cause in the days after the outage ended.

This is where the observer problem becomes operationally devastating. Roblox’s monitoring infrastructure depended on Consul. When Consul was unhealthy, the dashboards that would have shown engineers what was happening inside Consul were themselves unable to report. The post-mortem describes this directly:

There was a circular dependency between our telemetry systems and Consul, which meant that when Consul was unhealthy, we lacked the telemetry data that would have made it easier for us to figure out what was wrong.

The diagnostic question the operator needed to ask, what is the actual state of the BoltDB file inside the affected Consul instances, required telemetry that the affected systems were supposed to provide. They could not.

Engineers were debugging a system whose internal state was now unobservable, against a failure mode buried two software layers deep in an open-source dependency that the affected engineers had not personally written.

The breakthrough came at 15:51 PDT on 30 October, roughly 50 hours after the outage began, when engineers disabled the streaming feature across all Consul systems. KV write latency immediately returned to 300 ms.

The Heisenbug-disguised streaming contention had been suppressed; the underlying BoltDB freelist problem was still there, manifesting as a “slow leader” symptom in which certain leaders inherited the worst of the freelist state.

The team pragmatically worked around it by preventing those leaders from staying elected, and continued the long process of repopulating caches and restarting services.

Total downtime: 73 hours, from 13:37 PDT on 28 October to 16:45 PDT on 31 October, when 100% of players were given access.

The trigger was the interaction between Consul streaming and BoltDB’s freelist; both bugs, both fixable, both ultimately fixed (the BoltDB freelist issue was resolved by migration to bbolt, the etcd-io fork of BoltDB, which uses a hashmap-based freelist).

The outage duration was a property of the observation surface. With working telemetry into Consul’s internal state, the freelist issue would have been visible in hours, not days. Without it, engineers were solving a Heisenbug with their eyes closed.

The lesson Roblox drew, encoded explicitly in their post-mortem and in the architectural changes that followed, was that observation surfaces must be independent of the systems they observe.

Telemetry must run on infrastructure that does not depend on the thing being measured. If the monitoring stack and the production stack share a common substrate, a failure in that substrate blinds both at once, and the operator is left to debug a fully dark system.

This is, in the language Part I used, an invariant at the boundary between the observation system and the production system: the observability stack must function independently of any system whose state it reports.

The invariant is rarely enforced. It is rarely even written down. It is one of the assumptions the operator does not know they have made until the day Consul stops responding and the dashboards go dark with it.

The operator’s simulation

Every operator, when debugging a system, is in fact debugging a model of the system that lives in their own head.

The model was built from architecture documents, code reading, prior incidents, and conversation with colleagues. The model is a simplification, by necessity, because the system is too complex to fit in one head.

The model is also wrong, in specific ways, and the operator does not know which specific ways until the model and the system diverge.

This determines whether an incident is resolved in twenty minutes or six hours. During an outage, the operator performs inference: symptoms are observed, hypotheses are generated from the model, tests of those hypotheses are designed and executed. The hypothesis space the operator can explore is bounded by the model the operator has.

If the failure mode lives outside the model, the operator cannot generate the hypothesis that would lead to it. They will iterate, increasingly desperately, on hypotheses inside the model, none of which fit the symptoms, until either someone with a better model joins the call or the system recovers on its own.

The Cloudflare incident from Part I is partly an instance of this. The first ninety minutes were spent on the hypothesis that the failure was an external DDoS attack, because the oscillation between good and bad states matched the signature of intermittent external pressure better than it matched the signature of anything Cloudflare’s engineers had a model for.

The model said “oscillation means external adversary.“ The reality was that oscillation meant gradual rollout against a five-minute regeneration cycle, but that failure mode was not in anyone’s model until the post-mortem.

Richard Cook’s How Complex Systems Fail (1998) names this directly: every operator’s view of the system is a practitioner-constructed simulation, assembled from training, experience, and the artefacts of past incidents.

The simulation diverges from reality continuously and silently. The role of the practitioner during an incident is to detect the divergence, update the simulation, and act on the updated version, all in real time, under pressure, with incomplete information.

The systems that recover quickly from incidents are not the systems with the best dashboards. They are the systems whose operators have the best simulations, and whose dashboards expose enough raw data that the simulation can be corrected in flight.

The discipline of seeing what is there

The compressed form of this essay, the operational counterpart to Part I’s diagnostic question, is also a question, asked of every observation surface in the system: what is this metric not telling me, and what would I look at if it were lying?

Every dashboard has a set of failure modes for which it is the right view, and a set of failure modes for which it is misleading. Both sets are large. The first is documented (it is the reason the dashboard was built). The second is almost never documented, because the failure modes it contains are, by definition, the ones nobody anticipated.

The discipline is to know, for every observation, what it is showing and what it is hiding.

The metric is an aggregate over what dimensions? Over what time window? With what sampling?
Does the observation surface share a substrate with the system it observes?
What would a failure that is invisible on this metric look like, and what other view would catch it?
When the metric and the user experience disagree, which one should be believed?

The answer to the last question is always: the user experience. The metric is a model. The user is in the data plane. The data plane is the system.

The discipline of observation-aware engineering is not to build dashboards that show everything; that is impossible, and the attempt produces dashboards that show nothing useful.

It is to know, for every dashboard, what it cannot show, to keep the raw event stream queryable for the moments when the dashboard and the system disagree, and to keep the observation stack architecturally independent of the systems it watches.

This is what separates the systems where the operator spends the first hour of an incident narrowing the hypothesis from the systems where the operator spends the first hour arguing about whether the dashboards are correct.

Part III will move from observation to action: what it takes to operate a system whose state you cannot fully see, whose feedback loops fight you, and whose composition you do not control. The technical name for this is control theory under uncertainty. The operational name is on-call.

One more thing…

The reason these failures keep happening is that engineers are trained to reason about the logical structure of their systems, not the physical dynamics of how those systems execute under load.

The same gap exists, in concentrated form, in GPU programming. CUDA correctness is necessary but not sufficient. Performance lives entirely in how memory traffic, warp scheduling, instruction issue, and inter-block synchronisation interact under realistic workloads: the same composition-and-observation problem, compressed into a single die.

I wrote a deep guide on CUDA from this perspective: not isolated tricks, but how to reason about the GPU as a coupled dynamical system whose performance regimes are as discontinuous as any distributed system’s.

Read the CUDA Guide on Gumroad

How Systems Really Fail, Part I

Lorenzo Bradanini — Mon, 11 May 2026 11:08:11 GMT

Intro

There is a version of distributed systems that exists in textbooks, RFCs, and the architecture diagrams that get drawn during the first month of a new project.

In that version, failures are discrete events: a node dies, a network partitions, a disk fills. Each event has a name. Each name has a mitigation. The mitigations compose. The composition is correct.

Then there is the version of distributed systems that exists in production at 03:47 UTC, when the on-call engineer is staring at a dashboard that shows everything green except for a customer-impact metric that has been climbing for nine minutes.

The runbook does not apply because it was written for the system that existed before the migration, the last engineer who understood the offending subsystem left the company seven months ago, and the only documentation is a Confluence page from 2023 that contradicts itself in the third paragraph.

This series is about the second version.

It is not about how to design distributed systems. There are good books for that. It is about what happens to those designs after they meet reality: after the load grows by a factor of fifty, after three reorgs change the ownership of half the services, after the configuration file that was supposed to be immutable acquires a small permissions change on a Tuesday morning in November.

It is about the failure modes that emerge not from broken components but from the interaction between working ones. It is about why debugging at scale is not a technical activity but an epistemic one.

And it is about the design decisions, often made years before the outage, that determine whether a system has a fighting chance when the failure arrives.

Five essays. Each stands alone. They share a thesis: the gap between how engineers reason about systems and how systems actually behave is not a knowledge problem. It is a structural property of complexity.

The faster you accept this, the better your systems will be.

Subscribe now

The Composition Problem

On Monday, 17 November 2025, an engineer at Cloudflare merged a change to a permissions policy on the company’s ClickHouse database clusters.

The change was part of a long-running effort to migrate distributed queries from a shared system account to per-user authentication, so that query limits and access grants could be evaluated at finer granularity. It was the right kind of change. Reviewed, staged, rolled out gradually across cluster nodes, exactly as a careful operator would do it.

At 11:05 UTC the following morning, the rollout reached a critical threshold. Twenty-three minutes later, the internet broke.

At 11:28 UTC, Cloudflare’s network, which fronts roughly 20% of the websites on the public internet, began returning HTTP 5xx errors at scale. ChatGPT failed. X failed. Spotify, Discord, Canva, Figma, 1Password, Trello.

The outage lasted until 14:30 UTC for core traffic, with full restoration at 17:06 UTC. Matthew Prince, Cloudflare’s CEO, would later describe it as the worst outage since 2019. Estimated revenue loss across the affected ecosystem ran into the hundreds of millions of dollars.

The chain of causation, once it was understood, fits in a paragraph.

Cloudflare’s Bot Management module runs inside its core proxy (a system called FL, with a newer version FL2). The module scores every request as bot-or-human using a machine-learning model.

That model takes as input a “feature configuration file”, a list of per-request features, which is regenerated every five minutes by a query against a ClickHouse cluster. The regeneration query reads from system.columns, ClickHouse’s metadata table:

SELECT name, type
FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;

Note what is not in this query: a filter on the database name. The query implicitly assumed that system.columns would only return columns from the default database, because before the permissions migration users only had visibility into default.

ClickHouse’s distributed table engine actually stores shards in an underlying physical schema named r0. The new permissions policy granted explicit access to r0. After the change, the same query returned columns from both default and r0, roughly doubling the row count.

That row count was used directly to construct the feature file. The file had previously contained around 60 features. It now contained more than 200.

Downstream, in the Rust code that loaded the file into the FL2 proxy, there was a preallocated array sized for a hard ceiling of exactly 200 features: a performance optimisation so that runtime feature lookups would never allocate.

When the oversized file arrived, the load path returned Err(_). The calling code, written under the assumption that this could not happen, called .unwrap() on the Result.

The worker thread panicked with the now-public string:

thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

Every request routed through that worker returned 5xx.

The damage was amplified by a second-order property. ClickHouse was being rolled out gradually, so for nearly an hour only some cluster nodes returned the duplicated result.

The feature file regenerated every five minutes, and whether the run hit an upgraded node or a non-upgraded node was effectively random.

The file therefore alternated, every five minutes, between “good” and “bad,” and the proxy fleet oscillated between recovery and failure on a five-minute cycle.

From the dashboards, this looked exactly like an active DDoS attack, an external adversary probing the network with intermittent pressure.

The incident commander spent the first two hours of the outage investigating that hypothesis, because the signature of the failure mimicked a known threat.

Read this again. Notice what is not in it.

There was no bug in the database. The new permissions behaviour was correct ClickHouse semantics. There was no bug in the query; it executed exactly as written.

There was no bug in the feature file format; it stored what it was given. There was no bug in the Rust proxy, its bounds check correctly refused to process malformed input rather than corrupting state.

There also was no bug in the deployment process, gradual rollout to a database cluster is exactly how you mitigate rollout risk. Every component, examined in isolation, behaved as designed, as documented, as code-reviewed.

The outage existed in the spaces between the components. It existed in an unwritten assumption, that the cardinality of the metadata query was bounded by the schema layout.

It was present in the gap between the team that owned the permissions migration and the team that owned the feature pipeline.

It even existed in the asymmetry between data Cloudflare treated as “trusted” (internally generated configuration) and data it treated as “untrusted” (everything from outside).

The failure was not a property of any component. It was a property of the system.

This is the central, uncomfortable fact about distributed systems: their failure modes are not documented because they cannot be documented.

They emerge from the composition of components, and the space of possible compositions grows faster than anyone can enumerate it.

Subscribe now

Why decomposition breaks down

Software engineering is almost entirely built on decomposition.

You take a hard problem, split it into smaller problems, solve each, and compose the solutions. The discipline assumes (implicitly, almost religiously) that the behaviour of the whole can be derived from the behaviour of the parts.

This is the foundation of modular design, encapsulation, microservices, contracts, type systems. It is what allows ten thousand engineers to build a system no one of them understands in full.

The assumption is wrong, or rather: it holds only within a regime, and the regime ends somewhere around the scale where a system has enough components, enough state, and enough concurrency that the interactions between components become a richer source of behaviour than the components themselves.

The formal version of this argument is older than computer science. Herbert Simon, in The Architecture of Complexity (Proc. Am. Phil. Soc., 1962), distinguished between decomposable and nearly-decomposable systems.

In a decomposable system, interactions between subsystems are negligible compared to interactions within them, and the whole behaves like the sum of independent parts.

In a nearly-decomposable system, this is approximately true on short timescales but not on long ones, the weak inter-subsystem couplings accumulate into qualitatively different behaviour.

Simon’s claim, which has held up for sixty years across biology, economics, and engineering, is that all real systems of significant size are nearly-decomposable, not decomposable.

Distributed systems are an extreme case. The components have clean interfaces and look decomposable on a diagram.

But the interactions are mediated by shared resources (networks, clocks, storage, control planes) and those shared resources transmit perturbations between components in ways the diagram does not show.

A change in one component changes the load profile on the shared network, which changes the queueing behaviour at a different component, which changes the timing of its responses, which changes the retry behaviour of yet another component. The composition is opaque because the couplings are invisible.

Distributed systems theory has known a version of this for forty years. Fischer, Lynch, and Paterson (JACM 1985) proved that consensus is impossible in a purely asynchronous system with even one faulty process; a result that, properly understood, is not about consensus algorithms but about the impossibility of producing globally consistent system behaviour from locally correct components under partial failure.

Brewer’s CAP conjecture (PODC 2000) and the Gilbert-Lynch proof (ACM SIGACT 2002) formalised the same point at the level of state.

Lamport’s “Time, Clocks, and the Ordering of Events in a Distributed System” (CACM 1978) showed that there is no observer-independent simultaneity in a distributed system without explicit synchronisation, meaning every “global view” of the system is a stitched-together fiction.

The classical literature focused on discrete failures: a node dies, a message is lost, a clock drifts. The modern failures are stranger.

They are failures of coupling: moments when two pieces of working software, communicating through an interface both implement correctly, produce a behaviour neither would produce alone.

The Cloudflare incident is one. The DynamoDB DNS race condition that took down AWS US-EAST-1 on 19–20 October 2025 is a more elaborate example of the same pattern, and it is worth reconstructing mechanically because it shows how thoroughly the composition can betray its components.

Subscribe now

The AWS DynamoDB cascade, mechanically

DynamoDB’s regional endpoint, dynamodb.us-east-1.amazonaws.com, is served by an internal DNS management system that exists because DynamoDB runs on hundreds of thousands of load balancers, and the DNS records pointing clients at those load balancers must be updated continuously as capacity is added, removed, and rebalanced.

The system has two logical components. The DNS Planner monitors load-balancer health and produces “DNS plans”, versioned snapshots of which load balancers receive which fraction of regional traffic.

The DNS Enactor reads plans and applies them to Route 53, AWS’s DNS service. For availability, three Enactors run in parallel, one per availability zone. They operate concurrently and independently; no distributed lock, no leader election, no coordination protocol.

The system was designed this way deliberately, so a single Enactor crashing mid-run would not stall propagation; the other two would simply pick up subsequent plans and continue.

To prevent stale plans from overwriting newer ones, each Enactor performs a freshness check before applying a plan.

To prevent unbounded growth of historical plans, each Enactor also performs a cleanup pass after applying its current plan, deleting plans significantly older than the current one.

The freshness check happens once, at the start of the application phase. The cleanup happens once, at the end.

This is, again, the kind of design that gets praised in code review. Independent. Stateless. Fault-tolerant. Each component does one well-bounded job.

Now consider what actually happened. At 23:48 PDT on 19 October (06:48 UTC on 20 October), Enactor A read plan #N−1 from the Planner and began applying it to Route 53.

For reasons AWS’s post-mortem describes as “unusual delays”, likely network-mediated queueing inside Route 53’s control plane, Enactor A’s update run took longer than normal.

In the meantime, the Planner produced plan #N. Enactor B picked up plan #N, performed its freshness check (newer than the currently applied plan: pass), and began its own update run.

Enactor B finished first, applying #N to Route 53. It then began its cleanup pass, scanning for plans significantly older than #N and deleting them.

By the time Enactor A finished its delayed run and went to apply the last few records of plan #N−1, Enactor B had already applied #N to those same records.

Enactor A’s freshness check, made at the start of its run, had not detected this; the check was made when #N−1 was still the freshest plan, and that result was now stale. Enactor A overwrote those records with #N−1.

Now Enactor B’s cleanup pass arrived at plan #N−1. By Enactor B’s bookkeeping, #N−1 was significantly older than #N. Enactor B deleted plan #N−1. But Enactor A had just applied #N−1 to the regional endpoint records.

The records now pointed at a plan that did not exist. Route 53 dutifully served what it had: an empty answer set for dynamodb.us-east-1.amazonaws.com.

This is the worst possible DNS response. It is not NXDOMAIN, which clients treat as transient and retry. It is NOERROR with an empty ANSWER section; semantically “this name exists, intentionally, with zero addresses.” Compliant clients stop. There is no answer to retry.

Within seconds, every system inside and outside AWS that wanted to talk to DynamoDB in us-east-1 began failing to resolve its address. From the DynamoDB control plane’s view, the service was healthy: load balancers up, storage reachable, request handlers idle.

From Route 53’s view, the service was healthy: DNS was returning valid authoritative responses. From clients’ view, the service had ceased to exist. Three different frames of reference, three different “states” of the same service, all simultaneously true within their own frame. The mismatch between them was the outage.

It took manual intervention from on-call engineers to identify the empty record, repair Route 53 by hand, and re-enable normal automation. DynamoDB DNS recovered in approximately three hours.

The cascade that followed lasted ten more hours, and is the second composition failure embedded inside the first. EC2’s DropletWorkflow Manager (DWFM), the system that maintains operational leases on the physical hypervisors hosting customer EC2 instances, stores its lease state in DynamoDB.

While DynamoDB was unreachable, DWFM could not renew leases. Existing leases expired silently. When DynamoDB recovered, DWFM woke up to discover that essentially every hypervisor in the region needed a fresh lease, and tried to issue them all at once.

The lease-renewal subsystem entered what AWS’s post-mortem calls “congestive collapse”, a regime where throughput of useful work approaches zero because the system is spending all its time servicing retries of work that has already timed out.

Network Load Balancer health checks began failing en masse. New EC2 launches were impossible. The region was effectively down for production workloads until late that evening. Every design decision in this chain was defensible. Three Enactors instead of one, for availability.

Freshness check, to prevent old plans winning. Cleanup pass, to prevent unbounded growth. No distributed lock, to avoid coordination overhead and tolerate Enactor failures. DWFM storing state in DynamoDB, because what else would you use for a high-availability lease manager.

Each decision is the textbook answer to a specific risk.

The composition of all those textbook answers produced fifteen hours of regional unavailability and an industry-wide impact measured in hundreds of millions of dollars.

Why documentation cannot close the gap

The instinct, after this kind of incident, is to write better documentation. Add the failure mode to the runbook. Update the architecture diagram. Note the implicit assumption in a comment. Surely, next time, we will know.

We will not. The reason is not laziness; it is combinatorial.

Consider a system with N components, each with a small number of internal states, dependencies, and inputs. The number of pairwise interactions grows as O(N²).

The number of trajectories (sequences of states the system can traverse) grows much faster: for any reasonable model of state and concurrency, at least exponential in N. By the time N is in the low thousands (a serious production system), the trajectory space is unbounded for practical purposes.

Documentation is a linear medium. It can describe a finite number of states, interactions, and failure modes. The space of actual failure modes is not finite in any meaningful sense.

What documentation actually captures, in practice, is the failure modes that have already happened; the ones recovered from, written up, discussed in architecture review.

This is useful, but it is fundamentally backward-looking. The next outage is, almost by definition, the one not yet documented. It lives in some currently-undocumented region of the trajectory space, which the system will enter for the first time when some perturbation pushes it there.

This is not an indictment of documentation. Runbooks save lives. Post-mortems compound institutional knowledge. The point is that no quantity of documentation, however thorough, can close the gap between the system as designed and the system as composed.

The gap is structural. It widens deeply with scale.

The pattern beneath the patterns

If you read enough post-mortems (Dan Luu’s catalogue on GitHub remains the best free education in this material) a pattern emerges.

The triggers vary wildly: a permissions change, a DNS update, a config push, a deploy, a hardware failure, a thundering herd. The shape of the failure is often the same.

Nathan Bronson and his collaborators, in a 2021 HotOS paper, gave this shape a name: metastable failure. The framing has become foundational, and is worth restating precisely because it is the closest the field has come to a formal theory of why composition produces outages.

A metastable failure occurs in an open system with an uncontrolled load source. The system has at least two stable operating regimes: a stable regime, in which a transient perturbation decays back to equilibrium, and a metastable failure regime.

In that case, the system is functioning (consuming CPU, processing messages, producing output) but its useful throughput, what Bronson precisely terms with the word goodput, has collapsed.

The system transitions between regimes via a trigger: a load spike, a deploy, a partial failure, a configuration change.

What keeps the system in the failure regime, even after the trigger is removed, is a sustaining effect: a positive feedback loop, usually involving work amplification, in which the system’s response to its own degraded state increases the load on itself further.

The canonical example, paraphrased from the paper:

A web tier calls a database tier through a connection pool. Database latency is normally well below the client’s request timeout. A brief perturbation, like a network blip, a slow GC pause, causes some requests to exceed the timeout.

The client retries. The retry is a new request, added to the existing load. Database queue depths grow. Latency increases, pushing more requests past the timeout. More retries fire. Each timed-out request still consumed full database work to compute its answer, but no client ever saw it; that work was wasted.

The system is now processing 3× its normal request volume (originals plus retries), succeeding in completing them all, but every client is timing out before the answer arrives. Goodput is zero. Throughput is at saturation. The trigger (the original network blip) is long gone. The retry storm is sustaining the failure regime on its own.

The key insight is that the root cause of a metastable failure is the sustaining loop, not the trigger. Triggers are infinitely various and mostly cannot be prevented.

Sustaining loops are finite and identifiable, and if you eliminate them, the same trigger fails to produce the same outcome.

A follow-up paper, Metastable Failures in the Wild (Huang et al., OSDI 2022), examined 22 publicly disclosed incidents at 11 major organisations and concluded that at least 4 of the previous 15 major AWS outages fit the metastable pattern.

The October 2025 DynamoDB incident makes 5. The EC2 cascade after DynamoDB recovered is the metastable pattern in textbook form: the trigger (DynamoDB DNS being empty) was resolved in three hours; the sustaining loop (every hypervisor in the region simultaneously demanding lease renewal from a system that could not handle the surge) took ten more hours to break, and only broke when AWS manually rate-limited the work.

Marc Brooker, a principal engineer at AWS who has written extensively on this material, has pointed out that the appropriate intellectual framework here is not algorithms-and-data-structures but control theory and dynamical systems.

A metastable failure is, in dynamical-systems terms, a system with two stable attractors, where the perturbation required to push the system from the desirable attractor into the undesirable one is much smaller than the perturbation required to push it back.

The state-space geometry is asymmetric. Most production engineers have never thought about their systems this way, because computer science is taught around discrete models. The systems are continuous and dynamical, whether we model them that way or not.

Subscribe now

Invariants and the cardinality contract

The implication is not that distributed systems are unbuildable. They obviously are.

The true implication is that the mental model under which most distributed systems get built (components compose, contracts compose, correctness composes) is wrong in a way that matters for production behaviour.

The discipline that replaces this mental model is the explicit enforcement of invariants at every component boundary, including internal ones.

An invariant, in this context, is a property of a value that the consumer’s correctness depends on, but that the producer is not contractually obligated to maintain. The Cloudflare feature file had at least three such invariants, none enforced by any check at the boundary:

A cardinality bound. The Rust consumer required n_features ≤ 200. The ClickHouse query had no LIMIT, no WHERE on database, and no schema constraint preventing growth.
A schema invariant. The consumer assumed columns came from default only. The query implicitly assumed the same via the permissions model. Neither stated the invariant in code.
A monotonicity invariant. A doubling of feature count between two consecutive runs is, on its face, anomalous. No alarm fired on that delta.

Each invariant was true for years. Each became false silently when an upstream change reshaped the world. The boundary between producer and consumer had no formal contract; the contract lived in the heads of engineers, some of whom had left the company.

The discipline that prevents this is not “validation” in the loose sense. It is the explicit, in-code, enforced declaration of every cardinality, ordering, schema, and freshness constraint that the consumer relies on, with explicit handling of violation: typically degradation to last-known-good rather than panic.

The Rust idiom for this is the difference between .unwrap() and explicit pattern matching on Result; the operational idiom is the difference between trusting upstream data and treating every input as adversarial regardless of source.

The cost of the former is a few additional lines per consumer boundary. The cost of the latter is, occasionally, six hours of global downtime.

Sustaining loops and characteristic metrics

The second formal property the Cloudflare and DynamoDB incidents share is the presence of sustaining loops; control loops whose response to system degradation increases the load on the system rather than decreasing it.

The discipline for finding these before they fire is to enumerate every feedback loop in the system and classify each one’s stability properties.

A feedback loop is stable if, when perturbed from equilibrium by a small amount ε, it returns to equilibrium with error decaying as some function f(t,ε) that approaches zero.

A feedback loop is sustaining if the same perturbation produces error that grows or stays bounded away from zero.

The distinction is mathematically standard (Lyapunov stability) but is almost never applied to production systems, because engineers do not model their systems as dynamical systems.

The catalogue of loops in any non-trivial production system:

retry policies (timeout → retry → load → timeout amplification);
autoscaling (latency → scale-up → cold-start latency → more scale-up);
lease renewal (load → renewal delay → lease expiry → mass renewal storm);
connection pooling (failure → reconnect → handshake load → failure);
cache warming (cold cache → DB load → DB slow → cache cannot warm);
health checks (slow response → marked unhealthy → traffic shifted to fewer hosts → those hosts slower).

Each is a control loop. Each can be classified. The classification is rarely written down.

The observability counterpart of this classification is what Bronson calls characteristic metrics: observations of the loop state itself, not of the loop’s inputs or outputs.

Queue depth is a loop-state observable; request rate is not. Retry rate is a loop-state observable; error rate is not. Lease renewal latency is a loop-state observable; lease expiry rate is not.

The relationship between loop-state metrics and incident causality is direct: when a sustaining loop activates, its characteristic metric crosses out of its historical operating envelope before the user-facing symptom appears.

Instrumenting characteristic metrics is the difference between detecting a metastable failure during its inflation phase (when mitigation is cheap) and detecting it after it has saturated (when mitigation requires load-shedding the user-facing service).

Subscribe now

The diagnostic question

The compressed form of the entire discipline reduces to a single question, asked of every component boundary in the system: what am I assuming about my input that is not enforced by a check in this code?

Every such unenforced assumption is a future incident. The space of unenforced assumptions is large but finite, and it can be enumerated. Most engineering organisations have never done this enumeration.

The ones that have produce systems that fail in less catastrophic ways; not because they fail less often, but because the failures that occur are caught at the boundary where the assumption was violated, rather than three layers downstream after corruption has propagated.

The system you have is not the system you designed. The system you have is the composition. The composition is opaque, and the opacity is permanent, but the opacity at every individual boundary is not permanent.

Each boundary is a place where assumptions can be made explicit and enforced. The discipline of composition-aware engineering is not to make the whole transparent.

It is to make every boundary honest about what it requires from its neighbours, and to refuse to operate when those requirements are not met.

This is what separates the systems that fail loudly at the seams from the systems that fail catastrophically in the centre.

One more thing…

Modern systems rarely fail because of a single broken component.

They fail because interactions between correct components create behaviours nobody explicitly designed for. The same thing happens in high-performance GPU systems.

Most CUDA optimisation is not about isolated tricks. It is about understanding how kernels, memory hierarchies, scheduling, communication, and throughput constraints interact under load.

I wrote a deep guide on CUDA from exactly this perspective: systems-level performance engineering, bottlenecks, hidden coupling, and why many “optimisations” simply move the problem elsewhere.

Read the CUDA Guide on Gumroad

We built the CUDA guide I wish I had three years ago

Lorenzo Bradanini — Thu, 30 Apr 2026 07:20:37 GMT

Intro

For the past few days we have been quiet here. Not because the newsletter slowed down, but because we were building something underneath it.

Today we are publishing what came out of that work: CUDA Mastery 2026, The Definitive Engineer’s Reference for Hopper, Blackwell, and Beyond.

Twenty-seven chapters, five appendices, fact-checked end to end against NVIDIA’s own documentation, the PTX ISA 8.7, and primary architecture whitepapers.

It covers CUDA Toolkit 13.0, 13.1, and 13.2, compute capabilities 7.5 through 12.1, WMMA, WGMMA, UMMA (tcgen05), TMA, thread block clusters, Tensor Memory, CUDA Tile and cuTile Python, CUTLASS 4 / CuTe, NCCL 2.30, and Nsight 2025.4.

It is on Gumroad. The price is $89. If you have been following The Software Frontier, you already know whether this is for you. The rest of this post is for everyone who is on the fence.

Subscribe now

Why we wrote this

There is a strange gap in CUDA literature.

On one side, you have the official NVIDIA programming guide: dense, accurate, and written for people who already know what they are looking for.

On the other side, you have an ocean of blog posts and YouTube tutorials that stop at vector addition and matrix multiplication, repeating the same surface level explanations of threads, blocks, and grids.

What sits in the middle, the part that actually matters when you are writing production code or debugging a kernel that runs at 30 percent of peak, is mostly missing.

Or rather, it exists, but it is scattered across NVIDIA whitepapers, GTC talks from 2018, PTX ISA documentation, decompiled SASS dumps, the Hopper and Blackwell Tuning Guides, the Microbenchmarking Hopper and Microbenchmarking Blackwell arXiv papers, the CUTLASS source, and Stack Overflow threads from people who clearly know more than they are saying.

We have been reading and writing about this gap for months on the newsletter. The articles on the A100 memory hierarchy, on cp.async semantics, on scoreboard mechanics, on the submission pipeline, all of them came from the same frustration. Every time we wanted to explain something properly, we had to do the archaeology ourselves.

So we decided to do the archaeology once, in a single document, and structure it the way we wish someone had structured it for us when we started.

What is in the guide

Twenty-seven chapters across eleven parts, plus five appendices. Four chapters were rewritten end-to-end at handbook depth. Those are the PREMIUM chapters: the SM, the memory system, tensor cores, and the SGEMM walkthrough. The rest of the structure looks like this.

Foundations. The GPU as a throughput machine, the CUDA programming model, and the memory hierarchy at a glance. This is the vocabulary layer. A senior engineer can skim it in an afternoon; a new graduate can use it as their entry point and grow into the rest of the book.

The Streaming Multiprocessor in mechanical detail. The SM is the unit of concurrency, the unit of resource accounting, and the unit at which every meaningful CUDA performance argument must eventually be made.

We walk through the four near-independent partitions, the operand collector and its bank conflicts, the short and long scoreboards, the quantitative arithmetic of latency hiding via Little’s law, the full Nsight Compute stall taxonomy, and the structural deltas across Volta, Turing, Ampere, Ada, Hopper, and Blackwell.

The chapter ends with an end-to-end walkthrough of a single warp executing a wgmma.mma_async on a Hopper SM, stage by stage.

The memory system mechanically. The 32-byte sector model. The L1TEX path and every cache modifier you can attach to a global load. The L2 partitioning on H100 and the access policy window. HBM3 in three numbers and why the practical roofline is 70 to 85 percent of the headline.

Shared memory banks, swizzle modes, and the descriptor-encoded layout that WGMMA actually expects. cp.async mechanics on Ampere. TMA on Hopper and Blackwell, including descriptors, transaction barriers, phase parity, and cluster multicast. The mbarrier family and how warp-specialized GEMM mainloops use it.

Synchronization and concurrency. The CUDA memory model with its scopes and orderings. Cooperative Groups including cluster.sync on Hopper and beyond. Streams, events, and CUDA Graphs for launch-overhead amortization in inference and physics workloads.

Tensor cores mechanically. The hardware origin of the unit, the generation-by-generation shape and precision progression, the per-lane fragment ownership for mma.sync, the bit-level layout of the WGMMA matrix descriptor, and the structural transition to UMMA with the accumulator living in Tensor Memory.

A full section on the new numerical formats: FP6, FP4, and the MX wrapper that makes FP4 inference near-lossless on trained transformer weights.

Modern hardware. A Hopper deep dive on sm_90 / sm_90a. A Blackwell deep dive on sm_100 / sm_100a / sm_120. A chapter on Blackwell Ultra (B300, compute capability 10.3) and the trajectory toward Rubin.

Performance engineering. The roofline model in practice, with the second roofline for shared-memory bandwidth on tile-based kernels. Profiling with Nsight Systems and Nsight Compute, including the four-section workflow and the new tile-kernel statistics added in CUDA 13.1. Numerics and reproducibility, including the TF32 trap that silently downgrades FP32 GEMMs.

Multi-GPU and distributed. NVLink 5 and the NVL72 domain. SHARP v4 in-network reductions. NCCL internals across Ring, Tree, NVLS, and PAT. NVSHMEM and PGAS for sparse all-to-all in MoE training.

Libraries and toolchain. cuBLAS, cuBLASLt epilogue fusion, cuDNN, cuFFT, cuSPARSE. CUTLASS 4 and CuTe for hand-written tensor-core mainloops. CCCL (Thrust + CUB + libcu++). A full chapter on CUDA Tile and cuTile Python, the largest single addition to the CUDA programming model since cooperative groups. A chapter on nvcc, PTX, SASS, the fatbinary, and inline PTX as an escape hatch.

Capstone kernels. The SGEMM walkthrough from v1 to v6, with numbers. Most treatments stop at register tiling and gesture vaguely at “tensor cores make it faster.”

Ours follows the bottleneck through six versions on a 4096³ FP32 problem on H100 SXM5, names the architectural feature that breaks each ceiling, and gives the analytical bound. v1 reaches 0.2 percent of peak. v6, on Blackwell with UMMA + TMEM at FP8, reaches 88 to 95 percent of peak.

The chapter exists to teach what every transition costs and what it buys. Reductions and scans with single-pass decoupled lookback. Flash Attention 2, 3, and 4 / 5 including the online softmax derivation and the WGMMA / UMMA mainloop. Sort, hash, and graph primitives built on CUB.

Appendices. Compute capability quick reference. Architecture spec sheet from A100 through B300 with verified numbers from each part’s datasheet. PTX quick reference. Glossary. And a full bibliography of the primary sources every claim in the book was checked against.

Every claim in the guide has been fact checked against primary sources. Where we had to infer something from SASS or from the behavior of the hardware rather than from a published spec, we say so explicitly.

Subscribe now

What was truly missing

Maxwell, Pascal, and Volta offline-compilation material was retired, in line with CUDA 13.0 dropping pre-Turing offline compilation in August 2025. The tensor-core chapter was rewritten around UMMA, which supersedes Hopper’s wgmma.mma_async.

New material on CUDA Tile and cuTile Python, both introduced in CUDA 13.1 in December 2025 and extended in 13.2. New material on Tensor Memory.

The four PREMIUM chapters are new from the ground up. Numbers verified against NVIDIA’s Hopper, Blackwell, and Blackwell Ultra public datasheets at print time.

Who this is for

If you are writing CUDA professionally, in HPC, in ML systems, in inference engines, or in any context where kernel performance is part of your job, this guide is calibrated for you.

If you are a senior engineer transitioning into GPU work and you want one document that takes you from competent to dangerous without 200 hours of fragmented reading, this is the document.

If you are deep into the PTX weeds already, writing your own warp-specialized WGMMA mainloops and tuning CuTe layouts, you probably know a lot of what is in here.

You will still find the SM and memory chapters useful as a reference, and the SGEMM walkthrough is one of the few places where the v5-to-v6 transition is laid out in full. But we would not pretend to teach you something you do not already know.

If you are completely new to CUDA, with no parallel programming background, this is not the right starting point.

The guide assumes you can read C++ and that you have at least written a few kernels before. We would not want you to spend $89 and feel lost on chapter four.

Subscribe now

Why $89

Because the alternative is reading what we read, in the order we read it, over the same number of months.

We are not pricing this against tutorials. We are pricing it against the time of an engineer who bills somewhere between $80 and $200 an hour and needs to be productive on Hopper and Blackwell GPU code by next quarter.

If the guide saves you a single afternoon of debugging a kernel that turns out to be limited by operand collector bank conflicts that no public documentation describes, it has paid for itself.

There is no DRM, no expiration, no upsell. You buy the PDF, you own it. As future architectures change material details, we will publish updates to buyers at no additional cost.

The next edition is already scheduled to cover Rubin when its public specifications stabilize.

What happens next

The newsletter continues. The Mastering CUDA series is not over, and there are several articles already in draft on topics that did not fit cleanly into the guide.

If you buy the guide and you have feedback, send it. We read every email. The first revision is going out within thirty days based on what readers tell us, and the people who bought early get it first.

You can find the guide here: CUDA Mastery

Thank you for reading. Thank you for being here while this was being built. The next article goes out as scheduled.

Lorenzo and Lorenzo

Mastering CUDA and High-Performance Computing, Part X

Lorenzo Bradanini — Fri, 24 Apr 2026 13:58:58 GMT

Where Part IX Left Us

Part IX ended with a provocation dressed as a summary. Training, we said, is a one-time cost. Inference is the workload that runs forever.

That sentence deserves to be interrogated before we accept it as a frame, because it contains a hidden asymmetry that shapes everything that follows.

Training a frontier model is an event: it happens once, or perhaps a handful of times with different hyperparameters, and then it stops. The cost is large and bounded.

Inference is a process: it happens billions of times per day, across hardware that may or may not resemble the training cluster, under latency constraints that the training job never had to respect, serving users who have no patience for pipeline bubbles and no interest in MFU.

The engineering discipline of inference optimization is therefore a different subject from the engineering discipline of training optimization, not merely a scaled-down version of it. The bottlenecks are different in kind. The metrics are different. The vocabulary is different. The hardware choices are sometimes deliberately different.

But the physics is the same, because physics does not have a training mode and an inference mode. Compute and memory bandwidth are always the two resources, and every optimization in this space is, at root, a claim about which of the two you are spending and whether you are spending it wisely.

What we will do in this part is work through the inference problem with the same level of precision we brought to the training problem.

We will derive the arithmetic of autoregressive decoding from first principles, establish exactly why the decode phase of transformer inference is memory-bandwidth-bound by construction, explain what that means for hardware selection and batching strategy, and then examine the tools that practitioners have developed to recover the compute utilization that the memory-bound regime takes away.

We will go into detail that most treatments of this subject avoid, because the details are where the engineering actually lives.

Subscribe now

The two phases of transformer inference

Transformer inference for a generative model consists of two distinct phases that are so different in their computational character that they might as well be different workloads running on different hardware.

The prefill phase processes the input prompt. Given a prompt of length S tokens, the model performs a forward pass over all S tokens simultaneously.

This is a dense matrix multiply of shape [S, d_model] against the weight matrices, which is computationally equivalent to a training forward pass on a batch of S examples, with the important difference that no gradient computation happens.

The arithmetic intensity of prefill is high: the GEMM is large, the compute-to-memory ratio is favorable, and for sufficiently long prompts, prefill saturates tensor core utilization. Prefill is compute-bound.

The decode phase generates the output, one token at a time. At each step, the model processes a single new token and uses the key-value (KV) cache, which stores the key and value projections for all previously seen tokens, to compute attention over the full context without recomputing those projections.

The new token produces one row of the Q matrix, one row of the K matrix (appended to the KV cache), one row of the V matrix (also appended), and one output token.

The matrix multiply that dominates decode is therefore a matrix-vector product: a single vector of shape [1, d_model] multiplied against weight matrices of shape [d_model, d_model].

For an H100 at 494 TFLOP/s peak BF16, and a weight matrix that requires 2 × d_model² bytes to read from HBM (one load), the arithmetic intensity of this operation is:

flops = 2 × d_model² bytes = 2 × d_model² arithmetic intensity = 1 FLOP/byte

The H100’s ridge point, the arithmetic intensity at which the machine transitions from memory-bandwidth-bound to compute-bound, is approximately 494 TFLOP/s divided by 3.35 TB/s HBM3 bandwidth, which equals roughly 147 FLOP/byte.

Single-token decode has an arithmetic intensity of 1 FLOP/byte. The ridge point is at 147 FLOP/byte.

The gap is not a small inefficiency to be engineered away. It is two orders of magnitude. It is structural.

A matrix-vector product with batch size 1 will always be memory-bandwidth-bound on any hardware where compute throughput scales faster than memory bandwidth, which is every piece of hardware available today and likely every piece available for the next several years.

The H100’s tensor cores sit at 99.3% utilization waiting for data that cannot arrive fast enough. This is the inference problem, stated precisely.

Subscribe now

What the arithmetic intensity gap actually costs

Before we can appreciate why batching is not a simple fix, we need to quantify what the gap costs in concrete terms.

Consider the weight matrices of a 70B parameter model. They occupy 140 GB in BF16. To generate a single output token, the decode phase must read essentially all of the weight matrices from HBM: every attention projection, every MLP layer, every embedding lookup.

(The KV cache is also read, but its size is proportional to context length and sequence position, not model size.) At the H100’s HBM3 bandwidth of 3.35 TB/s, reading 140 GB takes approximately 42 milliseconds.

In 42 milliseconds, a single token is produced. That is approximately 24 tokens per second per H100 for a 70B model at batch size 1.

Now read that sentence again: 24 tokens per second per H100, a machine that costs tens of thousands of dollars and can perform 494 trillion floating-point operations per second, is producing tokens at roughly the rate that a person reads them.

The 494 TFLOP/s are not being used. The H100 is acting as a very expensive, very fast HBM3 reader. The silicon that took years to design and billions of dollars to fabricate is waiting for data.

This is the central pathology of autoregressive decode, and it motivates every technique we will discuss in this part.

Batching as arithmetic intensity recovery

The solution that every inference practitioner reaches for first is batching: if you run decode for multiple requests simultaneously, the weight reads are shared across the batch, and the arithmetic intensity increases proportionally.

The arithmetic is clean. For a batch of B requests, the matrix multiply in decode is no longer a matrix-vector product but a matrix-matrix product: [B, d_model] × [d_model, d_model].

The flop count scales as B × 2 × d_model², while the bytes for the weight matrix remain 2 × d_model² (the weights are read once, regardless of batch size). Arithmetic intensity is now B FLOP/byte.

To reach the ridge point at 147 FLOP/byte, you need a batch of 147 requests running simultaneously on the same H100. At batch size 147, the tensor cores begin to saturate and further increasing the batch does not change the arithmetic intensity (you are now compute-bound, and more requests means more total compute, not more memory reads per unit time).

The batch size at which you saturate the machine is the target operating point for maximum throughput per GPU. Everything below this point is wasted hardware.

But batch size is not free. Each request in the batch has its own KV cache, and the KV cache size is proportional to the sequence length of that request. For a context length of 8192 tokens, a model with 80 layers, 8 KV heads, a head dimension of 128, and BF16 storage, the KV cache size per request is:

8192 × 80 × 2 × 8 × 128 × 2 bytes = 26,843,545,600 bytes ≈ 25 GB

Twenty-five gigabytes per request, on a GPU with 80 GB of HBM. You can serve at most three concurrent requests at 8192-token context length before HBM is exhausted and you cannot increase the batch further.

The tension is fundamental: to achieve good arithmetic intensity you need large batches, but large batches require large KV caches, and large KV caches consume the memory that large batches require.

This is the central resource allocation problem of transformer inference, and it is more constrained than it appears, because KV cache memory is not static. It grows with sequence length.

A request that has generated 100 tokens has a small KV cache; the same request after generating 4000 tokens has a KV cache that is 40× larger. The memory footprint of the batch changes continuously as generation proceeds.

Subscribe now

Continuous batching and the end of static padding

The naive approach to batched inference is static batching: collect B requests, pad them all to the same sequence length, run them as a batch, return all B results when the longest sequence finishes.

Static batching is deeply inefficient. Consider a batch of 8 requests where one request will generate 2000 tokens and the others will generate 20 tokens each.

After the short requests finish at step 20, 7 of the 8 slots in the batch are empty, but the batch continues until step 2000 to service the one long request. The GPU is running at 1/8 occupancy for 99% of the wallclock time.

Continuous batching (also called iteration-level batching or in-flight batching), implemented in systems like vLLM, Orca, and TensorRT-LLM, solves this by removing the assumption that all requests in a batch start and finish together.

Instead, the batch is managed at the per-decoding-step level: at each step, the set of active requests is the set of requests that have not yet finished and for which memory is available.

When a request completes (generates a stop token or reaches the maximum length), its slot in the batch is immediately freed. A new waiting request is inserted into the freed slot and begins generating from its first decode step. There is no waiting for the batch to drain. The batch is always as full as memory permits.

The implementation requires that the CUDA kernels for attention and the MLP can handle variable-length sequences within a single kernel invocation, which is non-trivial because standard GEMM implementations assume a fixed batch dimension.

Paged attention (discussed shortly) is the memory management technique that makes this practical; the PagedAttention kernel from vLLM and the FlashAttention variants for variable-length sequences are the implementations that make it fast.

Continuous batching does not increase the maximum batch size (which is still limited by KV cache memory). What it does is ensure that the batch is always at or near the maximum size, eliminating the idling that static batching induces.

A system running continuous batching with maximum batch size 64 will achieve dramatically higher throughput than a system running static batching with the same maximum batch size, because the latter is almost never actually running 64 requests simultaneously.

PagedAttention and the memory management revolution

The KV cache memory problem has an analogy so precise it deserves to be stated explicitly: the KV cache is to inference systems what physical memory is to operating systems.

In an operating system, multiple processes compete for a fixed physical memory. Processes do not know their memory needs in advance (a process may allocate more memory as it runs).

Memory fragmentation is a real cost: even if the total free memory is sufficient, if it is not contiguous, an allocation may fail.

The solution that operating systems developed is virtual memory with paging: memory is divided into fixed-size pages, processes address a virtual space that the OS maps to physical pages on demand, and fragmentation is eliminated because non-contiguous physical pages can be mapped to a contiguous virtual space.

PagedAttention, introduced by Kwon et al. (2023) and implemented in vLLM, applies exactly this insight to KV cache management.

In a naive KV cache implementation, each request’s KV cache is a contiguous block of GPU memory allocated at request arrival. The maximum context length is reserved at allocation time (because the request might generate that many tokens), even if the actual generation is much shorter.

Fragmentation is severe: the gap between reserved and used memory across all requests is wasted, and new requests cannot use it.

PagedAttention divides the KV cache into fixed-size physical blocks (pages), where each block stores the keys and values for a fixed number of tokens (the block size, typically 16 or 32 tokens). When a request needs more KV cache space, it is allocated additional pages from a free pool.

Pages for a single request need not be contiguous in physical GPU memory; the PagedAttention kernel uses a block table (a small integer array per request mapping logical page indices to physical page indices) to find the right physical memory at attention computation time.

The consequences are significant. First, fragmentation falls from potentially 50% (if requests reserve maximum-length buffers but generate much shorter sequences) to under 5% (only the last partially-filled page of each sequence wastes space).

Second, sequences can share physical pages: if two requests have identical prompt prefixes (common in chat applications with a fixed system prompt), the KV cache pages for the shared prefix can be physically shared between them, eliminating redundant computation and halving the memory footprint of the prefix.

Third, memory allocation is lazy: pages are allocated only as tokens are generated, not at request arrival, which means a request does not consume its full potential KV cache until it actually generates enough tokens to need it.

The prefix sharing (also called KV cache sharing or prompt caching) deserves additional attention because its impact at scale is large. Consider a chat application where every request is prefixed with a 2000-token system prompt.

Without prefix sharing, each request independently computes and stores the KV cache for those 2000 tokens. With prefix sharing, the 2000-token prefix KV cache is computed once and shared across all concurrent requests.

For a batch of 64 requests, this eliminates 63 redundant prefill computations and reduces KV cache memory by 2000 × 64 tokens worth of activations, freeing space for a larger batch.

Speculative decoding and the bandwidth wall

Continuous batching with PagedAttention brings the inference system to the state of efficiently using available hardware at the maximum batch size the KV cache permits.

But we have still not escaped the fundamental constraint: at maximum batch size, the system is compute-bound, but getting to maximum batch size requires enough concurrent requests, which requires enough users, which is not always the case in low-traffic scenarios. At low batch sizes, we are still memory-bandwidth-bound, still producing tokens at the same rate that memory bandwidth permits.

Speculative decoding attacks this from a different angle. The observation is this: for a memory-bandwidth-bound system, the cost of generating one token and the cost of generating K tokens simultaneously in a single forward pass are approximately equal, because the bottleneck is reading the weight matrices from memory, and reading them once versus K times is the same operation if the K proposals can be evaluated in a single forward pass.

The mechanism: a small draft model (or a non-autoregressive heuristic, or a retrieval system) proposes a sequence of K candidate continuation tokens in a single forward pass.

The large target model then verifies this proposal in a single forward pass over the K tokens simultaneously. If the target model accepts all K tokens, K tokens have been generated at the cost of one large model forward pass plus one small model forward pass.

If the target model rejects some tokens, the generation rewinds to the first rejection and continues from there, having wasted some small model computation but no large model computation beyond what was necessary.

The analysis of when speculative decoding accelerates inference requires understanding the acceptance rate, which is the fraction of proposed tokens that the target model accepts.

For a good draft model generating from a similar distribution, acceptance rates of 70-90% are achievable. With an acceptance rate of α and a proposal length of K, the expected number of tokens accepted per large model forward pass is:

E[accepted] = sum_{k=0}^{K} (k+1) × αᵏ × (1 − α) + (K+1) × αᴷ

For α = 0.8 and K = 4: E[accepted] ≈ 3.3 tokens per large model forward pass, compared to 1 token per forward pass without speculation. Speedup is approximately 3.3×, reduced by the overhead of the draft model, typically 0.2-0.3 of a large model forward pass for a draft model that is 10-15× smaller.

Net speedup: approximately 3.3 / 1.2 ≈ 2.7×. This is not free, but it is substantial, and it is available even at batch size 1.

The reason speculative decoding works, and why it does not violate the memory-bandwidth constraint, is subtle. The verification pass is a prefill operation over K+1 tokens, which has higher arithmetic intensity than single-token decode.

For K=4, the verification pass processes 5 tokens simultaneously, which is 5× the arithmetic intensity of single-token decode. It is still memory-bandwidth-bound for small K, but the per-accepted-token cost of the memory reads is reduced because multiple tokens share the weight reads.

The deeper insight is that speculative decoding converts memory-bandwidth-bound decode operations into a mixture of memory-bandwidth-bound draft decode and slightly-less-memory-bandwidth-bound verification, and the mixture achieves better tokens-per-second because the draft model is smaller and therefore faster per forward pass.

The pathological case is when the draft model produces tokens that the target model almost never accepts, in which case the overhead of draft generation and failed verification exceeds the benefit.

Acceptance rates below approximately 0.5 make speculative decoding harmful, not helpful. This is why the choice of draft model matters: it should be distilled from or aligned with the target model, not simply chosen to be the fastest available.

Self-speculative decoding, where the model speculates with its own early layers (exiting at an intermediate layer for draft generation and running the full forward pass for verification) eliminates the draft model requirement at the cost of some architectural complexity.

Medusa, a multi-head speculative decoding method that adds dedicated draft heads to the target model, is a variant that achieves similar benefits with a different implementation strategy.

Subscribe now

KV cache quantization and the memory tradeoff

Even with PagedAttention and prefix sharing, the KV cache is often the binding constraint on batch size for long-context models. A 70B model with a 128K context length and a batch of 8 requests generates a KV cache of:

128,000 × 80 × 2 × 8 × 128 × 2 bytes × 8 requests = 3.3 TB

Three terabytes for 8 requests. No H100 cluster of reasonable size holds this in HBM. The only responses are to reduce context length (not always possible), reduce batch size (reduces throughput), or reduce the precision of KV cache values.

KV cache quantization stores keys and values in INT8 or FP8 rather than BF16, halving the memory requirement at the cost of approximation error. The question is whether that approximation error materially affects output quality.

The answer, empirically, is that keys and values are more quantization-sensitive than weights, because the attention score computation amplifies outliers in the key matrix, and those outliers are common and important.

Naive INT8 quantization of the KV cache causes measurable quality degradation on tasks requiring precise retrieval over long contexts. The degradation is smaller for short contexts where fewer keys compete for attention.

Techniques that address this include grouped-query attention (GQA), which reduces the number of KV heads (and therefore the KV cache size) without proportionally reducing the expressivity of attention by sharing keys and values across groups of query heads; and mixed-precision KV caching, which stores frequently-accessed (recent) tokens in higher precision and distant-context tokens in lower precision, exploiting the empirical observation that attention weights are concentrated on nearby and highly salient tokens.

GQA deserves detailed treatment because it has become standard in essentially all modern large language models: Llama 3, Mistral, Gemma, Qwen, and their successors all use it.

The mechanism is to reduce the number of KV heads from H to H/G for a group size of G, where typically G = 8. Each KV head is shared by G query heads during attention computation. The KV cache size shrinks by a factor of G, and with G=8 on a 128-head attention, the cache is 8× smaller.

The expressivity cost is small for typical values of G because attention heads tend to specialize into redundant groups anyway: the empirical evidence across many model evaluations suggests that the quality loss from GQA at G=8 is negligible relative to the memory benefit.

The prefill-decode disaggregation architecture

We have established that prefill is compute-bound and decode is memory-bandwidth-bound. The hardware that is optimal for one is different from the hardware that is optimal for the other.

High-compute GPUs (H100, B200) with large HBM capacity are appropriate for both prefill and decode, but they are expensive. For decode in particular, the binding resource is memory bandwidth, not compute throughput. A GPU with lower compute throughput but the same memory bandwidth would serve the decode phase equally well at lower cost.

This observation motivates prefill-decode disaggregation: running prefill and decode on separate hardware pools, each sized for its actual bottleneck.

The architecture is approximately as follows: a request arrives at a scheduler, which assigns it to a prefill worker. The prefill worker (a high-compute GPU) processes the prompt and produces the initial KV cache.

That KV cache is then transferred to a decode worker (which may be a different, possibly cheaper GPU). The decode worker generates tokens autoregressively, managing the KV cache in its memory, and streams tokens back to the user.

The KV cache transfer itself is non-trivial: for a long prompt on a large model, the KV cache may be tens of gigabytes, and transferring it across a network link between prefill and decode workers takes time proportional to size. The transfer must complete before the first decode token can be generated, which adds latency to the time-to-first-token (TTFT) metric.

For applications where throughput matters more than latency, this tradeoff is acceptable: the disaggregated system produces more tokens per dollar per second because each hardware type is used for the phase it is efficient at.

For applications where TTFT is critical (real-time conversational AI), the added transfer latency may be unacceptable.

The engineering tension here is real, and production systems navigate it differently depending on workload: Splitwise (from Microsoft Research), DistServe (from Peking University and others), and production implementations at major AI serving providers all make different tradeoffs along the latency-throughput frontier.

Subscribe now

Flash decoding and the attention bottleneck

We have spent considerable time discussing the weight matrix reads as the memory bottleneck for decode, but for long-context inference, a second bottleneck emerges: the KV cache reads during attention computation.

The attention operation during decode requires computing, for the new token’s query vector q of shape [d_head], an attention score against every key in the KV cache of shape [context_length, d_head], and then a weighted sum of every value in the KV cache of shape [context_length, d_head].

The KV cache for a single layer, a single head, at context length L is 2 × L × d_head × 2 bytes (keys and values, BF16). For L = 128,000, d_head = 128, this is 64 MB per head per layer per request.

For a model with 80 layers and 8 KV heads, the total KV cache read per decode step is 64 MB × 80 × 8 = 40,960 MB ≈ 40 GB.

At 3.35 TB/s HBM3 bandwidth, reading 40 GB takes approximately 12 milliseconds per decode step. The weight matrix reads (140 GB at 3.35 TB/s) take approximately 42 milliseconds. At 128K context length, KV cache reads are therefore about 22% of the per-step memory read time, a non-trivial contribution.

At context lengths beyond 128K (1M tokens is now a research target), KV cache reads can dominate weight reads entirely. The bottleneck for very-long-context inference is not the model weights but the growing KV cache.

Flash Decoding, introduced by Dao et al. and integrated into FlashAttention-3, addresses this by parallelizing the KV cache reads across multiple warps in a different pattern from standard FlashAttention.

In standard FlashAttention, which is designed for prefill (where Q, K, and V all have sequence dimension S), the parallelism is along the query sequence dimension.

In Flash Decoding, where the query has sequence dimension 1 but the KV cache has sequence dimension L, the parallelism instead decomposes along the key/value sequence dimension, allowing multiple warps to read different segments of the KV cache simultaneously and combine their partial softmax results using a numerically stable reduction.

The speedup from Flash Decoding is most pronounced when the KV cache length is large and the batch size is small, precisely the regime of long-context single-request inference where standard attention is most bottlenecked.

For L = 64K and batch size 1, Flash Decoding achieves 4-8× speedup over a naive attention implementation, which translates directly to 4-8× faster decode throughput for long-context requests.

The production inference stack

The mechanisms discussed above do not exist in isolation; they compose into a production inference serving system.

It is worth describing what that stack looks like end-to-end, because the interactions between components create optimization opportunities that are invisible when examining each component individually.

A production inference server for a 70B model on an 8-GPU node, circa mid-2026, runs approximately as follows.Tensor parallelism at degree 8 distributes the weight matrices across all 8 GPUs in the node, using NVLink for the all-reduce at each layer boundary.

The effective model per GPU is 70B / 8 = 8.75B parameters, requiring approximately 17.5 GB of HBM for weights in BF16. With 80 GB per GPU, this leaves 62.5 GB per GPU for KV cache.

The KV cache is managed by PagedAttention with a block size of 16 tokens. The pool of free blocks is divided among the 8 GPUs (with TP, the KV cache is also sharded, since each GPU handles a subset of the KV heads under GQA).

For a model with 8 total KV heads under 8-way TP, each GPU handles exactly 1 KV head, and the per-GPU KV cache is accordingly 1/8 of the total.

The scheduler runs continuous batching, maintaining a queue of waiting requests and a set of active requests. At each step, it admits new requests into the batch as KV cache pages become available. It preempts requests (evicts their KV cache pages, returning them to the free pool) when memory pressure is high, re-scheduling preempted requests from the beginning (or from a checkpoint) later.

Speculative decoding is enabled with a draft model of approximately 7B parameters, contributing an additional 14 GB of weight memory across the 8 GPUs (1.75 GB per GPU) and an additional 14 GB KV cache (1.75 GB per GPU), net of which there remains approximately 60 GB per GPU for the target model KV cache.

The prefill of long prompts is chunked (also called chunked prefill): instead of processing a 64K-token prompt as a single prefill operation that saturates compute for a long time and blocks decode requests from running, the prompt is processed in chunks of, say, 2048 tokens per step, interleaved with decode steps.

This trades slightly higher TTFT for dramatically better decode latency for concurrent users, a tradeoff that is almost always correct in a multi-user serving scenario.

Subscribe now

What the series has built

Ten parts. From SMs to NVSwitch. From warp scheduling to speculative decoding. From single-GPU kernels to multi-node inference stacks.

Across all of it, one idea stays invariant: a GPU is not a “compute machine” in the naive sense. It is a latency-hiding system. Every layer of the stack exists to keep data moving while something else is waiting; warps hiding memory latency, NVLink hiding interconnect latency, batching hiding kernel inefficiency, and scheduling hiding autoregressive seriality.

At scale, the same structure repeats. Training systems hide communication under compute. Inference systems hide serial generation under parallel requests. Serving stacks hide memory fragmentation under virtualized allocation.

The details change, but the pattern does not: identify the bottleneck, then restructure the system so that bottleneck is no longer visible to the critical path.

What emerges is not a collection of optimizations, but a consistent way of thinking about hardware systems. Everything reduces to one question:

what resource is constraining progress right now (compute, memory bandwidth, or communication) and how do we prevent it from sitting idle?

The specific techniques will evolve. NVLink will be replaced, attention kernels will be rewritten, new quantization schemes will appear, and hardware will continue to shift the ridge points we computed throughout this series.

But the underlying structure will not change, because it is not a property of transformers: it is a property of physics.

That is the real object we have been studying. Not GPUs. Not transformers. But the constraints that govern how any system can compute under finite bandwidth and finite time.

Mastering CUDA and High-Performance Computing, Part IX

Lorenzo Bradanini — Tue, 21 Apr 2026 09:19:01 GMT

Where Part VIII Left Us

Part VIII ended at something close to a philosophical statement: the SMSP on a well-tuned Hopper GEMM kernel is a machine that does one thing. Everything else has been delegated.

That is true, and it is beautiful, and it is also irrelevant the moment the model you are trying to train does not fit in 80 GB of HBM3.

GPT-3 has 175 billion parameters. In BF16 that is 350 GB of weights alone, before you add optimizer state, activations, and gradients. A single H100 has 80 GB. You need at least five of them just to hold the parameters, and in practice you need significantly more to make training feasible.

At this point the single-GPU roofline model, with its ridge points, its arithmetic intensity calculations, its tensor core utilization percentages, becomes necessary but not sufficient.

You need a new abstraction layer that sits above the GPU and treats a rack, or a pod, or a datacenter, as the compute substrate.

This part is about that layer. We will cover tensor parallelism, pipeline parallelism, data parallelism, and the collective communication primitives that tie them together.

We will look at NCCL, at NVLink topology and how it interacts with bandwidth requirements, and at the specific arithmetic of why certain parallelism strategies work and others do not at scale.

We will go into great detail. Tighten your seatbelts.

The memory wall has not gone away

Before we talk about parallelism strategies, we need to internalize what “model doesn’t fit” actually means, quantitatively.

A transformer with P parameters trained in mixed precision requires, at minimum:

2P bytes for the model weights in BF16
4P bytes for the master weights in FP32 (kept by the optimizer for numerical stability)
8P bytes for the Adam optimizer states (m and v vectors, both FP32)
2P bytes for the gradients in BF16

Total: 16P bytes in the steady state, not counting activations.

For a 70B parameter model (Llama 3 scale), this is 1,120 GB; fourteen H100s worth of HBM just for the optimizer state. This is not a pathological edge case; this is the routine reality of training frontier models.

Inference is cheaper (you do not need optimizer state) but for a 405B parameter model in FP8, you are still looking at 405 GB, spread across at minimum six H100 80GB instances, with careful attention to how the tensor operations are partitioned so that no single GPU computes a matrix multiply that requires moving activations larger than the HBM capacity.

The problem is therefore not just “how do we make one GPU fast” but “how do we decompose a computation that is too large for one GPU into pieces that run efficiently on many GPUs, with the communication overhead between those pieces small enough that the multi-GPU system achieves a meaningful fraction of the theoretical sum of its parts.”

That fraction has a name: parallel efficiency. Getting it above 0.5 for a thousand-GPU training run is hard. Getting it above 0.8 is a research problem.

Getting it above 0.9 is what separates companies that can train frontier models economically from companies that cannot.

Subscribe now

Three orthogonal dimensions of parallelism

The standard taxonomy, established empirically by the Megatron-LM work at NVIDIA and subsequently refined, identifies three orthogonal axes along which a transformer training job can be parallelized.

Data Parallelism (DP): Replicate the model across N GPUs, partition the training batch into N micro-batches, run a forward and backward pass independently on each GPU, and then average the gradients across all N replicas. Every GPU holds the full model. The communication pattern is a single all-reduce over the gradient tensors after each backward pass.

Tensor Parallelism (TP): Partition individual weight matrices across N GPUs, so that each GPU holds a 1/N shard of each matrix. A single matrix multiply that would require, say, a 4096×16384 GEMM on one GPU instead requires a 4096×(16384/N) GEMM on each of the N GPUs, followed by a collective to reassemble the result.

The communication pattern is tightly coupled to the forward pass; an all-reduce (or all-gather + reduce-scatter) at every layer boundary.

Pipeline Parallelism (PP): Partition the layers of the model across N GPUs, so that GPU 0 holds layers 1–L/N, GPU 1 holds layers L/N+1 through 2L/N, and so on. A micro-batch flows through the pipeline sequentially.

The communication pattern is a point-to-point activation transfer between adjacent stages, one per micro-batch per layer boundary.

These three dimensions compose. The Megatron-LM paper that trained GPT-3-scale models on A100 clusters used all three simultaneously: 8-way TP within a node (exploiting NVSwitch), 4-way PP across nodes, and data parallelism across node groups.

The product is the total GPU count: 8 × 4 × D = total GPUs, where D is the data parallel degree.

Understanding why this particular combination was chosen, and not, say, 64-way TP with no PP, requires understanding the communication topology and the arithmetic of collective operations.

That is what the rest of this part is about.

NVLink and NVSwitch

Communication between GPUs can happen over two physical fabrics: PCIe and NVLink. The performance difference between them is not small.

PCIe 4.0 x16 provides 32 GB/s unidirectional bandwidth. PCIe 5.0 x16 doubles that to 64 GB/s. These numbers sound reasonable until you compare them to what you actually need during all-reduce.

NVLink 4.0 (H100) provides 900 GB/s bidirectional bandwidth per GPU in NVLink-connected configurations: that is 450 GB/s in each direction. This is roughly 7× better than PCIe 5.0 in each direction, and the real-world benefit is larger because NVLink latency is also significantly lower.

But “NVLink-connected” hides a critical topological detail. NVLink connects pairs of GPUs (or GPUs through an NVSwitch). The DGX H100 system has 8 GPUs connected through NVSwitch 3.0, which provides full all-to-all connectivity at 900 GB/s per GPU.

This means any GPU can communicate with any other GPU in the same node at full bandwidth simultaneously. The NVSwitch acts as a non-blocking switch fabric.

Across nodes, the picture changes entirely. Multi-node communication happens over InfiniBand (HDR or NDR), with typical all-reduce bandwidth of 25–50 GB/s per GPU depending on topology and rail configuration: roughly 10–20× slower than intra-node NVLink.

This 10–20× bandwidth gap between intra-node and inter-node communication is the single most important physical fact for understanding why multi-GPU parallelism is structured the way it is.

The implication is immediate: communication-heavy parallelism strategies (like tensor parallelism, which requires an all-reduce at every layer) should be confined within a node, where NVLink bandwidth makes the overhead acceptable.

Communication-light parallelism strategies (like pipeline parallelism, which only requires activation transfers at layer boundaries) can span node boundaries.

This is exactly what Megatron-LM does, and the reason is physics, not convention.

Subscribe now

Data parallelism in depth

Data parallelism is the simplest strategy and the one that scales best in terms of implementation complexity. It is also the one where the communication overhead is most amenable to hiding behind computation, given careful engineering.

The communication requirement for data parallelism is an all-reduce over the gradient tensors after each backward pass. For a model with P parameters in BF16, this all-reduce moves 2P bytes of data through the network.

For a 70B model, that is 140 GB per all-reduce. At 25 GB/s inter-node InfiniBand, a naive all-reduce would take approximately 5.6 seconds. For a training step that takes 2–3 seconds of compute, this is catastrophically inefficient. The GPU would be idle for twice as long as it was computing.

The solution is to overlap gradient communication with the backward pass computation. As each layer’s gradients are computed during the backward pass, those gradients can be immediately all-reduced while the backward pass continues computing the gradients of earlier layers.

This is called gradient overlap, and it is implemented in PyTorch via the DistributedDataParallel (DDP) bucket mechanism: gradients are grouped into buckets of approximately 25 MB, and an all-reduce is launched for each bucket as soon as it fills, overlapping with the backward computation of earlier layers.

The efficiency of gradient overlap depends on the ratio of compute time to communication time per layer.

For large models with large batch sizes, this ratio is favorable: the layers are compute-heavy, and there is always useful computation happening while the all-reduce for a previous layer’s gradients is in flight.

For small models or small batch sizes, the backward computation per layer is short and the all-reduce cannot be fully hidden.

This is one reason why very large batch sizes are computationally efficient beyond the obvious “more samples per step” benefit: larger batches mean longer per-layer compute time, which means more time to hide communication.

ZeRO: when the model doesn’t fit, but you want data parallelism anyway

Vanilla data parallelism replicates the full model on every GPU. For a 70B model, this requires every GPU to have 1,120 GB of memory (with optimizer state), which is physically impossible today and will remain so for some time.

ZeRO (Zero Redundancy Optimizer), developed by Microsoft DeepSpeed, addresses this by partitioning the model state across the data parallel group rather than replicating it.

ZeRO comes in three stages of increasing memory savings and communication cost:

ZeRO-1: Partition the optimizer state (m, v vectors) across the DP group. Each GPU holds a 1/N shard of the optimizer state. Communication overhead is unchanged (gradients are still all-reduced). Memory savings: up to 4× for Adam (8P → 2P per GPU for optimizer state).

ZeRO-2: Partition the gradients in addition to the optimizer state. After the all-reduce, each GPU keeps only its 1/N shard of the gradients (the portion it needs for its optimizer state shard). Memory savings: up to 8× (12P → 1.5P per GPU for gradients + optimizer state). Communication overhead: unchanged.

ZeRO-3: Partition the model parameters as well. Each GPU holds only 1/N of the model weights at any given time. During the forward and backward pass, the needed weight shards are all-gathered from the DP group just-in-time. Memory savings: up to 64× for a large DP degree. Communication overhead: increases by 1.5× compared to vanilla DDP (due to the all-gather operations for parameters).

ZeRO-3 with a DP degree of 64 reduces the per-GPU memory for a 70B model from 1,120 GB to approximately 17.5 GB: comfortably fitting on a single H100. The tradeoff is the 1.5× increase in communication volume, which must be weighed against the larger batch sizes that ZeRO-3 enables.

The engineering implementation of ZeRO-3 is non-trivial: parameters must be gathered before each layer’s forward pass and immediately freed afterward (unless gradient checkpointing is also active, in which case they must be re-gathered during the backward pass as well).

The memory allocator must be aware of these temporary parameter buffers and free them aggressively.

DeepSpeed’s implementation of ZeRO-3 does this with a parameter fetch prefetch buffer: as layer i is executing its forward pass, layer i+1‘s parameters are being all-gathered in the background, overlapping communication with computation at the layer level rather than the bucket level.

Tensor parallelism in depth

Tensor parallelism (TP), as formalized in the Megatron-LM paper, exploits the specific structure of transformer layers to split individual matrix multiplications across multiple GPUs.

Consider a single transformer MLP layer with weight matrices W1 of shape [d_model, d_ffn] and W2 of shape [d_ffn, d_model], where d_ffn = 4 × d_model. For a 70B model, d_model ≈ 8192, so d_ffn ≈ 32768.

The output of the MLP layer is: Y = GeLU(XW1)W2

With 8-way tensor parallelism, W1 is split column-wise across 8 GPUs, each holding W1_i of shape [d_model, d_ffn/8]. W2 is split row-wise across 8 GPUs, each holding W2_i of shape [d_ffn/8, d_model].

The computation on GPU i becomes:

Y1_i = X × W1_i (local GEMM, shape [batch, d_ffn/8])
Z_i = GeLU(Y1_i) (local elementwise)
Y2_i = Z_i × W2_i (local GEMM, shape [batch, d_model])
Y = AllReduce(Y2_i) (sum partial results across 8 GPUs, shape [batch, d_model])

The all-reduce at step 4 is the communication bottleneck. Its cost is:

2 × (N-1)/N × |Y2| bytes

For a batch of 2048 tokens, d_model = 8192, BF16: |Y2| = 2048 × 8192 × 2 bytes ≈ 32 MB. At 450 GB/s NVLink bandwidth, this all-reduce takes approximately 32 MB / 450 GB/s ≈ 71 microseconds.

The compute time for the local GEMMs is: 2 × (2048 × 8192 × 32768/8) × 2 ops ÷ (494 TFLOP/s per GPU) ≈ 2 × 2048 × 8192 × 4096 × 2 / 494e12 ≈ 1.1 milliseconds per GEMM, so approximately 2.2 ms for both GEMMs.

Communication (71 µs) is roughly 3% of compute (2200 µs). This is an excellent ratio; the all-reduce is effectively free.

But now notice what happens if you push TP degree from 8 to 64 over InfiniBand at 25 GB/s instead of NVLink at 450 GB/s: the all-reduce takes 32 MB / 25 GB/s × (2 × 63/64) ≈ 2.5 ms, which is now comparable to the compute time.

The compute per GPU has also dropped by 8× (fewer flops per GPU due to more partitioning), so each GEMM takes 2.2ms / 8 ≈ 275 µs.

The all-reduce (2.5 ms) now takes 9× longer than the compute it is supposed to overlap with. You are GPU-idle for 90% of the time. This is why tensor parallelism over InfiniBand at high TP degrees is a terrible idea, not a theoretical concern but an arithmetic certainty.

The intra-node NVLink constraint on TP degree is therefore typically TP ≤ 8 for an 8-GPU node, precisely because that is the point where NVLink bandwidth makes the communication overhead negligible.

Self-attention and sequence parallelism

The attention mechanism has a different structure than the MLP, but the Megatron-LM approach handles it symmetrically: Q, K, and V projection matrices are split column-wise (head-parallel), and the output projection is split row-wise. Each GPU handles a subset of attention heads.

For H total heads and TP degree N, each GPU computes H/N heads independently. No communication is needed within the attention computation itself; only the output projection requires an all-reduce.

This works cleanly as long as H is divisible by N, which is why transformer architectures are almost always designed with head counts that are powers of 2 or multiples of 8.

There is an additional subtlety for attention: LayerNorm and dropout require the full activation tensor, not a sharded one.

In vanilla TP, these operations run on the full (all-reduced) activations, which means they see the full sequence length at full d_model dimension: they are not distributed and do not benefit from the TP decomposition.

Sequence Parallelism (SP), introduced as an extension to Megatron-LM TP, addresses this by replacing the all-reduce at layer boundaries with an all-gather + reduce-scatter pattern, and distributing the non-tensor-parallel operations (LayerNorm, dropout) across the sequence dimension rather than the model dimension.

In SP, the activation between transformer layers is sharded across the TP group along the sequence dimension: each GPU holds [batch, seq_len/N, d_model] instead of [batch, seq_len, d_model].

Before entering the tensor-parallel MLP, an all-gather reconstructs the full [batch, seq_len, d_model] tensor. After the MLP’s row-parallel W2 partial sum, a reduce-scatter simultaneously sums the partial results and re-shards the output along the sequence dimension.

The communication volume is identical to the all-reduce, but the memory advantage is significant: activations are now partitioned across the TP group, reducing the peak memory per GPU by a factor of N for the activation tensors.

At large sequence lengths (128K tokens, as in recent long-context models), this memory saving is the difference between fitting in HBM and catastrophic OOM.

Subscribe now

Pipeline Parallelism

Pipeline parallelism is the ugliest of the three strategies. This is a statement of fact, not an aesthetic judgment.

It introduces pipeline bubbles (periods where some GPUs are idle because they are waiting for activations from the previous stage) and managing those bubbles is the central engineering challenge.

In the simplest pipeline schedule (GPipe), a batch of M micro-batches flows through P pipeline stages sequentially. The forward pass processes micro-batch 0 through stage 0, then stage 1, ..., then stage P-1. Then micro-batch 1 flows through. Then micro-batch 2. And so on.

The backward pass happens in reverse, with a full flush between the forward and backward sweeps. The pipeline bubble fraction (time wasted on idle GPUs as a fraction of total time) is approximately:

bubble fraction ≈ (P − 1) / (M + P − 1)

For P = 4 stages and M = 8 micro-batches: bubble fraction ≈ 3/11 ≈ 27%. More than a quarter of GPU time is wasted.

To reduce the bubble, you increase M. For M = 32: bubble fraction ≈ 3/35 ≈ 8.5%. For M → ∞, bubble fraction → 0.

But increasing M increases the memory required to store the activations of all in-flight micro-batches during the forward pass before the backward pass can begin (GPipe requires storing all activations). For M micro-batches each with activation size A, the activation memory is M × A, and this grows linearly with M.

1F1B (One Forward One Backward) scheduling, introduced in PipeDream and used by Megatron-LM, breaks this linear activation scaling. In 1F1B, each pipeline stage interleaves one forward pass and one backward pass for different micro-batches, rather than running all forwards before any backwards.

The pipeline still has a bubble at startup and drain, but the steady-state memory is bounded: at any point, a stage has at most P micro-batches’ activations in flight (not M). The bubble fraction is the same as GPipe ((P−1)/(M+P−1)), but the activation memory is P × A instead of M × A.

For large models where M must be large to amortize the bubble, this is a crucial difference. With M = 64 and P = 8, GPipe stores 64 micro-batches of activations; 1F1B stores at most 8.

Interleaved 1F1B goes further: each GPU holds V virtual pipeline stages (chunks of layers) instead of one contiguous block, enabling the bubble fraction to be reduced to:

bubble fraction ≈ (P − 1) / (V × M + P − 1)

with V times the point-to-point communication per step (because activation tensors must be sent between non-adjacent GPUs). This is the schedule used by Megatron-LM for the largest training jobs, with V=2 or V=4 providing a 2-4× reduction in bubble at a 2-4× increase in inter-stage communication.

The Activation Recomputation Tradeoff

One more tool for managing activation memory in pipeline parallelism (and, frankly, in any large training run): gradient checkpointing, also called activation recomputation.

The idea is simple: instead of storing the full activation tensor for every layer during the forward pass (needed for the backward pass), you store only the activations at checkpoint boundaries (e.g., every transformer block) and recompute the intermediate activations on-demand during the backward pass by running the forward computation again.

The memory cost is reduced from O(L × A) to O(√L × A) for optimal checkpoint placement (checkpoint every √L layers). The compute cost increases by approximately 30-40% (one extra forward pass per layer, amortized).

For training at the frontier, this tradeoff is almost always worth it: compute is more abundant than HBM capacity, and the alternative is buying more GPUs (or, equivalently, using more PP stages, which increases the bubble).

The interaction between activation recomputation and pipeline parallelism is non-trivial: if you are recomputing activations, the backward pass must re-run the forward computation for each stage, which means each stage must still have access to the input activations from the forward pass checkpoint.

This constrains how aggressively you can combine PP and recomputation, and the Megatron-LM codebase has explicit logic for managing which activations are stored versus recomputed across pipeline stages.

Subscribe now

The library that makes multi-GPU work

Everything discussed above (all-reduce, all-gather, reduce-scatter) is implemented in practice by NCCL (NVIDIA Collective Communications Library).

Understanding what NCCL does, and more importantly how it does it, is necessary for diagnosing performance problems at scale.

NCCL implements the standard collective communication operations (AllReduce, AllGather, ReduceScatter, Broadcast, Reduce, Barrier) on NVIDIA GPUs, with topology-aware algorithms that exploit NVLink, NVSwitch, and InfiniBand according to the detected hardware configuration.

Ring-AllReduce vs Tree-AllReduce

The canonical AllReduce algorithm for a ring of N GPUs is ring-allreduce, introduced in the deep learning context by Baidu Research and later made famous by the Horovod library.

In ring-allreduce, GPUs are arranged in a logical ring. AllReduce is decomposed into two phases:

Reduce-Scatter: Each GPU sends a chunk of its data to the next GPU in the ring, while simultaneously receiving and accumulating a chunk from the previous GPU. After N−1 steps, each GPU holds the fully reduced value for one chunk (its 1/N shard of the full tensor).

AllGather: Each GPU broadcasts its reduced chunk to all others by rotating it around the ring. After N−1 more steps, every GPU holds the fully reduced tensor.

The latency of ring-allreduce scales as 2(N−1)α + 2(N−1)/N × |data|/β, where α is the point-to-point latency and β is the bandwidth.

For large data (|data| >> α/β × N²), the bandwidth term dominates and the algorithm is efficient. For small data, the latency term (proportional to N) dominates and ring-allreduce becomes expensive.

This is why AllReduce for gradient synchronization (large tensors, many GB) works well with ring-allreduce, but AllReduce for small tensors (like the normalization statistics in LayerNorm) can benefit from tree-based algorithms that have O(log N) latency scaling at the cost of non-optimal bandwidth utilization.

NCCL automatically selects the algorithm based on message size, topology, and a heuristic tuning table, but understanding the underlying tradeoff is necessary when the automatic selection is suboptimal for your specific workload.

NCCL and topology awareness

NCCL’s topology detection is worth examining in detail because it directly determines which algorithm it selects for intra-node vs inter-node operations.

On startup, NCCL probes the system topology using the CUDA device properties API and, where available, NVML (NVIDIA Management Library) topology information.

It constructs an internal graph where GPUs are nodes and NVLink/PCIe/InfiniBand connections are edges with associated bandwidths.

For an 8-GPU DGX H100 node with NVSwitch, NCCL detects a fully connected graph with 900 GB/s bidirectional bandwidth.

For multi-node communication over InfiniBand with a single SHARP switch, NCCL can use SHARP (Scalable Hierarchical Aggregation and Reduction Protocol), an in-network computing feature of modern InfiniBand switches that performs the AllReduce reduction inside the switch fabric rather than at the endpoint GPUs.

SHARP effectively moves the bandwidth bottleneck from the GPU NICs to the switch, and for large-scale clusters with SHARP-capable HDR/NDR InfiniBand, it can reduce AllReduce latency by 2-3× compared to standard ring-allreduce.

For hierarchical topologies (NVLink within nodes, InfiniBand between nodes), NCCL uses a two-level algorithm: a ReduceScatter within each node using the fast NVLink fabric, followed by an AllReduce across nodes over InfiniBand, followed by an AllGather within each node.

This correctly exploits the bandwidth hierarchy: the slow inter-node link sees only the partially reduced results, not the full gradient tensor from every GPU.

The implementation detail that matters for practitioners: the NCCL_SOCKET_NTHREADS and NCCL_BUFFSIZE environment variables, along with the NCCL_P2P_LEVEL setting, are often the first things to tune when AllReduce performance is below theoretical bandwidth.

NCCL’s defaults are conservative for stability across diverse hardware configurations.

The NCCL + CUDA stream interaction

NCCL operations run on CUDA streams, and correct stream management is the source of many subtle performance bugs in multi-GPU training code.

The critical invariant: NCCL operations on the same communicator object are serialized in NCCL’s internal queue, but operations on different communicators are independent.

The PyTorch DDP implementation creates one NCCL communicator per process group, and all AllReduce operations for gradient synchronization go through this communicator.

When gradient overlap is active (AllReduce launched for bucket i while the backward pass continues computing gradients for bucket i-1), PyTorch launches the AllReduce on a separate CUDA stream from the backward pass computation. The backward computation stream and the AllReduce stream run concurrently on the GPU.

The synchronization at the end of the backward pass (before the optimizer step) ensures that all AllReduce streams have completed. Failure to synchronize here is a common bug that manifests as non-deterministic gradient corruption: the optimizer sees a partially all-reduced gradient tensor.

The interaction with tensor parallelism adds another layer: the AllReduce within a TP group happens on a separate communicator from the AllReduce within the DP group.

These two communicators must be correctly ordered in the forward/backward pass: TP AllReduce happens at each layer boundary (blocking for the next layer to begin), while DP AllReduce happens after the full backward pass.

Subscribe now

The arithmetic of parallelism efficiency

With the mechanisms in place, we can now put numbers on the parallel efficiency achievable under different configurations. This is where the theoretical discussion becomes practically useful.

For data parallelism, the relevant ratio is Rᴅᴘ = T_compute / T_allreduce. For gradient overlap to be effective, we need Rᴅᴘ >> 1. For a 70B model over InfiniBand at 25 GB/s, T_allreduce ≈ 5.6 seconds. For a per-GPU batch of 512 tokens at 50% H100 utilization, T_compute ≈ 0.58 seconds. Rᴅᴘ ≈ 0.1.

The compute is an order of magnitude faster than the all-reduce, and no amount of overlap engineering fixes a ratio that is inverted.

The resolution is gradient accumulation: run multiple micro-batches locally before triggering the all-reduce, increasing the logical batch size without increasing activation memory.

This is not a workaround. It is the correct operating point, because convergence is measured in tokens seen, not steps taken.

For tensor parallelism, the critical TP degree N* is the point above which T_allreduce_TP > T_GEMM_local. On NVLink at 450 GB/s, N* ≈ 8. On InfiniBand at 25 GB/s, N* ≈ 1 to 2. Tensor parallelism over InfiniBand is not a suboptimal choice; it is an arithmetic mistake.

The practical ceiling for a well-tuned thousand-GPU training job with Megatron-LM or DeepSpeed on H100 hardware is approximately 40-60% MFU.

The gap from 100% is accounted for by pipeline bubbles (~15%), communication overhead (~10-15%), activation recomputation (~5-10%), and the residual from kernel inefficiency on non-GEMM operations and host-side Python overhead.

Getting from 40% to 60% MFU is worth approximately a 50% reduction in training cost. Every major AI lab has engineers whose entire job is that gap.

What the three strategies actually buy you

It is worth stepping back and stating plainly what each parallelism axis is for, now that we have done the arithmetic.

Data parallelism buys throughput. More DP replicas means more tokens per second, with communication overhead that is manageable at any scale where the per-GPU batch is large enough.

ZeRO makes DP viable even when the model does not fit on a single GPU, at the cost of increased communication volume that must be weighed against the larger batch sizes it unlocks.

DP is the outer loop of every large training job; the other strategies are refinements that make DP feasible at model sizes and cluster scales where it would otherwise be communication-bound.

Tensor parallelism buys memory capacity per layer, by distributing the weight matrices and activations of individual layers across multiple GPUs.

Its cost is a mandatory all-reduce at every layer boundary, which makes it viable only where NVLink bandwidth makes that all-reduce cheap.

TP is fundamentally an intra-node strategy on current hardware. Using it otherwise is trading compute for communication at an exchange rate that is never favorable.

Pipeline parallelism buys the ability to cross node boundaries without paying the tensor-parallel penalty. Its cost is the pipeline bubble, which is a tax on idleness that shrinks with micro-batch count and with interleaved scheduling.

PP is the mechanism by which you scale beyond a single node when the model is too large for TP alone to partition. It is ugly, and it is necessary, and the 1F1B schedule and its interleaved variants exist specifically to make that ugliness manageable.

The three-dimensional parallelism strategy (TP within a node, PP across nodes, DP across node groups) is not a design choice that someone made. It is the unique solution that the bandwidth hierarchy of current hardware enforces.

NVLink makes TP cheap within a node. InfiniBand makes PP the only viable strategy across nodes. Gradient accumulation and ZeRO make DP efficient across the full cluster. Change the hardware, and the optimal strategy shifts.

NVL72 on Blackwell, by extending the NVLink fabric to 72 GPUs, shifts the TP/PP boundary outward.

InfiniBand NDR at 400 Gb/s per link, as it becomes more widely deployed, shifts the point at which inter-node DP communication becomes the bottleneck. The strategies are not timeless; the physics that motivates them is.

Subscribe now

What actually limits you at scale

The question that every large-scale training practitioner eventually asks is: why isn’t my thousand-GPU job achieving 70% MFU? The answer is rarely a single cause.

It is a stack of inefficiencies, each modest in isolation, compounding into the gap between theoretical and measured throughput.

Pipeline bubbles are usually the largest single contributor at high PP degrees. For P=16 and M=32, the bubble fraction is 32% before any other inefficiency is counted.

Interleaved 1F1B at V=2 halves this, at the cost of doubled inter-stage communication. The right V depends on whether communication or compute is the limiting resource at the PP boundary, which varies by model size, node count, and InfiniBand configuration.

Inter-node AllReduce tail latency is the second major contributor and the hardest to reason about, because it is statistical rather than deterministic.

The average InfiniBand bandwidth may be 25 GB/s, but the 99th-percentile latency can be 3-5× higher due to switch congestion, adaptive routing jitter, and multi-tenancy.

The all-reduce waits for the slowest link. At scale, the slowest link is almost always slower than the average, which means the effective all-reduce bandwidth is consistently worse than the number on the datasheet.

RDMA configuration, static routing, and dedicated rail topology are the levers. They require infrastructure access that not every team has.

Host-side Python overhead is the most embarrassing source of inefficiency, because it is entirely self-inflicted.

PyTorch’s dispatcher, the GIL, and the overhead of the training loop’s Python logic can appear as measurable GPU idle time for models where per-step compute is short. CUDA graph capture eliminates per-operation launch overhead for the inner loop.

Careful pipelining of data loading and logging eliminates it for the outer loop. Teams that have done this work carefully report 5-10% improvement in effective MFU from host-side optimizations alone, which is not a small number when the training run costs millions of dollars.

Conclusion

The trajectory from single-GPU to multi-GPU mirrors the trajectory within the single GPU: find the bottleneck, route around it, measure again.

On a single GPU, the bottleneck moved from arithmetic units (tensor cores solved it) to memory bandwidth (cp.async attacked it) to instruction overhead (TMA eliminated it).

The SMSP on a well-tuned Hopper GEMM kernel is a machine that does one thing, because every other thing it used to do has been delegated to dedicated hardware.

At the multi-GPU level, the bottleneck is the communication fabric, and the solution follows the same logic: match the parallelism strategy to the bandwidth available at each level of the hierarchy.

TP over NVLink, PP over InfiniBand, DP across the full cluster. The ring-allreduce, the 1F1B schedule, the ZeRO partitioner: these are the mechanisms by which a thousand-GPU cluster achieves 40-60% of the theoretical sum of its parts.

40-60% of a thousand H100s is still an extraordinary amount of compute. Whether it is enough to train the next frontier model is left as an exercise for the reader’s infrastructure budget.

Part X will close in on a corner of this picture we have deliberately deferred: inference. Training is a one-time cost; inference is the workload that runs forever, at a scale that dwarfs training once a model is deployed.

The optimization challenges at inference time are different in kind: speculative decoding, KV cache management, continuous batching, and the specific arithmetic of when the prefill and decode phases are bottlenecked by entirely different resources.

The tools change; the principle does not.

Mastering CUDA and High-Performance Computing, Part VIII

Lorenzo Bradanini — Tue, 07 Apr 2026 09:52:06 GMT

Where Part VII Left Us

Part VII ended with a promise and an architectural cliffhanger.

The promise: on Hopper, the compute-to-load instruction ratio in a GEMM inner loop approaches infinity from the SMSP’s perspective.

The cliffhanger: one instruction moves a 128×128 BF16 tile, the TMA unit generates all the addresses, and something called an mbarrier replaces the __syncthreads() you have been writing since your first CUDA “hello world”.

Let us unpack exactly what that means, why NVIDIA made those choices, and what you have to understand to write, read, or debug CUTLASS 3.x kernels without feeling like you are reading a foreign language.

We will go very deep. There is no other way.

The Problem cp.async Did Not Fully Solve

Part VII established that cp.async is superior to the conventional LDG → STS path because it removes the destination registers from the scoreboard. The SMSP issues the copy, hands it off to the Async Copy Engine, and is immediately free to issue the next instruction.

This is genuinely great. But it has a hidden cost that only becomes visible when you look at the SMSP instruction stream of a real GEMM kernel.

Consider a 128×128×32 BF16 tile. Loading that tile requires 128 × 32 BF16 elements = 4096 BF16 = 8 KB. At 16 bytes per cp.async, that is 512 individual CP.ASYNC.CA.SHARED.GLOBAL instructions.

Those 512 instructions have to be fetched from the instruction cache, decoded, dispatched through the MIO unit, and tracked by the hardware. They consume SMSP instruction bandwidth even though they produce no register results.

On Ampere, the SMSP can issue roughly one 128-bit cp.async every 4 cycles per SMSP. For 512 instructions, that is approximately 2048 SMSP cycles per tile load, just for the instruction overhead. The actual data movement happens asynchronously, but the instruction stream is not free.

For large tiles this is manageable. For smaller tiles, or for architectures where you want the SMSP to spend every cycle on tensor core instructions, it is a ceiling.

Hopper (SM90, H100) was designed to remove that ceiling entirely. The answer is the Tensor Memory Accelerator.

Tensor Memory Accelerator

The TMA is a hardware unit introduced in Hopper that performs multi-dimensional tensor copies between global memory and shared memory (or distributed shared memory across a cluster, but we will get to clusters).

It accepts a tensor descriptor computed on the host and a set of coordinates computed on the device, and it handles everything else: address computation, striding, data type conversion, out-of-bounds clamping, cache policy, and transaction completion signaling.

Let us be concrete about what “everything else” means.

In a conventional tiled GEMM, for every tile you load, every thread in the warp must compute its portion of the global memory address.

That address computation involves the block index, the thread index, the tile dimensions, the matrix stride, and the element size. It is entirely deterministic arithmetic that produces the same result every time you execute the same tile iteration.

It is also arithmetic that the SMSP has to execute. On Ampere with cp.async, that arithmetic still happens in the SMSP even though the subsequent memory transaction is asynchronous.

The TMA eliminates that arithmetic from the SMSP. One thread issues one instruction with a tensor descriptor handle and a pair of (y, x) coordinates.

The TMA unit uses those coordinates and the descriptor’s metadata to compute every address needed for the entire tile transfer, scatter or gather the data, and write it to shared memory. The SMSP emitted one instruction. One.

This is not a minor optimization. It is a qualitative change in what the SMSP does during a GEMM kernel. On Hopper, the SMSP’s job is to run WGMMA.MMA_ASYNC instructions.

The TMA’s job is to move data. These two jobs happen simultaneously, on separate hardware units, and the only communication between them is an mbarrier synchronization object.

Subscribe now

The Tensor descriptor

Before a Hopper kernel runs, the host must create a tensor descriptor using cuTensorMapEncodeIm2col or, more commonly for GEMM, cuTensorMapEncodeTiled. This is a 128-byte opaque structure stored in constant memory (or passed through a register and loaded into the L1).

The descriptor encodes:

Base pointer: the global memory address of tensor element [0, 0, 0, ...].

Global dimensions: the actual size of each dimension in the full tensor, in elements. For an M×K matrix A, this is {M, K} (or {K, M} if column-major).

Global strides: the byte stride between consecutive elements in each dimension. For a row-major matrix with K columns and BF16 elements, the stride between row i and row i+1 is K × 2 bytes. These strides allow arbitrary non-contiguous tensors.

Box dimensions: the size of the tile to be transferred in each dimension. For a 128×32 BF16 tile, this is {128, 32}.

Interleave and swizzle mode: how data should be rearranged during the transfer to produce a shared memory layout that avoids bank conflicts. This is the part that replaces all the padding arithmetic from Part VII.

Element stride and data type: how to interpret the raw bytes.

The descriptor is created once on the CPU and passed to the kernel. On the device, a single warp or even a single thread can then use this descriptor to initiate a full tile transfer with one instruction, because all the per-tile invariant information is already encoded.

This is a deliberate design choice: move the expensive computation (descriptor creation) to the host, where latency is irrelevant relative to the kernel launch overhead, so that the device-side instruction can be as cheap as possible.

The TMA instruction itself

The PTX for a 2D TMA load looks like this:

cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
    [smem_dst], [gmem_desc, {coord_y, coord_x}], [mbar];

Let us parse every token.

cp.async.bulk means this is an asynchronous bulk copy; “bulk” distinguishes it from scalar cp.async. The transfer size is determined by the descriptor, not encoded in the instruction.

tensor.2d means the TMA will interpret the coordinates as a 2D tensor access. There are variants for 1D through 5D tensors.

shared::cluster is the destination scope: shared memory that is visible to the entire thread block cluster (more on clusters shortly). For single-CTA kernels this is simply shared memory.

global is the source: global memory, indexed via the descriptor.

mbarrier::complete_tx::bytes is the completion signaling mechanism. When the transfer completes, the TMA will signal a mbarrier object, decrementing its transaction count.

When the count reaches zero, threads waiting on the barrier are unblocked. This replaces consumer_wait() and __syncthreads() in the sense that the barrier itself tracks both the data arrival and the thread synchronization in a single primitive.

[smem_dst] is the destination address in shared memory.

[gmem_desc, {coord_y, coord_x}] is the descriptor plus coordinates. The TMA extracts the base pointer, strides, and box dimensions from the descriptor, applies the coordinates, and generates the full address range.

[mbar] is a pointer to the mbarrier object in shared memory.

In CUDA C++, the cuda::experimental::tma:: API (or __pipeline_memcpy_async for simpler cases) generates this instruction. The canonical production path is through CUTLASS 3.x’s cute::copy with a TMA copy atom, which we will examine in the CUTLASS section.

Subscribe now

A synchronization primitive you have not seen before

__syncthreads() is a full thread block barrier. Every thread in the block must arrive before any thread proceeds.

It is implemented via a shared counter that is decremented by each arriving thread and checked by a hardware barrier mechanism. Its cost is proportional to thread block size, and it cannot distinguish between “I’m done computing” and “my data has arrived from the DMA engine”.

mbarrier (memory barrier, or more precisely, the Hopper barrier object) solves both of those problems.

An mbarrier object is a 64-bit value stored in shared memory. It has two phases, expect and arrive, and it tracks two distinct counts:

The arrival count is decremented by threads calling mbarrier.arrive or mbarrier.arrive_drop. When this count hits zero, the barrier phase flips.

The transaction count is decremented by the TMA engine itself when a bulk copy completes. This is the complete_tx::bytes in the PTX instruction above. The programmer initializes this count to the expected number of bytes that the TMA will deliver.

The barrier is “complete” when both counts reach zero: all participating threads have arrived, and all expected TMA transactions have completed.

This means you can have a consumer wait on a barrier that is signaled partly by threads and partly by hardware DMA engines, with no polling loop, no atomics in the critical path, and no __syncthreads() that serializes all 128 threads in the block.

The setup looks like this in CUDA C++:

__shared__ cuda::barrier mbar;

// One thread initializes the barrier for N_THREADS participants
if (thread_rank == 0) {
    init(&mbar, N_THREADS);
    // Tell the barrier to also expect TMA_BYTES bytes of async data
    cuda::device::barrier_native_handle(mbar).arrive_tx(TMA_BYTES);
}
__syncthreads();  // This syncthreads is to publish the initialized mbar

// Producer thread issues TMA
if (thread_rank == 0) {
    tma_load(&mbar, smem_A, gmem_desc_A, tile_coord_m, tile_coord_k);
}

// All threads arrive at the barrier (decrement arrival count)
auto token = cuda::device::barrier_native_handle(mbar).arrive();

// Wait for both arrival count and transaction count to reach zero
cuda::device::barrier_native_handle(mbar).wait(std::move(token));

Note the asymmetry: one thread issues the TMA, all threads participate in the barrier synchronization. This is not a bug; it is the design.

The TMA is a singleton operation that one thread initiates, but the data it delivers is consumed by all threads, so all threads must synchronize on its completion.

The arrive_tx call informs the barrier that TMA bytes are expected. Without it, the barrier would complete as soon as all threads arrived, regardless of whether the DMA data had landed in shared memory. That would be a race condition.

The token returned by arrive is a phase token. mbarrier operates in alternating phases (like a double buffer at the synchronization level), and the token ensures that wait waits on the correct phase.

This is how Hopper avoids the ABA problem in barrier reuse: you cannot accidentally wait on a barrier phase that already completed in a previous iteration.

Subscribe now

Warpgroup MMA

Part VII did not cover the compute side of Hopper in depth because the memory side was already enough to digest. Now we need to talk about WGMMA, and it is equally radical.

On Ampere, tensor core instructions are issued per-warp: HMMA.1688 or the PTX mma.sync.aligned operates on 16×8×16 tiles with 32 threads participating. Each warp independently executes its tile of the matrix multiply.

Warp-level tensor core instructions were already a significant departure from SIMT, since all 32 threads in a warp cooperate to produce a single 16×8 output tile. But the warp is still the unit of scheduling and the unit of tensor core execution.

On Hopper, the tensor core instruction is warpgroup-level: WGMMA.MMA_ASYNC operates on a group of 4 warps (128 threads) simultaneously. The input tile dimensions for BF16 are:

A: 64×16 per warpgroup (contributed from registers or shared memory)
B: 16×256 per warpgroup (always from shared memory)
C/D: 64×256 accumulator (in registers, split across the 128 threads)

A single WGMMA.MMA_ASYNC instruction computes a 64×256×16 BFGEMM, producing 64×256 = 16,384 output elements in one instruction.

For comparison, an Ampere mma.sync.aligned with the largest BF16 shape produces 16×8×16 BFGEMM, 128 output elements.

The output volume ratio is 128:1. This is what “approaching infinite compute-to-load ratio” means in practice.

The _ASYNC suffix is critical: WGMMA.MMA_ASYNC does not complete synchronously. The 4 warps issue the instruction and the result is not guaranteed to be in the accumulator registers until a WGMMA.WAIT_GROUP instruction is executed.

The hardware can overlap multiple WGMMA operations in flight simultaneously, and the programmer must insert explicit waits before reading the accumulators.

The programming model therefore looks like this at the instruction level:

WGMMA.MMA_ASYNC D, A, B   ; issue tile multiply k=0
WGMMA.MMA_ASYNC D, A, B   ; issue tile multiply k=1
WGMMA.MMA_ASYNC D, A, B   ; issue tile multiply k=2
...
WGMMA.WAIT_GROUP 0         ; wait for all outstanding WGMMAs
; D accumulator registers now hold valid results

In CUDA C++, this is exposed through the cute::wgmma abstractions in CUTLASS 3.x, or through the lower-level cuda::wgmma:: namespace. Direct PTX is also possible but strongly inadvisable outside of research contexts.

The reason B must always come from shared memory (not registers) is a hardware constraint. The tensor core units on Hopper are wired directly to the shared memory arrays.

The B operand is fetched directly from the shared memory banks by the tensor core datapath, without going through the register file.

This is why the TMA delivering B into shared memory is on the critical path, but there is no “load B from shared memory to registers” step. The tensor core reads shared memory directly.

A can come from either registers or shared memory. For the highest-performance kernels, A also comes from shared memory, which means both operands bypass the register file entirely on the compute side. The register file holds only the C/D accumulator.

Subscribe now

Thread Block Clusters

Hopper introduced a new level of the GPU hierarchy between the thread block and the grid: the thread block cluster.

A cluster is a group of up to 8 thread blocks that are guaranteed to be co-scheduled on the same GPC (Graphics Processing Context, a group of SMs sharing an L2 slice).

Thread blocks within a cluster can access each other’s shared memory via the Distributed Shared Memory (DSMEM) mechanism, using TMA to move data between SMs without going through L2.

The PTX instruction for a cross-SM TMA transfer is:

cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
    [smem_dst], [gmem_desc, {coord_y, coord_x}], [mbar];

This is the same instruction as a regular TMA load, with the shared::cluster scope indicating that the destination is visible cluster-wide. The TMA unit manages the inter-SM data movement transparently.

Why does this matter for GEMM? Consider a cluster of 2 CTAs, each responsible for a different row block of C. Both need access to the same columns of B.

With clusters, CTA 0 loads B into its shared memory via TMA, and CTA 1 can read CTA 0’s shared memory directly via DSMEM. B is loaded once and consumed by two CTAs. This effectively doubles the B reuse without doubling the shared memory per CTA.

For an N=8 cluster, 8 CTAs share the B tile load, amortizing the HBM bandwidth for B across 8x more compute.

This is the mechanism by which Hopper GEMM kernels approach hardware peak on large problem sizes: the cluster architecture allows the working set of the entire computation to be held in distributed shared memory, with HBM touched only once per element.

The cluster size is specified at kernel launch:

cudaLaunchConfig_t config = {};
config.gridDim = grid;
config.blockDim = block;
cudaLaunchAttribute attr;
attr.id = cudaLaunchAttributeClusterDimension;
attr.val.clusterDim.x = 2; // 2 CTAs per cluster
attr.val.clusterDim.y = 1;
attr.val.clusterDim.z = 1;
config.attrs = &attr;
config.numAttrs = 1;
cudaLaunchKernelEx(&config, my_kernel, args...);

Cluster scheduling is cooperative: the hardware will attempt to co-locate the CTAs of a cluster on the same GPC, but this is a hint, not a guarantee for clusters larger than what fits on one GPC.

On H100 SXM5 with 132 SMs organized into 7 GPCs, clusters of up to 8 are always satisfied within a single GPC.

The Persistent Kernel Model

On Ampere, a typical GEMM kernel is a “grid kernel”: each thread block handles one (M_TILE, N_TILE) output tile and exits. The CUDA runtime schedules new thread blocks as soon as SM capacity becomes available.

For large matrices this is fine: there are enough tiles that the SM scheduler is always busy.

For smaller matrices, the overhead of launching and retiring thread blocks dominates. Each thread block must load its A and B tiles from scratch, write its C tile to global memory, and terminate. The shared memory state is not reused across thread blocks.

Hopper’s memory hierarchy and cluster model make a different approach attractive: persistent kernels.

In a persistent kernel, a thread block (or warpgroup) does not terminate after processing one tile.

Instead, it loops over multiple output tiles, maintaining the A and B tiles in shared memory between iterations where the tile is reused, and fetching new tiles via TMA only when necessary. The kernel terminates only after all output tiles in its assigned partition are complete.

CUTLASS 3.x implements this via the Tile Scheduler, a device-side component that manages the assignment of output tiles to persistent CTAs.

The scheduler atomically increments a work counter stored in global memory, assigning the next available (m_tile, n_tile) pair to the requesting CTA. When all tiles are assigned, the scheduler signals completion and the CTA exits the work loop.

The advantages are concrete:

L2 reuse improves because the same CTA processes multiple adjacent tiles, and the A or B tiles they share remain in L2 (or even in shared memory) between iterations.

Thread block launch overhead is amortized: the GPU launches one wave of persistent CTAs and they run to completion, rather than launching thousands of transient blocks.

Irregular problem sizes are handled more gracefully: the final partial tile is processed by whichever CTA happens to claim it, without requiring separate epilogue kernel launches.

The disadvantage is programming complexity: you are writing a software scheduler inside a CUDA kernel, with all the attendant concerns about correctness under concurrent access and load balancing across heterogeneous tile work.

CUTLASS handles this for you, which is one reason the library exists.

Subscribe now

The CUTLASS 3.x Architecture

CUTLASS 3.x is a complete rewrite of CUTLASS 2.x, built on a new abstraction layer called CuTe (CUDA Template library).

Understanding CUTLASS 3.x requires understanding CuTe, because CUTLASS 3.x is essentially CuTe plus a set of kernel templates that use it.

CuTe: Layouts as First-Class Objects

CuTe’s central idea is that a layout is a function from a logical coordinate space to a physical offset in memory. A layout encodes both shape (the extents of each dimension) and stride (the distance in elements between consecutive elements along each dimension).

In CuTe, a layout is written as Shape:Stride. For example, a 4×8 row-major matrix with elements of size 2 bytes has layout (4,8):(8,1), meaning: the outer dimension (rows) has stride 8 (each row is 8 elements apart), and the inner dimension (columns) has stride 1. A column-major version of the same matrix would be (4,8):(1,4).

The power of this representation is that it composes. A tiling operation is just a layout composition. A swizzle (bit permutation of addresses to avoid bank conflicts) is a layout transformation that permutes the address bits in a specific pattern.

The entire address computation for a tiled, swizzled, transposed tensor is expressed as a sequence of layout compositions that the compiler evaluates at compile time, producing a single address formula.

This is why CUTLASS 3.x can express complex access patterns without any runtime branching in the address computation.

using LayoutA = Layout, Stride<_32, _1>>;  // 128x32 row-major
using LayoutA_Swizzled = ComposedLayout, LayoutA>;

The Swizzle template encodes a specific XOR-based address permutation. B bits are permuted with S bits, offset by M bits.

For BF16 with 32 banks of 4 bytes each, the correct swizzle eliminates all bank conflicts without any padding. CUTLASS ships with the correct swizzle parameters for every element type and tile dimension it supports.

The MMA Atom and Copy Atom

In CUTLASS 3.x, a tensor core instruction is an MMA atom: a typed object that describes the input/output shapes, thread-to-data mapping, and instruction to emit. The canonical Hopper MMA atom for BF16 is:

using MMA_Atom = MMA_Atom;

The name encodes: SM90 (Hopper), 64×256×16 tile dimensions, F32 accumulator, BF16 A and B inputs, F32 output, SS meaning both A and B come from shared memory.

A TMA copy is a copy atom:

using Copy_Atom_A = Copy_Atom;

The CUTLASS kernel template composes these atoms with tile dimensions, cluster shapes, and pipeline stages into a complete kernel:

using CollectiveMainloop = cutlass::gemm::collective::CollectiveMma<
    cutlass::gemm::MainloopSm90TmaGmmaRmemAAccumulator<3>,  // 3-stage pipeline
    Shape<_128, _256, _64>,                                   // tile MxNxK
    bfloat16_t, LayoutA,
    bfloat16_t, LayoutB,
    TiledMma,
    GmemTiledCopyA,
    SmemLayoutA,
    SmemCopyAtomA,
    cute::identity,
    GmemTiledCopyB,
    SmemLayoutB,
    SmemCopyAtomB,
    cute::identity
>;

This is verbose, but every template parameter maps to a concrete hardware mechanism: MainloopSm90TmaGmmaRmemAAccumulator<3> means “use TMA for loads, use WGMMA for compute, keep the accumulator in registers, with 3 pipeline stages”.

The compiler resolves all of this into a kernel where the main loop body is a tight sequence of WGMMA.MMA_ASYNC instructions, interrupted only by TMA-initiated mbarrier waits at stage boundaries.

The address computation for the TMA loads is essentially absent from the device code, having been moved to the descriptor construction on the host.

The Producer-Consumer Warpgroup Model

CUTLASS 3.x on Hopper adopts a warpgroup specialization model within each CTA. A thread block of 128 threads (one warpgroup) is divided at compile time into a producer warpgroup and one or more consumer warpgroups.

The producer warpgroup is responsible for issuing TMA loads (one thread per load, the others arrive at barriers). The consumer warpgroups are responsible for issuing WGMMA.MMA_ASYNC instructions and running the epilogue (writing C to global memory via the output TMA store).

This specialization is explicit:

if (warpgroup_id == 0) {
    // Producer: issue TMA loads into shared memory stages
    collective_mainloop.load(params, smem_tensors, pipeline, pipeline_state, k_tile_count);
} else {
    // Consumer: issue WGMMA instructions, run epilogue
    collective_mainloop.mma(params, smem_tensors, accumulators, pipeline, pipeline_state, k_tile_count);
    collective_epilogue.store(params, accumulators, ...);
}

The producer and consumer warpgroups communicate exclusively through the mbarrier-protected shared memory pipeline. There is no __syncthreads() between them in steady state. The barriers are sufficient.

This is architecturally important: __syncthreads() is a full CTA barrier. In a producer-consumer model where the producer and consumer have different amounts of work to do per iteration, a full CTA barrier would force the faster group to wait for the slower one on every iteration.

The mbarrier primitive allows asymmetric synchronization: the consumer waits only for the data it needs, not for the producer to reach any particular point in its control flow.

Subscribe now

The N-Stage Pipeline on Hopper

Part VII described double buffering (2 stages) on Ampere. On Hopper, CUTLASS uses 3 to 8 stages by default, with the optimal stage count depending on the tile size, problem size, and occupancy target.

The pipeline state machine on Hopper manages N shared memory stages, N producer mbarriers (one per stage, signaling data arrival), and N consumer mbarriers (one per stage, signaling that the consumer is done reading and the stage can be reused).

The steady-state loop looks like this conceptually:

Stage 0: [TMA load A0, B0] → [mbar_full[0] signaled] → [WGMMA on A0,B0] → [mbar_empty[0] signaled]
Stage 1: [TMA load A1, B1] → [mbar_full[1] signaled] → [WGMMA on A1,B1] → [mbar_empty[1] signaled]
Stage 2: [TMA load A2, B2] → [mbar_full[2] signaled] → [WGMMA on A2,B2] → [mbar_empty[2] signaled]
Stage 0: [TMA load A3, B3] → ...

The producer issues TMA loads into stage i and signals mbar_full[i]. The consumer waits on mbar_full[i], runs WGMMA, signals mbar_empty[i], and moves to stage (i+1) % N.

The producer waits on mbar_empty[i] before reusing that stage for the next load. This circular buffer in shared memory, managed by mbarrier pairs, is the fundamental data structure of a Hopper GEMM kernel.

The prologue loads N-1 tiles before the main loop begins (same invariant as Part VII’s double buffer prologue, just with more stages). The epilogue drains the remaining in-flight tiles after the k loop exits.

With 3 stages on an H100 with 228 KB of shared memory per SM (up from Ampere’s 192 KB), a 128×256 BF16 tile pair consumes approximately:

A tile: 128 × 64 × 2 bytes = 16 KB
B tile: 64 × 256 × 2 bytes = 32 KB
Per stage: 48 KB
3 stages: 144 KB
Remaining for mbarriers and accumulator spills: 84 KB

At 3 stages and a 128×256 tile, one CTA per SM is feasible. Two CTAs would require 288 KB, which exceeds the 228 KB shared memory limit.

Occupancy is therefore 1 CTA per SM, which is fine on Hopper because the single CTA fills the SM with WGMMA instructions and the TMA unit is fully occupied.

This is a fundamentally different occupancy philosophy from Ampere. On Ampere, you often needed 2-4 CTAs per SM to hide memory latency through warp-switching.

On Hopper, one CTA with TMA and WGMMA already achieves near-peak throughput on large tiles, because the hardware units that matter (TMA, tensor cores) are all fully occupied.

Subscribe now

What the Profiler Shows You on Hopper

The Nsight Compute metrics shift dramatically compared to Ampere.

smsp__warp_issue_stalled_long_scoreboard approaches zero. Not because the memory is fast, but because TMA loads do not involve the scoreboard at all. The SMSP is not waiting for memory; it is not the unit that issued the memory request.

smsp__warp_issue_stalled_mio_throttle is also low. The single TMA instruction per tile barely loads the MIO unit.

smsp__warp_issue_stalled_wgmma_global_wait is the new dominant stall: this is the SMSP waiting for a WGMMA.WAIT_GROUP to complete so it can read the accumulator registers.

This stall is unavoidable for kernels that read their accumulators between WGMMA groups (e.g., for split-K partial reductions). For kernels with long K dimensions, the WGMMA pipeline fills up and this stall disappears.

sm__pipe_tensor_op_hmma_cycles_active should be 80-95% for a well-tuned Hopper GEMM. Anything below 70% suggests either a pipeline depth problem (too few stages) or a cluster scheduling problem (the GPC is not scheduling the cluster CTAs together).

l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld counts shared memory read operations. For a kernel where both A and B are read from shared memory by WGMMA (SS variant), this metric reflects tensor core throughput, not programmer-visible loads. The tensor cores are reading shared memory directly, and this shows up in the LSU metrics.

The TMA throughput metrics are in the tma namespace: tma__read_bytes and tma__read_transactions. A kernel that is achieving peak TMA throughput will show TMA bandwidth close to the theoretical HBM bandwidth, because TMA is the only thing accessing HBM.

The key diagnostic insight on Hopper: if your WGMMA utilization is high and your TMA bandwidth is high, the kernel is good. The two hardware units are the bottleneck by design. Everything else should be idle or near-idle.

Subscribe now

The Roofline on Hopper, revisited

Part VII introduced the roofline model and noted that the useful diagnosis is hierarchical: not “memory-bound” but “memory-bound at the L2 level, achieving 60% of L2 peak”. On Hopper the hierarchy has the same levels (L1, L2, HBM) but new slopes.

H100 SXM5 roofline parameters:

HBM3 peak bandwidth: 3.35 TB/s
L2 peak bandwidth: approximately 12 TB/s (across 50 MB of L2, two slices)
Shared memory peak bandwidth: approximately 33 TB/s aggregate (SM-local)
Tensor core peak (dense BF16): 494 TFLOP/s

Ridge points:

HBM ridge: 494 / 3.35 ≈ 147 FLOP/byte
L2 ridge: 494 / 12 ≈ 41 FLOP/byte
Shared memory ridge: 494 / 33 ≈ 15 FLOP/byte

For a GEMM with arithmetic intensity of 147 FLOP/byte or above, the kernel should be compute-bound assuming the memory hierarchy is properly utilized. Below 147 FLOP/byte, it is HBM-bandwidth-bound.

Below 41, even a perfect L2 hit rate cannot save you. Below 15, the tensor core throughput is limited by shared memory bandwidth, which means either bank conflicts or tile sizes that do not saturate the WGMMA datapath.

The key new insight on Hopper: TMA changes the shape of the memory hierarchy’s contribution. The SMSP instruction bandwidth, which was a secondary bottleneck on Ampere (and a primary bottleneck for small tiles), is effectively removed from the HBM bandwidth calculation.

The raw bandwidth to shared memory is now limited only by the TMA unit’s throughput, which the H100 documentation lists at approximately 900 GB/s aggregate (across all TMA units on all SMs).

This is below the HBM bandwidth of 3.35 TB/s, so for kernels that are purely bandwidth-limited (not compute-bound), TMA is not the constraint; HBM is.

For compute-bound kernels with large tiles, TMA’s instruction offloading is what enables the SMSP to run WGMMA at full throughput.

A Brief Look at Blackwell

Blackwell (SM100, B100/B200) was announced in March 2024 and began shipping to hyperscalers in late 2024. The architectural trajectory established by Hopper continues and accelerates.

The Blackwell tensor core introduces a 5th generation MMA with FP4 support (MXFP4 and NF4 formats), enabling 20 PFLOP/s peak at the full B200 system level (dual-die). The FP8 dense throughput is approximately 9 PFLOP/s per chip.

TMA on Blackwell gains native support for im2col pattern transforms (relevant for convolutions) and transposed stores, reducing the need for separate transpose kernels.

The cluster size limit increases to 16 CTAs (from 8 on Hopper), further amortizing B tile loads across more compute.

A new fifth-generation NVLink provides 1.8 TB/s bidirectional bandwidth per GPU in NVLink-connected systems (NVL72 rack), enabling multi-GPU kernels where the “global memory” seen by a TMA operation is distributed across 72 GPUs. This is the level at which the distinction between a single-GPU kernel and a distributed compute graph begins to blur.

CUTLASS 3.x supports Blackwell through new SM100 collective templates. The programming model is the same; the numbers are larger.

Subscribe now

Conclusion

The trajectory from Volta through Ampere to Hopper is a coherent story: every generation pushes more of the data movement machinery off the SMSP and onto dedicated hardware.

Volta gave you tensor cores, so the SMSP stopped doing the arithmetic. Ampere gave you cp.async, so the SMSP stopped waiting for loads. Hopper gave you TMA, so the SMSP stopped issuing loads entirely.

The SMSP on a well-tuned Hopper GEMM kernel is a machine that does one thing: issue WGMMA.MMA_ASYNC. Everything else has been delegated.

This is not an accident. It is the logical endpoint of the observation that matrix multiply is the kernel that matters most for modern ML workloads, and the most efficient hardware for matrix multiply is hardware where the compute units are never idle.

Every architectural innovation from 2017 onwards has been an attack on a different reason why the compute units were idle: arithmetic latency (tensor cores), memory latency (cp.async), instruction bandwidth (TMA), inter-SM bandwidth (clusters, NVLink).

The mbarrier, the tensor descriptor, the warpgroup specialization, the producer-consumer pipeline, the tile scheduler: these are not ornamental complexity.

They are the mechanisms by which a 2024 GPU running a 2024 kernel achieves 80-90% of theoretical peak on matrix multiply, a number that would have seemed implausible to practitioners writing hand-tuned BLAS routines a decade ago.

Part IX will step back from the single-GPU picture and look at multi-GPU parallelism: tensor parallelism, pipeline parallelism, NCCL, and the question of how NVLink bandwidth interacts with the per-GPU compute performance we have spent eight parts building up.

The tools change; the principle does not: find the bottleneck, route around it, measure again.

Mastering CUDA and High-Performance Computing, Part VII

Lorenzo Bradanini — Fri, 27 Mar 2026 17:53:12 GMT

Where Part VI Left Us

Part VI ended with a sentence that deserves to be unpacked:

cp.async instructions do not set the long scoreboard.
The register file is not involved, so no register’s bit is marked pending.
The SMSP issues the cp.async, the copy engine takes it, and the SMSP is immediately free to issue the next instruction for that warp.

This is not a minor optimization note.

It is a description of a fundamentally different execution model: one that requires you to abandon the mental model of “instruction issues, result arrives, next instruction proceeds”, and replace it with something more like a production pipeline in a factory:

stages overlap, buffers exist between them, and throughput is determined by the slowest stage, not the sum of all stage latencies.

Before we can make cp.async do useful work, we need an accurate model of what it is hiding from: the memory hierarchy.

Subscribe now

The Memory Hierarchy of the A100

The A100 SXM4 has six levels of memory that matter to kernel programmers. They are not equally documented, and the numbers in marketing materials are frequently not the numbers in production code.

Registers

Each SM on Ampere has a 256 KB register file, shared across the four SMSPs: 64 KB per SMSP, with a 256-bit read port per cycle.

Register file access latency is effectively 0 cycles in the bypass case; for non-bypassed reads the cost is absorbed into the 4-cycle FMA pipeline. Registers are not a latency source. They are a capacity and bandwidth source.

The capacity limit is the one that matters: each thread can use at most 255 registers.

Pressure above this causes the compiler to spill values to local memory; a per-thread private region mapped to L1/L2/DRAM.

Spills are indistinguishable from any other global memory access at the hardware level: they go through the MIO unit, set the long scoreboard, and wait 400+ cycles for DRAM. Every spilled register costs two MIO operations.

Shared Memory / L1 Cache

Ampere’s per-SM L1 is a 192 KB pool partitioned between shared memory and the hardware L1 data cache.

The split is configurable (0/192, 32/160, 64/128, 100/92, 132/60, 160/32 (shared/cache, in KB)) via cudaFuncSetAttribute with cudaFuncAttributePreferredSharedMemoryCarveout.

Shared memory has 32 banks, each 4 bytes wide.

Bank index for a byte address:

bank = (address >> 2) & 31

Access patterns where multiple threads in a warp access different addresses in the same bank serialize.

One 4-bank conflict causes 4× the latency of the conflict-free case. The conflict-free latency is approximately 23 cycles; a 4-bank conflict extends this to ~35 cycles; an 8-bank conflict to ~51 cycles. The penalty scales linearly.

The broadcast exception: if all threads in a warp access the exact same address within a bank, the hardware services this as a single read and broadcasts the result.

Thirty-two threads accessing thirty-two different addresses that all map to the same bank is not a broadcast. It is a 32-way serialization.

L2 Cache

The A100 has 40 MB of L2 cache, split into two 20 MB slices. L2 hit latency: approximately 180–200 cycles, higher than most documentation implies.

Accesses to the local slice are ~160–180 cycles; accesses to the remote slice (requiring crossbar traversal) are ~200–230 cycles.

L2 bandwidth is approximately 4 TB/s aggregate. The ratio of L2 bandwidth to HBM bandwidth is approximately 15:1. Fitting a working set in L2 is qualitatively different from spilling it to HBM.

HBM2e

The A100 SXM4 has six HBM2e stacks providing a peak theoretical bandwidth of 2 TB/s. In practice: a kernel with access pattern regularity sufficient to saturate all channels achieves 1.6–1.9 TB/s.

Irregular access patterns with row buffer conflicts: 800 GB/s–1.2 TB/s. Random byte-granularity reads: tens of GB/s, due to cache line waste.

HBM2e latency, measured with L1 and L2 bypassed: approximately 450–600 cycles at 1410 MHz. Row buffer hits land around 300–350 cycles; misses around 550–650 cycles.

The consequence at 1410 MHz: 500 cycles × 0.71 ns/cycle ≈ 355 nanoseconds of stall per warp. In that window, 500 instruction issue slots across the SM go dark.

If every resident warp has issued an HBM load and is waiting, you have a 500-cycle stall with no eligible warp to rescue you.

This is the memory wall in concrete form. The solution is not a faster memory: it is to restructure data movement so that HBM latency is overlapped with computation.

Subscribe now

The cp.async Instruction

cp.async was introduced in Ampere (sm_80). It performs a direct DMA-like transfer from global memory to shared memory, bypassing the register file entirely:

cp.async.ca.shared.global [dst], [src], size;
cp.async.cg.shared.global [dst], [src], size;   // bypass L1

The size parameter is 4, 8, or 16 bytes. The 16-byte variant is the most important: it issues a vectorized LDG.128, achieving maximum memory interface utilization.

What “bypassing the register file” actually means

The conventional load path:

LDG.128 R4, [R2]         ; → long scoreboard set for R4,R5,R6,R7
                         ; → warp stalls on any read of R4-R7
                         ; → 450-600 cycles later, HBM returns data
STS.128 [smem_ptr], R4   ; store registers → shared memory

This requires 4 registers in transit. The load sets four long scoreboard bits. The warp is ineligible for any instruction reading R4–R7 until the HBM transaction completes.

The cp.async path:

CP.ASYNC.CA.SHARED.GLOBAL [smem_dst], [R2], 0x10
; → no scoreboard bits set (no destination register)
; → warp immediately eligible to issue next instruction
; → data arrives in shared memory asynchronously

A dedicated Ampere Asynchronous Copy Engine receives the request via the MIO unit, takes ownership of the transaction, and performs the HBM load and shared memory write independently of the SMSP. The MIO unit is freed immediately after handoff.

The commit/wait mechanism

Commit (CP.ASYNC.COMMIT_GROUP): marks all preceding cp.async instructions as a commit group. Bookkeeping only,does not wait for anything.

Wait (CP.ASYNC.WAIT_GROUP N): stalls until at most N commit groups remain pending. N=0 is complete synchronization.

N=1 allows one in-flight group to remain outstanding while you compute on the previous.

auto pipe = cuda::make_pipeline();

for (int i = 0; i < BATCH_SIZE; i++)
    cuda::memcpy_async(smem[0][i], &gmem[base + i], sizeof(float4), pipe);
pipe.producer_commit();

for (int i = 0; i < BATCH_SIZE; i++)
    cuda::memcpy_async(smem[1][i], &gmem[base + BATCH_SIZE + i], sizeof(float4), pipe);
pipe.producer_commit();

pipe.consumer_wait();   // CP.ASYNC.WAIT_GROUP 1
__syncthreads();        // mandatory: propagates visibility to all threads

compute(smem[0]);

The __syncthreads() after consumer_wait is mandatory. consumer_wait ensures the data is in shared memory from the perspective of this warp.

Other warps in the thread block may not see the writes until __syncthreads() propagates them through the SM’s coherence domain.

Omitting it is a race condition: one that produces correct results most of the time and incorrect results unpredictably under heavy memory pressure.

The Double Buffer Pattern

A standard tiled GEMM loop is fully sequential: load tile, sync, compute, sync, repeat. The timeline is a flat chain of dependencies. For smaller problems or thinner tiles where T_load / T_compute > 1, the kernel is memory-bound.

The double buffer pattern breaks that chain:

Iter k:   |-- cp.async A[k] --|-- cp.async B[k] --|-- commit --|
                                                                |-- wait(k-1) --|-- compute(k-1) --|
Iter k+1: |-- cp.async A[k+1] --|-- cp.async B[k+1] --|-- commit --|
                                                                    |-- wait(k) --|-- compute(k) --|

Loads for iteration k+1 overlap with computation of iteration k. Memory latency is hidden as long as T_load(k+1) < T_compute(k). The pipeline then runs at the compute rate with zero memory stall.

This requires two ping-pong buffers in shared memory, doubling the shared memory requirement.

Doubling shared memory per thread block halves the maximum resident thread blocks per SM, reducing occupancy. The trade-off is explicit and computable.

Diagnostic signal: if smsp__warp_issue_stalled_long_scoreboard.avg.pct_of_peak_sustained_active exceeds 20%, memory latency is not being hidden. The first intervention is higher occupancy.

The second, when occupancy is already near maximum, is cp.async pipelining, which removes the long scoreboard from the equation entirely.

The Full Kernel Pattern

constexpr int TILE_M = 128, TILE_N = 128, TILE_K = 32;
constexpr int NUM_STAGES = 2;

__global__ void gemm_async_kernel(
    const __nv_bfloat16* __restrict__ A,
    const __nv_bfloat16* __restrict__ B,
    float* __restrict__ C,
    int M, int N, int K
) {
    __shared__ __nv_bfloat16 smem_A[NUM_STAGES][TILE_M][TILE_K];
    __shared__ __nv_bfloat16 smem_B[NUM_STAGES][TILE_K][TILE_N];
    float acc[4][4] = {};

    auto pipe = cuda::make_pipeline();
    const int k_tiles = K / TILE_K;

    // PROLOGUE: issue tile 0 before the main loop
    if (k_tiles > 0) {
        int row_a = threadIdx.x / TILE_K, col_a = threadIdx.x % TILE_K;
        if (row_a < TILE_M)
            cuda::memcpy_async(&smem_A[0][row_a][col_a],
                               &A[(blockIdx.y * TILE_M + row_a) * K + col_a],
                               sizeof(__nv_bfloat16), pipe);
        pipe.producer_commit();
    }

    // MAIN LOOP
    for (int k = 1; k < k_tiles; k++) {
        const int sw = k % 2, sr = (k - 1) % 2;

        int row_a = threadIdx.x / TILE_K, col_a = threadIdx.x % TILE_K;
        if (row_a < TILE_M)
            cuda::memcpy_async(&smem_A[sw][row_a][col_a],
                               &A[(blockIdx.y * TILE_M + row_a) * K + (k * TILE_K + col_a)],
                               sizeof(__nv_bfloat16), pipe);
        pipe.producer_commit();

        pipe.consumer_wait();   // CP.ASYNC.WAIT_GROUP 1
        __syncthreads();

        for (int ki = 0; ki < TILE_K; ki++)
            for (int i = 0; i < 4; i++)
                for (int j = 0; j < 4; j++)
                    acc[i][j] += __bfloat162float(smem_A[sr][threadIdx.y*4+i][ki])
                               * __bfloat162float(smem_B[sr][ki][threadIdx.x*4+j]);
        __syncthreads();
    }

    // EPILOGUE
    pipe.consumer_wait();   // CP.ASYNC.WAIT_GROUP 0
    __syncthreads();
}

Three things to internalize about this structure:

The prologue is not optional. Without issuing tile 0 before the loop, the first consumer_wait blocks on a commit group that doesn’t exist. Undefined behavior. The prologue establishes the “one stage ahead” invariant that the loop depends on.

Both synchronization primitives are required. consumer_wait ensures the DMA engine has written the data to shared memory for this warp. __syncthreads() ensures all threads in the block have reached this point before any thread reads.

They solve different problems. Neither substitutes for the other.

Stage read and stage write are never equal. The modular arithmetic guarantees sw ≠ sr for NUM_STAGES = 2. The DMA engine writes to one buffer while threads read from the other.

With NUM_STAGES ≥ 3 you deepen the pipeline, more latency hidden, more shared memory consumed.

N-Stage Generalization

With N stages, you issue N tiles’ worth of cp.async before the first computation begins. The latency is hidden when T_compute(tile) > T_HBM_load / N.

CUTLASS implements up to 5-stage pipelines for its Ampere GEMM kernels, with stage count as a compile-time template parameter swept by the profiler at tuning time. The shared memory cost scales linearly with stage count.

At some crossover point the shared memory requirement forces an occupancy reduction that exceeds the pipelining benefit.

This crossover depends on the specific kernel and problem size, which is why CUTLASS exposes the parameter rather than hardcoding it.

What the Profiler Shows You

Before pipelining (conventional LDG loads):

smsp__warp_issue_stalled_long_scoreboard — 40–70%, dominant stall
smsp__pipe_fma_cycles_active — 30–60%, computation starved

After pipelining (cp.async, double buffer):

smsp__warp_issue_stalled_long_scoreboard. <5%, cp.async sets no scoreboard bits
smsp__pipe_fma_cycles_active. 70–90% for a well-tuned kernel
Watch for smsp__warp_issue_stalled_mio_throttle; if you issue cp.async faster than the MIO unit can service them (~1 per 4 cycles per SMSP for 128-bit transfers), this stall replaces the scoreboard stall.
The fix is larger tiles or accepting the throttle if MIO throughput still exceeds compute throughput.

Bank Conflicts

The 32-bank model is documented. The practical implications for matrix access patterns are not.

In a tiled GEMM, tile A is loaded into shared memory in row-major layout, then read column-wise during the multiply.

For TILE_K = 32 and BF16 elements (2 bytes each), element [j][i] sits at byte offset j × 64 + i × 2. Bank index: (j × 16 + i/2) & 31.

For a warp reading column i (i fixed, j running 0..31) every pair of threads maps to the same bank. This is a 2-way bank conflict on every column read.

The fix is padding:

__shared__ __nv_bfloat16 smem_A[TILE_M][TILE_K + 2];   // +2 BF16 = +4 bytes per row

With the pad, element [j][i] is at byte offset j × 68 + i × 2. Bank index: (j × 17 + i/2) & 31. Since gcd(17, 32) = 1, the bank indices as j runs 0..31 form a complete permutation of 0..31. Zero conflicts.

The shared memory overhead is TILE_M × 4 bytes per buffer: 512 bytes for TILE_M = 128, trivial against the 8 KB tile.

CUTLASS’s Swizzle technique achieves the same result via address bit permutation rather than linear padding, which handles non-power-of-two tile sizes cleanly.

The arithmetic underneath is identical.

L1 Cache Policy

Cache behavior on Ampere is configurable at the instruction level:

Qualifier Behavior LDG.CA Cache in L1 (default) LDG.CG Bypass L1, go to L2 LDG.CS Streaming: insert at LRU position LDG.CV Bypass all caches (almost never correct)

In CUDA: __ldg() for L1-cached, __ldcg() / __ldcs() for the bypass variants. The compiler defaults to LDG.CA when uncertain.

For kernels that process each input element exactly once, elementwise operations, reductions, anything with no reuse, __ldcg() eliminates L1 pollution and preserves L1 capacity for data that does benefit from caching.

The effect in the profiler: lower L1 hit rate, unchanged L2 hit rate. The data skips one cache level without reducing effective bandwidth at the level where reuse actually exists.

The Roofline Model

The roofline model (Williams, Waterman, Patterson, 2009) plots FLOP/s against arithmetic intensity (FLOP/byte of DRAM traffic). For the A100 in FP32:

Peak compute: ~19.5 TFLOP/s
Peak HBM bandwidth: ~2 TB/s
Ridge point: ~9.75 FLOP/byte

Below the ridge: memory-bound. Above: compute-bound. The common mistake is treating DRAM bandwidth as the only line that matters.

The L2-based roofline has a ridge at ~4.9 FLOP/byte. The L1-based roofline has a ridge at ~1 FLOP/byte.

A kernel with strong L1 reuse can be compute-bound at an arithmetic intensity that looks memory-bound on the DRAM roofline.

A kernel that thrashes L2 will underperform the DRAM roofline because its effective bandwidth is below the theoretical peak. NCU’s roofline chart shows all three simultaneously.

The correct first diagnostic is hierarchical bandwidth analysis. Not “it’s memory-bound”; that’s a category.

The useful diagnosis is “it’s memory-bound at the L2 level, achieving 60% of L2 peak, because 40% of L2 bandwidth is wasted on non-reused data evicted before second use.” That tells you the fix.

Subscribe now

The Tensor Memory Accelerator

Ampere introduced cp.async. Hopper (sm_90, H100) introduced the Tensor Memory Accelerator (TMA), the same idea taken to its logical conclusion.

With cp.async, the programmer still computes every element’s global memory address and constructs the instruction stream.

For a 128×128 BF16 tile, that is 512 vectorized 128-bit cp.async instructions consuming SMSP instruction bandwidth, even though the transfers are asynchronous.

TMA accepts a tensor descriptor (base address, dimensions, strides, element type) and issues a single instruction:

cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
    [smem_dst], [gmem_desc, {coord_y, coord_x}], [mbar];

One instruction. One 128×128 BF16 tile. The TMA unit generates all the addresses, manages all the transactions, and signals completion via the mbarrier primitive;

a synchronization mechanism lighter than __syncthreads(), designed for producer-consumer coordination without a full SM barrier.

The consequence: on Hopper, the compute-to-load instruction ratio in a GEMM inner loop approaches ∞ from the SMSP’s perspective.

The SMSPs run wgmma.mma_async continuously; the TMA unit handles all data movement independently. CUTLASS 3.x is designed around this model. Part VIII will cover it in full.

Subscribe now

Conclusion

The line from Part VI, “the SMSP is immediately free to issue the next instruction for that warp”, is the hinge on which this article turns.

The memory hierarchy imposes latencies that are not negotiable in nanoseconds: 23 cycles for shared memory, 180 for L2, 500 for HBM. These numbers do not change by complaining about them.

They change by structuring code so that the latency is incurred before the result is needed: issuing the memory request while computing on previously loaded data.

cp.async is the mechanism. Software pipelining is the pattern. Double buffering is the minimum viable instance. The commit/wait protocol maintains correctness while the DMA engine and the compute engine run simultaneously.

The bank conflict analysis and the L1 bypass discussion are extensions of the same idea: minimize latency and maximize effective bandwidth at every level of the hierarchy, so that by the time data arrives at the computation, it has traveled through the hardware as efficiently as physics allows.

The limits of this approach on Ampere are what motivate TMA on Hopper: an architecture where the gap between what the programmer expresses and what the hardware executes narrows further, approaching the regime where the programmer describes what should move and the hardware decides when.

Part VIII begins there.

Mastering CUDA and High-Performance Computing, Part VI

Lorenzo Bradanini — Sun, 22 Mar 2026 09:15:47 GMT

The Pipeline’s One Promise, and How It Fails

The A100 SM runs at a base clock of approximately 765 MHz, boost to ~1410 MHz. At boost, one clock cycle is ~0.71 nanoseconds. The SM has four SMSPs.

Each SMSP has four warp schedulers (confirmed in NVIDIA’s Ampere whitepaper and independently via microbenchmarks by Jia et al. and the work of Markidis, Larsson et al.).

Each scheduler attempts to issue one instruction per cycle to one eligible warp.

At full throughput (all four schedulers in all four SMSPs issuing every cycle) a single A100 SM issues 16 instructions per cycle.

Across 108 SMs at 1410 MHz, peak issue rate is roughly 2.4 trillion instructions per second. This is the theoretical ceiling. You will never reach it. The question is why, and by how much.

An instruction issues in a given cycle when three conditions are simultaneously true:

The warp is eligible: it has been selected by the round-robin/priority scheduler, it is not stalled on a scoreboard dependency, and it has not exceeded the warp’s instruction buffer depth.
The execution unit is available: the target pipe (FMA, SFU, MIO, LSU...) has a free slot.
All operands are ready: every source register’s scoreboard bit has been cleared by its producing instruction.

When any of these three conditions fails, the scheduler increments a stall counter and moves to another warp.

The beauty of the GPU microarchitecture, and the central insight of GPU optimization, is that condition (3) failing for warp A doesn’t stall the SM; it just causes the scheduler to attempt warp B instead.

The SM stalls only when no warp satisfies all three conditions simultaneously. That’s the failure mode we are trying to prevent.

ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum is how you measure condition (3) failures for memory. smsp__warp_issue_stalled_* counters measure them by category.

We will use both throughout.

Subscribe now

What lives inside one SMSP

Before discussing stalls, you need an accurate map of what execution units exist and what their throughput and latency look like.

Much of the confusion in GPU optimization literature stems from people using “the FP32 pipe” as a monolith when it is not.

One Ampere SMSP contains, per NVIDIA’s Ampere Architecture whitepaper and corroborating microbenchmark work:

A few notes on the accuracy of this table:

The FMA latency of 4 cycles is confirmed by the CUDA C Programming Guide and by numerous independent microbenchmarks.

It is not 1 cycle. It is not 2 cycles. It is 4, and every serial dependency chain in your kernel pays it in full.

The SFU latency of 16 cycles is confirmed by microbenchmarks. Throughput of 4 cycles/instruction means one MUFU occupies the single SFU for 4 cycles; other warps’ MUFU instructions queue behind it.

Since there is only one SFU per SMSP and the SMSP has 16 warps maximum, a warp issuing a MUFU must wait for the SFU to become free if another warp issued one within the last 4 cycles.

The FP64 throughput asymmetry is critical for A100 versus A30/A10 hardware: the A100 has full-rate FP64 (2 cycles per DFMA per SMSP), while the A10 has 1/16th the FP64 throughput (DFMA at 32 cycles per instruction).

Running FP64 code on an A10 is not slower: it is catastrophically slower. Verify your hardware before benchmarking.

Shared memory load latency of 23 cycles is confirmed by microbenchmarks (Luitjens 2011, Volkov 2016, and more recently by Yan et al. in their SM scheduling simulator).

The official CUDA documentation says “~20 cycles” without precision; 23 cycles is the empirically correct number for Ampere under normal bank-conflict-free access.

With 4-way bank conflicts the effective latency compounds because the MIO pipe is occupied for additional cycles while the bank serialization completes.

The scoreboards in detail

Each SMSP has two scoreboards, as described in Part V. Understanding their interaction with execution units is worth revisiting with more precision:

Short scoreboard: covers arithmetic results from the FMA pipe, INT32 ALU, and SFU. Latency tracked: 4 cycles (FMA/INT32) and 16 cycles (SFU). The scoreboard has one bit per register per warp.

When an FFMA issues with destination R4, bit R4 for that warp is set in the short scoreboard. It is cleared 4 cycles later (for FMA results) by the pipeline’s bypass network.

An instruction in another warp that reads R4 of the issuing warp is unaffected: scoreboards are per-warp, not global.

Long scoreboard: covers memory results: any instruction that issues to the MIO unit (loads from global/shared/local memory, atomic operations).

The long scoreboard bit is set when the load issues and is not cleared until the data physically arrives and is written to the register file.

For an HBM access this can be 400+ cycles. The SMSP does not know in advance how long an HBM access will take (it depends on DRAM row buffer state, competing traffic, etc.); it just waits for the completion signal from the memory system.

An important subtlety: cp.async instructions do not set the long scoreboard. This is the mechanism by which they achieve asynchrony.

The register file is not involved, so no register’s bit is marked pending.

The SMSP issues the cp.async, the copy engine takes it, and the SMSP is immediately free to issue the next instruction for that warp.

We will return to the exact implications of this in future posts.

Subscribe now

The SFU, quantified completely

The Special Function Unit executes MUFU instructions. Let’s be more precise than “the SFU is slow.”

On Ampere, one SMSP has one SFU. The SFU pipeline is 4 stages deep; this is why it has a 4-cycle throughput: one new instruction can enter the pipeline every 4 cycles (this is the initiation interval, II), and the result is available 16 cycles after issue (this is the latency, L = 4 × II).

This is a reasonable structural design for a unit that computes hardware approximations to transcendentals: the underlying Newton-Raphson iterations take multiple stages.

The key distinction: throughput (4 cycles/instruction) limits how often you can issue MUFU instructions from the same SMSP.

Latency (16 cycles) limits how soon a downstream instruction can use the MUFU result.

Both matter; they fail you in different scenarios.

What MUFU opcodes actually compute

MUFU is not one instruction. It is one instruction format with an operation selector.

The compiler maps standard C math functions to MUFU as follows.

This mapping is important because the additional FFMA instructions required for argument scaling and result scaling come for free (they don’t touch the SFU), but they do consume FP32 pipe cycles:

expf(x) → __expf(x) (with -use_fast_math):

// Argument reduction: convert from base-e to base-2
// exp(x) = 2^(x * log2(e)) = 2^(x * 1.44269504...)
FMUL   R1, R0, 1.44269502f          ; x * log2(e), FP32 pipe, 0.25 cycles throughput
MUFU.EX2  R2, R1                    ; 2^(x*log2e), SFU, 4 cycles throughput

expf(x) (IEEE-compliant, without fast-math):

// Range check and reduction (compiler-generated, varies)
FMNMX  R1, R0, 88.722839f, ...      ; clamp to avoid overflow  
FFMA   R2, R1, 1.44269502f, ...     ; argument reduction with correction term
MUFU.EX2  R3, R2                    ; core computation
FMUL   R4, R3, ...                  ; reconstruction (potentially)
// Plus additional corrections for subnormals, NaN, INF

The IEEE-compliant version may issue conditional branches for edge cases. When your kernel has inputs that might be NaN, INF, or very large/small, the compiler generates defensive code.

__expf() removes these guards entirely: it is undefined behavior for inputs outside [−87.3, 88.7] (the approximate FP32 range of exp before overflow/underflow).

If you know your inputs are bounded, and in softmax after max-subtraction they are, since all values are ≤ 0, __expf() is always the correct choice.

tanhf(x) (any mode):

tanh has no single MUFU opcode. The compiler implements it using the identity: tanh(x) = 1 - 2/(exp(2x)+1).

The resulting SASS (approximately, varies by compiler version) includes:

FMUL   R1, R0, 2.0f                 ; 2x
FMUL   R2, R1, 1.44269502f          ; 2x * log2(e)
MUFU.EX2  R3, R2                    ; 2^(2x * log2e) = exp(2x)
FADD   R4, R3, 1.0f                 ; exp(2x) + 1
MUFU.RCP  R5, R4                    ; 1/(exp(2x)+1)
FMUL   R6, R5, 2.0f                 ; 2/(exp(2x)+1)
FADD   R7, -R6, 1.0f                ; 1 - 2/(exp(2x)+1) = tanh(x)

That’s two MUFU instructions (one EX2, one RCP) per tanhf call; 8 cycles of SFU pipe occupied per call in throughput terms.

For GELU, which uses tanhf internally (the fast approximation 0.5x(1+tanh(√(2/π)(x+0.044715x³)))), you have additional FFMAs on top.

GELU activation in a fused kernel is expensive in SFU terms, which is one motivation for the simpler SiLU activation (x * σ(x) = x / (1 + exp(-x))): it requires one MUFU.EX2 plus a few FMAs versus two MUFUs for tanh.

Throughput model for an SFU-bottlenecked loop

Suppose your kernel’s inner loop body, after compilation, contains:

1× MUFU.EX2 (4-cycle throughput, SFU pipe)
3× FFMA (0.25-cycle throughput each, FMA pipe)
2× FADD (0.25-cycle throughput each, FMA pipe)
1× LDS (shared memory load, ~1-cycle throughput assuming no bank conflict)

Total FMA pipe demand: 5 × 0.25 = 1.25 cycles
Total SFU pipe demand: 1 × 4.0 = 4.0 cycles
Total MIO demand: 1 × ~1.0 = ~1.0 cycles

The SFU is the bottleneck: the loop cannot issue faster than 4.0 cycles per iteration.

The FMA pipe is occupied 1.25/4.0 = 31% of the time. The remaining 69% of FMA pipe capacity is wasted, waiting for the SFU to finish so the next iteration can begin.

You can fill this gap in two ways: more ILP within the loop (unroll and issue multiple independent MUFU calls, keeping both the SFU and FMA pipe busier) or replace MUFU with FMA-pipe arithmetic.

The first approach doesn’t change the SFU ceiling; it just makes better use of the FMA pipe in parallel.

The second moves the ceiling.

Polynomial exp replacement: the real implementation

The “4th order polynomial” approach described in the previous version of this article is plausible but underspecified.

Here is a properly validated implementation using a piecewise approach compatible with softmax use cases:

// Fast exp2f approximation — pure FP32, no SFU
// Maps to ~6 FFMAs in SASS
// Valid for x ∈ [-126, 127] (FP32 normal range for 2^x)
// Error: < 2^-23 relative for x ∈ [-16, 16] (sufficient for softmax)
__device__ __forceinline__ float fast_exp2f_fma(float x) {
    // Decompose x = n + f where n is integer, f ∈ [-0.5, 0.5]
    float n = __float2int_rn(x);      // round to nearest int — FMUL-based  
    float f = x - n;                  // fractional part

    // Minimax polynomial for 2^f over [-0.5, 0.5]
    // Coefficients: Sollya minimax degree-4 in Horner form
    // 2^f ≈ 1 + f*(0.693147 + f*(0.240227 + f*(0.055504 + f*0.009618)))
    float p = 0.009618f;
    p = fmaf(p, f, 0.055504f);
    p = fmaf(p, f, 0.240227f);
    p = fmaf(p, f, 0.693147f);
    p = fmaf(p, f, 1.0f);             // 2^f approximation

    // Reconstruct 2^x = 2^n * 2^f via integer exponent manipulation
    // Pack n into FP32 exponent bits: (int)(n + 127) << 23
    int e = __float2int_rn(n) + 127;
    float scale = __int_as_float(e << 23); // exact power of 2, no error

    return p * scale;
}

// For expf(x): expf(x) = exp2f(x * log2(e))
__device__ __forceinline__ float fast_expf_fma(float x) {
    return fast_exp2f_fma(x * 1.4426950408889634f);
}

SASS output for fast_expf_fma: approximately 8 FFMAs, 1 F2I, 1 I2F, 1 integer SHL, 1 FMUL. No MUFU.

Throughput: ~2–2.5 cycles per call on the FMA pipe. Versus MUFU.EX2 at 4 cycles: a genuine 1.6–2× throughput improvement for softmax inner loops on Ampere.

The catch: verify SASS output yourself. The compiler has latitude with __int_as_float and __float2int_rn.

Confirm with nvdisasm that no MUFU instructions appear in the compiled output.

Measuring SFU utilization precisely

The two relevant Nsight Compute metrics:

smsp__pipe_fma_cycles_active.avg.pct_of_peak_sustained_active
smsp__pipe_xu_cycles_active.avg.pct_of_peak_sustained_active

On an SFU-bottlenecked kernel, xu (XU = execution unit, NVIDIA’s internal name for the SFU pipe) will be near 100% and fma will be proportionally lower.

The ratio xu_cycles / fma_cycles tells you the SFU/FMA throughput imbalance directly.

Also useful: smsp__average_warp_latency_per_inst_executed.ratio; if this is high while xu_cycles_active is also high, the warp latency is being driven by MUFU’s 16-cycle result latency, not just its 4-cycle throughput.

Both cost you, via different mechanisms.

Subscribe now

The L0 Instruction cache

Each SMSP on Ampere has a dedicated 32 KB L0 instruction cache (also referred to as the I-cache in some microarchitecture literature).

This is physically separate from the unified L1 data/shared memory: it is not carved from the 192 KB L1 pool.

The L0 is private to each SMSP; four SMSPs per SM means four independent L0 caches per SM.

Instructions on Ampere are 128 bits (16 bytes) wide. The L0 holds 32 KB / 16 B = 2048 instructions.

A typical kernel loop body of 100–300 instructions fits comfortably; the L0 warms up on the first iteration and subsequent fetches are essentially free (one cycle or less).

The exception: kernels generated from heavily templated C++ code (think Thrust or hand-unrolled matrix multiplication with large tile sizes) can have loop bodies exceeding 500–1000 instructions.

A kernel that fully unrolls a 256-wide loop body with 8 FMAs per iteration emits 2048 instructions for that loop; exactly filling the L0 and leaving nothing for the rest of the kernel.

Add one more instruction and you start thrashing.

When the L0 misses, the SMSP must fetch from L1 instruction cache (shared with data traffic, with associated latency) or, worse, from L2.

L1 instruction fetch latency is approximately 20–30 cycles. The miss is captured by:

smsp__pcsamp_warps_issue_stalled_imc_miss.sum

Values above 2% indicate a structural code-size problem. Values above 10% are severe.

There is no runtime mechanism to manage L0 occupancy. The only intervention is compile-time code size reduction:

Replace #pragma unroll N with smaller N or #pragma unroll 1 for large N
Mark non-critical helper functions with __noinline__
Split large kernels into kernel launch sequences (costs launch overhead; evaluate the trade-off)
Use --maxrregcount to limit register count, which sometimes causes the compiler to generate shorter instruction sequences

Instruction decode bandwidth

Decoded instructions are held in per-warp instruction buffers before issue.

On Ampere, these buffers are approximately 2 entries deep per warp (this is not officially documented; it is reverse-engineered from microbenchmarks.

Specifically, from observing that back-to-back dependent instructions with 1-cycle-latency arithmetic operations still issue without stall, implying at least 2-deep pre-decoding).

The decoder can process approximately 1 instruction per cycle per SMSP (across all warps).

This exceeds the issue rate for any single warp (maximum 1 instruction every 4 cycles for a compute-bound warp at peak), so the decoder runs ahead and the per-warp instruction buffer is almost always populated.

The pathological case: a kernel at very high occupancy (32 warps per SMSP, the A100 maximum) with a simple loop body of 3 instructions (say, a vectorized element-wise operation: LDG.128, FFMA.x4, STG.128).

All 32 warps are eligible every cycle. The decoder must keep all 32 instruction buffers populated.

At 1 decode per cycle and 32 warps each needing fresh instructions, the decoder is stretched.

If the instruction stream is not in L0 (forcing L1 fetch at 20+ cycle latency), the buffers drain and the schedulers stall even though 32 eligible warps exist.

This is rare but real. It manifests as a high smsp__pcsamp_warps_issue_stalled_imc_miss combined with near-100% occupancy;

confusing until you understand that 32 resident warps generates 32× the instruction fetch pressure of 1 warp.

Predicated execution

Each thread on Ampere has 7 predicate registers (P0 through P6).

These are separate from the 255 available scalar registers (R0–R254, with R255 reserved as the zero register).

Predicate registers are 1-bit values set by comparison instructions:

// Source: if (a > b) { ... }
FSETP.GT.AND P0, PT, R0, R1, PT   ; set P0 = (R0 > R1), unconditional (PT = true predicate)

FSETP.GT.AND P0, PT, R0, R1, PT reads as: “set predicate P0 to (R0 > R1) AND PT, and set the complement predicate (implicit) to the inverse, and all of this unconditionally (final PT).”

The AND/OR suffix specifies the combining mode for nested predicate logic.

This instruction issues on the FP32 pipe, costs 4-cycle latency, and produces a predicate bit, not a register value.

An instruction with a predicate prefix:

@P0    FFMA  R3, R1, R2, R3    ; execute FFMA only if P0 is true
@!P0   FFMA  R5, R1, R2, R5   ; execute FFMA only if P0 is false

The semantics at the hardware level: all lanes in the warp issue the instruction. The instruction traverses the pipeline.

When the result write-back occurs, it is gated by the predicate: lanes where the predicate is true write their result; lanes where it is false suppress the write-back.

No branch. No warp divergence. No reconvergence stack manipulation.

Consequence: predicated instructions consume throughput proportional to the total number of instructions, not proportional to the number of active lanes.

A 32-thread warp where 16 threads have P0=true and 16 have P0=false, executing @P0 FFMA R3, R1, R2, R3, consumes exactly the same FFMA pipe resources as all 32 threads having P0=true.

The 16 non-writing threads waste their execution slots.

This is the precise definition of “predicated execution trades throughput for divergence avoidance.”

The compiler’s branch/predicate decision heuristic

The CUDA compiler (nvcc, using the LLVM PTX backend) uses a cost model to decide between a BRA (branch) and predication. The model is approximately:

Predication is chosen when:

The combined instruction count of both branch arms is ≤ ~8–12 instructions total
OR the divergence probability is estimated to be high (many warps will have mixed predicate values)
OR the branch target is not cache-resident (branch prediction overhead is higher)

Branch is chosen when:

One arm is long (> ~6 instructions) and the other is short
The compiler can estimate that the majority of warps will take one branch uniformly
The branch condition is amenable to warp-uniform evaluation (all threads agree)

The threshold is not a hard constant: it depends on the compiler version, optimization level, and the surrounding code structure.

The reliable way to check what the compiler chose is to inspect SASS:

# Disassemble a compiled kernel to SASS
nvdisasm --print-instruction-types mykernel.cubin | grep -E "BRA|@P[0-9]"

Or, within Nsight Compute:
Source tab → enable “Source Counters” → switch to “SASS” view → look for @P0 prefixes versus BRA instructions in the hot loop.

When to override the compiler’s choice

The compiler is generally right. The cases where it is wrong:

Case 1: Long rare branch incorrectly predicated. If your hot loop has a condition triggered 1% of the time (e.g., an overflow check, a boundary condition), and the expensive handler is 15 instructions, the compiler might still predicate if the loop body is otherwise short and the combined instruction count falls under the threshold.

But 15 instructions × 32 threads × 1% frequency = the equivalent of 0.15 × 32 = ~5 instructions of wasted throughput per loop iteration, running at full throughput instead of 1% of it. Branch would cost nothing for the 99% case.

Fix: restructure the code to make the “expensive” path obviously large and separated: e.g., a function call rather than inlined code, which the compiler treats as a definite branch site.

Case 2: Warp-uniform condition incorrectly compiled as branch. If every thread in a warp evaluates the same condition (e.g., based on blockIdx.x or a value loaded from constant memory that all threads share), the warp takes the branch uniformly and pays zero divergence cost.

The compiler sometimes generates a branch here and sometimes predicates. When the branch is warp-uniform and the body is long, you want a branch (all threads skip the long body together); predication would execute the long body for all threads on every iteration.

You can encourage warp-uniform branch treatment by computing the predicate with __all_sync(__activemask(), condition) when you know it’s warp-uniform — this makes the intent explicit.

Conclusion

Predication is not a free lunch, and it is not free branching.

It is a specific trade: you pay throughput for all threads to avoid the divergence tax of splitting and reconverging a warp.

That trade is profitable when the branch body is short and divergence is likely.

It is catastrophically unprofitable when the branch body is long and most threads would have skipped it entirely.

The compiler’s heuristic gets this right most of the time, because most conditionals in well-written kernels are short.

The cases where it fails: a rare overflow handler that gets predicated, a warp-uniform load flag that gets branched, are invisible at the source level and only show up as unexplained throughput loss in the profiler.

The smsp__warp_issue_stalled_not_selected stall counter rising without a corresponding increase in occupancy is one signal; anomalously low FMA pipe utilization relative to the instruction count is another.

The discipline is the same as everywhere else in this series: don’t assume the compiler made the optimal choice.

Inspect the SASS, verify the @P prefixes are where you expect them and absent where you don’t, and use __all_sync to make warp-uniform conditions structurally explicit rather than relying on the compiler to infer them.

A predicate register costs nothing. A 15-instruction predicated block running at full throughput for 99% of warps that didn’t need it costs exactly as much as running it unconditionally; which is, in fact, what you did.