The Split and the Seam

Eighteen months ago, splitting prefill from decode was a contrarian research bet. Today it is the default that every serious stack runs, it helped erase 600 billion dollars from the company that build

Lorenzo Bradanini and Lorenzo Tettamanti

Jun 21, 2026

Intro

On Monday, January 27, 2025, NVIDIA lost about 600 billion dollars of market value in a single trading session, a 17 percent fall that stands as the largest one-day market-capitalization loss in the history of US public markets. It did not fall alone.

Broadcom dropped 17.4 percent, Marvell 19.1 percent, AMD 19.1 percent, the chip-adjacent names down in sympathy across the board. The trigger was not an earnings miss or a product recall.

It was a technical report from a Chinese lab, DeepSeek, whose V3 and R1 models matched the frontier while having been trained and, crucially, served at a fraction of the assumed cost.

The market did the obvious arithmetic: if inference is far cheaper than actually we believed, fewer GPUs are needed, so the company selling the GPUs is worth less.

The arithmetic was wrong, and the reason it was wrong is the subject of this issue. Within days, Microsoft’s Satya Nadella was citing the Jevons paradox, the nineteenth-century observation that making a resource cheaper to use tends to increase, not decrease, its total consumption. The paradox held beautifully.

Over the course of 2025, by the Peterson Institute’s accounting, the cost to reach a fixed score on a hard reasoning benchmark fell from about 4,500 dollars per task to 11.64 dollars, a roughly 386-fold collapse, and inference usage did not shrink to match. It exploded past the efficiency gains, exactly as Jevons would predict.

The chip that was supposed to be made redundant by cheap inference is now sold out for years on the back of it [Figure 1].

Figure 1. The shake and its resolution. Left, the single-day sector rout that followed DeepSeek’s efficiency disclosure, the largest one-day loss in market history. Right, the cost to reach a fixed benchmark score over 2025, a collapse the industry has named LLMflation. Efficiency did not destroy demand. It multiplied it.

Here is the part the market missed in its first reaction. DeepSeek’s efficiency was not a single trick.

It was a stack of techniques, low-precision FP8 arithmetic, a sparse mixture-of-experts model, a compressed attention scheme called multi-head latent attention, and, underneath all of it, an inference architecture that ran the two phases of language-model serving, prefill and decode, on entirely separate pools of machines.

That last technique, disaggregation, is the one that matters most for understanding what happened next, because in the eighteen months bracketing that selloff it went from a contrarian research idea that the open-source community pushed back on to the default architecture of virtually every production serving system in existence.

NVIDIA Dynamo, llm-d, Ray Serve, SGLang, vLLM, LMCache, and Mooncake all run on it now. The very metrics the industry uses to talk about inference latency, time-to-first-token and time-per-output-token, were popularized through its lens.

The authors of DistServe, the paper that named the architecture, marked the moment in a November 2025 retrospective with a wry observation: if Moore’s law doubles compute every eighteen months, then the serving-systems equivalent had just doubled too, not because the chips got faster, but because the systems serving them did.

This issue is a teardown of that architecture as it actually exists in mid-2026, not as a tidy origin story. We start with the physics, because the physics is clean and explains everything downstream.

Then why colocation lost, what the split mechanically does, and the cost it creates at the seam where the two halves rejoin, a cost that the rack-scale hardware of the current generation has largely, but not entirely, dissolved.

Then we go down into the kernel layer, to the expert-parallel all-to-all communication and the custom kernels that make large mixture-of-experts models servable at all, which is the part most coverage skips and the part that actually decides throughput.

Then the attention rewrites that are shrinking the problem from underneath, the cross-vendor benchmark numbers and the honest places they fall apart, the operational tax nobody puts on the slide, the new silicon the split has spawned, and finally what the whole shake settled into.

The thesis, stated once and defended throughout: disaggregation is no longer a choice an operator debates. It is the substrate. The live questions have moved up a layer, to how you balance the pools, how wide you spread the experts, and whether the wire between your machines is fast enough that the seam is free.

Two workloads on opposite ends of the roofline

The roofline model is the oldest honest tool in performance engineering, and once prefill and decode are plotted on it the rest of this issue is commentary.

Every kernel has an arithmetic intensity, the floating-point operations it performs per byte it moves from memory. Every chip has two ceilings: a compute ceiling set by its peak arithmetic rate, and a memory ceiling set by its bandwidth.

A kernel is compute-bound when its arithmetic intensity is high enough that the chip exhausts its FLOPS before its bandwidth, and memory-bound otherwise.

The crossover, the ridge point, is simply peak FLOP rate divided by bandwidth. For an H100 SXM at FP8, roughly 1,979 dense teraFLOPS over 3.35 terabytes per second of HBM3, the ridge sits near 591 FLOP per byte.

For the H200, identical compute over 4.8 terabytes per second of HBM3e, it falls to 412. For a B200, about 4,500 dense FP8 teraFLOPS over 8 terabytes per second, near 562 [Figure 2].

Figure 2. A roofline for three inference GPUs with the operating region of each phase marked. Prefill runs against the flat compute roof. Single-stream decode is pinned to the sloped bandwidth region. The same silicon is a different machine depending on which phase is running.

Prefill ingests the whole prompt at once. Every token attends to every prior token, the feed-forward layers process the full sequence in parallel, and the matrix multiplications are large and dense.

A prompt of a few thousand tokens pushes the arithmetic intensity into the hundreds or thousands of FLOP per byte, planting prefill firmly to the right of the ridge against the compute roof. Prefill is a FLOPS problem. It wants tensor cores and low precision and scales with raw matrix-multiply throughput.

Decode is the opposite animal. To generate one token it reads the entire weight set and the entire key-value cache for the sequence, performs a thin slice of computation, and emits a single token. For a single stream the arithmetic intensity sits near the floor, on the order of one to two FLOP per byte, pinning decode to the far left of the roofline on the bandwidth-limited slope.

Decode is a memory-bandwidth problem. It cannot keep the tensor cores fed; it cares only how fast the chip streams weights and cache out of HBM. The hard floor follows immediately: the fastest a single decode stream can run is bandwidth divided by bytes read per token, dominated by the weights [Figure 3].

A 70-billion-parameter model at FP16 is 140 gigabytes, so an H100 at 3.35 terabytes per second generates at most about 24 tokens per second on a single stream, an H200 about 34, a B200 about 57. Halve the precision to FP8 and every ceiling doubles.

This is the same calculation that bounds DeepSeek’s observed 20 to 22 tokens per second in production. Compute does not enter.

Figure 3. The single-stream decode ceiling is HBM bandwidth divided by model bytes. No quantity of compute changes it. Halving precision, which halves the weight bytes, is the only lever that moves the ceiling for a fixed model.

Batching is the escape, and its limit is the reason disaggregation exists. Decode many sequences at once and the weight read is shared across the batch: you stream the weights once and amortize them.

The arithmetic intensity of decode is therefore approximately twice the batch size divided by the bytes per weight, which at FP16 is approximately the batch size itself [Figure 4]. To cross the H100 FP8 ridge of 591 you need a batch in the high hundreds.

That is the entire game in decode, pack as many concurrent sequences into a step as memory allows, because every added sequence moves you rightward toward the compute roof and lifts tokens-per-second-per-GPU.

Hold the two facts together: prefill wants to run immediately, in small groups, against the compute roof, to keep first-token latency low; decode wants to run in enormous batches against the bandwidth ceiling, to keep cost-per-token low.

One phase is latency-shaped and compute-hungry, the other throughput-shaped and bandwidth-hungry, and for two years the industry asked one GPU under one scheduler to do both at once.

Figure 4. Decode arithmetic intensity is essentially the batch size, because weights are read once per step and shared. Reaching the compute roof requires hundreds of concurrent sequences, which is why decode pools are built around the largest batches memory will hold.

Why colocation lost

Run both phases on one GPU under one continuous-batching scheduler, the architecture Orca introduced and vLLM popularized, and they fight. The fight has a precise mechanism.

Continuous batching keeps a rolling batch of decode steps running and folds in new requests as they arrive, but a new request cannot decode until its prompt is prefilled, and prefill is a heavy compute-bound operation that occupies the GPU far longer than a single decode step.

When a prefill lands in the batch, the system must either pause the in-flight decodes to prioritize it or batch the prefill alongside them, and both choices stall token generation for every active sequence.

The DistServe authors quantify the damage bluntly in their retrospective: even with chunked-prefill mitigation, a single large prefill can inflate time-per-output-token by a factor of two to thirty under bursty workloads.

A long prompt arriving at the wrong moment makes every other user’s stream stutter for a third of a second or more [Figure 5].

Figure 5. The structural waste of colocation. A prefill-heavy step pins the tensor cores while bandwidth idles; a decode-heavy step pins bandwidth while the tensor cores idle. A shared GPU pays for both resources and presses hard on one at a time. Exact values are workload-dependent; the asymmetry is not.

The deeper cost is coupling. As the DistServe paper put it, colocation forces the resource allocator to provision for the worst case of both latency targets simultaneously, the tight first-token target and the tight per-token target, because the same GPUs serve both.

You cannot tune one phase without detuning the other, and you cannot scale one without scaling the other. The roofline says why the waste is structural and not a scheduling artifact: on a prefill-heavy step the tensor cores run near saturation while the memory bus idles, and on a decode-heavy step the reverse, so a colocated GPU pays rent on two expensive resources and uses roughly one of them at any instant.

The serving community fought this with chunked prefill, introduced in the Sarathi work by Amey Agrawal and co-authors, which breaks a long prefill into bounded chunks interleaved with ongoing decode so the peak disruption per step is capped and per-token latency smooths out.

Chunked prefill is the strongest argument against disaggregation and an honest teardown has to credit it: by mixing a compute-bound prefill chunk with bandwidth-bound decode work in one step it even improves utilization, running the two against different ceilings. But it does not dissolve the coupling.

The phases still share one parallelism strategy, one memory pool, one tensor-parallel degree, all of them a compromise. And it trades latencies against each other: the finer you chunk to protect per-token latency, the more you stretch first-token latency, because a long prompt now dribbles through the GPU in pieces.

You are back in the original bind. On shared hardware you can favor first-token latency or per-token latency, but you cannot independently optimize both.

What changed in 2025 was not the physics but the stakes. DistServe, the authors recount, met real pushback in 2024 because disaggregation demands a heavy refactor of existing serving systems, and saw little adoption that year.

Then businesses began deploying language models at competitive scale, and throughput stopped being the only metric that mattered. Latency became existential, because a chatbot that stutters loses users, and an agent that stalls breaks workflows.

At the same time models grew and traffic surged, forcing systems past hundreds and into thousands of GPUs, the regime where a disaggregated architecture genuinely shines because it can allocate resources to each phase independently and pair each with its own parallelism strategy. The technique that was ahead of its time in 2024 was exactly on time in 2025.

The anatomy of the split

Disaggregation assigns the phases to physically separate GPU pools. A request hits a prefill instance, which builds the key-value cache for the prompt and produces the first token; the cache is handed across to a decode instance, which loads it, folds the request into its large rolling batch, and generates the rest.

Prefill machines only prefill. Decode machines only decode. Three things become possible that colocation forbids, and they are the whole value proposition.

Interference disappears, because no prefill ever lands in the decode batch, so decode steps run uninterrupted and per-token latency stops spiking. Independent optimization becomes possible, because each pool can be tuned to its own roofline and scaled on its own axis, adding prefill capacity when prompts lengthen and decode capacity when outputs lengthen.

And phase-specific hardware becomes possible, the idea Splitwise pushed hardest, that since decode is bandwidth-bound and prefill compute-bound, the two should not run on identical chips at all, a thread that has since grown into an entire hardware category we will come to.

The crucial metric that makes all of this legible is goodput, and it is the metric most operators still fail to measure. Throughput is requests or tokens completed per second, full stop.

Goodput is requests completed per second that meet their service-level objectives, both the first-token target and the per-token target. The distinction is the whole point, because a colocated system under load can post rising throughput while its goodput collapses, as interference pushes more and more requests past their latency targets even as tokens keep flowing [Figure 6].

Hao Zhang of UC San Diego, a DistServe author, frames it starkly in his lectures: a system can show ten requests per second of throughput while delivering three requests per second of goodput once the SLO is applied. The other seven finished late, which for an interactive product means they did not finish.

Disaggregation’s claim is a goodput claim. It does not necessarily move more tokens in the abstract; it moves more tokens that arrive on time, and the DistServe paper measured that as 7.4 times more requests within SLO, or a 12.6 times tighter achievable latency target, against the colocated state of the art.

Figure 6. Throughput and goodput diverge under load. Raw tokens per second keep climbing while the count of requests that meet their latency targets falls away as interference mounts. Disaggregation’s win is the gap. Shape follows the DistServe goodput results; axes are illustrative.

The published uplifts that first announced the technique are worth stating with the asterisk each deserves [Figure 7].

DistServe reported 7.4 times more requests served within SLO against the colocated state of the art;
Splitwise, the parallel Microsoft Azure effort, 2.35 times the throughput at equal power and cost, or 1.4 times at 20 percent lower cost;
Mooncake, the platform behind Moonshot AI’s Kimi assistant, up to a 525 percent throughput gain in simulated overload and 75 percent more requests under real production traffic.

The metrics differ, goodput in one, throughput at fixed power in another, raw request volume in a third, so they are not directly comparable, and each is a ceiling for a favorable workload rather than a constant anyone reproduces.

They are the numbers that turned a research idea into a procurement decision.

Figure 7. The headline multiples are real and each is a best case. The metrics are not commensurable across bars, and every figure is tied to a specific model, workload, and SLO. Read them as ceilings for favorable regimes, not a universal constant.

Modern orchestration adds one more lever on top of the split: cache-aware routing. NVIDIA Dynamo, the orchestration layer that sits above the inference engines, routes each request to the prefill worker whose resident cache best overlaps the incoming prompt, which NVIDIA reports can roughly halve first-token latency by avoiding redundant prefill of shared prefixes.

The router treats prefill and decode workers as first-class services, and a central planner continuously profiles the GPUs to autoscale and rebalance. This is the productized form of an idea DistServe seeded with a much simpler pull-based scheduler, which it used to keep decode workers from being flooded by spiky prefill bursts.

The control plane has become as much of the system as the data plane.

The seam, and the rack that dissolved it

The instant you split the phases you create a seam, and across it you must carry the key-value cache from the prefill instance that built it to the decode instance that consumes it, for every request. The cache is not a control message. It is the full attention state of the prompt, and it can be gigabytes.

Its size is set by the attention architecture, and the spread is enormous [Figure 8]. Multi-head attention stores a key and value vector per head, per layer, per token; a Llama-2-class 70B model with 64 heads, head dimension 128, and 80 layers holds about 2.5 megabytes per token at FP16, so a 1,000-token prompt carries roughly 2.6 gigabytes of state.

Grouped-query attention, the Llama 3 scheme, shares key and value projections across groups and cuts the stored heads from 64 to 8, dropping the cache to about 0.31 gigabytes per thousand tokens.

Multi-head latent attention, DeepSeek’s design, compresses the per-token state into a single latent vector of dimension 576 stored once rather than per head, landing near 0.07 gigabytes per thousand tokens.

Across these three schemes the seam payload spans a factor of about 37, which is the unglamorous reason MLA models are structurally cheaper to disaggregate and the reason attention-architecture choices are now serving-cost choices.

Figure 8. The seam payload is set by attention design. MHA ships a fraction of a gigabyte per token of context across the wire; MLA ships a small fraction of that. The bytes you must move are decided in the model architecture, not the serving stack.

Now the carry. Take the cache for a 4,096-token prompt on a grouped-query 70B model, about 1.34 gigabytes, and move it across the interconnects an operator might have [Figure 9].

On fifth-generation NVLink at 1.8 terabytes per second, 0.75 milliseconds. On InfiniBand NDR at 400 gigabits per second, 27 milliseconds. On 200-gigabit RoCE, 54. On commodity 100-gigabit Ethernet, 107. Across a 25-gigabit link between datacenters, 429.

The carry is a one-time handoff per request, so it adds to first-token latency rather than the per-token rate, and the right comparison is against the prefill it follows, roughly 290 to 600 milliseconds for that prompt on an H100.

Against that, NVLink and InfiniBand transfers are rounding error or close to it, Ethernet is a third of the prefill bolted onto every request, and cross-datacenter is fatal on its own.

The DistServe authors measured the intra-node case at under 0.1 percent of request latency over fast intra-node links, and every serious stack hides the transfer further with layer-wise streaming, shipping each layer’s cache the moment that layer finishes so its movement overlaps the computation of the next.

Figure 9. The same 1.34 GB handoff across the interconnect hierarchy. Inside the rack it is noise; on InfiniBand it fits under the prefill; on commodity Ethernet it eats the budget; across datacenters it is fatal. The interconnect, not the GPU, decides whether disaggregation is viable.

This is where the current hardware generation changed the calculus, and it is the single most important development since the technique went mainstream.

The GB200 NVL72 and its successor the GB300 NVL72 connect 72 GPUs into one NVLink domain that behaves as a single massive GPU, with up to 130 terabytes per second of aggregate GPU-to-GPU bandwidth.

When prefill and decode pools live inside the same NVLink domain, the seam is no longer a network hop. It is a memory copy across a coherent fabric, sub-millisecond, effectively free.

The hard part of disaggregation, the part that made it a research problem for years, was moving the cache without blowing the latency budget, and the rack-scale NVLink domain makes that part vanish for deployments that fit inside a rack. This is a large reason disaggregation went from papers to production so fast: the hardware arrived to pay the seam’s bill.

The same fabric is why these systems can run the extremely wide expert parallelism we will see in the next section, which would be communication-bound on any slower interconnect.

The transfer machinery itself is now standardized infrastructure. NVIDIA’s NIXL, open-sourced at GTC 2025, unifies NVLink, InfiniBand, PCIe, and storage fabrics under one point-to-point abstraction and runs non-blocking so the GPU keeps computing while the cache moves.

Mooncake’s Transfer Engine, from the Kimi team, presents the same unified interface over TCP, RDMA, shared memory, and NVMe-over-fabrics.

LMCache, from the University of Chicago, accelerates the movement with batched transfers and I/O pipelining and decouples cache storage from the engine so the cache can persist, migrate, and be shared independently of execution; its authors report up to a tenfold reduction in first-token latency from KV reuse and offload.

DeepSeek built 3FS, a file system that pools thousands of SSDs and hundreds of storage nodes so any prefill can stash a cache that any decode can fetch in a locality-oblivious way.

And the cache, once it is a first-class managed object, can be retained and reused across requests, which is the quiet lever under the whole economy: of DeepSeek’s 608 billion daily input tokens, 342 billion, 56.3 percent, were served from cache rather than recomputed.

Prefix caching is for many workloads a larger cost lever than the prefill-decode split itself, and disaggregation is what makes the cache a managed resource in the first place.

Figure 10. The economics DeepSeek disclosed, and the cache that drives them. Left, the daily GPU cost against the theoretical revenue if every token were billed at R1 rates, the source of the much-quoted 545 percent margin, which the company itself flags as theoretical and which is materially lower in reality. Right, the token composition: more than half of all input tokens were served from cache, not recomputed. This is the disaggregated, MLA-based, heavily-cached architecture that lit the fuse on the whole repricing.

The kernel layer: expert parallelism and the all-to-all

Here is the part most coverage skips, and the part that actually sets throughput on a modern frontier model.

Look under the hood of virtually any current frontier model, DeepSeek-V3, Kimi K2, Qwen3, Llama 4, and you find a sparse mixture-of-experts architecture: hundreds of expert sub-networks per layer, of which each token activates only a handful.

A 671-billion-parameter model like DeepSeek-V3 activates about 37 billion parameters per token. Sparsity is what makes these models cheap to run, but it imposes a specific and brutal communication pattern, and that pattern is where disaggregation, expert parallelism, and the interconnect all collide.

To serve a large MoE you distribute the experts across many GPUs, a scheme called expert parallelism, because no single GPU holds all of them.

But then every token has to travel to whichever GPUs hold its chosen experts and the results have to travel back, which means every layer performs two all-to-all communication operations per step, a dispatch that scatters tokens to their experts and a combine that gathers the results.

This is the dominant communication cost of MoE inference, and it has a vicious property: the messages are tiny. In DeepSeek-V3 each dispatch or combine message runs from about 7 kilobytes during inference to 256 kilobytes during training [Figure 11].

General-purpose collective libraries like NCCL are tuned for the opposite regime, the large multi-megabyte all-reduce of dense training, where bandwidth dominates. At 7 kilobytes the per-message latency and synchronization overhead dominate instead, and NCCL leaves most of the wire idle.

Figure 11. Expert parallelism breaks the collective library. MoE dispatch and combine move tiny messages, where NCCL’s all-reduce kernels stall on latency and synchronization rather than saturating bandwidth. This mismatch is why DeepSeek wrote a dedicated communication library.

That library is DeepEP, which DeepSeek open-sourced on the second day of its Open Source Week, and it is a small masterclass in why the kernel layer matters.

DeepEP provides custom all-to-all dispatch and combine kernels in two distinct flavors, and the split mirrors the prefill-decode split exactly. The high-throughput kernels serve prefill and training, maximizing raw bandwidth, but they emit dynamically shaped tensors that are incompatible with CUDA graphs.

The low-latency kernels serve decode, using direct RDMA to minimize latency and, critically, remaining CUDA-graph compatible so they avoid the kernel-launch overhead that dominates the decode phase, where each step is tiny and launch cost is a large fraction of the work.

The kernels need only about 20 streaming multiprocessors to saturate both the intra-node NVLink domain and the inter-node RDMA network simultaneously, freeing the rest of the GPU for computation.

They achieve this through NVSHMEM and IBGDA, which let the GPU issue RDMA operations directly to the network card without a round-trip through the CPU, and through asymmetric-domain forwarding that bridges the fast NVLink domain and the slower RDMA domain in one kernel.

The high-throughput path uses 24 queue pairs; the low-latency path uses 8 to 16, matched to the local expert count. A hook-based design overlaps the communication with computation without occupying compute units at all.

The newest DeepEP revisions add TMA-based transfers for minimal SM usage, support for the larger multi-node NVLink domains of the rack-scale systems, and a zero-SM remote-memory primitive for fetching KV cache directly from a peer.

This level of specialization is not optional at scale, and NVIDIA’s response confirms it: NVIDIA built HybridEP, its own token-based dispatch backend using the same hardware primitives, and co-developed a set of Blackwell kernels with the SGLang and vLLM projects through the FlashInfer library, covering attention prefill and decode, the communication path, the grouped matrix multiplications, the multi-node NVLink transfers, and MLA specifically.

The all-to-all is the bottleneck, and the entire industry is now optimizing the same dozen kernels.

The payoff of getting this right is expert parallelism that goes very wide, and width is throughput [Figure 12].

Spreading the experts of DeepSeek-R1 across 32 ways instead of 8 on a GB200 NVL72 lifts per-GPU output throughput by about 1.8 times, NVIDIA’s TensorRT-LLM measurements show, because a wider spread means each GPU holds fewer experts and therefore loads less weight per step, and because it fills the grouped matrix multiplications more completely.

Wide expert parallelism is only viable because the 130-terabyte-per-second NVLink domain absorbs the all-to-all traffic that wider spreading generates; on a slower fabric the communication would swamp the gain. And the two phases run deliberately different widths.

DeepSeek’s disclosed production configuration runs prefill at expert-parallel degree 32 and decode at degree 144, a decode pool nearly five times wider than the prefill pool, because decode is where the wide spread pays off in throughput and where the batch is large enough to keep all those experts busy.

The DistServe authors, surveying the same system, note decode configurations reaching toward 256-way expert parallelism, and newer stacks push wider still.

The prefill-decode asymmetry that began as a latency argument has become a parallelism argument: the phases want not just different hardware but different distributed-systems topologies entirely.

Figure 12. Decode wants the widest expert parallelism the fabric will allow. Spreading experts to EP32 instead of EP8 lifts per-GPU decode throughput by about 1.8 times by shrinking the per-GPU weight load and filling the grouped matrix multiplications. The two phases run deliberately different degrees.

The attention rewrite is shrinking the problem

While the serving stack was learning to split and spread, the model architects were attacking the problem from underneath, and the attack lands squarely on the two costs this issue is about: the bytes at the seam and the compute in prefill.

Multi-head latent attention was the first cut, and we have already seen its effect on the seam, a roughly thirty-seven-fold reduction in cache bytes versus dense multi-head attention.

But MLA leaves the other cost untouched: prefill attention still scales quadratically with sequence length, because every token still attends to every prior token, and as context windows stretch toward a million tokens that quadratic term dominates the prefill bill.

This is the cost that DeepSeek attacked in V3.2 with DeepSeek Sparse Attention, and the mechanism is elegant [Figure 13]. A lightweight neural network DeepSeek calls the Lightning Indexer scores the relevance of past key blocks to the current query and selects only the top few thousand most relevant, and the expensive attention computation then runs only over that selected set.

The pattern is retrieve-then-attend, and it bends the cost curve from quadratic in sequence length toward roughly linear past the selection cutoff, while DeepSeek reports model quality virtually unchanged.

At a million tokens of context the difference is on the order of a few hundredfold less attention work in prefill.

Figure 13. Sparse attention attacks the prefill cost itself. MLA cut the KV bytes crossing the seam; DeepSeek Sparse Attention then cut the prefill compute, selecting a fixed set of relevant key blocks before attending and bending quadratic attention toward linear at long context. Illustrative scaling for a fixed selection budget.

This matters for disaggregation in a way that compounds. Sparse attention shrinks both the prefill compute and, in its variants, the cache that must be carried, which shifts the prefill-decode balance again and makes long-context disaggregation economically viable at lengths that would have been hopeless under dense attention.

It is also the live frontier as of this writing. DeepSeek-V3.2 shipped in late 2025 as a 671-billion-parameter MLA-plus-MoE model with DSA layered on top, reaching reasoning quality the company benchmarks against GPT-5, and its Speciale variant took gold-medal scores at the 2025 International Mathematical Olympiad and the ICPC World Finals.

DeepSeek-V4, released in April 2026, extends the context window to a million tokens and replaces DSA with a successor called Compressed Sparse Attention, with its Pro variant reported at 80.6 percent on SWE-bench.

The trajectory is unmistakable: the attention mechanism is being rebuilt around the economics of long-context inference, and each rebuild changes the numbers in the serving stack beneath it. Architecture and serving are no longer separable disciplines.

The benchmark reality, and where it breaks

For two years the inference market argued over performance with vendor slides, which is no way to allocate billions in capital.

That changed in late 2025 when SemiAnalysis launched InferenceMAX, the first independent open-source benchmark to measure not raw throughput but total cost of compute across real models and real interactivity targets, running DeepSeek R1, GPT-OSS, Llama 3, and Qwen across the GB200 NVL72, B200, H200, H100, and AMD’s MI300X, MI325X, and MI355X, with Google TPU and AWS Trainium backends following.

It is the closest thing the field has to a neutral scoreboard, and the picture it paints is one a chip-level analysis would miss entirely [Figure 13].

Figure 13. Per-GPU throughput on DeepSeek R1 at a fixed 25 tokens-per-second-per-user interactivity, normalized to H200. The GB200 NVL72’s roughly tenfold lead over the H200, and threefold over the B200, comes from the rack-scale NVLink domain, not from a faster individual chip. Values from SemiAnalysis InferenceMAX; the advantage is interactivity-dependent.

At a fixed interactivity of 25 tokens per second per user on DeepSeek R1, the GB200 NVL72 delivers roughly ten times the per-GPU throughput of an H200 and about three times that of a standalone B200, even though the B200 is a faster chip in isolation.

The advantage is the 72-GPU NVLink domain, which lets the rack run wider parallelism and larger coherent batches than any single eight-GPU node can.

On absolute terms the B200 reaches 60,000 tokens per second per GPU at 1,000 tokens per second per user on GPT-OSS, and software optimization alone drove its cost on that model to two cents per million tokens, a fivefold reduction in two months.

NVIDIA frames the rack-level economics as a five-million-dollar GB200 NVL72 generating 75 million dollars in token revenue, a fifteen-fold return, though that figure prices output at favorable rates and should be read as a vendor’s best case rather than a realized margin.

The energy picture is cleaner and third-party: on DeepSeek R1 the GB200 NVL72 delivers roughly eight times the tokens per provisioned megawatt of a single-node H200, and Blackwell runs about 20 percent more energy-efficient than AMD’s CDNA4 on GPT-OSS, partly because the MI355X draws 1.4 kilowatts per GPU against the B200’s 1 kilowatt.

Now the part the headline numbers omit, and the part this publication exists to surface. The rack-scale advantage is not a constant. It is a function of interactivity, and it expires [Figure 14].

At 60 tokens per second per user the GB200 NVL72 produces a little less than triple a B200’s per-GPU throughput, but as the interactivity target rises the batch that the rack can assemble shrinks, and by around 130 tokens per second per user the workload fits inside a single eight-GPU node’s NVLink domain, at which point the NVL72’s scale-out advantage disappears entirely and it becomes more expensive per token than a standalone node.

The whole case for the rack rests on serving many users at moderate interactivity, the chatbot and agent regime; push to extreme single-user speed and the economics invert.

The benchmark also exposes a software truth that no spec sheet shows: AMD’s MI355X is competitive with the B200 on FP8 disaggregated prefill, but its disaggregated performance actually degrades at higher interactivity because the ROCm stack lacks the kernel and collective optimizations needed to compose multiple state-of-the-art techniques together.

Disaggregation is not a hardware capability you buy; it is a software capability you accumulate, and the gap between vendors is measured in kernels.

Figure 14. The rack-scale win has an expiry date. At moderate interactivity the NVL72 roughly triples a single node per GPU; push interactivity high enough that the batch shrinks into one eight-GPU node, and the advantage evaporates and inverts on cost. Shape after SemiAnalysis InferenceX v2.

The dial you must keep turning

Disaggregation hands you an operational problem that the headline numbers never mention, and every team that has run it in production knows it immediately.

Once the fleet is split into a prefill pool and a decode pool, you have to choose the ratio between them, and you will get it wrong, because the right answer keeps moving [Figure 15].

The problem is a producer-consumer imbalance: prefill instances produce cache that decode instances consume, and the production rate rarely matches the consumption rate. Provision too few prefill instances and prompts queue while first-token latency slips and the decode pool sits half-idle for want of work.

Provision too few decode instances and the prefill pool races ahead while per-token latency slips and the prefill pool sits half-idle. Either error strands capital on GPUs that cannot do useful work because the other pool is the bottleneck.

DeepSeek’s disclosed answer was a fixed three-to-nine ratio, three prefill nodes feeding nine decode nodes, and the SGLang team reproduced a similar split on 96 H100s, twenty-four GPUs for prefill and seventy-two for decode, reaching 52,300 input tokens and 22,300 output tokens per second per node, the first open implementation to match DeepSeek’s own reported numbers, at a cost they put at twenty cents per million output tokens, about one-fifth the official API price.

Figure 15. Disaggregation hands you a ratio you must keep correct. Too few prefill instances and first-token latency slips; too few decode instances and per-token latency slips. The optimum sits in a narrow band and drifts with every shift in prompt and output length. Schematic of the producer-consumer balance.

What makes this hard rather than a one-time sizing exercise is that the optimal ratio is not constant. It depends on the shape of the traffic, the ratio of input length to output length, and that shape shifts by the hour and by the product surface.

A wave of document-summarization requests is prefill-heavy and wants more prefill capacity; a wave of long-form generation is decode-heavy and wants more decode; a split that is optimal at noon is wrong by midnight.

This is why the serious systems have moved to dynamic rebalancing, monitoring load in real time and shifting the ratio, and why a system like TaiChi goes further and switches whole instances between disaggregated and colocated modes depending on which yields better goodput at the current load.

The existence of that last capability is the tell: colocation is not always wrong, and a system smart enough to know when to disaggregate is smart enough to know when to stop.

Disaggregation converts a hardware-utilization problem into a scheduling-and-capacity problem, which is usually a good trade because software is cheaper to change than silicon, but it is a trade and not a free win, and a team that splits without building the rebalancing machinery will frequently lose to a well-tuned colocated deployment with chunked prefill.

The split spawned silicon

The most striking evidence that disaggregation has become foundational is not in any serving framework.

It is in the silicon roadmap, because once the phases run on separate machines, the machines stop needing to be the same machine, and the hardware vendors have noticed [Figure 16].

Figure 16. The split spawned silicon. Plotting accelerators by compute against memory bandwidth, vendors are now building to the corners: FLOPS-rich, cheap-memory parts for prefill, and bandwidth-rich parts for decode. Compute is a low-precision proxy across mixed formats; the positioning, not exact parity, is the point.

NVIDIA’s clearest statement is the Rubin CPX, announced in September 2025 and shipping at the end of 2026, a GPU built exclusively for the prefill and context phase.

Its logic is the wrong-sizing problem stated as a product: prefill is compute-bound and barely touches memory bandwidth, so dedicating expensive high-bandwidth HBM to it wastes the most expensive component on the chip.

The CPX therefore pairs 30 petaFLOPS of NVFP4 compute with 128 gigabytes of GDDR7, a memory that SemiAnalysis estimates is roughly five times more cost-effective per byte than HBM and runs at perhaps a quarter of HBM’s bandwidth, which is fine because prefill does not need the bandwidth.

It adds dedicated attention hardware delivering about three times the attention throughput of a GB300 NVL72, aimed directly at million-token context, and it ships with PCIe but no NVLink, because it is built for disaggregated inference racks rather than tightly coupled training clusters.

The packaging makes the intent explicit: the Vera Rubin NVL144 CPX rack pairs 144 CPX prefill GPUs with 144 standard Rubin decode GPUs and 36 CPUs, prefill and decode silicon racked side by side, the disaggregation thesis cast in metal.

The decode side is bifurcating too, and along a more radical axis. In December 2025 NVIDIA signed a roughly 20-billion-dollar licensing arrangement with Groq, whose language-processing units abandon HBM entirely for on-chip SRAM, 500 megabytes per chip at 150 terabytes per second, an order of magnitude past any HBM part.

For autoregressive decode at long context, where the entire bottleneck is reading state out of memory, SRAM’s bandwidth is decisive: a 70-billion-parameter decode at 128,000 tokens of context runs dramatically faster in memory-access time on an SRAM part than on an HBM GPU.

The emerging architectural division has GPUs handling training and prefill while specialized low-latency parts handle decode, which is prefill-decode disaggregation pushed all the way down to the level of distinct chip families.

And the trend is not NVIDIA’s alone. The DistServe authors report that Huawei, Enflame, MetaX, and Biren are all prototyping or deploying decode-specialized or attention-optimized accelerators built on exactly this philosophy.

A systems technique conceived to tame latency on homogeneous GPUs is now redrawing the boundaries of the accelerator market itself.

When it still doesn’t pay

Disaggregation is the substrate now, but it is not a universal good, and the honest boundaries matter as much as the wins [Figure 17].

SemiAnalysis, evaluating the Rubin CPX’s full hardware-level disaggregation, put the caveat precisely: complete disaggregation delivers excellent results only under certain ratios of input to output length and for long decode lengths, with other scenarios seeing underwhelming benefits.

The structure of the advantage explains the boundary. Disaggregation’s benefit grows with output length, because longer outputs mean more decode steps to protect from interference, and with offered load, because heavier load means more interference to remove.

In the opposite corner, short outputs under light traffic, there is little interference to eliminate and too little decoding for the protection to accrue, and the seam and the ratio overhead are pure cost. A well-tuned colocated deployment with chunked prefill wins there.

Figure 17. The decision is a regime. Disaggregation pays when outputs are long and load is heavy, and goes underwater for short replies under light load. The map is an illustrative model encoding the consistent direction of the evidence, not a measured surface. Characterize your own traffic before committing.

Three caveats compound this. The wrong-sizing problem never fully disappears even with disaggregation, because a pure prefill instance on an HBM part still underutilizes its memory bandwidth, which is the entire reason the Rubin CPX exists.

The interactivity cliff from the benchmark section means that even where disaggregation and rack-scale hardware win at moderate interactivity, the advantage inverts at extreme single-user speed, when the batch collapses into a single node.

And the software-maturity tax, visible in AMD’s degraded high-interactivity disaggregation, means the gains are not portable across stacks; they have to be earned kernel by kernel. The strategic read is not disaggregate or do not disaggregate.

It is characterize your traffic on the two axes that matter, real output-length distribution and real peak-to-trough load, then build for your regime, and buy the interconnect before you buy the split, because on a slow fabric the seam eats the gain and on a fast one it disappears.

The shake, settled

Return to the 600 billion dollars. The market’s first instinct, that cheaper inference means less demand for the chips that serve it, has been falsified about as cleanly as a macro thesis ever is.

The cost collapse was real, from thousands of dollars per benchmark task to about eleven, and demand did not fall; it ran so far past the efficiency gains that the Peterson Institute concluded usage had dwarfed them, the Jevons paradox playing out in real time.

The technique that frightened the market, efficient inference built on disaggregation and sparsity and compressed attention, did not shrink the industry. It enlarged it, by making applications viable that were uneconomical at the old prices.

But there is a second-order effect the bullish reading often misses, and it is where the real consequence sits. DeepSeek did not just demonstrate cheap inference; it open-sourced the means of production. DeepEP, 3FS, the full reference architecture, all released to anyone.

NVIDIA open-sourced Dynamo and NIXL. Mooncake, llm-d, LMCache, SGLang, and vLLM are all open. The result is that the disaggregated serving stack, the thing that lets you run a frontier model at a fraction of the naive cost, is no longer a moat.

It is a commodity any competent team can deploy, which is exactly why the SGLang reproduction could serve DeepSeek at one-fifth the official API price. The value did not disappear; it moved.

SemiAnalysis frames the shift as a move from raw FLOPS per chip to total intelligence per dollar at rack scale, and the InferenceMAX results bear it out: the GB200 NVL72 wins not because its chips are faster but because 72 of them act as one, and the orchestration software across them is as much of the product as the silicon.

The moat migrated from the model and the kernel, which are now shared, to the rack-scale system integration and the interconnect, which are hard to replicate and hard to buy.

The frontier from here is the generalization of the same idea to the next seam. The DistServe authors point to attention-FFN disaggregation as the natural successor: within decode, attention is memory-bound and hungry for KV-cache bandwidth while the feed-forward layers are compute-bound and hungry for weight storage, so splitting them onto tailored hardware lets each reach high utilization independently, and the same logic that justified the prefill-decode split applies one level down.

For dense models this was long considered impractical because it doubles the activation transfer per layer, but the MoE models that now dominate already perform two all-to-all operations per decode step, so the attention-FFN split can be folded into the communication pattern that already exists, making its extra transfer nearly free.

MegaScale-Infer and StepFun’s Step-3 have already demonstrated it on large MoE models. The pattern is always the same: find a boundary in the computation where the two sides want different hardware or different parallelism, split there, and pay a transfer cost at the new seam in exchange for independent optimization on each side.

The question every such split raises is the question this entire issue has circled. Is the thing you carry across the new seam small enough, and the wire fast enough, that the split pays?

DeepSeek published a margin and the market saw a number. The number was a consequence.

The cause was an architecture that took two workloads with opposite appetites and stopped forcing them to share a plate, and within eighteen months that architecture became the floor that everything else is built on, dragged the hardware roadmap behind it, and survived a 600-billion-dollar referendum on whether efficiency was a threat or an accelerant. The split won.

What remains, and what will keep deciding the economics as the seams multiply and the models keep rewriting themselves underneath, is the same thing it always was: how honestly the system respects the shape of the work, and how fast the wire is at the seam.

The Software Frontier

Discussion about this post

Ready for more?