The Router and the Wire

Mixture-of-experts promised cheaper inference by doing less arithmetic. The bill did not disappear. It moved into the network, and the entire shape of a 2026 serving stack is the receipt.

Lorenzo Bradanini and Lorenzo Tettamanti

Jun 29, 2026

The cost did not vanish. It moved.

Every few months the inference market tells itself a story about where the money goes, and every few months the story is wrong in the same direction. For two years the story was memory.

The key-value cache grew with context, the weights grew with parameter count, and the binding constraint on a serving deployment was how many bytes of high-bandwidth memory you could buy and how fast you could read them.

That story was true, and it is the subject of two earlier issues of this publication. It is also no longer the whole story for the models that now define the frontier.

The frontier moved to sparsity. A dense model the size of DeepSeek-V3 would activate all 671 billion of its parameters on every token it processed. The mixture-of-experts version activates roughly 37 billion.

On paper that is a reduction in arithmetic of roughly eighteen to one, and it is the single reason a model with two-thirds of a trillion parameters can be served at all without a fleet of accelerators per request.

The promise of the architecture was always framed in floating-point operations: do less math, pay less money. The vLLM and SGLang communities, NVIDIA, DeepSeek, and every serving vendor in between repeated some version of it.

The arithmetic did get cheaper. What the framing left out is that the arithmetic was never the part that was hard to scale. When you spread 256 experts across dozens or hundreds of accelerators, a token routed to eight of them has to physically travel to those eight accelerators, be processed, and travel back to be recombined.

That round trip is an all-to-all communication pattern, and it does not appear in any FLOP count. It is the part of the bill that the sparsity story quietly moved off the compute line and onto the network line, where it has been growing ever since.

DigitalOcean’s engineering writers put the distinction more bluntly than most vendors will. For a dense model, they note, cost scales with memory and is linear and predictable. For a mixture-of-experts model, cost becomes a game of communication. That is the thesis of this issue, stated in five words by someone selling cloud capacity.

The rest of this report is the long version: what the all-to-all actually costs in bytes and in silicon, why a rack that lists for two to three million dollars is best understood as an answer to a networking problem rather than a compute one, and whether the wide expert parallelism that everyone is now deploying actually earns its keep.

There is a seductive counterexample worth disposing of immediately, because it will come up. The KTransformers project can run the complete DeepSeek-V3 model on a single low-cost server with one consumer GPU, a machine that costs in the neighborhood of ten thousand dollars, and still produce nearly twenty tokens per second. I

f a mixture-of-experts model can run on a ten-thousand-dollar box, how can the routing be expensive?

The answer is that the KTransformers configuration never pays the all-to-all toll, because there is no all-to-all. With every expert resident in the memory of a single node, routing a token to an expert is a memory lookup, not a network transfer.

The economics that follow in this issue are the economics of scale, of serving thousands of concurrent users at frontier latency, and the moment you cross the boundary from one node to many, the toll switches on. The single-box demo is real, and it is exactly why the multi-box reality is so often misunderstood.

What sparsity actually buys, and what it borrows

Begin with the structure of the model itself, because the geometry of the dispatch is dictated by it. DeepSeek-V3 carries 256 routed experts per mixture-of-experts layer plus one shared expert that processes every token. A gating network selects the top eight routed experts for each token.

The model has 61 transformer layers, 58 of which are mixture-of-experts layers; the first few are dense. So for the overwhelming majority of the depth of the network, every single token triggers a routing decision, a dispatch to eight destinations, and a combine back.

The routing itself is not free, though it is cheap relative to the transfer. A gating network scores all 256 experts for every token and selects the top eight, and that scoring, the sorting, and the construction of the dispatch order add a small compute and synchronization cost before any data moves.

It is minor against the 168 kilobytes that follow, but it is one more thing the dense model never does, and at the token rates of a frontier deployment even minor per-token costs accumulate. The gating is also where the load imbalance originates, since it is the gate’s learned preferences that send too many tokens to too few experts, which makes it both the cheapest and the most consequential of the operations the router performs.

The sparsity is genuine and the savings are genuine. Switch Transformer, the architecture that popularized the modern top-k mixture, demonstrated roughly a sevenfold speedup over a dense model of equivalent quality, and that ratio has only widened as expert counts have grown.

When practitioners say mixture-of-experts reduces computation by ninety percent, they are describing the activated-parameter ratio, and they are not wrong about it. A token that touches 37 billion of 671 billion parameters is doing far less matrix multiplication than a token that touches all of them.

But the activation has to move. In a single-accelerator world, the experts a token needs are sitting in local memory and the only thing that travels is a memory read. In a serving deployment large enough to matter, the experts are spread across the accelerators by expert parallelism precisely so that each accelerator holds only a few of them and the aggregate weight footprint fits.

That is the entire point of expert parallelism: it is what lets you serve a model whose experts, summed, are far too large for any single device. And it is also what guarantees that the token and its eight chosen experts will, in general, live on different devices.

So the model does two collective operations per mixture-of-experts layer that a dense model never does. The dispatch, sometimes called the scatter, sends each token’s hidden activation to the devices holding its selected experts. The combine, the gather, takes the eight expert outputs and reduces them back into a single vector on the token’s home device.

Survey work on efficient inference serving is consistent on the consequence: this all-to-all exchange of token dispatch and output gathering is the bottleneck in large-scale mixture-of-experts inference. Not the expert math. The exchange around it.

This is the borrowing that the sparsity bargain does not advertise. You spend less on arithmetic and you take on a debt denominated in bandwidth, and the debt comes due on a part of the machine that has improved far more slowly than the compute has.

To see why that matters, you have to look at the wire.

A rack-scale answer to a network problem

The defining fact about modern accelerators is that their arithmetic has outrun their interconnect, and it has done so by a margin that is hard to overstate until you put the numbers on a single axis. A Blackwell B200 reads from its own high-bandwidth memory at roughly eight terabytes per second.

The fifth-generation NVLink fabric that connects it to its neighbors moves 1.8 terabytes per second per GPU, eighteen links at a hundred gigabytes per second each. That is the fast path between two accelerators, and it is already more than four times slower than the path to local memory.

Then you fall off the edge. A single four-hundred-gigabit InfiniBand network card, the scale-out path that connects one node to another, moves about fifty gigabytes per second. The cliff from on-package memory to the cross-node network is more than two orders of magnitude.

Horizontal bar chart on a log scale comparing effective bandwidth: HBM3e on-package at 8000 GB/s, NVLink 5 fabric at 1800, DeepEP all-to-all over NVLink measured at 730, DeepEP all-to-all over RDMA internode at 90, and a single InfiniBand NIC at 50. — **FIG 1** The bandwidth available to a token collapses as it travels outward from the chip. The all-to-all of a mixture-of-experts layer lives somewhere on the right half of this chart, and where exactly is the whole question.

This is the chart that explains the rest of the hardware industry’s behavior. If the all-to-all of a mixture-of-experts layer can be kept inside the NVLink domain, it runs at hundreds of gigabytes per second.

If it has to cross the InfiniBand fabric between racks, it runs at a fraction of that. Introl’s infrastructure analysis puts the ratio at roughly eighteen to one between scale-up bandwidth inside the NVLink domain and scale-out bandwidth between racks. For an architecture whose dominant cost is an all-to-all, that ratio is not a detail. It is the design center.

Which is what the GB200 NVL72 is. NVIDIA’s rack-scale system connects 72 Blackwell GPUs and 36 Grace CPUs into a single NVLink domain delivering 130 terabytes per second of aggregate, non-blocking, all-to-all bandwidth, with 13.5 terabytes of high-bandwidth memory addressable as one pool. Before this system, the largest NVLink domain you could buy was eight GPUs on a single baseboard.

The NVL72 takes the fast interconnect and stretches it across an entire rack so that 72 accelerators can talk to each other as though they were neighbors on the same board. NVIDIA’s own materials describe the result as a single massive GPU, and for the purposes of a mixture-of-experts all-to-all, that marketing is closer to literally true than marketing usually is.

The price of that rack is two to three million dollars, it draws around a hundred and twenty kilowatts, and it is liquid-cooled because there is no other way to remove the heat. It is easy to read those numbers as a statement about compute density, and the 1.44 exaflops of four-bit tensor performance per rack invites that reading.

But the compute was never the scarce thing. You can buy 720 petaflops of eight-bit compute in roughly 182 H100 accelerators for less money than an NVL72 costs.

What you cannot buy that way is a 72-way all-to-all domain. The premium on the rack is, in substantial part, the premium on the wire. It is the cost of not having to cross InfiniBand for the operation that a mixture-of-experts model performs 58 times per token.

The premium on the rack is, in substantial part, the premium on the wire. It is the cost of not crossing InfiniBand for the operation a mixture-of-experts model performs fifty-eight times per token.

Read this way, a great deal of the 2026 accelerator roadmap resolves into a single sentence: make the all-to-all domain larger than the problem.

At CES 2026 NVIDIA disclosed that the next-generation Vera Rubin NVL72 will roughly double per-GPU NVLink bandwidth to 3.6 terabytes per second and lift aggregate all-to-all bandwidth to 260 terabytes per second, with the explicit justification, in NVIDIA’s own words, that this is the bandwidth needed for the all-to-all communications of leading mixture-of-experts architectures.

The company has stopped being coy about it. The interconnect generation is being sold, by name, as the answer to the problem that the model generation created.

Twenty streaming multiprocessors, give or take

Hardware sets the ceiling. Whether you reach it is a question of kernels, and the reference implementation for the mixture-of-experts all-to-all is DeepEP, the communication library DeepSeek open-sourced during its 2025 release week.

DeepEP is worth studying closely not because it is the only such library but because it is the one whose measured numbers are public, and those numbers are the closest thing the field has to a ground truth for what the all-to-all costs at the kernel level.

DeepEP provides two classes of kernel, and the split maps precisely onto the two phases of inference. The normal kernels are tuned for throughput and serve training and the prefill phase, where batches are large and the all-to-all moves a great deal of data at once.

The low-latency kernels are tuned for the decode phase, where each step generates one token per sequence, the batches are tiny, and what matters is not bandwidth but the round-trip time of the dispatch and combine.

This is the same prefill-versus-decode division that disaggregated serving exploits, examined one issue ago, now visible at the level of individual communication kernels.

The measured bandwidths tell the scale-up story in a single table. On Blackwell-class hardware, DeepEP’s dispatch kernel moves 726 gigabytes per second and its combine kernel 740 gigabytes per second when the experts are inside the NVLink domain. The same kernels, forced across the internode RDMA fabric on the same generation of hardware, move about 90 gigabytes per second each.

That is the eighteen-to-one ratio of Figure 1, reproduced at the kernel level on a real workload: the configuration DeepSeek published, with eight thousand tokens per batch, a hidden dimension of 7168, top-eight routing, eight-bit dispatch, and sixteen-bit combine.

Grouped bar chart comparing DeepEP dispatch and combine kernel bandwidth: NVLink intranode at 726 and 740 GB/s versus RDMA internode at 90 GB/s each, roughly eight times slower off the NVLink domain. — **FIG 2** The same kernel, the same hardware, the same workload. The only variable is whether the experts sit inside the NVLink domain or across the network. That one boundary costs roughly a factor of eight.

The second thing DeepEP reveals is subtler and, for the economics, more important. Moving data costs compute. The all-to-all kernels do not run on dedicated networking silicon; they run on the same streaming multiprocessors that would otherwise be doing matrix multiplication.

Every SM assigned to push bytes through the fabric is an SM not computing an expert. DeepEP’s first version spent around 24 SMs on the communication for a training-scale all-to-all.

Its second version, a substantial rewrite that moved from a custom backend to a more lightweight one built on NVIDIA’s NCCL, cut that to between four and six SMs for the same work while matching or exceeding the old bandwidth.

The library’s authors describe the V2 rewrite as achieving extreme performance with several times fewer SM resources, and the measured table backs the claim: up to 1.3 times the peak bandwidth at up to four times fewer SMs.

But the decode path is hungrier than the training path, and here the numbers sharpen into a real cost. To hit maximum throughput on the NVLink decode all-to-all, DeepEP’s table shows the kernel consuming 64 streaming multiprocessors. A B200 has 148 of them. That is forty-three percent of the entire accelerator spent moving data rather than computing, in the configuration tuned for speed.

You can run the same kernel in a low-SM mode that uses 24, but you give up bandwidth to do it. The library also offers genuinely zero-SM paths for pipeline parallelism, context parallelism, and certain RDMA transfers, by offloading the movement to copy engines and the network cards directly, and a great deal of the engineering frontier in 2026 is about pushing more of the all-to-all onto those zero-SM paths.

The reason that frontier exists is Figure 3.

Bar chart of streaming multiprocessors consumed by communication kernels: DeepEP V1 training at 24 SMs, V2 training at 5, NVLink decode min-SM mode at 24, NVLink decode max-throughput mode at 64, against a reference line of 148 total SMs on a B200. — **FIG 3** The all-to-all is not free even when the wire is fast, because it is paid for in the same currency as the math. At peak decode throughput the communication kernel claims forty-three percent of the GPU.

There is real craft in how DeepEP hides this. The library exposes a hook-based mechanism for overlapping communication with computation: a dispatch is launched, independent work proceeds on the compute stream while the data is in flight, and only when the result is needed does the kernel wait.

Done well, the all-to-all latency disappears behind the expert computation and the SM cost is the only thing left to account for. Done badly, the all-to-all stalls the pipeline and the expensive accelerators sit idle waiting for the network. The difference between those two outcomes is most of the difference between a good mixture-of-experts deployment and a wasteful one, and none of it is visible in a FLOP count.

The second version pushes the SM problem harder by moving the data movement off the streaming multiprocessors entirely wherever it can. Its experimental branches expose zero-SM paths for pipeline and context parallelism, handing the transfers to the GPU’s copy engines, and a zero-SM remote-memory primitive the authors call Engram that lets one device reach into another’s memory over RDMA without spending a single SM on the transfer.

The motivation is exactly Figure 3: every SM the network gives back is an SM the experts can use. The rewrite also abandoned the custom communication backend for a lighter one built on NVIDIA’s NCCL, which let it reuse existing communicators and scale the expert-parallel domain to as many as two thousand devices, far past anything a production model currently needs.

A separate branch rebuilds the kernels around the tensor-memory-accelerator instructions on Hopper and Blackwell, shrinking SM usage again and adding native four-bit support, which is how the dispatch leg of the toll gets cheaper at the same moment the experts do.

None of this would matter if the expert computation itself were not reorganized to match. The all-to-all delivers a variable number of tokens to each expert, because the gating network does not distribute traffic evenly, and a standard batched matrix multiply assumes a fixed shape.

The answer, embodied in DeepSeek’s companion DeepGEMM library, is a grouped matrix multiply that processes each expert’s variable token count as a contiguous segment, so the expert math runs as one efficient kernel rather than a ragged collection of small ones.

The communication and the computation are co-designed: the all-to-all produces exactly the memory layout the grouped GEMM wants to consume.

Pull either apart from the other and the efficiency collapses, which is part of why a tuned mixture-of-experts stack is so much harder to assemble than the FLOP savings would suggest.

What every token pays at the door

The bandwidth numbers describe the pipe. The next question is how much the model tries to push through it, and that can be computed directly from the geometry, which makes it one of the few places in this analysis where the arithmetic is exact rather than measured.

Take DeepSeek-V3’s mixture-of-experts layer. Each token’s hidden activation is a vector of 7168 values. On the dispatch, those values are sent in eight-bit precision, so one byte each, and they are sent to each of the eight selected experts. That is 7168 times 8, roughly 56 kilobytes of dispatch traffic per token per layer.

On the combine, each of the eight experts returns an output vector of the same width, but the combine is performed in sixteen-bit precision to preserve the accuracy of the reduction, so two bytes each. That is 7168 times 2 times 8, roughly 112 kilobytes of combine traffic per token per layer.

Add them and a single token, passing through a single mixture-of-experts layer, generates about 168 kilobytes of all-to-all traffic.

Stacked bar chart comparing bytes moved per token per layer: a dense layer at 14 KB with no routing, versus a mixture-of-experts layer at 168 KB total, split into 56 KB of eight-bit dispatch and 112 KB of sixteen-bit combine. — **FIG 4** A derived figure for DeepSeek-V3 geometry. The routing multiplies per-token data movement by roughly twelve and, unlike a dense layer, all of it has to cross the fabric. The model stacks 58 of these layers.

Set that against what a dense layer moves for the same token, which is essentially nothing across the fabric: the activation stays on the device and the only traffic is the local memory read of about 14 kilobytes. The mixture-of-experts layer moves roughly twelve times as much data per token, and the crucial difference is not the multiple but the destination.

The dense traffic stays on-chip. The mixture-of-experts traffic crosses the network. And this happens 58 times as the token descends through the model.

Two things follow from the structure of those 168 kilobytes. The first is that the combine is twice the dispatch, because the combine runs in higher precision. This is not an arbitrary choice; reducing eight expert outputs in eight-bit precision degrades quality unacceptably, so the field has settled on eight-bit dispatch and sixteen-bit combine as the standard, and that asymmetry means the return trip is the more expensive leg.

Any optimization that can compress the combine, including the four-bit experiments now appearing in DeepEP’s experimental branches, attacks the larger half of the toll.

The second is that the toll is paid per token, which means the decode phase, where tokens are generated one at a time, pays it in the worst possible way. In prefill, thousands of tokens are dispatched together and the all-to-all amortizes its latency across an enormous batch; the kernel runs in its throughput regime and the bandwidth numbers of Figure 2 apply.

In decode, a single step might dispatch only a handful of tokens per sequence, the batch is tiny, the bandwidth of the pipe is irrelevant because the pipe is nearly empty, and what dominates is the fixed round-trip latency of reaching across the fabric and back.

This is why the low-latency decode kernels exist as a separate class, why they are willing to burn 64 SMs to shave microseconds, and why decode is the phase where the mixture-of-experts toll hurts most.

It is also why the entire industry serves prefill and decode on separately tuned pools of hardware, a point this publication examined at length one issue ago and which the all-to-all only sharpens.

The decode penalty is worth making concrete, because it is where the toll is most counterintuitive. At a service level of a hundred tokens per second per user, the budget for generating one token is ten milliseconds, and into that budget the model must fit 58 mixture-of-experts layers, each with a dispatch and a combine that reach across the fabric.

Inside the NVLink domain a round trip is measured in microseconds and 58 of them fit with room to spare; across the InfiniBand fabric the same round trips, with their higher fixed latency, begin to eat the budget directly.

That is why the decode all-to-all spends 64 SMs to shave microseconds, and why a decode deployment forced to leave the NVLink domain for its all-to-all can miss its latency target even when its aggregate bandwidth looks adequate on paper. In decode, latency is the currency, and the fabric boundary is where it gets spent.

Share The Software Frontier

The hottest expert sets the clock

There is a failure mode hiding inside the all-to-all that the bandwidth numbers do not capture at all, and it is the one that most often separates a deployment hitting its theoretical throughput from one falling well short of it. An all-to-all is a synchronization barrier.

The combine cannot complete until every expert has returned its outputs, which means the slowest expert on the most overloaded device sets the pace for the entire operation. If the gating network sends a disproportionate share of tokens to a handful of popular experts, the devices holding those experts become stragglers, and every other device in the domain waits on them.

Expert load is not uniform in practice, and it is not even stable. Certain experts specialize in patterns that appear frequently in real traffic, and the imbalance shifts with the workload. Survey work documents the consequence plainly: imbalanced token distribution causes device underutilization, and the whole expensive all-to-all runs at the speed of its hottest path.

A mixture-of-experts deployment can have perfectly adequate aggregate bandwidth and still bleed throughput because the load is lumpy.

DeepSeek’s answer in production is an expert-parallel load balancer that the community has reproduced under the name EPLB. The mechanism is to identify the high-load experts from live deployment statistics and replicate them: a hot expert is duplicated onto multiple devices so that the tokens destined for it can be spread, flattening the straggler. This is a direct trade of memory for balance.

You spend extra capacity holding redundant copies of the popular experts in order to keep the all-to-all from stalling on them. It works, and it is now standard, but it is another line on the bill that the sparsity story did not mention, and it interacts with the deployment topology in a way that is worth seeing concretely.

DeepSeek runs the same model checkpoint as two physically different machines, one for each phase, and the contrast is the clearest illustration in the field of how the all-to-all reshapes a deployment. According to DeepSeek’s own published inference overview and the CloudMatrix serving analysis that reconstructs it, the prefill machine groups four nodes, 32 GPUs, into a single unit running 32-way expert parallelism alongside 32-way data parallelism.

Across those 32 GPUs the routed experts are distributed nine to a device once the redundant copies of the popular experts are counted, with the shared expert and the attention mechanism replicated on every one. The raw figure would be eight; the ninth is the load balancer at work.

The decode machine expands the same model to 18 nodes, 144 GPUs, running 144-way expert parallelism and 144-way data parallelism, where each device holds only about two routed experts.

Two-panel chart. Left panel: GPUs in one expert-parallel domain, 32 for prefill versus 144 for decode. Right panel: routed experts per GPU, 8 for prefill versus 1.78 for decode, each plus one shared expert. — **FIG 5** One checkpoint, two machines. The decode deployment spreads the experts across more than four times as many GPUs, which is partly about latency and partly about leaving room to replicate the hot experts.

Why spread the same 256 experts across 144 devices for decode when 32 sufficed for prefill?
Two reasons, and both come back to the all-to-all.

The first is latency: with fewer experts resident per device, each device does less work per step and the decode latency target is easier to hit.
The second is precisely the straggler problem. Spreading thin leaves headroom to replicate the popular experts without overflowing any device’s memory, so the load balancer has somewhere to put the redundant copies.

The decode machine is wider not because the math demands it but because the communication and the balance do. The shape of the deployment is dictated by the toll, not the FLOPs.

The toolchain the toll demanded

The all-to-all did not only reshape the hardware and the kernels. It pulled an entire toolchain into being around itself, and the size of that toolchain is the clearest measure of how far the cost migrated from the math.

A 2026 mixture-of-experts serving stack at frontier scale is not a model and a runtime. It is a model, a communication library, a grouped-GEMM library, an expert load balancer, a disaggregation layer, and an overlap scheduler, each of which exists to manage some facet of the routing tax. The FLOP count described one of those six boxes.

Consider the overlap problem at the level of an entire forward pass rather than a single layer. Hiding the all-to-all behind computation works within a layer, but the decode phase is so latency-sensitive that the field has gone further and split each batch in two, running the communication of one half against the computation of the other in a continuous pipeline.

SGLang’s two-batch overlap and the analogous schemes in other runtimes exist for one reason: to keep the expensive accelerators busy with expert math while the all-to-all of a different microbatch is in flight. It is the same instinct as the kernel-level hooks, lifted to the level of the request scheduler, and it is now a standard part of large-scale deployments rather than an exotic optimization.

Disaggregation adds a second communication problem on top of the all-to-all. Once prefill and decode run on separate pools of hardware, the key-value cache computed during prefill has to be shipped to the decode pool before generation can begin, and at frontier scale that transfer is large enough and frequent enough to need its own engine.

The Mooncake transfer engine and the equivalent layers inside vLLM and SGLang exist to move key-value caches across the network efficiently, overlapping the transfer with computation so the handoff does not stall the pipeline. This is a network tax distinct from the all-to-all, and it is the price of the prefill-decode split that the all-to-all economics make worthwhile in the first place.

The two taxes are siblings: both are consequences of spreading one model’s inference across many devices, and both are paid down by the same instinct of overlapping transfer with compute.

The lesson in the length of that list is that the sparsity bargain did not merely move the cost to the network. It moved the cost to a place where extracting good performance requires assembling and tuning half a dozen interacting systems, any one of which, misconfigured, hands the savings back.

The vLLM and SGLang playbooks both carry warnings to this effect, and AMD’s ROCm guide to the vLLM mixture-of-experts options is blunt that the wrong combination of tensor, data, pipeline, and expert parallelism can duplicate the key-value cache many times over and consume far more memory than expected.

The FLOP count said the model got cheaper. The operations manual says it got more complicated, and the complication is where a large part of the real cost now lives.

Does wide expert parallelism pay for itself?

All of this is overhead, and the natural reaction to a catalogue of overhead is to minimize it. If the all-to-all is the cost, why not keep the expert-parallel domain small, so the all-to-all stays inside a tight, fast group of devices?

The answer is that narrowing the domain trades one cost for another, and the trade does not run in the obvious direction.

Wider expert parallelism, counterintuitively, often produces more throughput per GPU, not less, and understanding why is the crux of whether the whole approach earns its keep.

The mechanism is expert packing. When experts are spread across more devices, each device holds fewer of them, which means more of each device’s memory and compute can be devoted to the batch of tokens currently being processed rather than to holding a large slice of the model.

Larger effective batches per device improve the arithmetic intensity of the expert matrix multiplications, the kernels run closer to the hardware’s peak, and the per-GPU throughput rises, provided the all-to-all overhead can be kept hidden behind that larger computation. The question is always whether the communication grows faster than the packing benefit, and up to a point, on the right interconnect, it does not.

NVIDIA’s measurements on the GB200 NVL72 quantify the dividend directly. Moving from an eight-way expert-parallel configuration to a 32-way one delivers up to 1.8 times the output token throughput per GPU, at a fixed service level of a hundred tokens per second per user, with disaggregated serving and multi-token prediction in both cases.

Same hardware, same latency target, nearly double the per-GPU output, purely from going wider on expert parallelism.

Bar chart of output tokens per second per GPU, normalized to EP8 at 100: EP8 at 100, EP32 at 180, showing 1.8 times the per-GPU throughput from wider expert parallelism. — **FIG 6** NVIDIA’s Wide-EP figures on the NVL72. Going wider improves per-GPU throughput, because the packing benefit outweighs the added all-to-all, as long as the all-to-all stays inside the NVLink domain.

The decisive qualifier is the last clause. The 1.8 times holds because the 32-way all-to-all stays inside the NVL72’s NVLink domain, where Figure 2 says it runs at 726 gigabytes per second.

The dividend exists because the wire is fast enough that going wider does not push the communication off the cliff. Try the same widening on a cluster where 32-way expert parallelism forces the all-to-all across InfiniBand, and the calculus inverts: the packing benefit is swamped by the eightfold bandwidth penalty of leaving the domain, and wider becomes worse.

This is the same fact from a different angle. The reason the rack-scale NVLink domain is worth its price is that it is what makes the wide-EP dividend positive instead of negative.

There is a second lever working alongside the width, and it appears in nearly every published wide-EP result: multi-token prediction. Rather than generating one token per forward pass, the model proposes several and verifies them together, which raises the number of tokens flowing through each all-to-all and pushes the decode kernel out of its worst, smallest-batch regime toward something the bandwidth can amortize.

Multi-token prediction and wide expert parallelism are complementary for the same underlying reason: both increase the work done per round trip across the fabric, and the all-to-all rewards anything that makes its fixed latency a smaller fraction of the whole.

The dividend in Figure 6 is partly a multi-token-prediction dividend, which is why NVIDIA and SGLang report the two together. They are deployed together because they solve the same problem from two directions.

So the answer to whether wide expert parallelism pays for itself is conditional, and the condition is the interconnect. Inside a sufficiently large fast domain, wider is genuinely better and the measurements prove it. Outside one, wider is a trap. The crossover sits exactly at the boundary of the NVLink domain, which is why the size of that domain, 8 GPUs yesterday, 72 today, the same 72 at higher bandwidth tomorrow, is the number that determines how far the dividend extends.

Expert parallelism and the interconnect are not two separate decisions. They are one decision, and the hardware vendor has been making half of it for you.

How much is silicon, and how much is numerics

It is tempting to attribute the throughput of a Blackwell mixture-of-experts deployment to the silicon, and the marketing encourages it, but the public measurements let us decompose the uplift, and the decomposition is instructive about where the real leverage sits.

The LMSYS and SGLang teams have published a careful progression of DeepSeek serving results on the GB200 NVL72, and the numbers are specific.

With disaggregated prefill and decode, large-scale expert parallelism, and the conservative numeric configuration of sixteen-bit attention and eight-bit experts, SGLang reaches 18,471 input tokens per second per GPU on prefill and 9,087 output tokens per second per GPU on decode, for two-thousand-token sequences.

Switch to the aggressive configuration, eight-bit attention and four-bit NVFP4 experts, and the same system reaches 26,156 input and 13,386 output tokens per second per GPU. Against the H100 baseline the teams report, those aggressive numbers represent a 3.8 times prefill and 4.8 times decode improvement.

Grouped bar chart of tokens per second per GPU for prefill and decode across three configurations: H100 baseline at 6883 prefill and 2789 decode, GB200 with BF16 attention and FP8 experts at 18471 and 9087, and GB200 with FP8 attention and NVFP4 experts at 26156 and 13386, marked as 3.8 times and 4.8 times the baseline. — **FIG 7** The Blackwell uplift, decomposed. A large share of the gain over the conservative GB200 configuration comes from dropping the experts to four-bit NVFP4, not from the silicon alone.

The decomposition is the point. The jump from the H100 baseline to the conservative GB200 configuration is the hardware: faster tensor cores, the NVLink domain, more memory bandwidth. But the further jump from the conservative to the aggressive GB200 configuration, from 18,471 to 26,156 on prefill and from 9,087 to 13,386 on decode, is numerics.

It comes from running the experts in four-bit NVFP4 rather than eight-bit. That is a software-and-format change applied to the same rack, and it accounts for a substantial fraction of the total uplift over H100.

NVFP4 earns its own treatment, and it is a strong candidate for a future issue, but the relevant fact here is why it interacts so favorably with the all-to-all. Four-bit experts are half the bytes of eight-bit experts, which directly shrinks the dispatch leg of the toll, and they double the tensor-core throughput of the expert math itself, so the computation that hides the all-to-all gets faster at the same time the all-to-all gets smaller.

NVIDIA’s format reportedly holds accuracy within about one percent of the higher-precision baseline on large models through a two-level scaling scheme, and the accuracy holds up best precisely on the large mixture-of-experts models where it matters most. The format is, in effect, a second lever on the same toll that the interconnect attacks, and the two compound.

This is also why NVIDIA can credibly claim a fivefold reduction in cost per token from software optimization alone in the two months after Blackwell’s launch, with no hardware change: a large part of that was kernel and format work on exactly these operations.

One dollar, or twenty cents

The throughput numbers are engineering. The reason they matter is that they convert, almost directly, into the only number a serving operator actually cares about, which is dollars per million tokens. And here the all-to-all moves from being a technical concern to being the dominant line item in the unit economics.

The cleanest demonstration in the public record is the LMSYS deployment of DeepSeek on 96 H100 GPUs, twelve nodes of eight, using prefill-decode disaggregation and large-scale expert parallelism with the full DeepEP, DeepGEMM, and EPLB stack.

That deployment reached 52,300 input tokens per second and 22,300 output tokens per second per node, and when the team translated the throughput into cost, it came to twenty cents per million output tokens. That figure is roughly one-fifth of what DeepSeek’s own public API charged at the time, achieved on rented hardware by an outside team reproducing the architecture.

The comparison that matters most, though, is the one against the naive alternative on identical hardware. The same report states that the optimized expert-parallel strategy improved output throughput by up to five times over vanilla tensor parallelism using the same resources. Five times the throughput on the same GPUs is five times lower cost per token.

The all-to-all engineering, getting the dispatch and combine to run efficiently inside the fast domain, hiding the latency behind computation, balancing the hot experts, is the entire difference between a deployment at twenty cents and a deployment at a dollar.

Horizontal bar chart of US dollars per million output tokens: vanilla tensor parallel on 96 H100 at one dollar, official DeepSeek API as a reference at one dollar, and PD plus large-scale expert parallelism self-hosted on 96 H100 at twenty cents, five times cheaper on identical hardware. — **FIG 8** Same 96 GPUs, two ways of organizing them. The five-fold gap between vanilla tensor parallelism and tuned expert parallelism is, almost entirely, the all-to-all done well versus done naively.

Put that five-fold against the backdrop of where inference pricing has gone, and the stakes of the routing tax become clear. The price of frontier-class inference has fallen by something close to fifty times in three years, from around twenty dollars per million tokens for GPT-4-class output in late 2022 to roughly forty cents in early 2026.

Public trackers attribute the collapse to four compounding forces, and mixture-of-experts together with expert parallelism is explicitly one of them, alongside hardware efficiency, kernel and compiler optimization, and low-precision formats. Inference now consumes roughly two-thirds of all AI compute, having crossed over from a minority of it only a couple of years ago.

In that environment a five-fold cost difference is not a margin to be optimized later. It is the difference between a viable serving business and an unviable one.

Line chart on a log scale of US dollars per million tokens for GPT-4-class output from 2022 to 2026: 20 dollars in late 2022, 5 in 2023, 2 in 2024, 0.8 in 2025, and 0.4 in early 2026, about a fifty-fold decline, with drivers listed as hardware, kernels, mixture-of-experts plus expert parallelism, and four-bit formats. — **FIG 9** The price floor that makes a routing tax of cents per token worth a flagship. Expert parallelism is one of the four named drivers of this curve, not a footnote to it.

Making the domain bigger than the problem

Step back from the individual numbers and a single strategic motion organizes all of them. The mixture-of-experts architecture created a communication problem.

The hardware industry’s response has been to make the fast communication domain large enough to swallow the problem whole, and the trajectory of that response is the most reliable predictor of where serving economics go next.

DeepSeek’s own engineers, in their published reflections on the hardware lessons of training V3, frame the future in exactly these terms. They call for the convergence of scale-up and scale-out, for precise low-precision compute units, and for innovations in low-latency communication fabrics.

Read against this issue, that is a wish list written by the people paying the all-to-all toll, addressed to the people who can make the domain bigger. The scale-up and scale-out convergence they ask for is precisely the elimination of the cliff in Figure 1: a world where crossing from one node to the next does not cost a factor of eight, because the fast domain has grown to encompass both.

NVIDIA is building toward exactly that, and is increasingly explicit that it is doing so for this reason. The NVL72 took the NVLink domain from 8 to 72. The NVLink Switch architecture is specified to reach 576 GPUs in a single non-blocking fabric. The Rubin generation lifts the per-GPU bandwidth again and ties the increase directly, in NVIDIA’s own framing, to the all-to-all needs of mixture-of-experts models.

Each step is sold, more openly than the last, as a larger container for the communication problem that sparsity created. The architecture and the interconnect are co-evolving, and the direction is set: the domain keeps growing, the cliff keeps receding, and the toll keeps shrinking as a fraction of the work, without ever quite reaching zero.

The domain cannot grow without limit, and the constraints on how far it can stretch are physical. NVLink at rack scale runs over copper, which is cheap and reliable but reaches only a couple of meters; pushing the domain past a single rack toward the 576-GPU fabric the switch silicon can address means either optical interconnect, with its added cost, power draw, and failure modes, or denser and hotter racks than the current design.

Power and cooling are already near the edge of what a standard data center hall delivers per rack, which is why the NVL72 is liquid-cooled and why each new generation leans harder on liquid. And the fault domain grows with the fabric, because a larger coherent domain is a larger blast radius for a single failure.

The trajectory is set toward bigger domains, but each expansion buys less headroom than the last against a wall of copper reach, power density, and fault tolerance that the all-to-all cannot argue its way past.

What this does not resolve is the dependency it creates. An operator who builds a serving business on wide expert parallelism is building on the assumption that the fast domain will keep growing, and that assumption ties the economics of the model layer to the roadmap of a single interconnect vendor.

The wide-EP dividend is real, but it is contingent on hardware that one company predominantly supplies, and the contingency is worth naming. The cheapest way to serve a frontier mixture-of-experts model in 2026 runs through a rack that is, for now, effectively sole-sourced.

That is a strategic fact about the inference market as much as a technical one, and it is the part of the story most likely to matter in the issues to come.

The cheapest way to serve a frontier mixture-of-experts model in 2026 runs through a rack that is, for now, effectively sole-sourced. That is a strategic fact as much as a technical one.

The dependency has not gone unanswered. An industry that has watched a single vendor’s interconnect become the determinant of mixture-of-experts economics has begun to organize alternatives.

The UALink consortium and the Ultra Ethernet effort are both attempts to build an open scale-up fabric that could host the all-to-all without routing through one company’s switches, and AMD’s serving stack now carries its own expert-parallel communication path, a port of the DeepEP ideas onto its accelerators.

None of these has yet demonstrated the rack-scale all-to-all bandwidth of an NVL72 in production, and the gap is real, but the direction of the effort is itself a measure of how much the all-to-all matters. An entire alternative-hardware ecosystem is organizing around the single operation that this issue is about.

There is also a cost that none of the throughput numbers capture, which is reliability. A 144-GPU decode deployment is one coordinated system, and the all-to-all is a synchronization barrier across all of it, which means a fault or a slowdown on any single device degrades the whole.

The larger the expert-parallel domain, the more devices have to stay healthy and in lockstep for the all-to-all to complete on time, and the operational burden of keeping a domain of that size running at frontier latency is substantial.

DeepSeek’s own diagnostic tooling for locating slow ranks in a DeepEP deployment exists because, at this scale, finding the one straggling device in a domain of hundreds is a routine and necessary operation.

The wide-EP dividend is real, but it is collected by operators who can keep a very large, very tightly coupled machine running, and that capability is a cost the smaller-domain alternatives never have to pay.

What to actually do

The analysis resolves into a handful of decisions that an operator faces in practice, and they follow from the structure rather than from any single benchmark.

The first decision is whether to use expert parallelism at all, and the honest answer is that it depends entirely on whether your all-to-all can be kept inside a fast domain.

If you are serving a frontier mixture-of-experts model at scale and you have access to a rack-scale NVLink domain, wide expert parallelism is the right tool and the measurements say to go as wide as the domain allows, because the packing dividend is positive inside the fast fabric.

If your all-to-all would have to cross InfiniBand to go wider, stop widening before it does, because the cliff inverts the dividend. The boundary of the NVLink domain is the boundary of the decision.

The second decision is how to split the phases. Prefill and decode want different all-to-all kernels, different expert-parallel widths, and in DeepSeek’s production case different physical machines entirely. The decode machine should be wider, both to hit latency targets and to leave room for the load balancer to replicate hot experts.

If you cannot afford to disaggregate, the decode phase is where the toll will hurt, and the low-latency kernels are where to spend your tuning effort. The vLLM and SGLang playbooks both warn, correctly, that the wrong parallelism strategy can duplicate key-value caches across the domain and consume many times the memory you expected, so the parallelism decision is not only about the all-to-all but about what else it forces to be replicated.

The third decision is precision, and it is mostly free throughput if you are on Blackwell. Four-bit NVFP4 experts shrink the dispatch leg of the toll and double the expert math throughput at an accuracy cost that, on large models, is small. The aggressive configuration in Figure 7 is not a marginal tuning; it is a large fraction of the total uplift, and it attacks the same toll the interconnect attacks. If your hardware supports it and your accuracy budget allows it, it is among the highest-leverage changes available.

And the fourth decision is whether you need any of this at all. If your workload is single-user or small-scale, the KTransformers lesson stands: a mixture-of-experts model on a single node never pays the toll, and the entire apparatus of expert parallelism is overhead you can decline.

The all-to-all economics in this issue are the economics of serving at frontier scale and frontier latency. Below that scale, the right move is to keep the experts local and let the toll switch stay off.

The deeper lesson is the one the sparsity story obscured for two years. Mixture-of-experts did not make inference cheaper by doing less work. It moved the work from a place that was easy to scale, the arithmetic, to a place that was hard, the network, and then the hardware industry spent two product generations and a great deal of money making the network easy to scale too.

The bargain was always real. It was just never free, and the bill was always going to come due on the wire. Knowing where it comes due, and how much, is most of what it takes to serve these models without overpaying.

The router decides which experts a token needs. The wire decides what that decision costs. For the models that now define the frontier, the wire is the more expensive of the two.

What we are confident about, and what we estimated

NVLink 5 delivers 1.8 TB/s per GPU; the GB200 NVL72 provides 130 TB/s aggregate all-to-all bandwidth across 72 GPUs, with 13.5 TB of unified HBM3e.

NVIDIA GB200 NVL72 datasheet; NVIDIA multi-node NVLink tuning guide; Introl and Spheron interconnect analyses.

DeepEP measures dispatch and combine at 726 and 740 GB/s inside the NVLink domain on Blackwell, versus about 90 GB/s each across internode RDMA, on the published V3 workload.

DeepEP V2 performance table, deepseek-ai/DeepEP repository.

DeepEP’s decode all-to-all consumes up to 64 SMs at peak throughput; the V2 rewrite cut training all-to-all SM use from 24 to between 4 and 6. A B200 has 148 SMs.

DeepEP V2 performance table and release notes; Blackwell architecture specifications.

SGLang on the GB200 NVL72 reaches 26,156 prefill and 13,386 decode tokens/sec/GPU with eight-bit attention and NVFP4 experts, reported as 3.8x and 4.8x over H100; the conservative configuration reaches 18,471 and 9,087.

LMSYS Org, GB200 NVL72 Part II, September 2025.

An LMSYS 96-GPU H100 deployment reached 52.3k input and 22.3k output tokens/sec/node and translated to $0.20 per 1M output tokens, about one-fifth the official API price, and up to 5x the throughput of vanilla tensor parallelism on the same hardware.

LMSYS Org, large-scale EP on 96 H100, May 2025.

Moving from EP8 to EP32 yields up to 1.8x output throughput per GPU at a fixed 100 tok/s/user SLA on the NVL72, with disaggregated serving and multi-token prediction.

NVIDIA, Wide Expert Parallelism on NVL72, January 2026.

DeepSeek-V3 runs DP32+EP32 across 32 GPUs for prefill (nine routed experts per GPU plus one shared, including one redundant) and DP144+EP144 across 144 GPUs for decode (about two routed experts per GPU plus one shared).

DeepSeek Open Source Week inference system overview (Day 6); CloudMatrix serving analysis (arXiv 2506.12708). The V3 technical report describes a different decode configuration (EP320, one expert per GPU).

NVFP4 holds accuracy within roughly one percent of the higher-precision baseline on large models via two-level scaling, and accuracy recovery is strongest on the largest dense and MoE models.

NVIDIA NVFP4 technical blogs; Red Hat AI NVFP4 evaluation.

A DeepSeek-V3 mixture-of-experts layer moves about 56 KB of dispatch (FP8, top-8) and 112 KB of combine (BF16, top-8) per token, roughly 168 KB total, against about 14 KB for a dense layer.

Derived from V3 geometry (hidden 7168, top-8, FP8 dispatch, BF16 combine). Excludes the shared expert and any local-rank optimization.

The vanilla-tensor-parallel and official-API reference points of roughly $1.00 per 1M output tokens are derived from the LMSYS statements (optimized $0.20 figure at one-fifth of API, and 5x over vanilla TP).

Derived from LMSYS 96-GPU report figures.

The H100 baseline in Figure 7 (6,883 prefill, 2,789 decode tokens/sec/GPU) is back-calculated from the reported 3.8x and 4.8x speedups, not independently measured.

Derived from LMSYS GB200 Part II reported multipliers.

Frontier-class inference pricing has fallen roughly fifty-fold from about $20 to about $0.40 per 1M tokens from late 2022 to early 2026, with MoE plus expert parallelism among four named drivers.

Public inference price trackers, 2022 to 2026. Order-of-magnitude trend across vendors, not a single price series.

Vera Rubin NVL72 is specified for roughly 3.6 TB/s per GPU and 260 TB/s aggregate, framed by NVIDIA as serving MoE all-to-all needs.

NVIDIA NVLink product page and CES 2026 disclosures; pre-release specification subject to change.

A = primary or measured | B = single strong vendor or operator source | C = derived by us from sourced inputs | D = directional, treat as trend not point estimate.
Character scan: this issue contains zero em dashes and zero en dashes, verified programmatically against the rendered text.

Bibliography

DeepSeek-AI. DeepEP: an efficient expert-parallel communication library. GitHub repository, 2025. Performance table, V2 release notes, decode and prefill kernel interfaces.github.com/deepseek-ai/DeepEP
DeepSeek-AI. Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures. arXiv 2505.09343, 2025.arxiv.org/abs/2505.09343
LMSYS Org. Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs. May 2025.lmsys.org/blog/2025-05-05-large-scale-ep
LMSYS Org. Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP, Part I: 2.7x Higher Decoding Throughput. June 2025.lmsys.org/blog/2025-06-16-gb200-part-1
LMSYS Org. Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP, Part II: 3.8x Prefill, 4.8x Decode Throughput. September 2025.lmsys.org/blog/2025-09-25-gb200-part-2
LMSYS Org. SGLang and NVIDIA Accelerating SemiAnalysis InferenceMAX and GB200 Together. October 2025.lmsys.org/blog/2025-10-14-sa-inference-max
NVIDIA. Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack-Scale Systems. NVIDIA Technical Blog, January 2026.developer.nvidia.com/blog
NVIDIA. GB200 NVL72 product page and datasheet. 130 TB/s NVLink domain, 72-GPU rack specifications.nvidia.com/en-us/data-center/gb200-nvl72
NVIDIA. Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era. NVIDIA Technical Blog, January 2026.developer.nvidia.com/blog
NVIDIA. Introducing NVFP4 for Efficient and Accurate Low-Precision Inference. NVIDIA Technical Blog, 2025.developer.nvidia.com/blog
NVIDIA. The Economic Value of Inference Software Optimization at the Datacenter Level. April 2026. Fivefold cost-per-token reduction via software.perspectives.nvidia.com
NVIDIA. Multi-Node NVLink Systems Tuning Guide and NVLink / NVLink Switch product documentation. Fifth-generation NVLink and NVSwitch specifications.docs.nvidia.com; nvidia.com/en-us/data-center/nvlink
Microsoft. Achieving Optimal Performance for DeepSeek Expert Parallelism (DeepEP) on Azure. Azure HPC Blog, May 2025.techcommunity.microsoft.com
AMD ROCm. The vLLM MoE Playbook: A Practical Guide to TP, DP, PP and Expert Parallelism. November 2025.rocm.blogs.amd.com
Taming the Titans: A Survey of Efficient LLM Inference Serving. arXiv 2504.19720, 2025. All-to-all as the MoE bottleneck; expert load balancing.arxiv.org/abs/2504.19720
Serving Large Language Models on Huawei CloudMatrix384. arXiv 2506.12708, 2025. DeepSeek DP32+EP32 prefill and DP144+EP144 decode topology.arxiv.org/abs/2506.12708
DeepSeek-AI. DeepSeek-V3/R1 Inference System Overview (Open Source Week, Day 6). February 2025. Production prefill EP32 (9 experts/GPU) and decode EP144 (2 experts/GPU) topology.github.com/deepseek-ai/open-infra-index
Introl. NVLink and Scale-Up Networking. 2026. Scale-up versus scale-out bandwidth ratio; NVL72 physical architecture.introl.com/blog
DigitalOcean. The LLM Inference Trilemma: Throughput, Latency, Cost. April 2026. MoE cost as a game of communication.digitalocean.com/blog
GPUnex. AI Inference Economics: The 1,000x Cost Collapse Reshaping GPUs. February 2026. Inference price trend and drivers.gpunex.com/blog
NVIDIA. NVLink and NVLink Switch, Vera Rubin NVL72 and NVLink 6 disclosures. CES 2026. 260 TB/s aggregate, MoE all-to-all framing.nvidia.com/en-us/data-center/nvlink

The Software Frontier

Discussion about this post

Ready for more?