Mastering CUDA and High-Performance Computing, Part IX

Lorenzo Bradanini and Lorenzo Tettamanti

Apr 21, 2026

Where Part VIII Left Us

Part VIII ended at something close to a philosophical statement: the SMSP on a well-tuned Hopper GEMM kernel is a machine that does one thing. Everything else has been delegated.

That is true, and it is beautiful, and it is also irrelevant the moment the model you are trying to train does not fit in 80 GB of HBM3.

GPT-3 has 175 billion parameters. In BF16 that is 350 GB of weights alone, before you add optimizer state, activations, and gradients. A single H100 has 80 GB. You need at least five of them just to hold the parameters, and in practice you need significantly more to make training feasible.

At this point the single-GPU roofline model, with its ridge points, its arithmetic intensity calculations, its tensor core utilization percentages, becomes necessary but not sufficient.

You need a new abstraction layer that sits above the GPU and treats a rack, or a pod, or a datacenter, as the compute substrate.

This part is about that layer. We will cover tensor parallelism, pipeline parallelism, data parallelism, and the collective communication primitives that tie them together.

We will look at NCCL, at NVLink topology and how it interacts with bandwidth requirements, and at the specific arithmetic of why certain parallelism strategies work and others do not at scale.

We will go into great detail. Tighten your seatbelts.

The memory wall has not gone away

Before we talk about parallelism strategies, we need to internalize what “model doesn’t fit” actually means, quantitatively.

A transformer with P parameters trained in mixed precision requires, at minimum:

2P bytes for the model weights in BF16
4P bytes for the master weights in FP32 (kept by the optimizer for numerical stability)
8P bytes for the Adam optimizer states (m and v vectors, both FP32)
2P bytes for the gradients in BF16

Total: 16P bytes in the steady state, not counting activations.

For a 70B parameter model (Llama 3 scale), this is 1,120 GB; fourteen H100s worth of HBM just for the optimizer state. This is not a pathological edge case; this is the routine reality of training frontier models.

Inference is cheaper (you do not need optimizer state) but for a 405B parameter model in FP8, you are still looking at 405 GB, spread across at minimum six H100 80GB instances, with careful attention to how the tensor operations are partitioned so that no single GPU computes a matrix multiply that requires moving activations larger than the HBM capacity.

The problem is therefore not just “how do we make one GPU fast” but “how do we decompose a computation that is too large for one GPU into pieces that run efficiently on many GPUs, with the communication overhead between those pieces small enough that the multi-GPU system achieves a meaningful fraction of the theoretical sum of its parts.”

That fraction has a name: parallel efficiency. Getting it above 0.5 for a thousand-GPU training run is hard. Getting it above 0.8 is a research problem.

Getting it above 0.9 is what separates companies that can train frontier models economically from companies that cannot.

Three orthogonal dimensions of parallelism

The standard taxonomy, established empirically by the Megatron-LM work at NVIDIA and subsequently refined, identifies three orthogonal axes along which a transformer training job can be parallelized.

Data Parallelism (DP): Replicate the model across N GPUs, partition the training batch into N micro-batches, run a forward and backward pass independently on each GPU, and then average the gradients across all N replicas. Every GPU holds the full model. The communication pattern is a single all-reduce over the gradient tensors after each backward pass.

Tensor Parallelism (TP): Partition individual weight matrices across N GPUs, so that each GPU holds a 1/N shard of each matrix. A single matrix multiply that would require, say, a 4096×16384 GEMM on one GPU instead requires a 4096×(16384/N) GEMM on each of the N GPUs, followed by a collective to reassemble the result.

The communication pattern is tightly coupled to the forward pass; an all-reduce (or all-gather + reduce-scatter) at every layer boundary.

Pipeline Parallelism (PP): Partition the layers of the model across N GPUs, so that GPU 0 holds layers 1–L/N, GPU 1 holds layers L/N+1 through 2L/N, and so on. A micro-batch flows through the pipeline sequentially.

The communication pattern is a point-to-point activation transfer between adjacent stages, one per micro-batch per layer boundary.

These three dimensions compose. The Megatron-LM paper that trained GPT-3-scale models on A100 clusters used all three simultaneously: 8-way TP within a node (exploiting NVSwitch), 4-way PP across nodes, and data parallelism across node groups.

The product is the total GPU count: 8 × 4 × D = total GPUs, where D is the data parallel degree.

Understanding why this particular combination was chosen, and not, say, 64-way TP with no PP, requires understanding the communication topology and the arithmetic of collective operations.

That is what the rest of this part is about.

NVLink and NVSwitch

Communication between GPUs can happen over two physical fabrics: PCIe and NVLink. The performance difference between them is not small.

PCIe 4.0 x16 provides 32 GB/s unidirectional bandwidth. PCIe 5.0 x16 doubles that to 64 GB/s. These numbers sound reasonable until you compare them to what you actually need during all-reduce.

NVLink 4.0 (H100) provides 900 GB/s bidirectional bandwidth per GPU in NVLink-connected configurations: that is 450 GB/s in each direction. This is roughly 7× better than PCIe 5.0 in each direction, and the real-world benefit is larger because NVLink latency is also significantly lower.

But “NVLink-connected” hides a critical topological detail. NVLink connects pairs of GPUs (or GPUs through an NVSwitch). The DGX H100 system has 8 GPUs connected through NVSwitch 3.0, which provides full all-to-all connectivity at 900 GB/s per GPU.

This means any GPU can communicate with any other GPU in the same node at full bandwidth simultaneously. The NVSwitch acts as a non-blocking switch fabric.

Across nodes, the picture changes entirely. Multi-node communication happens over InfiniBand (HDR or NDR), with typical all-reduce bandwidth of 25–50 GB/s per GPU depending on topology and rail configuration: roughly 10–20× slower than intra-node NVLink.

This 10–20× bandwidth gap between intra-node and inter-node communication is the single most important physical fact for understanding why multi-GPU parallelism is structured the way it is.

The implication is immediate: communication-heavy parallelism strategies (like tensor parallelism, which requires an all-reduce at every layer) should be confined within a node, where NVLink bandwidth makes the overhead acceptable.

Communication-light parallelism strategies (like pipeline parallelism, which only requires activation transfers at layer boundaries) can span node boundaries.

This is exactly what Megatron-LM does, and the reason is physics, not convention.

Data parallelism in depth

Data parallelism is the simplest strategy and the one that scales best in terms of implementation complexity. It is also the one where the communication overhead is most amenable to hiding behind computation, given careful engineering.

The communication requirement for data parallelism is an all-reduce over the gradient tensors after each backward pass. For a model with P parameters in BF16, this all-reduce moves 2P bytes of data through the network.

For a 70B model, that is 140 GB per all-reduce. At 25 GB/s inter-node InfiniBand, a naive all-reduce would take approximately 5.6 seconds. For a training step that takes 2–3 seconds of compute, this is catastrophically inefficient. The GPU would be idle for twice as long as it was computing.

The solution is to overlap gradient communication with the backward pass computation. As each layer’s gradients are computed during the backward pass, those gradients can be immediately all-reduced while the backward pass continues computing the gradients of earlier layers.

This is called gradient overlap, and it is implemented in PyTorch via the DistributedDataParallel (DDP) bucket mechanism: gradients are grouped into buckets of approximately 25 MB, and an all-reduce is launched for each bucket as soon as it fills, overlapping with the backward computation of earlier layers.

The efficiency of gradient overlap depends on the ratio of compute time to communication time per layer.

For large models with large batch sizes, this ratio is favorable: the layers are compute-heavy, and there is always useful computation happening while the all-reduce for a previous layer’s gradients is in flight.

For small models or small batch sizes, the backward computation per layer is short and the all-reduce cannot be fully hidden.

This is one reason why very large batch sizes are computationally efficient beyond the obvious “more samples per step” benefit: larger batches mean longer per-layer compute time, which means more time to hide communication.

ZeRO: when the model doesn’t fit, but you want data parallelism anyway

Vanilla data parallelism replicates the full model on every GPU. For a 70B model, this requires every GPU to have 1,120 GB of memory (with optimizer state), which is physically impossible today and will remain so for some time.

ZeRO (Zero Redundancy Optimizer), developed by Microsoft DeepSpeed, addresses this by partitioning the model state across the data parallel group rather than replicating it.

ZeRO comes in three stages of increasing memory savings and communication cost:

ZeRO-1: Partition the optimizer state (m, v vectors) across the DP group. Each GPU holds a 1/N shard of the optimizer state. Communication overhead is unchanged (gradients are still all-reduced). Memory savings: up to 4× for Adam (8P → 2P per GPU for optimizer state).

ZeRO-2: Partition the gradients in addition to the optimizer state. After the all-reduce, each GPU keeps only its 1/N shard of the gradients (the portion it needs for its optimizer state shard). Memory savings: up to 8× (12P → 1.5P per GPU for gradients + optimizer state). Communication overhead: unchanged.

ZeRO-3: Partition the model parameters as well. Each GPU holds only 1/N of the model weights at any given time. During the forward and backward pass, the needed weight shards are all-gathered from the DP group just-in-time. Memory savings: up to 64× for a large DP degree. Communication overhead: increases by 1.5× compared to vanilla DDP (due to the all-gather operations for parameters).

ZeRO-3 with a DP degree of 64 reduces the per-GPU memory for a 70B model from 1,120 GB to approximately 17.5 GB: comfortably fitting on a single H100. The tradeoff is the 1.5× increase in communication volume, which must be weighed against the larger batch sizes that ZeRO-3 enables.

The engineering implementation of ZeRO-3 is non-trivial: parameters must be gathered before each layer’s forward pass and immediately freed afterward (unless gradient checkpointing is also active, in which case they must be re-gathered during the backward pass as well).

The memory allocator must be aware of these temporary parameter buffers and free them aggressively.

DeepSpeed’s implementation of ZeRO-3 does this with a parameter fetch prefetch buffer: as layer i is executing its forward pass, layer i+1‘s parameters are being all-gathered in the background, overlapping communication with computation at the layer level rather than the bucket level.

Tensor parallelism in depth

Tensor parallelism (TP), as formalized in the Megatron-LM paper, exploits the specific structure of transformer layers to split individual matrix multiplications across multiple GPUs.

Consider a single transformer MLP layer with weight matrices W1 of shape [d_model, d_ffn] and W2 of shape [d_ffn, d_model], where d_ffn = 4 × d_model. For a 70B model, d_model ≈ 8192, so d_ffn ≈ 32768.

The output of the MLP layer is: Y = GeLU(XW1)W2

With 8-way tensor parallelism, W1 is split column-wise across 8 GPUs, each holding W1_i of shape [d_model, d_ffn/8]. W2 is split row-wise across 8 GPUs, each holding W2_i of shape [d_ffn/8, d_model].

The computation on GPU i becomes:

Y1_i = X × W1_i (local GEMM, shape [batch, d_ffn/8])
Z_i = GeLU(Y1_i) (local elementwise)
Y2_i = Z_i × W2_i (local GEMM, shape [batch, d_model])
Y = AllReduce(Y2_i) (sum partial results across 8 GPUs, shape [batch, d_model])

The all-reduce at step 4 is the communication bottleneck. Its cost is:

2 × (N-1)/N × |Y2| bytes

For a batch of 2048 tokens, d_model = 8192, BF16: |Y2| = 2048 × 8192 × 2 bytes ≈ 32 MB. At 450 GB/s NVLink bandwidth, this all-reduce takes approximately 32 MB / 450 GB/s ≈ 71 microseconds.

The compute time for the local GEMMs is: 2 × (2048 × 8192 × 32768/8) × 2 ops ÷ (494 TFLOP/s per GPU) ≈ 2 × 2048 × 8192 × 4096 × 2 / 494e12 ≈ 1.1 milliseconds per GEMM, so approximately 2.2 ms for both GEMMs.

Communication (71 µs) is roughly 3% of compute (2200 µs). This is an excellent ratio; the all-reduce is effectively free.

But now notice what happens if you push TP degree from 8 to 64 over InfiniBand at 25 GB/s instead of NVLink at 450 GB/s: the all-reduce takes 32 MB / 25 GB/s × (2 × 63/64) ≈ 2.5 ms, which is now comparable to the compute time.

The compute per GPU has also dropped by 8× (fewer flops per GPU due to more partitioning), so each GEMM takes 2.2ms / 8 ≈ 275 µs.

The all-reduce (2.5 ms) now takes 9× longer than the compute it is supposed to overlap with. You are GPU-idle for 90% of the time. This is why tensor parallelism over InfiniBand at high TP degrees is a terrible idea, not a theoretical concern but an arithmetic certainty.

The intra-node NVLink constraint on TP degree is therefore typically TP ≤ 8 for an 8-GPU node, precisely because that is the point where NVLink bandwidth makes the communication overhead negligible.

Self-attention and sequence parallelism

The attention mechanism has a different structure than the MLP, but the Megatron-LM approach handles it symmetrically: Q, K, and V projection matrices are split column-wise (head-parallel), and the output projection is split row-wise. Each GPU handles a subset of attention heads.

For H total heads and TP degree N, each GPU computes H/N heads independently. No communication is needed within the attention computation itself; only the output projection requires an all-reduce.

This works cleanly as long as H is divisible by N, which is why transformer architectures are almost always designed with head counts that are powers of 2 or multiples of 8.

There is an additional subtlety for attention: LayerNorm and dropout require the full activation tensor, not a sharded one.

In vanilla TP, these operations run on the full (all-reduced) activations, which means they see the full sequence length at full d_model dimension: they are not distributed and do not benefit from the TP decomposition.

Sequence Parallelism (SP), introduced as an extension to Megatron-LM TP, addresses this by replacing the all-reduce at layer boundaries with an all-gather + reduce-scatter pattern, and distributing the non-tensor-parallel operations (LayerNorm, dropout) across the sequence dimension rather than the model dimension.

In SP, the activation between transformer layers is sharded across the TP group along the sequence dimension: each GPU holds [batch, seq_len/N, d_model] instead of [batch, seq_len, d_model].

Before entering the tensor-parallel MLP, an all-gather reconstructs the full [batch, seq_len, d_model] tensor. After the MLP’s row-parallel W2 partial sum, a reduce-scatter simultaneously sums the partial results and re-shards the output along the sequence dimension.

The communication volume is identical to the all-reduce, but the memory advantage is significant: activations are now partitioned across the TP group, reducing the peak memory per GPU by a factor of N for the activation tensors.

At large sequence lengths (128K tokens, as in recent long-context models), this memory saving is the difference between fitting in HBM and catastrophic OOM.

Pipeline Parallelism

Pipeline parallelism is the ugliest of the three strategies. This is a statement of fact, not an aesthetic judgment.

It introduces pipeline bubbles (periods where some GPUs are idle because they are waiting for activations from the previous stage) and managing those bubbles is the central engineering challenge.

In the simplest pipeline schedule (GPipe), a batch of M micro-batches flows through P pipeline stages sequentially. The forward pass processes micro-batch 0 through stage 0, then stage 1, ..., then stage P-1. Then micro-batch 1 flows through. Then micro-batch 2. And so on.

The backward pass happens in reverse, with a full flush between the forward and backward sweeps. The pipeline bubble fraction (time wasted on idle GPUs as a fraction of total time) is approximately:

bubble fraction ≈ (P − 1) / (M + P − 1)

For P = 4 stages and M = 8 micro-batches: bubble fraction ≈ 3/11 ≈ 27%. More than a quarter of GPU time is wasted.

To reduce the bubble, you increase M. For M = 32: bubble fraction ≈ 3/35 ≈ 8.5%. For M → ∞, bubble fraction → 0.

But increasing M increases the memory required to store the activations of all in-flight micro-batches during the forward pass before the backward pass can begin (GPipe requires storing all activations). For M micro-batches each with activation size A, the activation memory is M × A, and this grows linearly with M.

1F1B (One Forward One Backward) scheduling, introduced in PipeDream and used by Megatron-LM, breaks this linear activation scaling. In 1F1B, each pipeline stage interleaves one forward pass and one backward pass for different micro-batches, rather than running all forwards before any backwards.

The pipeline still has a bubble at startup and drain, but the steady-state memory is bounded: at any point, a stage has at most P micro-batches’ activations in flight (not M). The bubble fraction is the same as GPipe ((P−1)/(M+P−1)), but the activation memory is P × A instead of M × A.

For large models where M must be large to amortize the bubble, this is a crucial difference. With M = 64 and P = 8, GPipe stores 64 micro-batches of activations; 1F1B stores at most 8.

Interleaved 1F1B goes further: each GPU holds V virtual pipeline stages (chunks of layers) instead of one contiguous block, enabling the bubble fraction to be reduced to:

bubble fraction ≈ (P − 1) / (V × M + P − 1)

with V times the point-to-point communication per step (because activation tensors must be sent between non-adjacent GPUs). This is the schedule used by Megatron-LM for the largest training jobs, with V=2 or V=4 providing a 2-4× reduction in bubble at a 2-4× increase in inter-stage communication.

The Activation Recomputation Tradeoff

One more tool for managing activation memory in pipeline parallelism (and, frankly, in any large training run): gradient checkpointing, also called activation recomputation.

The idea is simple: instead of storing the full activation tensor for every layer during the forward pass (needed for the backward pass), you store only the activations at checkpoint boundaries (e.g., every transformer block) and recompute the intermediate activations on-demand during the backward pass by running the forward computation again.

The memory cost is reduced from O(L × A) to O(√L × A) for optimal checkpoint placement (checkpoint every √L layers). The compute cost increases by approximately 30-40% (one extra forward pass per layer, amortized).

For training at the frontier, this tradeoff is almost always worth it: compute is more abundant than HBM capacity, and the alternative is buying more GPUs (or, equivalently, using more PP stages, which increases the bubble).

The interaction between activation recomputation and pipeline parallelism is non-trivial: if you are recomputing activations, the backward pass must re-run the forward computation for each stage, which means each stage must still have access to the input activations from the forward pass checkpoint.

This constrains how aggressively you can combine PP and recomputation, and the Megatron-LM codebase has explicit logic for managing which activations are stored versus recomputed across pipeline stages.

The library that makes multi-GPU work

Everything discussed above (all-reduce, all-gather, reduce-scatter) is implemented in practice by NCCL (NVIDIA Collective Communications Library).

Understanding what NCCL does, and more importantly how it does it, is necessary for diagnosing performance problems at scale.

NCCL implements the standard collective communication operations (AllReduce, AllGather, ReduceScatter, Broadcast, Reduce, Barrier) on NVIDIA GPUs, with topology-aware algorithms that exploit NVLink, NVSwitch, and InfiniBand according to the detected hardware configuration.

Ring-AllReduce vs Tree-AllReduce

The canonical AllReduce algorithm for a ring of N GPUs is ring-allreduce, introduced in the deep learning context by Baidu Research and later made famous by the Horovod library.

In ring-allreduce, GPUs are arranged in a logical ring. AllReduce is decomposed into two phases:

Reduce-Scatter: Each GPU sends a chunk of its data to the next GPU in the ring, while simultaneously receiving and accumulating a chunk from the previous GPU. After N−1 steps, each GPU holds the fully reduced value for one chunk (its 1/N shard of the full tensor).

AllGather: Each GPU broadcasts its reduced chunk to all others by rotating it around the ring. After N−1 more steps, every GPU holds the fully reduced tensor.

The latency of ring-allreduce scales as 2(N−1)α + 2(N−1)/N × |data|/β, where α is the point-to-point latency and β is the bandwidth.

For large data (|data| >> α/β × N²), the bandwidth term dominates and the algorithm is efficient. For small data, the latency term (proportional to N) dominates and ring-allreduce becomes expensive.

This is why AllReduce for gradient synchronization (large tensors, many GB) works well with ring-allreduce, but AllReduce for small tensors (like the normalization statistics in LayerNorm) can benefit from tree-based algorithms that have O(log N) latency scaling at the cost of non-optimal bandwidth utilization.

NCCL automatically selects the algorithm based on message size, topology, and a heuristic tuning table, but understanding the underlying tradeoff is necessary when the automatic selection is suboptimal for your specific workload.

NCCL and topology awareness

NCCL’s topology detection is worth examining in detail because it directly determines which algorithm it selects for intra-node vs inter-node operations.

On startup, NCCL probes the system topology using the CUDA device properties API and, where available, NVML (NVIDIA Management Library) topology information.

It constructs an internal graph where GPUs are nodes and NVLink/PCIe/InfiniBand connections are edges with associated bandwidths.

For an 8-GPU DGX H100 node with NVSwitch, NCCL detects a fully connected graph with 900 GB/s bidirectional bandwidth.

For multi-node communication over InfiniBand with a single SHARP switch, NCCL can use SHARP (Scalable Hierarchical Aggregation and Reduction Protocol), an in-network computing feature of modern InfiniBand switches that performs the AllReduce reduction inside the switch fabric rather than at the endpoint GPUs.

SHARP effectively moves the bandwidth bottleneck from the GPU NICs to the switch, and for large-scale clusters with SHARP-capable HDR/NDR InfiniBand, it can reduce AllReduce latency by 2-3× compared to standard ring-allreduce.

For hierarchical topologies (NVLink within nodes, InfiniBand between nodes), NCCL uses a two-level algorithm: a ReduceScatter within each node using the fast NVLink fabric, followed by an AllReduce across nodes over InfiniBand, followed by an AllGather within each node.

This correctly exploits the bandwidth hierarchy: the slow inter-node link sees only the partially reduced results, not the full gradient tensor from every GPU.

The implementation detail that matters for practitioners: the NCCL_SOCKET_NTHREADS and NCCL_BUFFSIZE environment variables, along with the NCCL_P2P_LEVEL setting, are often the first things to tune when AllReduce performance is below theoretical bandwidth.

NCCL’s defaults are conservative for stability across diverse hardware configurations.

The NCCL + CUDA stream interaction

NCCL operations run on CUDA streams, and correct stream management is the source of many subtle performance bugs in multi-GPU training code.

The critical invariant: NCCL operations on the same communicator object are serialized in NCCL’s internal queue, but operations on different communicators are independent.

The PyTorch DDP implementation creates one NCCL communicator per process group, and all AllReduce operations for gradient synchronization go through this communicator.

When gradient overlap is active (AllReduce launched for bucket i while the backward pass continues computing gradients for bucket i-1), PyTorch launches the AllReduce on a separate CUDA stream from the backward pass computation. The backward computation stream and the AllReduce stream run concurrently on the GPU.

The synchronization at the end of the backward pass (before the optimizer step) ensures that all AllReduce streams have completed. Failure to synchronize here is a common bug that manifests as non-deterministic gradient corruption: the optimizer sees a partially all-reduced gradient tensor.

The interaction with tensor parallelism adds another layer: the AllReduce within a TP group happens on a separate communicator from the AllReduce within the DP group.

These two communicators must be correctly ordered in the forward/backward pass: TP AllReduce happens at each layer boundary (blocking for the next layer to begin), while DP AllReduce happens after the full backward pass.

The arithmetic of parallelism efficiency

With the mechanisms in place, we can now put numbers on the parallel efficiency achievable under different configurations. This is where the theoretical discussion becomes practically useful.

For data parallelism, the relevant ratio is Rᴅᴘ = T_compute / T_allreduce. For gradient overlap to be effective, we need Rᴅᴘ >> 1. For a 70B model over InfiniBand at 25 GB/s, T_allreduce ≈ 5.6 seconds. For a per-GPU batch of 512 tokens at 50% H100 utilization, T_compute ≈ 0.58 seconds. Rᴅᴘ ≈ 0.1.

The compute is an order of magnitude faster than the all-reduce, and no amount of overlap engineering fixes a ratio that is inverted.

The resolution is gradient accumulation: run multiple micro-batches locally before triggering the all-reduce, increasing the logical batch size without increasing activation memory.

This is not a workaround. It is the correct operating point, because convergence is measured in tokens seen, not steps taken.

For tensor parallelism, the critical TP degree N* is the point above which T_allreduce_TP > T_GEMM_local. On NVLink at 450 GB/s, N* ≈ 8. On InfiniBand at 25 GB/s, N* ≈ 1 to 2. Tensor parallelism over InfiniBand is not a suboptimal choice; it is an arithmetic mistake.

The practical ceiling for a well-tuned thousand-GPU training job with Megatron-LM or DeepSpeed on H100 hardware is approximately 40-60% MFU.

The gap from 100% is accounted for by pipeline bubbles (~15%), communication overhead (~10-15%), activation recomputation (~5-10%), and the residual from kernel inefficiency on non-GEMM operations and host-side Python overhead.

Getting from 40% to 60% MFU is worth approximately a 50% reduction in training cost. Every major AI lab has engineers whose entire job is that gap.

What the three strategies actually buy you

It is worth stepping back and stating plainly what each parallelism axis is for, now that we have done the arithmetic.

Data parallelism buys throughput. More DP replicas means more tokens per second, with communication overhead that is manageable at any scale where the per-GPU batch is large enough.

ZeRO makes DP viable even when the model does not fit on a single GPU, at the cost of increased communication volume that must be weighed against the larger batch sizes it unlocks.

DP is the outer loop of every large training job; the other strategies are refinements that make DP feasible at model sizes and cluster scales where it would otherwise be communication-bound.

Tensor parallelism buys memory capacity per layer, by distributing the weight matrices and activations of individual layers across multiple GPUs.

Its cost is a mandatory all-reduce at every layer boundary, which makes it viable only where NVLink bandwidth makes that all-reduce cheap.

TP is fundamentally an intra-node strategy on current hardware. Using it otherwise is trading compute for communication at an exchange rate that is never favorable.

Pipeline parallelism buys the ability to cross node boundaries without paying the tensor-parallel penalty. Its cost is the pipeline bubble, which is a tax on idleness that shrinks with micro-batch count and with interleaved scheduling.

PP is the mechanism by which you scale beyond a single node when the model is too large for TP alone to partition. It is ugly, and it is necessary, and the 1F1B schedule and its interleaved variants exist specifically to make that ugliness manageable.

The three-dimensional parallelism strategy (TP within a node, PP across nodes, DP across node groups) is not a design choice that someone made. It is the unique solution that the bandwidth hierarchy of current hardware enforces.

NVLink makes TP cheap within a node. InfiniBand makes PP the only viable strategy across nodes. Gradient accumulation and ZeRO make DP efficient across the full cluster. Change the hardware, and the optimal strategy shifts.

NVL72 on Blackwell, by extending the NVLink fabric to 72 GPUs, shifts the TP/PP boundary outward.

InfiniBand NDR at 400 Gb/s per link, as it becomes more widely deployed, shifts the point at which inter-node DP communication becomes the bottleneck. The strategies are not timeless; the physics that motivates them is.

What actually limits you at scale

The question that every large-scale training practitioner eventually asks is: why isn’t my thousand-GPU job achieving 70% MFU? The answer is rarely a single cause.

It is a stack of inefficiencies, each modest in isolation, compounding into the gap between theoretical and measured throughput.

Pipeline bubbles are usually the largest single contributor at high PP degrees. For P=16 and M=32, the bubble fraction is 32% before any other inefficiency is counted.

Interleaved 1F1B at V=2 halves this, at the cost of doubled inter-stage communication. The right V depends on whether communication or compute is the limiting resource at the PP boundary, which varies by model size, node count, and InfiniBand configuration.

Inter-node AllReduce tail latency is the second major contributor and the hardest to reason about, because it is statistical rather than deterministic.

The average InfiniBand bandwidth may be 25 GB/s, but the 99th-percentile latency can be 3-5× higher due to switch congestion, adaptive routing jitter, and multi-tenancy.

The all-reduce waits for the slowest link. At scale, the slowest link is almost always slower than the average, which means the effective all-reduce bandwidth is consistently worse than the number on the datasheet.

RDMA configuration, static routing, and dedicated rail topology are the levers. They require infrastructure access that not every team has.

Host-side Python overhead is the most embarrassing source of inefficiency, because it is entirely self-inflicted.

PyTorch’s dispatcher, the GIL, and the overhead of the training loop’s Python logic can appear as measurable GPU idle time for models where per-step compute is short. CUDA graph capture eliminates per-operation launch overhead for the inner loop.

Careful pipelining of data loading and logging eliminates it for the outer loop. Teams that have done this work carefully report 5-10% improvement in effective MFU from host-side optimizations alone, which is not a small number when the training run costs millions of dollars.

Conclusion

The trajectory from single-GPU to multi-GPU mirrors the trajectory within the single GPU: find the bottleneck, route around it, measure again.

On a single GPU, the bottleneck moved from arithmetic units (tensor cores solved it) to memory bandwidth (cp.async attacked it) to instruction overhead (TMA eliminated it).

The SMSP on a well-tuned Hopper GEMM kernel is a machine that does one thing, because every other thing it used to do has been delegated to dedicated hardware.

At the multi-GPU level, the bottleneck is the communication fabric, and the solution follows the same logic: match the parallelism strategy to the bandwidth available at each level of the hierarchy.

TP over NVLink, PP over InfiniBand, DP across the full cluster. The ring-allreduce, the 1F1B schedule, the ZeRO partitioner: these are the mechanisms by which a thousand-GPU cluster achieves 40-60% of the theoretical sum of its parts.

40-60% of a thousand H100s is still an extraordinary amount of compute. Whether it is enough to train the next frontier model is left as an exercise for the reader’s infrastructure budget.

Part X will close in on a corner of this picture we have deliberately deferred: inference. Training is a one-time cost; inference is the workload that runs forever, at a scale that dwarfs training once a model is deployed.

The optimization challenges at inference time are different in kind: speculative decoding, KV cache management, continuous batching, and the specific arithmetic of when the prefill and decode phases are bottlenecked by entirely different resources.

The tools change; the principle does not.

The Software Frontier

Discussion about this post

Ready for more?