We built the CUDA guide I wish I had three years ago
Intro
For the past few days we have been quiet here. Not because the newsletter slowed down, but because we were building something underneath it.
Today we are publishing what came out of that work: CUDA Mastery 2026, The Definitive Engineer’s Reference for Hopper, Blackwell, and Beyond.
Twenty-seven chapters, five appendices, fact-checked end to end against NVIDIA’s own documentation, the PTX ISA 8.7, and primary architecture whitepapers.
It covers CUDA Toolkit 13.0, 13.1, and 13.2, compute capabilities 7.5 through 12.1, WMMA, WGMMA, UMMA (tcgen05), TMA, thread block clusters, Tensor Memory, CUDA Tile and cuTile Python, CUTLASS 4 / CuTe, NCCL 2.30, and Nsight 2025.4.
It is on Gumroad. The price is $89. If you have been following The Software Frontier, you already know whether this is for you. The rest of this post is for everyone who is on the fence.
Why we wrote this
There is a strange gap in CUDA literature.
On one side, you have the official NVIDIA programming guide: dense, accurate, and written for people who already know what they are looking for.
On the other side, you have an ocean of blog posts and YouTube tutorials that stop at vector addition and matrix multiplication, repeating the same surface level explanations of threads, blocks, and grids.
What sits in the middle, the part that actually matters when you are writing production code or debugging a kernel that runs at 30 percent of peak, is mostly missing.
Or rather, it exists, but it is scattered across NVIDIA whitepapers, GTC talks from 2018, PTX ISA documentation, decompiled SASS dumps, the Hopper and Blackwell Tuning Guides, the Microbenchmarking Hopper and Microbenchmarking Blackwell arXiv papers, the CUTLASS source, and Stack Overflow threads from people who clearly know more than they are saying.
We have been reading and writing about this gap for months on the newsletter. The articles on the A100 memory hierarchy, on cp.async semantics, on scoreboard mechanics, on the submission pipeline, all of them came from the same frustration. Every time we wanted to explain something properly, we had to do the archaeology ourselves.
So we decided to do the archaeology once, in a single document, and structure it the way we wish someone had structured it for us when we started.
What is in the guide
Twenty-seven chapters across eleven parts, plus five appendices. Four chapters were rewritten end-to-end at handbook depth. Those are the PREMIUM chapters: the SM, the memory system, tensor cores, and the SGEMM walkthrough. The rest of the structure looks like this.
Foundations. The GPU as a throughput machine, the CUDA programming model, and the memory hierarchy at a glance. This is the vocabulary layer. A senior engineer can skim it in an afternoon; a new graduate can use it as their entry point and grow into the rest of the book.
The Streaming Multiprocessor in mechanical detail. The SM is the unit of concurrency, the unit of resource accounting, and the unit at which every meaningful CUDA performance argument must eventually be made.
We walk through the four near-independent partitions, the operand collector and its bank conflicts, the short and long scoreboards, the quantitative arithmetic of latency hiding via Little’s law, the full Nsight Compute stall taxonomy, and the structural deltas across Volta, Turing, Ampere, Ada, Hopper, and Blackwell.
The chapter ends with an end-to-end walkthrough of a single warp executing a wgmma.mma_async on a Hopper SM, stage by stage.
The memory system mechanically. The 32-byte sector model. The L1TEX path and every cache modifier you can attach to a global load. The L2 partitioning on H100 and the access policy window. HBM3 in three numbers and why the practical roofline is 70 to 85 percent of the headline.
Shared memory banks, swizzle modes, and the descriptor-encoded layout that WGMMA actually expects. cp.async mechanics on Ampere. TMA on Hopper and Blackwell, including descriptors, transaction barriers, phase parity, and cluster multicast. The mbarrier family and how warp-specialized GEMM mainloops use it.
Synchronization and concurrency. The CUDA memory model with its scopes and orderings. Cooperative Groups including cluster.sync on Hopper and beyond. Streams, events, and CUDA Graphs for launch-overhead amortization in inference and physics workloads.
Tensor cores mechanically. The hardware origin of the unit, the generation-by-generation shape and precision progression, the per-lane fragment ownership for mma.sync, the bit-level layout of the WGMMA matrix descriptor, and the structural transition to UMMA with the accumulator living in Tensor Memory.
A full section on the new numerical formats: FP6, FP4, and the MX wrapper that makes FP4 inference near-lossless on trained transformer weights.
Modern hardware. A Hopper deep dive on sm_90 / sm_90a. A Blackwell deep dive on sm_100 / sm_100a / sm_120. A chapter on Blackwell Ultra (B300, compute capability 10.3) and the trajectory toward Rubin.
Performance engineering. The roofline model in practice, with the second roofline for shared-memory bandwidth on tile-based kernels. Profiling with Nsight Systems and Nsight Compute, including the four-section workflow and the new tile-kernel statistics added in CUDA 13.1. Numerics and reproducibility, including the TF32 trap that silently downgrades FP32 GEMMs.
Multi-GPU and distributed. NVLink 5 and the NVL72 domain. SHARP v4 in-network reductions. NCCL internals across Ring, Tree, NVLS, and PAT. NVSHMEM and PGAS for sparse all-to-all in MoE training.
Libraries and toolchain. cuBLAS, cuBLASLt epilogue fusion, cuDNN, cuFFT, cuSPARSE. CUTLASS 4 and CuTe for hand-written tensor-core mainloops. CCCL (Thrust + CUB + libcu++). A full chapter on CUDA Tile and cuTile Python, the largest single addition to the CUDA programming model since cooperative groups. A chapter on nvcc, PTX, SASS, the fatbinary, and inline PTX as an escape hatch.
Capstone kernels. The SGEMM walkthrough from v1 to v6, with numbers. Most treatments stop at register tiling and gesture vaguely at “tensor cores make it faster.”
Ours follows the bottleneck through six versions on a 4096³ FP32 problem on H100 SXM5, names the architectural feature that breaks each ceiling, and gives the analytical bound. v1 reaches 0.2 percent of peak. v6, on Blackwell with UMMA + TMEM at FP8, reaches 88 to 95 percent of peak.
The chapter exists to teach what every transition costs and what it buys. Reductions and scans with single-pass decoupled lookback. Flash Attention 2, 3, and 4 / 5 including the online softmax derivation and the WGMMA / UMMA mainloop. Sort, hash, and graph primitives built on CUB.
Appendices. Compute capability quick reference. Architecture spec sheet from A100 through B300 with verified numbers from each part’s datasheet. PTX quick reference. Glossary. And a full bibliography of the primary sources every claim in the book was checked against.
Every claim in the guide has been fact checked against primary sources. Where we had to infer something from SASS or from the behavior of the hardware rather than from a published spec, we say so explicitly.
What was truly missing
Maxwell, Pascal, and Volta offline-compilation material was retired, in line with CUDA 13.0 dropping pre-Turing offline compilation in August 2025. The tensor-core chapter was rewritten around UMMA, which supersedes Hopper’s wgmma.mma_async.
New material on CUDA Tile and cuTile Python, both introduced in CUDA 13.1 in December 2025 and extended in 13.2. New material on Tensor Memory.
The four PREMIUM chapters are new from the ground up. Numbers verified against NVIDIA’s Hopper, Blackwell, and Blackwell Ultra public datasheets at print time.
Who this is for
If you are writing CUDA professionally, in HPC, in ML systems, in inference engines, or in any context where kernel performance is part of your job, this guide is calibrated for you.
If you are a senior engineer transitioning into GPU work and you want one document that takes you from competent to dangerous without 200 hours of fragmented reading, this is the document.
If you are deep into the PTX weeds already, writing your own warp-specialized WGMMA mainloops and tuning CuTe layouts, you probably know a lot of what is in here.
You will still find the SM and memory chapters useful as a reference, and the SGEMM walkthrough is one of the few places where the v5-to-v6 transition is laid out in full. But we would not pretend to teach you something you do not already know.
If you are completely new to CUDA, with no parallel programming background, this is not the right starting point.
The guide assumes you can read C++ and that you have at least written a few kernels before. We would not want you to spend $89 and feel lost on chapter four.
Why $89
Because the alternative is reading what we read, in the order we read it, over the same number of months.
We are not pricing this against tutorials. We are pricing it against the time of an engineer who bills somewhere between $80 and $200 an hour and needs to be productive on Hopper and Blackwell GPU code by next quarter.
If the guide saves you a single afternoon of debugging a kernel that turns out to be limited by operand collector bank conflicts that no public documentation describes, it has paid for itself.
There is no DRM, no expiration, no upsell. You buy the PDF, you own it. As future architectures change material details, we will publish updates to buyers at no additional cost.
The next edition is already scheduled to cover Rubin when its public specifications stabilize.
What happens next
The newsletter continues. The Mastering CUDA series is not over, and there are several articles already in draft on topics that did not fit cleanly into the guide.
If you buy the guide and you have feedback, send it. We read every email. The first revision is going out within thirty days based on what readers tell us, and the people who bought early get it first.
You can find the guide here: CUDA Mastery
Thank you for reading. Thank you for being here while this was being built. The next article goes out as scheduled.
Lorenzo and Lorenzo



