We wrote the CUDA reference we could not find
The handbook we wanted on our own desks. GPU performance from first principles, current to Blackwell and CUDA 13.x.
We have both lost more hours than we want to admit to questions that should have a clean answer. Here is one.
On Hopper, a WGMMA sources its operands from shared memory and registers. On Blackwell, the new UMMA instruction can read its A operand straight from tensor memory, a level of the hierarchy that did not exist a generation ago.
That single change reshapes how you build a pipeline. When we went looking for it written down in one place, it was not there. It was scattered across a PTX spec, a CUTLASS example, and a couple of microbenchmark threads.
That was the pattern everywhere. Occupancy rules in one tuning guide. Descriptor encodings in another. The genealogy of the tensor cores living in our heads after weeks of trial runs. Nothing sat in one place we could read front to back.
So we wrote that place.
We are not technical writers who picked up CUDA for a book. We write low-level GPU and systems code for a living, and we got tired of not having this on our own desks.
CUDA Mastery 2026 is a deep technical handbook for engineers who want to understand and optimize CUDA on modern NVIDIA GPUs, from the fundamentals through Hopper, Blackwell, and Blackwell Ultra. It is current to CUDA 13.x and the fifth generation of tensor cores.
It is not a syntax tutorial. Most CUDA resources teach you syntax, a few explain how the hardware actually works, and almost none teach you to reason about performance before you write a line. This is the one that does.
Here is what it covers, as one integrated reference instead of forty open browser tabs: SM internals, warp scheduling, scoreboarding, occupancy, and latency hiding. The full memory hierarchy, coalescing, shared memory, TMA, and Tensor Memory.
The tensor core path from WMMA to WGMMA to UMMA. Roofline modeling, bottleneck analysis, and Nsight profiling. CUTLASS, CuTe, CUDA Tile, PTX, SASS, and JIT. Multi-GPU with NCCL, NVLink, NVSwitch, and NVSHMEM. And real kernel walkthroughs: SGEMM, Flash Attention, reductions, scan, sort, and sparse ops.
Two rules we held ourselves to. Every specification is footnoted to a primary source, whether an NVIDIA whitepaper, the PTX ISA, a tuning guide, or published microbenchmarks, with nothing paraphrased from a blog you cannot trace.
And specification stays separate from measurement, so you always know whether a claim comes from a document or from a number someone ran.
Think about what one wrong optimization actually costs: a day chasing a bottleneck that was never there, or a kernel rewritten three times because an operand sat in the wrong memory. This reference shortens that loop. The next time you open Nsight and see a stall, you will know which part of the machine is talking to you.
It is written for engineers who already ship CUDA. If you have never launched a kernel, start somewhere gentler and come back when you want to go deep.
The book is €89, a single payment, or 10 installments, and less than the going hourly rate for someone who already knows this. Every future v1.x update and correction is included free.
We built this because we wanted it on our own desks. It is on yours now if you want it.


