<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Software Frontier]]></title><description><![CDATA[Where abstraction ends. Essays on GPU execution, kernel internals, and distributed systems at scale.]]></description><link>https://www.thesoftwarefrontier.com</link><image><url>https://substackcdn.com/image/fetch/$s_!SAY7!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54550d86-2756-4131-8818-956604f6749d_608x608.png</url><title>The Software Frontier</title><link>https://www.thesoftwarefrontier.com</link></image><generator>Substack</generator><lastBuildDate>Fri, 22 May 2026 16:45:13 GMT</lastBuildDate><atom:link href="https://www.thesoftwarefrontier.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Lorenzo Bradanini]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[softwarefrontier@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[softwarefrontier@substack.com]]></itunes:email><itunes:name><![CDATA[Lorenzo Bradanini]]></itunes:name></itunes:owner><itunes:author><![CDATA[Lorenzo Bradanini]]></itunes:author><googleplay:owner><![CDATA[softwarefrontier@substack.com]]></googleplay:owner><googleplay:email><![CDATA[softwarefrontier@substack.com]]></googleplay:email><googleplay:author><![CDATA[Lorenzo Bradanini]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[How Systems Really Fail, Part II]]></title><description><![CDATA[The observer problem: why your dashboards lie, why aggregation destroys the signal, and the unbridgeable gap between the system and your model of it.]]></description><link>https://www.thesoftwarefrontier.com/p/how-systems-really-fail-part-ii</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/how-systems-really-fail-part-ii</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Tue, 19 May 2026 09:51:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!SjgH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f753274-d624-4ee3-8c62-d59c194f83f0_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SjgH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f753274-d624-4ee3-8c62-d59c194f83f0_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SjgH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f753274-d624-4ee3-8c62-d59c194f83f0_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!SjgH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f753274-d624-4ee3-8c62-d59c194f83f0_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!SjgH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f753274-d624-4ee3-8c62-d59c194f83f0_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!SjgH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f753274-d624-4ee3-8c62-d59c194f83f0_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SjgH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f753274-d624-4ee3-8c62-d59c194f83f0_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f753274-d624-4ee3-8c62-d59c194f83f0_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3562838,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/197191625?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f753274-d624-4ee3-8c62-d59c194f83f0_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SjgH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f753274-d624-4ee3-8c62-d59c194f83f0_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!SjgH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f753274-d624-4ee3-8c62-d59c194f83f0_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!SjgH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f753274-d624-4ee3-8c62-d59c194f83f0_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!SjgH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f753274-d624-4ee3-8c62-d59c194f83f0_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Intro</h2><p>The first essay in this series argued that distributed systems fail in the spaces <em>between</em> their components, and that those spaces are structurally opaque. This one argues something more uncomfortable.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Software Frontier! </p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Even if you accept that the system is opaque, you still have to operate it. You still have to debug it at <strong>03:47 UTC</strong>. You still have to decide, in the next ninety seconds, whether to roll back, fail over, shed load, or page someone more senior.</p><p>To do any of that, you have to <strong>see</strong> the system.</p><p>This is the second structural problem, the one most production engineers learn about the hard way, usually during an incident that lasted longer than it should have because the dashboards stayed green until they didn&#8217;t.</p><p>The system you observe is <strong>not the system that exists</strong>. It is a <em>projection</em> of the system into a low-dimensional representation, built out of metrics, logs, traces, and the mental model that lives in the operator&#8217;s head.</p><p>The projection is incomplete by construction. It is <em>delayed</em> by the time it takes to collect, aggregate, and render. It is <em>biased</em> by what someone decided to instrument three years ago. And it is <em>aggregated</em>, often violently, in ways that destroy precisely the signal needed to recover from an incident.</p><p>This essay is about that gap. Not the gap between the system as designed and the system as composed (Part I), but the gap between the system as composed and the system as <strong>perceived</strong>. The two gaps are different, and they compound.</p><p>Three incidents, mechanically reconstructed: <strong>Slack&#8217;s autoscaler</strong> chasing metrics that had decoupled from the network failure underneath, while the dashboards that would have shown the decoupling failed at the same time as the system; </p><p><strong>GitHub&#8217;s 43-second partition</strong> and the 24-hour reconciliation that followed; <strong>Roblox&#8217;s 73-hour outage</strong>, where the monitoring stack failed at the same time as the system it monitored, and engineers spent two days debugging a fully dark cluster.</p><p>The pattern beneath all three is the same: the operator&#8217;s view of the system, and the system itself, were two different things. The interesting question is <em>why this is structural</em>, not accidental.</p><div><hr></div><h2>The observer problem, mechanically</h2><p>The canonical incident for this material is the <strong>Slack outage of 4 January 2021</strong>, the first business day after the holiday break, documented in detail in Slack&#8217;s post-mortem by Laura Nolan.</p><p>Slack runs on AWS, with services running in dedicated VPCs (Virtual Private Clouds) connected by <strong>AWS Transit Gateways</strong> (TGWs). TGWs are managed by AWS and intended to scale transparently. Slack&#8217;s traffic pattern is unusual: the platform is quiet over the holidays and then ramps to one of its biggest days of the year on the first Monday back, when clients reconnect with cold caches and pull down more data than usual.</p><p>On 4 January, the TGWs <strong>did not scale fast enough</strong> for that ramp. Around 6:00 AM PST, one of them began dropping packets. The packet loss caused widespread degradation in internal calls across Slack&#8217;s services, but the symptom was not yet visible to users.</p><p>Slack&#8217;s web tier autoscales on two signals: <strong>CPU utilisation</strong> and <strong>utilisation of available Apache worker threads</strong>. Here is where the failure begins. As packets dropped on the TGW, threads in the web tier spent more time waiting on slow or stalled backend calls. </p><p>Waiting threads do not burn CPU. So as the system became less able to serve users, <em>CPU utilisation actually dropped</em>. The autoscaler, looking at CPU, concluded the fleet was over-provisioned and <strong>downscaled</strong> the web tier.</p><p>Then the mini-peak at 7:00 AM PST arrived. Load increased against the now-smaller fleet on a degraded network. Apache worker thread utilisation climbed sharply, threads were waiting longer, and more of them were in use, and the thread-utilisation signal triggered aggressive <strong>upscaling</strong>. Slack attempted to add <strong>1,200 servers between 7:01 and 7:15 AM PST</strong>.</p><p>The scale-up failed. New instances are configured by an internal service Slack calls <code>provision-service</code>, which talks to other Slack systems and to AWS APIs over the same degraded network. Under the sudden load of 1,200 simultaneous provisioning requests, with elevated latency on every dependency call, <code>provision-service</code> hit two resource ceilings: <strong>the Linux open files limit</strong> and an <strong>AWS quota limit</strong>. </p><p>Most of the 1,200 instances were created but never fully provisioned. They counted against the autoscaling-group size limit, blocking further scale-up, but they did not serve traffic.</p><p>And then the second layer of the observer problem revealed itself. Slack&#8217;s dashboarding and alerting service had <strong>failed during the early stages of triage</strong>. The reason was structural: the monitoring stack ran in a different VPC from its backend databases, and the same TGW that was dropping packets sat on the path between them. <em>The failure that was breaking the web tier had also blinded the engineers trying to diagnose it.</em></p><p>For roughly the next hour, incident responders worked without dashboards. They had logs, <strong>command-line tools,</strong> and the ability to query metrics backends directly, but none of the pre-built queries that turn raw metrics into actionable views. </p><p>Some engineers were SSHed into production instances when the autoscaler deprovisioned them mid-investigation, abruptly ending their sessions. <code>provision-service</code> recovered around 8:15 AM PST; serving capacity reached a degraded-but-functional state by 9:15 AM PST; full recovery, after AWS engineers manually scaled TGW capacity, completed at <strong>10:40 AM PST</strong>.</p><p>The whole sequence is the observer problem in compounded form. The autoscaler responded first to a metric (<strong>CPU</strong>) that did not represent the system state, then to a metric (<strong>thread utilisation</strong>) that drove the wrong action under the conditions. </p><p>The control loop did exactly what it was designed to do; the <em>signals</em> it was acting on had decoupled from the <em>reality</em> of the network underneath. </p><p>And the observation surface that operators would normally have used to see this decoupling was itself, by architectural coincidence, a casualty of the same failure.</p><p>This is the <strong>observer problem</strong> in its operational form, and it has three structural sources, each worth pulling apart.</p><div><hr></div><h2>Instrumentation lag</h2><p>Every observability pipeline introduces delay between an event occurring in the system and that event being visible on a dashboard. The delay has multiple stages.</p><p>First, <strong>emission delay</strong>: the event happens, but the code that emits the metric or log runs after the event, and the emission itself takes some time, usually buffered behind a batching layer with a flush deadline (StatsD typical: 10s; OpenTelemetry batch span processor: 5s default, max queue 2048). Second, <strong>collection delay</strong>: the emitted data is scraped or pushed to a collector at fixed intervals (Prometheus default scrape interval: <strong>15s</strong>, with <code>scrape_timeout</code> typically <strong>10s</strong>). </p><p>Third, <strong>aggregation delay</strong>: the collector pre-computes summary statistics, often on a window ending some seconds in the past to allow late-arriving data. Fourth, <strong>render delay</strong>: the dashboard queries the storage layer and renders, typically on a 30s to 60s refresh.</p><p>End-to-end delay in a well-tuned production pipeline is typically <strong>15 to 60 seconds</strong>. In many real pipelines it is several minutes.</p><p>This is not a defect. It is the cost of producing observations that are coherent across thousands of hosts. The delay is the price paid for the metric being <em>computable</em> at all.</p><p>But the delay has a consequence that classical control theory makes precise. Any closed-loop control system whose feedback path introduces delay <code>&#964;</code> experiences a <strong>phase lag</strong> of <code>&#969;&#964;</code> radians at frequency <code>&#969;</code>. </p><p>The <strong>Nyquist stability criterion</strong> says, in essence, that a feedback loop becomes unstable when its total phase lag approaches 180&#176; at the frequency where the loop gain is unity; with even modest controller gain, sufficient phase lag turns negative feedback into positive feedback and the loop oscillates.</p><p>Concretely: for an autoscaler with a minute-scale measurement-to-action delay attempting to track load that varies on similar timescales, phase lag approaches the stability boundary. Aggressive scaling policies tip the loop into oscillation; this manifests operationally as the autoscaler <strong>over-provisioning, then over-scaling-down, then over-provisioning again</strong>, never settling. </p><p>Slack&#8217;s post-mortem describes a variant of this pattern: an initial downscale on CPU, followed by an aggressive upscale on <strong>thread utilisation</strong>, against a network problem the loop could not see at all.</p><p>The Shannon-Nyquist sampling theorem provides the converse bound: a control loop sampling at interval <code>T</code> cannot observe, and therefore cannot react to, disturbances faster than <code>2T</code>. A 15-second<strong> Prometheus scrape interval </strong>is structurally blind to load dynamics on timescales below 30 seconds. The information about those dynamics has been <strong>aliased</strong> into the lower-frequency band, where it appears as noise.</p><p>Marc Brooker has written about this directly in the context of AWS load balancing: a control loop with delay longer than the time constant of the thing it controls <em>cannot stabilise that thing.</em> It can only chase it. </p><p>The Slack autoscaler chasing a CPU metric that had decoupled from real load (because waiting threads do not burn CPU) was operating in exactly this regime.</p><p>The mitigation is not to make the metrics faster, though that helps. The mitigation is to design control loops that <strong>do not depend on global metrics</strong>: rate-limiting at the boundary, admission control based on <strong>local queue depth</strong> observable in <code>O(1)</code> from inside the affected process, fallback to last-known-good rather than reactive scaling. </p><p>Local observations made by the component itself have zero collection delay because they bypass the pipeline entirely. Global observations always carry the pipeline&#8217;s phase lag.</p><p>This is the structural argument for <strong>autonomic</strong> behaviour at the component level. Components that defend themselves locally (with circuit breakers, backpressure signals propagated synchronously to upstream callers, and load-shedding triggered by their own queue depth) do not depend on a delayed control loop to survive. </p><p>Components that wait for the autoscaler to rescue them are operating inside a feedback loop whose stability margin is, almost always, narrower than anyone has measured.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Aggregation destroys the signal</h2><p>Every metric you look at on a production dashboard is an aggregate. A counter has been summed across hosts. A latency value has been percentiled across requests. A gauge has been averaged or maxed across a time window.</p><p>Aggregation is mathematically necessary; you cannot stare at every individual request. It is also, almost always, the thing that hides the failure.</p><p>The pathology is most visible in <strong>latency aggregation</strong>, and the canonical analysis is Gil Tene&#8217;s <em>How NOT to Measure Latency</em> (2015). The argument is worth deriving mechanically, because the conclusion is counter-intuitive and the mechanism is not.</p><p>Consider a load generator configured to issue requests at a rate of <code>R</code> per second; one request every <code>1/R</code> seconds. The generator records, for each response, the time elapsed between <em>send</em> and <em>receipt</em>. Call this <code>service_time</code>. The reported P99 is the 99th percentile of the <code>service_time</code> distribution across some number of samples.</p><p>Now suppose the <strong>system stalls completely</strong> for <code>T</code> seconds, then recovers. During the stall, no responses arrive. The generator, in its naive form, has two possible behaviours.</p><p>In the <em>coordinated</em> form, the generator waits for each response before sending the next. During the <code>T</code>-second stall, exactly one request is in flight; its <code>service_time</code> is recorded as <code>T</code>. The other <code>R&#183;T</code> requests that <em>should have been sent during the stall</em> are never sent. They do not appear in the histogram at all.</p><p>In the <em>uncoordinated</em> form, the generator sends on schedule regardless of whether responses arrive. During the stall, <code>R&#183;T</code> requests pile up in the kernel&#8217;s socket buffer or in the generator&#8217;s own queue. When the system recovers, those requests are drained; each one&#8217;s <code>service_time</code> is measured from the moment it was <em>sent</em>, not the moment it was <em>scheduled to be sent</em>.</p><p>The bias is statistical. Let <code>F_observed(t)</code> be the empirical CDF of <code>service_time</code> the generator records, and <code>F_true(t)</code> the CDF of latency that a real user (arriving according to a <strong>Poisson process</strong> at rate <code>R</code>) would experience. </p><p>In the coordinated form, the missing samples should have been drawn from the slowest part of the latency distribution; their absence systematically truncates the right tail of <code>F_observed</code>. The quantile function <code>Q_observed(0.99) = F_observed&#8315;&#185;(0.99)</code> is, by construction, a lower bound on <code>Q_true(0.99)</code>, with the gap widening as <code>T</code> grows.</p><p>The user who clicked at time <code>t</code> and got a response at time <code>t + T + &#949;</code> experienced a latency of <code>T + &#949;</code>. The histogram has either zero entries near <code>T + &#949;</code> (coordinated form) or entries clustered near <code>&#949;</code> (uncoordinated form). In neither case does the percentile reflect what the user felt.</p><p>The correction <strong>Tene proposes</strong> (HdrHistogram&#8217;s <code>recordValueWithExpectedInterval</code>) is to <em>synthesise</em> the missing samples: for each measured <code>service_time</code> exceeding the expected interval <code>1/R</code>, insert additional samples at <code>service_time &#8722; 1/R</code>, <code>service_time &#8722; 2/R</code>, ..., down to <code>1/R</code>. </p><p>These synthetic samples represent the users who would have arrived during the stall and would have experienced progressively shorter waits.</p><p>The implementation, in essence:</p><pre><code><code>void recordValueWithExpectedInterval(long value, long expectedInterval) {
    recordValue(value);
    if (expectedInterval &lt;= 0 || value &lt;= expectedInterval) return;
    long missingValue = value - expectedInterval;
    while (missingValue &gt;= expectedInterval) {
        recordValue(missingValue);
        missingValue -= expectedInterval;
    }
}</code></code></pre><p>Six lines. One additional method call per recorded measurement. The result, when applied to real traces, is routinely an order-of-magnitude shift in the tail. A system reporting P99 = 200 ms under uncorrected measurement reports P99 = 2 to 4 seconds under coordinated-omission correction. </p><p>The dashboard was lying by a <strong>factor of ten</strong>, not by accident, but by <em>construction</em> of how percentiles are computed from a fixed-rate sampler against a non-stationary service-time distribution.</p><p>The same pathology appears, in different shapes, throughout the metric stack. Server-side latency histograms measure only requests the server got to process. Requests rejected at the <strong>load balancer</strong>, dropped at the TCP layer, or held in the kernel accept queue do not appear. </p><p>The server&#8217;s P99 can be excellent while the <em>connection</em> P99 (which the user actually experiences) is catastrophic. The <strong>Slack incident</strong> is exactly this: the web tier was reporting acceptable internal latencies for the requests it was handling, because the requests <em>not</em> being handled were not in the denominator. Survivorship bias, in observation form.</p><p>The same problem appears in <strong>averaging</strong>. A service that handles two classes of request, 99% of them fast and 1% of them slow, will show a mean latency dominated by the fast class. If the slow class gets ten times slower during an incident, the mean barely moves. The average is structurally insensitive to the tail, which is where outages live.</p><p>Percentiles are better than averages, but only if computed correctly, and only if the percentile you care about is in the data. </p><p>P99 across a million requests has ten thousand data points and is statistically reliable. P99 across a thousand requests has ten data points and is statistical noise. P99.9 across a million requests has a thousand data points, marginal. P99.99 across a million requests has a hundred data points, useless. </p><p>The deeper into the tail, the more samples needed to stabilise it, and the deeper into the tail is exactly where the interesting failures live.</p><p>The reason failures live in the tail is itself a queueing-theoretic result. For an <strong>M/M/1 queue</strong> (Poisson arrivals at rate <code>&#955;</code>, exponential service times at rate <code>&#956;</code>, single server) with utilisation <code>&#961; = &#955;/&#956;</code>, the expected waiting time in the system is <code>1/(&#956;(1&#8722;&#961;))</code>. As <code>&#961; &#8594; 1</code>, this <strong>diverges hyperbolically</strong>; the variance of waiting time grows even faster, as <code>1/(1&#8722;&#961;)&#178;</code>. </p><p>The practical consequence is that high-percentile latencies blow up far faster than utilisation increases: a service running near saturation has a P99 that is many multiples of its mean, and the ratio worsens sharply as you approach full utilisation. </p><p>Tails are not a measurement artefact; they are the <strong>physics of contention</strong>, and they are precisely what averages and low-resolution percentiles destroy.</p><p>The problem compounds when percentiles are themselves aggregated. The P99 of a service that is the union of ten hosts is <strong>not</strong> the average of those hosts&#8217; P99s, nor their maximum, nor any function of them; it is a quantile of the underlying merged distribution, which is unrecoverable once each host has been independently percentiled. </p><p>Many monitoring systems do exactly this aggregation, producing a number labelled &#8220;P99&#8221; that is mathematically meaningless. Theo Schlossnagle and the Circonus team have written extensively on this; the correct primitive is to store the <strong>full histogram</strong> and percentile <em>at query time</em>, after merging across hosts.</p><p>Three production-grade histogram primitives dominate:</p><ul><li><p><strong>HDR Histogram</strong> (Tene): fixed-precision logarithmic bucketing across many orders of magnitude (typically nanoseconds to hours at three significant digits), <code>O(1)</code> insert, mergeable across instances, widely used in the JVM ecosystem.</p></li><li><p><strong>t-digest</strong> (Dunning, 2013): centroid-based sketch with concentrated precision in the tails, useful when storage is constrained but tail accuracy matters; <code>O(log n)</code> insert in the worst case, mergeable with a controlled error bound.</p></li><li><p><strong>DDSketch</strong> (Masson, Rim, Lee, 2019): relative-error guarantee <code>&#945;</code>, log-bucketed similarly to HDR but with provable tail accuracy, <strong>fully mergeable without error accumulation</strong>; used by Datadog as its native primitive.</p></li></ul><p>All three solve the merge problem. None of them solve the <strong>storage</strong> problem: keeping a per-request histogram per dimension (customer, region, endpoint, version) at full cardinality costs roughly two orders of magnitude more than scalar metrics. Almost no organisation does it for everything.</p><p>Beyond latency, aggregation collapses <strong>cardinality</strong>. If a service is failing for one specific customer, on one specific endpoint, in one specific region, the aggregate error rate may show a 0.1% blip below any reasonable alerting threshold. The blip is the entire experience of that customer. Aggregation makes the rare invisible by averaging it against the common.</p><p>Charity Majors&#8217;s argument for <strong>high-cardinality observability</strong>, developed across her writing at Honeycomb and in <em>Observability Engineering</em> (O&#8217;Reilly, 2022, with Liz Fong-Jones and George Miranda), reduces to this: the questions that matter during an incident are almost always questions about specific slices of traffic, and any pre-aggregated metric has already destroyed the dimensions needed to slice on. </p><p>The <strong>pre-aggregation step</strong> is irreversible. Information theory enforces this; once you have computed the count of errors per minute, you cannot recover <em>which customers</em> produced those errors.</p><p>The cost of<strong> preserving cardinality</strong> is large; the cost of discarding it is invisible until the incident in which you need the dimension you discarded. The discipline is to know which dimensions are likely to matter, instrument those at full cardinality, accept that some incidents will be invisible until someone re-instruments after the fact.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>You can only see what you instrumented</h2><p>This is the<strong> third and most permanent</strong> of the three sources, and the hardest to mitigate, because it is a statement about the closure of the observable space.</p><p>Every metric, log line, and trace span in your system exists because some engineer, at some point, decided it would be useful. The decision was made under one model of how the <strong>system would fail</strong>. The decision was made before the migration that changed the failure modes. The decision was made by someone who has since left.</p><p>The <strong>observation space</strong> of a production system is, in effect, a fossil record of past concerns. It captures the questions someone thought to ask. The questions that nobody thought to ask are invisible, by definition, and they remain invisible until an incident forces someone to add the instrumentation in the middle of the firefight.</p><p>This is the operational meaning of Cindy Sridharan&#8217;s distinction (in <em>Distributed Systems Observability,</em> O&#8217;Reilly, 2018) between <strong>monitoring</strong> and <strong>observability</strong>. Monitoring is the practice of watching known failure modes by means of pre-defined metrics and alerts. </p><p><strong>Observability </strong>is the property of a system that allows you to ask questions about its state that you did not anticipate having to ask.</p><p>The two are different in kind. <strong>Monitoring </strong>is comprehensive at handling familiar failures and useless at handling novel ones. Observability is the opposite. A mature production environment requires both, but the second is much harder to achieve, because it requires preserving structured, high-cardinality data about every interesting event, in a form that can be queried after the fact along dimensions no one specified in advance.</p><p>The <strong>DynamoDB DNS</strong> race condition from <a href="https://open.substack.com/pub/softwarefrontier/p/how-systems-really-fail-part-i?r=3c7w5a&amp;utm_campaign=post-expanded-share&amp;utm_medium=web">Part I</a> is a clean example of the limit. The plan generation system had monitoring: are the Enactors running, are they applying plans, is Route 53 returning valid responses? All of these were green throughout the incident. </p><p>The question that would have caught the failure, <em>is there a window during which one Enactor has applied an older plan after another Enactor has deleted it,</em> was a question nobody had thought to ask, because the failure mode it describes had never happened.</p><p>There was no monitoring for it, because there was no model of it. Observability, in Sridharan&#8217;s sense, might have caught it: if the raw event stream of Enactor operations, with full cardinality on plan version and Enactor identity, had been queryable, an engineer during the incident could have constructed the query that revealed the interleaving. Whether anyone would have thought to construct that query in the first ninety minutes of the outage is a separate question.</p><p>Ben Sigelman, who designed Google&#8217;s Dapper tracing system and later co-founded LightStep, has argued that the practical limit of observability is set by the <strong>cost of the questions you are not yet asking.</strong> Storing every span, log, and structured event at full cardinality is theoretically ideal and economically impossible. </p><p>Every organisation makes a choice about which dimensions to preserve, and that choice is, in retrospect, always slightly wrong, because the next incident is the one whose relevant dimension was sampled out.</p><p>The discipline is not to eliminate this gap, which cannot be done, but to <em>acknowledge</em> it: to recognise that your dashboards are a model, that the model is incomplete, and that the moments when the model and the system disagree are the moments that matter most.</p><div><hr></div><h2>The three frames of reference</h2><p>The DynamoDB cascade in Part I introduced, almost in passing, an idea worth making explicit. During the outage, the system existed in three simultaneously valid states, depending on whose vantage point you took: the data plane saw a healthy service, the control plane saw correctly-served DNS, the client saw nothing.</p><p>This is not a metaphor. It is structural, and it has a theoretical grounding. <strong>Lamport&#8217;s 1978 paper</strong> <em>Time, Clocks, and the Ordering of Events in a Distributed System</em> established that in a distributed system without a shared global clock, the only well-defined ordering between events is the <strong>happens-before</strong> relation <code>&#8594;</code>, defined transitively by causal message passing. Events that are not connected by <code>&#8594;</code> are <strong>concurrent</strong>, and concurrent events have no meaningful temporal ordering across observers; each observer may legitimately see them in a different sequence. Each observer constructs a partial order from the messages it has received, and the partial orders need not agree.</p><p>Any distributed system at scale has at least three such frames, and they almost never agree.</p><p><strong>The data plane frame</strong> is the view of the components doing the actual work. Storage nodes know whether they are reachable on their primary network interfaces and whether their disks are responding. Compute hosts know whether their workers are processing requests. From this frame, the system is described in terms of internal state: queue depths, lock contention, GC pauses, file descriptor counts.</p><p><strong>The control plane frame</strong> is the view of the systems that manage the data plane. Schedulers, load balancers, service discovery, deployment systems, autoscalers. The control plane sees the data plane through its own observations, typically lagged metrics and periodic health checks. From this frame, the system is described in terms of declarative goals and reconciliation: how many instances should be running, how many are running, what is the gap?</p><p><strong>The client frame</strong> is the view of whatever is trying to use the system. This includes external customers but also internal services that depend on the one in question. The client sees the system only through its responses: latency, error rate, correctness. From this frame, the system is described in terms of the service contract being honoured or not.</p><p>In a healthy system, all <strong>three frames</strong> produce consistent descriptions. The data plane is processing requests, the control plane sees that the data plane is processing requests, and the client receives correct responses. This consistency is what allows operators to use any one frame as a proxy for the other two.</p><p>In an unhealthy system, the frames diverge, and the <em>pattern of divergence</em> is diagnostic. <strong>Data plane healthy</strong>, client failing: the failure is in the path between them, usually DNS, routing, or load balancing. Control plane healthy, data plane degraded: the control plane is observing a stale or filtered view of the data plane. </p><p>All three frames disagreeing: the system has entered a regime its designers did not anticipate, and the on-call engineer is going to have a long night.</p><h3>Frame divergence, mechanically: GitHub, 21 October 2018</h3><p>The 2018 <strong>GitHub MySQL</strong> split-brain is the textbook case of frame divergence, and it is worth reconstructing because the mechanism turns on which observers were in which partition.</p><p>At 22:52 UTC on 21 October, routine maintenance to replace <strong>failing 100G optical equipment</strong> severed connectivity between GitHub&#8217;s US East Coast network hub and its primary US East Coast data centre. The break lasted <strong>43 seconds</strong>. Not enough for a human to react. More than enough for everything that follows.</p><p>GitHub ran MySQL in a topology managed by <strong>Orchestrator</strong>, an open-source replication-topology manager that, in GitHub&#8217;s configuration, uses <strong>Raft consensus</strong> among its own nodes to decide when to promote replicas. The primary was in US East. </p><p>Replicas existed in <strong>US West </strong>and in a public-cloud region. Orchestrator nodes were distributed across all three. Crucially, Orchestrator&#8217;s automated failover was configured to promote across regional boundaries.</p><p>The Raft protocol requires a <strong>strict majority</strong> for any decision: for a cluster of <code>n</code> nodes, <code>&#8970;n/2&#8971; + 1</code> must agree. When the East Coast data centre dropped off the network, the Orchestrator nodes inside it were partitioned with it. The remaining nodes (US West plus US East public cloud) retained a quorum. </p><p>From their frame, the primary had failed; the only action under Raft&#8217;s liveness assumption was to elect a new leader and promote. Within seconds of the partition forming, the Orchestrator quorum began the <strong>leadership deselection process</strong>, opened a new Raft term, and promoted a US West MySQL replica to primary. Application traffic in the unaffected regions began flowing to it.</p><p>This is, in the language of CAP, the choice MySQL replication and Orchestrator had been configured to make: <strong>availability over consistency</strong>. When the network partitioned, the system preserved availability (writes continued to be accepted somewhere) at the cost of consistency (writes accepted in one partition were unknown to the other). Daniel Abadi&#8217;s <strong>PACELC</strong> refinement (2010) makes the framing sharper: <em>if Partitioned, choose between Availability and Consistency; Else, choose between Latency and Consistency</em>. GitHub&#8217;s topology had chosen <code>PA/EL</code>: availability under partition, latency in the steady state. The cost of that choice was paid, in full, during the 43 seconds the partition was active and the 40 minutes that followed.</p><p>The partition healed 43 seconds later. From the East Coast frame, nothing had happened: the local MySQL primary had continued serving writes the whole time, because applications in East continued routing to it. From the West Coast frame, <em>it</em> was now the primary, and writes were flowing in. <strong>Both databases had accepted writes for the duration of the partition, neither aware of the other&#8217;s writes.</strong> </p><p>Cross-region MySQL replication carries some lag in steady state, which is the window during which writes acknowledged to clients on East had not yet reached West and were therefore not present on the newly-promoted West primary.</p><p>The trap closed in the next forty minutes. Once connectivity restored, GitHub&#8217;s application tier saw the new West Coast primary and began directing writes to it. For nearly <strong>40 minutes</strong>, the West Coast accepted writes that the East Coast primary did not see. Meanwhile, the East Coast primary contained the <strong>few seconds</strong> of writes from the partition window that had never been replicated to West.</p><p>When engineers locked deployment tooling and assessed state, they found two databases with divergent histories. Reconciling by failing back to East would discard the 40 minutes of West Coast writes. Failing forward on West would discard the East Coast partition-window writes. Neither was acceptable.</p><p>GitHub chose to <strong>fail forward</strong>, preserving the 40 minutes of West Coast writes at the cost of consistency: applications in the East Coast now had to make a cross-country round trip for every database call, adding cross-country latency to operations that had been designed to complete in local-region time. </p><p>The site was effectively degraded for <strong>24 hours and 11 minutes</strong> while data was restored from backups, replication was rebuilt, and the orphaned East Coast partition-window writes were manually reconciled from binary logs. As the post-mortem records: one of the busiest clusters in the partition window contained <strong>954 writes</strong> that had to be reconciled by hand.</p><p>From the East Coast control plane&#8217;s frame, the system had operated correctly: writes were accepted, replication was healthy locally, no monitoring fired. From the Orchestrator quorum&#8217;s frame, failover happened exactly as designed when the primary became unreachable. </p><p>From the <strong>data plane&#8217;s frame</strong>, two independent write histories existed during a window neither side observed in full. From the client frame, some writes had succeeded that, in a consistent universe, would have been rejected.</p><p>The 43 seconds of partition were the <strong>trigger</strong>. The frame divergence was the <strong>failure</strong>. The trigger was unavoidable; physical networks partition. </p><p>The deeper unavoidability was theoretical: the <strong>FLP impossibility result</strong> (Fischer, Lynch, Paterson, 1985) proves that no asynchronous deterministic consensus protocol can guarantee both safety and liveness in the presence of even a single crash failure. </p><p>Every real consensus system, including Raft, breaks this tie pragmatically, in Raft&#8217;s case, by leaning on timing assumptions to maintain liveness. Those timing assumptions are exactly what fail during a network partition; the algorithm has no choice but to make progress on the basis of local observations that, <em>during</em> the partition, were sufficient for a local quorum decision but <strong>insufficient to determine the global state</strong>.</p><p>GitHub&#8217;s subsequent <strong>re-architecture</strong> eliminated cross-region automatic failover precisely because no observation surface available in real time was sufficient to detect the divergence as it was happening. </p><p>They moved the consistency/availability tradeoff from automatic to <strong>human-in-the-loop</strong>, accepting longer mean-time-to-recovery in exchange for not making this specific decision wrong again.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The Heisenbug problem</h2><p>There is a class of bug in distributed systems that disappears when you try to observe it. The folklore name is <strong>Heisenbug</strong>, after the uncertainty principle; the technical name is <strong>observation-dependent failure</strong>.</p><p>These bugs exist because adding observation to a system changes the system. Logging a value takes time, which changes the timing of the surrounding code, which changes the order in which <strong>concurrent operations</strong> interleave, which changes whether the race condition fires. </p><p>Capturing a stack trace acquires locks, which changes contention, which changes which thread reaches the critical section first. The act of looking at the system alters what the system does.</p><p>This is, again, not a metaphor. Modern observability tooling routinely takes 1-5% of a service&#8217;s CPU and a measurable fraction of its memory.<strong> eBPF-based profilers</strong>, distributed tracing, log aggregation, all of them consume resources that come from the same pool the application uses to do its work. In the steady state, the cost is acceptable. </p><p>In the regime where the<strong> system is already saturated</strong>, adding observation can be the perturbation that pushes the system from a marginal state into a failure state.</p><p>The mitigation is not to remove observation, which would leave the system unobservable. The mitigation is to design observation to be <strong>low-overhead and load-shedding</strong>: when the system is saturated, observation is the first thing to drop, not the last. </p><p>Sampling rather than full capture, head-based sampling rather than tail-based, structured events written to local buffers rather than synchronous network calls.</p><p>The modern technical answer to this is <strong>eBPF</strong>: in-kernel observation programs verified for safety at load time and executed in response to kernel events. Because the aggregation happens in kernel space, written into <strong>perf event arrays</strong> or <strong>BPF ring buffers</strong> shared with user-space readers via mmap, the observation path bypasses the syscall boundary entirely. </p><p>The cost of recording an event collapses to a few cache-line writes, with no context switch and no allocator pressure on the application path. </p><p>The Linux kernel&#8217;s eBPF verifier statically proves termination and memory safety at program load time, which means an observation program cannot crash the kernel even if it has a bug; bpftrace, BCC, and <strong>Cilium&#8217;s Hubble</strong> all build on this substrate. </p><p>The implication for the Heisenbug problem is that, for many workloads, eBPF-based observation has overhead small enough that <strong>observing the system no longer meaningfully alters the system&#8217;s behaviour</strong>.</p><p>Each of these decisions trades observability for the ability to keep the system running while it is being observed. The trade is acceptable only if you have decided it deliberately. </p><p>In most systems, it has been decided by accident, by whatever the default configuration of the tracing library is, set by whichever engineer integrated it.</p><p>But the deepest version of the observation-disturbing-the-system problem is not about overhead. It is about <strong>circular dependency between the observation surface and the thing being observed</strong>. </p><p>When the monitoring system depends on the system it is monitoring, a failure in the monitored system blinds the <strong>monitoring system</strong>, and the operator loses access to the diagnostic data at precisely the moment they need it most. </p><p>There is no better illustration of this than the 73 hours that began at 13:37 PDT on 28 October 2021.</p><div><hr></div><h3>The Roblox outage, mechanically</h3><p>Roblox at the time ran more than <strong>18,000 servers</strong> and <strong>170,000 containers</strong> across its own data centres, orchestrated using the HashiCorp stack: <strong>Nomad</strong> for scheduling, <strong>Vault</strong> for secrets, and <strong>Consul</strong> for service discovery, health checks, session locking, and as a KV store. </p><p>Consul was the central nervous system. Every service depended on it to find its peers.</p><p>A <strong>single Consul cluster</strong> supported the entire backend: 5 voter nodes plus 5 non-voter read replicas. This was, as the post-mortem would later note, a single point of failure of a kind that violated every textbook lesson about blast-radius isolation. </p><p>In the months leading up to October, Roblox had upgraded from Consul 1.9 to 1.10 to take advantage of a new <strong>streaming feature</strong> designed to reduce CPU and network bandwidth on large clusters. The feature had been incrementally enabled across services without incident. </p><p>On 27 October at 14:00, the day before the outage, it was enabled on the <strong>traffic routing</strong> service, and the number of routing nodes was increased by 50% in anticipation of end-of-year traffic.</p><p>At <strong>13:37 PDT on 28 October</strong>, Vault performance began to degrade and a single Consul server began exhibiting high CPU load. Engineers began to investigate; users were not yet impacted. The first signal was unusual write latency on Consul&#8217;s underlying KV store: the <strong>50th percentile</strong>, normally <strong>under 300 ms</strong>, had climbed to <strong>2 seconds</strong>.</p><p>The cluster was failing for two reasons that interacted, neither of which engineers identified for days.</p><p>The first was the streaming feature itself. HashiCorp would later explain that streaming, while overall more efficient than long polling, used <strong>fewer concurrency control elements (Go channels)</strong> in its implementation. </p><p>Under very high read <em>and</em> very high write load, the design exacerbated contention on a single Go channel, blocking writes and consuming CPU in kernel spin locks along the streaming subscription code path. This pathology had not appeared in HashiCorp&#8217;s pre-release benchmarks because it required the specific combination of large stream count and high churn rate that Roblox&#8217;s workload produced.</p><p>The second was buried inside Consul&#8217;s persistence layer. Consul uses <strong>BoltDB</strong>, an embedded Go key-value store inspired by LMDB&#8217;s memory-mapped design, to persist its <strong>Raft write-ahead log</strong>. </p><p>BoltDB&#8217;s design is a single memory-mapped file organised as a copy-on-write B+tree: every write transaction allocates new pages, never modifies existing ones, and commits by atomically swapping a single root pointer. This gives crash-safety, at the cost of page churn.</p><p>When pages become unreachable, <strong>BoltDB</strong> does not release them to the OS. Instead it tracks them in a <strong>freelist</strong> of free page IDs, which is rewritten in its entirety on every transaction commit. At normal scale, freelist maintenance is negligible. At <strong>Roblox&#8217;s scale</strong>, after months of accumulated Raft log writes, the freelist had grown pathologically.</p><p>The post-mortem provides the actual numbers, taken from a Consul server during the incident: the <strong>4.2 GB Raft log store contained only 489 MB of actual data</strong>. The remaining <strong>3.8 GB was empty space</strong>, tracked as free pages. </p><p>The freelist tracking those pages had grown to <strong>7.8 MB</strong>, containing <strong>nearly a million free page IDs</strong>. For every Raft log append, with all the batching Consul applies, a write of 16 KB or less was triggering a rewrite of the entire 7.8 MB freelist to disk.</p><p>This is the pathology. Each transaction commit performed: a search of the million-entry freelist for free pages; an update to that freelist; serialisation of the entire 7.8 MB freelist to disk; and an <code>fsync(2)</code> whose cost was dominated by the size of the dirty page set, which was now dominated by the freelist itself. </p><p>The work scaled linearly with <strong>freelist length</strong>, and the freelist length grew with every snapshot the system performed to keep itself trim.</p><p>Raft, sitting on top of BoltDB, has timing assumptions. The leader replicates log entries to followers and must commit them durably. </p><p>When BoltDB commit latency entered the multi-second range, leaders could not durably persist <strong>log entries fast enough</strong>; followers timed out, started elections, and a new leader was chosen, which inherited the same BoltDB file, performed the same expensive freelist operations, became slow, lost its leadership in turn, and triggered another election. </p><p>The cluster was <strong>alive but unable to make progress</strong>: a Raft cluster trapped in a leader-flap loop, with each leader&#8217;s presence too brief to commit meaningful work.</p><p>This was an internal failure at the level of database page management, observed at the level of cluster leadership stability. The two layers were not connected in anyone&#8217;s mental model of the system. They were connected through a freelist data structure most engineers did not know existed.</p><p>The team&#8217;s first hypotheses, in order, were the ones the operator&#8217;s model suggested. They suspected <strong>degraded hardware</strong> and replaced a Consul node. Performance continued to suffer. </p><p>They suspected <strong>capacity</strong>, and replaced all the Consul nodes with new machines: <strong>128 cores</strong> (up from 64) on faster NVME SSDs. As the post-mortem would later document, this <strong>made things worse</strong>: the new servers were dual-socket NUMA architectures, and the additional cores meant additional concurrent goroutines contending on the same <strong>Go channel </strong>in the streaming code path. </p><p>Cross-socket memory access added latency to operations that had been local on the old 64-core single-socket machines.</p><p>By 16:35 PDT on the 28th, concurrent users had dropped to 50% of normal. Subsequent attempts, resetting the cluster from a snapshot, blocking <strong>incoming traffic</strong> with <code>iptables</code> to bring it back under controlled conditions, reducing health-check frequency from 60 seconds to 10 minutes to give the cluster breathing room, all stabilised the system briefly and then returned it to the same 2-second KV write latency. </p><p>None of these interventions worked because none of them addressed the actual mechanism. The post-mortem is explicit: engineers did not identify the BoltDB freelist issue <em>during</em> the incident. <strong>HashiCorp</strong> engineers determined the root cause in the <strong>days after</strong> the outage ended.</p><p>This is where the observer problem becomes operationally devastating. <strong>Roblox&#8217;s monitoring infrastructure depended on Consul.</strong> When Consul was unhealthy, the dashboards that would have shown engineers what was happening inside Consul were themselves unable to report. The post-mortem describes this directly:</p><blockquote><p><em>There was a circular dependency between our telemetry systems and Consul, which meant that when Consul was unhealthy, we lacked the telemetry data that would have made it easier for us to figure out what was wrong.</em></p></blockquote><p>The diagnostic question the operator needed to ask, <em>what is the actual state of the BoltDB file inside the affected Consul instances,</em> required telemetry that the affected systems were supposed to provide. They could not. </p><p>Engineers were debugging a system whose internal state was now unobservable, against a <strong>failure mode </strong>buried two software layers deep in an open-source dependency that the affected engineers had not personally written.</p><p>The breakthrough came at 15:51 PDT on <strong>30 October</strong>, roughly 50 hours after the outage began, when engineers disabled the streaming feature across all Consul systems. KV write latency immediately returned to 300 ms. </p><p>The Heisenbug-disguised streaming contention had been suppressed; the underlying <strong>BoltDB freelist problem </strong>was still there, manifesting as a &#8220;<em>slow leader</em>&#8221; symptom in which certain leaders inherited the worst of the freelist state. </p><p>The team pragmatically worked around it by preventing those leaders from staying elected, and continued the long process of repopulating caches and restarting services.</p><p>Total downtime: <strong>73 hours</strong>, from 13:37 PDT on 28 October to <strong>16:45 PDT on 31 October</strong>, when 100% of players were given access. </p><p>The trigger was the interaction between Consul streaming and BoltDB&#8217;s freelist; both bugs, both fixable, both ultimately fixed (the BoltDB freelist issue was resolved by migration to <strong>bbolt</strong>, the etcd-io fork of BoltDB, which uses a hashmap-based freelist).</p><p>The <strong>outage duration</strong> was a property of the observation surface. With working telemetry into Consul&#8217;s internal state, the freelist issue would have been visible in hours, not days. Without it, engineers were solving a Heisenbug with their eyes closed.</p><p>The lesson Roblox drew, encoded explicitly in their post-mortem and in the architectural changes that followed, was that observation surfaces must be <strong>independent</strong> of the systems they observe. </p><p>Telemetry must run on infrastructure that does not depend on the thing being measured. If the monitoring stack and the production stack share a common substrate, a failure in that substrate blinds both at once, and the operator is left to debug a <strong>fully dark system</strong>.</p><p>This is, in the language Part I used, an <strong>invariant</strong> at the boundary between the observation system and the production system: <em>the observability stack must function independently of any system whose state it reports.</em> </p><p>The invariant is rarely enforced. It is rarely even written down. It is one of the assumptions the operator does not know they have made until the day Consul stops responding and the dashboards go dark with it.</p><div><hr></div><h2>The operator&#8217;s simulation</h2><p>Every operator, when debugging a system, is in fact debugging a <em>model</em> of the system that lives in their own head. </p><p>The model was built from <strong>architecture documents</strong>, code reading, prior incidents, and conversation with colleagues. The model is a simplification, by necessity, because the system is too complex to fit in one head.</p><p>The model is also wrong, in specific ways, and the operator does not know which specific ways until the model and the system diverge.</p><p>This determines whether an incident is resolved in twenty minutes or six hours. During an outage, the operator performs inference: <em>symptoms</em> are observed, <em>hypotheses</em> are <strong>generated from the model</strong>, <em>tests</em> of those hypotheses are designed and executed. The hypothesis space the operator can explore is bounded by the model the operator has.</p><p>If the failure mode lives outside the model, the operator cannot generate the hypothesis that would lead to it. They will iterate, increasingly desperately, on hypotheses inside the model, none of which fit the symptoms, until either someone with a better model joins the call or the system recovers on its own.</p><p>The <strong>Cloudflare incident from Part I</strong> is partly an instance of this. The first ninety minutes were spent on the hypothesis that the failure was an external DDoS attack, because the oscillation between good and bad states matched the signature of <strong>intermittent external pressure</strong> better than it matched the signature of anything Cloudflare&#8217;s engineers had a model for. </p><p>The model said &#8220;<em>oscillation means external adversary.</em>&#8220; The reality was that oscillation meant gradual rollout against a five-minute regeneration cycle, but that failure mode was not in anyone&#8217;s model until the post-mortem.</p><p><strong>Richard Cook&#8217;s</strong> <em>How Complex Systems Fail</em> (1998) names this directly: every operator&#8217;s view of the system is a <em>practitioner-constructed simulation,</em> assembled from training, experience, and the artefacts of past incidents. </p><p>The simulation diverges from reality <strong>continuously and silently</strong>. The role of the practitioner during an incident is to detect the divergence, update the simulation, and act on the updated version, all in real time, under pressure, with incomplete information.</p><p>The systems that recover quickly from incidents are not the systems with the best dashboards. They are the systems whose operators have the <strong>best simulations</strong>, and whose dashboards expose enough raw data that the simulation can be <strong>corrected in flight</strong>.</p><div><hr></div><h2>The discipline of seeing what is there</h2><p>The compressed form of this essay, the operational counterpart to Part I&#8217;s diagnostic question, is also a question, asked of every observation surface in the system: <em>what is this metric not telling me, and what would I look at if it were lying?</em></p><p>Every dashboard has a set of failure modes for which it is the right view, and a set of failure modes for which it is misleading. Both sets are large. The first is documented (it is the reason the dashboard was built). The second is almost never documented, because the failure modes it contains are, by definition, the ones nobody anticipated.</p><p>The discipline is to know, for every observation, what it is showing and what it is hiding. </p><blockquote><p><em>The metric is an aggregate over what dimensions? Over what time window? With what sampling? </em></p><p><em>Does the observation surface share a substrate with the system it observes? </em></p><p><em>What would a failure that is invisible on this metric look like, and what other view would catch it? </em></p><p><em>When the metric and the user experience disagree, which one should be believed?</em></p></blockquote><p>The answer to the last question is always: the user experience. The metric is a model. The user is in the data plane. The data plane is the system.</p><p>The discipline of <strong>observation-aware engineering</strong> is not to build dashboards that show everything; that is impossible, and the attempt produces dashboards that show nothing useful. </p><p>It is to know, for every dashboard, <em>what it cannot show</em>, to keep the raw event stream queryable for the moments when the dashboard and the system disagree, and to keep the observation stack <strong>architecturally independent</strong> of the systems it watches.</p><p>This is what separates the systems where the operator spends the first hour of an incident <em>narrowing the hypothesis</em> from the systems where the operator spends the first hour <em>arguing about whether the dashboards are correct</em>.</p><p><strong>Part III</strong> will move from observation to action: what it takes to operate a system whose state you cannot fully see, whose feedback loops fight you, and whose composition you do not control. The technical name for this is <em>control theory under uncertainty.</em> The operational name is on-call.</p><div><hr></div><h2>One more thing&#8230;</h2><p>The reason these failures keep happening is that engineers are trained to reason about the <strong>logical</strong> structure of their systems, not the <strong>physical</strong> dynamics of how those systems execute under load.</p><p>The same gap exists, in concentrated form, in <strong>GPU programming</strong>. CUDA correctness is necessary but not sufficient. Performance lives entirely in how memory traffic, warp scheduling, instruction issue, and inter-block synchronisation interact under realistic workloads: the same composition-and-observation problem, compressed into a single die.</p><p>I wrote a <strong>deep guide on CUDA</strong> from this perspective: not isolated tricks, but how to reason about the GPU as a coupled dynamical system whose performance regimes are as discontinuous as any distributed system&#8217;s.</p><p><a href="https://lorenzobrada.gumroad.com/l/cuda_mastery">Read the CUDA Guide on Gumroad</a></p>]]></content:encoded></item><item><title><![CDATA[How Systems Really Fail, Part I]]></title><description><![CDATA[The hidden physics of distributed outages, metastable cascades, and the assumptions that silently destroy systems at scale.]]></description><link>https://www.thesoftwarefrontier.com/p/how-systems-really-fail-part-i</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/how-systems-really-fail-part-i</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Mon, 11 May 2026 11:08:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!p7K1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecc7867d-9a2e-4c72-a92c-19f6abf85478_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p7K1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecc7867d-9a2e-4c72-a92c-19f6abf85478_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p7K1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecc7867d-9a2e-4c72-a92c-19f6abf85478_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!p7K1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecc7867d-9a2e-4c72-a92c-19f6abf85478_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!p7K1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecc7867d-9a2e-4c72-a92c-19f6abf85478_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!p7K1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecc7867d-9a2e-4c72-a92c-19f6abf85478_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p7K1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecc7867d-9a2e-4c72-a92c-19f6abf85478_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ecc7867d-9a2e-4c72-a92c-19f6abf85478_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2588602,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/195937335?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecc7867d-9a2e-4c72-a92c-19f6abf85478_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!p7K1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecc7867d-9a2e-4c72-a92c-19f6abf85478_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!p7K1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecc7867d-9a2e-4c72-a92c-19f6abf85478_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!p7K1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecc7867d-9a2e-4c72-a92c-19f6abf85478_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!p7K1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecc7867d-9a2e-4c72-a92c-19f6abf85478_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Intro</h2><p>There is a version of <strong>distributed systems</strong> that exists in textbooks, RFCs, and the architecture diagrams that get drawn during the first month of a new project. </p><p>In that version, failures are <strong>discrete events</strong>: a node dies, a network partitions, a disk fills. Each event has a name. Each name has a mitigation. The mitigations compose. The composition is correct.</p><p>Then there is the version of distributed systems that exists in production at 03:47 UTC, when the <strong>on-call engineer</strong> is staring at a dashboard that shows everything green except for a customer-impact metric that has been climbing for nine minutes. </p><p>The runbook does not apply because it was written for the system that existed before the migration, the last engineer who understood the offending subsystem left the company<strong> seven months ago</strong>, and the only documentation is a Confluence page from 2023 that contradicts itself in the third paragraph.</p><p>This series is about the <strong>second version.</strong></p><p>It is not about how to design distributed systems. There are good books for that. It is about what happens to those designs after they meet reality: after the load grows by a <strong>factor of fifty</strong>, after three reorgs change the ownership of half the services, after the configuration file that was supposed to be immutable acquires a small permissions change on a Tuesday morning in November. </p><p>It is about the failure modes that emerge <strong>not from broken components</strong> but from the interaction between working ones. It is about why debugging at scale is not a technical activity but an epistemic one. </p><p>And it is about the <strong>design decisions</strong>, often made years before the outage, that determine whether a system has a fighting chance when the failure arrives.</p><p>Five essays. Each stands alone. They share a thesis: the gap between how engineers reason about systems and <strong>how systems actually behave </strong>is not a knowledge problem. It is a structural property of complexity. </p><p>The faster you accept this, the better your systems will be.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The Composition Problem</h2><p>On Monday, 17 November 2025, an engineer at Cloudflare merged a change to a permissions policy on the company&#8217;s <strong>ClickHouse database</strong> clusters. </p><p>The change was part of a long-running effort to migrate distributed queries from a shared system account to <em>per-user authentication</em>, so that query limits and access grants could be evaluated at <strong>finer granularity.</strong> It was the right kind of change. Reviewed, staged, rolled out gradually across cluster nodes, exactly as a careful operator would do it.</p><p>At 11:05 UTC the following morning, the rollout reached a critical threshold. Twenty-three minutes later, the internet broke.</p><p>At 11:28 UTC, Cloudflare&#8217;s network, which fronts roughly 20% of the websites on the public internet, began returning<strong> HTTP 5xx errors </strong>at scale. ChatGPT failed. X failed. Spotify, Discord, Canva, Figma, 1Password, Trello. </p><p>The outage lasted until 14:30 UTC for core traffic, with full restoration at 17:06 UTC. <strong>Matthew Prince</strong>, Cloudflare&#8217;s CEO, would later describe it as the worst outage since 2019. Estimated revenue loss across the affected ecosystem ran into the hundreds of millions of dollars.</p><p>The <em>chain of causation</em>, once it was understood, fits in a paragraph.</p><p>Cloudflare&#8217;s <strong>Bot Management module</strong> runs inside its core proxy (a system called FL, with a newer version FL2). The module scores every request as bot-or-human using a machine-learning model. </p><p>That model takes as input a &#8220;<em>feature configuration file</em>&#8221;, a list of per-request features, which is <strong>regenerated every five minutes</strong> by a query against a ClickHouse cluster. The regeneration query reads from <code>system.columns</code>, ClickHouse&#8217;s metadata table:</p><pre><code><code>SELECT name, type
FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;</code></code></pre><p>Note what is <em>not</em> in this query: a filter on the <strong>database name. </strong>The query implicitly assumed that <code>system.columns</code> would only return columns from the <code>default</code> database, because before the permissions migration users only had visibility into <code>default</code>. </p><p>ClickHouse&#8217;s<strong> distributed table engine</strong> actually stores shards in an underlying physical schema named <code>r0</code>. The new permissions policy granted explicit access to <code>r0</code>. After the change, the same query returned columns from both <code>default</code> and <code>r0</code>, roughly doubling the row count.</p><p>That<strong> row count </strong>was used directly to construct the feature file. The file had previously contained around 60 features. It now contained more than 200.</p><p>Downstream, in the Rust code that loaded the file into the FL2 proxy, there was a preallocated array sized for a<strong> hard ceiling </strong>of exactly 200 features: a performance optimisation so that runtime feature lookups would never allocate. </p><p>When the <strong>oversized file</strong> arrived, the load path returned <code>Err(_)</code>. The calling code, written under the assumption that this could not happen, called <code>.unwrap()</code> on the Result. </p><p>The worker thread panicked with the now-public string:</p><pre><code><code>thread fl2_worker_thread panicked: called Result::unwrap() on an Err value</code></code></pre><p>Every request routed through that worker returned 5xx.</p><p>The damage was amplified by a <strong>second-order property</strong>. ClickHouse was being rolled out gradually, so for nearly an hour only some cluster nodes returned the duplicated result. </p><p>The feature file <strong>regenerated every five minutes</strong>, and whether the run hit an upgraded node or a non-upgraded node was effectively random. </p><p>The file therefore alternated, every five minutes, between &#8220;<em>good</em>&#8221; and &#8220;<em>bad</em>,&#8221; and the proxy fleet oscillated between recovery and failure on a five-minute cycle. </p><p>From the dashboards, this looked exactly like an <strong>active DDoS attack</strong>, an external adversary probing the network with intermittent pressure. </p><p>The <strong>incident commander </strong>spent the first two hours of the outage investigating that hypothesis, because the <em>signature</em> of the failure mimicked a known threat.</p><p>Read this again. Notice what is not in it.</p><p>There was no bug in the database. The new permissions behaviour was correct <strong>ClickHouse semantics</strong>. There was no bug in the query; it executed exactly as written. </p><p>There was no bug in the feature file format; it stored what it was given. There was no bug in the <strong>Rust proxy</strong>, its bounds check correctly refused to process malformed input rather than corrupting state. </p><p>There also was no bug in the deployment process, gradual rollout to a database cluster is exactly how you <strong>mitigate rollout risk</strong>. Every component, examined in isolation, behaved as designed, as documented, as code-reviewed.</p><p>The outage existed in the spaces between the components. It existed in an <strong>unwritten assumption</strong>, that the cardinality of the metadata query was bounded by the schema layout. </p><p>It was present in the gap between the team that owned the permissions migration and the team that owned the feature pipeline. </p><p>It even existed in the asymmetry between data Cloudflare treated as &#8220;<em>trusted</em>&#8221; (internally generated configuration) and data it treated as &#8220;<em>untrusted</em>&#8221; (everything from outside). </p><p>The failure was <strong>not a property</strong> of any component. It was a property of the system.</p><p>This is the central, uncomfortable fact about distributed systems: their failure modes are not documented because they cannot be documented. </p><p>They emerge from the<strong> composition of components</strong>, and the space of possible compositions grows faster than anyone can enumerate it.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Why decomposition breaks down</h2><p>Software engineering is almost entirely built on decomposition. </p><p>You take a <strong>hard problem</strong>, split it into smaller problems, solve each, and compose the solutions. The discipline assumes (implicitly, almost religiously) that the behaviour of the whole can be derived from the behaviour of the parts. </p><p>This is the foundation of <em>modular design</em>, encapsulation, microservices, contracts, type systems. It is what allows <strong>ten thousand engineers</strong> to build a system no one of them understands in full.</p><p>The assumption is wrong, or rather: it holds only within a regime, and the regime ends somewhere around the scale where a system has enough components, enough state, and <strong>enough concurrency</strong> that the <em>interactions</em> between components become a richer source of behaviour than the components themselves.</p><p>The formal version of this argument is older than computer science. Herbert Simon, in <em>The Architecture of Complexity</em> (Proc. Am. Phil. Soc., 1962), distinguished between <strong>decomposable</strong> and <strong>nearly-decomposable</strong> systems. </p><p>In a <strong>decomposable system</strong>, interactions <em>between</em> subsystems are negligible compared to interactions <em>within</em> them, and the whole behaves like the sum of independent parts.</p><p> In a nearly-decomposable system, this is <strong>approximately true </strong>on short timescales but not on long ones, the weak inter-subsystem couplings accumulate into qualitatively different behaviour. </p><p><strong>Simon&#8217;s claim</strong>, which has held up for sixty years across biology, economics, and engineering, is that all real systems of significant size are nearly-decomposable, not decomposable.</p><p>Distributed systems are an extreme case. The components have <strong>clean interfaces </strong>and look decomposable on a diagram. </p><p>But the interactions are mediated by shared resources (networks, clocks, storage, control planes) and those shared resources transmit perturbations between components in ways the diagram does not show. </p><p>A change in one component changes the<strong> load profile</strong> on the shared network, which changes the queueing behaviour at a different component, which changes the timing of its responses, which changes the<strong> retry behaviour </strong>of yet another component. The composition is opaque because the couplings are invisible.</p><p>Distributed systems theory has known a version of this for forty years. <strong>Fischer, Lynch, and Paterson (</strong><em><strong>JACM</strong></em><strong> 1985)</strong> proved that consensus is impossible in a purely asynchronous system with even one faulty process; a result that, properly understood, is not about <strong>consensus algorithms </strong>but about the impossibility of producing globally consistent system behaviour from locally correct components under partial failure. </p><p>Brewer&#8217;s CAP conjecture (PODC 2000) and the Gilbert-Lynch proof (<em>ACM SIGACT</em> 2002) formalised the same point at the level of state. </p><p>Lamport&#8217;s &#8220;<em>Time, Clocks, and the Ordering of Events in a Distributed System</em>&#8221; (<em>CACM</em> 1978) showed that there is <strong>no observer-independent</strong> <strong>simultaneity </strong>in a distributed system without explicit synchronisation, meaning every &#8220;global view&#8221; of the system is a stitched-together fiction.</p><p>The classical literature focused on <em>discrete</em> failures: a node dies, a message is lost, a clock drifts. The modern failures are stranger. </p><p>They are failures of <strong>coupling</strong>: moments when two pieces of working software, communicating through an interface both implement correctly, produce a behaviour neither would produce alone. </p><p>The Cloudflare incident is one. The <strong>DynamoDB DNS</strong> <strong>race condition</strong> that took down AWS US-EAST-1 on 19&#8211;20 October 2025 is a more elaborate example of the same pattern, and it is worth reconstructing mechanically because it shows how thoroughly the composition can betray its components.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h3>The AWS DynamoDB cascade, mechanically</h3><p>DynamoDB&#8217;s regional endpoint, <code>dynamodb.us-east-1.amazonaws.com</code>, is served by an <strong>internal DNS management</strong> system that exists because DynamoDB runs on hundreds of thousands of<strong> load balancers</strong>, and the DNS records pointing clients at those load balancers must be updated continuously as capacity is added, removed, and rebalanced.</p><p>The system has two logical components. The <strong>DNS Planner</strong> monitors load-balancer health and produces &#8220;<em>DNS plans</em>&#8221;, versioned snapshots of which load balancers receive which fraction of regional traffic. </p><p>The <strong>DNS Enactor</strong> reads plans and applies them to Route 53, AWS&#8217;s DNS service. For availability, three Enactors run in parallel, one per availability zone. They operate <strong>concurrently and independently</strong>; no distributed lock, no leader election, no coordination protocol. </p><p>The system was designed this way deliberately, so a <strong>single Enactor crashing mid-run </strong>would not stall propagation; the other two would simply pick up subsequent plans and continue.</p><p>To prevent stale plans from overwriting newer ones, each Enactor performs a freshness check before applying a plan. </p><p>To prevent <strong>unbounded growth</strong> of historical plans, each Enactor also performs a cleanup pass after applying its current plan, deleting plans significantly older than the current one. </p><p>The freshness check happens once, at the start of the application phase. The cleanup happens once, at the end.</p><p>This is, again, the kind of design that gets praised in <strong>code review.</strong> Independent. Stateless. Fault-tolerant. Each component does one well-bounded job.</p><p>Now consider what actually happened. At 23:48 PDT on 19 October (06:48 UTC on 20 October), Enactor A read plan #N&#8722;1 from the Planner and began applying it to Route 53. </p><p>For reasons AWS&#8217;s post-mortem describes as &#8220;<em>unusual delays</em>&#8221;, likely network-mediated queueing inside Route 53&#8217;s control plane, <strong>Enactor A&#8217;s update</strong> run took longer than normal. </p><p>In the meantime, the <strong>Planner produced plan #N. </strong>Enactor B picked up plan #N, performed its freshness check (newer than the currently applied plan: pass), and began its own update run. </p><p><strong>Enactor B</strong> finished first, applying #N to Route 53. It then began its cleanup pass, scanning for plans significantly older than #N and deleting them.</p><p>By the time Enactor A finished its delayed run and went to apply the <em>last few records</em> of plan #N&#8722;1, Enactor B had already applied #N to those same records. </p><p>Enactor A&#8217;s freshness check, made at the start of its run, had not detected this; the check was made when #N&#8722;1 was still the freshest plan, and that result was now stale. Enactor A overwrote those records with #N&#8722;1.</p><p>Now Enactor B&#8217;s cleanup pass arrived at plan #N&#8722;1. By Enactor B&#8217;s bookkeeping, #N&#8722;1 was significantly older than #N. Enactor B deleted plan #N&#8722;1. But Enactor A had just applied #N&#8722;1 to the regional endpoint records. </p><p>The records now pointed at a plan that did not exist. Route 53 dutifully served what it had: an empty answer set for <code>dynamodb.us-east-1.amazonaws.com</code>.</p><p>This is the worst possible<strong> DNS response</strong>. It is not <code>NXDOMAIN</code>, which clients treat as transient and retry. It is <code>NOERROR</code> with an empty <code>ANSWER</code> section; semantically &#8220;<em>this name exists, intentionally, with zero addresses.</em>&#8221; Compliant clients stop. There is no answer to retry.</p><p>Within seconds, every system inside and outside AWS that wanted to talk to DynamoDB in us-east-1 began failing to resolve its address. From the <strong>DynamoDB control plane&#8217;s view</strong>, the service was healthy: load balancers up, storage reachable, request handlers idle. </p><p>From Route 53&#8217;s view, the service was healthy: DNS was returning valid <strong>authoritative responses</strong>. From clients&#8217; view, the service had ceased to exist. Three different frames of reference, three different &#8220;<em>states</em>&#8221; of the same service, all simultaneously true within their own frame. The mismatch <em>between</em> them was the outage.</p><p>It took <strong>manual intervention</strong> from on-call engineers to identify the empty record, repair Route 53 by hand, and re-enable normal automation. DynamoDB DNS recovered in approximately three hours.</p><p>The cascade that followed lasted ten more hours, and is the second composition failure embedded inside the first. EC2&#8217;s <strong>DropletWorkflow Manager</strong> (DWFM), the system that maintains operational leases on the physical hypervisors hosting customer EC2 instances, stores its lease state in DynamoDB. </p><p>While DynamoDB was unreachable, DWFM could not renew leases. Existing leases expired silently. When DynamoDB recovered, <strong>DWFM woke up </strong>to discover that essentially every hypervisor in the region needed a fresh lease, and tried to issue them all at once. </p><p>The<strong> lease-renewal</strong> subsystem entered what AWS&#8217;s post-mortem calls &#8220;<em>congestive collapse</em>&#8221;, a regime where throughput of <em>useful</em> work approaches zero because the system is spending all its time servicing retries of work that has already timed out. </p><p><strong>Network Load Balancer</strong> health checks began failing en masse. New EC2 launches were impossible. The region was effectively down for production workloads until late that evening. Every design decision in this chain was <strong>defensible</strong>. Three Enactors instead of one, for availability. </p><p>Freshness check, to prevent old plans winning. <strong>Cleanup pass</strong>, to prevent unbounded growth. No distributed lock, to avoid coordination overhead and tolerate Enactor failures. <strong>DWFM storing state</strong> in DynamoDB, because what else would you use for a high-availability lease manager. </p><p>Each decision is the textbook answer to a specific risk. </p><p>The composition of all those textbook answers produced <strong>fifteen hours</strong> of regional unavailability and an industry-wide impact measured in <em>hundreds of millions of dollars.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/p/how-systems-really-fail-part-i?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/p/how-systems-really-fail-part-i?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Why documentation cannot close the gap</h2><p>The instinct, after this kind of incident, is to <strong>write better documentation.</strong> Add the failure mode to the runbook. Update the architecture diagram. Note the implicit assumption in a comment. <em>Surely, next time, we will know.</em></p><p>We will not. The reason is not laziness; it is combinatorial.</p><p>Consider a system with <strong>N components</strong>, each with a small number of internal states, dependencies, and inputs. The number of pairwise interactions grows as <strong>O(N&#178;). </strong></p><p>The number of <em>trajectories</em> (sequences of states the system can traverse) grows much faster: for any reasonable model of state and concurrency,<strong> at least exponential in N</strong>. By the time N is in the low thousands (a serious production system), the trajectory space is unbounded for practical purposes.</p><p>Documentation is a<strong> linear medium.</strong> It can describe a finite number of states, interactions, and failure modes. The space of actual failure modes is not finite in any meaningful sense. </p><p>What documentation actually captures, in practice, is the <strong>failure modes</strong> that have already happened; the ones recovered from, written up, discussed in architecture review. </p><p>This is useful, but it is fundamentally<strong> backward-looking.</strong> The next outage is, almost by definition, the one not yet documented. It lives in some currently-undocumented region of the trajectory space, which the system will enter for the first time when some perturbation pushes it there.</p><p>This is not an <strong>indictment of documentation</strong>. Runbooks save lives. Post-mortems compound institutional knowledge. The point is that no quantity of documentation, however thorough, can close the gap between the system as designed and the system as composed. </p><p>The gap is structural. It widens deeply with scale.</p><div><hr></div><h2>The pattern beneath the patterns</h2><p>If you read enough <strong>post-mortems</strong> (<em>Dan Luu&#8217;s catalogue on GitHub remains the best free education in this material</em>) a pattern emerges. </p><p>The triggers vary wildly: a permissions change, a DNS update, a config push, a deploy, a hardware failure, a thundering herd. The <em>shape</em> of the failure is often the same.</p><p>Nathan Bronson and his collaborators, in a 2021 HotOS paper, gave this shape a name: <strong>metastable failure</strong>. The framing has become foundational, and is worth restating precisely because it is the closest the field has come to a formal theory of why composition produces outages.</p><p>A metastable failure occurs in an <strong>open system</strong> with an uncontrolled load source. The system has at least two stable operating regimes: a <em>stable</em> regime, in which a <strong>transient perturbation</strong> decays back to equilibrium, and a <em>metastable failure</em> regime.  </p><p>In that case, the system is functioning (consuming CPU, processing messages, producing output) but its useful throughput, what <strong>Bronson </strong>precisely terms with the word <strong>goodput</strong>, has collapsed. </p><p>The system transitions between regimes via a <strong>trigger</strong>: a load spike, a deploy, a partial failure, a configuration change. </p><p>What keeps the system in the failure regime, even after the trigger is removed, is a <strong>sustaining effect</strong>: a positive feedback loop, usually involving work amplification, in which the system&#8217;s response to its own degraded state increases the load on itself further.</p><p>The <strong>canonical example,</strong> paraphrased from the paper:</p><p>A web tier calls a database tier through a connection pool. Database latency is normally well below the client&#8217;s request timeout. A<strong> brief perturbation</strong>, like a network blip, a slow GC pause, causes some requests to exceed the timeout. </p><p>The client retries. The retry is a <em>new</em> request, added to the existing load. Database queue depths grow. Latency increases, pushing more requests past the timeout. More retries fire. Each<strong> timed-out request </strong>still consumed full database work to compute its answer, but no client ever saw it; that work was wasted. </p><p>The system is now processing 3&#215; its normal request volume (originals plus retries), succeeding in completing them all, but every client is timing out before the answer arrives. <em>Goodput is zero. Throughput is at saturation.</em> The trigger (the original network blip) is long gone. The <strong>retry storm</strong> is sustaining the failure regime on its own.</p><p>The key insight is that the root cause of a metastable failure is the <strong>sustaining loop</strong>, not the trigger. Triggers are infinitely various and mostly cannot be prevented. </p><p>Sustaining loops are <em>finite </em>and <em>identifiable</em>, and if you eliminate them, the same trigger fails to produce the same outcome.</p><p>A follow-up paper, <em>Metastable Failures in the Wild</em> (Huang et al., OSDI 2022), examined 22 publicly disclosed incidents at 11 major organisations and concluded that at least 4 of the previous 15 major AWS outages fit the metastable pattern. </p><p>The October 2025 DynamoDB incident makes 5. The EC2 cascade after DynamoDB recovered is the<strong> metastable pattern</strong> in textbook form: the trigger (DynamoDB DNS being empty) was resolved in three hours; the sustaining loop (<em>every hypervisor in the region simultaneously demanding lease renewal from a system that could not handle the surge</em>) took ten more hours to break, and only broke when AWS manually rate-limited the work.</p><p><strong>Marc Brooker</strong>, a principal engineer at AWS who has written extensively on this material, has pointed out that the appropriate intellectual framework here is not algorithms-and-data-structures but <strong>control theory and dynamical systems</strong>. </p><p>A metastable failure is, in dynamical-systems terms, a system with two stable attractors, where the perturbation required to push the system from the desirable attractor into the undesirable one is much smaller than the perturbation required to push it back. </p><p>The <strong>state-space geometry</strong> is asymmetric. Most production engineers have never thought about their systems this way, because computer science is taught around discrete models. The systems are continuous and dynamical, whether we model them that way or not.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Invariants and the cardinality contract</h2><p>The implication is not that distributed systems are unbuildable. They obviously are. </p><p>The true implication is that the <em>mental model</em> under which most distributed systems get built (components compose, contracts compose, correctness composes) is wrong in a way that matters for production behaviour. </p><p>The discipline that replaces this mental model is the explicit enforcement of <strong>invariants</strong> at every component boundary, including internal ones.</p><p>An invariant, in this context, is a <strong>property of a value</strong> that the consumer&#8217;s correctness depends on, but that the producer is not contractually obligated to maintain. The Cloudflare feature file had at least three such invariants, none enforced by any check at the boundary:</p><ol><li><p><strong>A cardinality bound.</strong> The Rust consumer required <code>n_features &#8804; 200</code>. The ClickHouse query had no <code>LIMIT</code>, no <code>WHERE</code> on database, and no schema constraint preventing growth.</p></li><li><p><strong>A schema invariant.</strong> The consumer assumed columns came from <code>default</code> only. The query implicitly assumed the same via the permissions model. Neither stated the invariant in code.</p></li><li><p><strong>A monotonicity invariant.</strong> A doubling of feature count between two consecutive runs is, on its face, anomalous. No alarm fired on that delta.</p></li></ol><p>Each invariant was true for years. Each became false silently when an upstream change reshaped the world. The boundary between producer and consumer had no <strong>formal contract</strong>; the contract lived in the heads of engineers, some of whom had left the company.</p><p>The discipline that prevents this is not &#8220;<em>validation</em>&#8221; in the loose sense. It is the explicit, in-code, enforced declaration of every cardinality, ordering, schema, and <strong>freshness constraint</strong> that the consumer relies on, with explicit handling of violation: typically degradation to last-known-good rather than panic. </p><p>The Rust idiom for this is the difference between <code>.unwrap()</code> and explicit pattern matching on <code>Result</code>; the <strong>operational idiom</strong> is the difference between trusting upstream data and treating every input as adversarial regardless of source. </p><p>The cost of the former is a<strong> few additional lines</strong> per consumer boundary. The cost of the latter is, occasionally, six hours of global downtime.</p><div><hr></div><h2>Sustaining loops and characteristic metrics</h2><p>The second formal property the Cloudflare and DynamoDB incidents share is the presence of <strong>sustaining loops</strong>; control loops whose response to system degradation increases the load on the system rather than decreasing it. </p><p>The discipline for finding these before they fire is to <strong>enumerate every feedback loop </strong>in the system and classify each one&#8217;s stability properties.</p><p>A feedback loop is stable if, when perturbed from equilibrium by a small amount &#949;, it returns to equilibrium with <strong>error decaying</strong> as some function f(t,&#949;) that approaches zero. </p><p>A feedback loop is sustaining if the same perturbation produces error that grows or stays bounded away from zero. </p><p>The distinction is mathematically standard (<strong>Lyapunov stability</strong>) but is almost never applied to production systems, because engineers do not model their systems as dynamical systems.</p><p>The catalogue of loops in any non-trivial production system: </p><ul><li><p>retry policies (timeout &#8594; retry &#8594; load &#8594; timeout amplification); </p></li><li><p>autoscaling (latency &#8594; scale-up &#8594; cold-start latency &#8594; more scale-up); </p></li><li><p>lease renewal (load &#8594; renewal delay &#8594; lease expiry &#8594; mass renewal storm); </p></li><li><p>connection pooling (failure &#8594; reconnect &#8594; handshake load &#8594; failure); </p></li><li><p>cache warming (cold cache &#8594; DB load &#8594; DB slow &#8594; cache cannot warm); </p></li><li><p>health checks (slow response &#8594; marked unhealthy &#8594; traffic shifted to fewer hosts &#8594; those hosts slower). </p></li></ul><p>Each is a control loop. Each can be classified. The classification is rarely written down.</p><p>The observability counterpart of this classification is what Bronson calls <strong>characteristic metrics</strong>: observations of the <em>loop state itself</em>, not of the loop&#8217;s inputs or outputs. </p><p>Queue depth is a loop-state observable; request rate is not. Retry rate is a loop-state observable; error rate is not. Lease renewal latency is a loop-state observable; lease expiry rate is not. </p><p>The relationship between <strong>loop-state metrics </strong>and incident causality is direct: when a sustaining loop activates, its characteristic metric crosses out of its historical operating envelope before the user-facing symptom appears.</p><p><strong>Instrumenting characteristic metrics</strong> is the difference between detecting a metastable failure during its inflation phase (when mitigation is cheap) and detecting it after it has saturated (when mitigation requires load-shedding the user-facing service).</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The diagnostic question</h2><p>The <strong>compressed form</strong> of the entire discipline reduces to a single question, asked of every component boundary in the system: <em>what am I assuming about my input that is not enforced by a check in this code?</em></p><p>Every such unenforced assumption is a future incident. The space of unenforced assumptions is large but finite, and it <strong>can be enumerated</strong>. Most engineering organisations have never done this enumeration. </p><p>The ones that have produce systems that fail in less catastrophic ways; not because they fail less often, but because the failures that occur are <strong>caught at the boundary </strong>where the assumption was violated, rather than three layers downstream after corruption has propagated.</p><p>The system you have is not the system you designed. The system you have is the composition. The <strong>composition is opaque</strong>, and the opacity is permanent, but the opacity at every individual boundary is not permanent. </p><p>Each boundary is a place where assumptions can be made explicit and enforced. The discipline of <strong>composition-aware engineering</strong> is not to make the whole transparent. </p><p>It is to make every boundary honest about what it requires from its neighbours, and to refuse to operate when those requirements are not met. </p><p>This is what separates the systems that <strong>fail loudly</strong> at the seams from the systems that fail catastrophically in the centre.</p><div><hr></div><h2>One more thing&#8230;</h2><p><strong>Modern systems</strong> rarely fail because of a single broken component.</p><p>They fail because interactions between correct components create behaviours nobody explicitly designed for. The same thing happens in high-performance GPU systems.</p><p>Most <strong>CUDA optimisation</strong> is not about isolated tricks. It is about understanding how kernels, memory hierarchies, scheduling, communication, and throughput constraints interact under load.</p><p>I wrote a <strong>deep guide on CUDA</strong> from exactly this perspective: systems-level performance engineering, bottlenecks, hidden coupling, and why many &#8220;optimisations&#8221; simply move the problem elsewhere.</p><p><a href="https://lorenzobrada.gumroad.com/l/cuda_mastery">Read the CUDA Guide on Gumroad</a></p>]]></content:encoded></item><item><title><![CDATA[We built the CUDA guide I wish I had three years ago]]></title><description><![CDATA[Intro For the past few days we have been quiet here.]]></description><link>https://www.thesoftwarefrontier.com/p/we-built-the-cuda-guide-i-wish-i</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/we-built-the-cuda-guide-i-wish-i</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Thu, 30 Apr 2026 07:20:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dOUW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b55c88f-0b0e-48ef-bf71-1cb59dbccf28_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dOUW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b55c88f-0b0e-48ef-bf71-1cb59dbccf28_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dOUW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b55c88f-0b0e-48ef-bf71-1cb59dbccf28_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!dOUW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b55c88f-0b0e-48ef-bf71-1cb59dbccf28_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!dOUW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b55c88f-0b0e-48ef-bf71-1cb59dbccf28_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!dOUW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b55c88f-0b0e-48ef-bf71-1cb59dbccf28_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dOUW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b55c88f-0b0e-48ef-bf71-1cb59dbccf28_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b55c88f-0b0e-48ef-bf71-1cb59dbccf28_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1484626,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/195844470?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b55c88f-0b0e-48ef-bf71-1cb59dbccf28_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dOUW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b55c88f-0b0e-48ef-bf71-1cb59dbccf28_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!dOUW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b55c88f-0b0e-48ef-bf71-1cb59dbccf28_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!dOUW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b55c88f-0b0e-48ef-bf71-1cb59dbccf28_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!dOUW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b55c88f-0b0e-48ef-bf71-1cb59dbccf28_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Intro</h2><p>For the past few days we have been quiet here. Not because the newsletter slowed down, but because we were building something underneath it.</p><p>Today we are publishing what came out of that work: <strong>CUDA Mastery 2026, The Definitive Engineer&#8217;s Reference for Hopper, Blackwell, and Beyond</strong>. </p><p>Twenty-seven chapters, five appendices, fact-checked end to end against NVIDIA&#8217;s own documentation, the <strong>PTX ISA 8.7</strong>, and primary architecture whitepapers. </p><p>It covers <strong>CUDA Toolkit 13.0, 13.1, and 13.2</strong>, compute capabilities <strong>7.5 through 12.1</strong>, <strong>WMMA</strong>, <strong>WGMMA</strong>, <strong>UMMA</strong> (<code>tcgen05</code>), <strong>TMA</strong>, <strong>thread block clusters</strong>, <strong>Tensor Memory</strong>, <strong>CUDA Tile</strong> and <strong>cuTile Python</strong>, <strong>CUTLASS 4 / CuTe</strong>, <strong>NCCL 2.30</strong>, and <strong>Nsight 2025.4</strong>.</p><p>It is on <a href="https://lorenzobrada.gumroad.com/l/cuda_mastery">Gumroad</a>. The price is $89. If you have been following <strong>The Software Frontier</strong>, you already know whether this is for you. The rest of this post is for everyone who is on the fence.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Why we wrote this</h2><p>There is a strange gap in <strong>CUDA literature.</strong></p><p>On one side, you have the official <strong>NVIDIA programming guide</strong>: dense, accurate, and written for people who already know what they are looking for. </p><p>On the other side, you have an ocean of blog posts and YouTube tutorials that stop at vector addition and matrix multiplication, repeating the <strong>same surface level </strong>explanations of threads, blocks, and grids.</p><p>What <strong>sits in the middle</strong>, the part that actually matters when you are writing production code or debugging a kernel that runs at 30 percent of peak, is mostly missing. </p><p>Or rather, it exists, but it is scattered across <strong>NVIDIA whitepapers</strong>, GTC talks from 2018, <strong>PTX ISA</strong> documentation, decompiled SASS dumps, the <strong>Hopper</strong> and <strong>Blackwell Tuning Guides</strong>, the <strong>Microbenchmarking Hopper</strong> and <strong>Microbenchmarking Blackwell</strong> arXiv papers, the CUTLASS source, and Stack Overflow threads from people who clearly know more than they are saying.</p><p>We have been reading and writing about this gap for months on the newsletter. The articles on the <strong>A100 memory hierarchy</strong>, on <strong>cp.async</strong> semantics, on <strong>scoreboard mechanics</strong>, on the <strong>submission pipeline</strong>, all of them came from the same frustration. Every time we wanted to explain something properly, we had to do the archaeology ourselves.</p><p>So we decided to <strong>do the archaeology once</strong>, in a single document, and structure it the way we wish someone had structured it for us when we started.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/p/we-built-the-cuda-guide-i-wish-i?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/p/we-built-the-cuda-guide-i-wish-i?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>What is in the guide</h2><p><strong>Twenty-seven chapters</strong> across eleven parts, plus five appendices. Four chapters were rewritten end-to-end at handbook depth. Those are the <strong>PREMIUM</strong> chapters: the SM, the memory system, tensor cores, and the SGEMM walkthrough. The rest of the structure looks like this.</p><p><strong>Foundations.</strong> The GPU as a throughput machine, the <strong>CUDA programming model, </strong>and the memory hierarchy at a glance. This is the vocabulary layer. A senior engineer can skim it in an afternoon; a new graduate can use it as their entry point and grow into the rest of the book.</p><p><strong>The Streaming Multiprocessor in mechanical detail.</strong> The SM is the unit of concurrency, the unit of resource accounting, and the unit at which every meaningful CUDA performance argument must eventually be made. </p><p>We walk through the <strong>four near-independent partitions</strong>, the <strong>operand collector</strong> and its bank conflicts, the <strong>short</strong> and <strong>long scoreboards</strong>, the quantitative arithmetic of latency hiding via Little&#8217;s law, the full Nsight Compute <strong>stall taxonomy</strong>, and the structural deltas across Volta, Turing, Ampere, Ada, Hopper, and Blackwell. </p><p>The chapter ends with an end-to-end walkthrough of a single warp executing a <code>wgmma.mma_async</code> on a Hopper SM, stage by stage.</p><p><strong>The memory system mechanically.</strong> The 32-byte sector model. The <strong>L1TEX path</strong> and every cache modifier you can attach to a global load. The L2 partitioning on H100 and the <strong>access policy window</strong>. <strong>HBM3 in three numbers</strong> and why the practical roofline is 70 to 85 percent of the headline. </p><p>Shared memory banks, <strong>swizzle modes</strong>, and the descriptor-encoded layout that <strong>WGMMA</strong> actually expects. <strong>cp.async</strong> mechanics on Ampere. <strong>TMA</strong> on Hopper and Blackwell, including descriptors, transaction barriers, phase parity, and <strong>cluster multicast</strong>. The mbarrier family and how warp-specialized GEMM mainloops use it.</p><p><strong>Synchronization and concurrency.</strong> The CUDA memory model with its scopes and orderings. <strong>Cooperative Groups</strong> including <strong>cluster.sync</strong> on Hopper and beyond. <strong>Streams</strong>, events, and <strong>CUDA Graphs</strong> for launch-overhead amortization in inference and physics workloads.</p><p><strong>Tensor cores mechanically.</strong> The hardware origin of the unit, the generation-by-generation shape and precision progression, the per-lane fragment ownership for <code>mma.sync</code>, the bit-level layout of the <strong>WGMMA matrix descriptor</strong>, and the structural transition to <strong>UMMA</strong> with the accumulator living in <strong>Tensor Memory</strong>. </p><p>A full section on the new numerical formats: <strong>FP6</strong>, <strong>FP4</strong>, and the <strong>MX</strong> wrapper that makes FP4 inference near-lossless on trained transformer weights.</p><p><strong>Modern hardware.</strong> A Hopper deep dive on <code>sm_90</code> / <code>sm_90a</code>. A Blackwell deep dive on <code>sm_100</code> / <code>sm_100a</code> / <code>sm_120</code>. A chapter on <strong>Blackwell Ultra</strong> (B300, compute capability 10.3) and the trajectory toward Rubin.</p><p><strong>Performance engineering.</strong> The roofline model in practice, with the <strong>second roofline</strong> for shared-memory bandwidth on tile-based kernels. Profiling with <strong>Nsight Systems</strong> and <strong>Nsight Compute</strong>, including the four-section workflow and the new tile-kernel statistics added in CUDA 13.1. Numerics and reproducibility, including the <strong>TF32 trap</strong> that silently downgrades FP32 GEMMs.</p><p><strong>Multi-GPU and distributed.</strong> <strong>NVLink 5</strong> and the NVL72 domain. <strong>SHARP v4</strong> in-network reductions. <strong>NCCL</strong> internals across Ring, Tree, NVLS, and PAT. <strong>NVSHMEM</strong> and PGAS for sparse all-to-all in MoE training.</p><p><strong>Libraries and toolchain.</strong> cuBLAS, cuBLASLt epilogue fusion, cuDNN, cuFFT, cuSPARSE. <strong>CUTLASS 4</strong> and <strong>CuTe</strong> for hand-written tensor-core mainloops. <strong>CCCL</strong> (Thrust + CUB + libcu++). A full chapter on <strong>CUDA Tile</strong> and <strong>cuTile Python</strong>, the largest single addition to the CUDA programming model since cooperative groups. A chapter on <code>nvcc</code>, <strong>PTX</strong>, <strong>SASS</strong>, the fatbinary, and inline PTX as an escape hatch.</p><p><strong>Capstone kernels.</strong> The <strong>SGEMM walkthrough from v1 to v6, with numbers</strong>. Most treatments stop at register tiling and gesture vaguely at &#8220;<em>tensor cores make it faster.</em>&#8221; </p><p>Ours follows the bottleneck through six versions on a 4096&#179; FP32 problem on H100 SXM5, names the architectural feature that breaks each ceiling, and gives the analytical bound. v1 reaches 0.2 percent of peak. v6, on Blackwell with <strong>UMMA + TMEM at FP8</strong>, reaches 88 to 95 percent of peak. </p><p>The chapter exists to teach what every transition costs and what it buys. <strong>Reductions and scans</strong> with single-pass decoupled lookback. <strong>Flash Attention 2, 3, and 4 / 5</strong> including the online softmax derivation and the WGMMA / UMMA mainloop. <strong>Sort, hash, and graph primitives</strong> built on CUB.</p><p><strong>Appendices.</strong> Compute capability quick reference. Architecture spec sheet from A100 through B300 with verified numbers from each part&#8217;s datasheet. PTX quick reference. Glossary. And a full bibliography of the primary sources every claim in the book was checked against.</p><p>Every claim in the guide has been fact checked against primary sources. Where we had to infer something from<strong> SASS </strong>or from the behavior of the hardware rather than from a published spec, we say so explicitly.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>What was truly missing </h2><p>Maxwell, Pascal, and Volta offline-compilation material was retired, in line with <strong>CUDA 13.0</strong> dropping pre-Turing offline compilation in August 2025. The tensor-core chapter was rewritten around <strong>UMMA</strong>, which supersedes Hopper&#8217;s <code>wgmma.mma_async</code>. </p><p>New material on <strong>CUDA Tile</strong> and <strong>cuTile Python</strong>, both introduced in CUDA 13.1 in December 2025 and extended in 13.2. New material on <strong>Tensor Memory</strong>. </p><p>The four PREMIUM chapters are new from the ground up. Numbers verified against NVIDIA&#8217;s Hopper, Blackwell, and Blackwell Ultra public datasheets at print time.</p><div><hr></div><h2>Who this is for</h2><p>If you are writing CUDA professionally, in <strong>HPC</strong>, in <strong>ML systems</strong>, in <strong>inference engines</strong>, or in any context where kernel performance is part of your job, this guide is calibrated for you.</p><p>If you are a <strong>senior engineer </strong>transitioning into GPU work and you want one document that takes you from competent to dangerous without 200 hours of fragmented reading, this is the document.</p><p>If you are deep into the <strong>PTX</strong> weeds already, writing your own warp-specialized <strong>WGMMA</strong> mainloops and tuning <strong>CuTe</strong> layouts, you probably know a lot of what is in here. </p><p>You will still find the <strong>SM </strong>and memory chapters useful as a reference, and the SGEMM walkthrough is one of the few places where the v5-to-v6 transition is laid out in full. But we would not pretend to teach you something you do not already know.</p><p>If you are completely new to CUDA, with no parallel programming background, this is not the right starting point. </p><p>The guide assumes <strong>you can read C++ </strong>and that you have at least written a few kernels before. We would not want you to spend $89 and feel lost on chapter four.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Why $89</h2><p>Because the alternative is reading what we read, in the order we read it, over the same number of months.</p><p>We are not pricing this against tutorials. We are pricing it against the time of an engineer who bills somewhere between $80 and $200 an hour and needs to be productive on Hopper and Blackwell GPU code by next quarter. </p><p>If the guide saves you a single afternoon of debugging a kernel that turns out to be limited by <strong>operand collector bank conflicts</strong> that no public documentation describes, it has paid for itself.</p><p>There is no DRM, no expiration, no upsell. You <strong>buy the PDF, you own it</strong>. As future architectures change material details, we will publish updates to buyers at no additional cost. </p><p>The next edition is already scheduled to cover <strong>Rubin</strong> when its public specifications stabilize.</p><div><hr></div><h2>What happens next</h2><p>The newsletter continues. The <strong>Mastering CUDA</strong> series is not over, and there are several articles already in draft on topics that did not fit cleanly into the guide.</p><p>If you buy the guide and you have feedback, send it. We read every email. The first revision is going out <strong>within thirty days</strong> based on what readers tell us, and the people who bought early get it first.</p><p><strong>You can find the guide here:</strong> <a href="https://lorenzobrada.gumroad.com/l/cuda_mastery">CUDA Mastery</a></p><p>Thank you for reading. Thank you for being here while this was being built. The next article goes out as scheduled.</p><p><em>Lorenzo and Lorenzo</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Software Frontier! </p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Mastering CUDA and High-Performance Computing, Part X]]></title><description><![CDATA[A Deep Dive from Compiler Internals to High-Performance Parallel Computing]]></description><link>https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-be0</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-be0</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Fri, 24 Apr 2026 13:58:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!mhE7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ed33f8-2ce3-4cb2-907c-b4bb64edba93_1402x1122.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mhE7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ed33f8-2ce3-4cb2-907c-b4bb64edba93_1402x1122.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mhE7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ed33f8-2ce3-4cb2-907c-b4bb64edba93_1402x1122.png 424w, https://substackcdn.com/image/fetch/$s_!mhE7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ed33f8-2ce3-4cb2-907c-b4bb64edba93_1402x1122.png 848w, https://substackcdn.com/image/fetch/$s_!mhE7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ed33f8-2ce3-4cb2-907c-b4bb64edba93_1402x1122.png 1272w, https://substackcdn.com/image/fetch/$s_!mhE7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ed33f8-2ce3-4cb2-907c-b4bb64edba93_1402x1122.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mhE7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ed33f8-2ce3-4cb2-907c-b4bb64edba93_1402x1122.png" width="1402" height="1122" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72ed33f8-2ce3-4cb2-907c-b4bb64edba93_1402x1122.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1122,&quot;width&quot;:1402,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2668910,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/193447084?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ed33f8-2ce3-4cb2-907c-b4bb64edba93_1402x1122.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mhE7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ed33f8-2ce3-4cb2-907c-b4bb64edba93_1402x1122.png 424w, https://substackcdn.com/image/fetch/$s_!mhE7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ed33f8-2ce3-4cb2-907c-b4bb64edba93_1402x1122.png 848w, https://substackcdn.com/image/fetch/$s_!mhE7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ed33f8-2ce3-4cb2-907c-b4bb64edba93_1402x1122.png 1272w, https://substackcdn.com/image/fetch/$s_!mhE7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ed33f8-2ce3-4cb2-907c-b4bb64edba93_1402x1122.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Where Part IX Left Us</h2><p><strong>Part IX</strong> ended with a provocation dressed as a summary. Training, we said, is a one-time cost. Inference is the workload that runs forever.</p><p>That sentence deserves to be interrogated before we accept it as a frame, because it contains a <strong>hidden asymmetry </strong>that shapes everything that follows. </p><p>Training a <strong>frontier model</strong> is an event: it happens once, or perhaps a handful of times with different hyperparameters, and then it stops. The cost is large and bounded. </p><p><strong>Inference </strong>is a process: it happens billions of times per day, across hardware that may or may not resemble the training cluster, under latency constraints that the training job never had to respect, serving users who have no patience for pipeline bubbles and <strong>no interest in MFU</strong>.</p><p>The engineering discipline of <strong>inference optimization</strong> is therefore a different subject from the engineering discipline of training optimization, not merely a scaled-down version of it. <em>The bottlenecks are different in kind. The metrics are different. The vocabulary is different. The hardware choices are sometimes deliberately different.</em></p><p>But the physics is the same, because physics does not have a training mode and an inference mode. <strong>Compute and memory bandwidth </strong>are always the two resources, and every optimization in this space is, at root, a claim about which of the two you are spending and whether you are spending it wisely.</p><p>What we will do in this part is work through the inference problem with the same level of precision we brought to the training problem. </p><p>We will derive the arithmetic of autoregressive decoding from first principles, establish exactly why the <strong>decode phase</strong> of transformer inference is memory-bandwidth-bound by construction, explain what that means for hardware selection and <strong>batching strategy</strong>, and then examine the tools that practitioners have developed to recover the compute utilization that the memory-bound regime takes away.</p><p>We will <strong>go into detail </strong>that most treatments of this subject avoid, because the details are where the engineering actually lives.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The two phases of transformer inference </h2><p><strong>Transformer inference</strong> for a generative model consists of <strong>two distinct phases</strong> that are so different in their computational character that they might as well be different workloads running on different hardware.</p><p>The prefill phase processes the input prompt. Given a prompt of <strong>length S tokens</strong>, the model performs a forward pass over all S tokens simultaneously. </p><p>This is a dense <strong>matrix multiply</strong> of shape [S, d_model] against the weight matrices, which is computationally equivalent to a training forward pass on a batch of <strong>S examples</strong>, with the important difference that no gradient computation happens. </p><p>The arithmetic intensity of prefill is high: the <strong>GEMM is large,</strong> the compute-to-memory ratio is favorable, and for sufficiently long prompts, prefill saturates tensor core utilization. Prefill is compute-bound.</p><p>The decode phase generates the output, one token at a time. At each step, the model processes a single new token and uses the <strong>key-value (KV) cache</strong>, which stores the key and value projections for all previously seen tokens, to compute attention over the full context without recomputing those projections. </p><p>The new token produces one row of the <strong>Q matrix</strong>, one row of the <strong>K matrix</strong> (<em>appended to the KV cache</em>), one row of the V matrix (also appended), and one output token.</p><p>The matrix multiply that dominates decode is therefore a <strong>matrix-vector product:</strong> a single vector of shape [1, d_model] multiplied against weight matrices of shape [d_model, d_model]. </p><p>For an H100 at 494 TFLOP/s peak BF16, and a weight matrix that requires 2 &#215; d_model&#178; bytes to read from <strong>HBM</strong> (one load), the arithmetic intensity of this operation is:</p><p><em>flops = 2 &#215; d_model&#178; bytes = 2 &#215; d_model&#178; arithmetic intensity = 1 FLOP/byte</em></p><p>The H100&#8217;s ridge point, the arithmetic intensity at which the machine transitions from memory-bandwidth-bound to compute-bound, is approximately <strong>494 TFLOP/s </strong>divided by 3.35 TB/s HBM3 bandwidth, which equals roughly 147 FLOP/byte.</p><p>Single-token decode has an arithmetic intensity of <strong>1 FLOP/byte. </strong>The ridge point is at 147 FLOP/byte.</p><p>The gap is not a small inefficiency to be engineered away. It is two orders of magnitude. It is structural. </p><p>A matrix-vector product with batch size 1 will always be memory-bandwidth-bound on any hardware where compute throughput scales faster than <strong>memory bandwidth</strong>, which is every piece of hardware available today and likely every piece available for the next several years. </p><p>The H100&#8217;s tensor cores sit at<strong> 99.3% utilization</strong> waiting for data that cannot arrive fast enough. This is the inference problem, stated precisely.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>What the arithmetic intensity gap actually costs</h2><p>Before we can appreciate why batching is not a simple fix, we need to quantify what the gap costs in concrete terms.</p><p>Consider the weight matrices of a <strong>70B parameter model. </strong>They occupy 140 GB in BF16. To generate a single output token, the decode phase must read essentially all of the weight matrices from HBM: every attention projection, every MLP layer, every embedding lookup. </p><p>(The KV cache is also read, but its size is proportional to context length and sequence position, not model size.) At the H100&#8217;s<strong> HBM3 bandwidth</strong> of 3.35 TB/s, reading 140 GB takes approximately 42 milliseconds.</p><p>In 42 milliseconds, a single token is produced. That is approximately 24 tokens per second per H100 for a 70B model at batch size 1.</p><p>Now read that sentence again: <strong>24 tokens per second per H100</strong>, a machine that costs tens of thousands of dollars and can perform 494 trillion floating-point operations per second, is producing tokens at roughly the rate that a person reads them.</p><p>The <strong>494 TFLOP/s</strong> are not being used. The H100 is acting as a very expensive, very fast HBM3 reader. The silicon that took years to design and billions of dollars to fabricate is waiting for data.</p><p>This is the central pathology of autoregressive decode, and it motivates every technique we will discuss in this part.</p><h3>Batching as arithmetic intensity recovery</h3><p>The solution that every inference practitioner reaches for first is batching: if you run decode for multiple requests simultaneously, the<strong> weight reads</strong> are shared across the batch, and the arithmetic intensity increases proportionally.</p><p>The arithmetic is clean. For a batch of B requests, the matrix multiply in decode is no longer a matrix-vector product but a <strong>matrix-matrix product</strong>: [B, d_model] &#215; [d_model, d_model]. </p><p>The flop count scales as B &#215; 2 &#215; d_model&#178;, while the bytes for the weight matrix remain 2 &#215; d_model&#178; (the weights are read once, regardless of batch size). Arithmetic intensity is now<strong> B FLOP/byte.</strong></p><p>To reach the ridge point at 147 FLOP/byte, you need a batch of 147 requests running simultaneously on the same H100. At batch size 147, the tensor cores begin to saturate and further increasing the batch does not change the arithmetic intensity (you are now <strong>compute-bound</strong>, and more requests means more total compute, not more memory reads per unit time).</p><p>The batch size at which you saturate the machine is the target operating point for <em>maximum throughput per GPU</em>. Everything below this point is <strong>wasted hardware.</strong></p><p>But batch size is not free. Each request in the batch has its own KV cache, and the KV cache size is proportional to the sequence length of that request. For a context length of 8192 tokens, a model with 80 layers, 8 KV heads, a head dimension of 128, and BF16 storage, the KV cache size per request is:</p><p><em>8192 &#215; 80 &#215; 2 &#215; 8 &#215; 128 &#215; 2 bytes = 26,843,545,600 bytes &#8776; 25 GB</em></p><p>Twenty-five gigabytes per request, on a GPU with 80 GB of HBM. You can serve at most three concurrent requests at 8192-token context length before <strong>HBM is exhausted </strong>and you cannot increase the batch further.</p><p>The tension is fundamental: to achieve good arithmetic intensity you need large batches, but large batches require large KV caches, and large KV caches consume the memory that large batches require.</p><p>This is the central <strong>resource allocation problem</strong> of transformer inference, and it is more constrained than it appears, because KV cache memory is not static. It grows with sequence length. </p><p>A request that has generated 100 tokens has a small KV cache; the same request after generating 4000 tokens has a KV cache that is<strong> 40&#215; larger. </strong>The memory footprint of the batch changes continuously as generation proceeds.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Continuous batching and the end of static padding</h2><p>The naive approach to batched inference is static batching: collect B requests, pad them all to the same <strong>sequence length,</strong> run them as a batch, return all B results when the longest sequence finishes.</p><p>Static batching is deeply inefficient. Consider a batch of 8 requests where one request will generate <strong>2000 tokens </strong>and the others will generate 20 tokens each. </p><p>After the short requests finish at step 20, 7 of the 8 slots in the batch are empty, but the batch continues until step <strong>2000 to service</strong> the one long request. The GPU is running at 1/8 occupancy for 99% of the wallclock time.</p><p>Continuous batching (also called iteration-level batching or in-flight batching), implemented in systems like vLLM, <strong>Orca</strong>, and TensorRT-LLM, solves this by removing the assumption that all requests in a batch start and finish together. </p><p>Instead, the batch is managed at the<strong> per-decoding-step level</strong>: at each step, the set of active requests is the set of requests that have not yet finished and for which memory is available.</p><p>When a request completes (<em>generates a stop token or reaches the maximum length</em>), its slot in the batch is <strong>immediately freed</strong>. A new waiting request is inserted into the freed slot and begins generating from its first decode step. There is no waiting for the batch to drain. The batch is always as full as memory permits.</p><p>The implementation requires that the <strong>CUDA kernels </strong>for attention and the MLP can handle variable-length sequences within a single kernel invocation, which is non-trivial because standard <strong>GEMM implementations</strong> assume a fixed batch dimension. </p><p>Paged attention (discussed shortly) is the memory management technique that makes this practical; the <strong>PagedAttention</strong> kernel from vLLM and the FlashAttention variants for variable-length sequences are the implementations that make it fast.</p><p>Continuous batching does not increase the maximum batch size (<em>which is still limited by KV cache memory</em>). What it does is ensure that the batch is always at or near the maximum size, eliminating the idling that static batching induces. </p><p>A system running continuous batching with <strong>maximum batch size 64 </strong>will achieve dramatically higher throughput than a system running static batching with the same maximum batch size, because the latter is almost never actually running 64 requests simultaneously.</p><h3>PagedAttention and the memory management revolution</h3><p>The KV cache memory problem has an analogy so precise it deserves to be stated explicitly: the KV cache is to inference systems what<strong> physical memory</strong> is to operating systems.</p><p>In an <strong>operating system</strong>, multiple processes compete for a fixed physical memory. Processes do not know their memory needs in advance (a process may allocate more memory as it runs). </p><p>Memory fragmentation is a<strong> real cost</strong>: even if the total free memory is sufficient, if it is not contiguous, an allocation may fail. </p><p>The solution that operating systems developed is<strong> virtual memory </strong>with paging: memory is divided into fixed-size pages, processes address a virtual space that the OS maps to physical pages on demand, and fragmentation is eliminated because non-contiguous physical pages can be mapped to a contiguous virtual space.</p><p><strong>PagedAttention</strong>, introduced by <strong>Kwon et al.</strong> (2023) and implemented in vLLM, applies exactly this insight to KV cache management.</p><p>In a <strong>naive KV cache</strong> implementation, each request&#8217;s KV cache is a contiguous block of GPU memory allocated at request arrival. The maximum context length is reserved at allocation time (because the request might generate that many tokens), even if the actual generation is much shorter. </p><p><strong>Fragmentation is severe</strong>: the gap between reserved and used memory across all requests is wasted, and new requests cannot use it.</p><p>PagedAttention divides the <strong>KV cache</strong> into fixed-size physical blocks (pages), where each block stores the keys and values for a fixed number of tokens (the block size, typically 16 or 32 tokens). When a request needs more KV cache space, it is allocated additional pages from a free pool. </p><p>Pages for a single request need not be contiguous in physical GPU memory; the PagedAttention kernel uses a <strong>block table</strong> (a small integer array per request mapping logical page indices to physical page indices) to find the right physical memory at attention computation time.</p><p>The consequences are significant. First, fragmentation falls from potentially 50% (if requests <strong>reserve maximum-length</strong> buffers but generate much shorter sequences) to under 5% (only the last partially-filled page of each sequence wastes space). </p><p>Second, sequences can share physical pages: if two requests have identical prompt prefixes (<em>common in chat applications with a fixed system prompt</em>), the KV cache pages for the shared prefix can be <strong>physically shared </strong>between them, eliminating redundant computation and halving the memory footprint of the prefix. </p><p>Third, <strong>memory allocation is lazy</strong>: pages are allocated only as tokens are generated, not at request arrival, which means a request does not consume its full potential KV cache until it actually generates enough tokens to need it.</p><p>The prefix sharing (<em>also called KV cache sharing or prompt caching</em>) deserves additional attention because its impact at scale is large. Consider a chat application where every request is prefixed with a <strong>2000-token system prompt.</strong> </p><p>Without prefix sharing, each request independently computes and stores the KV cache for those 2000 tokens. With prefix sharing, the<strong> 2000-token prefix</strong> KV cache is computed once and shared across all concurrent requests. </p><p>For a batch of 64 requests, this eliminates 63 redundant prefill computations and reduces KV cache memory by<strong> 2000 &#215; 64 tokens </strong>worth of activations, freeing space for a larger batch.</p><h3>Speculative decoding and the bandwidth wall</h3><p>Continuous batching with PagedAttention brings the inference system to the state of efficiently using available hardware at the maximum batch size the KV cache permits. </p><p>But we have still not escaped the fundamental constraint: at maximum batch size, the system is compute-bound, but <strong>getting to maximum batch size </strong>requires enough concurrent requests, which requires enough users, which is not always the case in low-traffic scenarios. At<strong> low batch sizes</strong>, we are still memory-bandwidth-bound, still producing tokens at the same rate that memory bandwidth permits.</p><p>Speculative decoding attacks this from a different angle. The observation is this: for a memory-bandwidth-bound system, the cost of generating one token and the cost of generating <strong>K tokens simultaneously</strong> in a single forward pass are approximately equal, because the bottleneck is reading the weight matrices from memory, and reading them once versus K times is the same operation if the K proposals can be evaluated in a single forward pass.</p><p>The mechanism: a <strong>small draft model</strong> (or a non-autoregressive heuristic, or a retrieval system) proposes a sequence of K candidate continuation tokens in a single forward pass. </p><p>The large target model then verifies this proposal in a single forward pass over the K <strong>tokens simultaneously</strong>. If the target model accepts all K tokens, K tokens have been generated at the cost of one large model forward pass plus one small model forward pass. </p><p>If the target model rejects some tokens, the generation rewinds to the first rejection and continues from there, having wasted some <strong>small model computation</strong> but no large model computation beyond what was necessary.</p><p>The analysis of when speculative decoding accelerates inference requires understanding the <strong>acceptance rate</strong>, which is the fraction of proposed tokens that the target model accepts. </p><p>For a good draft model generating from a similar distribution, acceptance rates of 70-90% are achievable. With an acceptance rate of &#945; and a <strong>proposal length of K</strong>, the expected number of tokens accepted per large model forward pass is:</p><p><em>E[accepted] = sum_{k=0}^{K} (k+1) &#215; &#945;&#7503; &#215; (1 &#8722; &#945;) + (K+1) &#215; &#945;&#7479;</em></p><p>For &#945; = 0.8 and K = 4: E[accepted] &#8776; 3.3 tokens per large model forward pass, compared to 1 token per forward pass without speculation. <strong>Speedup</strong> is approximately 3.3&#215;, reduced by the overhead of the draft model, typically 0.2-0.3 of a large model forward pass for a draft model that is 10-15&#215; smaller.</p><p>Net speedup: approximately 3.3 / 1.2 &#8776; 2.7&#215;. This is not free, but it is substantial, and it is available even at batch size 1.</p><p>The reason speculative decoding works, and why it does not violate the <strong>memory-bandwidth</strong> constraint, is subtle. The verification pass is a prefill operation over K+1 tokens, which has higher arithmetic intensity than single-token decode. </p><p>For K=4, the verification pass processes 5 tokens simultaneously, which is 5&#215; the arithmetic intensity of single-token decode. It is still <strong>memory-bandwidth-bound </strong>for small K, but the per-accepted-token cost of the memory reads is reduced because multiple tokens share the weight reads.</p><p>The<strong> deeper insight </strong>is that speculative decoding converts memory-bandwidth-bound decode operations into a mixture of memory-bandwidth-bound draft decode and slightly-less-memory-bandwidth-bound verification, and the mixture achieves <strong>better tokens-per-second </strong>because the draft model is smaller and therefore faster per forward pass.</p><p>The pathological case is when the draft model produces tokens that the target model almost never accepts, in which case the overhead of<strong> draft generation </strong>and failed verification exceeds the benefit. </p><p>Acceptance rates below approximately 0.5 make speculative decoding harmful, not helpful. This is why the choice of<strong> draft model matters</strong>: it should be distilled from or aligned with the target model, not simply chosen to be the fastest available.</p><p><strong>Self-speculative decoding</strong>, where the model speculates with its own early layers (exiting at an intermediate layer for draft generation and running the full forward pass for verification) eliminates the draft model requirement at the cost of some architectural complexity. </p><p>Medusa, a <strong>multi-head speculative decoding method </strong>that adds dedicated draft heads to the target model, is a variant that achieves similar benefits with a different implementation strategy.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>KV cache quantization and the memory tradeoff</h2><p>Even with PagedAttention and prefix sharing, the KV cache is often the binding constraint on batch size for <strong>long-context models</strong>. A 70B model with a 128K context length and a batch of 8 requests generates a KV cache of:</p><p><em>128,000 &#215; 80 &#215; 2 &#215; 8 &#215; 128 &#215; 2 bytes &#215; 8 requests = 3.3 TB</em></p><p>Three terabytes for 8 requests. No <strong>H100 cluster</strong> of reasonable size holds this in HBM. The only responses are to reduce context length (not always possible), reduce batch size (<em>reduces throughput</em>), or reduce the precision of KV cache values.</p><p>KV cache quantization stores keys and values in INT8 or FP8 rather than BF16, halving the memory requirement at the cost of <strong>approximation error</strong>. The question is whether that approximation error materially affects output quality.</p><p>The answer, empirically, is that keys and values are more quantization-sensitive than weights, because the attention<strong> score computation</strong> amplifies outliers in the key matrix, and those outliers are common and important. </p><p><strong>Naive INT8 quantization</strong> of the KV cache causes measurable quality degradation on tasks requiring precise retrieval over long contexts. The degradation is smaller for short contexts where fewer keys compete for attention.</p><p>Techniques that address this include<strong> grouped-query attention</strong> (GQA), which reduces the number of KV heads (and therefore the KV cache size) without proportionally reducing the expressivity of attention by sharing keys and values across groups of query heads; and <strong>mixed-precision</strong> KV caching, which stores frequently-accessed (recent) tokens in higher precision and distant-context tokens in lower precision, exploiting the empirical observation that attention weights are concentrated on nearby and highly salient tokens.</p><p><strong>GQA</strong> deserves detailed treatment because it has become standard in essentially all modern large language models: Llama 3, Mistral, Gemma, Qwen, and their successors all use it. </p><p>The mechanism is to reduce the number of KV heads from H to H/G for a group size of G, where typically G = 8. Each KV head is shared by G query heads during attention computation. The<strong> KV cache</strong> size shrinks by a factor of G, and with G=8 on a 128-head attention, the cache is 8&#215; smaller. </p><p>The expressivity cost is small for typical values of G because attention heads tend to specialize into redundant groups anyway: the <strong>empirical evidence</strong> across many model evaluations suggests that the quality loss from GQA at G=8 is negligible relative to the memory benefit.</p><h3>The prefill-decode disaggregation architecture</h3><p>We have established that prefill is <strong>compute-bound</strong> and decode is <strong>memory-bandwidth-bound.</strong> The hardware that is optimal for one is different from the hardware that is optimal for the other.</p><p>High-compute GPUs (H100, B200) with large HBM capacity are appropriate for both prefill and decode, but they are expensive. For decode in particular, the binding resource is memory bandwidth, <strong>not compute throughput. </strong>A GPU with lower compute throughput but the same memory bandwidth would serve the decode phase equally well at lower cost.</p><p>This observation motivates <strong>prefill-decode disaggregation</strong>: running prefill and decode on separate hardware pools, each sized for its actual bottleneck.</p><p>The architecture is approximately as follows: a request arrives at a scheduler, which assigns it to a <strong>prefill worker. </strong>The prefill worker (<em>a high-compute GPU</em>) processes the prompt and produces the initial KV cache. </p><p>That KV cache is then transferred to a decode worker (<em>which may be a different, possibly cheaper GPU</em>). The <strong>decode worker</strong> generates tokens autoregressively, managing the KV cache in its memory, and streams tokens back to the user.</p><p>The KV cache transfer itself is non-trivial: for a long prompt on a large model, the KV cache may be tens of gigabytes, and transferring it across a network link between <strong>prefill and decode workers</strong> takes time proportional to size. The transfer must complete before the first decode token can be generated, which adds latency to the <em>time-to-first-token (</em>TTFT) metric.</p><p>For applications where throughput matters more than latency, this tradeoff is acceptable: the <strong>disaggregated system </strong>produces more tokens per dollar per second because each hardware type is used for the phase it is efficient at. </p><p>For applications where <strong>TTFT is critical </strong>(real-time conversational AI), the added transfer latency may be unacceptable.</p><p>The engineering tension here is real, and production systems navigate it differently depending on workload: <strong>Splitwise </strong>(from Microsoft Research), <strong>DistServe </strong>(from Peking University and others), and production implementations at major AI serving providers all make different tradeoffs along the latency-throughput frontier.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Flash decoding and the attention bottleneck</h2><p>We have spent considerable time discussing the weight matrix reads as the memory bottleneck for decode, but for<strong> long-context inference</strong>, a second bottleneck emerges: the KV cache reads during attention computation.</p><p>The attention operation during decode requires computing, for the new token&#8217;s query vector q of shape [d_head], an attention score against every key in the KV cache of shape [context_length, d_head], and then a <strong>weighted sum</strong> of every value in the KV cache of shape [context_length, d_head]. </p><p>The KV cache for a single layer, a <strong>single head</strong>, at context length L is 2 &#215; L &#215; d_head &#215; 2 bytes (keys and values, BF16). For L = 128,000, d_head = 128, this is 64 MB per head per layer per request.</p><p>For a model with 80 layers and 8 KV heads, the total KV cache read per decode step is <em>64 MB &#215; 80 &#215; 8 = 40,960 MB &#8776; 40 GB.</em></p><p>At <em>3.35 TB/s HBM3 bandwidth,</em> reading 40 GB takes approximately 12 milliseconds per decode step. The weight matrix reads (140 GB at 3.35 TB/s) take approximately 42 milliseconds. At 128K<strong> context length,</strong> KV cache reads are therefore about 22% of the per-step memory read time, a non-trivial contribution.</p><p>At context lengths beyond 128K (1M tokens is now a research target), KV cache reads can dominate weight reads entirely. The bottleneck for<strong> very-long-context</strong> inference is not the model weights but the growing KV cache.</p><p>Flash Decoding, introduced by Dao et al. and integrated into FlashAttention-3, addresses this by parallelizing the KV cache reads across multiple warps in a different pattern from standard <strong>FlashAttention</strong>. </p><p>In standard FlashAttention, which is designed for prefill <em>(where Q, K, and V all have sequence dimension S)</em>, the parallelism is along the query sequence dimension. </p><p>In <strong>Flash Decoding</strong>, where the query has sequence dimension 1 but the KV cache has sequence dimension L, the parallelism instead decomposes along the key/value sequence dimension, allowing multiple warps to read different segments of the KV cache simultaneously and combine their partial softmax results using a numerically stable reduction.</p><p>The speedup from Flash Decoding is most pronounced when the KV cache length is large and the batch size is small, precisely the regime of <strong>long-context single-request inference </strong>where standard attention is most bottlenecked. </p><p>For L = 64K and batch size 1, Flash Decoding achieves <strong>4-8&#215; speedup</strong> over a naive attention implementation, which translates directly to 4-8&#215; faster decode throughput for long-context requests.</p><h3>The production inference stack</h3><p>The mechanisms discussed above do not exist in isolation; they compose into a production inference serving system. </p><p>It is worth describing what that stack looks like<strong> end-to-end</strong>, because the interactions between components create optimization opportunities that are invisible when examining each component individually.</p><p>A production inference server for a 70B model on an 8-GPU node, circa mid-2026, runs approximately as follows.<strong>Tensor parallelism</strong> at degree 8 distributes the weight matrices across all 8 GPUs in the node, using NVLink for the all-reduce at each layer boundary. </p><p>The effective model per GPU is 70B / 8 = 8.75B parameters, requiring approximately 17.5 GB of HBM for weights in BF16. With 80 GB per GPU, this leaves 62.5 GB per GPU for KV cache.</p><p>The<strong> KV cache</strong> is managed by PagedAttention with a block size of 16 tokens. The pool of free blocks is divided among the 8 GPUs (with TP, the KV cache is also sharded, since each GPU handles a subset of the KV heads under GQA). </p><p>For a model with 8 total KV heads under 8-way TP, each GPU handles exactly 1 KV head, and the <strong>per-GPU KV cache </strong>is accordingly 1/8 of the total.</p><p>The scheduler runs continuous batching, maintaining a queue of waiting requests and a set of active requests. At each step, it admits new requests into the batch as KV cache pages become available. It <strong>preempts requests</strong> (<em>evicts their KV cache pages, returning them to the free pool</em>) when memory pressure is high, re-scheduling preempted requests from the beginning (or from a checkpoint) later.</p><p>Speculative decoding is enabled with a draft model of approximately 7B parameters, contributing an additional 14 GB of weight memory across the 8 GPUs (1.75 GB per GPU) and an additional <strong>14 GB KV cache</strong> (1.75 GB per GPU), net of which there remains approximately 60 GB per GPU for the target model KV cache.</p><p>The prefill of long prompts is chunked (<em>also called chunked prefill</em>): instead of processing a 64K-token prompt as a single prefill operation that saturates compute for a long time and <strong>blocks decode requests </strong>from running, the prompt is processed in chunks of, say, 2048 tokens per step, interleaved with decode steps. </p><p>This trades slightly <strong>higher TTFT</strong> for dramatically better decode latency for concurrent users, a tradeoff that is almost always correct in a multi-user serving scenario.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>What the series has built</h2><p>Ten parts. From <strong>SMs to NVSwitch</strong>. From warp scheduling to speculative decoding. From single-GPU kernels to multi-node inference stacks.</p><p>Across all of it, one idea stays invariant: a GPU is not a &#8220;<em>compute machine</em>&#8221; in the naive sense. It is a latency-hiding system. Every layer of the stack exists to keep data moving while something else is waiting; <strong>warps hiding memory latency</strong>, NVLink hiding interconnect latency, batching hiding kernel inefficiency, and scheduling hiding autoregressive seriality.</p><p>At scale, the same structure repeats. Training systems hide communication under compute. <strong>Inference systems</strong> hide serial generation under parallel requests. Serving stacks hide memory fragmentation under virtualized allocation. </p><p>The details change, but the pattern does not: identify the bottleneck, then restructure the system so that bottleneck is <strong>no longer visible </strong>to the critical path.</p><p>What emerges is not a collection of optimizations, but a consistent way of thinking about hardware systems. Everything reduces to one question: </p><p><em>what resource is constraining progress right now (compute, memory bandwidth, or communication) and how do we prevent it from sitting idle?</em></p><p>The specific techniques will evolve. NVLink will be replaced, attention kernels will be rewritten, <strong>new quantization schemes</strong> will appear, and hardware will continue to shift the ridge points we computed throughout this series. </p><p>But the underlying structure will not change, because it is not a property of transformers: it is a <strong>property of physics.</strong></p><p>That is the real object we have been studying. Not GPUs. Not transformers. But the constraints that govern<em> how any system can compute</em> under finite bandwidth and finite time.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-be0?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-be0?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Mastering CUDA and High-Performance Computing, Part IX ]]></title><description><![CDATA[Where Part VIII Left Us]]></description><link>https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-d2d</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-d2d</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Tue, 21 Apr 2026 09:19:01 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FyE3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a37cd0-605f-46ee-bbc8-55b3d119503c_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FyE3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a37cd0-605f-46ee-bbc8-55b3d119503c_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FyE3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a37cd0-605f-46ee-bbc8-55b3d119503c_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!FyE3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a37cd0-605f-46ee-bbc8-55b3d119503c_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!FyE3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a37cd0-605f-46ee-bbc8-55b3d119503c_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!FyE3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a37cd0-605f-46ee-bbc8-55b3d119503c_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FyE3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a37cd0-605f-46ee-bbc8-55b3d119503c_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/70a37cd0-605f-46ee-bbc8-55b3d119503c_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2751137,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/194498965?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a37cd0-605f-46ee-bbc8-55b3d119503c_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FyE3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a37cd0-605f-46ee-bbc8-55b3d119503c_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!FyE3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a37cd0-605f-46ee-bbc8-55b3d119503c_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!FyE3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a37cd0-605f-46ee-bbc8-55b3d119503c_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!FyE3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a37cd0-605f-46ee-bbc8-55b3d119503c_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Where Part VIII Left Us</h2><p><strong>Part VIII</strong> ended at something close to a philosophical statement: the SMSP on a well-tuned Hopper <strong>GEMM kernel </strong>is a machine that does one thing. Everything else has been delegated.</p><p>That is true, and it is beautiful, and it is also irrelevant the moment the model you are trying to train does not fit in 80 GB of <strong>HBM3.</strong></p><p>GPT-3 has <em>175 billion parameters.</em> In BF16 that is 350 GB of weights alone, before you add optimizer state, activations, and <strong>gradients</strong>. A single H100 has 80 GB. You need at least five of them just to hold the parameters, and in practice you need significantly more to make training feasible.</p><p>At this point the<strong> single-GPU roofline model</strong>, with its ridge points, its arithmetic intensity calculations, its tensor core utilization percentages, becomes necessary but not sufficient. </p><p>You need a <strong>new abstraction layer </strong>that sits above the GPU and treats a rack, or a pod, or a datacenter, as the compute substrate.</p><p>This part is about that layer. We will cover <strong>tensor parallelism</strong>, pipeline parallelism, data parallelism, and the collective communication primitives that tie them together. </p><p>We will look at <strong>NCCL</strong>, at <strong>NVLink</strong> topology and how it interacts with bandwidth requirements, and at the specific arithmetic of why certain parallelism strategies work and others do not at scale.</p><p>We will go into great detail. Tighten your seatbelts. </p><div><hr></div><h2>The memory wall has not gone away</h2><p>Before we talk about parallelism strategies, we need to internalize what &#8220;model doesn&#8217;t fit&#8221; actually means, quantitatively.</p><p>A transformer with <em>P</em> parameters trained in mixed precision requires, at minimum:</p><ul><li><p><strong>2P bytes</strong> for the model weights in BF16</p></li><li><p><strong>4P bytes</strong> for the master weights in FP32 (kept by the optimizer for numerical stability)</p></li><li><p><strong>8P bytes</strong> for the Adam optimizer states (m and v vectors, both FP32)</p></li><li><p><strong>2P bytes</strong> for the gradients in BF16</p></li></ul><p>Total: <strong>16P bytes</strong> in the steady state, not counting activations.</p><p>For a 70B parameter model (<em>Llama 3 scale</em>), this is 1,120 GB; fourteen H100s worth of <strong>HBM</strong> just for the optimizer state. This is not a pathological edge case; this is the routine reality of training frontier models.</p><p>Inference is cheaper (you do not need <strong>optimizer state</strong>) but for a 405B parameter model in FP8, you are still looking at 405 GB, spread across at minimum six H100 80GB instances, with <strong>careful attention</strong> to how the tensor operations are partitioned so that no single GPU computes a matrix multiply that requires moving activations larger than the HBM capacity.</p><p>The problem is therefore not just &#8220;<em>how do we make one GPU fast</em>&#8221; but &#8220;<em>how do we decompose a computation that is too large for one GPU into pieces that run efficiently on many GPUs, with the communication overhead between those pieces small enough that the <strong>multi-GPU system </strong>achieves a meaningful fraction of the theoretical sum of its parts.</em>&#8221;</p><p>That fraction has a name: <strong>parallel efficiency</strong>. Getting it above 0.5 for a thousand-GPU training run is hard. Getting it above 0.8 is a research problem. </p><p>Getting it <strong>above 0.9 </strong>is what separates companies that can train frontier models economically from companies that cannot.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Three orthogonal dimensions of parallelism</h2><p>The standard taxonomy, established empirically by the <strong>Megatron-LM work</strong> at NVIDIA and subsequently refined, identifies three orthogonal axes along which a transformer training job can be parallelized.</p><p><strong>Data Parallelism (DP):</strong> Replicate the model across <em>N</em> GPUs, partition the training batch into <em>N</em> micro-batches, run a forward and backward pass independently on each GPU, and then average the gradients across all <em>N</em> replicas. Every GPU holds the full model. The communication pattern is a <strong>single all-reduce </strong>over the gradient tensors after each backward pass.</p><p><strong>Tensor Parallelism (TP):</strong> Partition individual weight matrices across <em>N</em> GPUs, so that each GPU holds a <em>1/N</em> shard of each matrix. A single <strong>matrix multiply</strong> that would require, say, a 4096&#215;16384 GEMM on one GPU instead requires a 4096&#215;(16384/N) GEMM on each of the <em>N</em> GPUs, followed by a collective to reassemble the result. </p><p>The communication pattern is tightly coupled to the forward pass; an all-reduce (or all-gather + reduce-scatter) at every layer boundary.</p><p><strong>Pipeline Parallelism (PP):</strong> Partition the layers of the model across <em>N</em> GPUs, so that GPU 0 holds layers 1&#8211;<em>L/N</em>, GPU 1 holds layers <em>L/N</em>+1 through <em>2L/N</em>, and so on. A micro-batch flows through the pipeline sequentially. </p><p>The communication pattern is a <strong>point-to-point activation </strong>transfer between adjacent stages, one per micro-batch per layer boundary.</p><p>These three dimensions compose. The Megatron-LM paper that trained GPT-3-scale models on<strong> A100 clusters</strong> used all three simultaneously: 8-way TP within a node (exploiting NVSwitch), <strong>4-way PP across nodes</strong>, and data parallelism across node groups. </p><p>The product is the total GPU count: 8 &#215; 4 &#215; D = total GPUs, where D is the data parallel degree.</p><p>Understanding <em>why</em> this particular combination was chosen, and not, say, 64-way TP with no PP, requires understanding the communication topology and the arithmetic of collective operations. </p><p>That is what the rest of this part is about.</p><div><hr></div><h2>NVLink and NVSwitch</h2><p>Communication between GPUs can happen over<strong> two physical fabrics</strong>:<strong> PCIe</strong> and <strong>NVLink</strong>. The performance difference between them is not small.</p><p>PCIe 4.0 x16 provides <strong>32 GB/s unidirectional bandwidth.</strong> PCIe 5.0 x16 doubles that to 64 GB/s. These numbers sound reasonable until you compare them to what you actually need during all-reduce.</p><p>NVLink 4.0 (H100) provides <strong>900 GB/s bidirectional bandwidth</strong> per GPU in NVLink-connected configurations: that is 450 GB/s in each direction. This is roughly 7&#215; better than PCIe 5.0 in each direction, and the real-world benefit is larger because NVLink latency is also significantly lower.</p><p>But &#8220;<em>NVLink-connected</em>&#8221; hides a critical topological detail. NVLink connects pairs of GPUs (or GPUs through an NVSwitch). The <strong>DGX H100 system </strong>has 8 GPUs connected through NVSwitch 3.0, which provides full all-to-all connectivity at 900 GB/s per GPU. </p><p>This means any GPU can communicate with any other GPU in the same node at full bandwidth simultaneously. The NVSwitch acts as a <strong>non-blocking switch </strong>fabric.</p><p>Across nodes, the picture changes entirely. <em>Multi-node communication</em> happens over <strong>InfiniBand </strong>(HDR or NDR), with typical all-reduce bandwidth of 25&#8211;50 GB/s per GPU depending on topology and rail configuration: roughly <strong>10&#8211;20&#215; slower</strong> than intra-node NVLink.</p><p>This 10&#8211;20&#215; bandwidth gap between <strong>intra-node</strong> and <strong>inter-node</strong> communication is the single most important physical fact for understanding why multi-GPU parallelism is structured the way it is.</p><p><strong>The implication is immediate:</strong> communication-heavy parallelism strategies (like tensor parallelism, which requires an all-reduce at every layer) should be confined within a node, where NVLink bandwidth makes the overhead acceptable. </p><p><strong>Communication-light</strong> parallelism strategies (like pipeline parallelism, which only requires activation transfers at layer boundaries) can span node boundaries.</p><p>This is exactly what Megatron-LM does, and the reason is physics, not convention.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Data parallelism in depth</h2><p>Data parallelism is the simplest strategy and the one that scales best in terms of implementation complexity. It is also the one where the <strong>communication overhead </strong>is most amenable to hiding behind computation, given careful engineering.</p><p>The communication requirement for data parallelism is an <strong>all-reduce</strong> over the gradient tensors after each backward pass. For a model with <em>P</em> parameters in BF16, this all-reduce moves 2<em>P</em> bytes of data through the network.</p><p>For a<strong> 70B model</strong>, that is 140 GB per all-reduce. At 25 GB/s inter-node InfiniBand, a naive all-reduce would take approximately <strong>5.6 seconds.</strong> For a training step that takes 2&#8211;3 seconds of compute, this is catastrophically inefficient. The GPU would be idle for twice as long as it was computing.</p><p>The solution is to overlap <strong>gradient communication</strong> with the backward pass computation. As each layer&#8217;s gradients are computed during the backward pass, those gradients can be immediately all-reduced while the <strong>backward pass </strong>continues computing the gradients of earlier layers.</p><p>This is called <strong>gradient overlap</strong>, and it is implemented in PyTorch via the DistributedDataParallel (DDP) bucket mechanism: gradients are grouped into buckets of <strong>approximately 25 MB</strong>, and an all-reduce is launched for each bucket as soon as it fills, overlapping with the backward computation of earlier layers.</p><p>The efficiency of gradient overlap depends on the ratio of compute time to communication time per layer. </p><p>For large models with large batch sizes, this ratio is favorable: the layers are <strong>compute-heavy</strong>, and there is always useful computation happening while the all-reduce for a previous layer&#8217;s gradients is in flight.</p><p>For small models or <strong>small batch sizes</strong>, the backward computation per layer is short and the all-reduce cannot be fully hidden. </p><p>This is one reason why very large batch sizes are computationally efficient beyond the obvious &#8220;<em>more samples per step</em>&#8221; benefit: larger batches mean <strong>longer per-layer compute time</strong>, which means more time to hide communication.</p><h3>ZeRO: when the model doesn&#8217;t fit, but you want data parallelism anyway</h3><p>Vanilla data parallelism replicates the <strong>full model</strong> on every GPU. For a 70B model, this requires every GPU to have 1,120 GB of memory (with optimizer state), which is physically impossible today and will remain so for some time.</p><p><strong>ZeRO</strong> (<em>Zero Redundancy Optimizer</em>), developed by Microsoft <strong>DeepSpeed</strong>, addresses this by partitioning the model state across the data parallel group rather than replicating it.</p><p>ZeRO comes in three stages of increasing memory savings and communication cost:</p><p><strong>ZeRO-1:</strong> Partition the optimizer state (m, v vectors) across the DP group. Each GPU holds a <em><strong>1/N</strong></em><strong> shard of the optimizer state.</strong> Communication overhead is unchanged (gradients are still all-reduced). <em>Memory savings:</em> up to 4&#215; for Adam (8P &#8594; 2P per GPU for optimizer state).</p><p><strong>ZeRO-2:</strong> Partition the gradients in addition to the optimizer state. After the all-reduce, each GPU keeps only its <em><strong>1/N</strong></em><strong> shard of the gradients</strong> (the portion it needs for its optimizer state shard). <em>Memory savings:</em> up to 8&#215; (12P &#8594; 1.5P per GPU for gradients + optimizer state). Communication overhead: unchanged.</p><p><strong>ZeRO-3:</strong> Partition the model parameters as well. Each GPU holds only <em>1/N</em> of the model weights at any given time. During the forward and backward pass, the needed weight shards are all-gathered from the <strong>DP group just-in-time.</strong> Memory savings: up to 64&#215; for a large DP degree. <em>Communication overhead</em>: increases by 1.5&#215; compared to vanilla DDP (due to the all-gather operations for parameters).</p><p>ZeRO-3 with a DP degree of 64 reduces the per-GPU memory for a 70B model from 1,120 GB to approximately 17.5 GB: comfortably fitting on a single H100. The tradeoff is the 1.5&#215; increase in <strong>communication volume</strong>, which must be weighed against the larger batch sizes that ZeRO-3 enables.</p><p>The engineering implementation of ZeRO-3 is <strong>non-trivial: </strong>parameters must be gathered before each layer&#8217;s forward pass and immediately freed afterward (unless <strong>gradient checkpointing</strong> is also active, in which case they must be re-gathered during the backward pass as well). </p><p>The <strong>memory allocator</strong> must be aware of these temporary parameter buffers and free them aggressively.</p><p>DeepSpeed&#8217;s implementation of ZeRO-3 does this with a parameter fetch prefetch buffer: as layer <em>i</em> is executing its forward pass, layer <em>i+1</em>&#8216;s parameters are being<strong> all-gathered in the background</strong>, overlapping communication with computation at the layer level rather than the bucket level.</p><div><hr></div><h2>Tensor parallelism in depth</h2><p>Tensor parallelism (TP), as formalized in the Megatron-LM paper, exploits the specific structure of <strong>transformer layers</strong> to split individual matrix multiplications across multiple GPUs.</p><p>Consider a single transformer <strong>MLP layer </strong>with weight matrices W1 of shape [d_model, d_ffn] and <strong>W2 </strong>of shape [d_ffn, d_model], where d_ffn = 4 &#215; d_model. For a 70B model, d_model &#8776; 8192, so d_ffn &#8776; 32768.</p><p>The output of the MLP layer is: Y = GeLU(XW1)W2</p><p>With 8-way tensor parallelism, <strong>W1 </strong>is split column-wise across 8 GPUs, each holding W1_i of shape [d_model, d_ffn/8]. W2 is split row-wise across 8 GPUs, each holding W2_i of shape [d_ffn/8, d_model].</p><p><strong>The computation on GPU </strong><em><strong>i</strong></em><strong> becomes:</strong></p><ol><li><p><em>Y1_i = X &#215; W1_i (local GEMM, shape [batch, d_ffn/8])</em></p></li><li><p><em>Z_i = GeLU(Y1_i) (local elementwise)</em></p></li><li><p><em>Y2_i = Z_i &#215; W2_i (local GEMM, shape [batch, d_model])</em></p></li><li><p><em>Y = AllReduce(Y2_i) (sum partial results across 8 GPUs, shape [batch, d_model])</em></p></li></ol><p>The <strong>all-reduce at step 4</strong> is the communication bottleneck. Its cost is:</p><p>2 &#215; (N-1)/N &#215; |Y2| bytes</p><p>For a batch of 2048 tokens, d_model = 8192, BF16: |Y2| = 2048 &#215; 8192 &#215; 2 bytes &#8776; 32 MB. At 450 GB/s NVLink bandwidth, this all-reduce takes approximately 32 MB / 450 GB/s &#8776; 71 microseconds.</p><p>The <strong>compute time</strong> for the local GEMMs is: 2 &#215; (2048 &#215; 8192 &#215; 32768/8) &#215; 2 ops &#247; (494 TFLOP/s per GPU) &#8776; 2 &#215; 2048 &#215; 8192 &#215; 4096 &#215; 2 / 494e12 &#8776; <strong>1.1 milliseconds</strong> per GEMM, so approximately 2.2 ms for both GEMMs.</p><p>Communication (71 &#181;s) is roughly 3% of compute (2200 &#181;s). This is an <strong>excellent ratio</strong>; the all-reduce is effectively free. </p><p>But now notice what happens if you push<strong> TP degree from 8 to 64 over InfiniBand </strong>at 25 GB/s instead of NVLink at 450 GB/s: the all-reduce takes 32 MB / 25 GB/s &#215; (2 &#215; 63/64) &#8776; 2.5 ms, which is now comparable to the compute time. </p><p>The<strong> compute per GPU</strong> has also dropped by 8&#215; (fewer flops per GPU due to more partitioning), so each GEMM takes 2.2ms / 8 &#8776; 275 &#181;s.</p><p>The all-reduce (2.5 ms) now takes <strong>9&#215; longer than the compute it is supposed to overlap with</strong>. You are GPU-idle for 90% of the time. This is why tensor parallelism over InfiniBand at high TP degrees is a terrible idea, not a theoretical concern but an arithmetic certainty.</p><p>The <strong>intra-node NVLink</strong> constraint on <strong>TP degree </strong>is therefore typically TP &#8804; 8 for an 8-GPU node, precisely because that is the point where NVLink bandwidth makes the communication overhead negligible.</p><h3>Self-attention and sequence parallelism</h3><p>The attention mechanism has a different structure than the MLP, but the Megatron-LM approach handles it symmetrically: <strong>Q, K, and V projection matrices</strong> are split column-wise (head-parallel), and the output projection is split row-wise. Each GPU handles a subset of attention heads.</p><p>For <em>H</em> total heads and TP degree <em>N</em>, each GPU computes <strong>H/N heads independently. </strong>No communication is needed within the attention computation itself; only the output projection requires an all-reduce.</p><p>This works cleanly as long as <em>H</em> is divisible by <em>N</em>, which is why transformer architectures are almost always designed with head counts that are <strong>powers of 2 or multiples of 8.</strong></p><p>There is an additional subtlety for attention: <strong>LayerNorm</strong> and dropout require the full activation tensor, not a sharded one. </p><p><strong>In vanilla TP</strong>, these operations run on the full (all-reduced) activations, which means they see the full sequence length at full d_model dimension: they are not distributed and do not benefit from the <strong>TP decomposition.</strong></p><p><strong>Sequence Parallelism</strong> (SP), introduced as an extension to Megatron-LM TP, addresses this by replacing the all-reduce at layer boundaries with an <strong>all-gather + reduce-scatter</strong> pattern, and distributing the non-tensor-parallel operations (LayerNorm, dropout) across the sequence dimension rather than the model dimension.</p><p><strong>In SP</strong>, the activation between transformer layers is sharded across the TP group along the sequence dimension: each GPU holds [batch, seq_len/N, d_model] instead of [batch, seq_len, d_model]. </p><p>Before entering the tensor-parallel MLP, an all-gather reconstructs the full [batch, seq_len, d_model] tensor. After the <strong>MLP&#8217;s row-parallel W2 partial sum</strong>, a reduce-scatter simultaneously sums the partial results and re-shards the output along the sequence dimension.</p><p>The <strong>communication volume</strong> is identical to the all-reduce, but the memory advantage is significant: activations are now partitioned across the TP group, reducing the peak memory per GPU by a factor of <em>N</em> for the activation tensors.</p><p>At <strong>large sequence lengths </strong>(128K tokens, as in recent long-context models), this memory saving is the difference between fitting in HBM and <strong>catastrophic OOM.</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Pipeline Parallelism</h2><p><strong>Pipeline parallelism is the ugliest of the three strategies.</strong> This is a statement of fact, not an aesthetic judgment.</p><p>It introduces <strong>pipeline bubbles</strong> (periods where some GPUs are idle because they are waiting for activations from the previous stage) and <strong>managing those bubbles is the central engineering challenge</strong>.</p><p>In the simplest pipeline schedule (<strong>GPipe</strong>), a batch of <em>M</em> micro-batches flows through <em>P</em> pipeline stages sequentially. The forward pass processes micro-batch 0 through stage 0, then stage 1, ..., then stage <em>P</em>-1. Then micro-batch 1 flows through. Then micro-batch 2. And so on.</p><p>The backward pass happens in reverse, with a full flush between the forward and backward sweeps. The pipeline bubble fraction (time wasted on idle GPUs as a fraction of total time) is approximately:</p><p><strong>bubble fraction &#8776; (P &#8722; 1) / (M + P &#8722; 1)</strong></p><p>For P = 4 stages and M = 8 micro-batches: bubble fraction &#8776; 3/11 &#8776; 27%. More than a quarter of GPU time is wasted.</p><p>To reduce the bubble, you increase M. For M = 32: bubble fraction &#8776; 3/35 &#8776; 8.5%. For M &#8594; &#8734;, bubble fraction &#8594; 0.</p><p>But increasing M increases the memory required to store the activations of all in-flight micro-batches during the forward pass before the backward pass can begin (GPipe requires storing all activations). <strong>For </strong><em><strong>M</strong></em><strong> micro-batches each with activation size </strong><em><strong>A</strong></em><strong>, the activation memory is M &#215; A, and this grows linearly with M.</strong></p><p><strong>1F1B (One Forward One Backward)</strong> scheduling, introduced in PipeDream and used by Megatron-LM, breaks this linear activation scaling. In 1F1B, each pipeline stage interleaves one forward pass and one backward pass for different micro-batches, rather than running all forwards before any backwards.</p><p>The pipeline still has a bubble at startup and drain, but <strong>the steady-state memory is bounded: at any point, a stage has at most </strong><em><strong>P</strong></em><strong> micro-batches&#8217; activations in flight (not </strong><em><strong>M</strong></em><strong>)</strong>. The bubble fraction is the same as GPipe ((P&#8722;1)/(M+P&#8722;1)), but the activation memory is P &#215; A instead of M &#215; A.</p><p>For large models where M must be large to amortize the bubble, this is a crucial difference. With M = 64 and P = 8, GPipe stores 64 micro-batches of activations; 1F1B stores at most 8.</p><p><strong>Interleaved 1F1B</strong> goes further: each GPU holds <em>V</em> virtual pipeline stages (chunks of layers) instead of one contiguous block, enabling the bubble fraction to be reduced to:</p><p><strong>bubble fraction &#8776; (P &#8722; 1) / (V &#215; M + P &#8722; 1)</strong></p><p>with V times the point-to-point communication per step (because activation tensors must be sent between non-adjacent GPUs). This is the schedule used by Megatron-LM for the largest training jobs, with V=2 or V=4 providing a 2-4&#215; reduction in bubble at a 2-4&#215; increase in inter-stage communication.</p><h3>The Activation Recomputation Tradeoff</h3><p>One more tool for managing activation memory in pipeline parallelism (and, frankly, in any large training run): <strong>gradient checkpointing</strong>, also called <strong>activation recomputation</strong>.</p><p>The idea is simple: instead of storing the full activation tensor for every layer during the forward pass (<em>needed for the backward pass</em>), you store only the activations at checkpoint boundaries (<em>e.g., every transformer block</em>) and recompute the intermediate activations on-demand during the backward pass by running the forward computation again.</p><p><strong>The memory cost is reduced from O(L &#215; A) to O(&#8730;L &#215; A) for optimal checkpoint placement</strong> (<em>checkpoint every &#8730;L layers</em>). The compute cost increases by approximately 30-40% (one extra forward pass per layer, amortized).</p><p>For training at the frontier, this tradeoff is almost always worth it: <strong>compute is more abundant than HBM capacity</strong>, and the alternative is buying more GPUs (or, equivalently, using more PP stages, which increases the bubble).</p><p>The interaction between<strong> activation recomputation</strong> and pipeline parallelism is non-trivial: if you are recomputing activations, the backward pass must re-run the forward computation for each stage, which means each stage must still have access to the input activations from the <strong>forward pass checkpoint.</strong> </p><p>This constrains how aggressively you can combine <strong>PP and recomputation</strong>, and the Megatron-LM codebase has explicit logic for managing which activations are stored versus recomputed across pipeline stages.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The library that makes multi-GPU work</h2><p><strong>Everything discussed above (all-reduce, all-gather, reduce-scatter) is implemented in practice by NCCL (NVIDIA Collective Communications Library).</strong> </p><p>Understanding what NCCL does, and more importantly <em>how</em> it does it, is necessary for diagnosing performance problems at scale.</p><p>NCCL implements the standard collective communication operations (AllReduce, AllGather, ReduceScatter, Broadcast, Reduce, Barrier) on NVIDIA GPUs, with <strong>topology-aware algorithms</strong> that exploit NVLink, NVSwitch, and InfiniBand according to the detected hardware configuration.</p><h3>Ring-AllReduce vs Tree-AllReduce</h3><p>The canonical<strong> AllReduce algorithm</strong> for a ring of <em>N</em> GPUs is <strong>ring-allreduce</strong>, introduced in the deep learning context by Baidu Research and later made famous by the Horovod library.</p><p>In <strong>ring-allreduce</strong>, GPUs are arranged in a logical ring. AllReduce is decomposed into two phases:</p><p><strong>Reduce-Scatter:</strong> Each GPU sends a chunk of its data to the next GPU in the ring, while simultaneously receiving and accumulating a chunk from the previous GPU. After N&#8722;1 steps, each GPU holds the fully reduced value for one chunk (its 1/N shard of the full tensor).</p><p><strong>AllGather:</strong> Each GPU broadcasts its reduced chunk to all others by rotating it around the ring. After N&#8722;1 more steps, every GPU holds the fully reduced tensor.</p><p><strong>Total data sent per GPU: 2 &#215; (N&#8722;1)/N &#215; |data| &#8776; 2 &#215; |data| for large </strong><em><strong>N</strong></em><strong>.</strong> This is bandwidth-optimal: any AllReduce algorithm must send at least 2 &#215; (N&#8722;1)/N &#215; |data| bytes per GPU, so ring-allreduce achieves the theoretical minimum.</p><p>The latency of <strong>ring-allreduce</strong> scales as 2(N&#8722;1)&#945; + 2(N&#8722;1)/N &#215; |data|/&#946;, where &#945; is the point-to-point latency and &#946; is the bandwidth. </p><p>For large data (|data| &gt;&gt; &#945;/&#946; &#215; N&#178;), the bandwidth term dominates and the algorithm is efficient. <strong>For small data, the latency term (proportional to N) dominates and ring-allreduce becomes expensive.</strong></p><p>This is why AllReduce for gradient synchronization (<em>large tensors, many GB</em>) works well with ring-allreduce, but AllReduce for small tensors (<em>like the normalization statistics in LayerNorm</em>) can benefit from tree-based algorithms that have <strong>O(log N) latency scaling</strong> at the cost of non-optimal bandwidth utilization.</p><p>NCCL automatically selects the algorithm based on <strong>message size</strong>, topology, and a heuristic tuning table, but understanding the underlying tradeoff is necessary when the <strong>automatic selection</strong> is suboptimal for your specific workload.</p><h3>NCCL and topology awareness</h3><p>NCCL&#8217;s topology detection is worth examining in detail because it <strong>directly determines which algorithm it selects for intra-node vs inter-node operations</strong>.</p><p>On startup, NCCL probes the system topology using the<strong> CUDA device properties API </strong>and, where available, NVML (NVIDIA Management Library) topology information. </p><p>It constructs an<strong> internal graph</strong> where GPUs are nodes and NVLink/PCIe/InfiniBand connections are edges with associated bandwidths.</p><p>For an 8-GPU DGX H100 node with NVSwitch, NCCL detects a fully connected graph with 900 GB/s bidirectional bandwidth. </p><p>For multi-node communication over InfiniBand with a single SHARP switch, NCCL can use <strong>SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)</strong>, an in-network computing feature of modern InfiniBand switches that <strong>performs the AllReduce reduction inside the switch fabric rather than at the endpoint GPUs</strong>.</p><p>SHARP effectively moves the bandwidth bottleneck from the GPU NICs to the switch, and for large-scale clusters with <strong>SHARP-capable HDR/NDR InfiniBand</strong>, it can reduce AllReduce latency by 2-3&#215; compared to standard ring-allreduce.</p><p>For hierarchical topologies (NVLink within nodes, InfiniBand between nodes), NCCL uses a <strong>two-level algorithm</strong>: a ReduceScatter within each node using the fast NVLink fabric, followed by an AllReduce across nodes over InfiniBand, followed by an AllGather within each node. </p><p><strong>This correctly exploits the bandwidth hierarchy: </strong>the slow inter-node link sees only the partially reduced results, not the full gradient tensor from every GPU.</p><p>The implementation detail that matters for practitioners: the <code>NCCL_SOCKET_NTHREADS</code> and <code>NCCL_BUFFSIZE</code> environment variables, along with the <code>NCCL_P2P_LEVEL</code> setting, are often the first things to tune when AllReduce performance is below theoretical bandwidth. </p><p>NCCL&#8217;s defaults are conservative for stability across diverse hardware configurations.</p><h3>The NCCL + CUDA stream interaction</h3><p>NCCL operations run on CUDA streams, and <strong>correct stream management is the source of many subtle performance bugs in multi-GPU training code</strong>.</p><p>The critical invariant: <strong>NCCL operations on the same communicator object are serialized in NCCL&#8217;s internal queue</strong>, but operations on different communicators are independent. </p><p>The <strong>PyTorch DDP</strong> implementation creates one NCCL communicator per process group, and all AllReduce operations for gradient synchronization go through this communicator.</p><p>When gradient overlap is active (AllReduce launched for bucket <em>i</em> while the backward pass continues computing gradients for bucket <em>i</em>-1), PyTorch launches the AllReduce on a separate CUDA stream from the backward pass computation. The backward computation stream and the <strong>AllReduce stream</strong> run concurrently on the GPU.</p><p>The synchronization at the end of the backward pass (before the optimizer step) ensures that all AllReduce streams have completed. <strong>Failure to synchronize here is a common bug that manifests as non-deterministic gradient corruption</strong>: the optimizer sees a partially all-reduced gradient tensor.</p><p>The interaction with tensor parallelism adds another layer: the AllReduce within a TP group happens on a separate communicator from the AllReduce within the DP group. </p><p>These two communicators must be correctly ordered in the forward/backward pass: <strong>TP AllReduce </strong>happens at each layer boundary (blocking for the next layer to begin), while <strong>DP AllReduce </strong>happens after the full backward pass.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The arithmetic of parallelism efficiency</h2><p>With the mechanisms in place, we can now put numbers on the parallel efficiency achievable under different configurations. This is where the theoretical discussion becomes practically useful.</p><p>For data parallelism, the relevant ratio is <strong>R&#7429;&#7448; = T_compute / T_allreduce</strong>. For gradient overlap to be effective, we need R&#7429;&#7448; &gt;&gt; 1. For a 70B model over InfiniBand at 25 GB/s, T_allreduce &#8776; 5.6 seconds. For a per-GPU batch of 512 tokens at 50% H100 utilization, T_compute &#8776; 0.58 seconds. R&#7429;&#7448; &#8776; 0.1. </p><p>The compute is <strong>an order of magnitude faster</strong> than the all-reduce, and no amount of overlap engineering fixes a ratio that is inverted.</p><p>The resolution is gradient accumulation: run multiple micro-batches locally before triggering the all-reduce, increasing the logical batch size without increasing activation memory. </p><p>This is not a workaround. It is the correct operating point, because <strong>convergence is measured in tokens seen, not steps taken</strong>.</p><p>For tensor parallelism, the critical TP degree N* is the point above which T_allreduce_TP &gt; T_GEMM_local. On NVLink at 450 GB/s, N* &#8776; 8. On InfiniBand at 25 GB/s, N* &#8776; 1 to 2. Tensor parallelism over InfiniBand is not a suboptimal choice; it is an arithmetic mistake.</p><p>The practical ceiling for a well-tuned thousand-GPU training job with Megatron-LM or DeepSpeed on H100 hardware is approximately <strong>40-60% MFU</strong>. </p><p>The gap from 100% is accounted for by pipeline bubbles (~15%), communication overhead (~10-15%), <strong>activation recomputation</strong> (~5-10%), and the residual from kernel inefficiency on non-GEMM operations and host-side Python overhead. </p><p>Getting from 40% to 60% MFU is worth approximately a 50% reduction in training cost. <strong>Every major AI lab </strong>has engineers whose entire job is that gap.</p><h2>What the three strategies actually buy you</h2><p>It is worth stepping back and stating plainly what each parallelism axis is for, now that we have done the arithmetic.</p><p><strong>Data parallelism</strong> buys throughput. More DP replicas means more tokens per second, with communication overhead that is manageable at any scale where the per-GPU batch is large enough. </p><p><strong>ZeRO makes DP viable</strong> even when the model does not fit on a single GPU, at the cost of increased communication volume that must be weighed against the larger batch sizes it unlocks. </p><p>DP is the outer loop of every large training job; the other strategies are refinements that make DP feasible at model sizes and cluster scales where it would otherwise be communication-bound.</p><p><strong>Tensor parallelism</strong> buys memory capacity per layer, by distributing the weight matrices and activations of individual layers across multiple GPUs. </p><p>Its cost is a mandatory all-reduce at every layer boundary, which makes it viable only where <strong>NVLink bandwidth</strong> makes that all-reduce cheap. </p><p>TP is fundamentally an intra-node strategy on current hardware. Using it otherwise is trading compute for communication at an exchange rate that is never favorable.</p><p><strong>Pipeline parallelism</strong> buys the ability to cross node boundaries without paying the tensor-parallel penalty. Its cost is the pipeline bubble, which is a tax on idleness that shrinks with micro-batch count and with interleaved scheduling. </p><p>PP is the mechanism by which you scale beyond a single node when the model is too large for TP alone to partition. It is ugly, and it is necessary, and the <strong>1F1B schedule </strong>and its interleaved variants exist specifically to make that ugliness manageable.</p><p>The <strong>three-dimensional parallelism strategy </strong>(TP within a node, PP across nodes, DP across node groups) is not a design choice that someone made. It is the unique solution that the bandwidth hierarchy of current hardware enforces. </p><p>NVLink makes TP cheap within a node. InfiniBand makes PP the only viable strategy across nodes. Gradient accumulation and<strong> ZeRO make DP efficient </strong>across the full cluster. Change the hardware, and the optimal strategy shifts. </p><p>NVL72 on Blackwell, by extending the NVLink fabric to 72 GPUs, shifts the TP/PP boundary outward. </p><p>InfiniBand NDR at 400 Gb/s per link, as it becomes more widely deployed, shifts the point at which inter-node DP communication becomes the bottleneck. <strong>The strategies are not timeless; the physics that motivates them is.</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>What actually limits you at scale</h2><p>The question that every large-scale training practitioner eventually asks is:<em> why isn&#8217;t my thousand-GPU job achieving 70% MFU?</em> The answer is rarely a single cause. </p><p>It is a <strong>stack of inefficiencies</strong>, each modest in isolation, compounding into the gap between theoretical and measured throughput.</p><p>Pipeline bubbles are usually the largest single contributor at high PP degrees. For P=16 and M=32, the bubble fraction is 32% before any other inefficiency is counted. </p><p>Interleaved 1F1B at V=2 halves this, at the cost of doubled inter-stage communication. The right V depends on whether <strong>communication or compute</strong> is the limiting resource at the PP boundary, which varies by model size, node count, and InfiniBand configuration.</p><p>Inter-node AllReduce tail latency is the second major contributor and the hardest to reason about, because it is <strong>statistical rather than deterministic</strong>. </p><p>The average InfiniBand bandwidth may be 25 GB/s, but the 99th-percentile latency can be 3-5&#215; higher due to switch congestion, adaptive routing jitter, and multi-tenancy. </p><p>The all-reduce waits for the slowest link. At scale, the slowest link is almost always slower than the average, which means the<strong> effective all-reduce bandwidth</strong> is consistently worse than the number on the datasheet. </p><p><strong>RDMA configuration</strong>, static routing, and dedicated rail topology are the levers. They require infrastructure access that not every team has.</p><p>Host-side Python overhead is the most embarrassing source of inefficiency, because it is entirely self-inflicted. </p><p>PyTorch&#8217;s dispatcher, the <strong>GIL,</strong> and the overhead of the training loop&#8217;s Python logic can appear as measurable GPU idle time for models where per-step compute is short. <strong>CUDA graph</strong> <strong>capture </strong>eliminates per-operation launch overhead for the inner loop. </p><p>Careful pipelining of data loading and logging eliminates it for the outer loop. Teams that have done this work carefully report <strong>5-10% improvement</strong> in effective MFU from host-side optimizations alone, which is not a small number when the training run costs millions of dollars.</p><h2>Conclusion</h2><p>The trajectory from<strong> single-GPU to multi-GPU</strong> mirrors the trajectory within the single GPU: find the bottleneck, route around it, measure again.</p><p>On a single GPU, the bottleneck moved from arithmetic units (tensor cores solved it) to memory bandwidth (cp.async attacked it) to instruction overhead (TMA eliminated it). </p><p>The SMSP on a <strong>well-tuned Hopper GEMM kernel</strong> is a machine that does one thing, because every other thing it used to do has been delegated to dedicated hardware.</p><p>At the <strong>multi-GPU level</strong>, the bottleneck is the communication fabric, and the solution follows the same logic: match the parallelism strategy to the bandwidth available at each level of the hierarchy. </p><p>TP over NVLink, PP over InfiniBand, DP across the full cluster. The ring-allreduce, the 1F1B schedule, the ZeRO partitioner: these are the mechanisms by which a thousand-GPU cluster achieves 40-60% of the theoretical sum of its parts.</p><p>40-60% of a thousand H100s is still an extraordinary amount of compute. Whether it is enough to train the next frontier model is left as an exercise for the reader&#8217;s infrastructure budget.</p><p><strong>Part X</strong> will close in on a corner of this picture we have deliberately deferred: inference. Training is a <strong>one-time cost</strong>; inference is the workload that runs forever, at a scale that dwarfs training once a model is deployed. </p><p>The optimization challenges at inference time are different in kind: speculative decoding, <strong>KV cache management</strong>, continuous batching, and the specific arithmetic of when the prefill and decode phases are bottlenecked by entirely different resources. </p><p><em>The tools change; the principle does not.</em></p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-d2d?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-d2d?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Mastering CUDA and High-Performance Computing, Part VIII]]></title><description><![CDATA[Where Part VII Left Us]]></description><link>https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-db8</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-db8</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Tue, 07 Apr 2026 09:52:06 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!GG5h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03d2005-5a43-4f0d-8dda-41d6047144ed_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GG5h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03d2005-5a43-4f0d-8dda-41d6047144ed_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GG5h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03d2005-5a43-4f0d-8dda-41d6047144ed_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!GG5h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03d2005-5a43-4f0d-8dda-41d6047144ed_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!GG5h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03d2005-5a43-4f0d-8dda-41d6047144ed_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!GG5h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03d2005-5a43-4f0d-8dda-41d6047144ed_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GG5h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03d2005-5a43-4f0d-8dda-41d6047144ed_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f03d2005-5a43-4f0d-8dda-41d6047144ed_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2478549,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/160372692?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03d2005-5a43-4f0d-8dda-41d6047144ed_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GG5h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03d2005-5a43-4f0d-8dda-41d6047144ed_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!GG5h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03d2005-5a43-4f0d-8dda-41d6047144ed_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!GG5h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03d2005-5a43-4f0d-8dda-41d6047144ed_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!GG5h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03d2005-5a43-4f0d-8dda-41d6047144ed_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Where Part VII Left Us</h2><p><strong><a href="https://softwarefrontier.substack.com/p/mastering-cuda-and-high-performance-ea1">Part VII</a></strong> ended with a promise and an architectural cliffhanger.</p><p>The promise: on <strong>Hopper</strong>, the compute-to-load instruction ratio in a <strong>GEMM inner loop</strong> approaches infinity from the SMSP&#8217;s perspective. </p><p><strong>The cliffhanger</strong>: one instruction moves a 128&#215;128 BF16 tile, the <strong>TMA unit </strong>generates all the addresses, and something called an <code>mbarrier</code> replaces the <code>__syncthreads()</code> you have been writing since your first CUDA &#8220;hello world&#8221;.</p><p>Let us unpack exactly what that means, why NVIDIA made those choices, and what you have to understand to write, read, or <strong>debug CUTLASS 3.x kernels </strong>without feeling like you are reading a foreign language.</p><p>We will <strong>go very deep</strong>. There is no other way.</p><div><hr></div><h2>The Problem cp.async Did Not Fully Solve</h2><p><strong>Part VII</strong> established that <code>cp.async</code> is superior to the <strong>conventional LDG &#8594; STS path</strong> because it removes the destination registers from the scoreboard. The SMSP issues the copy, hands it off to the <strong>Async Copy Engine</strong>, and is immediately free to issue the next instruction.</p><p>This is genuinely great. But it has a hidden cost that only becomes visible when you look at the SMSP instruction stream of a<strong> real GEMM kernel.</strong></p><p>Consider a <strong>128&#215;128&#215;32 BF16 tile.</strong> Loading that tile requires 128 &#215; 32 BF16 elements = 4096 BF16 = 8 KB. At 16 bytes per <code>cp.async</code>, that is 512 individual <code>CP.ASYNC.CA.SHARED.GLOBAL</code> instructions. </p><p>Those 512 instructions have to be fetched from the instruction cache, decoded, dispatched through the <strong>MIO unit,</strong> and tracked by the hardware. They consume SMSP instruction bandwidth even though they produce no register results.</p><p>On Ampere, the SMSP can issue roughly one 128-bit <code>cp.async</code> every 4 cycles per SMSP. For 512 instructions, that is approximately <strong>2048 SMSP cycles</strong> per tile load, just for the instruction overhead. The actual data movement happens asynchronously, but the instruction stream is not free.</p><p>For large tiles this is manageable. For smaller tiles, or for architectures where you want the SMSP to spend every cycle on tensor core instructions, it is a ceiling.</p><p>Hopper (SM90, H100) was designed to remove that ceiling entirely. The answer is the <strong>Tensor Memory Accelerator</strong>.</p><div><hr></div><h2>Tensor Memory Accelerator</h2><p>The <strong>TMA </strong>is a hardware unit introduced in <strong>Hopper</strong> that performs multi-dimensional tensor copies between global memory and shared memory (<em>or distributed shared memory across a cluster, but we will get to clusters</em>). </p><p>It accepts a <strong>tensor descriptor</strong> computed on the host and a set of <strong>coordinates</strong> computed on the device, and it handles everything else: address computation, striding, data type conversion, out-of-bounds clamping, cache policy, and transaction completion signaling.</p><p>Let us be concrete about what &#8220;<em>everything else</em>&#8221; means.</p><p>In a <strong>conventional tiled GEMM</strong>, for every tile you load, every thread in the warp must compute its portion of the global memory address. </p><p>That address computation involves the block index, the thread index, the tile dimensions, <strong>the matrix stride</strong>, and the element size. It is entirely deterministic arithmetic that produces the same result every time you execute the same tile iteration. </p><p>It is also arithmetic that the <strong>SMSP</strong> has to execute. On Ampere with <code>cp.async</code>, that arithmetic still happens in the SMSP even though the subsequent memory transaction is asynchronous.</p><p>The TMA eliminates that arithmetic from the SMSP. One thread issues one instruction with a tensor descriptor handle and a pair of (y, x) coordinates. </p><p>The <strong>TMA unit</strong> uses those coordinates and the descriptor&#8217;s metadata to compute every address needed for the entire tile transfer, scatter or gather the data, and write it to shared memory. The SMSP emitted one instruction. One.</p><p>This is not a minor optimization. It is a<strong> qualitative change</strong> in what the SMSP does during a GEMM kernel. On Hopper, the SMSP&#8217;s job is to run <code>WGMMA.MMA_ASYNC</code> instructions. </p><p>The TMA&#8217;s job is to move data. These<strong> two jobs </strong>happen simultaneously, on separate hardware units, and the only communication between them is an <code>mbarrier</code> synchronization object.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The Tensor descriptor</h2><p>Before a Hopper kernel runs, the host must create a tensor descriptor using <code>cuTensorMapEncodeIm2col</code> or, more commonly for GEMM, <code>cuTensorMapEncodeTiled</code>. This is a <strong>128-byte opaque structure</strong> stored in constant memory (or passed through a register and loaded into the L1).</p><p>The descriptor encodes:</p><p><strong>Base pointer</strong>: the global memory address of tensor element [0, 0, 0, ...].</p><p><strong>Global dimensions</strong>: the actual size of each dimension in the full tensor, in elements. For an M&#215;K matrix A, this is {M, K} (or {K, M} if column-major).</p><p><strong>Global strides</strong>: the byte stride between consecutive elements in each dimension. For a row-major matrix with K columns and BF16 elements, the stride between row i and row i+1 is K &#215; 2 bytes. These strides allow arbitrary non-contiguous tensors.</p><p><strong>Box dimensions</strong>: the size of the tile to be transferred in each dimension. For a 128&#215;32 BF16 tile, this is {128, 32}.</p><p><strong>Interleave and swizzle mode</strong>: how data should be rearranged during the transfer to produce a shared memory layout that avoids bank conflicts. This is the part that replaces all the padding arithmetic from Part VII.</p><p><strong>Element stride and data type</strong>: how to interpret the raw bytes.</p><p>The descriptor is created once on the<strong> CPU </strong>and passed to the kernel. On the device, a single warp or even a single thread can then use this descriptor to initiate a full tile transfer with one instruction, because all the<strong> per-tile invariant information</strong> is already encoded.</p><p>This is a deliberate design choice: move the expensive computation (descriptor creation) to the host, where <strong>latency is irrelevant</strong> relative to the kernel launch overhead, so that the device-side instruction can be as cheap as possible.</p><div><hr></div><h2>The TMA instruction itself</h2><p>The PTX for a 2D TMA load looks like this:</p><pre><code><code>cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
    [smem_dst], [gmem_desc, {coord_y, coord_x}], [mbar];
</code></code></pre><p>Let us parse every token.</p><p><code>cp.async.bulk</code> means this is an <strong>asynchronous bulk copy</strong>; &#8220;bulk&#8221; distinguishes it from scalar <code>cp.async</code>. The transfer size is determined by the descriptor, not encoded in the instruction.</p><p><code>tensor.2d</code> means the TMA will interpret the coordinates as a <strong>2D tensor access</strong>. There are variants for 1D through 5D tensors.</p><p><code>shared::cluster</code> is the destination scope: shared memory that is visible to the entire thread block cluster (more on clusters shortly). For <strong>single-CTA kernels</strong> this is simply shared memory.</p><p><code>global</code> is the source: <strong>global memory</strong>, indexed via the descriptor.</p><p><code>mbarrier::complete_tx::bytes</code> is the completion signaling mechanism. When the transfer completes, the TMA will signal a <code>mbarrier</code> object, decrementing its transaction count. </p><p>When the count reaches zero, threads waiting on the barrier are unblocked. This replaces <code>consumer_wait()</code> and <code>__syncthreads()</code> in the sense that the barrier itself tracks both the data arrival and the<strong> thread synchronization</strong> in a single primitive.</p><p><code>[smem_dst]</code> is the destination address in shared memory.</p><p><code>[gmem_desc, {coord_y, coord_x}]</code> is the descriptor plus coordinates. The TMA extracts the base pointer, strides, and box dimensions from the descriptor, applies the coordinates, and generates the full address range.</p><p><code>[mbar]</code> is a pointer to the <code>mbarrier</code> object in shared memory.</p><p>In <strong>CUDA C++</strong>, the <code>cuda::experimental::tma::</code> API (or <code>__pipeline_memcpy_async</code> for simpler cases) generates this instruction. The canonical production path is through CUTLASS 3.x&#8217;s <code>cute::copy</code> with a <strong>TMA copy atom</strong>, which we will examine in the CUTLASS section.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>A synchronization primitive you have not seen before</h2><p><code>__syncthreads()</code> is a full thread block barrier. Every thread in the block must arrive before any thread proceeds. </p><p>It is implemented via a shared counter that is decremented by each arriving thread and checked by a <strong>hardware barrier mechanism</strong>. Its cost is proportional to thread block size, and it cannot distinguish between &#8220;<em>I&#8217;m done computing</em>&#8221; and &#8220;<em>my data has arrived from the DMA engine</em>&#8221;.</p><p><code>mbarrier</code> (memory barrier, or more precisely, the Hopper barrier object) solves both of those problems.</p><p>An <code>mbarrier</code> object is a <strong>64-bit value</strong> stored in shared memory. It has two phases, <strong>expect</strong> and <strong>arrive</strong>, and it tracks two distinct counts:</p><p>The <strong>arrival count</strong> is decremented by threads calling <code>mbarrier.arrive</code> or <code>mbarrier.arrive_drop</code>. When this count hits zero, the barrier phase flips.</p><p>The <strong>transaction count</strong> is decremented by the TMA engine itself when a bulk copy completes. This is the <code>complete_tx::bytes</code> in the PTX instruction above. The programmer initializes this count to the expected number of bytes that the TMA will deliver.</p><p>The barrier is &#8220;<em>complete</em>&#8221; when both counts reach zero: all participating threads have arrived, and all <strong>expected TMA transactions</strong> have completed.</p><p>This means you can have a consumer wait on a barrier that is signaled partly by threads and partly by hardware DMA engines, with <strong>no polling loop</strong>, no atomics in the critical path, and no <code>__syncthreads()</code> that serializes all 128 threads in the block.</p><p>The setup looks like this in<strong> CUDA C++</strong>:</p><pre><code><code>__shared__ cuda::barrier&lt;cuda::thread_scope_block&gt; mbar;

// One thread initializes the barrier for N_THREADS participants
if (thread_rank == 0) {
    init(&amp;mbar, N_THREADS);
    // Tell the barrier to also expect TMA_BYTES bytes of async data
    cuda::device::barrier_native_handle(mbar).arrive_tx(TMA_BYTES);
}
__syncthreads();  // This syncthreads is to publish the initialized mbar

// Producer thread issues TMA
if (thread_rank == 0) {
    tma_load(&amp;mbar, smem_A, gmem_desc_A, tile_coord_m, tile_coord_k);
}

// All threads arrive at the barrier (decrement arrival count)
auto token = cuda::device::barrier_native_handle(mbar).arrive();

// Wait for both arrival count and transaction count to reach zero
cuda::device::barrier_native_handle(mbar).wait(std::move(token));</code></code></pre><p><strong>Note the asymmetry</strong>: one thread issues the TMA, all threads participate in the barrier synchronization. This is not a bug; it is the design. </p><p>The TMA is a singleton operation that one thread initiates, but the data it delivers is <strong>consumed by all threads</strong>, so all threads must synchronize on its completion.</p><p>The <code>arrive_tx</code> call informs the barrier that TMA bytes are expected. Without it, the barrier would complete as soon as all threads arrived, regardless of whether the DMA data had landed in shared memory. That would be a<strong> race condition.</strong></p><p>The <code>token</code> returned by <code>arrive</code> is a phase token. <code>mbarrier</code> operates in alternating phases (like a double buffer at the synchronization level), and the token ensures that <code>wait</code> waits on the correct phase. </p><p>This is how Hopper avoids the <strong>ABA problem</strong> in barrier reuse: you cannot accidentally wait on a barrier phase that already completed in a previous iteration.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Warpgroup MMA</h2><p>Part VII did not cover the compute side of Hopper in depth because the memory side was already enough to digest. Now we need to talk about <code>WGMMA</code>, and it is equally radical.</p><p>On Ampere, tensor core instructions are issued per-warp: <code>HMMA.1688</code> or the PTX <code>mma.sync.aligned</code> operates on <strong>16&#215;8&#215;16 tiles</strong> with 32 threads participating. Each warp independently executes its tile of the matrix multiply. </p><p>Warp-level <strong>tensor core instructions</strong> were already a significant departure from SIMT, since all 32 threads in a warp cooperate to produce a single 16&#215;8 output tile. But the warp is still the unit of scheduling and the unit of tensor core execution.</p><p>On Hopper, the tensor core instruction is <strong>warpgroup-level</strong>: <code>WGMMA.MMA_ASYNC</code> operates on a group of 4 warps (128 threads) simultaneously. The input tile dimensions for BF16 are:</p><ul><li><p><em>A: 64&#215;16 per warpgroup (contributed from registers or shared memory)</em></p></li><li><p><em>B: 16&#215;256 per warpgroup (always from shared memory)</em></p></li><li><p><em>C/D: 64&#215;256 accumulator (in registers, split across the 128 threads)</em></p></li></ul><p>A single <code>WGMMA.MMA_ASYNC</code> instruction computes a 64&#215;256&#215;16 BFGEMM, producing 64&#215;256 = 16,384 output elements in one instruction. </p><p>For comparison, an Ampere <code>mma.sync.aligned</code> with the largest BF16 shape produces 16&#215;8&#215;16 BFGEMM, 128 output elements.</p><p>The output volume ratio is 128:1. This is what<em> &#8220;approaching infinite compute-to-load ratio&#8221;</em> means in practice.</p><p>The <code>_ASYNC</code> suffix is critical: <code>WGMMA.MMA_ASYNC</code> does not complete synchronously. <strong>The 4 warps issue</strong> the instruction and the result is not guaranteed to be in the accumulator registers until a <code>WGMMA.WAIT_GROUP</code> instruction is executed. </p><p>The hardware can overlap multiple <code>WGMMA</code> operations in flight simultaneously, and the programmer must insert explicit waits before reading the accumulators.</p><p><strong>The programming model</strong> therefore looks like this at the instruction level:</p><pre><code><code>WGMMA.MMA_ASYNC D, A, B   ; issue tile multiply k=0
WGMMA.MMA_ASYNC D, A, B   ; issue tile multiply k=1
WGMMA.MMA_ASYNC D, A, B   ; issue tile multiply k=2
...
WGMMA.WAIT_GROUP 0         ; wait for all outstanding WGMMAs
; D accumulator registers now hold valid results</code></code></pre><p>In CUDA C++, this is exposed through the <code>cute::wgmma</code> abstractions in CUTLASS 3.x, or through the lower-level <code>cuda::wgmma::</code> namespace. <strong>Direct PTX </strong>is also possible but strongly inadvisable outside of research contexts.</p><p>The reason B must always come from shared memory (not registers) is a hardware constraint. The tensor core units on Hopper are wired directly to the<strong> shared memory arrays</strong>. </p><p>The B operand is fetched directly from the <strong>shared memory banks</strong> by the tensor core datapath, without going through the register file. </p><p>This is why the TMA delivering B into shared memory is on the critical path, but there is no &#8220;<em>load B from shared memory to registers</em>&#8221; step. The tensor core reads shared memory directly.</p><p>A can come from either registers or shared memory. For the highest-performance kernels, A also comes from shared memory, which means both operands bypass the register file entirely on the compute side. The register file holds only the C/D accumulator.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Thread Block Clusters</h2><p>Hopper introduced a new level of the GPU hierarchy between the thread block and the grid: the <strong>thread block cluster</strong>.</p><p>A cluster is a group of up to 8 thread blocks that are guaranteed to be co-scheduled on the same GPC (Graphics Processing Context, a group of SMs sharing an L2 slice). </p><p>Thread blocks within a cluster can access each other&#8217;s shared memory via the <strong>Distributed Shared Memory (DSMEM)</strong> mechanism, using TMA to move data between SMs without going through L2.</p><p>The<strong> PTX instruction</strong> for a cross-SM TMA transfer is:</p><pre><code><code>cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
    [smem_dst], [gmem_desc, {coord_y, coord_x}], [mbar];
</code></code></pre><p>This is the same instruction as a regular TMA load, with the <code>shared::cluster</code> scope indicating that the destination is visible cluster-wide. The <strong>TMA unit </strong>manages the inter-SM data movement transparently.</p><p><em>Why does this matter for GEMM?</em> Consider a<strong> cluster of 2 CTAs</strong>, each responsible for a different row block of C. Both need access to the same columns of B. </p><p>With clusters, CTA 0 loads B into its<strong> shared memory</strong> via TMA, and CTA 1 can read CTA 0&#8217;s shared memory directly via DSMEM. B is loaded once and consumed by two CTAs. This effectively doubles the B reuse without doubling the shared memory per CTA.</p><p>For an N=8 cluster, 8 CTAs share the B tile load, amortizing the<strong> HBM bandwidth</strong> for B across 8x more compute. </p><p>This is the mechanism by which <strong>Hopper GEMM kernels</strong> approach hardware peak on large problem sizes: the cluster architecture allows the working set of the entire computation to be held in distributed shared memory, with HBM touched only once per element.</p><p>The cluster size is <strong>specified at kernel launch:</strong></p><pre><code><code>cudaLaunchConfig_t config = {};
config.gridDim = grid;
config.blockDim = block;
cudaLaunchAttribute attr;
attr.id = cudaLaunchAttributeClusterDimension;
attr.val.clusterDim.x = 2; // 2 CTAs per cluster
attr.val.clusterDim.y = 1;
attr.val.clusterDim.z = 1;
config.attrs = &amp;attr;
config.numAttrs = 1;
cudaLaunchKernelEx(&amp;config, my_kernel, args...);
</code></code></pre><p>Cluster scheduling is cooperative: the hardware will attempt to co-locate the <strong>CTAs</strong> of a cluster on the <strong>same GPC</strong>, but this is a hint, not a guarantee for clusters larger than what fits on one GPC. </p><p>On<strong> H100 SXM5</strong> with 132 SMs organized into 7 GPCs, clusters of up to 8 are always satisfied within a single GPC.</p><div><hr></div><h2>The Persistent Kernel Model</h2><p>On <strong>Ampere</strong>, a typical GEMM kernel is a &#8220;<em>grid kernel</em>&#8221;: each thread block handles one (M_TILE, N_TILE) output tile and exits. The <strong>CUDA runtime</strong> schedules new thread blocks as soon as SM capacity becomes available. </p><p>For <strong>large matrices</strong> this is fine: there are enough tiles that the SM scheduler is always busy.</p><p>For smaller matrices, the overhead of launching and retiring thread blocks dominates. Each thread block must load its A and B tiles from scratch, write its C tile to global memory, and terminate. The <strong>shared memory state </strong>is not reused across thread blocks.</p><p>Hopper&#8217;s memory hierarchy and cluster model make a different approach attractive: <strong>persistent kernels</strong>.</p><p>In a persistent kernel, a thread block (<em>or warpgroup</em>) does not terminate after processing one tile. </p><p>Instead, it loops over multiple output tiles, maintaining the A and B tiles in<strong> shared memory </strong>between iterations where the tile is reused, and fetching new tiles via TMA only when necessary. The kernel terminates only after all output tiles in its assigned partition are complete.</p><p>CUTLASS 3.x implements this via the <strong>Tile Scheduler</strong>, a device-side component that manages the assignment of output tiles to persistent CTAs. </p><p>The scheduler atomically increments a<strong> work counter </strong>stored in global memory, assigning the next available (m_tile, n_tile) pair to the requesting CTA. When all tiles are assigned, the scheduler signals completion and the CTA exits the work loop.</p><p>The advantages are concrete:</p><p>L2 reuse improves because the same <strong>CTA processes </strong>multiple adjacent tiles, and the A or B tiles they share remain in<strong> L2</strong> (or even in shared memory) between iterations.</p><p>Thread block launch overhead is amortized: the GPU launches one wave of persistent CTAs and they run to completion, rather than launching thousands of transient blocks.</p><p><strong>Irregular problem</strong> <strong>sizes </strong>are handled more gracefully: the final partial tile is processed by whichever CTA happens to claim it, without requiring separate epilogue kernel launches.</p><p>The disadvantage is programming complexity: you are writing a software scheduler inside a CUDA kernel, with all the<strong> attendant concerns</strong> about correctness under concurrent access and load balancing across heterogeneous tile work.</p><p>CUTLASS handles this for you, which is one reason the library exists.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The CUTLASS 3.x Architecture</h2><p>CUTLASS 3.x is a complete rewrite of CUTLASS 2.x, built on a new abstraction layer called <strong>CuTe</strong> (CUDA Template library). </p><p>Understanding CUTLASS 3.x requires understanding CuTe, because <strong>CUTLASS 3.x </strong>is essentially CuTe plus a set of kernel templates that use it.</p><h3>CuTe: Layouts as First-Class Objects</h3><p><strong>CuTe&#8217;s</strong> central idea is that a <strong>layout</strong> is a function from a logical coordinate space to a physical offset in memory. A layout encodes both shape (<em>the extents of each dimension</em>) and stride (<em>the distance in elements between consecutive elements along each dimension</em>).</p><p><strong>In CuTe,</strong> a layout is written as <code>Shape:Stride</code>. For example, a 4&#215;8 row-major matrix with elements of size 2 bytes has layout <code>(4,8):(8,1)</code>, meaning: the outer dimension (rows) <strong>has stride 8</strong> (<em>each row is 8 elements apart</em>), and the inner dimension (columns) has stride 1. A column-major version of the same matrix would be <code>(4,8):(1,4)</code>.</p><p>The power of this representation is that it composes. A tiling operation is just a layout composition.<strong> A swizzle</strong> (bit permutation of addresses to avoid bank conflicts) is a layout transformation that permutes the address bits in a specific pattern. </p><p>The entire address computation for a tiled, swizzled, transposed tensor is expressed as a sequence of <strong>layout compositions</strong> that the compiler evaluates at compile time, producing a single address formula.</p><p>This is why CUTLASS 3.x can express <strong>complex access patterns</strong> without any runtime branching in the address computation.</p><pre><code><code>using LayoutA = Layout&lt;Shape&lt;_128, _32&gt;, Stride&lt;_32, _1&gt;&gt;;  // 128x32 row-major
using LayoutA_Swizzled = ComposedLayout&lt;Swizzle&lt;3,3,3&gt;, LayoutA&gt;;</code></code></pre><p>The <code>Swizzle&lt;B,M,S&gt;</code> template encodes a specific<strong> XOR-based address permutation</strong>. <code>B</code> bits are permuted with <code>S</code> bits, offset by <code>M</code> bits. </p><p>For BF16 with 32 banks of 4 bytes each, the correct swizzle eliminates all bank conflicts without any padding. CUTLASS ships with the correct swizzle parameters for every element type and tile dimension it supports.</p><h3>The MMA Atom and Copy Atom</h3><p>In CUTLASS 3.x, a tensor core instruction is an <strong>MMA atom</strong>: a typed object that describes the input/output shapes, thread-to-data mapping, and instruction to emit. The canonical Hopper MMA atom for BF16 is:</p><pre><code><code>using MMA_Atom = MMA_Atom&lt;SM90_64x256x16_F32BF16BF16F32_SS&gt;;</code></code></pre><p>The name encodes: SM90 (Hopper), 64&#215;256&#215;16 tile dimensions, <strong>F32 accumulator</strong>, BF16 A and B inputs, F32 output, SS meaning both A and B come from shared memory.</p><p>A TMA copy is a <strong>copy atom</strong>:</p><pre><code><code>using Copy_Atom_A = Copy_Atom&lt;SM90_TMA_LOAD, bfloat16_t&gt;;</code></code></pre><p>The CUTLASS kernel template composes these atoms with tile dimensions, cluster shapes, and pipeline stages into a<strong> complete kernel:</strong></p><pre><code><code>using CollectiveMainloop = cutlass::gemm::collective::CollectiveMma&lt;
    cutlass::gemm::MainloopSm90TmaGmmaRmemAAccumulator&lt;3&gt;,  // 3-stage pipeline
    Shape&lt;_128, _256, _64&gt;,                                   // tile MxNxK
    bfloat16_t, LayoutA,
    bfloat16_t, LayoutB,
    TiledMma,
    GmemTiledCopyA,
    SmemLayoutA,
    SmemCopyAtomA,
    cute::identity,
    GmemTiledCopyB,
    SmemLayoutB,
    SmemCopyAtomB,
    cute::identity
&gt;;</code></code></pre><p>This is verbose, but every template parameter maps to a concrete hardware mechanism: <code>MainloopSm90TmaGmmaRmemAAccumulator&lt;3&gt;</code> means &#8220;use TMA for loads, use WGMMA for compute, keep the accumulator in registers, with 3 pipeline stages&#8221;.</p><p>The compiler resolves all of this into a kernel where the main loop body is a tight sequence of <code>WGMMA.MMA_ASYNC</code> instructions, interrupted only by TMA-initiated <code>mbarrier</code> waits at stage boundaries. </p><p>The address computation for the <strong>TMA loads</strong> is essentially absent from the device code, having been moved to the descriptor construction on the host.</p><h3>The Producer-Consumer Warpgroup Model</h3><p>CUTLASS 3.x on Hopper adopts a <strong>warpgroup specialization</strong> model within each CTA. A thread block of 128 threads (one warpgroup) is divided at compile time into a <strong>producer warpgroup</strong> and one or more <strong>consumer warpgroups</strong>.</p><p>The producer warpgroup is responsible for issuing TMA loads (one thread per load, the others arrive at barriers). The consumer warpgroups are responsible for issuing <code>WGMMA.MMA_ASYNC</code> instructions and running the epilogue (writing C to global memory via the output TMA store).</p><p>This specialization is explicit:</p><pre><code><code>if (warpgroup_id == 0) {
    // Producer: issue TMA loads into shared memory stages
    collective_mainloop.load(params, smem_tensors, pipeline, pipeline_state, k_tile_count);
} else {
    // Consumer: issue WGMMA instructions, run epilogue
    collective_mainloop.mma(params, smem_tensors, accumulators, pipeline, pipeline_state, k_tile_count);
    collective_epilogue.store(params, accumulators, ...);
}</code></code></pre><p>The producer and consumer warpgroups communicate exclusively through the <code>mbarrier</code>-protected shared memory pipeline. There is no <code>__syncthreads()</code> between them in steady state. The barriers are sufficient.</p><p>This is architecturally important: <code>__syncthreads()</code> is a full CTA barrier. In a producer-consumer model where the producer and consumer have different amounts of work to do per iteration, a<strong> full CTA barrier</strong> would force the faster group to wait for the slower one on every iteration. </p><p>The <code>mbarrier</code> primitive allows <strong>asymmetric synchronization</strong>: the consumer waits only for the data it needs, not for the producer to reach any particular point in its control flow.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The N-Stage Pipeline on Hopper</h2><p>Part VII described double buffering (2 stages) on Ampere. On Hopper, CUTLASS uses <strong>3 to 8 stages</strong> by default, with the optimal stage count depending on the tile size, problem size, and occupancy target.</p><p>The pipeline state machine on Hopper manages N shared memory stages, N producer mbarriers (one per stage, signaling data arrival), and <strong>N consumer mbarriers</strong> (one per stage, signaling that the consumer is done reading and the stage can be reused).</p><p>The steady-state loop looks like this conceptually:</p><pre><code><code>Stage 0: [TMA load A0, B0] &#8594; [mbar_full[0] signaled] &#8594; [WGMMA on A0,B0] &#8594; [mbar_empty[0] signaled]
Stage 1: [TMA load A1, B1] &#8594; [mbar_full[1] signaled] &#8594; [WGMMA on A1,B1] &#8594; [mbar_empty[1] signaled]
Stage 2: [TMA load A2, B2] &#8594; [mbar_full[2] signaled] &#8594; [WGMMA on A2,B2] &#8594; [mbar_empty[2] signaled]
Stage 0: [TMA load A3, B3] &#8594; ...</code></code></pre><p>The producer issues TMA loads into stage i and signals <code>mbar_full[i]</code>. The consumer waits on <code>mbar_full[i]</code>, runs WGMMA, signals <code>mbar_empty[i]</code>, and moves to stage (i+1) % N. </p><p>The producer waits on <code>mbar_empty[i]</code> before reusing that stage for the next load. This circular buffer in shared memory, managed by mbarrier pairs, is the fundamental data structure of a <strong>Hopper GEMM kernel</strong>.</p><p>The prologue loads N-1 tiles before the main loop begins (same invariant as Part VII&#8217;s double buffer prologue, just with more stages). The <strong>epilogue</strong> drains the remaining in-flight tiles after the k loop exits.</p><p>With 3 stages on an H100 with 228 KB of <strong>shared memory per SM</strong> (up from Ampere&#8217;s 192 KB), a 128&#215;256 BF16 tile pair consumes approximately:</p><ul><li><p><em>A tile: 128 &#215; 64 &#215; 2 bytes = 16 KB</em></p></li><li><p><em>B tile: 64 &#215; 256 &#215; 2 bytes = 32 KB</em></p></li><li><p><em>Per stage: 48 KB</em></p></li><li><p><em>3 stages: 144 KB</em></p></li><li><p><em>Remaining for mbarriers and accumulator spills: 84 KB</em></p></li></ul><p>At 3 stages and a 128&#215;256 tile, one <strong>CTA per SM is feasible.</strong> Two CTAs would require 288 KB, which exceeds the 228 KB shared memory limit. </p><p>Occupancy is therefore 1 CTA per SM, which is fine on Hopper because the single CTA fills the SM with <strong>WGMMA instructions</strong> and the TMA unit is fully occupied.</p><p>This is a fundamentally different occupancy philosophy from Ampere. On Ampere, you often needed <strong>2-4 CTAs </strong>per SM to hide memory latency through warp-switching.</p><p>On Hopper, one CTA with TMA and<strong> WGMMA </strong>already achieves near-peak throughput on large tiles, because the hardware units that matter (TMA, tensor cores) are all fully occupied.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>What the Profiler Shows You on Hopper</h2><p>The Nsight Compute metrics shift dramatically compared to Ampere.</p><p><code>smsp__warp_issue_stalled_long_scoreboard</code> approaches zero. Not because the memory is fast, but because <strong>TMA loads</strong> do not involve the scoreboard at all. The SMSP is not waiting for memory; it is not the unit that issued the memory request.</p><p><code>smsp__warp_issue_stalled_mio_throttle</code> is also low. The single TMA instruction per tile barely loads the MIO unit.</p><p><code>smsp__warp_issue_stalled_wgmma_global_wait</code> is the new dominant stall: this is the SMSP waiting for a <code>WGMMA.WAIT_GROUP</code> to complete so it can read the accumulator registers. </p><p>This stall is unavoidable for kernels that read their accumulators between WGMMA groups (e.g., for <strong>split-K partial reductions</strong>). For kernels with long K dimensions, the WGMMA pipeline fills up and this stall disappears.</p><p><code>sm__pipe_tensor_op_hmma_cycles_active</code> should be 80-95% for a well-tuned Hopper GEMM. Anything below 70% suggests either a pipeline depth problem (too few stages) or a cluster scheduling problem (<em>the GPC is not scheduling the cluster CTAs together</em>).</p><p><code>l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld</code> counts shared memory read operations. For a kernel where <strong>both A and B</strong> are read from shared memory by WGMMA (SS variant), this metric reflects tensor core throughput, not programmer-visible loads. The <strong>tensor cores</strong> are reading shared memory directly, and this shows up in the LSU metrics.</p><p>The TMA throughput metrics are in the <code>tma</code> namespace: <code>tma__read_bytes</code> and <code>tma__read_transactions</code>. A kernel that is achieving peak TMA throughput will show TMA bandwidth close to the<strong> theoretical HBM bandwidth</strong>, because TMA is the only thing accessing HBM.</p><p>The <strong>key diagnostic insight </strong>on Hopper: if your WGMMA utilization is high and your TMA bandwidth is high, the kernel is good. The <strong>two hardware units </strong>are the bottleneck by design. Everything else should be idle or near-idle.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The Roofline on Hopper, revisited</h2><p>Part VII introduced the roofline model and noted that the useful diagnosis is hierarchical: not &#8220;<em>memory-bound</em>&#8221; but &#8220;<em>memory-bound at the L2 level, achieving 60% of L2 peak</em>&#8221;. On Hopper the hierarchy has the same levels (L1, L2, HBM) but new slopes.</p><p><strong>H100 SXM5 roofline parameters:</strong></p><ul><li><p><em>HBM3 peak bandwidth: 3.35 TB/s</em></p></li><li><p><em>L2 peak bandwidth: approximately 12 TB/s (across 50 MB of L2, two slices)</em></p></li><li><p><em>Shared memory peak bandwidth: approximately 33 TB/s aggregate (SM-local)</em></p></li><li><p><em>Tensor core peak (dense BF16): 494 TFLOP/s</em></p></li></ul><p><strong>Ridge points:</strong></p><ul><li><p><em>HBM ridge: 494 / 3.35 &#8776; 147 FLOP/byte</em></p></li><li><p><em>L2 ridge: 494 / 12 &#8776; 41 FLOP/byte</em></p></li><li><p><em>Shared memory ridge: 494 / 33 &#8776; 15 FLOP/byte</em></p></li></ul><p>For a GEMM with arithmetic intensity of <strong>147 FLOP/byte</strong> or above, the kernel should be compute-bound assuming the memory hierarchy is properly utilized. Below <strong>147 FLOP/byte</strong>, it is HBM-bandwidth-bound. </p><p>Below 41, even a perfect<strong> L2 hit rate</strong> cannot save you. Below 15, the tensor core throughput is limited by shared memory bandwidth, which means either bank conflicts or tile sizes that do not saturate the <strong>WGMMA datapath</strong>.</p><p>The key new insight on Hopper: TMA changes the shape of the memory hierarchy&#8217;s contribution. The <strong>SMSP instruction bandwidth</strong>, which was a secondary bottleneck on Ampere (and a primary bottleneck for small tiles), is effectively removed from the HBM bandwidth calculation. </p><p>The raw bandwidth to shared memory is now limited only by the TMA unit&#8217;s throughput, which the <strong>H100 documentation lists </strong>at approximately 900 GB/s aggregate (across all TMA units on all SMs). </p><p>This is below the HBM bandwidth of 3.35 TB/s, so for kernels that are<strong> purely bandwidth-limited </strong>(not compute-bound), TMA is not the constraint; HBM is. </p><p>For compute-bound kernels with large tiles, TMA&#8217;s instruction offloading is what enables the SMSP to run WGMMA at full throughput.</p><div><hr></div><h2>A Brief Look at Blackwell</h2><p>Blackwell (SM100, B100/B200) was announced in March 2024 and began shipping to hyperscalers in late 2024. The architectural trajectory established by Hopper continues and accelerates.</p><p>The Blackwell tensor core introduces a <strong>5th generation MMA</strong> with FP4 support (MXFP4 and NF4 formats), enabling 20 PFLOP/s peak at the full B200 system level (dual-die). The FP8 dense throughput is approximately 9 PFLOP/s per chip.</p><p>TMA on Blackwell gains native support for <strong>im2col</strong> pattern transforms (relevant for convolutions) and <strong>transposed stores</strong>, reducing the need for separate transpose kernels.</p><p>The cluster size limit increases to 16 CTAs (from 8 on Hopper), further amortizing B tile loads across more compute.</p><p>A new <strong>fifth-generation NVLink</strong> provides 1.8 TB/s bidirectional bandwidth per GPU in NVLink-connected systems (NVL72 rack), enabling multi-GPU kernels where the &#8220;global memory&#8221; seen by a TMA operation is distributed across 72 GPUs. This is the level at which the distinction between a single-GPU kernel and a distributed compute graph begins to blur.</p><p>CUTLASS 3.x supports Blackwell through new <strong>SM100 collective templates.</strong> The programming model is the same; the numbers are larger.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Conclusion</h2><p>The trajectory from <strong>Volta </strong>through Ampere to <strong>Hopper </strong>is a coherent story: every generation pushes more of the data movement machinery off the SMSP and onto dedicated hardware. </p><p><strong>Volta</strong> gave you tensor cores, so the SMSP stopped doing the arithmetic. Ampere gave you <code>cp.async</code>, so the SMSP stopped waiting for loads. Hopper gave you TMA, so the SMSP stopped issuing loads entirely.</p><p>The SMSP on a <strong>well-tuned Hopper GEMM kernel</strong> is a machine that does one thing: issue <code>WGMMA.MMA_ASYNC</code>. Everything else has been delegated.</p><p>This is not an accident. It is the logical endpoint of the observation that matrix multiply is the kernel that matters most for <strong>modern ML workloads,</strong> and the most efficient hardware for matrix multiply is hardware where the compute units are never idle. </p><p>Every <strong>architectural innovation</strong> from 2017 onwards has been an attack on a different reason why the compute units were idle: arithmetic latency (tensor cores), memory latency (cp.async), instruction bandwidth (TMA), <strong>inter-SM bandwidth </strong>(clusters, NVLink).</p><p>The mbarrier, the<strong> tensor descriptor,</strong> the warpgroup specialization, the producer-consumer pipeline, the tile scheduler: these are not ornamental complexity. </p><p>They are the mechanisms by which a 2024 GPU running a <strong>2024 kernel achieves 80-90% of theoretical peak</strong> on matrix multiply, a number that would have seemed implausible to practitioners writing hand-tuned BLAS routines a decade ago.</p><p>Part IX will step back from the single-GPU picture and look at multi-GPU parallelism: tensor parallelism, <strong>pipeline parallelism</strong>, NCCL, and the question of how <strong>NVLink bandwidth</strong> interacts with the per-GPU compute performance we have spent eight parts building up. </p><p>The tools change; the principle does not: find the bottleneck, route around it, measure again.</p>]]></content:encoded></item><item><title><![CDATA[Mastering CUDA and High-Performance Computing, Part VII]]></title><description><![CDATA[A Deep Dive from Compiler Internals to High-Performance Parallel Computing]]></description><link>https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-ea1</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-ea1</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Fri, 27 Mar 2026 17:53:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!5Sno!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8415ac6-79d2-4b69-9c66-afb888d7a6ba_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5Sno!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8415ac6-79d2-4b69-9c66-afb888d7a6ba_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5Sno!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8415ac6-79d2-4b69-9c66-afb888d7a6ba_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!5Sno!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8415ac6-79d2-4b69-9c66-afb888d7a6ba_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!5Sno!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8415ac6-79d2-4b69-9c66-afb888d7a6ba_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!5Sno!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8415ac6-79d2-4b69-9c66-afb888d7a6ba_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5Sno!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8415ac6-79d2-4b69-9c66-afb888d7a6ba_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8415ac6-79d2-4b69-9c66-afb888d7a6ba_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2665470,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/191312572?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8415ac6-79d2-4b69-9c66-afb888d7a6ba_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5Sno!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8415ac6-79d2-4b69-9c66-afb888d7a6ba_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!5Sno!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8415ac6-79d2-4b69-9c66-afb888d7a6ba_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!5Sno!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8415ac6-79d2-4b69-9c66-afb888d7a6ba_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!5Sno!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8415ac6-79d2-4b69-9c66-afb888d7a6ba_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Where Part VI Left Us</h2><p>Part VI ended with a sentence that deserves to be unpacked:</p><blockquote><p><em>cp.async instructions do not set the long scoreboard. </em></p><p><em>The register file is not involved, so no register&#8217;s bit is marked pending. </em></p><p><em>The SMSP issues the cp.async, the copy engine takes it, and the SMSP is immediately free to issue the next instruction for that warp.</em></p></blockquote><p>This is <strong>not a minor optimization note</strong>. </p><p>It is a description of a fundamentally different <strong>execution model</strong>: one that requires you to abandon the mental model of &#8220;<em>instruction issues, result arrives, next instruction proceeds</em>&#8221;, and replace it with something more like a production pipeline in a factory: </p><p><strong>stages overlap</strong>, buffers exist between them, and throughput is determined by the slowest stage, not the sum of all stage latencies.</p><p>Before we can make<code> cp.async</code> do useful work, we need an accurate model of what it is hiding from: the <strong>memory hierarchy.</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The Memory Hierarchy of the A100</h2><p>The <strong>A100 SXM4 </strong>has six levels of memory that matter to kernel programmers. They are not equally documented, and the numbers in marketing materials are frequently not the numbers in production code.</p><h3>Registers</h3><p>Each SM on Ampere has a 256 KB register file, shared across the four SMSPs: <strong>64 KB per SMSP</strong>, with a 256-bit read port per cycle. </p><p>Register file access latency is effectively 0 cycles in the bypass case; for non-bypassed reads the cost is absorbed into the <strong>4-cycle FMA pipeline. </strong>Registers are not a latency source. They are a capacity and bandwidth source.</p><p>The capacity limit is the one that matters: each thread can use at most 255 registers. </p><p>Pressure above this causes the compiler to spill values to local memory; a per-thread private region mapped to <strong>L1/L2/DRAM</strong>. </p><p>Spills are indistinguishable from any other global memory access at the hardware level: they go through the <strong>MIO unit</strong>, set the long scoreboard, and wait 400+ cycles for DRAM. Every spilled register costs two MIO operations.</p><h3>Shared Memory / L1 Cache</h3><p>Ampere&#8217;s per-SM L1 is a <strong>192 KB pool</strong> partitioned between shared memory and the hardware L1 data cache. </p><p>The split is configurable <em>(0/192, 32/160, 64/128, 100/92, 132/60, 160/32 (shared/cache, in KB))</em> via <code>cudaFuncSetAttribute</code> with <code>cudaFuncAttributePreferredSharedMemoryCarveout</code>.</p><p>Shared memory has 32 banks, each 4 bytes wide. </p><p><strong>Bank index </strong>for a byte address:</p><pre><code><code>bank = (address &gt;&gt; 2) &amp; 31</code></code></pre><p><strong>Access patterns</strong> where multiple threads in a warp access different addresses in the same bank serialize. </p><p>One 4-bank conflict causes <strong>4&#215; the latency </strong>of the conflict-free case. The conflict-free latency is approximately 23 cycles; a 4-bank conflict extends this to ~35 cycles; an 8-bank conflict to ~51 cycles. The penalty scales linearly.</p><p>The<strong> broadcast exception</strong>: if all threads in a warp access the <em>exact same address</em> within a bank, the hardware services this as a single read and broadcasts the result. </p><p><strong>Thirty-two threads</strong> accessing thirty-two <em>different</em> addresses that all map to the same bank is not a broadcast. It is a 32-way serialization.</p><h3>L2 Cache</h3><p>The A100 has 40 MB of L2 cache, split into two 20 MB slices. L2 hit latency: approximately <strong>180&#8211;200 cycles</strong>, higher than most documentation implies. </p><p>Accesses to the local slice are <strong>~160&#8211;180 cycles</strong>; accesses to the remote slice (requiring crossbar traversal) are ~200&#8211;230 cycles.</p><p>L2 bandwidth is approximately 4 TB/s aggregate. The ratio of <strong>L2 bandwidth </strong>to<strong> HBM bandwidth </strong>is approximately 15:1. Fitting a working set in L2 is qualitatively different from spilling it to HBM.</p><h3>HBM2e</h3><p>The A100 SXM4 has <strong>six HBM2e stacks</strong> providing a peak theoretical bandwidth of 2 TB/s. In practice: a kernel with access pattern regularity sufficient to saturate all channels achieves 1.6&#8211;1.9 TB/s. </p><p>Irregular access patterns with row buffer conflicts: <strong>800 GB/s&#8211;1.2 TB/s. </strong>Random byte-granularity reads: tens of GB/s, due to cache line waste.</p><p><strong>HBM2e latency</strong>, measured with L1 and L2 bypassed: approximately <strong>450&#8211;600 cycles at 1410 MHz.</strong> Row buffer hits land around 300&#8211;350 cycles; misses around 550&#8211;650 cycles.</p><p>The consequence at 1410 MHz:<strong> 500 cycles &#215; 0.71 ns/cycle &#8776; 355 nanoseconds</strong> of stall per warp. In that window, 500 instruction issue slots across the SM go dark. </p><p>If every resident warp has issued an<strong> HBM load</strong> and is waiting, you have a 500-cycle stall with no eligible warp to rescue you. </p><p>This is the <strong>memory wall</strong> in concrete form. The solution is not a faster memory: it is to restructure data movement so that <strong>HBM latency</strong> is overlapped with computation.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The cp.async Instruction</h2><p><code>cp.async</code> was introduced in <strong>Ampere (sm_80)</strong>. It performs a direct DMA-like transfer from global memory to shared memory, bypassing the register file entirely:</p><pre><code><code>cp.async.ca.shared.global [dst], [src], size;
cp.async.cg.shared.global [dst], [src], size;   // bypass L1</code></code></pre><p>The <code>size</code> parameter is 4, 8, or 16 bytes. The 16-byte variant is the most important: it issues a <strong>vectorized LDG.128,</strong> achieving maximum memory interface utilization.</p><h3>What &#8220;bypassing the register file&#8221; actually means</h3><p>The conventional load path:</p><pre><code><code>LDG.128 R4, [R2]         ; &#8594; long scoreboard set for R4,R5,R6,R7
                         ; &#8594; warp stalls on any read of R4-R7
                         ; &#8594; 450-600 cycles later, HBM returns data
STS.128 [smem_ptr], R4   ; store registers &#8594; shared memory</code></code></pre><p>This requires <strong>4 registers</strong> in transit. The load sets four long scoreboard bits. The warp is ineligible for any instruction reading <strong>R4&#8211;R7 </strong>until the HBM transaction completes.</p><p>The<code> cp.async path:</code></p><pre><code><code>CP.ASYNC.CA.SHARED.GLOBAL [smem_dst], [R2], 0x10
; &#8594; no scoreboard bits set (no destination register)
; &#8594; warp immediately eligible to issue next instruction
; &#8594; data arrives in shared memory asynchronously</code></code></pre><p>A dedicated <strong>Ampere Asynchronous Copy Engine</strong> receives the request via the MIO unit, takes ownership of the transaction, and performs the<strong> HBM load</strong> and shared memory write independently of the <strong>SMSP</strong>. The MIO unit is freed immediately after handoff.</p><h3>The commit/wait mechanism</h3><p><strong>Commit</strong> (<code>CP.ASYNC.COMMIT_GROUP</code>): marks all preceding cp.async instructions as a commit group. Bookkeeping only,does not wait for anything.</p><p><strong>Wait</strong> (<code>CP.ASYNC.WAIT_GROUP N</code>): stalls until at most N commit groups remain pending. <code>N=0</code> is complete synchronization. </p><p><code>N=1</code><strong> </strong>allows one in-flight group to remain outstanding while you compute on the previous.</p><pre><code><code>auto pipe = cuda::make_pipeline();

for (int i = 0; i &lt; BATCH_SIZE; i++)
    cuda::memcpy_async(smem[0][i], &amp;gmem[base + i], sizeof(float4), pipe);
pipe.producer_commit();

for (int i = 0; i &lt; BATCH_SIZE; i++)
    cuda::memcpy_async(smem[1][i], &amp;gmem[base + BATCH_SIZE + i], sizeof(float4), pipe);
pipe.producer_commit();

pipe.consumer_wait();   // CP.ASYNC.WAIT_GROUP 1
__syncthreads();        // mandatory: propagates visibility to all threads

compute(smem[0]);</code></code></pre><p>The <code>__syncthreads()</code> after <code>consumer_wait</code> is mandatory. <code>consumer_wait</code> ensures the data is in shared memory from the perspective of <em>this warp</em>. </p><p>Other warps in the thread block may not see the writes until <code>__syncthreads()</code> propagates them through the <strong>SM&#8217;s coherence domain</strong>. </p><p>Omitting it is a <strong>race condition</strong>: one that produces correct results most of the time and incorrect results unpredictably under heavy memory pressure.</p><div><hr></div><h2>The Double Buffer Pattern</h2><p>A standard tiled <strong>GEMM loop</strong> is fully sequential: load tile, sync, compute, sync, repeat. The timeline is a flat chain of dependencies. For smaller problems or thinner tiles where T_load / T_compute &gt; 1, the kernel is memory-bound.</p><p>The <strong>double buffer</strong> pattern breaks that chain:</p><pre><code><code>Iter k:   |-- cp.async A[k] --|-- cp.async B[k] --|-- commit --|
                                                                |-- wait(k-1) --|-- compute(k-1) --|
Iter k+1: |-- cp.async A[k+1] --|-- cp.async B[k+1] --|-- commit --|
                                                                    |-- wait(k) --|-- compute(k) --|</code></code></pre><p>Loads for iteration k+1 overlap with computation of iteration k. Memory latency is hidden as long as <code>T_load(k+1) &lt; T_compute(k)</code>. The pipeline then runs at the compute rate with zero memory stall.</p><p>This requires<strong> two ping-pong buffers</strong> in shared memory, doubling the shared memory requirement. </p><p><strong>Doubling shared memory </strong>per thread block halves the maximum resident thread blocks per SM, reducing occupancy. The trade-off is explicit and computable.</p><p><strong>Diagnostic signal</strong>: if <code>smsp__warp_issue_stalled_long_scoreboard.avg.pct_of_peak_sustained_active</code> exceeds 20%, memory latency is not being hidden. The first intervention is higher occupancy. </p><p>The second, when <strong>occupancy</strong> is already near maximum, is<code> cp.async</code> pipelining,  which removes the long scoreboard from the equation entirely.</p><div><hr></div><h2>The Full Kernel Pattern</h2><pre><code><code>constexpr int TILE_M = 128, TILE_N = 128, TILE_K = 32;
constexpr int NUM_STAGES = 2;

__global__ void gemm_async_kernel(
    const __nv_bfloat16* __restrict__ A,
    const __nv_bfloat16* __restrict__ B,
    float* __restrict__ C,
    int M, int N, int K
) {
    __shared__ __nv_bfloat16 smem_A[NUM_STAGES][TILE_M][TILE_K];
    __shared__ __nv_bfloat16 smem_B[NUM_STAGES][TILE_K][TILE_N];
    float acc[4][4] = {};

    auto pipe = cuda::make_pipeline();
    const int k_tiles = K / TILE_K;

    // PROLOGUE: issue tile 0 before the main loop
    if (k_tiles &gt; 0) {
        int row_a = threadIdx.x / TILE_K, col_a = threadIdx.x % TILE_K;
        if (row_a &lt; TILE_M)
            cuda::memcpy_async(&amp;smem_A[0][row_a][col_a],
                               &amp;A[(blockIdx.y * TILE_M + row_a) * K + col_a],
                               sizeof(__nv_bfloat16), pipe);
        pipe.producer_commit();
    }

    // MAIN LOOP
    for (int k = 1; k &lt; k_tiles; k++) {
        const int sw = k % 2, sr = (k - 1) % 2;

        int row_a = threadIdx.x / TILE_K, col_a = threadIdx.x % TILE_K;
        if (row_a &lt; TILE_M)
            cuda::memcpy_async(&amp;smem_A[sw][row_a][col_a],
                               &amp;A[(blockIdx.y * TILE_M + row_a) * K + (k * TILE_K + col_a)],
                               sizeof(__nv_bfloat16), pipe);
        pipe.producer_commit();

        pipe.consumer_wait();   // CP.ASYNC.WAIT_GROUP 1
        __syncthreads();

        for (int ki = 0; ki &lt; TILE_K; ki++)
            for (int i = 0; i &lt; 4; i++)
                for (int j = 0; j &lt; 4; j++)
                    acc[i][j] += __bfloat162float(smem_A[sr][threadIdx.y*4+i][ki])
                               * __bfloat162float(smem_B[sr][ki][threadIdx.x*4+j]);
        __syncthreads();
    }

    // EPILOGUE
    pipe.consumer_wait();   // CP.ASYNC.WAIT_GROUP 0
    __syncthreads();
}</code></code></pre><p>Three things to internalize about this structure:</p><p><strong>The prologue is not optional.</strong> Without issuing tile 0 before the loop, the first <code>consumer_wait</code> blocks on a commit group that doesn&#8217;t exist. Undefined behavior. The prologue establishes the &#8220;<em>one stage ahead</em>&#8221; invariant that the loop depends on.</p><p><strong>Both synchronization primitives are required.</strong> <code>consumer_wait</code> ensures the DMA engine has written the data to shared memory <em>for this warp</em>. <code>__syncthreads()</code> ensures all threads in the block have reached this point before any thread reads. </p><p>They solve different problems. Neither substitutes for the other.</p><p><strong>Stage read and stage write are never equal.</strong> The modular arithmetic guarantees <code>sw &#8800; sr</code> for <code>NUM_STAGES = 2</code>. The DMA engine writes to one buffer while threads read from the other. </p><p>With <code>NUM_STAGES &#8805; 3</code> you deepen the pipeline, more latency hidden, more shared memory consumed.</p><div><hr></div><h2>N-Stage Generalization</h2><p>With <strong>N stages</strong>, you issue <strong>N tiles&#8217;</strong> worth of cp.async before the first computation begins. The latency is hidden when <code>T_compute(tile) &gt; T_HBM_load / N</code>.</p><p>CUTLASS implements up to<strong> 5-stage pipelines</strong> for its Ampere GEMM kernels, with stage count as a compile-time template parameter swept by the profiler at tuning time. The shared memory cost scales linearly with stage count. </p><p>At some crossover point the <strong>shared memory requirement </strong>forces an occupancy reduction that exceeds the pipelining benefit. </p><p>This crossover depends on the specific kernel and problem size, which is why <strong>CUTLASS</strong> exposes the parameter rather than hardcoding it.</p><h4>What the Profiler Shows You</h4><p><strong>Before pipelining (conventional LDG loads):</strong></p><ul><li><p><code>smsp__warp_issue_stalled_long_scoreboard</code> &#8212; 40&#8211;70%, dominant stall</p></li><li><p><code>smsp__pipe_fma_cycles_active</code> &#8212; 30&#8211;60%, computation starved</p></li></ul><p><strong>After pipelining (cp.async, double buffer):</strong></p><ul><li><p><code>smsp__warp_issue_stalled_long_scoreboard</code>. &lt;5%, cp.async sets no scoreboard bits</p></li><li><p><code>smsp__pipe_fma_cycles_active</code>. 70&#8211;90% for a well-tuned kernel</p></li><li><p>Watch for <code>smsp__warp_issue_stalled_mio_throttle</code>; if you issue cp.async faster than the MIO unit can service them (~1 per 4 cycles per SMSP for 128-bit transfers), this stall replaces the scoreboard stall. </p><p>The fix is larger tiles or accepting the throttle if MIO throughput still exceeds compute throughput.</p></li></ul><div><hr></div><h2>Bank Conflicts</h2><p>The 32-bank model is documented. The practical implications for <strong>matrix access patterns </strong>are not.</p><p>In a <strong>tiled GEMM</strong>, tile A is loaded into shared memory in row-major layout, then read column-wise during the multiply. </p><p>For <strong>TILE_K</strong> = 32 and BF16 elements (2 bytes each), element <code>[j][i]</code> sits at byte offset <code>j &#215; 64 + i &#215; 2</code>. Bank index: <code>(j &#215; 16 + i/2) &amp; 31</code>.</p><p>For a <strong>warp reading column</strong> <code>i</code> (<code>i</code> fixed, <code>j</code> running 0..31) every pair of threads maps to the same bank. This is a 2-way bank conflict on every column read.</p><p>The fix is padding:</p><pre><code><code>__shared__ __nv_bfloat16 smem_A[TILE_M][TILE_K + 2];   // +2 BF16 = +4 bytes per row</code></code></pre><p>With the pad, element <code>[j][i]</code> is at byte offset <code>j &#215; 68 + i &#215; 2</code>. Bank index: <code>(j &#215; 17 + i/2) &amp; 31</code>. Since <code>gcd(17, 32) = 1</code>, the bank indices as <code>j</code> runs 0..31 form a complete permutation of 0..31. <strong>Zero conflicts.</strong> </p><p>The shared memory overhead is <code>TILE_M &#215; 4</code> bytes per buffer: 512 bytes for TILE_M = 128, trivial against the 8 KB tile.</p><p><strong>CUTLASS&#8217;s Swizzle technique </strong>achieves the same result via address bit permutation rather than linear padding, which handles non-power-of-two tile sizes cleanly. </p><p>The arithmetic underneath is identical.</p><h3>L1 Cache Policy</h3><p>Cache behavior on <strong>Ampere</strong> is configurable at the instruction level:</p><p>Qualifier Behavior <code>LDG.CA</code> Cache in L1 (default) <code>LDG.CG</code> Bypass L1, go to L2 <code>LDG.CS</code> Streaming: insert at LRU position <code>LDG.CV</code> Bypass all caches (almost never correct)</p><p>In <strong>CUDA</strong>: <code>__ldg()</code> for L1-cached, <code>__ldcg()</code> / <code>__ldcs()</code> for the bypass variants. The compiler defaults to <code>LDG.CA</code> when uncertain.</p><p>For kernels that process each input element exactly once, elementwise operations, reductions, anything with no reuse, <code>__ldcg()</code> eliminates <strong>L1 pollution</strong> and preserves L1 capacity for data that does benefit from caching. </p><p>The effect in the profiler: lower L1 hit rate, <strong>unchanged L2 hit rate</strong>. The data skips one cache level without reducing effective bandwidth at the level where reuse actually exists.</p><h3>The Roofline Model</h3><p>The roofline model (Williams, Waterman, Patterson, 2009) plots <strong>FLOP/s against arithmetic</strong> intensity (FLOP/byte of DRAM traffic). For the A100 in FP32:</p><ul><li><p><em>Peak compute: ~19.5 TFLOP/s</em></p></li><li><p><em>Peak HBM bandwidth: ~2 TB/s</em></p></li><li><p><em>Ridge point: ~9.75 FLOP/byte</em></p></li></ul><p>Below the ridge: memory-bound. Above: compute-bound. The common mistake is treating<strong> DRAM bandwidth</strong> as the only line that matters. </p><p>The L2-based roofline has a ridge at ~4.9 FLOP/byte. The L1-based roofline has a ridge at ~1 FLOP/byte.</p><p>A kernel with <strong>strong L1 reuse</strong> can be compute-bound at an arithmetic intensity that looks memory-bound on the DRAM roofline. </p><p>A kernel that thrashes L2 will underperform the DRAM roofline because its effective bandwidth is below the theoretical peak. <strong>NCU&#8217;s roofline </strong>chart shows all three simultaneously.</p><p>The correct first diagnostic is hierarchical<strong> bandwidth analysis</strong>. Not &#8220;it&#8217;s memory-bound&#8221;; that&#8217;s a category. </p><p>The useful diagnosis is &#8220;<em>it&#8217;s memory-bound at the L2 level, achieving 60% of L2 peak, because 40% of L2 bandwidth is wasted on non-reused data evicted before second use.</em>&#8221; That tells you the fix.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The Tensor Memory Accelerator</h2><p>Ampere introduced cp.async. Hopper (sm_90, H100) introduced the <strong>Tensor Memory Accelerator (TMA)</strong>, the same idea taken to its logical conclusion.</p><p>With cp.async, the programmer still computes every element&#8217;s <strong>global memory address </strong>and constructs the instruction stream. </p><p>For a 128&#215;128 BF16 tile, that is<strong> 512 vectorized</strong> 128-bit cp.async instructions consuming SMSP instruction bandwidth, even though the transfers are asynchronous.</p><p>TMA accepts a tensor descriptor (base address, dimensions, strides, element type) and issues a single instruction:</p><pre><code><code>cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
    [smem_dst], [gmem_desc, {coord_y, coord_x}], [mbar];</code></code></pre><p>One instruction. One<strong> 128&#215;128 BF16 tile.</strong> The TMA unit generates all the addresses, manages all the transactions, and signals completion via the <code>mbarrier</code> primitive; </p><p>a synchronization mechanism lighter than <code>__syncthreads()</code>, designed for producer-consumer coordination without a full SM barrier.</p><p>The consequence: on Hopper, the <strong>compute-to-load instruction ratio</strong> in a GEMM inner loop approaches &#8734; from the SMSP&#8217;s perspective. </p><p>The SMSPs run <code>wgmma.mma_async</code> continuously; the TMA unit handles all data movement independently. <strong>CUTLASS 3.x</strong> is designed around this model. Part VIII will cover it in full.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Conclusion</h2><p>The line from <strong>Part VI, </strong><em>&#8220;the SMSP is immediately free to issue the next instruction for that warp&#8221;</em>, is the hinge on which this article turns.</p><p>The memory hierarchy imposes latencies that are not negotiable in nanoseconds: 23 cycles for shared memory, <strong>180 for L2, 500 for HBM</strong>. These numbers do not change by complaining about them. </p><p>They change by <strong>structuring code </strong>so that the latency is incurred before the result is needed: issuing the memory request while computing on previously loaded data.</p><p>cp.async is the mechanism. Software pipelining is the pattern. Double buffering is the minimum viable instance. The <strong>commit/wait protocol</strong> maintains correctness while the DMA engine and the compute engine run simultaneously.</p><p>The bank conflict analysis and the <strong>L1 bypass</strong> discussion are extensions of the same idea: <strong>minimize latency</strong> and maximize effective bandwidth at every level of the hierarchy, so that by the time data arrives at the computation, it has<strong> traveled through the hardware</strong> as efficiently as physics allows.</p><p>The limits of this approach on Ampere are what motivate <strong>TMA</strong> on <strong>Hopper: </strong>an architecture where the gap between what the programmer expresses and what the hardware executes narrows further, approaching the regime where the programmer describes <em>what</em> should move and the hardware decides <em>when</em>.</p><p><strong>Part VIII begins there.</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-ea1?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-ea1?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Mastering CUDA and High-Performance Computing, Part VI]]></title><description><![CDATA[A Deep Dive from Compiler Internals to High-Performance Parallel Computing]]></description><link>https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-f02</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-f02</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Sun, 22 Mar 2026 09:15:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!zwA0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa736b8ce-69b2-47bd-988e-7f0c8336a4ff_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zwA0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa736b8ce-69b2-47bd-988e-7f0c8336a4ff_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zwA0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa736b8ce-69b2-47bd-988e-7f0c8336a4ff_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!zwA0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa736b8ce-69b2-47bd-988e-7f0c8336a4ff_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!zwA0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa736b8ce-69b2-47bd-988e-7f0c8336a4ff_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!zwA0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa736b8ce-69b2-47bd-988e-7f0c8336a4ff_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zwA0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa736b8ce-69b2-47bd-988e-7f0c8336a4ff_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a736b8ce-69b2-47bd-988e-7f0c8336a4ff_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4291409,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/191590077?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa736b8ce-69b2-47bd-988e-7f0c8336a4ff_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zwA0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa736b8ce-69b2-47bd-988e-7f0c8336a4ff_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!zwA0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa736b8ce-69b2-47bd-988e-7f0c8336a4ff_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!zwA0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa736b8ce-69b2-47bd-988e-7f0c8336a4ff_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!zwA0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa736b8ce-69b2-47bd-988e-7f0c8336a4ff_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Pipeline&#8217;s One Promise, and How It Fails</h2><p>The<strong> A100 SM</strong> runs at a base clock of approximately 765 MHz, boost to ~1410 MHz. At boost, one clock cycle is<strong> ~0.71 nanoseconds. </strong>The SM has four SMSPs. </p><p>Each <strong>SMSP </strong>has four warp schedulers (<em>confirmed in NVIDIA&#8217;s Ampere whitepaper and independently via microbenchmarks by Jia et al. and the work of Markidis, Larsson et al.</em>). </p><p>Each scheduler attempts to issue one instruction per cycle to one eligible warp.</p><p>At full throughput (all four schedulers in <strong>all four SMSPs</strong> issuing every cycle) a single A100 SM issues <strong>16 instructions per cycle</strong>. </p><p>Across 108 SMs at 1410 MHz, peak issue rate is roughly <strong>2.4 trillion instructions per second</strong>. This is the theoretical ceiling. You will never reach it. The question is why, and by how much.</p><p>An instruction issues in a given cycle when three conditions are simultaneously true:</p><ol><li><p><strong>The warp is eligible</strong>: it has been selected by the round-robin/priority scheduler, it is not stalled on a scoreboard dependency, and it has not exceeded the warp&#8217;s instruction buffer depth.</p></li><li><p><strong>The execution unit is available</strong>: the target pipe (FMA, SFU, MIO, LSU...) has a free slot.</p></li><li><p><strong>All operands are ready</strong>: every source register&#8217;s scoreboard bit has been cleared by its producing instruction.</p></li></ol><p>When any of these three conditions fails, the scheduler increments a stall counter and moves to another warp. </p><p>The beauty of the <strong>GPU microarchitecture</strong>, and the central insight of GPU optimization, is that condition (3) failing for warp A doesn&#8217;t stall the SM; it just causes the scheduler to attempt warp B instead. </p><p>The<strong> SM stalls</strong> only when <em>no</em> warp satisfies all three conditions simultaneously. That&#8217;s the failure mode we are trying to prevent.</p><p><code>ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum</code> is how you measure condition (3) failures for memory. <code>smsp__warp_issue_stalled_*</code> counters measure them by category. </p><p>We will use both throughout.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>What lives inside one SMSP</h2><p>Before discussing stalls, you need an accurate map of what execution units exist and what their throughput and latency look like. </p><p>Much of the confusion in GPU optimization literature stems from people using &#8220;<strong>the FP32 pipe</strong>&#8221; as a monolith when it is not.</p><p>One Ampere SMSP contains, per NVIDIA&#8217;s Ampere Architecture whitepaper and corroborating microbenchmark work:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SsON!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68b499ca-7fc9-445f-9651-6799a54b7136_919x500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SsON!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68b499ca-7fc9-445f-9651-6799a54b7136_919x500.png 424w, https://substackcdn.com/image/fetch/$s_!SsON!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68b499ca-7fc9-445f-9651-6799a54b7136_919x500.png 848w, https://substackcdn.com/image/fetch/$s_!SsON!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68b499ca-7fc9-445f-9651-6799a54b7136_919x500.png 1272w, https://substackcdn.com/image/fetch/$s_!SsON!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68b499ca-7fc9-445f-9651-6799a54b7136_919x500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SsON!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68b499ca-7fc9-445f-9651-6799a54b7136_919x500.png" width="919" height="500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68b499ca-7fc9-445f-9651-6799a54b7136_919x500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:919,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:74614,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/191590077?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd277843f-ea6d-4dcd-99c5-d7dc54570383_919x500.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SsON!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68b499ca-7fc9-445f-9651-6799a54b7136_919x500.png 424w, https://substackcdn.com/image/fetch/$s_!SsON!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68b499ca-7fc9-445f-9651-6799a54b7136_919x500.png 848w, https://substackcdn.com/image/fetch/$s_!SsON!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68b499ca-7fc9-445f-9651-6799a54b7136_919x500.png 1272w, https://substackcdn.com/image/fetch/$s_!SsON!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68b499ca-7fc9-445f-9651-6799a54b7136_919x500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A few notes on the accuracy of this table:</p><p>The <strong>FMA latency of 4 cycles</strong> is confirmed by the CUDA C Programming Guide and by numerous independent microbenchmarks. </p><p>It is not 1 cycle. It is not 2 cycles. It is 4, and every serial dependency chain in your kernel pays it in full.</p><p>The <strong>SFU latency of 16 cycles</strong> is confirmed by microbenchmarks. Throughput of 4 cycles/instruction means one MUFU occupies the single SFU for <strong>4 cycles</strong>; other warps&#8217; MUFU instructions queue behind it. </p><p>Since there is only <strong>one SFU per SMSP</strong> and the SMSP has 16 warps maximum, a warp issuing a MUFU must wait for the SFU to become free if another warp issued one within the last 4 cycles.</p><p>The <strong>FP64 throughput asymmetry</strong> is critical for A100 versus A30/A10 hardware: the <strong>A100</strong> has full-rate FP64 (2 cycles per DFMA per SMSP), while the A10 has 1/16th the FP64 throughput (DFMA at 32 cycles per instruction). </p><p>Running FP64 code on an <strong>A10</strong> is not slower: it is catastrophically slower. Verify your hardware before benchmarking.</p><p><strong>Shared memory load latency of 23 cycles</strong> is confirmed by microbenchmarks (<em>Luitjens</em> 2011, <em>Volkov</em> 2016, and more recently by <em>Yan et al.</em> in their SM scheduling simulator). </p><p>The <strong>official CUDA documentation</strong> says &#8220;~20 cycles&#8221; without precision; 23 cycles is the empirically correct number for Ampere under normal bank-conflict-free access.</p><p> With <strong>4-way bank conflicts</strong> the effective latency compounds because the MIO pipe is occupied for additional cycles while the bank serialization completes.</p><h3>The scoreboards in detail</h3><p>Each SMSP has two scoreboards, as described in <strong>Part V</strong>. Understanding their interaction with execution units is worth revisiting with more precision:</p><p><strong>Short scoreboard</strong>: covers arithmetic results from the FMA pipe, <strong>INT32 ALU</strong>, and SFU. Latency tracked: 4 cycles (FMA/INT32) and 16 cycles (SFU). The scoreboard has one bit per register per warp. </p><p>When an <strong>FFMA issues</strong> with destination R4, bit R4 for that warp is set in the short scoreboard. It is cleared 4 cycles later (for FMA results) by the pipeline&#8217;s bypass network. </p><p>An instruction in another warp that reads R4 of the issuing warp is unaffected: scoreboards are <strong>per-warp</strong>, not global.</p><p><strong>Long scoreboard</strong>: covers memory results: any instruction that issues to the <strong>MIO unit </strong>(loads from global/shared/local memory, atomic operations). </p><p>The long scoreboard bit is set when the load issues and is <strong>not cleared</strong> until the data physically arrives and is written to the register file. </p><p>For an HBM access this can be <strong>400+ cycles. </strong>The SMSP does not know in advance how long an HBM access will take (it depends on <strong>DRAM row buffer state</strong>, competing traffic, etc.); it just waits for the completion signal from the memory system.</p><p>An important subtlety: <code>cp.async</code> instructions <strong>do not set the long scoreboard</strong>. This is the mechanism by which they achieve asynchrony. </p><p>The register file is not involved, so no register&#8217;s bit is marked pending. </p><p>The SMSP issues the <code>cp.async</code>, the copy engine takes it, and the SMSP is immediately free to issue the next instruction for that warp. </p><p>We will return to the exact implications of this in future posts.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The SFU, quantified completely</h2><p>The <strong>Special Function Unit</strong> executes MUFU instructions. Let&#8217;s be more precise than &#8220;the SFU is slow.&#8221;</p><p>On Ampere, one SMSP has one SFU. The SFU pipeline is <strong>4 stages deep</strong>; this is why it has a 4-cycle throughput: one new instruction can enter the pipeline every 4 cycles (this is the initiation interval, II), and the result is available 16 cycles after issue (this is the latency, L = 4 &#215; II). </p><p>This is a <strong>reasonable structural design</strong> for a unit that computes hardware approximations to transcendentals: the underlying <strong>Newton-Raphson </strong>iterations take multiple stages.</p><p>The key distinction: <strong>throughput</strong> (4 cycles/instruction) limits how often you can <em>issue</em> MUFU instructions from the same SMSP. </p><p><strong>Latency</strong> (16 cycles) limits how soon a downstream instruction can <em>use</em> the MUFU result. </p><p>Both matter; they fail you in different scenarios.</p><h3>What MUFU opcodes actually compute</h3><p>MUFU is not one instruction. It is one instruction format with an operation selector.</p><p>The compiler maps <strong>standard C math</strong> functions to MUFU as follows. </p><p>This mapping is important because the additional <strong>FFMA instructions</strong> required for argument scaling and result scaling come for free (they don&#8217;t touch the SFU), but they do consume FP32 pipe cycles:</p><p><code>expf(x)</code><strong> &#8594; </strong><code>__expf(x)</code><strong> (with </strong><code>-use_fast_math</code><strong>):</strong></p><pre><code><code>// Argument reduction: convert from base-e to base-2
// exp(x) = 2^(x * log2(e)) = 2^(x * 1.44269504...)
FMUL   R1, R0, 1.44269502f          ; x * log2(e), FP32 pipe, 0.25 cycles throughput
MUFU.EX2  R2, R1                    ; 2^(x*log2e), SFU, 4 cycles throughput
</code></code></pre><p><code>expf(x)</code><strong> (IEEE-compliant, without fast-math):</strong></p><pre><code><code>// Range check and reduction (compiler-generated, varies)
FMNMX  R1, R0, 88.722839f, ...      ; clamp to avoid overflow  
FFMA   R2, R1, 1.44269502f, ...     ; argument reduction with correction term
MUFU.EX2  R3, R2                    ; core computation
FMUL   R4, R3, ...                  ; reconstruction (potentially)
// Plus additional corrections for subnormals, NaN, INF
</code></code></pre><p>The <strong>IEEE-compliant version </strong>may issue conditional branches for edge cases. When your kernel has inputs that might be NaN, INF, or very large/small, the compiler generates defensive code.</p><p> <code>__expf()</code> removes these guards entirely: it is <strong>undefined behavior for inputs outside [&#8722;87.3, 88.7]</strong> (the approximate FP32 range of exp before overflow/underflow). </p><p>If you know your <strong>inputs are bounded</strong>, and in softmax after max-subtraction they are, since all values are &#8804; 0,  <code>__expf()</code> is always the correct choice.</p><p><code>tanhf(x)</code><strong> (any mode):</strong></p><p>tanh has no single MUFU opcode. The compiler implements it using the identity: <code>tanh(x) = 1 - 2/(exp(2x)+1)</code>. </p><p>The resulting <strong>SASS</strong> (approximately, varies by compiler version) includes:</p><pre><code><code>FMUL   R1, R0, 2.0f                 ; 2x
FMUL   R2, R1, 1.44269502f          ; 2x * log2(e)
MUFU.EX2  R3, R2                    ; 2^(2x * log2e) = exp(2x)
FADD   R4, R3, 1.0f                 ; exp(2x) + 1
MUFU.RCP  R5, R4                    ; 1/(exp(2x)+1)
FMUL   R6, R5, 2.0f                 ; 2/(exp(2x)+1)
FADD   R7, -R6, 1.0f                ; 1 - 2/(exp(2x)+1) = tanh(x)
</code></code></pre><p>That&#8217;s <strong>two MUFU instructions</strong> (one EX2, one RCP) per <code>tanhf</code> call;  8 cycles of SFU pipe occupied per call in throughput terms. </p><p>For GELU, which uses <code>tanhf</code> internally (the fast approximation <code>0.5x(1+tanh(&#8730;(2/&#960;)(x+0.044715x&#179;)))</code>), you have additional FFMAs on top.</p><p><strong>GELU</strong> activation in a fused kernel is expensive in SFU terms, which is one motivation for the simpler SiLU activation (<code>x * &#963;(x) = x / (1 + exp(-x))</code>): it requires one MUFU.EX2 plus a few FMAs versus two MUFUs for tanh.</p><h3>Throughput model for an SFU-bottlenecked loop</h3><p>Suppose your <strong>kernel&#8217;s inner loop body</strong>, after compilation, contains:</p><ul><li><p>1&#215; MUFU.EX2 (4-cycle throughput, SFU pipe)</p></li><li><p>3&#215; FFMA (0.25-cycle throughput each, FMA pipe)</p></li><li><p>2&#215; FADD (0.25-cycle throughput each, FMA pipe)</p></li><li><p>1&#215; LDS (shared memory load, ~1-cycle throughput assuming no bank conflict)</p></li></ul><p>Total FMA pipe demand: 5 &#215; 0.25 = <strong>1.25 cycles</strong><br>Total SFU pipe demand: 1 &#215; 4.0 = <strong>4.0 cycles</strong><br>Total MIO demand: 1 &#215; ~1.0 = <strong>~1.0 cycles</strong></p><p>The SFU is the bottleneck: the loop cannot issue faster than <strong>4.0 cycles per iteration</strong>. </p><p>The FMA pipe is occupied 1.25/4.0 = <strong>31% of the time</strong>. The remaining 69% of FMA pipe capacity is wasted, waiting for the SFU to finish so the next iteration can begin.</p><p>You can fill this gap in two ways: <strong>more ILP within the loop</strong> (unroll and issue multiple independent MUFU calls, keeping both the SFU and FMA pipe busier) or <strong>replace MUFU with FMA-pipe arithmetic</strong>. </p><p>The first approach doesn&#8217;t change the SFU ceiling; it just makes better use of the FMA pipe in parallel. </p><p>The second moves the ceiling.</p><h3>Polynomial exp replacement: the real implementation</h3><p>The &#8220;<em>4th order polynomial</em>&#8221; approach described in the previous version of this article is plausible but underspecified. </p><p>Here is a properly validated implementation using a <strong>piecewise approach</strong> compatible with softmax use cases:</p><pre><code><code>// Fast exp2f approximation &#8212; pure FP32, no SFU
// Maps to ~6 FFMAs in SASS
// Valid for x &#8712; [-126, 127] (FP32 normal range for 2^x)
// Error: &lt; 2^-23 relative for x &#8712; [-16, 16] (sufficient for softmax)
__device__ __forceinline__ float fast_exp2f_fma(float x) {
    // Decompose x = n + f where n is integer, f &#8712; [-0.5, 0.5]
    float n = __float2int_rn(x);      // round to nearest int &#8212; FMUL-based  
    float f = x - n;                  // fractional part

    // Minimax polynomial for 2^f over [-0.5, 0.5]
    // Coefficients: Sollya minimax degree-4 in Horner form
    // 2^f &#8776; 1 + f*(0.693147 + f*(0.240227 + f*(0.055504 + f*0.009618)))
    float p = 0.009618f;
    p = fmaf(p, f, 0.055504f);
    p = fmaf(p, f, 0.240227f);
    p = fmaf(p, f, 0.693147f);
    p = fmaf(p, f, 1.0f);             // 2^f approximation

    // Reconstruct 2^x = 2^n * 2^f via integer exponent manipulation
    // Pack n into FP32 exponent bits: (int)(n + 127) &lt;&lt; 23
    int e = __float2int_rn(n) + 127;
    float scale = __int_as_float(e &lt;&lt; 23); // exact power of 2, no error

    return p * scale;
}

// For expf(x): expf(x) = exp2f(x * log2(e))
__device__ __forceinline__ float fast_expf_fma(float x) {
    return fast_exp2f_fma(x * 1.4426950408889634f);
}</code></code></pre><p>SASS output for <code>fast_expf_fma</code>: approximately 8 FFMAs, 1 F2I, 1 I2F, 1 integer SHL, 1 FMUL. No MUFU. </p><p>Throughput: ~2&#8211;2.5 cycles per call on the FMA pipe. Versus MUFU.EX2 at 4 cycles: a genuine <strong>1.6&#8211;2&#215; throughput improvement</strong> for softmax inner loops on Ampere.</p><p>The catch: verify SASS output yourself. The compiler has latitude with <code>__int_as_float</code> and <code>__float2int_rn</code>. </p><p>Confirm with <code>nvdisasm</code> that no MUFU instructions appear in the compiled output.</p><h3>Measuring SFU utilization precisely</h3><p>The two relevant Nsight Compute metrics:</p><pre><code><code>smsp__pipe_fma_cycles_active.avg.pct_of_peak_sustained_active
smsp__pipe_xu_cycles_active.avg.pct_of_peak_sustained_active
</code></code></pre><p>On an <strong>SFU-bottlenecked kernel</strong>, <code>xu</code> (XU = execution unit, NVIDIA&#8217;s internal name for the SFU pipe) will be near 100% and <code>fma</code> will be proportionally lower. </p><p>The ratio <code>xu_cycles / fma_cycles</code> tells you the SFU/FMA throughput imbalance directly.</p><p>Also useful: <code>smsp__average_warp_latency_per_inst_executed.ratio</code>; if this is high while <code>xu_cycles_active</code> is also high, the warp latency is being driven by MUFU&#8217;s 16-cycle result latency, not just its 4-cycle throughput. </p><p>Both cost you, via different mechanisms.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The L0 Instruction cache </h2><p>Each SMSP on Ampere has a <strong>dedicated 32 KB L0 instruction cache</strong> (also referred to as the I-cache in some microarchitecture literature). </p><p>This is <strong>physically separate </strong>from the unified L1 data/shared memory: it is not carved from the 192 KB L1 pool. </p><p>The <strong>L0 is private</strong> to each SMSP; four SMSPs per SM means four independent L0 caches per SM.</p><p>Instructions on Ampere are 128 bits (16 bytes) wide. The L0 holds 32 KB / 16 B = <strong>2048 instructions</strong>. </p><p>A typical kernel loop body of 100&#8211;300 instructions fits comfortably; the L0 warms up on the <strong>first iteration</strong> and subsequent fetches are essentially free (one cycle or less).</p><p>The exception: kernels generated from heavily templated <strong>C++ code</strong> (think Thrust or hand-unrolled matrix multiplication with large tile sizes) can have loop bodies exceeding 500&#8211;1000 instructions. </p><p>A kernel that fully unrolls a <strong>256-wide loop body</strong> with 8 FMAs per iteration emits 2048 instructions for that loop; exactly filling the L0 and leaving nothing for the rest of the kernel. </p><p>Add one more instruction and you start thrashing.</p><p>When the L0 misses, the SMSP must fetch from L1 <strong>instruction cache</strong> (shared with data traffic, with associated latency) or, worse, from L2. </p><p>L1 instruction fetch latency is approximately 20&#8211;30 cycles. The miss is captured by:</p><pre><code><code>smsp__pcsamp_warps_issue_stalled_imc_miss.sum</code></code></pre><p>Values above 2% indicate a structural code-size problem. Values above 10% are severe.</p><p>There is no runtime mechanism to manage L0 occupancy. The only intervention is <strong>compile-time code size reduction</strong>:</p><ul><li><p>Replace <code>#pragma unroll N</code> with smaller N or <code>#pragma unroll 1</code> for large N</p></li><li><p>Mark non-critical helper functions with <code>__noinline__</code></p></li><li><p>Split large kernels into kernel launch sequences (costs launch overhead; evaluate the trade-off)</p></li><li><p>Use <code>--maxrregcount</code> to limit register count, which sometimes causes the compiler to generate shorter instruction sequences</p></li></ul><h3>Instruction decode bandwidth</h3><p>Decoded instructions are held in per-warp instruction buffers before issue. </p><p>On Ampere, these buffers are approximately <strong>2 entries deep per warp</strong> (this is not officially documented; it is reverse-engineered from microbenchmarks. </p><p>Specifically, from observing that back-to-back dependent instructions with <strong>1-cycle-latency arithmetic</strong> <strong>operations</strong> still issue without stall, implying at least 2-deep pre-decoding).</p><p>The decoder can process approximately <strong>1 instruction per cycle per SMSP</strong> (across all warps). </p><p>This exceeds the issue rate for any single warp (maximum 1 instruction every 4 cycles for a compute-bound warp at peak), so the decoder runs ahead and the per-warp instruction buffer is almost always populated.</p><p>The <strong>pathological case</strong>: a kernel at very high occupancy (32 warps per SMSP, the A100 maximum) with a simple loop body of 3 instructions (say, a vectorized element-wise operation: LDG.128, FFMA.x4, STG.128). </p><p>All 32 warps are eligible every cycle. The decoder must keep all 32 instruction buffers populated. </p><p>At <strong>1 decode per cycle</strong> and 32 warps each needing fresh instructions, the decoder is stretched. </p><p>If the instruction stream is not in L0 (forcing L1 fetch at 20+ cycle latency), the buffers drain and the schedulers stall even though 32 eligible warps exist.</p><p>This is rare but real. It manifests as a high <code>smsp__pcsamp_warps_issue_stalled_imc_miss</code> combined with near-100% occupancy; </p><p>confusing until you understand that 32 resident warps generates 32&#215; the instruction fetch pressure of 1 warp.</p><div><hr></div><h2>Predicated execution</h2><p>Each thread on Ampere has <strong>7 predicate registers</strong> (P0 through P6). </p><p>These are separate from the 255 available scalar registers (R0&#8211;R254, with R255 reserved as the zero register). </p><p>Predicate registers are<strong> 1-bit values</strong> set by comparison instructions:</p><pre><code><code>// Source: if (a &gt; b) { ... }
FSETP.GT.AND P0, PT, R0, R1, PT   ; set P0 = (R0 &gt; R1), unconditional (PT = true predicate)</code></code></pre><p><code>FSETP.GT.AND P0, PT, R0, R1, PT</code> reads as: &#8220;set predicate P0 to (R0 &gt; R1) AND PT, and set the complement predicate (implicit) to the inverse, and all of this unconditionally (final PT).&#8221; </p><p>The <strong>AND/OR suffix </strong>specifies the combining mode for nested predicate logic. </p><p>This instruction issues on the FP32 pipe, costs 4-cycle latency, and produces a predicate bit, not a register value.</p><p>An instruction with a predicate prefix:</p><pre><code><code>@P0    FFMA  R3, R1, R2, R3    ; execute FFMA only if P0 is true
@!P0   FFMA  R5, R1, R2, R5   ; execute FFMA only if P0 is false</code></code></pre><p>The semantics at the hardware level: <strong>all lanes in the warp issue the instruction</strong>. The instruction traverses the pipeline. </p><p>When the result write-back occurs, it is gated by the predicate: lanes where the predicate is true write their result; lanes where it is false suppress the write-back.</p><p> No branch. No warp divergence. No reconvergence stack manipulation.</p><p>Consequence: predicated instructions consume <strong>throughput proportional to the total number of instructions</strong>, not proportional to the number of active lanes.</p><p> A 32-thread warp where 16 threads have P0=true and 16 have P0=false, executing <code>@P0 FFMA R3, R1, R2, R3</code>, consumes exactly the same FFMA pipe resources as all 32 threads having P0=true. </p><p>The 16 non-writing threads waste their execution slots.</p><p>This is the precise definition of &#8220;predicated execution trades throughput for divergence avoidance.&#8221;</p><h3>The compiler&#8217;s branch/predicate decision heuristic</h3><p>The CUDA compiler (nvcc, using the LLVM PTX backend) uses a cost model to decide between a BRA (branch) and predication. The model is approximately:</p><p><strong>Predication is chosen when:</strong></p><ul><li><p>The combined instruction count of both branch arms is &#8804; ~8&#8211;12 instructions total</p></li><li><p>OR the divergence probability is estimated to be high (many warps will have mixed predicate values)</p></li><li><p>OR the branch target is not cache-resident (branch prediction overhead is higher)</p></li></ul><p><strong>Branch is chosen when:</strong></p><ul><li><p>One arm is long (&gt; ~6 instructions) and the other is short</p></li><li><p>The compiler can estimate that the majority of warps will take one branch uniformly</p></li><li><p>The branch condition is amenable to warp-uniform evaluation (all threads agree)</p></li></ul><p>The threshold is <strong>not a hard constant</strong>: it depends on the compiler version, optimization level, and the surrounding code structure. </p><p>The reliable way to check what the compiler chose is to inspect SASS:</p><pre><code><code># Disassemble a compiled kernel to SASS
nvdisasm --print-instruction-types mykernel.cubin | grep -E "BRA|@P[0-9]"</code></code></pre><p>Or, within Nsight Compute:<br><strong>Source</strong> tab &#8594; enable &#8220;Source Counters&#8221; &#8594; switch to &#8220;SASS&#8221; view &#8594; look for <code>@P0</code> prefixes versus <code>BRA</code> instructions in the hot loop.</p><h3>When to override the compiler&#8217;s choice</h3><p>The compiler is generally right. The cases where it is wrong:</p><p><strong>Case 1: Long rare branch incorrectly predicated.</strong> If your hot loop has a condition triggered 1% of the time (e.g., an overflow check, a boundary condition), and the expensive handler is 15 instructions, the compiler might still predicate if the loop body is otherwise short and the combined instruction count falls under the threshold. </p><p>But 15 instructions &#215; 32 threads &#215; 1% frequency = the equivalent of 0.15 &#215; 32 = ~5 instructions of wasted throughput per loop iteration, running at full throughput instead of 1% of it. Branch would cost nothing for the 99% case.</p><p>Fix: restructure the code to make the &#8220;expensive&#8221; path obviously large and separated: e.g., a function call rather than inlined code, which the compiler treats as a definite branch site.</p><p><strong>Case 2: Warp-uniform condition incorrectly compiled as branch.</strong> If every thread in a warp evaluates the same condition (e.g., based on <code>blockIdx.x</code> or a value loaded from constant memory that all threads share), the warp takes the branch uniformly and pays zero divergence cost.</p><p> The compiler sometimes generates a branch here and sometimes predicates. When the branch is warp-uniform and the body is long, you want a branch (all threads skip the long body together); predication would execute the long body for all threads on every iteration.</p><p>You can encourage warp-uniform branch treatment by computing the predicate with <code>__all_sync(__activemask(), condition)</code> when you know it&#8217;s warp-uniform &#8212; this makes the intent explicit.</p><div><hr></div><h2>Conclusion </h2><p>Predication is not a free lunch, and it is not free branching. </p><p>It is a <strong>specific trade</strong>: you pay throughput for all threads to avoid the divergence tax of splitting and reconverging a warp. </p><p>That trade is profitable when the branch body is short and divergence is likely. </p><p>It is catastrophically unprofitable when the <strong>branch body</strong> is long and most threads would have skipped it entirely.</p><p>The compiler&#8217;s heuristic gets this right most of the time, because most conditionals in well-written kernels are short. </p><p>The cases where it fails: a rare overflow handler that gets predicated, a warp-<strong>uniform load flag</strong> that gets branched,  are invisible at the source level and only show up as unexplained throughput loss in the profiler. </p><p>The smsp__warp_issue_stalled_not_selected stall counter rising without a corresponding increase in occupancy is one signal; <strong>anomalously low FMA pipe </strong>utilization relative to the instruction count is another.</p><p>The discipline is the same as everywhere else in this series: don&#8217;t assume the compiler made the optimal choice. </p><p>Inspect the SASS, verify the @P prefixes are where you expect them and absent where you don&#8217;t, and use __all_sync to make <strong>warp-uniform conditions</strong> structurally explicit rather than relying on the compiler to infer them. </p><p>A predicate register costs nothing. A 15-instruction predicated block running at full throughput for <strong>99% of warps</strong> that didn&#8217;t need it costs exactly as much as running it unconditionally; which is, in fact, what you did.&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Software Frontier! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Mastering CUDA and High-Performance Computing, Part V]]></title><description><![CDATA[A Deep Dive from Compiler Internals to High-Performance Parallel Computing]]></description><link>https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-ebc</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-ebc</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Fri, 20 Mar 2026 15:31:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!N76I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc69b2260-c501-45aa-976c-c0b042fa3f7e_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N76I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc69b2260-c501-45aa-976c-c0b042fa3f7e_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N76I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc69b2260-c501-45aa-976c-c0b042fa3f7e_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!N76I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc69b2260-c501-45aa-976c-c0b042fa3f7e_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!N76I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc69b2260-c501-45aa-976c-c0b042fa3f7e_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!N76I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc69b2260-c501-45aa-976c-c0b042fa3f7e_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N76I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc69b2260-c501-45aa-976c-c0b042fa3f7e_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c69b2260-c501-45aa-976c-c0b042fa3f7e_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3897081,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/189894383?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc69b2260-c501-45aa-976c-c0b042fa3f7e_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N76I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc69b2260-c501-45aa-976c-c0b042fa3f7e_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!N76I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc69b2260-c501-45aa-976c-c0b042fa3f7e_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!N76I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc69b2260-c501-45aa-976c-c0b042fa3f7e_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!N76I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc69b2260-c501-45aa-976c-c0b042fa3f7e_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The instruction pipeline</h2><p>There is a moment in <strong>GPU optimization</strong> work that comes after the first profiling session, after you&#8217;ve internalized roofline arithmetic and stopped chasing occupancy for its own sake.</p><p>You&#8217;ve fixed the obvious things. Coalescing is clean. Shared memory is tiled. Bank conflicts are gone. <strong>Register pressure</strong> is measured and tolerated. Arithmetic intensity sits above the ridge point. You run the kernel.</p><p>The profiler still shows stalls.</p><p>Different stalls. The memory stalls are reduced to acceptable levels, but something else is happening. </p><p>The SM is doing work, not waiting on <strong>DRAM</strong>, and yet it is slower than the theoretical ceiling by a factor the memory model doesn&#8217;t explain.</p><p>This is when the focus shifts from data movement to instruction flow. From where data lives to how instructions are scheduled, issued, and retired.</p><p>The second real lesson of <strong>GPU programming</strong> is this: the instruction pipeline is not transparent. </p><p>It has depth, hazards, and throughput ceilings that are<strong> entirely independent</strong> of memory bandwidth. It can be the bottleneck even when memory is not.</p><p>Understanding it requires going deeper into the SM than most<strong> CUDA tutorials</strong> ever go.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The instruction stream beneath the kernel</h2><p>Every CUDA kernel you write is compiled to <strong>PTX </strong>(NVIDIA&#8217;s Parallel Thread eXecution intermediate representation) and from there to SASS: Streaming <strong>ASSembler</strong>, the native instruction set of the GPU.</p><p>PTX is portable across GPU generations within broad families. SASS is not. It is tied to a specific architecture, encodes <strong>pipeline-specific</strong> timing constraints directly into instruction control bits, and is the only thing that actually executes on the hardware.</p><p>Most <strong>CUDA developers</strong> never look at SASS. This is a mistake: not because you need to write it, but because it is the only place you can verify what the compiler actually produced, confirm that<strong> instruction selection </strong>matches your intent, and identify pipeline bottlenecks that the high-level model cannot expose.</p><p>The translation path:</p><pre><code><code>CUDA source (.cu)
     &#8595;  [nvcc / clang front-end: C++ parsing, template instantiation]
PTX  (target-independent virtual ISA; open, documented)
     &#8595;  [ptxas: machine-specific optimization, register allocation, SASS generation]
SASS (cubin; architecture-locked, closed binary)
     &#8595;  [CUDA runtime: kernel loading, parameter binding, SM dispatch]
SM execution units</code></code></pre><p>To <strong>inspect SASS</strong> for any compiled binary, use:</p><pre><code><code># From a compiled CUDA binary
cuobjdump --dump-sass my_kernel.cubin

# Or directly from a PTX file compiled for a target
ptxas -arch=sm_80 kernel.ptx -o kernel.cubin
cuobjdump --dump-sass kernel.cubin</code></code></pre><p>When you write <code>float x = a * b + c</code>, the compiler emits a single <code>FFMA</code><strong> instruction </strong>(fused multiply-add) which performs both operations in one pipeline pass with a single rounding step. </p><p>This is semantically different from two separate operations: <code>(a * b) + c</code> computed with FFMA rounds once at the end; <code>float t = a * b; t = t + c;</code> rounds twice. </p><p>The compiler fuses by default. Disable with <code>-fmad=false</code> if numerical reproducibility across implementations matters, accepting roughly halved throughput on <strong>compute-bound code.</strong></p><p>The SASS instruction stream is <strong>not a one-to-one transcription</strong> of your source. It is the compiler&#8217;s best attempt to map your intent to the hardware&#8217;s actual execution model. </p><p>Understanding that model is what makes the difference between reading SASS as noise and reading it as a diagnostic.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Latency versus throughput</h2><p>The single most important conceptual distinction in pipeline-level reasoning is between instruction <strong>latency </strong>and instruction <strong>throughput</strong>. They govern different bottlenecks, require different fixes, and<em> are frequently confused</em>.</p><p><strong>Latency</strong> is the number of clock cycles between when an instruction is issued and when its result is available to a subsequent instruction that depends on it. A dependent instruction cannot issue until this interval has elapsed.</p><p><strong>Throughput</strong> is the inverse of the rate at which independent instructions can flow through an execution unit, expressed as cycles per instruction. It represents the pipeline&#8217;s <strong>steady-state capacity,</strong> ignoring data dependencies entirely.</p><p>The <strong>following table</strong> gives measured values for the Ampere architecture (A100, sm_80), derived from microbenchmarking work by <strong>Abdelkhalik et al. (2022)</strong> and corroborated across multiple independent sources:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;47282565-e295-4af1-a90b-09517172ec2b&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Instruction              Latency (cycles)   Throughput (cyc/instr, per SMSP)
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
FFMA (FP32)                     4                  0.25  (4 FMAs/cycle)
FADD / FMUL (FP32)              4                  0.25
IMAD / IADD3 (INT32)            4                  0.25
DFMA (FP64)                     8                  2.0
MUFU.RCP / RSQ                 16                  4.0
MUFU.SIN / COS                 16                  4.0
MUFU.EX2 / LG2                 16                  4.0
HFMA2 (FP16&#215;2)                  4                  0.25
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
Shared memory load             23                  ~1.0
Shared memory store            19                  ~1.0
L1 cache hit (LDG)            ~33                  ~1.0
L2 cache hit (LDG)           ~200                  ~1.0
HBM uncached (LDG)         ~290&#8211;566&#8224;               ~1.0</code></pre></div><p><em>&#8224;The range reflects two different measurement methodologies: pointer-chasing through fully resident arrays (290 cycles, Abdelkhalik et al.) versus pointer-chasing through a working set larger than L2, which forces full HBM round-trips and measured ~566 cycles on A100 (Shi et al., 2025). The lower number approximates L2-miss latency; the upper number is closer to the true cold-DRAM round-trip experienced in bandwidth-saturated conditions.</em></p><p>The <strong>MUFU </strong>throughput of <strong>4 cycles</strong> per instruction requires emphasis. MUFU executes transcendental functions (<code>sinf</code>, <code>cosf</code>, <code>expf</code>, <code>logf</code>, <code>rsqrtf</code>, <code>rcpf</code>) via the <strong>Special Function Unit</strong>, a separate pipeline from the FP32 FMA units. </p><p>On A100, each SM sub-partition has one SFU. That SFU can issue one MUFU instruction every 4 cycles, while the FP32 pipe can issue four <strong>FMAs </strong>per cycle. </p><p>A kernel that mixes heavy transcendental usage with FP32 arithmetic will hit SFU throughput long before it approaches <strong>FP32 throughput.</strong> This matters for ML activation functions (<code>tanhf</code>, <code>expf</code> in softmax) and any scientific kernel using trigonometry.</p><p>Now the fundamental insight: <strong>FFMA latency</strong> is 4 cycles, but FFMA throughput is 1 instruction per 0.25 cycles (4 per cycle). Four independent FP32 FMAs can enter the pipeline simultaneously every cycle. </p><p>If no dependency links them, all four execute in parallel, and the pipeline issues one per cycle. If every instruction reads the result of the previous one, the pipeline must wait 4 cycles between each issue. The <strong>throughput ceiling </strong>goes unrealized.</p><p>This is not hypothetical. It is the dominant performance regime for naive accumulation loops.</p><div><hr></div><h2>The SMSP: the real unit of execution</h2><p>A critical detail that the previous part&#8217;s SM diagram abstracts away: on Ampere (and Volta and Turing), the SM is not monolithic. It is divided into four <strong>SM sub-partitions</strong>, each designated <code>SMSP</code> in the <strong>Nsight Compute </strong>metric namespace.</p><p>Each <strong>SMSP </strong>contains:</p><ul><li><p>One warp scheduler (one instruction issued per cycle)</p></li><li><p>One dispatch unit</p></li><li><p>An L0 instruction cache (private to the SMSP)</p></li><li><p>A 16K&#215;32-bit register file (64 KB, 16,384 32-bit registers per SMSP; four SMSPs yield 65,536 total per SM)</p></li><li><p>32 FP32 CUDA cores (on GA100; gaming Ampere GA10x uses a different split)</p></li><li><p>16 INT32 cores</p></li><li><p>8 FP64 cores</p></li><li><p>1 third-generation Tensor Core</p></li><li><p>8 Load/Store Units</p></li></ul><p>The SMSP, not the <strong>SM</strong>, is the scheduling atom. A warp is assigned to a specific <strong>SMSP </strong>at launch and remains there for its entire lifetime. It does not migrate between SMSPs. </p><p>All the warp stall metrics in <strong>Nsight Compute</strong> that carry the <code>smsp__</code> prefix are per-SMSP counters; <code>sm__</code> metrics aggregate across all four. This subdivision matters for two reasons.</p><p>First, it explains the per-SMSP warp pool size. On <strong>Ampere</strong>, each SMSP can host up to 16 warps. Four SMSPs yield the <strong>64-warp-per-SM </strong>maximum. The scheduler operates per SMSP, picking one eligible warp per cycle from its local pool of 16. </p><p>The practical ceiling for latency hiding is thus determined per SMSP, not per SM. A kernel with <strong>32 resident warps </strong>per SM has 8 per SMSP, which is often sufficient for latency hiding when HBM latency is the bottleneck.</p><p>Second, it exposes the register file partitioning. The 65,536 registers per SM are physically distributed across <strong>four SMSPs</strong>, 16,384 each. </p><p>When you compute that 32 registers per thread allows 2,048 threads per SM at full occupancy, those <strong>2,048 threads</strong> are physically spread 512 per SMSP, each holding 32 registers, fully consuming the 16,384 available. </p><p>The constraint is per-SMSP, and violating it in one SMSP limits the entire SM.</p><p>The Nsight Compute metric <code>smsp__warps_active.avg.pct_of_peak_sustained_active</code> reports active warp fraction per SMSP averaged over time. </p><p>It is more informative than <code>sm__warps_active</code> for diagnosing <strong>occupancy limits</strong> because it reflects the actual scheduling capacity of the unit that does the scheduling.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The hardware that enforces the dependency graph</h2><p>Before a warp can issue its next instruction, the<strong> warp scheduler </strong>must verify that all source operands for that instruction are available. </p><p>The hardware mechanism for this is the <strong>scoreboard</strong>, a per-SMSP register file that tracks which registers have outstanding writes from in-flight instructions.</p><p>Every issued instruction that writes to a register marks that register as &#8220;<em>pending</em>&#8221; in the scoreboard. When the instruction completes and the result is written to the register file, the mark is cleared. </p><p>If the <strong>scheduler </strong>selects a warp to issue and the warp&#8217;s next instruction reads a register that is still marked pending, the warp is stalled. It is not eligible. The scheduler moves to the next warp.</p><p><strong>CUDA</strong> distinguishes two scoreboard domains based on the source of the pending result:</p><p><strong>Short scoreboard</strong> (Nsight metric: <code>smsp__pcsamp_warps_issue_stalled_short_scoreboard</code>): tracks instructions with latency short enough that the hardware uses a fixed countdown timer rather than a completion signal. </p><p>This covers: FP32/INT32/FP16 arithmetic from the<strong> FMA pipe</strong>, shared memory loads (23-cycle latency), <strong>SFU/MUFU </strong>results (16 cycles), indexed constant loads, and warp-level vote instructions. </p><p>The hardware knows the exact cycle at which the result will be ready and unblocks the scoreboard entry on that cycle.</p><p><strong>Long scoreboard</strong> (Nsight metric: <code>smsp__pcsamp_warps_issue_stalled_long_scoreboard</code>): tracks instructions whose completion time is not fixed in advance. </p><p>This covers all loads from global memory ( L1 hits (~33 cycles), L2 hits (~200 cycles), HBM (~290&#8211;566 cycles)) and anything that crosses the L1TEX pipeline. </p><p>The hardware cannot predict when the data will arrive; it waits for an explicit writeback signal from the memory subsystem.</p><p>This split has a crucial diagnostic implication. High <code>stall_long_scoreboard</code> means threads are waiting for data from <strong>DRAM</strong>. The fix is occupancy (more warps to swap in while waiting), prefetching, or restructured data layout. </p><p>High <code>stall_short_scoreboard</code> means threads are waiting on arithmetic results or shared memory: a dependency bottleneck in the instruction stream itself. The fix is instruction-level parallelism, not occupancy.</p><p>A third stall class completes the picture: <strong>MIO throttle</strong> (<code>stall_mio_throttle</code>). This appears when the input<strong> FIFO</strong> to the Memory I/O pipeline is full: too many outstanding memory requests are already in flight and new ones cannot be accepted. </p><p>It is distinct from <code>stall_long_scoreboard</code>. The latter means a warp is waiting for a specific result. The former means a warp cannot even submit a new request yet. </p><p>MIO throttle is the signature of heavy but poorly coalesced global memory access, where many independent <strong>32-transaction </strong>memory operations are flooding the queue simultaneously.</p><p>And the fourth: <strong>math pipe throttle</strong> (<code>stall_math_pipe_throttle</code>). This appears when a warp is ready to issue an FP32 (or FP64, or tensor) instruction but the execution pipeline is already occupied by instructions from other warps. </p><p>This is the good stall: it means arithmetic throughput, not memory or dependencies, is the actual ceiling. On a well-tuned <strong>compute-bound kernel</strong>, <code>stall_math_pipe_throttle</code> should dominate the stall breakdown.</p><p>The differential diagnosis in Nsight Compute&#8217;s Warp State Statistics section is the most powerful <strong>single analytical tool</strong> available after the roofline. Reading the dominant stall reason maps directly onto the class of optimization required:</p><ul><li><p><code>stall_long_scoreboard</code> dominant &#8594; memory latency. Fix: more warps, better tiling, async prefetch.</p></li><li><p><code>stall_short_scoreboard</code> dominant &#8594; arithmetic dependency chain. Fix: ILP, loop unrolling, independent accumulators.</p></li><li><p><code>stall_mio_throttle</code> dominant &#8594; memory request queue saturation. Fix: coalescing, vectorized loads, reduced memory instruction count.</p></li><li><p><code>stall_math_pipe_throttle</code> dominant &#8594; compute-bound. Fix: tensor cores if using scalar FP16/FP32, mixed-precision, or just accept that you&#8217;ve reached peak.</p></li><li><p><code>stall_barrier</code> dominant &#8594; synchronization structure. Fix: reduced barrier scope, better work distribution, cooperative groups.</p></li><li><p><code>stall_not_selected</code> dominant &#8594; too many eligible warps, not enough issue bandwidth. Fine above ~20%; unusually high means you&#8217;re register-rich with very high ILP and can potentially increase complexity per thread.</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p></li></ul><div><hr></div><h2>The dependency chain</h2><p>Consider a canonical example: a dot product over a fixed array, unrolled.</p><pre><code><code>float acc = 0.0f;
for (int i = 0; i &lt; N; i++) {
    acc += a[i] * b[i];
}</code></code></pre><p>After loading all <code>a[i]</code> and <code>b[i]</code> into registers (ignoring the loads themselves for now), the inner loop body compiles to a sequence of <strong>FFMA instructions</strong>, each writing to <code>acc</code> and reading the result of the previous one:</p><pre><code><code>FFMA R4, R6, R8, R4    // acc += a[0]*b[0]; R4 depends on R4
FFMA R4, R10, R12, R4  // acc += a[1]*b[1]; R4 depends on R4 from prior
FFMA R4, R14, R16, R4  // acc += a[2]*b[2]; R4 depends on R4 from prior
...</code></code></pre><p>Each FFMA writes to <code>R4</code> and reads from <code>R4</code>. The short scoreboard marks <code>R4</code> as pending for 4 cycles after each issue. The next instruction reads <code>R4</code>, finds it pending, and stalls. The effective issue rate is <strong>1 FFMA per 4 cycles</strong>.</p><p>The FP32 pipeline on an <strong>A100 SMSP</strong> can issue 4 FMAs per cycle when fed independent work. This naive accumulation uses 1/16 of that capacity.</p><p>The fix is <strong>independent accumulators</strong>, which break the serial chain into parallel chains that the hardware can interleave:</p><pre><code><code>float acc0 = 0.0f, acc1 = 0.0f, acc2 = 0.0f, acc3 = 0.0f;
for (int i = 0; i &lt; N; i += 4) {
    acc0 += a[i+0] * b[i+0];
    acc1 += a[i+1] * b[i+1];
    acc2 += a[i+2] * b[i+2];
    acc3 += a[i+3] * b[i+3];
}
float acc = (acc0 + acc1) + (acc2 + acc3);
</code></code></pre><p>The <strong>SASS </strong>now has four independent dependency chains. After <code>FFMA R4, ..., R4</code>, the scheduler can immediately issue <code>FFMA R5, ..., R5</code>, <code>FFMA R6, ..., R6</code>, and <code>FFMA R7, ..., R7</code>.</p><p>When the first <strong>FFMA&#8217;s 4-cycle latency </strong>expires and R4 is ready, the next <code>FFMA R4</code> instruction can issue. The pipeline runs at near peak throughput.</p><p>This is not a micro-optimization on the margin. On compute-bound kernels with long accumulation chains (reductions, dot products, <strong>small GEMMs</strong> written without tensor cores) the difference between serial and independent-accumulator form is frequently <strong>4&#8211;8&#215; in throughput</strong>.</p><p>The compiler performs this unrolling automatically in many cases when <code>#pragma unroll</code> is used or when the loop trip count is known at compile time and is small. </p><p>It does <strong>not</strong> reliably unroll across complex loop bodies, through function call boundaries, or when accumulators are accessed via pointers (which the compiler may assume alias).</p><p>To verify: inspect the SASS. If you see <code>FFMA R4, ..., R4</code> repeating with the same destination register, the dependency chain is <strong>serialized</strong>. If you see <code>FFMA R4</code>, <code>FFMA R5</code>, <code>FFMA R6</code>, <code>FFMA R7</code> cycling, the compiler found the independent accumulator form. </p><p>The profiler can confirm via <code>smsp__pcsamp_warps_issue_stalled_short_scoreboard</code> &#8212; if this metric is non-trivial on a compute-bound kernel, the<strong> dependency graph</strong> is restricting throughput.</p><div><hr></div><h2>Register bank conflicts</h2><p>The register file on each SMSP is a four-banked SRAM, where each bank is 4 bytes wide. An<strong> FFMA instruction</strong> reads up to three source registers (two multiplicands and an addend) and one destination. </p><p>If two or more source registers of a<strong> single instruction map</strong> to the same bank, the reads serialize. Each register at index <code>R</code> maps to bank <code>R % 4</code> (on Ampere; earlier architectures used different moduli and widths).</p><p>For an FFMA with sources R4, R8, R12: banks are 0, 0, 0. All three reads conflict. The register file must issue three <strong>sequential read operations </strong>to bank 0, adding latency to the instruction even if no data dependency exists.</p><p>For an FFMA with sources R4, R5, R6: banks are 0, 1, 2. No conflict. All three reads issue in parallel.</p><p>This has a direct consequence for accumulation loop unrolling. Consider four accumulators in R4, R5, R6, R7. If the multiplicands for each land in:</p><pre><code><code>FFMA R4, R8,  R12, R4   // banks: 0, 0, 0, 0 &#8594; conflict on R8, R12, R4
FFMA R5, R9,  R13, R5   // banks: 1, 1, 1, 1 &#8594; conflict
</code></code></pre><p>All three sources of each <strong>FFMA map</strong> to the same bank (since 8%4=0, 12%4=0, 4%4=0). Every instruction stalls internally. The throughput benefit of ILP is partially eaten by register bank conflicts.</p><p>The compiler&#8217;s <strong>register allocator</strong> is aware of this and attempts to assign registers to avoid conflicts in hot instruction sequences. But it operates under <strong>register pressure constraints </strong>and cannot always do so. </p><p>When manual register assignment is needed for <strong>peak performance</strong> (as in hand-tuned sgemm kernels) this requires explicit attention to bank distribution.</p><p>To check: in SASS output from <strong>Nsight Compute&#8217;s Source view</strong>, bank conflict indicators appear per instruction when hardware counter <code>smsp__sass_data_bank_conflicts_pipe_fma_cycles_active</code> is nonzero. </p><p>Values above 5% on a compute-bound kernel suggest register allocation is constraining pipeline throughput.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Control divergence</h2><p>Divergence is well-understood at the conceptual level but frequently mis-modeled quantitatively.</p><p>The SIMT execution model on post-Pascal NVIDIA GPUs (Volta and later) uses <strong>independent thread scheduling</strong>: each thread has its own program counter and stack, and the hardware can reconverge threads that diverged. </p><p>This replaced the <strong>pre-Volta model </strong>where threads were locked to warp-level lockstep with explicit SIMD masking.</p><p>What<strong> independent thread scheduling </strong>provides:</p><ul><li><p><em>Threads can diverge without being permanently trapped in separate execution paths until an explicit reconvergence point.</em></p></li><li><p><em>The scheduler can interleave instructions from different sub-groups within a warp to improve utilization.</em></p></li></ul><p>What it <strong>does not provide:</strong></p><ul><li><p><em>SIMT execution is still 32-wide. When a warp diverges, the hardware executes the taken path and the not-taken path <strong>serially</strong>, masking inactive threads. Independent thread scheduling changes when reconvergence can happen, not whether both paths must execute.</em></p></li></ul><p>The true cost of a divergent conditional is not &#8220;<em>50% efficiency if threads split 50/50.</em>&#8221; It is the <strong>sum of execution time </strong>for all distinct paths through the conditional, not the maximum. </p><p>If half the threads take path A (<em>10 instructions</em>) and half take path B (<em>20 instructions</em>), total warp execution time is <strong>~30 instruction-cycles</strong>, not ~20.</p><p>For nested conditionals, the cost compounds. A kernel with a two-level nested conditional where each level has<strong> 50/50 divergence</strong> may see a 4&#215; slowdown compared to fully converged execution.</p><p>The <strong>SASS opcode</strong> <code>BRA</code> (branch) is preceded by a predicate evaluation. All threads evaluate the predicate; the warp then issues along the taken path with non-predicated threads masked. </p><p>The reconvergence point is encoded in <code>SSY</code> (set synchronize) and <code>SYNC</code> instructions in the SASS, inserted by the compiler.</p><p>The diagnostic metric is <code>smsp__sass_thread_inst_executed_op_control.sum</code> relative to <strong>total instructions</strong>: high control overhead relative to arithmetic instructions indicates either heavy divergence or loop overhead. </p><p>Nsight Compute&#8217;s <strong>Source Counters section </strong>shows per-instruction thread execution counts; instructions in the taken path of a divergent branch show fewer thread executions than instructions outside the branch.</p><p>The <em>practical implication</em>: for kernels with data-dependent branching (parsing, tree traversal, sparse format processing), minimizing the number of <strong>distinct per-warp execution paths</strong> matters more than minimizing the total number of conditionals. </p><p>One 32-way divergent conditional is better than eight 2-way ones.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share The Software Frontier&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share The Software Frontier</span></a></p><div><hr></div><h2>Conclusion</h2><p>After <strong>roofline analysis</strong>, memory coalescing, and careful tiling, most GPU kernels are no longer limited by bandwidth: they are limited by the instruction pipeline itself. </p><p>Understanding the pipeline means looking <strong>beyond PTX </strong>and into SASS: the real instruction stream executed by the hardware, with its latencies, throughput ceilings, and resource partitions.</p><p><strong>Instruction-level bottlenecks</strong> (dependency chains, register bank conflicts and control divergence) often dominate performance even when memory stalls are minimal. </p><p>Metrics like short and long scoreboard stalls, math pipe utilization, and <strong>SMSP-level warp activity</strong> provide the only reliable window into these hidden limits.</p><p>The practical lesson is simple but profound: to push a <strong>compute-bound kernel</strong> to its theoretical peak, you must treat the instruction pipeline as first-class terrain. </p><p>Unroll loops into independent accumulators, balance register allocation to avoid bank conflicts, <strong>minimize divergent paths</strong>, and account for specialized units like the SFU. </p><p>Only by reasoning at the level of instruction flow, rather than just data movement, can you approach the true limits of GPU performance.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-ebc?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-ebc?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Mastering CUDA and High-Performance Computing, Part IV]]></title><description><![CDATA[A Deep Dive from Compiler Internals to High-Performance Parallel Computing]]></description><link>https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-de3</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-de3</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Wed, 11 Mar 2026 12:03:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xC2g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863a440a-6e80-4066-a12a-4d34fa8b45c4_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xC2g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863a440a-6e80-4066-a12a-4d34fa8b45c4_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xC2g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863a440a-6e80-4066-a12a-4d34fa8b45c4_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!xC2g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863a440a-6e80-4066-a12a-4d34fa8b45c4_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!xC2g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863a440a-6e80-4066-a12a-4d34fa8b45c4_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!xC2g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863a440a-6e80-4066-a12a-4d34fa8b45c4_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xC2g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863a440a-6e80-4066-a12a-4d34fa8b45c4_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/863a440a-6e80-4066-a12a-4d34fa8b45c4_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3646556,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/190600683?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863a440a-6e80-4066-a12a-4d34fa8b45c4_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xC2g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863a440a-6e80-4066-a12a-4d34fa8b45c4_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!xC2g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863a440a-6e80-4066-a12a-4d34fa8b45c4_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!xC2g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863a440a-6e80-4066-a12a-4d34fa8b45c4_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!xC2g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863a440a-6e80-4066-a12a-4d34fa8b45c4_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The moment arithmetic stops mattering</h2><p>Every <strong>CUDA program</strong> eventually reaches the same reckoning.</p><p>You&#8217;ve written the kernel. You&#8217;ve structured the grid. The launch configuration looks reasonable. The <strong>arithmetic is correct</strong>. You submit it, measure wall time, and feel like a competent engineer.</p><p>Then you open a profiler.</p><p><strong>Nsight Compute </strong>loads the metrics. You look at the first column: SM utilization. You expect something respectable: 70%, 80%, maybe more. What you see instead is a number that doesn&#8217;t make sense.</p><p>12%. Sometimes a little lower.</p><p>The tensor cores are almost entirely idle. The floating-point pipelines are stalled. A chip whose transistor count would have represented the entire <strong>global semiconductor output </strong>of the mid-1990s is sitting mostly quiet, doing almost nothing, waiting.</p><p>Waiting for data.</p><p>This is the first real lesson of GPU programming, and it is one of the most counterintuitive in all of systems engineering: <strong>modern GPUs are not, in practice, compute-limited. They are memory-limited.</strong> The bottleneck isn&#8217;t the arithmetic: it&#8217;s the infrastructure that feeds the arithmetic. </p><p>And that infrastructure is governed <strong>not by logic design</strong>, not by clock frequency, but by the literal physics of moving electric signals through metal on a substrate the width of a few dozen silicon atoms.</p><p>The GPU was designed to compute at enormous scale. But before it can compute anything, it needs data. And moving that data, from <strong>DRAM</strong>, through cache hierarchies, across the chip, into registers, costs time. Often far more time than the computation itself.</p><p>Understanding why, and what to do about it, is the <strong>entire subject</strong> of this article.</p><div><hr></div><h2>The physics of moving bits through silicon</h2><p>Let&#8217;s start with something that almost never appears in <strong>CUDA tutorials</strong> but underpins everything else: the actual physical constraints on data movement inside a chip.</p><p>An electric signal traveling through a metal interconnect on silicon propagates at <strong>roughly two-thirds</strong> the speed of light in vacuum. That sounds fast. It is fast. But it&#8217;s not the bottleneck.</p><p>The bottleneck is capacitance.</p><p>Every wire connecting two points on a chip behaves as a <strong>tiny capacitor</strong>. Moving a signal through that wire requires charging or discharging that capacitance. That costs energy, and more importantly for our purposes, it takes time. </p><p>The longer and wider the wire, the greater the capacitance, the slower the signal edges, the more time the<strong> receiving circuit</strong> spends waiting for the voltage to settle.</p><p>At <strong>1 GHz</strong>, one clock cycle is 1 nanosecond. Light travels about 30 cm in that time, which is longer than most GPU dies. Signal propagation across the chip isn&#8217;t the problem. </p><p>The problem is the full pipeline required to actually deliver data: address generation, routing through the memory subsystem, cache tag lookup, <strong>DRAM row activation</strong>, column access, data return bus, write to register file. Each of these steps adds cycles. Latencies compound.</p><p>When you add them all up, reaching off-chip DRAM from an execution unit costs <strong>between 400 and 800 clock cycles</strong> on a modern GPU.</p><p>That number is worth sitting with.</p><p>At 1.5 GHz, <strong>600 cycles is 400 nanoseconds</strong>. In that same window, a modern NVIDIA SM could theoretically issue hundreds of independent arithmetic instructions. </p><p>Instead, if it&#8217;s waiting for DRAM, it issues zero. The arithmetic units sit idle. The watt-hours tick over. Nothing useful happens.</p><p>This gap (<em>the chasm between compute speed and memory latency</em>) is the central engineering problem of <strong>GPU design</strong>. Everything else in GPU architecture exists to paper over this gap. </p><p>The warp scheduler, the cache hierarchy, shared memory, asynchronous pipelines, tensor cores: all of it is infrastructure built around one uncomfortable physical fact: <strong>moving data costs time, and that time is long</strong>.</p><p>DRAM hasn&#8217;t gotten meaningfully faster in latency terms for decades. It has gotten wider, more parallel, higher bandwidth. </p><p>But the fundamental latency of a DRAM access (row activation, <strong>column select</strong>, sense amplifiers settling) is governed by the same physics it always was. </p><p><strong>HBM2e</strong> has extraordinary bandwidth. Its latency is still measured in hundreds of nanoseconds.</p><p>You cannot optimize your way out of physics.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Inside the Streaming Multiprocessor</h2><p>Modern GPUs solve this problem through a specific architectural pattern: massive parallelism layered over a <strong>deep memory hierarchy</strong>, orchestrated by a scheduling engine designed to hide latency rather than eliminate it. </p><p>To understand how, we need to look at the machine&#8217;s basic unit of execution. The <strong>SM </strong>is, in fact, the atom of GPU execution. Every thread you launch lands in one, executes inside one, and exits from one. </p><p>Everything else (<em>grid geometry, block dimensions, global memory layout</em>) is scaffolding around the SM. </p><p>Understanding what&#8217;s <strong>physically inside</strong> is the prerequisite for reasoning about performance. Here is a schematic of an Ampere SM:</p><pre><code><code>&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474;                     Streaming Multiprocessor (SM)                &#9474;
&#9474;                                                                  &#9474;
&#9474;   &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;  &#9484;&#9472;&#9472;&#9472;&#9472;&#9488;  &#9474;
&#9474;   &#9474; Warp Sched 0 &#9474;  &#9474; Warp Sched 1 &#9474;  &#9474; Warp Sched 2 &#9474;  &#9474; W3 &#9474;  &#9474;
&#9474;   &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;  &#9492;&#9472;&#9516;&#9472;&#9472;&#9496;  &#9474;
&#9474;          &#9474;                 &#9474;                 &#9474;             &#9474;      &#9474;
&#9474;   &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9660;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9660;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9660;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9660;&#9472;&#9472;&#9488;  &#9474;
&#9474;   &#9474;                   Dispatch / Issue Logic                    &#9474;  &#9474;
&#9474;   &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;  &#9474;
&#9474;                                                                  &#9474;
&#9474;   &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488; &#9474;
&#9474;   &#9474;  FP32 Cores     &#9474;  &#9474;  INT32 Units    &#9474;  &#9474;  Tensor Cores    &#9474; &#9474;
&#9474;   &#9474;  (128 per SM)   &#9474;  &#9474;  (128 per SM)   &#9474;  &#9474;  (4 per SM)      &#9474; &#9474;
&#9474;   &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496; &#9474;
&#9474;                                                                  &#9474;
&#9474;   &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;                       &#9474;
&#9474;   &#9474;  FP64 Cores     &#9474;  &#9474;  Special Func.  &#9474;                       &#9474;
&#9474;   &#9474;  (64 per SM)    &#9474;  &#9474;  Units (SFU)    &#9474;                       &#9474;
&#9474;   &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;                       &#9474;
&#9474;                                                                  &#9474;
&#9474;   &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;   &#9474;
&#9474;   &#9474;              Load / Store Units (32 LSU)                  &#9474;   &#9474;
&#9474;   &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;   &#9474;
&#9474;                                                                  &#9474;
&#9474;   &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;   &#9474;
&#9474;   &#9474;              Register File (~256 KB)                      &#9474;   &#9474;
&#9474;   &#9474;              (65,536 &#215; 32-bit registers)                  &#9474;   &#9474;
&#9474;   &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;   &#9474;
&#9474;                                                                  &#9474;
&#9474;   &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;   &#9474;
&#9474;   &#9474;       L1 Cache / Shared Memory (192 KB unified)          &#9474;   &#9474;
&#9474;   &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;   &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
</code></code></pre><p>The arithmetic units (<em>FP32 cores, INT32 units, tensor cores</em>) get all the attention in marketing materials. They are not where performance is determined.</p><p>Performance is determined by the four warp schedulers at the top.</p><h3>The warp scheduler&#8217;s decision tree</h3><p>Every cycle, each of the four warp schedulers examines its pool of eligible warps and must select one to issue. A warp is eligible if it satisfies four conditions simultaneously:</p><ol><li><p><em>It has a valid next instruction</em></p></li><li><p><em>All source operands for that instruction are available (no outstanding data dependency)</em></p></li><li><p><em>The required execution unit is not occupied by another instruction</em></p></li><li><p><em>No synchronization barrier is blocking it</em></p></li></ol><p>If a warp fails any condition, it is <strong>ineligible</strong>. The scheduler ignores it and checks the next one. The scheduler doesn&#8217;t understand priority, criticality, or deadlines. It is a simple priority-free selector: find the first <strong>eligible warp</strong>, issue it, repeat.</p><p>If <strong>no</strong> warp is eligible (<em>all are waiting for memory, or all have unresolved register dependencies</em>) the SM stalls. Not one warp stalls. The entire SM stalls. All four schedulers sit idle. </p><p>The execution pipelines emit nothing. Every watt powering the chip produces zero useful work. This is the stall you see in the profiler as &#8220;<em>no eligible warp selected.</em>&#8221;</p><p>On Ampere, the <strong>theoretical maximum</strong> is 64 resident warps per SM, 4 schedulers each issuing 1 warp per cycle. In the best case, the SM issues 4 independent instructions simultaneously from 4 different warps. In the worst case, when <strong>all 64 warps</strong> are blocked on DRAM, it issues zero.</p><p>The entire architecture of <strong>GPU programming</strong> is a battle to keep this scheduler fed. Everything else (<em>tiles, pipelines, occupancy tuning, coalescing</em>) is in service of that one goal.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The memory hierarchy</h2><p>Each layer of the GPU memory hierarchy represents a different <strong>engineering tradeoff </strong>between latency, bandwidth, capacity, and programmer control. </p><p>Understanding each layer&#8217;s character (<em>not just its numbers</em>) determines whether you can make intelligent decisions about where data should live at any moment during execution.</p><h3>Registers: the fastest storage on the chip</h3><p>Registers are physically distributed across the SM register file &#8212; on Ampere, a 256 KB SRAM array containing <strong>65,536 32-bit registers</strong> per SM. </p><p>Access latency is <strong>1 clock cycle. </strong>Bandwidth between the register file and execution units is effectively unlimited for the purposes of performance analysis.</p><p>Registers are not a cache. They do not hold data that might be needed. They hold data that is <em>actively being computed on</em>. </p><p>The compiler allocates registers <strong>deterministically</strong> at compile time. Every local variable, intermediate computation, and loop accumulator that doesn&#8217;t get spilled lives in the register file.</p><p>But registers introduce the most important constraint in <strong>CUDA performance engineering</strong>: their consumption directly limits occupancy.</p><p>The register file must be divided among all threads resident on the SM simultaneously. If each thread requires <strong>64 registers</strong>, the 65,536 available registers accommodate at most <strong>1,024 threads</strong>: 32 warps, half the Ampere maximum. </p><p>With 32 warps instead of 64, the scheduler has half as<strong> many options</strong> when a warp stalls. Latency hiding degrades.</p><p>This is why <code>--maxrregcount</code> exists in the CUDA compiler toolchain. Capping register usage forces spilling to <strong>local memory</strong> which is DRAM, cached through L1/L2, with all the attendant latency. </p><p>The tradeoff is: more concurrent warps (<em>better latency hiding</em>) at the cost of some extra memory traffic (<em>more memory pressure</em>). Whether it helps depends entirely on the kernel&#8217;s specific balance of compute and memory.</p><p>The profiler metric to monitor is <code>sm__warps_active.avg.pct_of_peak_sustained_active</code>: the fraction of cycles during which at least one warp was active on the SM. </p><p>Values below 50% on <strong>memory-bound kernels</strong> usually indicate occupancy is limiting performance. Values above 75% suggest you&#8217;re probably fine.</p><h3>Shared memory: the programmable cache</h3><p>Shared memory is physically the same SRAM as <strong>L1 cache</strong>; on Ampere, a unified 192 KB block that software partitions between the two. </p><p>Typical configurations: 128 KB shared / 64 KB L1, or 64 KB shared / 128 KB L1, selectable per kernel.</p><p>The critical property that distinguishes shared memory from every other memory type on the GPU: <strong>it is explicitly managed by software.</strong> L1 is automatic. L2 is automatic. HBM is automatic. </p><p>Shared memory requires the programmer to decide what to load, when to load it, and when to <strong>synchronize threads</strong> after loading. The hardware does exactly what the programmer specifies and nothing more.</p><p>This explicitness is simultaneously the source of its power and its most common source of subtle bugs.</p><p>Latency is roughly <strong>20 clock cycles</strong>, an order of magnitude faster than L2, 20&#8211;40&#215; faster than DRAM. Intra-SM bandwidth reaches several terabytes per second. </p><p>For workloads with structured access patterns and significant data reuse within a thread block, shared memory is the <strong>most important performance</strong> tool available.</p><h4>Bank conflicts: the hidden serialization</h4><p>Shared memory is internally divided into 32 banks, each 4 bytes wide. In a single clock cycle, the hardware can service 32 simultaneous<strong> 4-byte accesses </strong>(<em>one per bank</em>) as long as no two accesses target the same bank. </p><p>This design provides very high bandwidth when accesses are distributed. The bank index for a 32-bit word at byte address <code>addr</code> is:</p><pre><code><code>bank = (addr / 4) % 32</code></code></pre><p>If two or more threads in the same warp access different addresses mapping to the same bank, those accesses serialize. This is a <strong>bank conflict</strong>. The hardware issues them sequentially, one per cycle, multiplying the effective latency by the conflict degree. </p><p>A <strong>32-way bank conflict</strong>, 32 threads hitting the same bank simultaneously, effectively transforms a 20-cycle operation into a <strong>640-cycle one</strong>. On shared memory.</p><p>The canonical example is column access in a row-major 2D shared memory array:</p><pre><code><code>__shared__ float tile[32][32];
float val = tile[threadIdx.x][threadIdx.y];  // accessing column threadIdx.y</code></code></pre><p>In a row-major layout, <code>tile[row][col]</code> is stored at byte offset <code>(row * 32 + col) * 4</code>. For a fixed column <code>c</code>, the elements <code>tile[0][c]</code>, <code>tile[1][c]</code>, <code>tile[2][c]</code>... are stored at offsets <code>c*4</code>, <code>(32+c)*4</code>, <code>(64+c)*4</code>... </p><p>The bank index for <code>tile[i][c]</code> is <code>(i*32 + c) % 32 = c</code>. Every row has the same bank for column <code>c</code>. </p><p>If 32 threads <strong>simultaneously </strong>access column <code>c</code>, all 32 accesses hit bank <code>c</code>. Complete serialization. 32&#215; slowdown. The fix is padding by one element:</p><pre><code><code>__shared__ float tile[32][33];  // one extra float per row</code></code></pre><p>Now <code>tile[i][j]</code> is at byte offset <code>(i*33 + j) * 4</code>. Bank index is <code>(i*33 + j) % 32</code>. For column <code>j</code>, bank = <code>(i*33 + j) % 32</code>. Since 33 is coprime to 32, incrementing <code>i</code> by 1 increments the bank by <code>33 % 32 = 1</code>. The 32 threads access 32 different banks. No conflicts.</p><p>One unused float per row, 128 bytes of wasted SRAM, eliminates a <strong>32&#215; serialization penalty.</strong> This is one of the cheapest performance wins in CUDA.</p><h3>L1 cache: automatic but controllable</h3><p>The portion of the unified SRAM not allocated to shared memory functions as an automatic L1 data cache for global memory accesses. Access latency is <strong>28&#8211;33 cycles</strong> on Ampere.</p><p>Unlike shared memory, L1 is not explicitly managed. The hardware decides what to cache based on access patterns. For streaming workloads with <strong>no temporal reuse</strong>, L1 caching is<strong> actively harmful</strong>; it evicts potentially useful data to cache data that will never be accessed again.</p><p><strong>CUDA</strong> provides tools to limit this damage. The <code>__ldg()</code> intrinsic routes loads through the read-only texture cache, bypassing L1 entirely and preserving it for data with genuine reuse. </p><p>Cache-bypass load modifiers (<code>.cs</code> for streaming, <code>.cg</code> for L2-only) are available in PTX and accessible via <code>__builtin_nontemporal_load</code> variants. Using them correctly on <strong>streaming data </strong>can meaningfully improve L1 hit rates for other kernel data.</p><p>L1 is most valuable for <strong>irregular access patterns</strong> with temporal reuse: hash table lookups, graph traversals, sparse matrix operations, embedding lookups. </p><p>For structured <strong>compute kernels </strong>with predictable access patterns, explicit shared memory almost always dominates.</p><h3>L2 cache: the shared reservoir</h3><p>L2 is a resource shared across all SMs: on the A100, a 40 MB unified L2 with approximately 4 TB/s aggregate read bandwidth. <strong>Access latency</strong> is roughly 200 cycles.</p><p>The bandwidth matters more than the latency for <strong>most workloads. </strong>If data fits in L2 and is accessed repeatedly across different thread blocks, L2 reuse can dramatically reduce <strong>DRAM traffic</strong> without any explicit shared memory management. </p><p>This is the main performance lever for workloads with inter-block reuse: embedding tables, <strong>small lookup matrices</strong>, bias vectors applied to many thread blocks.</p><p><strong>CUDA 11.1</strong> and Ampere introduced explicit L2 residency controls via <code>cudaStreamSetAttribute</code> with <code>cudaStreamAttrAccessPolicyWindow</code>. </p><p>This allows developers to mark a specific memory region as <strong>high-priority</strong> for L2 retention: the hardware will attempt to keep it resident across thread block launches. </p><p>For embedding lookups or frequently-accessed <strong>read-only tables,</strong> this can reduce DRAM bandwidth consumption by an order of magnitude.</p><h3>HBM: the distant reservoir</h3><p>High Bandwidth Memory sits physically separate from the <strong>SM die</strong>, stacked on the <strong>GPU package</strong> via silicon interposer. </p><p>Access requires leaving the SM die entirely, traversing the memory controller, and accessing <strong>DRAM cells</strong> across the interposer.</p><p>Current numbers:</p><ul><li><p>V100 HBM2, 32 GB, 900 GB/s </p></li><li><p>A100 HBM2e, 80 GB, 2.0 TB/s </p></li><li><p>H100 SXM HBM3, 80 GB, 3.35 TB/s </p></li><li><p>H100 NVL HBM3e, 188 GB, 3.9 TB/s</p></li></ul><p>These bandwidth numbers are genuinely impressive. They are still not enough.</p><p>The A100&#8217;s peak <strong>FP16 tensor core</strong> throughput is 312 TFLOPS. A typical deep learning layer performs approximately 2 FLOPs per weight element loaded (one multiply, one accumulate). </p><p>To keep tensor cores saturated, you&#8217;d need to deliver 156 TB/s of weight data; it&#8217;s 78&#215; more than HBM2e can provide.</p><p>This isn&#8217;t a design failure. It&#8217;s physics. The solution is arithmetic intensity: load each weight once, perform <strong>many operations</strong> on it before the data expires from registers or shared memory. </p><p>The entire science of GPU kernel optimization is the science of achieving sufficient arithmetic intensity to bridge this <strong>78&#215; gap</strong>.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Warps and coalescing</h2><p>Everything we&#8217;ve discussed about the memory hierarchy becomes concrete in the behavior of individual memory instructions.</p><h3>The anatomy of a memory transaction</h3><p>When a warp executes a load instruction:</p><pre><code><code>float x = data[idx];</code></code></pre><p>The hardware doesn&#8217;t see 32 independent loads. It sees 32 addresses simultaneously, one per thread, and must service them with as <strong>few memory transactions</strong> as possible. </p><p>The memory controller coalesces these<strong> 32 addresses</strong> into the minimum set of cache line requests that covers all of them.</p><p>DRAM is accessed in 128-byte cache lines on current NVIDIA GPUs. L2 sector granularity is 32 bytes. The hardware merges<strong> warp addresses</strong> into the minimum number of 128-byte requests covering all requested bytes.</p><h3>Coalesced access: ideal</h3><p>Thirty-two threads accessing consecutive floats:</p><pre><code><code>Thread  0 &#8594; data[base + 0]
Thread  1 &#8594; data[base + 1]
...
Thread 31 &#8594; data[base + 31]</code></code></pre><p>Total footprint: 128 bytes, exactly one cache line. One transaction. Every byte fetched is used. Memory efficiency: 100%. </p><p>This is exactly the access pattern the hardware was designed for.</p><h3>Strided access: degraded</h3><p>Stride-2 access:</p><pre><code><code>Thread  0 &#8594; data[base + 0]
Thread  1 &#8594; data[base + 2]
...
Thread 31 &#8594; data[base + 62]</code></code></pre><p>Address range: 252 bytes, spanning two 128-byte cache lines. Two transactions. 256 bytes transferred, 128 bytes used. Memory efficiency: 50%.</p><p>Stride 4: four cache lines, 25% efficiency. With stride 32, each thread&#8217;s access can fall in a different cache line, up to 32 transactions, 3% efficiency. The kernel is now burning 32&#215; the memory bandwidth for the same amount of useful data delivered.</p><h3>Random access</h3><p><strong>Scatter/gather</strong>, hash lookups, pointer chasing, patterns where 32 thread addresses bear no spatial relationship to each other. </p><p>The worst case: one transaction per thread. 32 transactions, each returning 128 bytes to deliver 4 bytes. </p><p>Bytes-transferred-to-bytes-used ratio: 32:1. The kernel consumes <strong>97% of its memory bandwidth </strong>fetching data it will immediately discard.</p><p>This doesn&#8217;t just hurt the kernel itself. It saturates the<strong> L2 </strong>and memory controllers, degrading bandwidth for every other kernel running concurrently on the chip.</p><p>The profiler metric to inspect is the ratio of <code>l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum</code> to <code>l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum</code>, sectors per request. </p><p>Perfectly coalesced: 1. Mild inefficiency: 2&#8211;4. Significant inefficiency: 4&#8211;16. &#8220;<em>Please reconsider your life choices</em>&#8221;: above 16.</p><div><hr></div><h2>Occupancy: the arithmetic of latency hiding</h2><p>The GPU hides memory latency by switching between warps when one stalls. This only works if there are enough warps to switch to. If every warp is stalled waiting for DRAM, the scheduler has nothing to do. The <strong>SM stalls</strong>. Occupancy is the metric that captures this.</p><p><strong>Occupancy</strong> is the ratio of active warps on an SM to the architectural maximum. On Ampere: maximum 64 warps, 2,048 threads. At 50% occupancy: 32 warps. At 25%: 16 warps.</p><p>Occupancy is constrained by three physical resources, all of which must simultaneously fit on the SM:</p><p><strong>1. Registers.</strong> If each thread uses <code>R</code> registers, maximum concurrent threads = <code>65,536 / R</code>. For full occupancy (2,048 threads), each thread can use at most 32 registers. Kernels routinely use 64&#8211;128. At 64 registers: 50% occupancy. At 128: 25%.</p><p><strong>2. Shared memory.</strong> If a kernel uses <code>S</code> bytes of shared memory per block with <code>T</code> threads per block, maximum concurrent blocks = <code>floor(192 KB / S)</code>. Maximum warps = <code>floor(192 KB / S) &#215; (T / 32)</code>. A kernel using 96 KB per block with 256 threads per block: 2 concurrent blocks &#215; 8 warps = 16 warps. 25% occupancy.</p><p><strong>3. Block limits.</strong> Ampere supports at most 32 concurrent blocks per SM. A kernel with 32-thread blocks (1 warp per block) hits this limit at 32 warps, 50% occupancy, regardless of register or shared memory pressure.</p><h3>The occupancy-performance relationship</h3><p>High occupancy is not synonymous with high performance. A compute-bound kernel with very few memory instructions doesn&#8217;t need <strong>64 warps</strong> to keep the scheduler busy; 4 ready warps might be sufficient if they never stall on memory. </p><p><strong>Chasing occupancy</strong> for its own sake can force the compiler to spill registers to DRAM, adding memory traffic that makes performance worse.</p><p>The right mental model: occupancy matters proportionally to <strong>memory latency exposure</strong>. The more frequently your kernel stalls on DRAM, the more warps you need to hide that latency. If your kernel rarely touches memory, occupancy barely matters.</p><p><code>cudaOccupancyMaxActiveBlocksPerMultiprocessor</code> gives you the theoretical maximum for a given kernel. The gap between that theoretical maximum and <strong>what you observe </strong>in the profiler (<code>sm__warps_active</code>) tells you how much the hardware can actually hide. </p><p>A kernel at 25% theoretical occupancy but 24% active occupancy is fine. A kernel at 75% theoretical occupancy but 20% active occupancy has a structural stall problem.</p><div><hr></div><h2>Putting hard numbers on constraints</h2><p>The roofline model is the most <strong>useful analytical tool</strong> in GPU performance engineering, and it is underused. </p><p>It doesn&#8217;t tell you <em>how</em> to optimize. It tells you <em>what</em> optimization is even possible, which is more valuable.</p><p>The central quantity is <strong>arithmetic intensity</strong>: floating-point operations performed per byte transferred from main memory.</p><pre><code><code>Arithmetic Intensity (I) = FLOPs / bytes_transferred</code></code></pre><p>Performance is bounded by the minimum of two constraints:</p><pre><code><code>Attainable Performance = min(Peak_FLOPs, Peak_Bandwidth &#215; I)</code></code></pre><p>For the A100 FP16:</p><ul><li><p>Peak tensor throughput: 312 TFLOPS</p></li><li><p>Peak HBM bandwidth: 2 TB/s</p></li></ul><p>The <strong>ridge point</strong>, minimum arithmetic intensity required to be compute-bound rather than memory-bound:</p><pre><code><code>I_ridge = 312 &#215; 10&#185;&#178; / 2 &#215; 10&#185;&#178; = 156 FLOPs/byte</code></code></pre><p>To be compute-bound on the A100, every byte you load from HBM must be used for at least 156 floating-point operations. Most <strong>CUDA kernels </strong>don&#8217;t come close.</p><p>Consider a naive vector addition:</p><pre><code><code>FLOPs:        1 (one addition)
Bytes moved:  12 (two float32 reads + one write)
I =           1/12 &#8776; 0.08 FLOPs/byte</code></code></pre><p>This kernel sits roughly 1,875&#215; below the ridge point. Not 10% below. Not 50% below. <strong>Nearly 2,000&#215; below</strong>. No amount of launch configuration tuning, thread count adjustment, or arithmetic reorganization will move the needle. </p><p>The kernel is physically limited by how fast you can move <strong>12 bytes per FLOP</strong> through the memory hierarchy. That&#8217;s a fundamental property of the algorithm.</p><p>Dense matrix multiplication is different. A square matrix multiply of dimension N performs <code>2N&#179;</code> FLOPs and reads <code>3N&#178;</code> elements (A, B, and C matrices). Arithmetic intensity grows as <code>2N&#179; / (3N&#178; &#215; 4 bytes) &#8776; N/6</code>. </p><p>For N=1,024: roughly 170 FLOPs/byte, above the A100 ridge point. <strong>Large matrix multiply</strong> on Ampere is compute-bound. This is why it saturates tensor cores; there&#8217;s enough arithmetic per byte to keep them fed.</p><p>The roofline tells you which side of the ridge your kernel sits on, and therefore which type of optimization is worth pursuing. <strong>Memory-bound kernels</strong> benefit from better coalescing, data reuse, and smaller datatypes. </p><p>Compute-bound kernels benefit from better instruction throughput, occupancy, and <strong>reduced arithmetic latency</strong>. </p><p>Applying compute-bound optimizations to a memory-bound kernel is how engineers spend weeks achieving nothing.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The architecture&#8217;s logic, finally visible</h2><p>After tracing the memory hierarchy from physics to <strong>TMA</strong>, a coherent picture emerges.</p><p>The GPU is not a calculator that operates on data. It is a machine for orchestrating the movement of data at <strong>massive scale</strong>. </p><p>It is just fast enough, and in just the right form, for computation to occur at the rate the arithmetic units can sustain. Everything in the architecture, the four warp schedulers, the unified 192 KB SRAM, <code>cp.async</code>, TMA, WGMMA, exists in service of that orchestration.</p><p>The fastest <strong>GPU programs</strong> are not the ones that perform the most arithmetic per line of code. They are the ones that most efficiently move data through the hierarchy:</p><pre><code><code>HBM (2 TB/s, ~600 ns)
  &#8595;  [TMA or cp.async]
Shared Memory (several TB/s, ~20 ns)
  &#8595;  [warp-level loads]
Register File (~1 ns)
  &#8595;  [tensor core instructions]
Accumulators
  &#8595;  [store]
HBM</code></code></pre><p>Each arrow is a potential bottleneck. Each transition must be managed so that the layer above never waits for the layer below.</p><p>There is a <strong>mental model shift</strong> that separates engineers who write fast GPU code from those who don&#8217;t. It isn&#8217;t knowledge of specific APIs or familiarity with PTX. It&#8217;s the habit of thinking about data location. </p><p>At every point during <strong>kernel execution</strong>: </p><blockquote><p><em>where is this data right now? </em></p><p><em>Where does it need to be in three instructions? </em></p><p><em>How many cycles will it cost to move it there? </em></p><p><em>Is there computation I can usefully perform in the meantime?</em></p></blockquote><p>This is how the <strong>FlashAttention team</strong> found their key insight: the bottleneck in naive attention wasn&#8217;t the matrix multiplies, it was repeated HBM round-trips for the attention matrix. </p><p>The arithmetic didn&#8217;t change. The <strong>data choreography</strong> did. That&#8217;s the entire optimization.</p><p>Once the hardware model is <strong>genuinely internalized</strong>, the techniques follow naturally. The padding that eliminates bank conflicts isn&#8217;t a trick you memorize, it falls directly out of understanding how the banking hardware works. </p><p>The <code>cp.async</code> pipeline isn&#8217;t a template you copy; it&#8217;s the obvious solution once you understand that <strong>synchronous loads </strong>are serializing your kernel for no reason.</p><p>That&#8217;s the real skill. Not writing <strong>fast arithmetic</strong>. Writing fast data movement, and just enough arithmetic to justify it.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Mastering CUDA and High-Performance Computing, Part III]]></title><description><![CDATA[A Deep Dive from Compiler Internals to High-Performance Parallel Computing]]></description><link>https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-204</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-204</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Wed, 04 Mar 2026 18:01:55 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pohw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e0edde-9e30-4e89-9608-b7ab52d6dd5e_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pohw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e0edde-9e30-4e89-9608-b7ab52d6dd5e_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pohw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e0edde-9e30-4e89-9608-b7ab52d6dd5e_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!pohw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e0edde-9e30-4e89-9608-b7ab52d6dd5e_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!pohw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e0edde-9e30-4e89-9608-b7ab52d6dd5e_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!pohw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e0edde-9e30-4e89-9608-b7ab52d6dd5e_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pohw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e0edde-9e30-4e89-9608-b7ab52d6dd5e_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3e0edde-9e30-4e89-9608-b7ab52d6dd5e_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4131291,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/189889954?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e0edde-9e30-4e89-9608-b7ab52d6dd5e_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pohw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e0edde-9e30-4e89-9608-b7ab52d6dd5e_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!pohw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e0edde-9e30-4e89-9608-b7ab52d6dd5e_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!pohw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e0edde-9e30-4e89-9608-b7ab52d6dd5e_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!pohw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e0edde-9e30-4e89-9608-b7ab52d6dd5e_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The system contract beneath the kernel</h2><p>In <strong>Part I</strong>, we descended from C++ into<strong> LLVM IR</strong>, tracing each loop, phi node, and virtual register.</p><p>In<strong> Part II</strong>, we dissolved PTX into warps, registers, and latency-hiding strategies, exploring how threads collide and cooperate in the Streaming Multiprocessor pipelines.</p><p>While I was studying all of this, a few months ago, I genuinely thought I understood the <strong>atomic unit</strong> of performance. I was totally wrong.</p><p>The atomic unit is not the kernel. A kernel is not computation. It is a <strong>submission</strong>: a meticulously packaged descriptor, assembled by the driver, serialized over the PCIe or NVLink bus, and handed to a firmware-managed command processor.</p><p>This processor does not execute &#8220;<em>kernels</em>&#8221; the way you imagine. It schedules thread blocks onto independently clocked<strong> Streaming Multiprocessors</strong>, arbitrates shared registers and memory banks, manages multiple <strong>DMA engines</strong>, and interleaves thousands of warps to hide latency.</p><p>Every instruction, every <strong>FMA</strong>, every predicated branch in your kernel is meaningless until the submission reaches the scheduler. Until you reason at this level, you are tuning the wrong layer.</p><p>Optimizing PTX, unrolling loops, or<strong> balancing registers</strong> matters only after the submission has carved a feasible execution plan through firmware, SM resources, and memory subsystems.</p><p>This is where performance truly lives: </p><p>at the intersection of host, bus, firmware, and microarchitecture, where the<strong> invisible choreography</strong> of scheduling, arbitration, and allocation transforms a static descriptor into teraflops of actual computation.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The atomic unit of GPU execution</h2><p>When you write a kernel with the triple-chevron syntax, you are not &#8220;<em>launching computation</em>&#8221;. You are calling into the <strong>CUDA Runtime API</strong>, specifically <code>cudaLaunchKernel</code>. </p><p>Consider what that call site looks like after the compiler lowers it:</p><pre><code><code>// Source-level kernel invocation
myKernel&lt;&lt;&lt;gridDim, blockDim, sharedMemBytes, stream&gt;&gt;&gt;(arg0, arg1, arg2);

// What the compiler actually emits (simplified)
void* args[] = { &amp;arg0, &amp;arg1, &amp;arg2 };
cudaLaunchKernel(
    (const void*)&amp;myKernel,  // fat binary function handle
    gridDim,                 // dim3
    blockDim,                // dim3
    args,                    // pointer to argument array
    sharedMemBytes,          // dynamic shared memory in bytes
    stream                   // cudaStream_t
);</code></code></pre><p>That call is deceptively simple. Beneath it lies a cascade of complex, orchestrated operations that transform your high-level kernel into a<strong> GPU-executable submission.</strong></p><p>The runtime layer, <code>libcudart</code>, performs several critical tasks before the GPU ever sees a single instruction. First, it resolves the <strong>function pointer</strong> for your kernel. This pointer does not point to machine code directly. </p><p>It points into a <strong>fat binary</strong> (<code>fatbin</code>) section embedded in the host ELF executable under the <code>.nv_fatbin</code> section, a container that may hold multiple artifacts. You can inspect this directly:</p><pre><code><code># Dump all ELF sections to verify the fat binary is embedded
readelf -S ./my_binary | grep nv_fatbin

# Or use cuobjdump to list all embedded architectures
cuobjdump --list-elf ./my_binary
# Output example:
# ELF file    1: my_kernel.sm_80.cubin
# ELF file    2: my_kernel.sm_86.cubin

# Disassemble the SASS for a specific architecture
cuobjdump --dump-sass --gpu-architecture sm_80 ./my_binary
</code></code></pre><p>The fat binary container holds one or more <strong>cubin</strong> objects, each containing <strong>SASS</strong> (Shader ASSembly, the actual machine-level ISA) for a specific compute capability, and possibly <strong>PTX</strong> (Parallel Thread Execution, a stable virtual ISA) for forward compatibility. </p><p><strong>SASS</strong> is versioned by SM architecture: <code>sm_80</code> targets Ampere A100, <code>sm_86</code> targets Ampere RTX 30-series, <code>sm_90</code> targets Hopper H100.</p><p>If a cubin matches the active device&#8217;s compute capability, the driver loads it directly into device memory. </p><p>Otherwise, the embedded <strong>JIT compiler</strong> in the driver compiles PTX into SASS tailored to the exact SM microarchitecture, applying instruction scheduling, register allocation, and latency-hiding optimizations on-the-fly, then caches the result. </p><p>The cache lives at:</p><pre><code><code># Default JIT cache location on Linux
~/.cache/nvidia/ComputeCache/

# Inspect the cache
ls -lh ~/.cache/nvidia/ComputeCache/

# Force JIT recompilation by clearing the cache
rm -rf ~/.cache/nvidia/ComputeCache/

# Disable the cache entirely (forces JIT on every launch)
export CUDA_CACHE_DISABLE=1

# Inspect PTX embedded in the binary (requires PTX to have been included)
cuobjdump --dump-ptx ./my_binary
</code></code></pre><p>Next, the runtime constructs the <strong>parameter buffer</strong>. All kernel arguments are serialized into a contiguous memory region, respecting ABI alignment and padding rules. </p><p>The <strong>CUDA ABI</strong> mandates that arguments are packed in declaration order, each aligned to its own size, up to a maximum natural alignment of 8 bytes. This buffer is critical: it decouples the host representation of arguments from the GPU&#8217;s execution context. </p><p>The parameter buffer, along with metadata about grid dimensions, block dimensions, shared memory allocation, and the kernel entry point, forms the <strong>launch descriptor</strong>.</p><p>Control now transitions from <code>libcudart</code> into the <strong>CUDA Driver API</strong>, <code>libcuda</code>. At this point, you leave the comfort of user-space abstractions. The driver constructs a command packet fully describing the submission. </p><p>You can see the <strong>Driver API </strong>equivalents directly, bypassing the Runtime API entirely:</p><pre><code><code>// Equivalent kernel launch using the Driver API directly
CUfunction kernel;
CUmodule module;

// Load a cubin directly, skipping the fat binary resolution step
cuModuleLoad(&amp;module, "my_kernel.sm_80.cubin");
cuModuleGetFunction(&amp;kernel, module, "myKernel");

// Pack arguments manually
void* kernelArgs[] = { &amp;arg0, &amp;arg1, &amp;arg2 };

// Launch via Driver API
cuLaunchKernel(
    kernel,
    gridDim.x,  gridDim.y,  gridDim.z,   // grid dimensions
    blockDim.x, blockDim.y, blockDim.z,   // block dimensions
    sharedMemBytes,                        // dynamic shared memory
    stream,                                // CUstream
    kernelArgs,                            // argument array
    NULL                                   // extra options (NULL = unused)
);
</code></code></pre><p>The driver then executes a <strong>system call</strong> into kernel mode, a user-to-kernel <strong>privilege transition</strong>. This step is measurable. You can observe it with <code>strace</code> on Linux:</p><pre><code><code># Trace system calls during a CUDA kernel launch
strace -e trace=ioctl ./my_cuda_binary 2&gt;&amp;1 | grep -A2 "ioctl.*NVOS"
# You will see a sequence of ioctl() calls into /dev/nvidia0
# Each call corresponds to a driver operation: context management,
# command buffer allocation, or doorbell signaling

# Profile the full launch overhead including driver roundtrip
nvprof --print-gpu-trace ./my_cuda_binary
# Or with Nsight Systems (preferred for modern CUDA)
nsys profile --trace=cuda,nvtx ./my_cuda_binary
nsys stats report.nsys-rep
</code></code></pre><p>Inside the kernel-mode driver (<code>nvidia.ko</code> on Linux), the GPU submission is materialized. A GPU <strong>command buffer</strong> entry is created, encapsulating all launch metadata. </p><p>This entry is written into a memory region visible to the GPU, either pinned host memory accessible via PCIe DMA or device <strong>BAR (Base Address Register) mapped space</strong>, a CPU-accessible window into device memory through MMIO. </p><p>A <strong>memory-mapped I/O (MMIO) write</strong> to the GPU&#8217;s doorbell register signals that new work is ready. </p><p>This write traverses the interconnect, typically PCI Express or NVLink, crossing from CPU memory controllers to the GPU&#8217;s front-end hardware queue manager.</p><p>From this moment, the <strong>CPU </strong>relinquishes control. The GPU&#8217;s front-end processor fetches the command packet from the submission queue, decodes it, and begins orchestrating thread blocks across SMs.</p><p><strong>Launch latency</strong>, typically 5 to 15 microseconds on a modern discrete GPU, is dominated not by grid size but by runtime and driver overhead, mode switch cost, command serialization, and <strong>MMIO signaling</strong>. You can verify this invariance empirically:</p><pre><code><code>// Microbenchmark: measure launch latency vs. grid size
cudaEvent_t start, stop;
cudaEventCreate(&amp;start);
cudaEventCreate(&amp;stop);

// Launch a trivially empty kernel
__global__ void emptyKernel() {}

for (int blocks : {1, 256, 65535}) {
    cudaEventRecord(start);
    emptyKernel&lt;&lt;&lt;blocks, 256&gt;&gt;&gt;();
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float ms;
    cudaEventElapsedTime(&amp;ms, start, stop);
    printf("blocks=%d  latency=%.3f ms\n", blocks, ms);
    // You will observe near-constant latency across all grid sizes
    // The dominant cost is the submission path, not the computation
}
</code></code></pre><p>This fact fundamentally shifts how you reason about<strong> GPU performance</strong>: kernels are not units of computation. They are descriptors. </p><p>Your optimization focus must move away from the code inside the kernel and toward the <strong>orchestration </strong>of submissions, streams, and resource contention at the firmware and microarchitectural level.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Inside the GPU front-end</h2><p>GPUs are not passive execution units. They are orchestrators of massive parallelism, with hardware designed to manage, schedule, and feed thousands of threads simultaneously. </p><p>At the apex of this architecture lies the <strong>front-end command processor</strong> (sometimes called the <strong>FECS</strong>, Front-End Command Streamer on Ampere and later architectures), a microarchitectural state machine tasked with transforming host-submitted work into actionable instructions for <strong>Streaming Multiprocessors.</strong></p><p>When the CPU writes to the doorbell register via <strong>MMIO</strong>, the signal propagates across the PCIe interconnect into the GPU&#8217;s front-end. </p><p>The command processor wakes, fetches the command packet from its <strong>pushbuffer</strong> (the ring buffer used for command submission, a concept originating in GPU architecture from the early 2000s and still present in modern CUDA), and begins parsing the launch descriptor. </p><p>This descriptor is more than metadata. It encodes the grid dimensions, block dimensions, parameter buffer pointers, <strong>kernel entry points</strong>, shared memory allocation sizes, and stream association.</p><p>Crucially, the front-end does not immediately partition work across SMs. There is no pre-distribution. </p><p>Instead, the <strong>GigaThread Engine</strong> (NVIDIA&#8217;s term for the global thread block scheduler present since Fermi) maintains internal hardware work queues, constantly tracking SM occupancy, shared memory usage, register pressure, and warp slot availability. </p><p>Thread blocks are dynamically dispatched to <strong>SMs </strong>only when resources allow. This is why occupancy and resource usage matter at launch: <em>a block cannot begin execution until the SM has enough free registers, shared memory bytes, and warp slots.</em></p><p>You can query the resource limits of any SM directly:</p><pre><code><code># Query device properties relevant to block scheduling
nvidia-smi --query-gpu=name,compute_cap --format=csv

# Or from CUDA code
cudaDeviceProp prop;
cudaGetDeviceProperties(&amp;prop, 0);
printf("Max threads per SM:       %d\n", prop.maxThreadsPerMultiProcessor);
printf("Max blocks per SM:        %d\n", prop.maxBlocksPerMultiProcessor);
printf("Shared memory per SM:     %zu bytes\n", prop.sharedMemPerMultiprocessor);
printf("Registers per SM:         %d\n", prop.regsPerMultiprocessor);
printf("Warp size:                %d\n", prop.warpSize);
printf("Number of SMs:            %d\n", prop.multiProcessorCount);
// On an A100 (sm_80):
// Max threads per SM:       2048
// Max blocks per SM:        32
// Shared memory per SM:     167936 bytes (with carveout config)
// Registers per SM:         65536
// Warp size:                32
// Number of SMs:            108
</code></code></pre><p>This <strong>dynamic dispatch</strong> has profound architectural implications. Each block must be fully independent: it cannot rely on any global synchronization that spans other blocks. </p><p>If it did, the <strong>GigaThread Engine</strong> would risk deadlock whenever the number of dispatched-but-not-yet-scheduled blocks exceeded the number of concurrently resident blocks across all SMs. CUDA eliminates this hazard by design. </p><p>Blocks are the unit of forward progress, and cross-block synchronization is limited to device-wide barriers using <strong>Cooperative Groups</strong> with <code>grid.sync()</code>, which requires explicit opt-in via <code>cudaLaunchCooperativeKernel</code> and hardware support (Pascal and later):</p><pre><code><code>#include &lt;cooperative_groups.h&gt;
namespace cg = cooperative_groups;

__global__ void cooperativeKernel(float* data) {
    cg::grid_group grid = cg::this_grid();

    // Phase 1: every block writes
    data[blockIdx.x * blockDim.x + threadIdx.x] = threadIdx.x;

    // Device-wide barrier: ALL blocks must reach this point
    // Only safe when the kernel is launched with cudaLaunchCooperativeKernel
    // and grid size &lt;= number of SMs * max resident blocks per SM
    grid.sync();

    // Phase 2: every block reads from another block's data
    float val = data[(blockIdx.x + 1) % gridDim.x * blockDim.x + threadIdx.x];
}

// Launch must use cudaLaunchCooperativeKernel, not the chevron syntax
void* args[] = { &amp;data };
cudaLaunchCooperativeKernel(
    (void*)cooperativeKernel,
    gridDim, blockDim,
    args, 0, stream
);
</code></code></pre><p>Within the front-end, multiple execution engines work in parallel. </p><p>One engine fetches blocks from the queue, another evaluates resource availability per SM, a third schedules warps to the SM <strong>instruction dispatch units</strong> (each SM on Ampere has four warp schedulers, each capable of issuing one instruction per clock to its assigned warp pool). </p><p>All of this happens in tens of nanoseconds, invisible to the CPU but essential to sustaining throughput.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Where kernels meet silicon</h2><p>I used to think a kernel was the atomic unit of performance. I was wrong.</p><p>A kernel is not computation. It is a descriptor, a carefully structured packet, handed from the CPU, through a driver, across a bus, and into a firmware-managed execution engine that schedules thread blocks onto independently clocked <strong>Streaming Multiprocessors</strong> sharing a memory subsystem with multiple DMA engines. </p><p>That sentence is not dramatic. It is mechanically accurate. And until you reason at that level, you are optimizing the wrong layer.</p><p>The interconnect is the immutable physical boundary that every kernel submission must cross. Most discrete GPUs connect over <strong>PCI Express</strong>, a packetized, credit-based, arbitrated serial fabric. </p><p>PCIe 4.0 x16 (the most common configuration as of Ada Lovelace) provides approximately 32 GB/s bidirectional bandwidth with roughly 1 to 3 microseconds of round-trip latency per transaction. </p><p>PCIe 5.0 x16 doubles the bandwidth to 64 GB/s while keeping similar latency characteristics. You can observe the actual topology and bandwidth on your system:</p><pre><code><code># Inspect PCIe topology and bandwidth
nvidia-smi topo --matrix
# Output shows P2P access type between GPUs and CPU:
# NV# = NVLink (higher bandwidth), PIX = PCIe same switch, etc.

# Benchmark PCIe bandwidth directly
# (from CUDA samples: bandwidthTest)
./bandwidthTest --mode=shmoo --memory=pinned
# Example output on PCIe 4.0 x16:
# Host to Device: ~25 GB/s
# Device to Host: ~26 GB/s
# (theoretical max 32 GB/s; overhead from protocol reduces effective bandwidth)
</code></code></pre><p>Systems with <strong>NVLink</strong> (A100 uses NVLink 3.0, H100 uses NVLink 4.0) fundamentally change this picture. </p><p>NVLink 3.0 provides 600 GB/s total bidirectional GPU-to-GPU bandwidth across 12 links on the A100, with sub-microsecond latency. NVLink 4.0 on H100 scales to 900 GB/s. </p><p>The principle remains the same: data movement occurs over a finite physical fabric. Move 20 GB per iteration over a 25 GB/s effective PCIe bandwidth, and the lower bound is 0.8 seconds regardless of what your kernels compute. </p><p>No amount of register blocking or shared memory tiling can bypass that limit. Physics dominates.</p><p>To address this, GPUs expose dedicated <strong>asynchronous copy engines</strong>, entirely separate from the SM compute pipelines. </p><p>On the A100, there are two copy engines capable of overlapping host-to-device (H2D) and device-to-host (D2H) transfers simultaneously with kernel execution. </p><p>On the H100, the copy engine count increases to three. When you call <code>cudaMemcpyAsync</code> with pinned memory in a non-default stream, the driver programs a <strong>DMA descriptor</strong> and the copy engine pulls data directly from<strong> host memory </strong>using bus-mastering DMA while SMs continue executing kernels on a separate stream:</p><pre><code><code>// Correct pattern for overlapping compute and transfer
// Requires: pinned memory, separate streams, no implicit sync

float *h_input, *h_output, *d_input, *d_output;
const size_t N = 1 &lt;&lt; 24;
const size_t bytes = N * sizeof(float);

// Allocate pinned (page-locked) host memory
cudaMallocHost(&amp;h_input,  bytes);  // pinned
cudaMallocHost(&amp;h_output, bytes);  // pinned
cudaMalloc(&amp;d_input,  bytes);
cudaMalloc(&amp;d_output, bytes);

cudaStream_t computeStream, transferStream;
cudaStreamCreate(&amp;computeStream);
cudaStreamCreate(&amp;transferStream);

// Issue H2D copy on transferStream (handled by copy engine)
cudaMemcpyAsync(d_input, h_input, bytes, cudaMemcpyHostToDevice, transferStream);

// Issue kernel on computeStream (handled by SMs, independently)
processKernel&lt;&lt;&lt;N/256, 256, 0, computeStream&gt;&gt;&gt;(d_output, previousData);

// Insert event-based dependency: computeStream waits for transferStream
cudaEvent_t transferDone;
cudaEventCreate(&amp;transferDone);
cudaEventRecord(transferDone, transferStream);
cudaStreamWaitEvent(computeStream, transferDone, 0);

// Now safe to launch the kernel that depends on d_input
dependentKernel&lt;&lt;&lt;N/256, 256, 0, computeStream&gt;&gt;&gt;(d_input, d_output);

// DO NOT call cudaDeviceSynchronize() here unless absolutely necessary.
// It drains ALL streams and ALL engines, collapsing the pipeline to serial.
</code></code></pre><p>The distinction between <code>cudaMalloc</code> for device memory and <code>cudaMallocHost</code> for pinned host memory is not stylistic. It is architectural. </p><p>The copy engine requires <strong>physically contiguous, page-locked pages</strong> to issue DMA transfers without CPU intervention. </p><p>Pageable memory allocated with standard <code>malloc</code> forces the driver to stage through a temporary pinned <strong>bounce buffer first</strong>, adding one full extra copy and destroying any possibility of true overlap.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Pageable vs. page-locked memory and the orchestration of streams</h2><p>I used to think memory transfers were trivial: just a <code>cudaMemcpy</code> and the GPU would magically have the data. That is not the case.</p><p>At the hardware level, not all host memory is created equal. <strong>Pageable memory</strong>, the default for every allocation from <code>malloc</code>, <code>new</code>, or <code>std::vector</code>, cannot be accessed directly by a GPU DMA engine. </p><p>The OS may migrate, swap, or remap pageable pages at any moment. The DMA engine requires a <strong>physically stable</strong>, contiguous range of pages with a fixed physical address to issue a transaction, because the DMA controller programs the physical address into the PCIe packet directly. </p><p>There is no page table walk on the GPU side during a DMA transfer.</p><p>When you pass a pointer to pageable memory to any <code>cudaMemcpy</code> variant, the driver silently performs a<strong> three-step process</strong>: it first allocates a temporary pinned staging buffer in driver-managed memory. </p><p>Second, it performs a CPU-side <code>memcpy</code> from your pageable buffer into the pinned staging buffer, then programs the<strong> DMA engine</strong> to transfer from the pinned staging buffer to device memory. </p><p>Two copies of the <strong>entire dataset</strong>, two chances for latency to accumulate, two pressure events on the CPU&#8217;s memory subsystem, multiplied by every transfer in your pipeline.</p><p><strong>Pinning host memory</strong> eliminates this staging copy entirely. You can pin memory in several ways:</p><pre><code><code>// Method 1: allocate pinned memory from the start (preferred)
float* h_data;
cudaMallocHost(&amp;h_data, bytes);          // pinned, CUDA-managed
// or equivalently:
cudaHostAlloc(&amp;h_data, bytes, cudaHostAllocDefault);

// Method 2: pin an existing pageable allocation (useful for legacy code)
float* existing_ptr = (float*)malloc(bytes);
cudaHostRegister(existing_ptr, bytes, cudaHostRegisterDefault);
// ... use existing_ptr in cudaMemcpyAsync ...
cudaHostUnregister(existing_ptr);        // must unpin before free
free(existing_ptr);

// Method 3: write-combined memory (good for H2D only, uncached on CPU side)
float* h_wc;
cudaHostAlloc(&amp;h_wc, bytes, cudaHostAllocWriteCombined);
// Write-combined memory bypasses the CPU cache hierarchy,
// reducing cache pollution on the host but making CPU reads very slow.
// Only use when the CPU writes sequentially and the GPU reads.

// Verify a pointer is pinned and get the device-mapped address
cudaPointerAttributes attr;
cudaPointerGetAttributes(&amp;attr, h_data);
printf("Memory type: %d (2 = cudaMemoryTypeHost/pinned)\n", attr.type);
</code></code></pre><p>But pinning comes at a real<strong> OS-level</strong> cost. Each pinned page is <strong>wired</strong>: the OS cannot reclaim it for paging, cannot map it to another physical address, and cannot swap it to disk. </p><p>On <strong>Linux</strong>, wired pages count against the locked memory limit (<code>RLIMIT_MEMLOCK</code>). High-performance systems typically raise this limit:</p><pre><code><code># Check current locked memory limit
ulimit -l
# Default is often 64 KB (far too low for GPU workloads)

# Raise the limit for the current session
ulimit -l unlimited

# Or permanently via /etc/security/limits.conf
echo "* hard memlock unlimited" &gt;&gt; /etc/security/limits.conf
echo "* soft memlock unlimited" &gt;&gt; /etc/security/limits.conf

# Verify pinned memory usage on the system
cat /proc/meminfo | grep -i locked
# Mlocked: shows currently wired pages in KB
</code></code></pre><p>Even with pinned buffers, transferring data efficiently requires orchestrated submission. <strong>Streams</strong> are the mechanism.</p><p>A CUDA stream is an ordered sequence of operations: kernel launches, memory copies, or event markers. Within a stream, operations execute in issue order, serialized by the hardware. </p><p>Across streams, no ordering is guaranteed unless the programmer explicitly inserts synchronization via <code>cudaEvent</code> or <code>cudaStreamWaitEvent</code>. </p><p>Internally, each stream maps to a logical command queue in the driver, which maps to a <strong>hardware channel</strong> on the GPU front-end.</p><p>The GPU scheduler interprets these queues at the hardware front-end, deciding dynamically which operation to issue to which engine. </p><p>The <strong>concurrency decisions</strong> depend on SM capacity (how many blocks can reside simultaneously on each SM given register and shared memory consumption), warp slot availability (each SM on Ampere supports up to <strong>64 resident warps </strong>regardless of block count). </p><p>After that, it copy <strong>engine availability</strong> (two engines on A100, three on H100, each capable of one direction of transfer at a time), and memory bandwidth headroom (HBM2e on A100 provides 2 TB/s; GDDR6X on RTX 4090 provides 1 TB/s).</p><p>Streams are not independent parallel lanes. They represent <strong>potential parallelism</strong>, a promise the hardware can choose to fulfill when resources allow. </p><p>You can use <strong>Nsight Systems</strong> to visualize whether your multi-stream design is actually achieving overlap:</p><pre><code><code># Profile with stream-level visibility
nsys profile \
    --trace=cuda \
    --cuda-memory-usage=true \
    --output=timeline \
    ./my_application

# Open the report
nsys-ui timeline.nsys-rep
# The timeline view will show whether kernels and memcpy operations
# from different streams truly overlap, or serialize due to resource contention
</code></code></pre><div><hr></div><h2>Resource-constrained block residency</h2><p>I used to think that once a block is launched, it simply executes. That is not how the hardware sees it. The Streaming Multiprocessor treats blocks as units of <strong>resource allocation</strong>, not just threads to execute.</p><p>When a block is assigned to an SM, its <strong>residency</strong>, the number of blocks and warps that can coexist, is determined by the most constrained of four interlocking resources. </p><ul><li><p>First, register consumption: each SM has 65,536 registers on Ampere (the register file is physically 32-bit wide). If your kernel uses 64 registers per thread and launches 256-thread blocks, that block consumes 256 &#215; 64 = 16,384 registers, allowing at most 65,536 / 16,384 = 4 blocks to reside simultaneously. </p></li><li><p>Second, static and dynamic shared memory: an Ampere SM has up to 164 KB of configurable shared memory (in a combined L1/shared memory array of 192 KB). </p></li><li><p>Third, the hardware-imposed maximum of 2,048 resident threads per SM. </p></li><li><p>Fourth, the architectural maximum of 32 resident blocks per SM on Ampere.</p></li></ul><p>The CUDA <strong>Occupancy Calculator</strong> (available as both a spreadsheet and as API calls) computes the binding constraint:</p><pre><code><code>// Query theoretical occupancy from the runtime
int minGridSize, blockSize;

// Let the runtime choose an optimal block size to maximize occupancy
cudaOccupancyMaxPotentialBlockSize(
    &amp;minGridSize,
    &amp;blockSize,
    myKernel,
    0,        // dynamic shared memory per block
    0         // block size limit (0 = no limit)
);
printf("Suggested block size: %d, min grid size: %d\n", blockSize, minGridSize);

// Compute occupancy for a specific configuration
int numBlocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &amp;numBlocks,
    myKernel,
    256,   // threads per block
    0      // dynamic shared memory per block
);
printf("Max active blocks per SM: %d\n", numBlocks);
// occupancy = numBlocks * threadsPerBlock / maxThreadsPerSM

// Inspect register and shared memory usage of a compiled kernel
// (from command line)
// nvcc --ptxas-options=-v mykernel.cu
// Output:
// ptxas info: Used 32 registers, 4096 bytes smem, 400 bytes cmem[0]
</code></code></pre><p>You can also inspect <strong>register usage</strong> directly in the compiled binary:</p><pre><code><code># Check register usage per thread in a compiled cubin
cuobjdump --dump-sass ./my_binary | grep -A5 "Function : myKernel"
# Look for the .regcount field or count the register usage in SASS

# Alternatively, pass verbose flags to ptxas at compile time
nvcc -Xptxas -v -arch=sm_80 mykernel.cu -o mykernel
# ptxas info: compiling entry function 'myKernel' for 'sm_80'
# ptxas info: Function properties for myKernel:
#     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
# ptxas info: Used 48 registers, 8192 bytes smem, 400 bytes cmem[0]
</code></code></pre><p><strong>Register spilling</strong>, when the compiler cannot fit all live variables into the 65,536 register file and must evict some to <strong>local memory</strong> (a per-thread region in global DRAM), is one of the most expensive occupancy failures because local memory accesses are cached in <strong>L1/L2 </strong>but still require <strong>DRAM </strong>bandwidth when they miss. </p><p>You can force a register cap and observe the effect:</p><pre><code><code>// Annotate a kernel to cap its register usage
// Forces the compiler to spill anything above the cap into local memory
__global__ void __launch_bounds__(256, 4) myKernel(float* data) {
    // 256 = max threads per block (helps compiler optimize)
    // 4   = minimum blocks per SM (compiler will spill registers to meet this)
    ...
}

// Or via nvcc flag (applies globally to all kernels in the translation unit)
// nvcc --maxrregcount=32 mykernel.cu
</code></code></pre><p>Residency is not just about thread count. It directly determines the SM&#8217;s ability to hide memory latency via <strong>warp-level latency hiding</strong>. An Ampere SM has four warp schedulers. Each scheduler selects a ready warp every clock cycle. </p><p>If a warp issues a global memory load (with a latency of roughly 290 clock cycles to HBM on A100), the warp scheduler immediately switches to another ready warp. To fully hide that 290-cycle latency, you need enough resident warps to keep all four schedulers busy during the wait. </p><p>With too few resident warps due to register or <strong>shared memory pressure,</strong> stalls become visible in the instruction pipeline, and throughput collapses.</p><p>Low residency at the block level also reduces system-level concurrency. A single <strong>resource-heavy kernel</strong> can monopolize SMs, eliminating the scheduler&#8217;s ability to overlap operations from multiple streams. </p><p>Kernel design cascades upward through the entire submission pipeline.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The anatomy of GPU execution</h2><p>Looking back across this journey, from kernel launches to PTX, from SM occupancy to DMA engines, from streams to page migration, the lesson is clear: the GPU is a tightly orchestrated ecosystem, not a black box that executes kernels on demand.</p><p>A kernel is not pure computation. It is a descriptor, a submission packet traveling from <code>cudaLaunchKernel</code> through <code>libcudart</code>, across the Driver API privilege boundary, into <code>nvidia.ko</code>, across the <strong>PCIe fabric</strong> or NVLink interconnect, into the GigaThread Engine&#8217;s pushbuffer, through the front-end command processor, and only then into the SM warp schedulers. </p><p>Execution is layered: submission queues, DMA engines, front-end command processors, <strong>SM schedulers</strong>, warp instruction buffers, and physical register files all contribute to the effective performance of even a trivially simple operation.</p><p>Hardware constraints propagate upward across the entire stack. <strong>Register pressure </strong>on a single kernel reduces<strong> </strong>SM residency, which reduces warp count, which reduces latency-hiding capability, which reduces SM throughput, which reduces the scheduler&#8217;s ability to overlap streams, which reduces system-level concurrency. </p><p>Memory is not homogeneous: pageable versus pinned allocation, Unified Memory page fault overhead, DMA copy engine count, and <strong>HBM bandwidth</strong> all determine whether your kernels can achieve real parallel throughput. </p><p>Every microsecond is measurable. Mode switches, MMIO signaling,<strong> JIT compilation</strong>, and page migrations all introduce latency invisible in source code but entirely observable with <code>nsys</code>, <code>nvprof</code>, <code>cuobjdump</code>, and <code>/proc/driver/nvidia-uvm/stats</code>.</p><p>Optimizing a GPU is not a matter of loop unrolling, warp-level FMA scheduling, or shared memory bank conflict elimination in isolation. </p><p>It is engineering at <strong>multiple layers </strong>simultaneously, understanding how software abstractions translate to hardware realities at every level of the submission path. </p><p>Only by reasoning across the entire execution path, from host-side memory allocation strategy to SM warp scheduler residency, can one design systems that realize the full potential of GPU hardware.</p><p>The <strong>atomic unit of GPU performance is submission</strong>. Everything else, threads, blocks, warps, registers, is consequence, not cause. </p><p>Every optimization, every kernel redesign, every memory allocation decision must respect the architecture-wide <strong>resource constraints</strong>, the interconnect physics, and the hardware scheduler&#8217;s discretion.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-204?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-204?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Mastering CUDA and High-Performance Computing, Part II]]></title><description><![CDATA[A Deep Dive from Compiler Internals to High-Performance Parallel Computing]]></description><link>https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-9e2</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-9e2</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Fri, 27 Feb 2026 14:31:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!UY0I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F152a535a-5d52-4497-916a-a1619e1b3202_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UY0I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F152a535a-5d52-4497-916a-a1619e1b3202_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UY0I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F152a535a-5d52-4497-916a-a1619e1b3202_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!UY0I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F152a535a-5d52-4497-916a-a1619e1b3202_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!UY0I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F152a535a-5d52-4497-916a-a1619e1b3202_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!UY0I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F152a535a-5d52-4497-916a-a1619e1b3202_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UY0I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F152a535a-5d52-4497-916a-a1619e1b3202_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/152a535a-5d52-4497-916a-a1619e1b3202_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3280946,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/189125595?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F152a535a-5d52-4497-916a-a1619e1b3202_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UY0I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F152a535a-5d52-4497-916a-a1619e1b3202_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!UY0I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F152a535a-5d52-4497-916a-a1619e1b3202_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!UY0I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F152a535a-5d52-4497-916a-a1619e1b3202_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!UY0I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F152a535a-5d52-4497-916a-a1619e1b3202_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Introduction</h2><p>For a long time, I approached <strong>CUDA </strong>and <strong>GPU </strong>performance the way most engineers do: <em>through fragments</em>. </p><p>I read the CUDA Programming Guide, the <strong>PTX ISA </strong>specification, sections of the LLVM Language Reference, scattered forum posts, backend source files, and many whitepapers describing streaming multiprocessors and warp schedulers. </p><p>Each document was precise, internally consistent, and technically rigorous; but none of them, in isolation, explained the system. The <strong>PTX manual</strong> described instructions but not why they existed. </p><p>The CUDA guide described memory hierarchies but not how code became those memory accesses. <strong>LLVM documentation </strong>described passes and SSA form, but without grounding them in the physical constraints of a GPU executing 64 warps simultaneously. </p><p>I could see every layer, but I could not see how they connected. Performance tuning still felt empirical: inspect PTX, adjust launch parameters, add <code>__restrict__</code>, benchmark again. </p><p>The compiler remained a black box that emitted artifacts, not a system whose behavior I could predict.</p><p>The shift came when I stopped treating these as <strong>independent specifications</strong> and instead focused on the underlying invariants: </p><blockquote><p><em>how computation is represented</em></p><p><em>how hardware executes dependency graphs</em></p><p><em>how those representations evolved alongside processor architecture. </em></p></blockquote><p>I began reading compiler design notes, backend implementations, and especially the writings and technical explanations of Chris Lattner, tracing how<strong> LLVM&#8217;s SSA model</strong>, register allocation, and instruction selection were explicitly designed to map abstract programs onto finite physical machines. </p><p>In parallel, I studied the evolution of hardware itself: from in-order scalar pipelines to superscalar <strong>out-of-order execution</strong>, and eventually to GPUs, where thousands of threads exist not as independent programs but as replicated instances of the same dependency graph. </p><p>It became clear that hardware had stopped becoming fundamentally more complex in its <strong>execution model</strong>; instead, compilers had absorbed that complexity. </p><p>The compiler was no longer just translating syntax: it was actively restructuring computation to satisfy <strong>register file limits</strong>, memory latency constraints, and instruction throughput requirements. </p><p>Modern performance was no longer determined solely by hardware capability, but by how effectively the compiler could expose <strong>parallelism </strong>within the constraints of that hardware.</p><p>That realization reframed everything. CUDA was not a separate <strong>programming model </strong>layered on top of GPUs; it was a frontend into <em>LLVM&#8217;s transformation pipeline</em>, and LLVM itself was the mechanism that reshaped high-level intent into a form the GPU could physically sustain. </p><p>The compiler had become the <strong>critical mediator</strong> between software and silicon, encoding assumptions about latency, bandwidth, register pressure, and execution width directly into the structure of the program.</p><p> Understanding CUDA performance therefore required understanding LLVM: not just its syntax, but its internal passes, its <strong>SSA graph semantics</strong>, and its register allocation strategies. </p><p>Only then did the abstraction barrier dissolve, revealing that what ultimately runs on a GPU is not the original kernel, nor even its <strong>PTX representation</strong>, but the final physical realization of an LLVM-optimized dependency graph constrained by the realities of hardware.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>From CUDA kernels to silicon</h2><p>I remember the exact moment the<strong> abstraction barrier</strong> finally dissolved into dust. </p><p>Until then, CUDA performance tuning felt like an endless cycle of micro-optimizations: insert a <code>__restrict__</code> here, manually unroll a loop there, inspect the PTX output, rerun benchmarks, adjust launch parameters, and hope for better occupancy. </p><p>PTX felt like the canonical truth: the visible, low-level artifact you could reason about, but it was only the tip of the iceberg. </p><p>Reading the design notes and deep technical writings of <strong>Chris Lattner</strong>, combined with the LLVM Language Reference, the NVVM backend specification, and numerous LLVM source dives, revealed a far more rigorous reality: <strong>CUDA kernels never become GPU programs directly. </strong></p><p>They first exist as fully functional <strong>LLVM SSA graphs.</strong> Every subsequent transformation, like for example <strong>NVVM lowering, PTX emission, </strong>instruction scheduling, register allocation<strong>, </strong>is a constrained projection of that<strong> SSA graph </strong>onto the finite hardware resources of an<strong> SM.</strong></p><p>When Clang compiles a CUDA kernel, the &#8220;<em>threads</em>&#8221; you write in source code are abstracted away; the compiler produces a single SSA function parameterized by the thread index. </p><p>Execution is purely functional, and dependencies are fully explicit. Consider a canonical <strong>SAXPY kernel</strong>:</p><pre><code><code>__global__ void saxpy(float a, float* __restrict x,
                      float* __restrict y,
                      float* __restrict out) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    out[i] = a * x[i] + y[i];
}
</code></code></pre><p><strong>Clang</strong> then lowers this into an <strong>SSA function</strong> resembling:</p><pre><code><code>define void @saxpy(float %a,
                   float addrspace(1)* noalias %x,
                   float addrspace(1)* noalias %y,
                   float addrspace(1)* noalias %out) {

entry:
  %tid   = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %ctaid = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
  %ntid  = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()

  %block_offset = mul i32 %ctaid, %ntid
  %i32          = add i32 %block_offset, %tid
  %idx          = sext i32 %i32 to i64

  %x_ptr   = getelementptr float, float addrspace(1)* %x, i64 %idx
  %y_ptr   = getelementptr float, float addrspace(1)* %y, i64 %idx
  %out_ptr = getelementptr float, float addrspace(1)* %out, i64 %idx

  %vx = load float, float addrspace(1)* %x_ptr, !noalias !0
  %vy = load float, float addrspace(1)* %y_ptr, !noalias !1

  %mul = fmul contract float %a, %vx
  %sum = fadd contract float %mul, %vy

  store float %sum, float addrspace(1)* %out_ptr, !noalias !2
  ret void
}
</code></code></pre><p>Several structural observations are immediately apparent:</p><ol><li><p><strong>SSA nodes encode all dependencies explicitly.</strong> Every intermediate <code>%value</code>, like <code>%vx</code>, <code>%mul</code>, <code>%sum</code>, occupies a virtual register that is immutable. Lifetimes are explicit; LLVM can compute dead values and reuse registers globally, long before physical registers exist.</p></li><li><p><strong>Thread identity is parameterized, not instantiated.</strong> <code>%tid</code>, <code>%ctaid</code>, <code>%ntid</code> are just SSA inputs. The compiler sees a single, deterministic function, not thousands of concurrently executing threads. Hardware parallelism emerges only when the same SSA graph is instantiated across SM warps.</p></li><li><p><strong>Memory hierarchy is embedded in the type system.</strong> <code>addrspace(1)</code> for global memory, <code>addrspace(3)</code> for shared memory, <code>addrspace(5)</code> for local/thread memory. This distinction allows alias analysis and load-store motion passes to apply different reordering guarantees depending on latency and side effects.</p></li></ol><p>The role of <code>__restrict__</code> is now immediately visible in LLVM IR:</p><pre><code><code>load float, float addrspace(1)* %a, !noalias !0
</code></code></pre><p>This metadata allows the compiler to disambiguate memory locations, enabling <strong>Global Value Numbering (GVN)</strong>, load forwarding, and redundant load elimination. </p><p>On a GPU, the difference is dramatic: a single <strong>unprovable alias</strong> in global memory could serialize hundreds of cycles of latency across a warp.</p><p>Another early and profound transformation is the <code>mem2reg</code> pass. Stack allocations introduced in <strong>IR construction</strong> are converted into pure SSA values:</p><p><strong>Before mem2reg:</strong></p><pre><code><code>%tmp = alloca float
store float %vx, float* %tmp
%val = load float, float* %tmp
%mul = fmul float %val, %val
</code></code></pre><p><strong>After mem2reg:</strong></p><pre><code><code>%mul = fmul float %vx, %vx
</code></code></pre><p>The load and store vanish entirely. On <strong>GPUs</strong>, this avoids global memory instructions that would have cost hundreds of cycles per thread. SSA immutability and live-range analysis now become the primary mechanism controlling register pressure downstream.</p><p>The next critical stage is <strong>SelectionDAG</strong>, where LLVM converts SSA IR into a directed acyclic graph of operations:</p><pre><code><code>        fadd
       /    \
     fmul    load y
     /  \
    a   load x
</code></code></pre><p>This graph is <strong>target-independent</strong> but already encodes all data dependencies. DAG nodes are matched against target-specific instruction patterns using heuristics that account for latency, throughput, and register use. For example, the LLVM IR:</p><pre><code><code>%offset = mul i64 %i, 4
%addr   = add i64 %base, %offset
</code></code></pre><p>becomes PTX:</p><pre><code><code>mul.lo.s64 %rd2, %rd1, 4;
add.s64 %rd3, %rdBase, %rd2;
</code></code></pre><p>but at the SASS stage, NVIDIA&#8217;s backend can fold this into a single scaled addressing instruction. </p><p><strong>Instruction count is not the primary metric</strong>; minimizing dependency depth and live-range length dominates warp occupancy and latency hiding.</p><p>Machine IR (MIR) after instruction selection but before register allocation exposes these constraints:</p><pre><code><code>%3:gpr64 = IMUL64ri32 %1, 4
%4:gpr64 = ADD64rr %base, %3
</code></code></pre><p>Here, <code>%gpr64</code> are still virtual. LLVM builds an <strong>interference graph</strong> and maps virtual registers onto finite physical registers. </p><p>On an Ampere SM, 65536 registers per SM dictate occupancy: a kernel requiring <strong>32 registers</strong> per thread allows 2048 threads, but 96 registers per thread drops occupancy to 682. </p><p>Correlating LLVM live-range dumps with <code>ptxas --verbose</code> reports confirmed that register pressure originates entirely at the LLVM IR level, long before <strong>PTX emission</strong>.</p><p>Finally, <strong>loop transformations</strong> illustrate LLVM&#8217;s deterministic orchestration of latency hiding. Consider a reduction:</p><pre><code><code>for (int i = 0; i &lt; 1024; i++)
    sum += x[i];
</code></code></pre><p>The initial IR contains a loop-carried dependency:</p><pre><code><code>%sum_next = fadd float %sum, %val
</code></code></pre><p>which enforces strict serialization. After <strong>partial unrolling</strong>:</p><pre><code><code>%v0 = load float, ptr
%v1 = load float, ptr+4
%v2 = load float, ptr+8
%v3 = load float, ptr+12

%s1 = fadd float %sum, %v0
%s2 = fadd float %s1, %v1
%s3 = fadd float %s2, %v2
%s4 = fadd float %s3, %v3
</code></code></pre><p>LLVM&#8217;s scheduler can now interleave loads and arithmetic:</p><pre><code><code>load x[i]
load x[i+1]
fmul previous
fadd previous
</code></code></pre><p>The scheduler can interleave arithmetic with independent loads. By the time PTX is emitted, <strong>all parallelism, instruction ordering, and memory dependencies </strong>have already been determined; the warp scheduler is merely executing what LLVM has exposed. </p><p>Global memory latency is hidden not by runtime heuristics, but by compiler-scheduled instruction independence.</p><p>Ultimately, <strong>CUDA performance is a function of SSA graph transformations, not PTX heuristics</strong>. Every critical metric (<em>register pressure, warp occupancy, memory coalescing, instruction-level parallelism</em>) originates in LLVM IR. </p><p>Once I understood this, CUDA tuning ceased being black magic: by inspecting IR and tracing passes, I could predict <strong>register allocation, spilling, latency hiding, and throughput</strong> entirely before compiling to PTX or running on hardware. </p><p>The <strong>abstraction barrier</strong> had vanished, and performance became a deterministic, analyzable function of compiler-driven SSA transformations.<br></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Tracing LLVM Transformations to Hardware</h2><p>Once I understood that CUDA kernels first exist as LLVM SSA graphs, the next natural question emerged: <em>how exactly do these immutable, dependency-annotated IR values map all the way down to warp execution on the SM?</em> </p><p>To answer this, I traced a simple kernel (<em>again, SAXPY</em>) through every stage: SSA &#8594; NVVM &#8594; PTX &#8594; SASS &#8594; warp execution.</p><h4>LLVM SSA Graph</h4><p>The SSA graph encodes <strong>pure computation and memory dependencies</strong>:</p><pre><code><code>%vx  = load float, float addrspace(1)* %x_ptr, !noalias !0
%vy  = load float, float addrspace(1)* %y_ptr, !noalias !1
%mul = fmul contract float %a, %vx
%sum = fadd contract float %mul, %vy
store float %sum, float addrspace(1)* %out_ptr, !noalias !2
</code></code></pre><p>Key points:</p><ul><li><p><strong>Each </strong><code>%value</code><strong> is immutable.</strong> There is no &#8220;overwriting&#8221; of registers; instead, values flow along edges in a DAG.</p></li><li><p><strong>Dependencies are explicit.</strong> <code>%sum</code> depends on <code>%mul</code>, which depends on <code>%vx</code> and <code>%a</code>.</p></li><li><p><strong>Memory type is explicit.</strong> <code>addrspace(1)</code> informs alias analysis, reordering, and coalescing passes.</p></li></ul><p>LLVM passes operate entirely on this DAG:</p><ul><li><p><code>mem2reg</code> removes unnecessary stack loads/stores.</p></li><li><p><code>GVN</code> eliminates redundant calculations.</p></li><li><p><code>LICM</code> hoists loop-invariant loads into shared memory.</p></li><li><p><code>LoopUnroll</code> exposes independent operations for the scheduler.</p></li><li><p><code>SLPVectorizer</code> packs independent arithmetic into wider SIMD instructions (or pseudo-SIMT operations for the GPU).</p></li></ul><p>At this stage, <strong>parallelism is implicit</strong>, coming from parameterization over <code>%tid</code> and <code>%ctaid</code>, not from threads or warps.</p><h4>NVVM Lowering and PTX Generation</h4><p>LLVM&#8217;s NVVM backend converts SSA into <strong>PTX pseudo-assembly</strong>, a virtual instruction set for NVIDIA GPUs:</p><pre><code><code>// Global memory load
ld.global.f32 %f1, [%x+%idx];
// Arithmetic
mul.f32 %f2, %a, %f1
add.f32 %f3, %f2, %y[%idx]
// Store
st.global.f32 [%out+%idx], %f3
</code></code></pre><p>Observations:</p><ul><li><p>SSA immutability is preserved: <code>%f1</code>, <code>%f2</code>, <code>%f3</code> correspond 1:1 with SSA nodes.</p></li><li><p>Memory hierarchy is respected. Loads/stores from <code>addrspace(1)</code> are global, <code>addrspace(3)</code> would be shared.</p></li><li><p>Instruction reordering is limited only by <strong>LLVM metadata and alias analysis</strong>, not PTX syntax. PTX reflects <em>already scheduled operations</em>.</p></li></ul><p>PTX is <strong>still virtual</strong>: the actual scheduling, register allocation, and instruction fusion are deferred until SASS.</p><h4>SASS: Physical Instruction Selection</h4><p>The NVIDIA SASS (microarchitecture ISA) stage maps PTX to <strong>hardware instructions</strong> for the SM:</p><pre><code><code>IMAD R3, R1, 4, RBase    // Compute address
LD.E R4, [R3]            // Load x[i]
LD.E R5, [R3_y]          // Load y[i]
FMAD R6, R4, Ra, R5      // Multiply-add
ST.E [R3_out], R6        // Store result
</code></code></pre><p>Key points:</p><ul><li><p>LLVM&#8217;s <strong>DAG scheduling</strong> determines the order of these operations. Instruction count is not minimized; dependency depth and live-range length are.</p></li><li><p>Register pressure is now concrete: each <code>%value</code> is mapped to a <strong>physical register</strong>. The interference graph built in MIR ensures no two simultaneously live values share the same physical register.</p></li><li><p>Instruction fusion happens here (<code>FMAD</code>); an optimization LLVM hinted at by exposing independent arithmetic in SSA.</p></li></ul><p>By now, <strong>warp-level execution semantics are encoded</strong>: independent instructions are consecutive, loads can hide arithmetic latency, and dependency chains are minimized.</p><h4>Warp Execution</h4><p>Finally, the SM executes multiple threads (warps) of the same SASS code:</p><pre><code><code>Thread 0: load x[0], load y[0], fma, store
Thread 1: load x[1], load y[1], fma, store
...
Thread 31: load x[31], load y[31], fma, store
</code></code></pre><ul><li><p><strong>Warp scheduler</strong> selects ready instructions from the DAG instantiated across 32 threads.</p></li><li><p><strong>Global memory latency hiding</strong> emerges naturally: because LLVM already separated independent loads from dependent arithmetic, the scheduler always finds instructions to issue, overlapping memory and compute.</p></li><li><p><strong>Register pressure</strong> limits occupancy: if LLVM IR produced too many live <code>%value</code>s per thread, fewer threads can execute concurrently.</p></li></ul><p>The crucial insight: <strong>hardware doesn&#8217;t create parallelism.</strong> It executes what LLVM has already structurally exposed. </p><p>Warp scheduling, latency hiding, and coalescing are <em>emergent properties</em> of SSA DAG transformations combined with physical constraints.</p><h4>Loop Unrolling in SSA &#8594; PTX &#8594; SASS</h4><p>Consider a reduction:</p><pre><code><code>for (int i=0; i&lt;4; i++)
    sum += x[i];
</code></code></pre><p>SSA after partial unroll:</p><pre><code><code>%v0 = load float, %x+0
%v1 = load float, %x+1
%v2 = load float, %x+2
%v3 = load float, %x+3

%s1 = fadd %sum, %v0
%s2 = fadd %s1, %v1
%s3 = fadd %s2, %v2
%s4 = fadd %s3, %v3
</code></code></pre><p>PTX:</p><pre><code><code>ld.global.f32 %f0, [%x]
ld.global.f32 %f1, [%x+4]
ld.global.f32 %f2, [%x+8]
ld.global.f32 %f3, [%x+12]

add.f32 %sum1, %sum, %f0
add.f32 %sum2, %sum1, %f1
add.f32 %sum3, %sum2, %f2
add.f32 %sum4, %sum3, %f3
</code></code></pre><p>SASS:</p><pre><code><code>LD.E R0, [R1]
LD.E R2, [R1+4]
LD.E R4, [R1+8]
LD.E R6, [R1+12]

FADD R8, Rsum, R0
FADD R10, R8, R2
FADD R12, R10, R4
FADD R14, R12, R6
ST.E [Rout], R14
</code></code></pre><ul><li><p><strong>Instruction interleaving</strong>: the SM scheduler can issue independent loads before dependent FADDs.</p></li><li><p><strong>Latency hiding</strong>: memory loads from global memory overlap with prior arithmetic.</p></li><li><p><strong>Predictable register pressure</strong>: each <code>%vN</code> maps to R0&#8211;R6; FADDs reuse registers, keeping occupancy within hardware limits.</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p></li></ul><div><hr></div><h2>Merging History, Architecture, and Software</h2><p>I remember stepping a little bit back and seeing the full picture: <strong>GPU evolution</strong>, compiler transformations, and kernel performance are inseparable. </p><p>Every decision in a CUDA kernel exists in the context of decades of <strong>architectural evolution.</strong></p><p>In the 1990s, GPUs were<strong> fixed-function pipelines:</strong> vertices and pixels flowed through rigid stages, parallelism implicit but inaccessible. </p><p>Programmable shaders in the early 2000s allowed tiny per-pixel or <strong>per-vertex programs</strong> (<em>SIMD-style warps, predication, and texture-aware memory</em>) but general-purpose computation was still a hack.</p><p>Then CUDA arrived in 2006, exposing threads, blocks, and explicit memory hierarchies: registers, <strong>shared memory</strong>, global memory. For the first time, developers could directly reason about latency, coalescing, and warp-level execution. </p><p>Yet kernels were still only the tip of the iceberg: <strong>true performance lives in LLVM IR</strong>, long before PTX or hardware sees it.</p><p>Every kernel begins as an <strong>SSA graph</strong>: pure, immutable, fully explicit in dependencies. <code>%tid</code>, <code>%ctaid</code>, <code>%ntid</code> are function parameters, not threads. </p><p>Address spaces encode <strong>memory hierarchy</strong>: <code>addrspace(1)</code> for global, <code>addrspace(3)</code> for shared, <code>addrspace(5)</code> for local. <code>__restrict__</code> metadata allows disambiguation, enabling GVN, load forwarding, and aggressive reordering. </p><p>Loop unrolling, <strong>scalar replacement</strong>, and DAG scheduling expose independent operations, determining instruction-level parallelism and register pressure.</p><p>When LLVM lowers to PTX and eventually SASS, the graph&#8217;s structure dictates occupancy, warp scheduling, and latency hiding. The GPU does not invent parallelism; it realizes the<strong> parallelism</strong> the compiler has already exposed. </p><p>Every optimization (register allocation, memory coalescing, interleaved arithmetic) is preordained by SSA transformations.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Compiler, Hardware, and the Deterministic Birth of Performance</h2><p>Stepping back, it becomes clear that GPU performance is <strong>never accidental</strong>. </p><p>Every evolution, starting from fixed-function pipelines to shaders, from CUDA to modern heterogeneous SMs, is a layer that shaped how we write, reason about, and execute kernels. </p><p>But the real magic lies in <strong>the compiler</strong>, particularly LLVM: it transforms human code into SSA graphs where dependencies, memory hierarchy, and live ranges are explicit, deterministic, and fully analyzable. </p><p>PTX and SASS are not optimization stages; they are <strong>projections </strong>of these decisions onto silicon.</p><p>Performance emerges not at runtime but <strong>at compile time</strong>, where LLVM orchestrates instruction scheduling, loop transformations, memory disambiguation, and register allocation. </p><p>The GPU simply executes what <strong>LLVM </strong>has already exposed: warps, latency hiding, coalesced accesses, and parallelism are consequences of carefully structured SSA graphs. </p><p>Every tweak in IR ripples through PTX, affects occupancy, and determines throughput. Understanding <strong>this chain</strong> (<em>history, architecture, compiler</em>) is what allows a developer to predict, reason about, and ultimately master GPU performance.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-9e2?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance-9e2?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Mastering CUDA and High-Performance Computing, Part I]]></title><description><![CDATA[A Deep Dive from Compiler Internals to High-Performance Parallel Computing]]></description><link>https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Wed, 25 Feb 2026 12:23:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!SFw_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42713eba-91d5-4e6e-acb0-291b02cf991f_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SFw_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42713eba-91d5-4e6e-acb0-291b02cf991f_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SFw_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42713eba-91d5-4e6e-acb0-291b02cf991f_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!SFw_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42713eba-91d5-4e6e-acb0-291b02cf991f_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!SFw_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42713eba-91d5-4e6e-acb0-291b02cf991f_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!SFw_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42713eba-91d5-4e6e-acb0-291b02cf991f_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SFw_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42713eba-91d5-4e6e-acb0-291b02cf991f_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42713eba-91d5-4e6e-acb0-291b02cf991f_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3951875,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/189117904?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42713eba-91d5-4e6e-acb0-291b02cf991f_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SFw_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42713eba-91d5-4e6e-acb0-291b02cf991f_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!SFw_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42713eba-91d5-4e6e-acb0-291b02cf991f_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!SFw_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42713eba-91d5-4e6e-acb0-291b02cf991f_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!SFw_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42713eba-91d5-4e6e-acb0-291b02cf991f_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>A Personal Journey into GPU Computing</h2><p>I still remember <strong>November 12, 2024</strong>, with a clarity that surprises even me to this day, even though more than a year has already passed.</p><p>I had spent the evening lost in <strong>LLVM internals</strong>, tracing the intricate dance between front-end parsing, IR transformations, optimization passes, and backend code generation. </p><p>I explored <strong>FunctionPassManager sequences</strong>, observed <strong>GVN</strong>, <strong>SROA</strong>, <strong>LoopVectorization</strong>, <strong>Instruction Combining</strong>, and <strong>Dead Code Elimination</strong>, and tried to correlate these transformations with <strong>register pressure</strong>, <strong>live interval analysis</strong>, and <strong>instruction scheduling heuristics</strong>. </p><p>Every detail, every pass, seemed like a miniature universe of logic, optimization, and constraint satisfaction.</p><p>I had always been fascinated by the intersection of <strong>software, hardware, and system infrastructure</strong>; but until that night, I had never realized just how foundational and deep that connection could be. </p><p>I was thinking about compilers, IR, and abstract execution models, but I had no sense yet of how these abstractions directly map to silicon at massive scale.</p><p>Then I stumbled upon an article about <strong>DeepSeek R1</strong>, the <strong>PTX intermediate representation</strong>, and the orchestration of thousands of threads across<strong> NVIDIA GPUs&#8217; Streaming Multiprocessors (SMs)</strong>. </p><p>It felt like stepping through a door into a world I had only glimpsed. Almost instantly, I found myself in the middle of a <strong>global conversation</strong>: for days, forums, blogs, and even financial news outlets buzzed about DeepSeek. </p><p>Developers dissected PTX, debated its <strong>scheduling optimizations</strong>, speculated about the next wave of AI workloads, and NVIDIA&#8217;s stock price swung dramatically.</p><p>While most people were caught up in hype, speculation, or high-level geopolitical and financial implications, I became obsessed with the <strong>core layer</strong>: the base of the software, the invisible threads connecting high-level code to the hardware executing it. </p><p>I devoured every whitepaper, blog post, SDK guide, and forum discussion I could find, trying to understand exactly <strong>how PTX bridges high-level CUDA kernels </strong>with the underlying<strong> SM pipelines</strong>, how <strong>execution is scheduled across warps</strong>, and how <strong>memory hierarchies</strong> (<em>register files, shared memory banks, L1/L2 caches, and DRAM</em>) are orchestrated at a microscopic level.</p><p>At that time, I barely understood what a GPU did beyond simple graphics acceleration or using <strong>high-level CUDA</strong> <strong>frameworks</strong>. But the article illuminated something crucial: modern computational performance isn&#8217;t just about clever algorithms.</p><p>It is about <strong>how software maps onto hardware through multiple layers of compilation and execution</strong>, aspects that most programmers, including myself until that night, rarely consider. </p><p>I realized that to truly understand <strong>high-performance GPU computing</strong>, I needed to trace the <strong>entire execution path</strong>: from <strong>CUDA C++ kernels</strong>, through <strong>LLVM IR transformations</strong>, into <strong>PTX</strong>, and finally into the <strong>Streaming Multiprocessor pipelines</strong> themselves.</p><p>Suddenly, my abstract fascination with compilers and infrastructure became a tangible, almost physical journey. </p><p>Every kernel, every loop, every thread had a story. I could begin to see how <strong>thread indexing</strong> (<em>threadIdx, blockIdx, blockDim</em>) maps logically to warps, how <strong>memory coalescing</strong> affects throughput, how <strong>shared memory bank conflicts</strong> serialize execution, and how <strong>predication</strong> avoids warp divergence penalties. </p><p>I was no longer just reading about GPUs; I was entering their world <strong>thread by thread, instruction by instruction</strong>, tracing the invisible logic that transforms high-level abstractions into thousands of coordinated instruction streams flowing through silicon.</p><p>That night marked the start of a transformation in my understanding. It wasn&#8217;t merely academic curiosity anymore; it was a journey into the <strong>layered reality of modern computing</strong>, where compiler theory, PTX abstractions, and microarchitectural details converge to define what is actually possible on a GPU. </p><p>And it was the beginning of the series you&#8217;re reading: a deep dive into <strong>CUDA, LLVM IR, PTX, and SM execution</strong>, told from the perspective of someone who has traced every layer, experimentally and obsessively, to understand not just how GPUs compute, but <strong>why</strong>.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>My LLVM odyssey</h2><p>Before GPUs became the core focus of my studies, I was already deeply immersed in <strong>software infrastructure</strong>, with a singular fascination: <strong>compilers</strong>. </p><p>Their ability to transform high-level code into highly optimized machine instructions, <strong>managing dependencies</strong>, scheduling, and register allocation, had always captivated me: almost like engineering alchemy, where abstract logic crystallizes into finely tuned execution on silicon.</p><p>I spent months dissecting LLVM&#8217;s internals, tracing each stage of the pipeline.</p><p>At the front-end, I studied <strong>clang&#8217;s parsing</strong>, type-checking, and <strong>LLVM IR generation</strong>, observing how control flow graphs, SSA form, and metadata are preserved to represent program semantics accurately. </p><p>LLVM IR became a playground of possibilities, allowing multiple layers of analysis, transformation, and optimization without committing to a target architecture.</p><p>Next came the optimization passes. I followed the <strong>FunctionPassManager</strong>, analyzing pass ordering and dependencies. </p><p>I explored <strong>Global Value Numbering (GVN)</strong> to identify equivalent computations, <strong>Scalar Replacement of Aggregates (SROA)</strong> to break down complex structures, and <strong>LoopVectorization</strong> to exploit SIMD execution patterns. </p><p><strong>Instruction Combining</strong> and <strong>Dead Code Elimination (DCE)</strong> revealed how small IR-level transformations propagate through instruction scheduling, register pressure, and memory accesses.</p><p>Diving deeper, I mapped IR onto hardware targets through <strong>InstructionSelector DAGs</strong>, examining how LLVM matches abstract operations to <strong>target-specific opcodes</strong>, considers <strong>RegisterClass constraints</strong>, and queries <strong>TargetTransformInfo</strong> to evaluate instruction latency, throughput, and memory cost models. </p><p>I traced <strong>live interval analysis</strong> in relation to <strong>register allocation heuristics</strong>, noting how spilling, rematerialization, and coalescing decisions affect execution efficiency: critical insights for architectures with thousands of concurrent threads.</p><p>Predication and <strong>control-flow lowering</strong> fascinated me: conditional branches in IR could be transformed into predicated instructions, minimizing pipeline stalls. </p><p>Loop transformations (<strong>unrolling, interchange, fusion, and vectorization</strong>) demonstrated the delicate balance between instruction-level parallelism, memory alignment, and cache behavior. </p><p>Each IR pass could increase or decrease register pressure, affect instruction scheduling, or modify memory footprint, directly influencing runtime performance.</p><p>Even then, I did not anticipate how this knowledge would later intersect with <strong>GPU architectures. </strong></p><p>LLVM IR transformations, optimization passes, and backend heuristics were abstract exercises; but they had prepared me to reason about CUDA kernels, PTX <strong>intermediate representation</strong>, warp scheduling, shared memory layouts, and SM pipelines. </p><p>Every abstraction, every instruction, and every <strong>scheduling decision</strong> became a lens through which I could understand how software maps efficiently onto massively parallel hardware.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The evening of discovery: PTX and DeepSeek</h2><p><strong>January 20, 2025.</strong> That evening remains etched in my memory. After hours lost in LLVM&#8217;s inner workings, by tracing IR transformations, exploring <strong>FunctionPassManager</strong> sequences, and correlating register allocation with live interval analysis, I stumbled upon the <strong>DeepSeek R1</strong> article.</p><p>The moment felt surreal: everything I had been exploring in software infrastructure suddenly collided with the raw power of hardware parallelism. </p><p>The article unpacked <strong>PTX</strong>, NVIDIA&#8217;s virtual ISA, and detailed how thousands of threads execute across <strong>warps</strong> on a <strong>Streaming Multiprocessor (SM)</strong>. </p><p>Each warp, a bundle of<strong> 32 threads </strong>in SIMT execution, follows the instruction scheduler, hiding latency by rapidly switching between warps.</p><p>PTX abstracts this complexity: developers write kernels in CUDA C++, compile them with NVCC, which emits PTX, and the <strong>NVIDIA driver</strong> JIT-compiles the PTX into device-specific cubins streamed into the SM pipelines.</p><p>PTX fascinated me for several reasons:</p><ul><li><p><strong>Forward compatibility</strong>:<em> A PTX kernel compiled today could run on next-generation GPUs via JIT compilation, bridging software longevity and hardware evolution.</em></p></li><li><p><strong>Hardware abstraction</strong>:<em> Kernels could be written without intimate knowledge of specific SM layouts, register files, or shared memory bank conflicts.</em></p></li><li><p><strong>Optimization opportunities</strong>: <em>Despite abstraction, peak performance still demanded deep awareness of occupancy, memory coalescing, shared memory conflicts, warp divergence, and pipeline latencies.</em></p></li></ul><p>The parallels with LLVM were striking. LLVM transforms <strong>high-level C++ </strong>into <em>IR</em>, applies passes like <em>GVN, SROA, LoopVectorization, Instruction Combining,</em> and <em>Dead Code Elimination</em>, and lowers code to target-specific instructions. </p><p>PTX, on the other hand, expresses <strong>parallel execution semantics independently of final hardware mapping</strong>. Both systems separate abstraction from execution yet require deep knowledge to optimize performance.</p><p>I sat tracing mental parallels between LLVM&#8217;s DAG-based instruction selection and PTX&#8217;s warp-scheduled execution. LLVM must reason about <strong>register pressure</strong>, instruction latency, predication, and control-flow lowering. </p><p>PTX, in turn, demands understanding warp occupancy, coalesced memory accesses, shared memory bank conflicts, and <strong>L1/L2 cache</strong> interplay with DRAM.</p><p>That night, <strong>DeepSeek and PTX became more than concepts</strong>. They bridged my compiler obsession with high-performance GPU computing. </p><p>I realized that to truly understand CUDA, I needed to follow the full path: <strong>high-level CUDA kernels &#8594; NVCC &#8594; LLVM IR &#8594; PTX &#8594; cubins &#8594; SM pipelines</strong>, where thousands of threads coordinate in lockstep to perform computations that would overwhelm any CPU.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Tracing CUDA Kernels, PTX, and the Hardware Dance</h2><p>When I first started writing CUDA kernels, my experiments were deliberately simple: matrix addition,<strong> vector scaling,</strong> small convolution operations. I wanted to see, concretely, how high-level code transformed as it moved down the compilation chain. </p><p>I compiled these kernels and examined the PTX emitted by NVCC. The experience was revelatory, almost like peeling <strong>back a layer of abstraction</strong> to reveal a hidden parallel universe of computation.</p><p>PTX exposed the fundamental building blocks of GPU execution:</p><ul><li><p><strong>Thread indexing:</strong> Registers like <code>threadIdx</code>, <code>blockIdx</code>, and <code>blockDim</code> define each thread&#8217;s unique identity in a multidimensional grid. Observing PTX code, I could see how each thread calculates its global index using arithmetic derived from block and grid dimensions. </p></li><li><p><strong>Memory coalescing:</strong> PTX made it clear how critical memory alignment is. Threads within a warp that access contiguous addresses can combine requests into a single transaction, dramatically increasing throughput. Misalignment, however, causes serialized transactions, stalls, and pipeline bubbles. </p></li><li><p><strong>Shared memory and bank conflicts:</strong> Threads in a block share memory, divided into banks. I could observe how multiple threads attempting to access the same bank would serialize operations, a subtle yet crucial bottleneck.</p></li><li><p><strong>Predication:</strong> Divergent branches are another warp-level concern. PTX can transform conditional execution into predicated instructions, effectively executing both paths but masking results for non-taken threads. Seeing divergent loops lowered into predicated instructions showed me how compiler heuristics, IR transforms, and PTX scheduling work together for correct, parallel execution.</p></li></ul><p>I remember embedding <strong>inline PTX</strong> inside simple kernels, controlling instruction ordering, memory patterns, and even warp-level operations that CUDA C++ itself could not expose. </p><p>It was a hands-on lesson: the GPU is not just a black box for parallelism; it is a layered ecosystem of<strong> instruction streams</strong>, memory banks, and scheduling heuristics, all orchestrated to maximize throughput.</p><div><hr></div><h2>Matrix Multiplication and Warp-Level Optimization</h2><p>After mastering vector addition, I wanted a more challenging experiment &#8212; something that truly tested the GPU&#8217;s compute and memory subsystems. I chose <strong>matrix multiplication</strong>, the classic computational kernel in high-performance computing, AI, and scientific simulations.</p><p>The first na&#239;ve implementation looked like this:</p><pre><code><code>__global__ void matMulNaive(float* A, float* B, float* C, int N) {
    int row = threadIdx.y + blockIdx.y * blockDim.y;
    int col = threadIdx.x + blockIdx.x * blockDim.x;

    if(row &lt; N &amp;&amp; col &lt; N){
        float sum = 0.0f;
        for(int k = 0; k &lt; N; ++k){
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}</code></code></pre><p>At a high level, this seems straightforward: each thread computes a single element of the result matrix. </p><p>But my experience with vector addition had taught me to <strong>look deeper</strong>, beyond correctness, into how this kernel would map onto PTX and SM pipelines.</p><div><hr></div><h2>LLVM IR insights</h2><p>I compiled the kernel with <strong>NVCC </strong>targeting PTX and examined the<strong> LLVM IR </strong>output. Immediately, several patterns emerged:</p><ul><li><p><strong>Nested loops are represented as IR loops</strong>: The outer loops over rows and columns map directly to thread indexing arithmetic. The inner loop over <code>k</code> is a canonical loop with <code>phi</code> nodes tracking the accumulator <code>sum</code>.</p></li><li><p><strong>SSA form and virtual registers</strong>: Every variable (<code>row</code>, <code>col</code>, <code>k</code>, <code>sum</code>) exists as a virtual register, yet LLVM&#8217;s optimization passes can combine, eliminate, or spill them based on usage patterns.</p></li><li><p><strong>Load/store separation</strong>: Every access to <code>A[row * N + k]</code> and <code>B[k * N + col]</code> generates explicit <code>load</code> instructions. LLVM can apply <strong>scalar replacement</strong> or <strong>loop-invariant code motion</strong>, lifting certain computations out of loops to reduce redundant instructions.</p></li></ul><p>I remember one night sitting with my IR and a notebook, tracing the <code>phi</code> nodes and loop unrolling transformations. I observed how a small tweak in IR, <em>say unrolling the inner loop by 2</em>, could double the number of live registers per thread. </p><p>I immediately understood that <strong>naive unrolling without accounting for register pressure could reduce occupancy</strong>, a lesson painfully learned with vector addition.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>PTX, a parallel symphony</h2><p>After IR, I looked at the generated PTX. The arithmetic, loads, and stores became explicit <strong>thread-local instructions</strong>:</p><pre><code><code>mul.lo.u32 %row_idx, %blockIdx.y, %blockDim.y
add.u32 %row, %row_idx, %threadIdx.y

mul.lo.u32 %col_idx, %blockIdx.x, %blockDim.x
add.u32 %col, %col_idx, %threadIdx.x

setp.ge.u32 %p0, %row, %N
setp.ge.u32 %p1, %col, %N
or.pred %p2, %p0, %p1
@%p2 bra DONE

mov.f32 %sum, 0.0
LOOP:
ld.global.f32 %a, [A + %row*N + %k]
ld.global.f32 %b, [B + %k*N + %col]
fma.rn.f32 %sum, %a, %b, %sum
add.u32 %k, %k, 1
setp.lt.u32 %cond, %k, %N
@%cond bra LOOP
st.global.f32 [C + %row*N + %col], %sum
DONE:
</code></code></pre><p>I was fascinated by several features:</p><ul><li><p><strong>Fused multiply-add (FMA) instructions</strong>: PTX automatically uses FMA when possible, maximizing FLOPs per cycle.</p></li><li><p><strong>Predicated branches</strong>: Out-of-bounds threads are masked off, avoiding unnecessary execution while keeping warps aligned.</p></li><li><p><strong>Thread indexing arithmetic</strong>: Derived directly from <code>threadIdx</code>, <code>blockIdx</code>, and <code>blockDim</code>. Every PTX instruction respects these indices to maintain correctness across thousands of threads.</p></li></ul><p>But even this PTX was naive. Each thread individually loaded elements of <code>A</code> and <code>B</code> from global memory. I knew from my vector addition experiments that <strong>misaligned global accesses </strong>and <strong>low arithmetic intensity </strong>would throttle performance.</p><div><hr></div><h2>Shared memory tiling</h2><p>The next step was introducing <strong>shared memory tiling</strong>, a standard GPU optimization. I rewrote the kernel to load tiles of <code>A</code> and <code>B</code> into shared memory, perform the multiply-accumulate locally, and then write back the result:</p><pre><code><code>__global__ void matMulTiled(float* A, float* B, float* C, int N){
    __shared__ float As[32][32];
    __shared__ float Bs[32][32];

    int row = threadIdx.y + blockIdx.y*blockDim.y;
    int col = threadIdx.x + blockIdx.x*blockDim.x;
    float sum = 0.0f;

    for(int tile=0; tile &lt; N/32; ++tile){
        As[threadIdx.y][threadIdx.x] = A[row*N + tile*32 + threadIdx.x];
        Bs[threadIdx.y][threadIdx.x] = B[(tile*32 + threadIdx.y)*N + col];
        __syncthreads();

        for(int k=0; k&lt;32; ++k)
            sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        __syncthreads();
    }
    C[row*N + col] = sum;
}</code></code></pre><p>Here, I could <strong>directly control shared memory layout</strong>, a factor invisible at the high-level IR stage. I experimented with <strong>bank conflicts</strong>, adjusting tile strides to avoid multiple threads accessing the same bank simultaneously. </p><p>The performance gains were immediate: <strong>bandwidth utilization improved, and warp stalls due to serialized accesses dropped dramatically</strong>.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Warp scheduling, occupancy, and register pressure</h2><p>Next, I traced the PTX generated by the tiled kernel. The number of live registers increased because each thread now held <code>sum</code>, multiple shared memory references, and loop counters. I correlated this with <strong>SM occupancy</strong>:</p><ul><li><p><strong>Registers per thread</strong>: Increasing them beyond ~64 per thread reduced active warps per SM from 64 to 32.</p></li><li><p><strong>Warp interleaving</strong>: Fewer active warps reduced the hardware&#8217;s ability to hide memory latency.</p></li><li><p><strong>Shared memory allocation</strong>: Large tiles per block reduced the number of blocks that could reside concurrently on an SM.</p></li></ul><p>These experiments mirrored what I had seen in LLVM: <strong>IR-level decisions cascade all the way to warp execution and memory latency hiding</strong>.</p><div><hr></div><h2>Inline PTX Experiments in Matrix Multiplication</h2><p>I pushed further, embedding <strong>inline PTX inside tiled kernels</strong>:</p><pre><code><code>asm volatile(
    "ld.shared.f32 %fA, [As + %threadIdx.y*32 + %k];\n\t"
    "ld.shared.f32 %fB, [Bs + %k*32 + %threadIdx.x];\n\t"
    "fma.rn.f32 %sum, %fA, %fB, %sum;\n\t"
);</code></code></pre><p>This allowed me to:</p><ul><li><p>Test <strong>instruction ordering</strong> and <strong>latency hiding</strong> manually.</p></li><li><p>Experiment with <strong>predication and divergent threads</strong>, observing how small PTX-level changes could improve warp execution.</p></li><li><p>Measure <strong>shared memory bank conflicts</strong> in real time, adjusting indexing to maximize throughput.</p></li></ul><p>I could see the subtle interplay between <strong>LLVM IR choices, PTX transformations, and hardware execution</strong>, solidifying my understanding that GPU performance is a multi-layered orchestration.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>My deep experiments</h2><p>I approached CUDA like a lab. I experimented relentlessly:</p><ul><li><p>Observing how LLVM IR loop unrolling affects <strong>PTX instruction</strong> counts and warp efficiency.</p></li><li><p>Measuring occupancy changes as <strong>kernel launch parameters</strong> and register usage were modified.</p></li><li><p>Embedding inline PTX to explore <strong>instruction scheduling</strong>, shared memory bank conflicts, and predication.</p></li><li><p>Mapping thread blocks to <strong>SMs </strong>to study warp interleaving and latency hiding.</p></li></ul><p>These experiments made abstract compiler and microarchitectural concepts tangible. Every IR pass, PTX instruction, and <strong>memory layout</strong> decision was observable in execution behavior. </p><p>I saw clearly how software abstractions orchestrate silicon, and how performance is a delicate balance of register allocation, memory coalescing, and warp-level scheduling.</p><div><hr></div><h2>Why this series exists</h2><p>By the end of this journey, I knew why I had to write this series. I wanted more than a cursory &#8220;launch a kernel&#8221; tutorial. My goal is a narrative-driven, technically rigorous exploration of CUDA:</p><ul><li><p><em>LLVM IR transformations and compiler mechanics.</em></p></li><li><p><em>PTX intermediate representation and optimization strategies.</em></p></li><li><p><em>CUDA kernel design, thread blocks, and warps.</em></p></li><li><p><em>GPU microarchitecture, SM pipelines, and memory hierarchy.</em></p></li><li><p><em>Real-world performance strategies, from occupancy tuning to memory coalescing.</em></p></li></ul><p>This series is my attempt to <strong>bridge software and hardware</strong>, guiding readers through the journey I experienced. </p><p>Curiosity sparked by DeepSeek, insights crystallized by LLVM, experiments refined in PTX, and understanding solidified by GPU microarchitecture. </p><p>Readers will trace kernels from CUDA C++ to LLVM IR, into PTX, and finally to SM execution, experiencing the choreography of parallel computation firsthand.</p><p>You will see not only <strong>how</strong> to write CUDA code, but <strong>why</strong> each instruction executes as it does, <em>how</em> threads collaborate within warps, and <em>how </em>to extract maximum performance by understanding the deep interplay of compiler optimizations, PTX abstractions, and hardware realities.</p><p>Enjoy this incredible journey! </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/p/mastering-cuda-and-high-performance?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[How Meta turned the Linux Kernel into a planet-scale Load Balancer. Part III]]></title><description><![CDATA[A deep architectural narrative on XDP, eBPF, stateless routing, and why hyperscale traffic outgrew proxies.]]></description><link>https://www.thesoftwarefrontier.com/p/how-meta-turned-the-linux-kernel-f39</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/how-meta-turned-the-linux-kernel-f39</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Fri, 20 Feb 2026 13:03:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!IdzI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de3d029-9b20-4aa9-a21d-739a300271b1_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IdzI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de3d029-9b20-4aa9-a21d-739a300271b1_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IdzI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de3d029-9b20-4aa9-a21d-739a300271b1_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!IdzI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de3d029-9b20-4aa9-a21d-739a300271b1_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!IdzI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de3d029-9b20-4aa9-a21d-739a300271b1_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!IdzI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de3d029-9b20-4aa9-a21d-739a300271b1_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IdzI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de3d029-9b20-4aa9-a21d-739a300271b1_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1de3d029-9b20-4aa9-a21d-739a300271b1_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3278182,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/187942914?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de3d029-9b20-4aa9-a21d-739a300271b1_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IdzI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de3d029-9b20-4aa9-a21d-739a300271b1_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!IdzI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de3d029-9b20-4aa9-a21d-739a300271b1_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!IdzI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de3d029-9b20-4aa9-a21d-739a300271b1_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!IdzI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de3d029-9b20-4aa9-a21d-739a300271b1_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Introduction</h2><p>Engineers who have spent many years in the trenches of <strong>distributed systems</strong> develop an instinctive radar for complexity. </p><p>It&#8217;s a radar that basically never sleeps. It picks up on the faintest signs that a system is about to balloon beyond comprehension: the silent growth of <strong>memory tables</strong>, the small code change that cascades into unpredictable failure, the queues that inflate exponentially under the first unexpected burst. </p><p>At small or medium scale, these things are manageable. You <strong>add retries</strong>, per-flow connection tracking, dynamic allocation: you feel just a bit more clever. But scale up to planetary levels, and these &#8220;<em>clever</em>&#8221; additions become the system&#8217;s <strong>Achilles&#8217; heel.</strong></p><p> Every bit of mutable state, every <strong>dynamic structure</strong>, every per-flow table is a ticking time bomb. And this is precisely where Katran finds its elegance.</p><p>Katran strips away all unnecessary complexity. It reduces load balancing to <strong>pure, deterministic computation</strong>, executed at the first possible moment a packet enters the kernel. </p><p>There are <strong>no per-flow tables</strong>. There are no loops. There are no dynamic allocations. There is no memory of the past. Every packet is an independent, self-contained entity, evaluated as a function: headers in, backend out, <strong>no storage</strong>, no cleanup, no drama. </p><p>It is, in a sense, <strong>load balancing as mathematics</strong>, executed at line rate.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Encountering the XDP Program</h2><p>When you first lay eyes on <strong>Katran&#8217;s XDP program</strong>, disbelief is natural. </p><p>You instinctively start looking for the infrastructure you expect from a &#8220;<em>normal</em>&#8221; load balancer: socket managers, <strong>memory pools</strong>, per-flow hash tables, connection cleanup threads. </p><blockquote><p><em>Where are the timers that expire idle flows? </em></p><p><em>Where is the heap allocation for buffering partial packets? </em></p><p><em>Where is the messy user-space coordination? </em></p></blockquote><p>The answer, surprisingly, is: none of that exists. Every omission is intentional, dictated by the <strong>eBPF verifier&#8217;s strict execution model</strong> and the program&#8217;s architectural goal: stateless, line-rate forwarding.</p><p>The <strong>eBPF verifier</strong> acts as a pedantic static analyzer for the kernel. At load time, it simulates every possible execution path through the program, tracking all pointers, offsets, and integers. </p><p>It ensures that no memory access exceeds the bounds of the packet buffer (<code>ctx-&gt;data</code> to <code>ctx-&gt;data_end</code>), no pointer arithmetic overflows, and no <strong>uninitialized value</strong> is ever dereferenced. </p><p>If a program could potentially write past <code>data_end</code> or access memory outside a BPF map safely, the verifier refuses to load it. </p><p>This is why XDP code is highly defensive: explicit casts, pointer comparisons, early exits, and carefully sequenced checks are everywhere. In short, <strong>you can&#8217;t crash the kernel because the verifier won&#8217;t allow it</strong>.</p><p>Katran leverages this discipline elegantly. Every packet is validated before it touches the forwarding logic. Parsing begins at the Ethernet layer:</p><pre><code><code>struct ethhdr *eth = data;
if ((void *)(eth + 1) &gt; data_end)
    return XDP_DROP;

if (eth-&gt;h_proto != htons(ETH_P_IP))
    return XDP_PASS;
</code></code></pre><p>Here, <code>(eth + 1)</code> calculates the memory address immediately after the Ethernet header. Comparing this against <code>data_end</code> ensures that the packet buffer fully contains the<strong> Ethernet header. </strong></p><p>If not, the packet is dropped (<code>XDP_DROP</code>). Non-IPv4 traffic is passed upstream (<code>XDP_PASS</code>) because Katran only routes IPv4 packets. </p><p>Notice that <strong>no memory allocation occurs</strong>, and <strong>no state is stored</strong>; the check is purely functional, evaluating the packet as it exists in memory.</p><p>Next, the program parses the IP header:</p><pre><code><code>struct iphdr *ip = data + sizeof(*eth);
if ((void *)(ip + 1) &gt; data_end)
    return XDP_DROP;

if (ip-&gt;protocol != IPPROTO_TCP)
    return XDP_PASS;
</code></code></pre><p>Pointer arithmetic ensures that <code>ip</code> points exactly at the start of the IP header. The verifier confirms that <code>(ip + 1)</code> does not exceed <code>data_end</code>. Any malformed or truncated packet is dropped immediately. </p><p>Non-TCP packets are passed upstream. Again, <strong>no dynamic structures are created</strong>, no per-flow tables are touched, no queues are used. Every packet is an independent entity, evaluated purely by its content.</p><p>The final layer of parsing handles <strong>TCP headers:</strong></p><pre><code><code>struct tcphdr *tcp = (void *)ip + sizeof(*ip);
if ((void *)(tcp + 1) &gt; data_end)
    return XDP_DROP;
</code></code></pre><p>The program performs the same bounds validation at the TCP layer. </p><p>The cast <code>(void *)ip</code> is required because pointer arithmetic on <code>struct iphdr *</code> could otherwise introduce<strong> type-dependent scaling</strong> that violates verifier expectations. </p><p>Every field access is verified at load time, ensuring memory safety. </p><p>By the time the packet reaches this stage, Katran has guaranteed that Ethernet, IP, and TCP headers are complete, aligned, and accessible: <strong>all without allocating a single byte of memory</strong>.</p><h3>Deterministic Hashing</h3><p>With headers validated, Katran computes a deterministic <strong>5-tuple hash</strong>:</p><pre><code><code>__u64 hash = hash_5tuple(ip-&gt;saddr, ip-&gt;daddr,
                         tcp-&gt;source, tcp-&gt;dest,
                         ip-&gt;protocol);
</code></code></pre><p>This hash forms the cornerstone of the system. It alone determines the backend to which the packet will be forwarded. </p><p>The kernel holds <strong>no history of past flows</strong>; there is no connection table. There are no retries, no queues, no cleanup tasks. Every packet is a self-contained function:</p><pre><code><code>Packet headers &#8594; 5-tuple hash &#8594; ring map lookup &#8594; backend map &#8594; packet rewrite &#8594; XDP_TX
</code></code></pre><h3>eBPF Map Lookups</h3><p>The program then queries two <strong>BPF maps</strong>: <code>ring_map</code> and <code>backend_map</code>. The <code>ring_map</code> implements a <strong>Maglev-style consistent hash ring</strong>. </p><p>Conceptually, it&#8217;s a fixed-size array where each slot points to a backend index, precomputed in the control plane. The <code>backend_map</code> holds the IP and port for each backend. </p><p>Both maps are <strong>read-only from the kernel&#8217;s perspective during packet processing</strong>; the control plane updates them asynchronously.</p><pre><code><code>__u32 *backend_idx = bpf_map_lookup_elem(&amp;ring_map, &amp;hash);
if (!backend_idx)
    return XDP_DROP;

struct backend *b = bpf_map_lookup_elem(&amp;backend_map, backend_idx);
if (!b)
    return XDP_DROP;

ip-&gt;daddr = b-&gt;ip;
tcp-&gt;dest = b-&gt;port;

recalc_ip_checksum(ip);
recalc_tcp_checksum(ip, tcp);

return XDP_TX;
</code></code></pre><p>Notice the simplicity and determinism. The kernel reads the slot atomically, rewrites the packet headers, recalculates checksums, and transmits. </p><p>No per-flow state, no locks, no dynamic weighting &#8212; all O(1), purely functional. The <strong>XDP_TX return code</strong> transmits directly back to the NIC TX ring, bypassing sockets entirely. </p><p>The packet never enters the <strong>kernel networking stack</strong>, avoiding context switches, scheduling delays, or buffer copies.</p><h3>Line-Rate Safety and Verifier Compliance</h3><p>At first glance, Katran&#8217;s code looks almost painfully minimalist. But every defensive check, every explicit cast, and every early return exists because the <strong>eBPF verifier demands total memory safety</strong>. </p><p>The verifier effectively simulates the NIC&#8217;s RX ring in software, reasoning about each possible execution path. If any access could exceed <code>data_end</code> or touch uninitialized memory, the program would fail to load. </p><p>This ensures that Katran&#8217;s XDP program is <strong>guaranteed to execute safely at line rate</strong>, independent of traffic patterns, packet sizes, or maliciously malformed headers.</p><p>By combining <strong>careful pointer arithmetic</strong>, <strong>deterministic 5-tuple hashing</strong>, and <strong>atomic BPF map reads</strong>, Katran transforms each packet into a <strong>stateless, functional computation</strong>. </p><p>No connections are remembered. No resources are dynamically allocated. No loops or locks slow the datapath. </p><p>This is <strong>minimalism enforced by both design and verifier</strong>; the kernel&#8217;s version of &#8220;<em>if it isn&#8217;t necessary, it doesn&#8217;t exist</em>&#8221;.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Maglev-Style Hash Ring Construction</h2><p>Katran&#8217;s stateless elegance rests on one deceptively simple concept: a <strong>precomputed Maglev-style consistent hash ring</strong>. </p><p>At first glance, it&#8217;s just an array, a circle of <code>RING_SIZE</code> slots. But in reality, it is the linchpin that allows packets to be routed <strong>deterministically, without storing per-flow state, and without any runtime iteration over backends</strong>. </p><p>Each backend server occupies multiple slots on this ring, proportionally to its assigned weight. Packets are mapped to slots via the <strong>5-tuple hash</strong> computed in the kernel. </p><p>Once the hash is computed, the kernel does nothing more than perform a BPF map lookup: the heavy lifting has already been done by the <strong>control plane. </strong></p><p>The XDP program <strong>never recomputes the ring</strong>, never tracks the flow history, and never allocates memory: it simply consults the map, rewrites headers, and transmits.</p><h3>Virtual Nodes and Weighted Distribution</h3><p>The magic lies in <strong>virtual nodes</strong>. Each backend is assigned a number of virtual nodes proportional to its weight. </p><p>A backend that is twice as capable as another receives twice as many virtual nodes, which in turn means it will receive roughly twice the traffic. The control plane distributes these virtual nodes <strong>pseudo-randomly but deterministically</strong> across the ring. </p><p>Deterministic here means that given the <strong>same backend configuration </strong>and weight, every calculation produces the same ring layout: a property essential for consistent flow mapping and minimal disruption during updates.</p><p>When a backend fails, only the slots associated with its virtual nodes are removed. The remaining slots for <strong>other backends</strong> are untouched. </p><p>This <strong>minimal disruption</strong> ensures that the vast majority of traffic continues to flow to existing backends without rerouting, avoiding cache thrashing, queue spikes, or packet reordering at the network level. </p><p>It&#8217;s a conceptually simple idea, but its effects on <strong>reliability </strong>and <strong>predictability </strong>at planetary scale are profound.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Control Plane Construction</h2><p>Here is a distilled view of how the control plane constructs the ring:</p><pre><code><code>#define RING_SIZE 65537
#define VIRTUAL_NODES_PER_BACKEND 100

struct backend {
    __u32 ip;
    __u16 port;
    __u32 weight;
};

__u32 ring[RING_SIZE]; // 0 = empty, 1..N = backend indices

for (int b = 0; b &lt; num_backends; b++) {
    int virtual_nodes = backends[b].weight * VIRTUAL_NODES_PER_BACKEND;
    for (int v = 0; v &lt; virtual_nodes; v++) {
        __u64 hash = hash_fn(backends[b].ip, backends[b].port, v);
        __u32 offset = hash % RING_SIZE;
        __u32 skip = (hash % (RING_SIZE - 1)) + 1;

        while (ring[offset] != 0) {
            offset = (offset + skip) % RING_SIZE;
        }
        ring[offset] = b + 1; // backend index + 1 because 0 = empty
    }
}
</code></code></pre><p>At first glance, it looks <strong>almost na&#239;ve:</strong> a nested loop, some arithmetic, and a while loop to resolve collisions. But every line has a precise purpose:</p><ol><li><p><code>hash_fn</code> combines the backend&#8217;s IP, port, and virtual node index to produce a deterministic pseudo-random value.</p></li><li><p><code>offset</code> determines the initial slot in the ring for this virtual node.</p></li><li><p><code>skip</code> ensures that collisions are resolved deterministically: if a slot is already occupied, we advance by <code>skip</code> until an empty slot is found.</p></li><li><p>The final assignment writes the backend index plus one; because zero is reserved to indicate &#8220;<em>empty</em>.&#8221;</p></li></ol><p>The result is a <em>fully populated</em><strong>, deterministic hash ring</strong>. The control plane now has a complete map of which slot belongs to which backend, and crucially, this computation happens <strong>entirely outside the kernel</strong>, asynchronously from packet forwarding.</p><h3>Pushing the Ring into the Kernel</h3><p>Once the ring is computed, the control plane atomically populates a <strong>BPF map</strong> in the kernel:</p><pre><code><code>for (int i = 0; i &lt; RING_SIZE; i++) {
    __u32 key = i;
    __u32 val = ring[i];
    bpf_map_update_elem(&amp;ring_map, &amp;key, &amp;val, BPF_ANY);
}
</code></code></pre><p>From this moment, the XDP program doesn&#8217;t touch the ring logic. Each incoming packet:</p><ol><li><p>Computes its <strong>5-tuple hash.</strong></p></li><li><p>Calculates the slot index via <code>hash % RING_SIZE</code>.</p></li><li><p>Looks up the slot in the <code>ring_map</code> to find the backend index.</p></li><li><p>Looks up the<strong> backend IP</strong> and port in the <code>backend_map</code>.</p></li><li><p>Rewrites headers and transmits.</p></li></ol><p>The ring is <strong>immutable from the kernel&#8217;s perspective</strong> during packet processing. Updates, weight adjustments, or backend removals are handled entirely by the control plane, and applied atomically. </p><p>This allows Katran to handle <strong>failures and reconfigurations without ever stalling packet forwarding</strong>.</p><h3>Why This Matters</h3><p>The beauty of this approach is that the XDP datapath is <strong>entirely stateless and deterministic</strong>, yet fully aware of backend weights. </p><p>There are <strong>no loops in the kernel</strong>, no per-flow tracking, no locks or contention. The cost of adding more backends or adjusting weights is absorbed by the control plane; the kernel simply reads the updated maps. </p><p>This separation allows <strong>line-rate forwarding at planetary scale</strong>, where a single ingress could be handling millions of flows per second.</p><p>The Maglev ring is the perfect example of <strong>precomputation as a scaling strategy</strong>: do the complex work once in user-space, push a static, deterministic representation to the kernel, and let the stateless datapath execute at hardware speed. </p><p>Every packet becomes a pure function evaluation: <em>input headers &#8594; hash &#8594; slot &#8594; backend &#8594; rewritten packet &#8594; transmit. </em></p><p>No memory allocations, no state, no concurrency hazards; just <strong>physics applied to packets</strong>.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Lookup and Deterministic Forwarding </h2><p>Once the packet has passed through Ethernet, IP, and TCP validation, the XDP program performs the most crucial step: <strong>deterministic forwarding using the precomputed Maglev-style hash ring</strong>. </p><p>In the kernel, this process is astonishingly simple: and yet it underpins the system&#8217;s ability to operate at <strong>line rate with planetary-scale fan-in</strong>.</p><pre><code><code>__u32 slot = hash % RING_SIZE;
__u32 *backend_idx = bpf_map_lookup_elem(&amp;ring_map, &amp;slot);
struct backend *b = bpf_map_lookup_elem(&amp;backend_map, backend_idx);
</code></code></pre><p>Here, a single modulo operation maps the packet&#8217;s 5-tuple hash to a slot. </p><p>A <strong>single map lookup</strong> retrieves the backend index, and a second map lookup fetches the backend&#8217;s IP and port. That is all. </p><p>No loops, no runtime weight calculations, no dynamic per-flow state. <strong>O(1) complexity, deterministic results, purely functional evaluation.</strong></p><p>Because all kernel accesses are atomic and read-only during packet processing, this architecture scales linearly with <strong>CPU cores</strong> and <strong>NIC queues. </strong></p><p>There are no locks, no contention, and no thread coordination. Each core can process its <strong>RX queue</strong> independently, and the control plane can update <code>ring_map</code> or <code>backend_map</code> asynchronously. </p><p><strong>New flows automatically observe updated backends</strong>, while old flows continue to use the last-known-good mapping: all without disrupting the kernel datapath.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Failure as a First-Class Design Input </h2><p>Statelessness is not just an optimization; it is a <strong>failure-handling strategy baked into the architecture</strong>. Because the kernel holds no flow tables or connection metadata, failure is <strong>trivial to contain</strong>.</p><p>When a backend server fails, the control plane simply removes its virtual nodes from the hash ring. From the <strong>kernel&#8217;s perspective</strong>, packets that would have landed on that backend now hash to other slots. </p><p>There are no stale connection entries to purge, no retries to manage, no cache warm-ups required.</p><p>When a Katran node itself fails, upstream Anycast automatically redirects traffic. Each surviving node continues forwarding packets using the <strong>last-known-good maps</strong>, oblivious to the failure. </p><p>There is no need for global coordination, distributed state reconciliation, or multi-step failover. <strong>Failure domains remain local, predictable, and instantly recoverable.</strong></p><p>This is <strong>architectural serenity</strong>: the system is designed to survive failures because the kernel never held state that could be lost. Statelessness transforms complexity into predictability.</p><div><hr></div><h2>Visualizing Katran&#8217;s Packet Flow </h2><p>Understanding 5-tuple hashing, virtual node assignment, and <strong>BPF map lookups</strong> is abstractly simple, but seeing the full flow clarifies the elegance:</p><pre><code><code>[Client Packet]
    |
    v
[NIC RX Queue / RSS]
    |
    v
[XDP Parsing &amp; Validation]
    |
    v
[5-Tuple Hash Computation]
    |
    v
[Ring Map Lookup -&gt; Slot]
    |
    v
[Backend Map Lookup -&gt; Backend IP/Port]
    |
    v
[Packet Rewrite &amp; XDP_TX]
    |
    v
[Backend Server / Direct Server Return]

Control Plane (User Space)
    |
    |-- Computes virtual nodes per backend (weight-based)
    |-- Calculates offsets &amp; skips for Maglev ring
    |-- Updates ring_map in kernel atomically
    |-- Updates backend_map with IPs &amp; ports
</code></code></pre><p>Figure 1 illustrates this flow: the kernel operates entirely <strong>statelessly and deterministically</strong>, while the control plane orchestrates updates asynchronously. </p><p>The XDP program is <strong>immutable during packet processing</strong>, and the system naturally adapts to changes without disruption.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Direct Server Return (DSR)</h2><p>Katran very often employs the so-called <strong>Direct Server Return (DSR)</strong>. </p><p>Instead of proxying replies through the <strong>load balancer,</strong> it rewrites only the destination IP and port, leaving the source IP intact. Backends reply directly to clients, bypassing the balancer entirely.</p><p>This approach eliminates double traversal of the datapath, removes additional memory and CPU overhead, and allows the system to handle <strong>planetary-scale fan-in</strong>. </p><p>When combined with <strong>Receive-Side Scaling (RSS)</strong> across NIC queues, each CPU core can process independent flows in parallel, achieving near-linear throughput across thousands of simultaneous connections: all <strong>without storing a single flow or connection state in the kernel</strong>.</p><h3>Physics vs. Semantics</h3><p>It is tempting to ask: &#8220;<em>Could Katran replace proxies?</em>&#8221; The shortest answer I could give you right now is: hell no. </p><p>Proxies operate in the <strong>semantic domain</strong>, handling TLS termination, header inspection, authentication, authorization, retries, circuit breaking, and protocol translation. They reason about requests, responses, cookies, tokens, and application-layer meaning.</p><p>Katran operates entirely in the <strong>physics domain</strong>. Its responsibility is deterministic packet routing, load distribution, and weight-based selection. </p><p>By isolating the concerns, Meta ensures that <strong>each layer can scale according to its own constraints</strong>: the kernel moves packets at line rate, and proxies reason about requests at application speed. </p><p>Collapse these layers, and the system suffers both <strong>CPU saturation</strong> and state management nightmares. </p><p>Katran redraws the boundary between <strong>raw throughput and semantic complexity</strong>, allowing both layers to excel.</p><h3>Computation at the Data</h3><p>Katran is also part of a larger architectural movement: perform computation <strong>where the data resides</strong>. Modern examples abound:</p><ul><li><p><strong>eBPF firewalls</strong> replacing iptables, operating in-kernel without context switches.</p></li><li><p><strong>SmartNICs</strong> offloading encryption and filtering.</p></li><li><p><strong>Programmable switches</strong> performing aggregation in-network.</p></li><li><p><strong>io_uring</strong> bypassing syscalls for low-latency I/O.</p></li><li><p><strong>RDMA</strong> eliminating kernel involvement for high-throughput transfers.</p></li></ul><p>The principle is <strong>simple but profound</strong>: reduce copies, avoid context switches, and specialize each layer for its constraints. </p><p>Katran embodies this philosophy at the <strong>kernel&#8217;s ingress point</strong>, turning every packet into a deterministic computation before it ever reaches user-space.</p><div><hr></div><h2>Quiet Minimalism at Hyperscale</h2><p>Ultimately, Katran&#8217;s brilliance is disciplined minimalism. It does not parse HTTP. It does not retry requests. It does not track connections or manage caches. </p><p>It simply <strong>forwards packets deterministically</strong>, guided by the Maglev-style hash ring materialized in kernel maps.</p><p>Combined with <strong>DSR</strong>, RSS, and atomic map updates, Katran achieves <strong>planetary-scale throughput</strong>. Proxies are freed from the burden of raw packet fan-in and can focus entirely on semantics.</p><p>Katran does not render proxies obsolete; it makes them <strong>viable at hyperscale</strong>. In doing so, it delivers one of the most profound lessons in distributed systems: at extreme scale, <strong>less complexity per operation is not a feature: it is survival</strong>. </p><p>The kernel, stripped to pure functionality, becomes a deterministic engine for traffic physics, leaving higher layers to reason about meaning.</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/p/how-meta-turned-the-linux-kernel-f39?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/p/how-meta-turned-the-linux-kernel-f39?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[How Meta turned the Linux Kernel into a planet-scale Load Balancer. Part II]]></title><description><![CDATA[A deep architectural narrative on XDP, eBPF, stateless routing, and why hyperscale traffic outgrew proxies.]]></description><link>https://www.thesoftwarefrontier.com/p/how-meta-turned-the-linux-kernel-3e4</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/how-meta-turned-the-linux-kernel-3e4</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Sat, 14 Feb 2026 13:56:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!TpBo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7393c592-b91b-40e7-8cb3-bc47c94c3867_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TpBo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7393c592-b91b-40e7-8cb3-bc47c94c3867_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TpBo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7393c592-b91b-40e7-8cb3-bc47c94c3867_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!TpBo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7393c592-b91b-40e7-8cb3-bc47c94c3867_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!TpBo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7393c592-b91b-40e7-8cb3-bc47c94c3867_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!TpBo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7393c592-b91b-40e7-8cb3-bc47c94c3867_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TpBo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7393c592-b91b-40e7-8cb3-bc47c94c3867_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7393c592-b91b-40e7-8cb3-bc47c94c3867_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3651815,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/187197806?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7393c592-b91b-40e7-8cb3-bc47c94c3867_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TpBo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7393c592-b91b-40e7-8cb3-bc47c94c3867_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!TpBo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7393c592-b91b-40e7-8cb3-bc47c94c3867_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!TpBo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7393c592-b91b-40e7-8cb3-bc47c94c3867_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!TpBo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7393c592-b91b-40e7-8cb3-bc47c94c3867_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Hidden Constraint Stateless Systems Cannot Escape</h2><p>By the end of <strong><a href="https://softwarefrontier.substack.com/p/how-meta-turned-the-linux-kernel">Part I</a></strong>, we discovered how Katran had collapsed load balancing into something almost &#8220;<em>offensively simple</em>&#8221;: a <strong>pure function </strong>executed at the earliest point in the kernel receive path. </p><p>A packet arrived via <strong>DMA </strong>into host memory. XDP intercepted it before socket allocation, before skb creation, before <strong>TCP state machines</strong> or retransmission queues even existed. </p><p>Katran read a fixed set of bytes, computed a hash over the 5-tuple, performed a constant-time lookup inside an <strong>eBPF map</strong>, rewrote the destination fields in place, fixed the checksums, and transmitted the packet. </p><blockquote><p><em>No heap allocation. </em></p><p><em>No locks. </em></p><p><em>No connection tracking. </em></p><p><em>No memory proportional to traffic volume. </em></p></blockquote><p>Routing had been reduced to deterministic evaluation over immutable input.</p><p>But this<strong> architectural purity</strong> comes with a constraint that stateful systems quietly absorb without thinking: </p><blockquote><p><em>if the system refuses to remember past decisions, it must be able to reproduce them exactly, indefinitely, using only computation and configuration as inputs.</em></p></blockquote><p>Katran doesn&#8217;t operate in a calm, static environment. Backends <strong>fail silently,</strong> networks partition, nodes drain for rolling upgrades, and autoscaling stretches and contracts continuously. </p><p>At Meta scale (tens of thousands of machines) these events happen constantly, while packets flow without pause.</p><p>Naive approaches, like <strong>modulo hashing</strong>, cannot survive this churn: millions of flows would be reassigned simultaneously, TLS sessions break, caches go cold, and tail latency explodes. </p><p>Katran survives by <strong>remembering nothing</strong>. Each packet recomputes its destination independently, deterministically, relying only on the current configuration and immutable header fields. </p><p>Continuity cannot come from memory; it must come from <strong>mathematics</strong>. Formally, the routing decision is a function: </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\nf : (\\mathit{flow\\_identity}, \\mathit{backend\\_set}, \\mathit{configuration}) \\to \\mathit{backend}\n\n\n\n\n\n\n&quot;,&quot;id&quot;:&quot;AJCSHUCWTQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <code>flow_identity</code> is the 5-tuple and <code>backend_set</code> reflects the current fleet encoded in <em>eBPF maps</em>. Every <strong>backend change</strong> alters the function. A naive mapping over a changing set would instantly reshuffle all flows, because Katran does not store previous decisions. </p><p>There is no per-flow state, no hash table of prior assignments, no flow table to synchronize. Stateful systems externalize continuity into memory; Katran externalizes it into <strong>math</strong>.</p><p>This is where consistent hashing becomes essential. It<strong> distributes flows</strong> across the ring such that small changes in the backend set affect only a fraction of traffic, preserving mapping continuity without storing anything. </p><p>Every packet independently recomputes its backend, yet the global distribution remains stable, predictable, and local.</p><p>In other words: continuity is no longer a property of memory; it is a property of the function itself. </p><p>Statelessness, determinism, and resilience are inseparable. This is why consistent hashing is not an optimization: it is <strong>the backbone of stateless load balancing at hyperscale</strong>.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Why modulo hashing fails under dynamic membership</h2><p>At first glance, <strong>stateless routing</strong> appears trivial to implement. The packet already contains a globally unique identifier in the form of its 5-tuple. </p><p>Applying a<strong> high-quality</strong> hash function such as Jenkins hash, MurmurHash, or Toeplitz hash produces a uniformly distributed <strong>32-bit</strong> or <strong>64-bit </strong>value. </p><p>Mapping that value into a backend index can be done using a simple modulo operation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{backend_index} = \\operatorname{hash}(\\text{flow_tuple}) \\bmod N\n&quot;,&quot;id&quot;:&quot;AGBTWGIGLL&quot;}" data-component-name="LatexBlockToDOM"></div><p>This approach satisfies nearly all of Katran&#8217;s architectural constraints. It is deterministic. It requires no per-flow state. It executes in constant time. </p><p>It is also trivial to implement in eBPF. It produces<strong> uniform load distribution </strong>assuming a well-behaved hash function.</p><p>And yet it fails catastrophically the moment the backend set changes.</p><p>The problem lies in the divisor.</p><p>Modulo arithmetic does not preserve ordering relationships when the modulus changes. When <strong>N </strong>changes to<strong> N+1</strong>, the mapping of hash values to indices changes globally. </p><p>A hash value that previously mapped to backend 2 may now map to backend 7. Another may map to backend 0. There is no locality preservation property.</p><p>To see this concretely, consider a simplified example with N=4 backends. Hash values map into backend indices like this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\text{hash} \\bmod 4 \\;\\rightarrow\\; \\{0, 1, 2, 3\\}&quot;,&quot;id&quot;:&quot;FDHTAWCUKI&quot;}" data-component-name="LatexBlockToDOM"></div><p>Now add a fifth backend, so <strong>N=5</strong>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\text{hash} \\bmod 5 \\;\\rightarrow\\; \\{0, 1, 2, 3, 4\\}\n&quot;,&quot;id&quot;:&quot;CRPFOHKNFZ&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>For any given hash value <strong>H</strong>, the probability that:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\nH \\bmod 4 = H \\bmod 5&quot;,&quot;id&quot;:&quot;AYQPBOAXQI&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>is exactly:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\frac{1}{5}\n\n&quot;,&quot;id&quot;:&quot;KGSUQSQWGN&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>This means <strong>80%</strong> of all flows are reassigned immediately when a single backend is added. At hyperscale, this is not a theoretical inconvenience. It is an <strong>operational disaster.</strong></p><p>Every reassigned flow results in packets arriving at backend machines that have never seen that flow before. TCP stacks reject packets with unknown sequence numbers. </p><p><strong>TLS sessions</strong> fail because cryptographic state does not exist on the new backend. HTTP/2 multiplexed streams break because connection context is lost. Clients initiate retransmissions, exponential backoff, and eventually full connection re-establishment.</p><p>This cascades upward.</p><blockquote><p><em>Connection pools maintained by clients become invalid.<br>Application-layer caches lose locality.<br>Backend CPU cache warmth disappears.<br>NUMA locality is destroyed.<br>Kernel routing caches become irrelevant.</em></p></blockquote><p>Traffic patterns that were previously thermally stable suddenly destabilize.</p><p>This creates transient <strong>overload conditions</strong> that amplify latency precisely when the system is undergoing change, which is the worst possible time for instability.</p><p>Even more dangerously, this reshuffling occurs regardless of whether the backend change is an addition, removal, or replacement. Removing a single backend causes the same global <strong>redistribution effect</strong>, forcing nearly all flows to move.</p><p>The system becomes <strong>topologically fragile.</strong> Minor topology changes trigger global traffic churn.</p><p>This is <strong>fundamentally incompatible</strong> with Katran&#8217;s stateless model.</p><p>Because Katran has no memory, it cannot preserve continuity explicitly. Therefore continuity must be preserved implicitly by the mapping function.</p><p>Modulo hashing does not provide this guarantee.</p><p>Katran required a routing function whose output changes minimally when the backend set changes. A function whose continuity properties are intrinsic, not emergent.</p><p>A function designed not merely for distribution, but for stability under mutation.</p><p>This is exactly what <strong>consistent hashing</strong> provides.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The missing half of the equation</h2><p>Up to this point, Katran&#8217;s model appears almost <strong>suspiciously incomplete.</strong> A packet arrives, a deterministic hash selects a backend, the destination IP is rewritten, and the packet is forwarded. </p><ul><li><p>No connection table is created. </p></li><li><p>No session state is remembered. </p></li><li><p>No record is kept anywhere. </p></li></ul><p>The <strong>load balancer</strong> forgets the packet the instant it leaves the NIC. Which raises an immediate and deeply practical question: </p><blockquote><p><em>how does the response find its way back?</em></p></blockquote><p>Traditional load balancers solve this by remembering everything. They allocate <strong>connection entries</strong>, store source and destination tuples, and use that memory later to reverse the transformation. </p><p>Return traffic is not discovered: it is <strong>looked up</strong>. Every response packet must re-enter the load balancer, consult state, and be translated back into the client-visible address space.</p><p>Katran does something far more radical. It eliminates the need to remember in the first place.</p><p>This works because Katran operates in <strong>two distinct forwarding modes</strong>, each designed to preserve the illusion of a single virtual service IP while minimizing the amount of work Katran itself must perform. </p><p>Both modes rely on the<strong> same invariant:</strong> the load balancer&#8217;s job is not to own the connection, but merely to place the packet correctly at the beginning. Once placed, the network itself can do the rest.</p><p>This is the difference between supervising every conversation and simply introducing two parties who can speak directly. Katran prefers introductions.</p><div><hr></div><h2>The art of disappearing</h2><p>Direct Server Return is the purer expression of Katran&#8217;s philosophy. In <strong>DSR mode</strong>, Katran modifies only the minimum amount of information required to steer the packet to the correct backend. </p><p>Specifically, it rewrites the <strong>destination MAC address</strong> so the frame reaches the selected server at layer 2, while leaving the IP layer logically consistent with the virtual service abstraction.</p><p>From the client&#8217;s perspective, the packet was sent to the <strong>VIP</strong>. From the backend&#8217;s perspective, the packet appears to be addressed to that same VIP, because the backend is explicitly configured to accept traffic destined for that address via a loopback interface. </p><p>This configuration is deliberate sleight of hand: multiple machines simultaneously claim <strong>ownership </strong>of the same IP, but only Katran determines which one actually receives each packet.</p><p>Once the backend processes the request, it responds directly to the client using normal <strong>IP routing</strong>. The response does not pass back through Katran. It does not need translation, correction, or approval. </p><p>The backend already knows the <strong>client&#8217;s source IP</strong>, and the VIP is already a valid source address on the backend.</p><p>Katran never sees the response. And that absence is the entire point.</p><p>By removing itself from the return path, Katran immediately cuts its packet processing load in half. Every byte of outbound response traffic bypasses the load balancer entirely, freeing CPU cycles, memory bandwidth, <strong>PCIe bandwidth</strong>, and NIC queue capacity. </p><p>Latency improves because an entire network hop disappears. Throughput increases because Katran&#8217;s processing budget is now devoted exclusively to ingress traffic.</p><p>The load balancer becomes a <strong>one-way valve.</strong></p><p>This has profound scaling implications. In traditional NAT-based load balancers, throughput is constrained by <strong>bidirectional</strong> processing capacity. Every request and every response must traverse the same CPU, the same queues, the same kernel structures. </p><p>In<strong> DSR mode</strong>, Katran&#8217;s throughput ceiling effectively doubles, not because the hardware changed, but because half the work vanished.</p><p>The <em>most efficient packet</em> is the one you never have to touch.</p><p>Of course, this requires backend cooperation. Each backend must be configured with the VIP on a <strong>loopback interface</strong>, must disable certain reverse path filtering protections, and must ensure routing policies allow responses to exit with the VIP as the source address. </p><p>These adjustments sound invasive, but <em>at Meta&#8217;s scale</em>, where fleets are centrally managed and automatically provisioned, such configuration is routine infrastructure hygiene.</p><p>In exchange, Katran achieves something remarkable: it participates only in the part of the connection where its intelligence is actually required.</p><p>The rest of the time, it actually &#8220;<em>disappears</em>&#8221;.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The compatibility layer</h2><p>Not every environment can support Direct Server Return.<strong> Legacy systems</strong>, asymmetric routing environments, and networks with strict topology constraints often require the load balancer to remain in the return path. </p><p>For these cases, Katran supports traditional NAT mode, where both inbound and outbound packets are rewritten to preserve the virtual service abstraction.</p><p>In NAT mode, Katran performs <strong>two symmetric transformations</strong>. On ingress, it rewrites the destination IP from the VIP to the backend&#8217;s real address. On egress, it performs the inverse transformation, rewriting the source IP from the backend&#8217;s address back to the VIP before forwarding the packet to the client.</p><p>From the outside, the illusion remains intact. The client believes it is communicating with a single <strong>logical endpoint</strong>. The backend receives routable packets addressed to itself. Katran acts as the translator between these two realities.</p><p>What makes Katran&#8217;s NAT implementation unusual is what it does not do.</p><p>It does not allocate connection objects. It does not maintain per-flow state. It does not synchronize session tables across nodes. Instead, it relies entirely on <strong>deterministic recomputation. </strong></p><p>Because the consistent hashing function produces the same backend selection for every packet in a flow, Katran can reapply the same<strong> translation logic </strong>independently to each packet without remembering anything about previous packets.</p><p>Each packet carries sufficient information to rediscover its correct path. This preserves the defining properties of the system:</p><p>Routing decisions remain stateless, lock-free, and purely functional. There are no shared data structures that grow with connection volume. </p><p>Memory usage does not scale with traffic, and even <strong>CPU cost</strong> remains constant per packet. Failure recovery requires no state reconstruction because no state exists to reconstruct.</p><p>Even when performing NAT, Katran refuses to become stateful. Statelessness is not a feature. It is a constraint the system refuses to violate.</p><div><hr></div><h2>Where intelligence is allowed to exist</h2><p>At first glance, Katran&#8217;s data plane appears almost aggressively unintelligent. It does not monitor <strong>backend health</strong>. It does not detect overload. It does not adapt dynamically. It does not even know what services exist in any meaningful sense. </p><p>It simply reads precomputed values from eBPF maps and executes deterministic transformations on packets.</p><ol><li><p>This is not a limitation.</p></li><li><p>It is a deliberate <strong>architectural boundary.</strong></p></li><li><p>All intelligence lives elsewhere.</p></li></ol><p>The control plane operates entirely in user space, where <strong>complexity </strong>is cheap and mistakes are survivable. It continuously observes the backend fleet, performing health checks, capacity measurements, deployment coordination, and traffic engineering decisions. </p><p>It computes consistent <strong>hashing rings</strong>, assigns weights to reflect backend capacity, removes unhealthy nodes, and introduces new ones during scaling events.</p><p>Once computed, these decisions are materialized into eBPF maps inside the kernel. This is the only <strong>communication channel</strong> between intelligence and execution.</p><p>The kernel data plane never queries user space. It never performs system calls. It never waits for locks held by other threads. It never allocates dynamic memory. It never blocks on<strong> I/O.</strong> </p><p>It simply reads from maps that already exist <strong>in memory</strong> and executes instructions that have already been verified.</p><p>The relationship between control plane and <strong>data plane</strong> is asynchronous and one-directional. The control plane writes new configurations when necessary. The data plane continues executing the previous configuration until the new one appears. </p><p>There is no synchronization barrier. There is no pause in <strong>packet processing.</strong> There is no transitional state where routing becomes uncertain.</p><p>This separation produces one of Katran&#8217;s most important operational properties:</p><p>If the control plane crashes, traffic continues uninterrupted.</p><p>The <strong>kernel</strong> retains the last valid configuration indefinitely. Packets continue to be routed correctly. Services remain reachable. The system does not degrade gradually: it simply stops evolving until the control plane returns.</p><p>The <strong>inverse failure</strong> is equally unremarkable.</p><p>If a Katran node fails entirely, upstream Anycast routing automatically shifts traffic to other <strong>Katran nodes</strong> advertising the same VIP. Because no per-connection state exists, there is nothing to migrate. </p><p>New packets arrive at different load balancers, which recompute the same consistent <strong>hash decisions</strong> and forward traffic to the same backends.</p><p>There is no recovery phase. Recovery is instantaneous because nothing was lost.</p><p>This is resilience not through redundancy, but through the absence of fragile state.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Watching decisions at line rate</h2><p>Running routing logic inside the kernel would traditionally be an observability nightmare. <strong>Kernel code</strong> is opaque, difficult to instrument, and extremely dangerous to modify. </p><p>Debugging typically relies on indirect signals: application logs, sampled traces, or inferred metrics reconstructed from incomplete information.</p><p><strong>eBPF </strong>changes this completely.</p><p>Because Katran is implemented as an eBPF program, it can <strong>export telemetry </strong>directly from the exact instruction path that processes each packet. </p><p>Every <strong>routing decision</strong> can increment counters<em>, update histograms,</em> or emit structured events into shared maps or perf buffers, all without leaving kernel space or introducing meaningful overhead.</p><p>This provides <strong>observability </strong>at the point of truth.</p><p>Operators can measure per-service packet rates exactly as they are forwarded. They can<strong> observe backend distribution </strong>and detect imbalance immediately. They can identify drops caused by invalid packets, resource exhaustion, or configuration transitions. </p><p>They can monitor<strong> NIC queue</strong> utilization and detect early signs of saturation before packet loss begins.</p><blockquote><p><em>Nothing is inferred.</em></p><p><em>Nothing is sampled.</em></p><p><em>Nothing is reconstructed.</em></p></blockquote><p>The measurements are produced by the same instructions that forward the packet.</p><p>Even more importantly, these programs can be updated dynamically. New instrumentation can be deployed without <strong>kernel recompilation</strong>, without system reboot, and without interrupting traffic. </p><p>The kernel effectively becomes a programmable execution environment, capable of evolving while the system remains live.</p><p>This represents a<strong> fundamental shift</strong> in how infrastructure behaves.</p><p>The kernel is no longer a static artifact frozen at boot time. It becomes a runtime, executing safe, <strong>verified programs</strong> that can be replaced as requirements evolve.</p><p>Katran is not merely a load balancer.</p><p>It is an example of what happens when <strong>packet processing</strong> stops being a fixed function of the operating system and becomes software again.</p><div><hr></div><h2>Statelessness as resilience</h2><p>Katran&#8217;s architecture repeatedly demonstrates a central theme: <strong>statelessness is not merely a performance optimization; </strong>it is the<strong> source of resilience</strong>. </p><p>By eliminating memory, per-flow state, and mutable tables from the data plane, Katran transforms a potentially fragile, <strong>memory-bound </strong>system into a deterministic, local, lock-free computation.</p><p>Every section we&#8217;ve explored, return traffic handling, control plane separation, observability, and <strong>minimal kernel programs,</strong> converges on the same insight: the network does not need to remember. </p><p>The network needs only to compute, <em>consistently</em>, <em>independently</em>, and <em>correctly</em>.</p><p>In hyperscale systems, that subtle <strong>architectural shift</strong>, from storing what happened to recomputing what must happen, does more than improve speed. </p><p>It restores simplicity, predictability, and operational sanity to a layer of infrastructure that was<strong> traditionally overburdened </strong>with complexity it never required.</p><p>Katran does not just move packets faster. It <strong>redefines what moving packets efficiently even means</strong>. It vanishes in the network, leaving only math, determinism, and physics-aware computation behind. </p><p>And in doing so, it proves that sometimes the most radical engineering move is to refuse to remember anything at all.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share The Software Frontier&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share The Software Frontier</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[How Meta turned the Linux Kernel into a planet-scale Load Balancer. Part I]]></title><description><![CDATA[A deep architectural narrative on XDP, eBPF, stateless routing, and why hyperscale traffic outgrew proxies.]]></description><link>https://www.thesoftwarefrontier.com/p/how-meta-turned-the-linux-kernel</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/how-meta-turned-the-linux-kernel</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Sat, 07 Feb 2026 16:20:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!woiB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8858502-d9da-40d8-b5f2-50ae94480fe3_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!woiB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8858502-d9da-40d8-b5f2-50ae94480fe3_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!woiB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8858502-d9da-40d8-b5f2-50ae94480fe3_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!woiB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8858502-d9da-40d8-b5f2-50ae94480fe3_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!woiB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8858502-d9da-40d8-b5f2-50ae94480fe3_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!woiB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8858502-d9da-40d8-b5f2-50ae94480fe3_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!woiB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8858502-d9da-40d8-b5f2-50ae94480fe3_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8858502-d9da-40d8-b5f2-50ae94480fe3_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3018387,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/185625450?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8858502-d9da-40d8-b5f2-50ae94480fe3_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!woiB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8858502-d9da-40d8-b5f2-50ae94480fe3_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!woiB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8858502-d9da-40d8-b5f2-50ae94480fe3_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!woiB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8858502-d9da-40d8-b5f2-50ae94480fe3_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!woiB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8858502-d9da-40d8-b5f2-50ae94480fe3_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Night Load Balancers stopped feeling like plumbing</h2><p>The first time I saw a production system degrade because of its load balancer, nothing spectacular happened.</p><blockquote><p><em>No outages.<br>No alarms.<br>No crashes.</em></p></blockquote><p>Latency just started creeping upward, first <strong>p95</strong>, then <strong>p99</strong>, until everything felt sticky. Requests weren&#8217;t failing. They were waiting. Queues lengthened. Retries multiplied. Backends looked saturated despite idle <strong>CPUs</strong>. </p><p>What should have been a minor networking blip slowly metastasized into systemic instability.</p><blockquote><p><em>The system wasn&#8217;t compute-bound.<br>It wasn&#8217;t storage-bound.<br>It wasn&#8217;t even network-bandwidth-bound.</em></p></blockquote><p>It was <strong>packet-movement-bound</strong>.</p><p>At the time, that realization felt wrong. Load balancers were supposed to be solved infrastructure, plumbing you deploy once and forget. You scale them horizontally, tune some buffers, maybe add another tier, and move on.</p><p>That night changed how I thought about them. I realized that load balancing isn&#8217;t plumbing at all.</p><p>It&#8217;s distributed systems engineering hiding inside the network layer.</p><p>Years later, when I first read about Facebook&#8217;s Katran, that same dissonance came rushing back, not because<strong> Katran</strong> was fast (many systems are fast), but because it treated the problem so differently from everything else I&#8217;d seen. </p><p>It didn&#8217;t try to be a better proxy. It didn&#8217;t try to optimize user-space networking. It didn&#8217;t try to understand protocols, sessions, or requests.</p><p>It tried to disappear.</p><p>Katran&#8217;s ambition was not to sit in front of traffic, but to move packets <strong>before the operating system itself even noticed they existed</strong>. Not as an optimization, but as a redefinition of what load balancing actually is.</p><p>This piece is an attempt to unpack that redefinition, not as a feature tour, but as a system story: </p><ol><li><p>how Meta ended up building a kernel-level, stateless, line-rate load balancer; </p></li><li><p>why proxies stopped scaling; how <strong>eBPF</strong> and <strong>XDP</strong> turned Linux into a programmable switch; </p></li><li><p>and what this architecture tells us about where infrastructure is heading.</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p></li></ol><div><hr></div><h2>When proxies became the bottleneck</h2><p>At small scale, load balancers feel like coordination infrastructure. </p><p>They terminate <strong>TLS,</strong> parse HTTP, enforce policies, retry requests, expose metrics, and act as programmable gateways between clients and services.</p><p>At medium scale, they become <strong>throughput machines</strong>. You start tuning epoll loops, socket buffers, thread pools, kernel parameters, memory allocators. But everything still feels tractable.</p><p>At hyperscale, Meta scale, something breaks conceptually.</p><blockquote><p><em>Traffic no longer looks like &#8220;clients making requests.&#8221;<br>It looks like &#8220;millions of packets per second, continuously, forever.&#8221;</em></p></blockquote><p>In that regime, the bottleneck stops being <em>computation</em> and becomes <em>movement</em>. The system isn&#8217;t struggling to compute responses. It&#8217;s struggling to <strong>move bytes</strong> through layers of software fast enough.</p><p>A traditional proxy pipeline looks roughly like this:</p><pre><code><code>NIC &#8594; kernel networking stack &#8594; socket buffers
    &#8594; user-space context switch &#8594; protocol parsing
    &#8594; routing decision &#8594; connection pool lookup
    &#8594; buffering &#8594; write syscall &#8594; kernel &#8594; NIC</code></code></pre><p>Every arrow carries some real costs:</p><ul><li><p>Cache misses.</p></li><li><p>Branch mispredictions.</p></li><li><p>Memory allocation and deallocation.</p></li><li><p>Kernel &#8596; user context switches.</p></li><li><p>Lock contention.</p></li><li><p>Scheduling delays.</p></li><li><p>Buffer copies.</p></li><li><p>Queue management overhead.</p></li></ul><p>At tens of thousands of requests per second, this is noise. At millions of packets per second, this is overhead, but at tens of millions, this becomes physics.</p><p>At Meta scale, the<strong> load balancer</strong> fleet itself grew into one of the company&#8217;s largest compute clusters: not because it was doing intellectually difficult work, but because it was burning enormous <strong>CPU cycles</strong> just forwarding bytes.</p><p>Worse, that cost surfaced primarily as <strong>tail latency</strong>, the kind that cascades invisibly through distributed systems. Slight delays trigger retries. Retries amplify load. Load amplifies queueing. Queueing amplifies latency variance. </p><p>What starts as<strong> packet-processing</strong> overhead becomes application-level instability.</p><p>This is where the framing changed.</p><p>Meta engineers stopped asking:</p><blockquote><p><em>How do we build a faster proxy?</em></p></blockquote><p>And instead started asking:</p><blockquote><p><em>Why are we proxying at all?</em></p></blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The conceptual break</h2><p><strong>Katran</strong> begins with a deceptively simple idea:</p><p>Load balancing is not about requests.<br>It&#8217;s about packets.</p><p>This sounds trivial until you follow it all the way down.</p><p>Most systems reason about traffic at the so-called <strong>Layer 7</strong>. We think in terms of HTTP verbs, paths, cookies, headers, auth tokens, retries, sessions, and request lifecycles. Load balancers become policy engines and application routers.</p><p>But stripped to fundamentals, routing is just:</p><blockquote><p><em>A packet arrives.<br>It must be forwarded somewhere else.</em></p></blockquote><p>That decision does not require:</p><ul><li><p>Parsing payloads.</p></li><li><p>Understanding protocols.</p></li><li><p>Tracking sessions.</p></li><li><p>Maintaining connection state.</p></li><li><p>Allocating memory.</p></li><li><p>Buffering streams.</p></li></ul><p>It requires in fact:</p><ul><li><p>Reading a few header fields.</p></li><li><p>Mapping them deterministically to a backend.</p></li><li><p>Rewriting destination addresses.</p></li></ul><p>Once that&#8217;s accepted, the rest of Katran&#8217;s architecture follows almost inevitably.</p><p>Once you accept that routing doesn&#8217;t require user space, a strange realization follows. Why enter user space at all?</p><blockquote><p><em>If routing doesn&#8217;t require sockets, why traverse the TCP stack?<br>If it doesn&#8217;t require state, why store state?<br>If it doesn&#8217;t require buffering, why buffer?</em></p></blockquote><p>Most traditional networking pipelines do all of these things: not because routing needs them, but because the systems we built around routing do. </p><p>Over time, <strong>layers accreted</strong>: sockets, connection tracking, session tables, queues, retries, buffers. Routing became entangled with everything else, until forwarding a packet meant walking through half the operating system.</p><p>Katran asks a very much <strong>simpler question.</strong></p><p>If routing is just computation, like a deterministic function of <strong>packet headers</strong>, then <em>why</em> not perform that computation at the earliest possible moment, the instant the packet enters the machine, and then get out of the way?</p><p>That&#8217;s exactly what Katran does.</p><p>It runs in the <strong>kernel datapath</strong>, before the TCP stack, before sockets, before user space, before state is allocated, before buffers pile up. </p><p>A packet arrives. Its tuple is hashed. A backend is chosen. The destination is rewritten. The packet moves on.</p><p>No sessions. No tables. <strong>No queues</strong>. Not even memory.</p><p>Just math, applied at line rate, and that&#8217;s because Katran doesn&#8217;t optimize routing.</p><p>It eliminates everything routing never needed in the first place.</p><div><hr></div><h2>The physics of Packet Movement</h2><p>Before diving deeper, it&#8217;s worth stepping back and looking at the constraint Katran is truly built to respect: <strong>physics</strong>.</p><p>At hyperscale, performance isn&#8217;t just a matter of elegant algorithms or clever data structures. It&#8217;s about the <strong>raw realities</strong> of moving bits through silicon. </p><p>Every design choice carries a cost in CPU cycles, memory bandwidth, and latency. At this scale, even small inefficiencies multiply into catastrophic slowdowns.</p><p>Every packet copied between kernel and user space consumes memory bandwidth. Every crossing of<strong> privilege boundaries </strong>introduces pipeline stalls. </p><p>Every memory allocation risks cache thrash or fragmentation. Every context switch adds jitter. Every queue introduces contention, every interrupt triggers scheduling overhead. </p><p>NUMA topology, <strong>DMA latency</strong>, cache line locality, branch predictability; all of it suddenly matters, not in theory, but in the measured tail latencies of millions of concurrent flows.</p><p>Traditional proxies accumulate these costs by design. They are built to understand packets: to parse headers, <strong>terminate TLS</strong>, enforce policies, track sessions. </p><p>Each feature is useful at a semantic level, but each comes at a physical price. And at planetary scale, that price dominates everything else.</p><p>Katran takes the opposite approach. It doesn&#8217;t want to know what the packet <em>means</em>. It doesn&#8217;t parse HTTP, it doesn&#8217;t validate TLS, it doesn&#8217;t track sessions. All it cares about is <strong>where the packet should go</strong>. That singular focus allows it to escape the usual costs.</p><p>By moving routing decisions to the earliest possible point, the kernel&#8217;s first receive path, before the <strong>networking stack</strong> even wakes up, Katran eliminates most of the overhead proxies can&#8217;t avoid. </p><p>The packet hits the <strong>NIC</strong>, the tuple is hashed, a backend is chosen, the destination is rewritten, and the packet continues. No user-space copies. No socket buffers. No queues. No connection tables. Almost no memory traffic beyond the computation itself.</p><p>In other words, Katran doesn&#8217;t just optimize for speed: it<strong> co-locates computation</strong> with the physics of the system, letting the network move at line rate while avoiding the hidden costs that would crush any traditional proxy at hyperscale.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>XDP and eBPF </h2><p>None of this would have been remotely feasible a decade ago. </p><p><strong>Kernel-level packet</strong> processing used to be the domain of fear and ritual. You wrote kernel modules in C, recompiled kernels, rebooted machines, and prayed nothing would panic. </p><p>Debugging meant sprinkling <code>printks</code> throughout the code and hoping the kernel didn&#8217;t crash before you could see the output. </p><p>It was brittle, slow, and incompatible with the dynamic, constantly changing demands of modern production infrastructure.</p><p>Then came <strong>eBPF</strong>, the extended Berkeley Packet Filter, and <strong>XDP</strong>, the eXpress Data Path. Together, they changed the rules of the game.</p><p>eBPF introduced something <strong>unprecedented</strong>: safe, sandboxed programs that could run directly inside the kernel. </p><p>You could load them dynamically, update them on the fly, and remove them at runtime: all without touching the <strong>kernel&#8217;s core</strong> or risking catastrophic panics. </p><p>The verifier guaranteed bounded execution, prevented unsafe memory access, and ensured termination. Suddenly, the kernel became <strong>programmable</strong>, but without the traditional risks that had made kernel development a high-stakes gamble.</p><p>XDP tied this capability to the earliest possible<strong> packet hook</strong>, right after the NIC&#8217;s DMA copied the packet into memory, long before the kernel networking stack woke up, before socket buffers were allocated, before protocol parsing began. </p><p>At that instant, the packet is nothing more than bytes in memory. And at that instant, you get to run code on it.</p><p>This is the environment where <strong>Katran lives</strong>. It&#8217;s not a load balancer in the traditional sense. It&#8217;s a programmable<strong> L3/L4 forwarding pipeline</strong>, implemented in software but operating at the same layer as a hardware switch. </p><p>It doesn&#8217;t parse requests, terminate TLS, or track sessions. It doesn&#8217;t worry about retries, circuits, or headers. </p><p>It computes packet destinations at line rate and moves on. Everything else (<em>proxies, policies, application logic</em>) happens after the packet has already been <strong>routed correctly </strong>and efficiently.</p><p>From this perspective, Katran isn&#8217;t an optimization. It&#8217;s a reframing. </p><p>By combining eBPF and XDP, Katran elevates routing to the earliest stage possible, removes the overhead<strong> traditional systems </strong>carry, and brings software routing into the same performance envelope as hardware switches; all without sacrificing the <strong>flexibility</strong>, safety, or dynamism that modern production environments demand.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The Katran Data Plane </h2><p>Let&#8217;s follow a single <strong>SYN packet</strong> entering a Katran node, step by step, to see exactly what makes this system so different. </p><p>Picture it in your mind: every nanosecond counts, every memory access is a potential bottleneck, and every decision must be deterministic.</p><p>The packet arrives at the NIC. The network card&#8217;s DMA engine deposits it into memory. The kernel is still asleep for the most part, unaware of what just arrived. At this moment, the <strong>XDP hook fires</strong>, and Katran&#8217;s eBPF program takes control.</p><p>From here, the packet&#8217;s journey is unlike anything a traditional stack would do. Katran doesn&#8217;t allocate memory. It doesn&#8217;t touch sockets. It doesn&#8217;t consult a connection table. </p><p>It doesn&#8217;t parse HTTP headers or check <strong>TLS states</strong>. It doesn&#8217;t wait for user-space processes to schedule it. It doesn&#8217;t buffer or queue.</p><p>All it does is look at a few bytes of the header: the classic<strong> 5-tuple </strong>that uniquely identifies the flow.</p><pre><code><code>struct iphdr *ip = data + ETH_HLEN;
struct tcphdr *tcp = data + ETH_HLEN + sizeof(*ip);

__u32 src_ip = ip-&gt;saddr;
__u32 dst_ip = ip-&gt;daddr;
__u16 src_port = tcp-&gt;source;
__u16 dst_port = tcp-&gt;dest;
__u8 proto = ip-&gt;protocol;
</code></code></pre><p>Those five fields, like source IP, destination IP, source port, destination port, and protocol, become the <strong>entire input</strong> to Katran&#8217;s routing decision. Nothing else matters.</p><p>Next comes the whole heart of the <strong>computation</strong>: hashing and consistent mapping to a backend. </p><p>Katran uses the 5-tuple to produce a deterministic key, which indexes directly into a consistent hashing ring stored in an <strong>eBPF map</strong>.</p><pre><code><code>__u64 key = hash_5tuple(src_ip, dst_ip, src_port, dst_port, proto);
__u32 backend_idx = bpf_map_lookup_elem(&amp;backend_ring, &amp;key);
</code></code></pre><p>That index points to another map that contains backend IP addresses and ports:</p><pre><code><code>struct backend backend = backends[backend_idx];
</code></code></pre><p>Katran then rewrites the packet headers in place:</p><pre><code><code>ip-&gt;daddr = backend.ip;
tcp-&gt;dest = backend.port;
</code></code></pre><p>Checksums are recomputed using <strong>kernel helpers</strong>, which are optimized for in-place, zero-copy updates.</p><p>And then, this is the magical moment, the packet is sent immediately:</p><pre><code><code>return XDP_TX;
</code></code></pre><blockquote><p><em>No socket buffers.<br>No TCP stack.<br>No connection tables.<br>No memory allocations.<br>No context switches.<br>No scheduling delays.</em></p></blockquote><p>Just:</p><pre><code><code>Packet arrives &#8594; compute backend &#8594; rewrite headers &#8594; transmit
</code></code></pre><p>From the backend&#8217;s perspective, the packet looks like it came directly from the client. From the client&#8217;s perspective, it came from the VIP. From the kernel&#8217;s perspective&#8230; the packet essentially <strong>never existed</strong> outside this tiny, deterministic computation.</p><p>The routing decision executes in a fixed, predictable number of instructions. It takes <strong>microseconds</strong>. There is no jitter from GC pauses, no locks to contend over, no queues to saturate. </p><p>The system scales linearly with <strong>CPU cores</strong>, not with the number of flows or the size of the connection table.</p><p>But the performance, impressive as it is, is only part of the story. The deeper shift is <strong>architectural</strong>: the network stops tracking state. There is no per-flow memory, no NAT tables, no sticky sessions. Routing becomes a pure function:</p><pre><code><code>flow tuple &#8594; hash &#8594; backend &#8594; forward
</code></code></pre><p>Everything else (<em>failure recovery, backend updates and scaling</em>) can be handled by updating <strong>the small, shared hash ring</strong>. </p><p>There is no explosion of per-flow state, no coordination across nodes, and no warm-up after failures.</p><p>In other words, Katran achieves hyperscale not by faster processors, or clever batching, or hardware acceleration, although it uses those wisely, but by <strong>eliminating everything routing doesn&#8217;t actually need</strong>.</p><p>It&#8217;s a profound simplification. And once you internalize it, you begin to see why Katran doesn&#8217;t just move packets faster: it <strong>reshapes the way we think about packet routing entirely</strong>.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Routing as pure computation</h2><p>Traditional load balancers remember things.</p><p>They remember which backend a connection was assigned to, which <strong>NAT mapping </strong>applies to a flow, which sessions are active, which backends are draining, which retries are still in flight. </p><p>Every packet doesn&#8217;t just <strong>pass through</strong> the system: it leaves a trace. A new entry in a table. A new piece of memory that must be managed, synchronized, persisted, migrated, invalidated, and eventually garbage-collected.</p><p>At small scale, this feels manageable. At large scale, it becomes suffocating.</p><p>State grows with traffic. Traffic grows with success. And in distributed systems, state is not just complexity: it is <strong>fragility</strong>. </p><p>It must survive failures. It must remain consistent across replicas. It must be reconstructed after crashes. Over time, the system stops being a<strong> load balancer </strong>and becomes a distributed state machine that happens to forward packets.</p><p>Katran refuses this entire model. Instead of remembering:</p><blockquote><p><em>&#8220;Flow A maps to backend X.&#8221;</em></p></blockquote><p>Katran recomputes:</p><blockquote><p><em>&#8220;Given tuple A, backend X is the correct destination.&#8221;</em></p></blockquote><p>Every time. From scratch. Deterministically.</p><p>Same input &#8594; same output.</p><p>No memory. No eviction. No synchronization. No garbage collection.<br>No coordination.</p><p>Routing stops being a mutable process and becomes a pure function.</p><p>This is not a performance trick; it is a <strong>philosophical shift.</strong> Once routing is expressed as computation rather than storage, failure semantics collapse into simplicity. </p><p>A node can restart and instantly resume forwarding traffic, because nothing was ever stored. Scaling no longer requires <strong>migrating flow tables</strong>. </p><p>Draining no longer requires tracking sessions. There is no warm-up, no replay, no reconstruction; only recomputation.</p><p>This is where<strong> consistent hashing</strong> becomes central.</p><p>Instead of tracking where flows went, Katran hashes the flow tuple and maps it deterministically<strong> onto a backend </strong>via a hash ring. Every packet independently computes its destination. Every node running Katran executes the same function. </p><p>As long as they share the same configuration, they produce the same routing decisions: without coordination, without memory, without state.</p><p>Routing becomes <strong>mathematics</strong>.</p><p>Not &#8220;<em>remember what happened</em>&#8221;, but &#8220;<em>recompute what must happen</em>&#8221;.</p><p>And that single shift, from stored decisions to deterministic computation, is the architectural heart of Katran.</p><div><hr></div><h2>Epilogue</h2><p>Katran isn&#8217;t just a faster load balancer. It&#8217;s a <strong>redefinition of what load balancing means at scale</strong>.</p><p>Traditional systems treat routing as <strong>storage</strong>: track flows, maintain tables, buffer packets, reconcile state. At small scale, it works. </p><p>At Meta scale, it collapses under its own complexity. Latency creeps up. Queues grow. Retries multiply. The plumbing leaks, invisibly, into application performance.</p><p>Katran treats routing as <strong>pure computation</strong>. A packet arrives, a hash is computed, a backend is chosen, the headers are rewritten, and the packet moves on. No state. No coordination. No memory overhead. No jitter. </p><p>The network itself becomes stateless, deterministic, and trivially scalable. Failure recovery, draining, scaling; all of it reduces to updating a <strong>simple hash ring</strong>. The system doesn&#8217;t need to remember. It only needs to compute.</p><p>By moving routing into the kernel&#8217;s earliest receive path and stripping away everything that isn&#8217;t essential, Katran <strong>co-locates computation</strong> with the physics of packet movement. </p><p>It eliminates the<strong> overhead proxies</strong> carry by design, brings software routing close to hardware performance, and reshapes the architectural landscape for planetary-scale infrastructure.</p><p>In the end, Katran is less about &#8220;<em>doing load balancing faster</em>&#8221; and more about <strong>rethinking what load balancing is</strong>. </p><p>It&#8217;s a system that vanishes in the network, leaving only <strong>mathematics and determinism</strong> behind; a quiet revolution in how packets flow through the world&#8217;s largest data centers.</p><p>At hyperscale, that subtle shift doesn&#8217;t just improve performance. It restores simplicity, <strong>reliability</strong>, and sanity to a layer of the network that, for too long, had been treated as plumbing when it was always a distributed system.</p><p>Katran reminds us that sometimes, the best way to handle complexity isn&#8217;t to manage it: it&#8217;s to <strong>eliminate it entirely</strong>.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/p/how-meta-turned-the-linux-kernel?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/p/how-meta-turned-the-linux-kernel?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Can Time Be Computed? Part II]]></title><description><![CDATA[Why causality might be the real computational primitive]]></description><link>https://www.thesoftwarefrontier.com/p/can-time-be-computed-part-ii</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/can-time-be-computed-part-ii</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Sat, 24 Jan 2026 15:07:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!HR2L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9710b2a7-3987-41da-aadb-c53e562ee6e7_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HR2L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9710b2a7-3987-41da-aadb-c53e562ee6e7_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HR2L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9710b2a7-3987-41da-aadb-c53e562ee6e7_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!HR2L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9710b2a7-3987-41da-aadb-c53e562ee6e7_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!HR2L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9710b2a7-3987-41da-aadb-c53e562ee6e7_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!HR2L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9710b2a7-3987-41da-aadb-c53e562ee6e7_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HR2L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9710b2a7-3987-41da-aadb-c53e562ee6e7_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9710b2a7-3987-41da-aadb-c53e562ee6e7_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2910081,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/185290011?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9710b2a7-3987-41da-aadb-c53e562ee6e7_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HR2L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9710b2a7-3987-41da-aadb-c53e562ee6e7_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!HR2L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9710b2a7-3987-41da-aadb-c53e562ee6e7_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!HR2L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9710b2a7-3987-41da-aadb-c53e562ee6e7_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!HR2L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9710b2a7-3987-41da-aadb-c53e562ee6e7_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>When halting becomes optional</h2><p>I remember the afternoon clearly. It was like every other <strong>debugging afternoon</strong>, except this one felt heavier. </p><p>I was chasing a <strong>race condition</strong> in a distributed system: one of those bugs that appears only when enough users click enough buttons fast enough, vanishes if you trace it, and refuses to reproduce in any deterministic way.</p><p>A <strong>value </strong>would appear in my logs <em>before it was supposed to exist</em>. A future bleeding into the present. </p><p>And for a moment, I realized (not dramatically, just quietly, like noticing a stain on your favorite shirt) that the problem wasn&#8217;t logic, at all. It was <strong>time</strong>.</p><p>Every failure I&#8217;d seen in computation, like for example deadlocks, <strong>Heisenbugs</strong>, eventual consistency anomalies, was a failure of ordering, not of truth.</p><p>And that thought carried me somewhere I had never expected: into the idea that maybe time itself is a kind of <strong>computation</strong>. </p><p>Or maybe computation is a shadow cast by time. Or maybe neither is <strong>fundamental</strong>, and what we think of as execution is just a trick our brains play on us.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Waiting for the Future</h2><p>The <strong>Halting Problem</strong>, which we&#8217;ve alredy covered, is usually presented like a brick wall:</p><blockquote><p><em>&#8220;No algorithm can decide, for every possible program and input, whether it halts.&#8221;</em></p></blockquote><p>It feels eternal. <strong>Immutable</strong>. Absolute.</p><p>But take a closer look. Its &#8220;<em>magic</em>&#8221; relies on something subtle: <em>time</em>. A program halts if, eventually, a halting state <strong>sh</strong>&#8203; is reached. If it doesn&#8217;t, you can <strong>never know</strong> in finite time.</p><p>In more formal terms, let a program be a sequence of states.  </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\{ s_n \\}_{n \\in \\mathbb{N}}\n&quot;,&quot;id&quot;:&quot;BRPSHZHTMZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Halting is essentially the predicate:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;H(P) = \\exists n \\in \\mathbb{N} : s_n = s_h\n&quot;,&quot;id&quot;:&quot;FXRGAGCKES&quot;}" data-component-name="LatexBlockToDOM"></div><p>Non-halting is simply:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\neg H(P) \\implies \\forall n \\in \\mathbb{N}, \\; s_n \\neq s_h\n&quot;,&quot;id&quot;:&quot;UETMTGHRKE&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Notice the asymmetry: the &#8220;<em>yes</em>&#8221; case is witnessed finitely, the &#8220;<em>no</em>&#8221; case is only witnessed <em>infinite time later</em>. </p><p>That waiting, the &#8220;<em>eventually</em>,&#8221; is baked into the problem. Remove time, and halting loses meaning.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Physics as constraint, not execution</h2><p>In physics, especially modern physics, &#8220;<em>waiting</em>&#8221; isn&#8217;t guaranteed. Einstein&#8217;s equations don&#8217;t evolve; they rather constrain:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;G_{\\mu\\nu}(x) = 8 \\pi T_{\\mu\\nu}(x)\n&quot;,&quot;id&quot;:&quot;HWQWYDSIHF&quot;}" data-component-name="LatexBlockToDOM"></div><p>No &#8220;<em>next step</em>&#8221;. No &#8220;<em>after</em>&#8221;. There&#8217;s only a <strong>four-dimensional block</strong> satisfying relational constraints.</p><p>Quantum mechanics complicates it further. The <strong>Schr&#246;dinger equation</strong> provides smooth evolution:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;i \\hbar \\frac{\\partial}{\\partial t} \\psi(x,t) = \\hat{H} \\, \\psi(x,t)\n&quot;,&quot;id&quot;:&quot;FUJMMUPNXV&quot;}" data-component-name="LatexBlockToDOM"></div><p>&#8230;but measurement collapses states discontinuously. And, what about quantum gravity? The hamiltonian version of <strong>Wheeler&#8211;DeWitt equation</strong> annihilates time entirely:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\hat{H}^\\Psi[h_{ij}, \\phi] = 0\n&quot;,&quot;id&quot;:&quot;WISHUDJGTT&quot;}" data-component-name="LatexBlockToDOM"></div><p>No <strong>t</strong>. No sequence. No &#8220;<em>eventually</em>.&#8221; Just a series of solutions that exist or don&#8217;t.</p><p>Physics does not <em>run</em>; it <em>satisfies</em>. </p><p>Execution is our interface, not the universe&#8217;s method.</p><div><hr></div><h2>Computation without steps</h2><p>I first realized this when reading about closed timelike curves (<strong>CTCs</strong>). </p><p>Imagine a world where the past can interact with the future: a universe with loops in <strong>causality</strong>. </p><p>A computer on a CTC doesn&#8217;t run sequentially. It must satisfy a self-consistency condition. </p><p><strong>Deutsch </strong>formalized it in a beautiful way:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\rho = \\Phi(\\rho)\n&quot;,&quot;id&quot;:&quot;LIEDWCSJKM&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Here, <strong>&#961; </strong>is a density matrix representing the state of the system, and <strong>&#934; </strong>is a completely positive map describing interaction with its own past.</p><p>Execution disappears. There is no <strong>s0&#8203;&#8594;s1&#8203;&#8594;s2&#8203;.</strong> Only a state that <em>works</em>. </p><p>Problems that are usually thought to be <strong>intractable</strong>, suddently collapse to solutions, not by computing faster, but because the problem itself has been reframed: <em>run replaced by exist</em>.</p><p>I love this because it feels like <strong>cheating</strong>, except for the fact that the whole universe isn&#8217;t cheating. </p><p>We just assumed a rule (sequential steps) that no longer applies.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Halting as satisfiability</h2><p>So&#8230;. What happens to the Halting Problem, then?</p><p>Instead of asking:</p><blockquote><p><em>&#8220;Does the machine eventually reach sh&#8203;?&#8221;</em></p></blockquote><p>We ask:</p><blockquote><p><em>&#8220;Does there exist a configuration s* satisfying the transition constraints and the halting condition?&#8221;</em></p></blockquote><p>Symbolically:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\exists s^* \\;:\\; \\forall s \\in S, \\; s \\to s^* \\;\\text{and}\\; s^* = s_h\n&quot;,&quot;id&quot;:&quot;RRKRQFKISU&quot;}" data-component-name="LatexBlockToDOM"></div><p>This is no longer halting. It&#8217;s pure satisfiability.<strong> SAT</strong>.</p><p><strong>Undecidability</strong>, once thought eternal, collapses. Not because the universe is super-Turing. But because the universe doesn&#8217;t need a clock to decide.</p><p>I remember leaning back in my chair and whispering to no one:<br><em>&#8220;Oh&#8230; so this is what time was doing all along.&#8221;</em></p><div><hr></div><h2>The missing clock</h2><p>Chaos theory feels inevitable. <strong>Sensitive </strong>dependence on initial conditions:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\lvert \\delta x(t) \\rvert \\sim e^{\\lambda t} \\, \\lvert \\delta x(0) \\rvert\n&quot;,&quot;id&quot;:&quot;OSHBVGRMPN&quot;}" data-component-name="LatexBlockToDOM"></div><p>But if there is no <strong>t</strong>, there is no exponential divergence. Only complex relational structure.</p><p>Chaos becomes not a <strong>dynamical phenomenon</strong> but a structural one: &#8220;<em>I cannot know the whole from partial information</em>&#8221;. The unpredictability persists, but it is <em>epistemic</em>, not temporal.</p><p>Time was doing invisible work again. Remove it, and the <strong>phenomenon</strong> mutates.</p><div><hr></div><h2>The structural version</h2><p><strong>Wolfram </strong>tells us that some systems can only be understood by running them. But &#8220;<em>running</em>&#8221; is a temporal concept.</p><p>In a <strong>timeless universe</strong>, irreducibility morphs. It becomes clear that:</p><blockquote><p><em>&#8220;The only way to know the global structure is to analyze the entire relational network.&#8221;</em></p></blockquote><p>Formally, let a system be defined by a set of constraints <strong>C </strong>over states <strong>S</strong>. Irreducibility is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\forall S' \\subsetneq S : C(S') \\not\\models C(S)\n&quot;,&quot;id&quot;:&quot;CJWBHKHJSM&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>You cannot shortcut. Not because of time, not at all. But because structure itself resists local reasoning.</p><div><hr></div><h2>Why Time Feels Real</h2><p>So why does time <em>feel</em> so real? Why does <strong>computation </strong>work at all? Why do programs execute reliably in the world we inhabit, instead of collapsing into incoherence the moment we press &#8220;run&#8221;?</p><p>At first glance, it&#8217;s almost miraculous. </p><p>The universe, a jumble of<strong> quantum fields </strong>and fluctuating geometries, somehow behaves as though it were sequential, as though there were an arrow pointing from &#8220;<em>before</em>&#8221; to &#8220;<em>after</em>.&#8221; </p><p>But the secret, I realized, lies in three deeply interwoven ingredients:</p><ol><li><p><strong>Decoherence</strong> &#8212; the quantum-to-classical alchemy that stabilizes records. Without decoherence, &#8220;<em>memory</em>&#8221; cannot exist. A bit in <strong>RAM</strong>, a log entry, even a thought; all of these are ephemeral superpositions that would collapse if left unobserved. <strong>Decoherence </strong>carves out islands of classicality from the quantum sea, giving us something we can call a <em>state</em> at all.</p></li><li><p><strong>Entropy gradients</strong> &#8212; the thermodynamic arrow. Low entropy in the past and higher entropy in the future gives <strong>directionality</strong>. Irreversibility emerges. A program halts not because it <em>must</em>, but because the universe conspires to make irreversible transitions happen. Without entropy gradients, the notion of &#8220;<em>progress</em>&#8221; would evaporate. Your loops could oscillate forever without any emergent asymmetry.</p></li><li><p><strong>Causal stability</strong> &#8212; macroscopic spacetime, the approximate global hyperbolicity that gives us a reliable causal order. <strong>Lightcones </strong>don&#8217;t randomly rotate. Past and future are roughly separable. Partial orders are locally well-behaved. Without this, even classical computation would collapse into ambiguity: a signal could arrive before it&#8217;s sent, and your neat sequence of state transitions would have no semantic meaning.</p></li></ol><p>Together, these<strong> three factors</strong> carve out pockets of reality where <em>causal order, memory, and execution make sense</em>. </p><p>Within those pockets, <strong>computation </strong>works reliably. It feels fundamental, but it&#8217;s a phase, not a law. </p><p>A convenient feature of the universe&#8217;s current &#8220;<em>operating mode</em>,&#8221; not a guarantee for all of existence.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h3>Execution is Fragile</h3><p>And then, like a subtle shock, I realized something <strong>terrifyingly </strong>mundane: I had been staring at the same principle in my own work all along.</p><p>Every time I debugged a <strong>distributed system</strong>, the failures were rarely about wrong values. They were about ordering:</p><ul><li><p><strong>Deadlocks</strong> &#8212; processes waiting on each other forever, trapped by causal dependencies.</p></li><li><p><strong>Races</strong> &#8212; nondeterministic outcomes arising from subtle shifts in ordering.</p></li><li><p><strong>Livelocks</strong> &#8212; systems spinning endlessly without making progress, alive but unproductive.</p></li><li><p><strong>Eventual consistency anomalies</strong> &#8212; the illusion of stability until the system &#8220;<em>catches up</em>&#8221;&#8230;. or doesn&#8217;t.</p></li></ul><p>Every one of these problems is a temporal problem. Remove time, or<strong> scramble it</strong>, and the systems collapse in ways logic alone cannot predict.</p><p>Execution is fragile. It relies on scaffolding we rarely notice: stable memory, arrow-of-time, consistent partial order. </p><p>Take away any piece, and the <strong>machinery of computation</strong>, the thing we assume is rock-solid, ceases to make sense.</p><p>The universe doesn&#8217;t <em>compute</em>. It <em>exists</em>.</p><p>Execution, halting, irreducibility: these are <strong>projections</strong>. They are interfaces for observers embedded in a particular causal phase of reality. </p><p>Halting becomes <strong>satisfiability</strong>. Complexity suddently becomes constraint. </p><p>The deep laws of nature don&#8217;t &#8220;<em>run</em>&#8221; programs; they encode global <strong>consistency conditions</strong> that, to us, appear as sequential evolution.</p><div><hr></div><h3>Beyond Undecidability</h3><p>So if halting collapses into <strong>satisfiability</strong>, chaos into structural opacity, irreducibility into relational depth, then<em> what remains? What are the true limits of understanding?</em></p><p>They are <strong>no longer temporal.</strong> They are <em>structural</em>. They are about existence.</p><p>Formally, we can describe them as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\exists X \\text{ satisfying } R(X)\n&quot;,&quot;id&quot;:&quot;XCFGDAJTAS&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>where <strong>R(X)</strong> is a set of relational or constraint-based rules defining a system.</p><p>Notice the shift: the question is no longer <em>&#8220;Can this program be computed?&#8221;</em> but <em>&#8220;Can this global structure exist without contradiction?&#8221;</em></p><p>A program that fails to halt in <strong>classical computation</strong> is harmless: it just spins forever. </p><p>But in physics, <strong>inconsistency </strong>is catastrophic. A relational network with no self-consistent configuration simply <em>cannot exist</em>. </p><p>The universe cannot &#8220;<em>run</em>&#8221; an impossible program. Non-existence is the ultimate computational limit.</p><p>Consider a simple example. Imagine a set of fields <strong>&#981;i </strong>with constraints:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;R(\\phi) =\n\\begin{cases}\n\\phi_1 + \\phi_2 = \\phi_3 \\\\\n\\phi_2 \\cdot \\phi_4 = \\phi_1 \\\\\n\\phi_3 - \\phi_4 = \\phi_2\n\\end{cases}\n&quot;,&quot;id&quot;:&quot;BNVRYCDBOY&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Does there exist a set of values <strong>{&#981;1&#8203;,&#981;2&#8203;,&#981;3&#8203;,&#981;4&#8203;}</strong> satisfying all constraints? That is the fundamental question. </p><p>Not whether a <strong>sequence of updates</strong> will reach a solution, but whether the solution exists <em>at all</em>.</p><p>And that, I realized, is the fundamental limit of reality: <em>structural consistency</em>. Not undecidability, not<strong> non-termination</strong>. Just existence itself.</p><p>Every time we think we are bumping against the &#8220;<em>hard limits</em>&#8221; of computation (P vs NP, the Halting Problem, irreducibility) we are really bumping <strong>against the limits </strong>of the phase of reality we inhabit. </p><p>Outside the <strong>decoherence</strong>, outside the entropy gradient, outside causal stability, those limits may dissolve. </p><p>Halting, chaos, and irreducibility are just mere <strong>artifacts </strong>of our temporally embedded perspective.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The radical implications</h2><p>Execution depends on time. Time depends on physical conditions. And <strong>undecidability </strong>depends on execution.</p><p>Step back, and it all becomes a <strong>chain of contingencies:</strong></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Existence} \\;\\Rightarrow\\; \\text{Time emerges} \\;\\Rightarrow\\; \\text{Execution possible} \\;\\Rightarrow\\; \\text{Undecidability arises}\n&quot;,&quot;id&quot;:&quot;ATKUOHCWLI&quot;}" data-component-name="LatexBlockToDOM"></div><p>Remove the first link, and the rest vanish.</p><p>The universe doesn&#8217;t step forward. It <em>satisfies</em>. It <em>exists</em>. It is a structure, not a computation. </p><p>And what we perceive as <strong>halting</strong>, chaos, or irreducibility is simply the shadow of our perspective, projected onto a universe that doesn&#8217;t fundamentally need time to exist.</p><p>The <strong>deepest limits</strong>, it turns out, are not computational. They are ontological. Not about &#8220;what can be done,&#8221; but about &#8220;what <em>can exist</em>.&#8221; </p><p>And in that realization, every programmer, <strong>physicist</strong>, and philosopher finds a quiet, unnerving thrill: the universe doesn&#8217;t need to run to be infinite.</p><div><hr></div><h2>Seeing the universe as constraint</h2><p>I started Part I with a <strong>debugging </strong>story. I end Part II the same way.</p><p>Systems fail not because they <strong>execute incorrectly</strong>, but because they try to execute in a regime where only <em>existence</em> matters.</p><p>The universe does not run.<br>It <strong>satisfies</strong>.</p><p>Time is not its clock. It is our interface.</p><p>Computation is not its language. It&#8217;s our <strong>projection</strong>.</p><p>Undecidability, chaos, irreducibility; these are features of temporally embedded observers. They are <strong>not fundamental.</strong></p><p>And the question that lingers, the one I cannot shake, is this:</p><p>If the universe is a solution, not a process, then <em>what determines which solutions exist at all?</em></p><p>Not what happens. But<strong> what is allowed.</strong></p>]]></content:encoded></item><item><title><![CDATA[Can Time Be Computed? Part I]]></title><description><![CDATA[Why causality might be the real computational primitive]]></description><link>https://www.thesoftwarefrontier.com/p/can-time-be-computed-part-i</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/can-time-be-computed-part-i</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Wed, 21 Jan 2026 12:53:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6lsS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d8eedb0-29ea-4e63-979d-c816ddc11908_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6lsS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d8eedb0-29ea-4e63-979d-c816ddc11908_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6lsS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d8eedb0-29ea-4e63-979d-c816ddc11908_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!6lsS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d8eedb0-29ea-4e63-979d-c816ddc11908_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!6lsS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d8eedb0-29ea-4e63-979d-c816ddc11908_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!6lsS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d8eedb0-29ea-4e63-979d-c816ddc11908_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6lsS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d8eedb0-29ea-4e63-979d-c816ddc11908_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d8eedb0-29ea-4e63-979d-c816ddc11908_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3970200,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/185285702?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d8eedb0-29ea-4e63-979d-c816ddc11908_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6lsS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d8eedb0-29ea-4e63-979d-c816ddc11908_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!6lsS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d8eedb0-29ea-4e63-979d-c816ddc11908_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!6lsS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d8eedb0-29ea-4e63-979d-c816ddc11908_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!6lsS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d8eedb0-29ea-4e63-979d-c816ddc11908_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I didn&#8217;t arrive at this question through theoretical physics. I arrived through <strong>debugging</strong>.</p><p>Late afternoon. Half-broken distributed system. A <strong>race condition</strong> that only appeared under load, vanished under tracing, and refused to reproduce deterministically. A value read before it had causally stabilized. A future leaking into the present.</p><p>Not a <strong>logic bug. </strong>A temporal one.</p><p>And I remember thinking&#8230;.Not dramatically, just with the quiet unease of someone who has seen this movie too many times:</p><p>Every serious failure mode in computation is a failure of <strong>ordering</strong>, not of truth.</p><ul><li><p>Deadlocks.</p></li><li><p>Non-termination.</p></li><li><p>Eventual consistency anomalies.</p></li><li><p>Heisenbugs.</p></li></ul><p>They&#8217;re not about wrong values, they&#8217;re about wrong <strong>partial orders</strong>.</p><p>Which slowly pushed me toward a question that felt, at first, like category error:</p><blockquote><p>Is time something computation happens <em>inside</em>?<br>Or is there something physics itself must <em>compute</em>?</p></blockquote><p>Because every algorithm presupposes a <strong>causal structure.</strong></p><p>And physics, increasingly, does not.</p><p>That&#8217;s the fault line I want to explore.</p><p>Not metaphorically.<br>Not <strong>operationally</strong>.<br>But where computation stops being a tool, and starts being a claim about reality.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The unspoken Axiom of computation</h2><p>Every formal model of computation begins with a hidden axiom:</p><p>There exists a <strong>well-founded causal ordering</strong> over computational steps.</p><p>A Turing machine defines a sequence</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(s_0, q_0) \\to (s_1, q_1) \\to (s_2, q_2) \\to \\cdots\n&quot;,&quot;id&quot;:&quot;CBXZQMVBLX&quot;}" data-component-name="LatexBlockToDOM"></div><p>A circuit defines a directed acyclic graph with layers. Lambda calculus assumes <strong>&#946;-reduction </strong>sequences.</p><p>Even concurrent and <strong>nondeterministic models</strong> rely on partial orders, things like Mazurkiewicz traces, event structures, causal cones.</p><p>But nowhere in computation theory do we define <em>time</em>.</p><p>We <strong>assume it</strong> and we treat temporal succession as primitive, not derived.</p><p>Which is fine, unless we believe computation is physical, because physics does not give you a <strong>global ordering</strong> for free.</p><div><hr></div><h2>When physics revoked the clock</h2><p>Classical mechanics offers a global time parameter (t \in \mathbb{R}), and evolution equations of the form:<br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\frac{d}{dt} x(t) = F(x(t)).\n&quot;,&quot;id&quot;:&quot;OBJLMTCDHZ&quot;}" data-component-name="LatexBlockToDOM"></div><p><br>This fits computation beautifully. <strong>Discretize time.</strong> Simulate. Iterate.</p><p>But relativity removes global simultaneity. Spacetime is actually a <strong>Lorentzian manifold</strong> ((M,g)) with only local causal structure: lightcones, not layers. </p><p>There exists no preferred foliation into spacelike <strong>hypersurfaces.</strong> Different observers disagree on temporal order for spacelike-separated events.</p><p>Still, maybe computation survives by picking a frame.</p><p><strong>Quantum mechanics </strong>destabilizes that further. Time evolution is unitary:<br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n|\\psi(t)\\rangle = e^{-iHt} |\\psi(0)\\rangle.\n\n&quot;,&quot;id&quot;:&quot;ZKDLZCAFPQ&quot;}" data-component-name="LatexBlockToDOM"></div><p><br>but measurement is discontinuous, stochastic, and not generated by the <strong>Schr&#246;dinger equation. </strong>There is no closed-form dynamical law for collapse; already a fracture between time and dynamics.</p><p>Then quantum gravity removes time entirely. In canonical quantum gravity, the <strong>Wheeler&#8211;DeWitt </strong>equation reads:<br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\hat{H}\\Psi[g_{ij}, \\phi] = 0.\n&quot;,&quot;id&quot;:&quot;MUUWMRZLOL&quot;}" data-component-name="LatexBlockToDOM"></div><p><br>with no time derivative. The universe is described by a stationary wavefunctional over <strong>spatial geometries</strong> and matter fields.</p><blockquote><p>No evolution parameter.<br>No (t).<br>No &#8220;<em>next</em>&#8221;.</p></blockquote><p>Which leaves us staring at something structural:</p><p>Computation assumes time as a<strong> crucial metric</strong>. Fundamental physics does not.</p><p>So either computation is not fundamental, or time is not. Either way, something we thought primitive isn&#8217;t.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Can you compute without temporal order?</h2><p>At first glance, this seems <strong>incoherent.</strong> Computation <em>is</em> ordered state transition:<br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\ns_0 \\to s_1 \\to s_2 \\to \\dots\n\n&quot;,&quot;id&quot;:&quot;USFPFIIMXO&quot;}" data-component-name="LatexBlockToDOM"></div><p>Remove well-founded succession, and execution collapses. But physics forces us to consider precisely that regime.</p><p>In quantum gravity and emergent spacetime models:</p><ul><li><p><strong>Geometry </strong>emerges from entanglement structure (AdS/CFT, tensor networks).</p></li><li><p>Causal relations fluctuate.</p></li><li><p>The distinction between &#8220;<em>earlier</em>&#8221; and &#8220;<em>later</em>&#8221; may not exist at Planck scale.</p></li></ul><p>Which means there may be <strong>physically admissible</strong> regions of reality where no global or even local temporal ordering exists.</p><p>And in such regions, computation, as defined by <strong>state transition</strong> systems indexed by N, is simply undefined.</p><p>Not because machines fail.</p><p>Because the semantic preconditions of &#8220;<em>execution</em>&#8221; are absent. This pushed me toward an unsettling reframing:</p><p>Maybe computation is not primitive.</p><p>Maybe <strong>causality</strong> is.</p><blockquote><p>Not clocks.<br>Not time parameters.<br>Not sequences.</p></blockquote><p>Just the partial order relation:<br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A \\prec B \\quad \\text{(A can influence B)}.\n\n&quot;,&quot;id&quot;:&quot;OKOFTYKKZZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>And everything else (time, algorithms, dynamics) is emergent structure layered on top of that relation when it happens to be <strong>acyclic</strong>, well-founded, and stable.</p><p>Which is not guaranteed by physics.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Maybe time is computation</h2><p>There&#8217;s a competing thesis.</p><p>Maybe time doesn&#8217;t underlie computation. Maybe time <em>is</em> computation.</p><p>This idea recurs across physics:</p><ul><li><p><strong>Thermodynamics</strong>: entropy increase corresponds to irreversible information erasure.</p></li><li><p><strong>Quantum information:</strong> unitary evolution corresponds to reversible computation.</p></li><li><p><strong>Holography</strong>: spacetime geometry emerges from entanglement patterns.</p></li><li><p><strong>Tensor network models</strong>: geometry is literally computational wiring.</p></li></ul><p>From this view, temporal ordering is nothing but dependency ordering between informational degrees of freedom.</p><p>Time is not background, it is output.</p><p><strong>Computation </strong>presupposes time, and time emerges from computation, which creates a loop that suggests neither is truly fundamental, leaving only consistency as the primitive. </p><p>Not evolution, not process, not execution, but the <strong>satisfaction </strong>of constraints over relational structures, exactly as modern physics is written. </p><p>The universe is not something that unfolds like a movie but a <strong>solution </strong>that exists because all conditions fit together.</p><div><hr></div><h2>Causal structure as a computational resource</h2><p>In standard computational models, causal structure is trivial: a total or partial order on steps.</p><p>But in physics, causal structure is dynamical.</p><p>In relativity, causal order depends on metric structure:<br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;p \\prec q \\iff q \\in J^+(p),\n\n&quot;,&quot;id&quot;:&quot;GDUMYERWKB&quot;}" data-component-name="LatexBlockToDOM"></div><p><br>and <strong>(J^+(p))</strong> depends on spacetime geometry.</p><p>In quantum mechanics, entangled systems violate classical factorization:<br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\rho_{AB} \\neq \\rho_A \\otimes \\rho_B,\n\n&quot;,&quot;id&quot;:&quot;GOQUTFNHIL&quot;}" data-component-name="LatexBlockToDOM"></div><p><br>producing nonlocal correlations without signaling.</p><p>In quantum gravity, even the <strong>causal order relation</strong> may not be well-defined.</p><p>Which means that causality is not free.</p><p>It is constrained by physical law.</p><p>And if computation is structured causality (state transitions respecting dependency) then physics is not merely the <strong>substrate of computation</strong>.</p><p>Physics defines what computation <em>means</em>.</p><p>Which leads to a sharper version of the Physical Church&#8211;Turing Thesis:</p><blockquote><p>Every physically realizable <strong>causal structure</strong> admits an equivalent computational representation.</p></blockquote><p>And that claim is radically non-obvious.</p><p>Because some causal structures are cyclic.<br>Some are indefinite.<br>Some are <strong>globally inconsistent.</strong></p><p>And in such worlds, computation (in the Turing sense) does not merely become inefficient. It becomes<strong> ill-posed</strong>.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Closed timelike curves and the collapse of execution</h2><p>General relativity admits solutions with closed timelike curves (CTCs), where causal order contains cycles:<br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;p \\prec p.\n\n&quot;,&quot;id&quot;:&quot;KHKNVKKMLY&quot;}" data-component-name="LatexBlockToDOM"></div><p>In such spacetimes, execution semantics fail. There is <strong>no acyclic dependency </strong>graph. No notion of &#8220;<em>before state</em>&#8221; and &#8220;<em>after state</em>&#8221;.</p><p>But when computer scientists studied computation in the presence of CTCs, notably <strong>Deutsch (1991)</strong>, and later <strong>Aaronson&#8211;Watrous</strong>, something unexpected happened.</p><p>Computational power increased, not incrementally, but qualitatively.</p><p>Problems in <strong>PSPACE</strong> collapse into polynomial time.</p><p>Not because you can iterate faster, but because computation no longer proceeds sequentially.</p><p>Instead, the system must satisfy a fixed-point constraint:<br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\rho = \\Phi(\\rho),\n\n&quot;,&quot;id&quot;:&quot;YBZEQVKQCZ&quot;}" data-component-name="LatexBlockToDOM"></div><p><br>where </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(\\Phi)\n&quot;,&quot;id&quot;:&quot;NEXQBAOWRG&quot;}" data-component-name="LatexBlockToDOM"></div><p>is a completely positive trace-preserving map representing the circuit interacting with its own past state.</p><p>The computation is not:<br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s_0 \\to s_1 \\to s_2 \\to \\dots\n\n&quot;,&quot;id&quot;:&quot;OJEFOBEYZK&quot;}" data-component-name="LatexBlockToDOM"></div><p>It is:<br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Find } s \\text{ such that } F(s) = s.\n&quot;,&quot;id&quot;:&quot;XECAVYDDDN&quot;}" data-component-name="LatexBlockToDOM"></div><p>Execution is replaced by <strong>global consistency</strong>, which is precisely how fundamental physics operates.</p><p><strong>Einstein&#8217;s equations:</strong><br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;G_{\\mu\\nu} = 8 \\pi T_{\\mu\\nu}\n\n&quot;,&quot;id&quot;:&quot;OTDQIDFGZB&quot;}" data-component-name="LatexBlockToDOM"></div><p><br>do not evolve geometry: they constrain the entire spacetime manifold.</p><p><strong>Quantum path</strong> integrals:<br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\langle \\psi_f | \\psi_i \\rangle = \\int \\mathcal{D}[x(t)] \\, e^{i S[x]/\\hbar}\n\n&quot;,&quot;id&quot;:&quot;XJURBJYWXV&quot;}" data-component-name="LatexBlockToDOM"></div><p><br>do not generate <strong>trajectories</strong>; they sum over all histories consistent with boundary conditions.</p><p>Physics does not run but satisfies, suggesting that the deepest form of &#8220;<em>computation</em>&#8221; in nature is not algorithmic but <strong>static</strong>, constraint-based, and atemporal.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The day I stopped thinking the universe executes</h2><p>I used to imagine the universe as a machine.</p><blockquote><p>Initial conditions.<br>Evolution law.<br>Future states.</p></blockquote><p>A cosmic program. Then I suddently noticed: that&#8217;s not how the equations are written.</p><p>They don&#8217;t say:<br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\nx(t + \\Delta t) = F(x(t)).\n&quot;,&quot;id&quot;:&quot;CRZOADCLFY&quot;}" data-component-name="LatexBlockToDOM"></div><p>They say instead:<br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\mathcal{E}[g, \\phi] = 0,\n\n&quot;,&quot;id&quot;:&quot;MIRMVYMHDC&quot;}" data-component-name="LatexBlockToDOM"></div><p><br>a global constraint over fields on spacetime.</p><p>The universe does<strong> not step forward</strong>; it satisfies equations over four-dimensional structures. </p><p>What we call time is <strong>not execution</strong> but traversal, a worldline slicing a static solution, and what we call dynamics is not computation but projection. </p><p>Time is not the <strong>medium of physics</strong>; it is the interface exposed to embedded observers.</p><div><hr></div><h2>What does &#8220;halting&#8221; mean without time?</h2><p>Now things become truly unstable: the Halting Problem assumes a machine starts in state <strong>(s0)</strong>, executes a sequence of transitions, and may or may not reach a halting state <strong>(sh)</strong>. </p><p>In a timeless universe, there is no start, no sequence, no eventually: only the question of whether a <strong>globally consistent </strong>configuration exists. </p><p>Halting becomes satisfiability, non-halting becomes inconsistency, and there is no <strong>semi-decidability</strong>, no asymmetry, no waiting: just the existence or non-existence of fixed points. </p><p>This suggests something radical: classical undecidability, like the <strong>Halting Problem</strong>, may not be a fundamental feature of physical law. </p><p>This is not because physics violates Turing limits, but because physics does not instantiate the <strong>temporal semantics</strong> those limits presuppose. </p><p>Undecidability may emerge only from <strong>timeful computation</strong>, not from reality itself.</p><div><hr></div><h2>The limits of execution</h2><p><em>Where does this leave us?</em></p><p>If computation presupposes time, and time is not fundamental, then &#8220;<em>algorithm</em>&#8221; becomes provisional. </p><p>It works inside<strong> pockets of reality </strong>where causality is well-behaved, decoherence stabilizes information, and <strong>entropy </strong>defines an arrow. </p><p>Outside those pockets, stepwise execution may simply not exist.</p><p>The universe may not &#8220;<em>run</em>&#8221; at all. </p><p>It may simply <strong>exist</strong>: a globally consistent solution to relational constraints, where evolution, dynamics, and causality are emergent features perceived by observers. </p><p>Execution is projection. Halting is satisfiability. Complexity is constraint.</p><p>Classical computer science tells us what can be computed if time flows like a river. </p><p>Physics tells us what is <strong>consistent</strong> if time is undefined. </p><p>Between the two lies a gap: </p><p>one about algorithms, the other about existence.</p><p><strong>No algorithm</strong>, however clever, can escape the possibility that time itself is emergent, and computation exists only because the universe provides the scaffolding to support it.</p><p>In <strong>Part II</strong>, we will explore whether undecidability, chaos, and computational irreducibility are artifacts of timeful computation; or whether they emerge from the very structure of reality itself.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/p/can-time-be-computed-part-i?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/p/can-time-be-computed-part-i?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[Beyond the Semantic Layer]]></title><description><![CDATA[How analytics systems fail, why the semantic layer was inevitable, and why it is not enough]]></description><link>https://www.thesoftwarefrontier.com/p/beyond-the-semantic-layer</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/beyond-the-semantic-layer</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Mon, 12 Jan 2026 12:18:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!U2lK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5216aa01-1e25-45d4-86e1-fa760bd8fb4e_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U2lK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5216aa01-1e25-45d4-86e1-fa760bd8fb4e_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U2lK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5216aa01-1e25-45d4-86e1-fa760bd8fb4e_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!U2lK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5216aa01-1e25-45d4-86e1-fa760bd8fb4e_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!U2lK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5216aa01-1e25-45d4-86e1-fa760bd8fb4e_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!U2lK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5216aa01-1e25-45d4-86e1-fa760bd8fb4e_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U2lK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5216aa01-1e25-45d4-86e1-fa760bd8fb4e_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5216aa01-1e25-45d4-86e1-fa760bd8fb4e_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2704408,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/180533705?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5216aa01-1e25-45d4-86e1-fa760bd8fb4e_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!U2lK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5216aa01-1e25-45d4-86e1-fa760bd8fb4e_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!U2lK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5216aa01-1e25-45d4-86e1-fa760bd8fb4e_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!U2lK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5216aa01-1e25-45d4-86e1-fa760bd8fb4e_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!U2lK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5216aa01-1e25-45d4-86e1-fa760bd8fb4e_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The day I realized data was not the problem</h2><p>There is a particular <strong>kind of frustration</strong> that only appears once a data system is, by all reasonable standards, <em>successful</em>.</p><blockquote><p><em>The pipelines run.<br>The warehouse scales.<br>The dashboards load.<br>The executives nod.</em></p></blockquote><p>The incident reports stop.<br>The on-call pager stays quiet.<br>The <strong>architecture diagrams</strong> finally look clean.</p><p>And yet, something feels deeply wrong.</p><p>Not loudly wrong. Nothing is on fire. No red alerts. No missing data. The system does what it is supposed to do. </p><p>It <strong>produces numbers</strong>, charts, trends. It answers questions: at least the ones it was designed to answer.</p><p>The <strong>discomfort </strong>is quieter than that. Harder to name.</p><p>I remember the moment clearly. I was reading yet another thoughtful, well-written deep dive on the <strong>semantic layer</strong>. </p><p>It explained metrics, dimensions, abstractions, self-serve analytics. </p><p>It talked about centralizing logic, eliminating ambiguity, empowering business users. I nodded along. Everything made sense. The arguments were sound. The tooling was impressive.</p><p>This was, by any standard, <em>good work</em>.</p><p>Eight hours later, I closed my laptop with an uncomfortable realization I couldn&#8217;t quite shake:</p><blockquote><p><em>If all of this is so well understood,<br>why does analytics still feel so brittle?</em></p></blockquote><p>Not broken. Brittle.</p><p>The kind of brittle where nothing fails outright, but everything feels <strong>slightly unstable. </strong>Where confidence depends on context. Where every number comes with a mental footnote. Where trust is provisional.</p><p>A single new question could fracture consensus.<br>A <strong>small change</strong> in definition could invalidate months of careful work.<br>Two dashboards could disagree without either being wrong; and without anyone being able to explain, cleanly, why.</p><p>Everyone involved was competent.<br>Everyone followed <strong>best practices.</strong><br>And yet, the system could not absorb change without stress.</p><p>That was the moment the problem stopped feeling to be just technical.</p><p>This essay is <strong>my own attempt </strong>to sit with that discomfort and follow it all the way down.</p><ul><li><p>Not to criticize tools.</p></li><li><p>Not to promote frameworks.</p></li><li><p>Not to propose yet another layer for its own sake.</p></li></ul><p>But to understand what analytics systems are actually missing,<br>and why the hardest problems only seem to appear <em>after</em> everything is supposedly working.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The illusion of maturity</h2><p>For most of the history of data engineering, scale was the enemy.</p><ol><li><p><em>Could we ingest fast enough?</em></p></li><li><p><em>Could we store cheaply enough?</em></p></li><li><p><em>Could we query quickly enough?</em></p></li></ol><p>These were clear, honest problems. They had <strong>benchmarks</strong>. They had roadmaps. They had satisfying answers, usually involving more cores and a larger bill.</p><p>And, eventually, we won.</p><blockquote><p><em>Cloud storage removed scarcity.<br>Columnar warehouses removed fear.<br>Distributed systems removed ceilings.</em></p></blockquote><p>We learned how to move data around at impressive speeds, how to store absurd volumes of it, and how to scan it so fast that no one even pretends to <strong>count rows </strong>anymore. </p><p>We casually talk about <strong>terabytes </strong>the way people once talked about megabytes, which should probably concern someone.</p><p>By almost any technical measure, the modern data stack worked. And yet, the promised land never quite arrived.</p><p>Not because the <strong>systems failed</strong>, everything was green, but because once scale stopped being the problem, something else quietly took its place.</p><p>Meaning.</p><p>The remaining problems were no longer about throughput or latency. They were about understanding. About <strong>interpretation</strong>. About agreement. </p><p>The kind of problems that don&#8217;t show up in monitoring dashboards and cannot be solved by adding another node.</p><p>Because at its core, <strong>analytics</strong> is not really about numbers.</p><p>It is about translation.</p><blockquote><p><em>From events to facts.<br>From facts to metrics.<br>From metrics to decisions.</em></p></blockquote><p>Each step looks straightforward on a slide deck. Each one feels like it should be automatable. And each one, in practice, smuggles in a <strong>handful of assumptions</strong> while no one is looking.</p><ul><li><p>An event becomes a fact once we decide it matters.</p></li><li><p>A fact becomes a metric once we decide how to aggregate it.</p></li><li><p>A metric becomes a decision once someone decides to trust it; often after asking who built the dashboard.</p></li></ul><p>Every translation introduces interpretation, and it reflects a <strong>precise point of view.</strong></p><p>And interpretation, left unmanaged, drifts.</p><p>Slowly. Politely. With good intentions.</p><p><strong>Definitions evolve</strong>. Exceptions accumulate. A metric that once meant something precise becomes a rough suggestion. </p><p>The system still <strong>answers questions</strong>: it just answers them with a quiet asterisk.</p><p>Scale used to hide this. When data was scarce, ambiguity was tolerable. When queries were expensive, people didn&#8217;t ask<strong> follow-ups</strong>. When dashboards were rare, contradictions stayed invisible.</p><p>But once analytics became cheap, fast, and everywhere, meaning became the bottleneck.</p><p>And that&#8217;s when a <strong>data system</strong> that looks mature on paper starts to feel oddly fragile in practice: like a very impressive machine held together by <strong>shared assumptions </strong>and a mutual agreement not to ask <em>that</em> question.</p><div><hr></div><h2>When precision isn&#8217;t enough</h2><p><strong>SQL </strong>is one of the most successful languages ever designed. </p><p>It scales from hobby projects to<strong> trillion-row warehouses</strong>, and entire careers are built on fluency. And yet, it answers <em>how</em>, not <em>why</em>.</p><pre><code><code>SELECT SUM(amount)
FROM transactions
WHERE status = 'settled'</code></code></pre><p>Syntactically, this is flawless. The query planner understands it, the optimizer executes it efficiently. <em>Semantically</em>? <strong>Empty</strong>.</p><blockquote><p><em>What does &#8220;settled&#8221; mean? Settled with the customer, accounting, or the bank? </em></p><p><em>Why are refunds excluded, because they were reversed later, or because someone copied this query from last quarter? </em></p><p><em>Which FX rate applies; spot, month-end, or the one that makes finance look good? </em></p></blockquote><p>The <strong>database </strong>does not know. SQL executes logic; it cannot encode meaning, context, or assumptions.</p><p>Every <strong>analytical query </strong>carries this hidden cognitive tax: <em>is this table canonical? Is this join correct? Is this filter business-approved? </em></p><p>Multiply that across dozens of analysts, <strong>hundreds of dashboards</strong>, and ad-hoc queries running before meetings, and what looks like efficiency is really entropy: quietly accumulating semantic drift.</p><p>The<strong> semantic layer</strong> emerged as an honest response. Metrics became objects. Dimensions became curated. Business logic was centralized. Joins, filters, and aggregations were standardized. </p><p>Dashboards could finally speak a <strong>common language. </strong>This was progress.</p><p>But notice what it does <em>not</em> stabilize: assumptions, temporal validity, domain disagreement. The layer freezes a moment in time while the business continues to evolve. </p><p><strong>Products change</strong>, markets shift, regulations update: and suddenly a once-&#8220;<em>canonical</em>&#8221; metric feels negotiable.</p><p>Because metrics are not facts. Invoices, payments, FX rates: those are facts. </p><p>Metrics are interpretations: stories we tell about those facts, often encoded as SQL transformations, <strong>dbt models</strong>, or metrics-layer objects. And like all stories, they are context-dependent and time-sensitive.</p><p>Meaning, it turns out, does not live in the warehouse. It lives in<strong> analyst intuition</strong>, tribal knowledge, and half-maintained documentation. The warehouse stores raw events. </p><p>The semantic layer stores calculated objects. But understanding, what the numbers <em>actually signify</em>, lives elsewhere. </p><p>And until we make that explicit, even the <strong>most precise analytics </strong>system will remain impressively brittle.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>The Cognitive Data Interface</h2><p>Ask a large language model to answer a <strong>business question</strong>, and you get a revelation: it guesses.</p><ul><li><p><strong>It guesses joins.</strong> Should it join <code>orders</code> to <code>payments</code> or <code>transactions</code>? Which key is canonical?</p></li><li><p><strong>It guesses metrics.</strong> Does &#8220;<em>net revenue</em>&#8221; include refunds? Discounts? FX conversions?</p></li><li><p><strong>It guesses intent.</strong> Is the user asking for last quarter&#8217;s revenue, or a forecast-adjusted figure?</p></li></ul><p>This is not because LLMs are careless or sloppy. It is because <strong>we never encoded meaning explicitly</strong>. </p><p>All the rules, edge cases, and assumptions that analysts carry in their heads are invisible to machines. <strong>SQL queries</strong>, dbt models, and metrics layers are only partially helpful: they are brittle shells without context.</p><p>Humans quietly compensate for ambiguity. Analysts know that <code>settled = completed minus refunded</code>, even if that logic is scattered across Slack threads, meeting notes, and half-written docs. </p><p>Machines do not. AI exposes these cracks, forcing us to <strong>formalize </strong>what we never bothered to codify.</p><h3>AI did not create the problem, it exposed it</h3><p>This is an important distinction. AI did not invent ambiguity. It only <strong>surfaced what was already there</strong>.</p><p>Every analytical system has been silently<strong> drifting under layers</strong> of human assumptions. Analysts and dashboards absorb ambiguity without complaint; machines cannot. </p><p>AI does not &#8220;<em>understand</em>&#8221; business rules: it follows logic. If the logic is incomplete, inconsistent, or implicit, the <strong>AI will guess</strong>, and those guesses highlight exactly where meaning was never formalized.</p><h3>From metrics to knowledge</h3><p>The semantic layer asks a reasonable question:</p><blockquote><p><em>How do we define metrics once, so that dashboards, analysts, and AI agree?</em></p></blockquote><p>But a deeper question is far <strong>more consequential:</strong></p><blockquote><p><em>How does an organization know what it knows?</em></p></blockquote><p>This is<strong> not a data engineering</strong> <strong>problem </strong>in the classical sense. It is <strong>epistemic</strong>. It is about encoding the logic, context, and history of decisions in a way that both humans and machines can consume reliably. </p><p>Without this, every query, model, and dashboard remains an <strong>approximation</strong>, and AI only makes those approximations painfully visible.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Definition of the Cognitive Data Interface (CDI)</h2><p>The <strong>Cognitive Data Interface (CDI)</strong> is an architectural layer designed to make organizational knowledge <strong>explicit, versioned, and machine-actionable</strong>. It is built around four core principles:</p><ul><li><p><strong>Separates facts from interpretations</strong><br>Facts are immutable raw events, e.g., <code>(order_id, amount, currency, event_ts)</code>; interpretations are calculated metrics, e.g., <code>RecognizedRevenue</code>, which depends on facts plus business logic.</p></li><li><p><strong>Versions meaning over time</strong><br>Metrics and definitions are not static. Each calculation carries temporal metadata so that historical analyses can reproduce what <em>was believed at the time</em>.</p></li><li><p><strong>Encodes domain knowledge explicitly</strong><br>Business rules, edge cases, assumptions, and constraints are captured as executable objects rather than scattered in documentation or analyst intuition.</p></li><li><p><strong>Serves humans and machines equally</strong><br>Dashboards, analysts, and AI agents query the CDI through the same interface, ensuring consistency across all consumers.</p></li></ul><p>The CDI treats meaning as <strong>first-class, executable infrastructure</strong>, not documentation. It&#8217;s the bridge between raw data and trusted insight, humans and AI alike.</p><h3>Facts vs. Assertions</h3><p>Consider this concrete example:</p><p><strong>Facts (raw events):</strong></p><pre><code><code>order_id | amount | currency | event_ts
--------------------------------------
1001     | 100    | USD      | 2026-01-01
1002     | 200    | EUR      | 2026-01-02
</code></code></pre><p><strong>Assertions (interpretations):</strong></p><pre><code><code>RecognizedRevenue :=
  SUM(amount)
  WHERE status = 'completed'
  USING fx_rate_at(event_ts)
</code></code></pre><ul><li><p>Facts persist: they are<strong> immutable observations</strong> of what happened.</p></li><li><p>Assertions evolve: they encode<strong> business logic</strong> that may change as definitions, rules, or assumptions change.</p></li></ul><p>This separation allows you to version, audit, and reason about both <em>what happened</em> and <em>what we believed happened</em>, without conflating the two.</p><h3>Time as a First-Class Semantic Dimension</h3><p>Every metric, rule, and definition has a lifespan. The CDI tracks two types of time:</p><ul><li><p><strong>Valid time:</strong> when the fact or metric applies in the real world.</p></li><li><p><strong>Knowledge time:</strong> when the organization codified, recognized, or made a decision based on that fact or rule.</p></li></ul><p>This enables questions like:</p><blockquote><p><em>&#8220;What did we believe was our net revenue for Q4 2025, as of January 1, 2026?&#8221;</em></p></blockquote><p>Even if definitions of &#8220;completed&#8221; orders, FX rates, or discount rules changed later, the CDI reproduces the answer faithfully. </p><p>Dual-temporal modeling ensures metrics are <strong>reproducible, context-aware, and auditable</strong>, allowing both humans and AI to reason about the past without guessing or hallucinating.</p><div><hr></div><h2>Architecture in Practice</h2><p>At the heart of the Cognitive Data Interface (CDI) lies the <strong>semantic graph</strong>, a live, versioned, and queryable representation of organizational knowledge. </p><p>Its <strong>nodes</strong> represent entities, metrics, dimensions, and policies. </p><blockquote><p><strong>Entities</strong> might be <code>customer</code><strong>, </strong><code>order</code>, or <code>subscription</code>; </p><p><strong>metrics</strong> could be <code>RecognizedRevenue</code> or <code>NetRecurringRevenue</code>; </p><p><strong>dimensions</strong> include <code>region</code>, <code>industry_segment</code>, or <code>customer_size</code>; </p></blockquote><p>and <strong>policies </strong>encode data access rules or approval constraints. </p><p>Edges define relationships such as <code>depends_on</code> for dependencies between metrics or entities, <code>aggregates</code> for roll-ups from facts to metrics, and <code>constrains</code> to enforce policies on metrics or entities. </p><p>Versioning ensures that any <strong>historical query</strong> can reproduce exactly the state of definitions at the time, allowing meaning, not just data, to become first-class infrastructure.</p><h3>From Intent to SQL</h3><p>The CDI allows queries to originate from <strong>intent</strong> rather than pre-written SQL. </p><p>For example, when a user asks, &#8220;<em>Net recurring revenue growth for mid-market customers last quarter</em>,&#8221; the CDI first resolves the<strong> metric definitions</strong> to determine what qualifies as recurring revenue and which invoice types are included. </p><p>It then applies <strong>segmentation rules</strong> to identify which accounts historically fell into the mid-market bracket, and finally enforces temporal validity, accounting for the FX rates, accounting policies, or definitions in effect last quarter. </p><p>Only after these steps does SQL emerge as an <strong>execution artifact</strong>; it is bytecode, not the source of truth. </p><p>Humans and AI alike can rely on the output without ever inspecting the SQL itself.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Governance without friction</h2><p>In the CDI, <strong>access control </strong>is not merely a filter: it is part of meaning. </p><p>Who can view a metric shapes how it is interpreted. </p><p>A <strong>revenue metric</strong> visible only to finance but hidden from sales carries different implications than the same metric shared across departments. </p><p>Encoding access rules as first-class semantic objects ensures that <strong>semantics, security, and intent converge</strong>, maintaining both trust and compliance across the organization.</p><h3>Lineage as Explanation</h3><p>Lineage is not just a directed acyclic graph. </p><p>In the CDI, it becomes a <strong>narrative</strong>:<em> &#8220;This number exists because we aggregated completed invoices from these entities, applied FX rates from these dates, and excluded refunded transactions per the finance policy of Q4 2025.&#8221; </em></p><p>Every metric carries its reasoning along with its formula, making lineage an executable, auditable, and <strong>machine-readable explanation </strong>that preserves context for both humans and AI.</p><div><hr></div><h2>Failure Modes</h2><p>Not all meaning can or should be encoded. Over-formalization risks rigidity: overly prescriptive rules can stifle <strong>decision-making</strong> and produce brittle systems. </p><p>The CDI must balance <strong>formalization </strong>with flexibility, allowing analysts to override or annotate interpretations when context demands. </p><p>Judgment, intuition, and situational awareness remain human responsibilities.</p><h3>Organizational Truth</h3><p>The CDI mirrors the organization. When teams disagree on <strong>definitions</strong>, or when policies evolve differently across departments, the model will fracture. </p><p>That visibility is not a flaw; it is a <strong>feature</strong>. </p><p>Analytics systems should reflect organizational <strong>complexity </strong>rather than pretending it does not exist, ensuring that inconsistencies become actionable insights rather than silent errors.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>After the Semantic Layer</h2><p>The semantic layer was truly <strong>inevitable</strong>. </p><p>Centralizing logic, metrics, and definitions was necessary to establish consistency and trust. But it was <strong>never the destination. </strong></p><p>The next bottleneck is not scale, nor tooling, but <strong>meaning</strong> itself. Until analytics systems learn to represent what organizations <em>believe</em>: their assumptions, <strong>interpretations</strong>, and evolving definitions, they will remain fragile.</p><p>The Cognitive Data Interface is not a product. It is a recognition: <strong>data was never the hard part</strong>. </p><p>The true challenge lies in capturing, versioning, and operationalizing <strong>organizational knowledge</strong>, ensuring that humans and machines alike can reason about the past, present, and future with confidence.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/p/beyond-the-semantic-layer?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/p/beyond-the-semantic-layer?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[If the Universe Computes, What Can It Not Compute? ]]></title><description><![CDATA[G&#246;del, Turing, Quantum Chaos, and the Ultimate Algorithmic Limits of Reality]]></description><link>https://www.thesoftwarefrontier.com/p/if-the-universe-computes-what-can</link><guid isPermaLink="false">https://www.thesoftwarefrontier.com/p/if-the-universe-computes-what-can</guid><dc:creator><![CDATA[Lorenzo Bradanini]]></dc:creator><pubDate>Thu, 08 Jan 2026 21:36:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!OC44!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dd1244c-6d13-4e7f-b23d-72a96f702376_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OC44!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dd1244c-6d13-4e7f-b23d-72a96f702376_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OC44!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dd1244c-6d13-4e7f-b23d-72a96f702376_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!OC44!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dd1244c-6d13-4e7f-b23d-72a96f702376_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!OC44!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dd1244c-6d13-4e7f-b23d-72a96f702376_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!OC44!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dd1244c-6d13-4e7f-b23d-72a96f702376_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OC44!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dd1244c-6d13-4e7f-b23d-72a96f702376_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9dd1244c-6d13-4e7f-b23d-72a96f702376_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3415248,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://softwarefrontier.substack.com/i/181689372?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dd1244c-6d13-4e7f-b23d-72a96f702376_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OC44!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dd1244c-6d13-4e7f-b23d-72a96f702376_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!OC44!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dd1244c-6d13-4e7f-b23d-72a96f702376_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!OC44!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dd1244c-6d13-4e7f-b23d-72a96f702376_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!OC44!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dd1244c-6d13-4e7f-b23d-72a96f702376_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>The universe whispers in code</h2><p>The universe has a <strong>rhythm </strong>that humans barely perceive.  </p><p>Everywhere I looked, I saw <strong>pattern</strong>, <strong>law</strong>, and <strong>constraint</strong>: not metaphorical, not anthropocentric, but structural and inescapable.</p><p>For centuries, <strong>humanity </strong>has been intoxicated by the same dream: that this lawfulness is totally predictable. </p><p><strong>Laplace</strong>, in 1814, imagined an intellect that, knowing all positions and velocities of particles, could compute the entire past and future. </p><p>Ignorance was a problem of <strong>calculation</strong>, not principle. Newtonian mechanics, with its continuous trajectories and deterministic equations, seemed to guarantee this possibility.</p><p>Yet, even before <strong>quantum mechanics</strong>, subtle cracks appeared. Henri Poincar&#233;, studying the three-body problem in celestial mechanics, discovered <strong>chaotic sensitivity</strong>: tiny variations in initial conditions could lead to wildly diverging outcomes. </p><p><strong>Determinism </strong>remained, but predictability faltered. Later, the discovery of strange attractors and turbulence in fluid dynamics hinted that lawfulness does not guarantee <strong>understandable evolution</strong>.</p><p>The real breakthrough, however, came with <strong>computation</strong>. The mid-20th century revealed that predictability might have a formal ceiling. </p><p>The universe, it seemed, might compute: but it could also refuse to compute answers for us. </p><p>This is the second part of a two-part journey exploring the intersection of <strong>physics </strong>and <strong>computation</strong>: two fields that have fascinated me for years. </p><p>I hope that reading it gives you the same sense of wonder, <strong>curiosity</strong>, and awe that I felt while writing and researching it, and perhaps even inspires you to look at the universe as a vast, <strong>subtle computation</strong> in motion.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>G&#246;del, Turing, and the genesis of limits</h2><p>In 1931, <strong>Kurt G&#246;del</strong> demonstrated to the world that formal systems of arithmetic contain <strong>intrinsic limits</strong>. </p><p>His incompleteness theorems showed that for <strong>any system</strong> capable of representing natural numbers, there are true statements unprovable within that system. </p><p>For the first time in its <strong>millennia-long history</strong>, mathematics, long seen as a sanctuary of certainty, was revealed to harbor fundamental ignorance.</p><p>A few years later, <strong>Alan Turing</strong> formalized the concept of computation. His Turing machine, basically a read/write head scanning an infinite tape under finite control rules, provided a model of <strong>mechanical reasoning</strong>. </p><p>Turing proved the <strong>Halting Problem</strong>: no algorithm exists that can determine, for every program and input, whether the program halts or runs forever. Some truths are <strong>logically uncomputable</strong>, no matter the resources or ingenuity of the observer.</p><p>At first, these were abstract results, inhabiting the realm of pure mathematics. But physical systems can implement computation. </p><p><strong>Von Neumann&#8217;s</strong> self-replicating cellular automata, Fredkin and Toffoli&#8217;s billiard-ball models, and contemporary quantum spin chains show that the laws of physics can encode <strong>universal computation</strong>. </p><p>Once embedded in reality, undecidability migrates from the abstract to the tangible: physical questions about system evolution inherit <strong>computational limits</strong>.</p><p>Consider a classical billiard-ball system arranged to implement a<strong> Turing machine.</strong> Each collision represents a logic operation; the trajectory of balls encodes tape symbols. </p><p>The system obeys <strong>deterministic </strong>Newtonian mechanics. Yet, ask whether a particular ball will ever reach a given location, and you face <strong>undecidability</strong>. </p><p>The universe has <strong>obeyed the laws</strong> faithfully, but no observer can compute the outcome faster than watching the system evolve step by step.</p><p>This is the subtle shift: the universe is lawful, deterministic, and continuous, yet <strong>it may refuse to reveal some answers</strong>. </p><p>Undecidability, once a property of formal systems, is now <strong>woven into the fabric of reality</strong>.</p><div><hr></div><h2>Chaos: lawful but unpredictable</h2><p>Chaos is seductive because it masquerades as pure <strong>randomness</strong>. </p><p>In the early 20th century, Henri Poincar&#233;&#8217;s work on the<strong> three-body problem</strong> revealed a profound truth: deterministic systems could behave in ways that defied prediction. </p><p>A slight perturbation in initial positions (<em>a millimeter, a fraction of a gram</em>) could radically alter long-term trajectories. This is the essence of <strong>sensitive dependence on initial conditions</strong>, the hallmark of chaos.</p><p>Yet chaos is a <strong>partial limitation</strong>. It is epistemic: our inability to measure initial conditions with infinite precision prevents practical prediction. </p><p>If a <strong>Laplacean intellect</strong> could know the positions and velocities of every particle to infinite accuracy, chaos would vanish, replaced by determinism.</p><p>But deeper than chaos is the concept of <strong>computational irreducibility</strong>, formalized in the late 20th century by Stephen Wolfram. A system is computationally irreducible if <strong>no shortcut exists</strong> to predicting its state at time <strong>t</strong>. </p><p>The only way to know the outcome is to let the system unfold, step by step. This principle applies not only to simple<strong> cellular automata</strong> but also to complex natural systems: turbulent fluids, interacting planetary systems, or quantum spin chains.</p><p>Consider a turbulent fluid described by the<strong> Navier&#8211;Stokes </strong>equations:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{equation}\n\\frac{\\partial \\mathbf{u}}{\\partial t} + (\\mathbf{u} \\cdot \\nabla) \\mathbf{u} = -\\frac{1}{\\rho} \\nabla p + \\nu \\nabla^2 \\mathbf{u} + \\mathbf{f}\n\\end{equation}\n&quot;,&quot;id&quot;:&quot;HZJPIRYRBK&quot;}" data-component-name="LatexBlockToDOM"></div><p>Even when deterministic, predicting the velocity field <strong>u(x,t) </strong>at a future time is computationally irreducible. </p><p>Tiny uncertainties amplify exponentially due to nonlinear terms. Approximations are possible, but <strong>exact outcomes remain inaccessible</strong>, a physical manifestation of the Halting Problem.</p><p>This insight is critical: <strong>lawful systems</strong> may be inherently opaque. </p><p>Determinism is not synonymous with predictability. The universe may compute, but some of its outputs are <strong>forever hidden</strong> from any observer, not by chance but by computational structure itself.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>Nature as a computational engine</h2><p>Quantum mechanics takes unpredictability to a new level. In the classical world, determinism reigns, and randomness is epistemic. </p><p>In the quantum world, randomness is intrinsic. The evolution of a closed system follows the <strong>Schr&#246;dinger equation</strong>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{equation}\ni \\hbar \\frac{\\partial}{\\partial t} \\lvert \\psi(t) \\rangle = \\hat{H} \\lvert \\psi(t) \\rangle\n\\end{equation}\n&quot;,&quot;id&quot;:&quot;LDTTOZLTMU&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>where <strong>&#8739;&#968;(t)&#10217;</strong> is the system&#8217;s wavefunction and H^ is its <strong>Hamiltonian</strong>. Unitary evolution is deterministic, yet measurement introduces probabilistic collapse. </p><p>Superposition allows a quantum system to &#8220;<em>exist</em>&#8221; in many states simultaneously, and <strong>interference</strong> lets amplitudes reinforce or cancel, giving quantum computation its unique power.</p><p>Richard Feynman, in the 1980s, realized that <strong>classical computers cannot efficiently simulate quantum systems</strong>. A system of N qubits has a state space of 2^N dimensions. </p><p>Explicit simulation on a classical Turing machine scales exponentially, quickly becoming infeasible. Feynman&#8217;s insight birthed the idea of <strong>quantum computers</strong>: devices exploiting quantum law to compute naturally.</p><p>Two canonical examples illustrate quantum advantage:</p><ol><li><p><strong>Shor&#8217;s Algorithm (1994):</strong> Factorization of a large integer N is classically exponential in complexity. </p><p>Shor showed a quantum algorithm can factor in <strong>polynomial time</strong>, exploiting the <strong>quantum Fourier transform</strong> to extract periodicity from superposition:</p></li></ol><blockquote><p>Time complexity: <strong>O((logN)^3)</strong></p></blockquote><p>This has profound implications: cryptography, algorithmic efficiency, and the realization that physical law itself can accelerate computation.</p><ol start="2"><li><p><strong>Grover&#8217;s Algorithm (1996):</strong> Searching an unstructured database of size N classically requires O(N) queries. Grover demonstrated a quantum search algorithm requiring only O(N&#8203;) queries: a <strong>quadratic speedup</strong> achievable via amplitude amplification and interference.</p></li></ol><p>Yet quantum mechanics does <strong>not break the Church&#8211;Turing limit</strong>. It does not enable hypercomputation; it cannot solve the Halting Problem or decide arbitrary G&#246;delian statements. It reshapes <strong>complexity</strong>, not computability.</p><p>The universe is a <strong>computational engine</strong>. Superposition encodes parallel states, interference prunes errors, and entanglement links distant subsystems nonlocally. The physical laws themselves are a kind of <strong>hardware</strong> optimized for certain classes of computation.</p><h3>Feynman, Deutsch, and the Church&#8211;Turing thesis</h3><p>David Deutsch, in the 1980s, proposed the <strong>Physical Church&#8211;Turing Thesis</strong>: every function realizable by physical law can, in principle, be computed by a universal Turing machine. </p><p><strong>Quantum mechanics</strong> challenges this thesis subtly. While quantum computers can accelerate some computations, they respect the fundamental boundary of Turing computability. </p><p>Yet they also illuminate a <strong>deeper truth</strong>: complexity is physically grounded. Some problems are intractable not by lack of ingenuity but because <strong>physics enforces resource constraints</strong>: time, energy, entanglement, and coherence.</p><div><hr></div><h2>The Lure and Limits of analog computation</h2><p>In the 1980s and 1990s, computer scientists and mathematicians began to ask: what if the universe could compute <strong>beyond discrete machines</strong>? </p><p>The continuous nature of physics suggested tantalizing possibilities. Consider a pendulum, a flowing fluid, or the precise trajectories of planets: these are <strong>analog systems</strong> governed by continuous variables. </p><blockquote><p><em>Could a clever arrangement exploit this continuity to solve problems a Turing machine cannot?</em></p></blockquote><p>Mathematically, this is plausible. Blum, Shub, and Smale formalized <strong>real-number computation</strong>: a model where machines operate over R rather than discrete symbols. </p><p>In principle, a single real number with an <strong>infinite decimal expansion</strong> could encode the solution to the Halting Problem or other undecidable statements. </p><p>Differential equations, with infinite precision initial conditions, could act as <strong>hypercomputers</strong>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{equation}\n\\frac{dx}{dt} = f(x)\n\\end{equation}\n&quot;,&quot;id&quot;:&quot;JVEUOQQYCD&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>If <strong>x(0) </strong>encodes an infinite sequence representing a computation, the trajectory <strong>x(t) </strong>could &#8220;<em>output</em>&#8221; solutions unreachable by any finite Turing machine.</p><p>Yet nature pushes back. The Heisenberg <strong>uncertainty principle</strong> forbids exact simultaneous knowledge of conjugate variables:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{equation}\n\\Delta x \\, \\Delta p \\geq \\frac{\\hbar}{2}\n\\end{equation}\n&quot;,&quot;id&quot;:&quot;VMPQMCTRYW&quot;}" data-component-name="LatexBlockToDOM"></div><p>No physical system can manifest truly infinite precision. </p><p>Thermodynamic limits impose bounds on information density: a finite-energy system can store only finitely many bits. Decoherence, noise, and thermal fluctuations amplify tiny errors, making infinite precision <strong>operationally impossible</strong>.</p><p>The tantalizing promise of hypercomputation exists in equations and thought experiments, but <strong>nature enforces hard ceilings</strong>. </p><p>Real numbers may exist in the abstract, but no device or system can access all their digits. </p><p>Even a flowing fluid or an analog circuit obeys constraints that <strong>limit the universe to Turing-computable processes</strong> in practice.</p><div><hr></div><h2>Undecidability emerges in physical systems</h2><p>Even without hypercomputation, undecidability appears naturally in physical systems capable of <strong>universal computation</strong>. </p><p>Consider a two-dimensional lattice of interacting spins, like a<strong> quantum spin </strong>chain. Each spin can interact with its neighbors according to a Hamiltonian H. Questions like:</p><ul><li><p><em>Will a given spin reach a particular state at a future time?</em></p></li><li><p><em>Will the system settle into a stable configuration?</em></p></li></ul><p>can be <strong>provably undecidable</strong>. The rules are simple; the system is local and deterministic. Yet the global behavior can encode a <strong>Turing machine. </strong>Predicting its evolution is then equivalent to solving the Halting Problem.</p><p>Similarly, classical cellular automata, like <strong>Conway&#8217;s Game of Life</strong>, can simulate universal computation. The question </p><blockquote><p><em>&#8220;Will this configuration ever produce a glider?&#8221; </em></p></blockquote><p>is undecidable in general terms. Nature, by embedding computation, can <strong>hide answers from any observer</strong>, not due to measurement limitations, but because <strong>no algorithm exists that can predict them</strong>.</p><p>Even chaotic, classical systems can exhibit computational irreducibility. </p><p>Turbulent flows, multi-body gravitational systems, and <strong>non-equilibrium thermodynamics</strong> generate trajectories that are lawful yet opaque: predicting their future requires simulating each step, in real time, without shortcut.</p><p>This is where the physical Church&#8211;Turing thesis gains traction: the <strong>universe may compute, but computation itself is bounded by physical law</strong>. </p><p>Some processes are fundamentally inaccessible, not because of chaos or complexity, but because <strong>computational structure forbids shortcut solutions</strong>.</p><p>A thought experiment clarifies this: imagine a marble rolling in a complex, frictionless landscape designed to encode a Turing machine. </p><p>Predicting which exit the marble takes is equivalent to asking whether the simulated machine halts. </p><p>Deterministic laws govern the marble, yet <strong>no observer can compute the outcome in advance</strong>. This is nature embedding <strong>undecidability in the real world</strong>.</p><div><hr></div><h2>Black Holes, entropy, and the geometry of computation</h2><p>Black holes are the ultimate laboratories for exploring the boundaries of computation. </p><p>Their defining characteristic is simplicity: mass, charge, and angular momentum. Yet within their event horizons, complexity reaches its extremes. The <strong>Bekenstein-Hawking entropy formula</strong> gives the maximum information content a black hole can hold:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{equation}\nS_\\mathrm{BH} = \\frac{k c^3 A}{4 G \\hbar}\n\\end{equation}\n&quot;,&quot;id&quot;:&quot;UKJENSBFGE&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, A is the area of the event horizon, not the volume of the black hole. This is startling: the <strong>informational capacity of a region of space scales with its surface area</strong>, not its volume. </p><p>The universe seems to be telling us that <strong>computation is fundamentally geometric</strong>. There is a ceiling to how much information can exist in any finite region.</p><p>The &#8220;<strong>fast-scrambling conjecture</strong>&#8221; further illuminates these limits. It suggests that black holes mix information faster than almost any other physical system:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{equation}\nt_\\mathrm{scramble} \\sim \\frac{\\beta}{2\\pi} \\ln N\n\\end{equation}\n&quot;,&quot;id&quot;:&quot;ACIQFOPJKF&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>where <strong>N </strong>is the number of degrees of freedom and <strong>&#946; </strong>the inverse temperature. Information becomes <strong>effectively irretrievable</strong> almost immediately. </p><p>Even if a black hole perfectly encodes a computation, the <strong>spacetime fabric itself enforces limits on what can be extracted</strong>, demonstrating that computation is <strong>constrained by physical law</strong> at the most extreme scales.</p><p>These insights are reinforced by the <strong>holographic principle</strong>. The <strong>AdS/CFT </strong>correspondence suggests that the dynamics of an entire volume of space can be represented by degrees of freedom on its boundary. </p><p>In other words, the <strong>bulk of spacetime is computationally reducible to its &#8220;skin&#8221;</strong>. Space itself imposes a limit on computation: the deeper the region, the more entangled and inaccessible the information becomes.</p><p>Black holes thus teach a profound lesson: computation is <strong>not just abstract math</strong>. It is a <strong>physical process constrained by energy, curvature, and causality</strong>. </p><p>Undecidability, irreducibility, and operational limits are <strong>written into the geometry of the universe itself</strong>.</p><div><hr></div><h2>Predictive science in a bounded universe</h2><p>The implications for science are pretty radical. </p><p>Classical science assumes that given enough data and clever modeling, phenomena are predictable. But if the universe enforces computational boundaries, some predictions are <strong>provably impossible</strong>. </p><p>Deterministic laws may exist, yet outcomes may remain forever hidden:</p><ul><li><p><em>Turbulent fluids evolve in ways that cannot be shortcut.</em></p></li><li><p><em>Quantum many-body systems produce correlations that defy tractable simulation.</em></p></li><li><p><em>Black hole interiors encode information that is physically irretrievable.</em></p></li></ul><p>Science must shift from prediction to <strong>constraint analysis</strong>. Rather than asking &#8220;<strong>What will happen exactly?</strong>&#8221;, we ask:</p><ul><li><p>Which behaviors are <strong>typical</strong>?</p></li><li><p>Which invariants are <strong>structurally enforced</strong> by physical law?</p></li><li><p>How does <strong>computational complexity</strong> shape observable outcomes?</p></li></ul><p>This shift is not academic. It is already implicit in statistical mechanics, quantum information theory, and cosmology. Systems are described in terms of <strong>entropy, decoherence, and information flow</strong>, rather than exact trajectories. </p><p>Complexity is a <strong>physical quantity</strong>, bounded by energy, entropy, and the geometry of spacetime.</p><p>Even our most ambitious intellectual projects, the search for a <strong>Theory of Everything</strong>, face a new framing. </p><p>Equations may exist that perfectly describe reality, but <strong>solving them may be impossible</strong>, not practically, but in principle. The universe is lawful yet <strong>provably unknowable in regions</strong>. </p><p>Some truths are forever beyond computation, not due to chaos, but because <strong>physical law forbids the processes needed to extract them</strong>.</p><div><hr></div><h2>The ontology of ignorance</h2><p>The deepest realization of this journey is ontological: <strong>ignorance is embedded in reality itself</strong>. </p><p>It is not merely a gap in human knowledge or a technical limitation of computers; it is a feature of the universe. The <strong>undecidability </strong>that arises in formal systems manifests physically. </p><p>Computational irreducibility, chaos, and quantum uncertainty are not quirks, they are <strong>laws of nature</strong>.</p><p>Consider again the quantum spin chain or a cellular automaton encoding a Turing machine. Each step is lawful, deterministic in its rules, yet predicting a specific outcome may be impossible. </p><p>The universe computes, yet some computations <strong>cannot be shortcut, cannot be solved faster than the system unfolds</strong>. Nature is selective in what it reveals. Some patterns remain forever inaccessible.</p><p>Even black holes illustrate this principle. The information within the horizon is maximally scrambled, beyond any retrieval process allowed by the laws of physics. </p><p>The <strong>geometry of spacetime itself enforces limits</strong>, making some computations permanently opaque. Entropy, causality, and energy constraints combine to form a <strong>ceiling on possible knowledge</strong>.</p><p>This shifts our understanding of ignorance. It is <strong>ontological</strong>, not merely epistemic. Some truths are unknowable <strong>in principle</strong>, no matter how clever, patient, or well-equipped the observer. </p><p>This is a <strong>radical departure</strong> from centuries of scientific optimism, which assumed that lawfulness implied predictability.</p><div><hr></div><h2>Understanding in the Age of Computational Limits</h2><p>What remains is to ask: <strong>what does it mean to understand the universe in a world of intrinsic limits?</strong></p><p>Understanding can no longer be equated with prediction. Instead, it becomes:</p><ul><li><p><strong>Constraint recognition:</strong> identifying what can and cannot occur, what is physically allowed or forbidden.</p></li><li><p><strong>Structural insight:</strong> discovering invariants, symmetries, and universal behaviors that emerge despite unpredictability.</p></li><li><p><strong>Complexity appreciation:</strong> understanding that some outcomes require simulation in full, step by step, with no shortcuts.</p></li></ul><p>In other words, science becomes less about computing exact futures and more about <strong>mapping the landscape of possibility</strong>, recognizing <strong>the boundaries imposed by physics and computation</strong>.</p><p>This perspective unites G&#246;del, Turing, chaos theory, quantum mechanics, and black holes: each shows a <strong>fundamental limitation on what can be known, computed, or predicted</strong>. </p><p>Yet these limits are themselves law-like. They are constraints that the universe imposes, shaping the very nature of reality.</p><p>The universe computes, yes; but it does so under <strong>strict algorithmic rules</strong>, bounded by energy, entropy, and geometry. Some computations are accessible, many are tractable, and others are forever closed. </p><p>Prediction, in its absolute form, may be impossible. Knowledge, in its classical sense, is <strong>provably bounded</strong>.</p><p>This is both unsettling and liberating. Uncertainty, irreducibility, and incompleteness are not failures of human intelligence; they are <strong>features of the cosmos</strong>. </p><p>They are signals that reality is richer, deeper, and more intricate than any algorithm we could ever encode. </p><p>And in acknowledging these boundaries, we begin to understand not just what the universe is, but <strong>how it allows us to know it, and where it insists on remaining hidden</strong>.</p><div><hr></div><h2>Closing reflections</h2><p>From <strong>G&#246;del&#8217;s theorems</strong> to Turing machines, from chaotic fluids to quantum entanglement, and from analog computation to black hole entropy, the journey reveals a consistent lesson: the universe is a computation <strong>with ceilings</strong>. </p><p>Not all computations are allowed, not all outcomes are <strong>predictable</strong>, and some truths are fundamentally unreachable.</p><p>Physics and computation converge into a single insight: <strong>the structure of law imposes the limits of knowledge</strong>. Some questions will forever remain unanswerable; not because we lack cleverness, resources, or patience, but because <strong>nature itself forbids the processes needed to resolve them</strong>.</p><p>The universe whispers in code, and we have learned to listen. But even the best listening will sometimes meet silence. </p><p>That silence is not chaos; it is law. </p><p>And therein lies the ultimate lesson: reality is <strong>provably unknowable in places</strong>, and that unknowability is part of its deepest structure.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.thesoftwarefrontier.com/p/if-the-universe-computes-what-can?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.thesoftwarefrontier.com/p/if-the-universe-computes-what-can?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p>]]></content:encoded></item></channel></rss>