How Systems Really Fail, Part III
The control problem: why the loop you close cannot stabilise the system you have, why every remediation has a half-life, and what it means to act on a system whose state you cannot see.
Intro
The first essay argued that distributed systems fail in the spaces between their components, and that those spaces are structurally opaque. The second argued that the system you observe is not the system that exists, that aggregation destroys signal, and that the operator’s dashboard is a delayed, partial, instrumented projection of a system that has already moved on.
This one is about what happens next.
Once you have accepted that the system is opaque and that your view of it is incomplete, you still have to act. The pager has gone off. The error rate has climbed from a green 0.02% to a red 12%. Customers are tweeting. Your manager is on the call.
The runbook has three pages and none of them describe this. You have ninety seconds before the next escalation tier joins, and you have to decide whether to roll back, fail over, shed load, drain a region, or do nothing and let the system find its own equilibrium.
This essay is about that decision. Not the politics of incident response, not the cultural question of blameless post-mortems, but the structural problem underneath: you are closing a control loop on a system whose state you cannot fully observe, whose composition you do not fully control, and whose response to your inputs is, in the regime where you most need to act, nonlinear, delayed, and frequently the opposite of what you expected.
Classical control theory has names for all of this. The combination is called control under uncertainty, and the bounds it places on what an operator can achieve are not soft. They are mathematical.
They are the reason a competent on-call engineer with a complete runbook and a working dashboard can still make an outage worse, not through error, but by executing the textbook intervention against a system that has, by the time the intervention lands, already entered a regime where the textbook does not apply.
Three incidents, mechanically reconstructed: Knight Capital’s forty-five-minute, four-hundred-and-forty-million-dollar loss in 2012, where a human control loop sampling at the speed of decision could not stabilise a software loop running at the speed of order entry; the Facebook BGP withdrawal of October 2021, where the control plane that needed to repair the network had been routed through the network it had just withdrawn from; and the AWS Kinesis outage of November 2020, where the remediation that would have ended the failure could not proceed because it depended on the very subsystem the failure had taken down.
The pattern beneath all three is the same. The system entered a regime in which the available control inputs were either too slow, structurally unable to reach the failing component, or themselves dependent on the failure being already fixed.
The operators were not negligent. They were operating inside a loop whose closure conditions had been silently violated, and the loop did what control loops do when their closure conditions fail: it stopped controlling.
The interesting question, again, is why this is structural.
The control loop, mechanically
A control loop is a four-stage cycle: a plant whose state evolves over time, a sensor producing measurements, a controller computing corrective inputs against a setpoint, an actuator applying those inputs.
The plant’s new state is measured, and the cycle repeats.
In a textbook loop, all four stages are coupled tightly enough that the system can be analysed as a single dynamical system. The classical results, like Nyquist stability, Bode gain and phase margins, Lyapunov functions, Kalman observers, assume this coupling.
Given a plant with known dynamics, a sensor with bounded noise, a controller with a known transfer function, and an actuator with bounded authority, they tell you whether the closed loop is stable and how it responds to disturbances.
A production distributed system violates every one of these assumptions.
The plant is not one system; it is a composition of subsystems each with their own dynamics, coupled through interfaces that hide most of the relevant state.
The sensor is the observability pipeline of Part II, with tens of seconds of phase lag and aggregation that destroys precisely the signal the controller needs.
The controller is split across at least three actors at different sampling rates: automated systems (autoscalers, load balancers, schedulers) running at the speed of metric collection; on-call humans running at the speed of cognition under stress; incident commanders running slower still.
The actuator is whatever combination of API calls, configuration pushes, deployment rollbacks, and SSH sessions the operator can bring to bear, each with its own latency, blast radius, and probability of producing the opposite of what was intended.
The result is a control loop whose stability margin is set by the slowest, noisiest, most delayed component in the chain. In the steady state, this is fine: the loop has plenty of margin and corrections are small. In an incident, the margin evaporates, and the loop’s behaviour is determined by parts of the system no one had thought to characterise.
There is a precise name for this in control theory. The formal definitions are Kalman’s (1960): a system is observable if its internal state can be reconstructed from a finite history of outputs; controllable if any state can be reached from any other state in finite time by an admissible input sequence.
In a healthy production system, both hold approximately. In an incident, one or both fails. Observability fails when the failure mode is invisible to the metrics, as in Slack’s autoscaler chasing CPU while threads waited on a degraded network. Controllability fails when the action that would fix the problem is no longer reachable, as we are about to see in three different forms.
When both fail at once, the operator is, in the precise technical sense, no longer controlling the system. They are watching it. Interventions may correlate with eventual recovery, but the causal chain from action to outcome has been severed.
When the human loop is too slow: Knight Capital, 1 August 2012
The canonical case for control-loop timescale mismatch is not a distributed-systems outage in the conventional sense.
It is a financial one, and the reason it belongs here is that it isolates, more cleanly than any web-scale incident, what happens when the human control loop runs orders of magnitude slower than the software loop it is supposed to govern.
Knight Capital Americas was, on the morning of 1 August 2012, the largest U.S. retail market-maker, market-making roughly 17% of NASDAQ-listed and 16% of NYSE-listed stocks. Its core function was posting bid and ask quotes on thousands of stocks, capturing the spread, managing inventory.
The platform that did this was SMARS, the Smart Market Access Routing System, running for over a decade.
On 31 July, NYSE was about to launch the Retail Liquidity Program (RLP). Knight had updated SMARS to support RLP order types.
The update repurposed a flag that since 2003 had activated a piece of dormant code called Power Peg: an old test algorithm, originally designed to buy high and sell low in order to exercise other trading algorithms in a controlled environment, that Knight had stopped using years earlier but never removed from production.
In 2005, a separate refactor had moved the cumulative-quantity counter (the routine that tracked how many shares of a parent order had been filled and was responsible for stopping further child orders once an order was complete) to an earlier point in the SMARS workflow.
The move disconnected the counter from Power Peg, and Knight never retested Power Peg afterwards. In the new RLP code, the flag’s meaning was repointed at the RLP handler. The deployment was rolled out manually to eight production servers between 27 July and 1 August. Seven of them received the new code. One did not.
At 9:30 AM Eastern, the U.S. equities market opened. Parent orders flowed into SMARS to be split into child orders and sent to the exchanges. On the seven correctly-deployed servers, child orders were generated, sent to NYSE, and matched against the RLP.
On the eighth, the repurposed flag was being set on incoming RLP-eligible orders, but the code interpreting it was still the old Power Peg algorithm, now without the cumulative-quantity counter that would have throttled it.
Each parent order on the eighth server generated child orders continuously, with no signal back from the fill-confirmation path to indicate the order had been satisfied.
Over the next forty-five minutes, the eighth server sent more than four million orders into the market in response to 212 customer orders, executing across 154 symbols and ultimately moving 397 million shares. It bought at the offer and sold at the bid hundreds of times per second. Each round trip lost the spread.
By contemporary reporting, Knight’s losses accumulated at roughly $10M per minute.
From the perspective of every component except the broken one, the system was behaving correctly. The exchanges were filling the orders. The risk system was receiving the fills. The position-keeping system was updating.
Knight’s internal monitoring had generated 97 emails containing “Power Peg disabled” between 8:01 and 8:24 AM EST, before the market opened, but these were not designed as alerts and no one acted on them.
What did not happen, for forty-five minutes, was anyone stopping the eighth server.
The reasons map cleanly onto the structure of the control loop. After roughly twenty minutes of diagnosis without documented incident-response procedures, engineers reached the conclusion that the issue lay in the new code and reverted SMARS to its previous version on all eight servers.
This was the opposite of the correct action: the previous version was the one in which the Power Peg flag still activated the broken Power Peg path. The rollback propagated the failure contained on one server onto all of them.
Eventually the call was made to halt SMARS entirely. By the time the system was actually stopped, at approximately 10:15 AM, Knight had taken positions of approximately $7.65B (net long $3.5B in 80 stocks, net short $3.15B in 74).
Once unwound, the realised loss was reported by Knight at ~$440M; the SEC’s enforcement order placed the figure above $460M.
The firm did not survive in its prior form. By mid-December, less than five months later, Knight had agreed to a merger with Getco; the deal closed in July 2013, and the combined entity (KCG Holdings) was itself acquired by Virtu in 2017.
The point is the loop. SMARS was running an automated control loop generating orders at machine speed, executing against the market, receiving fills, generating more orders.
The human control loop above it, monitoring positions, raising alerts, halting the system on threshold breaches, was nominally coupled to SMARS through dashboards and risk limits.
In an incident, they decoupled. The position-monitoring metric had a collection interval on the order of a minute. The decision cycle for incident response was five to ten minutes per hypothesis-test iteration.
The order-generation cycle was milliseconds. The two loops differed by roughly four orders of magnitude. The faster loop accumulated four hundred million dollars of damage in the time the slower one ran three diagnostic iterations.
This is the structural form. Nyquist’s sampling argument applies with full force: a control loop sampling at interval $T$ cannot react to disturbances faster than $2T$. Knight’s human loop sampled at minutes; the plant disturbance was milliseconds. The loop was, by sampling theory, blind to its own plant.
The crisis simply could not be controlled by the available control structure, regardless of operator competence.
The lesson the industry encoded after Knight was not that humans should react faster, they obviously cannot, but that any control loop running at machine speed must have a kill switch at machine speed: pre-trade risk checks in the order path, position limits enforced before order submission, circuit breakers triggered on order velocity.
The slow human loop sits above all of this and decides when to re-enable after the automated kill. It does not, anymore, try to be the kill itself.
The principle generalises. Any system whose failure mode propagates faster than the slowest control loop authorised to stop it is, in the precise technical sense, uncontrollable along that axis.
The mitigation is not faster humans; it is a fast-enough automated cutoff with a slow-enough human override.
This is the operational meaning of what Marc Brooker calls autonomic behaviour: the component must be capable of saving itself, on millisecond timescales, against failures the human loop is structurally too slow to address.
When the control plane cannot reach itself: Facebook, 4 October 2021
The second structural form of control-loop failure is reachability. The diagnosis is correct, the action is well-understood, the human loop is fast enough, but the action cannot be applied because the path from controller to actuator runs through the system that has failed.
At 15:39 UTC on 4 October 2021, an engineer at what was then still called Facebook executed a routine maintenance command intended to assess backbone capacity. The command was issued through an audit tool whose job was to reject any change that would take too much of the backbone offline at once.
A bug in the audit tool failed to catch this one. The command withdrew the BGP advertisements for every prefix Facebook announced to the rest of the Internet.
BGP is the inter-domain routing protocol that lets autonomous systems tell the rest of the Internet which prefixes they own and how to reach them.
When Facebook stopped announcing its prefixes, BGP speakers across the Internet, operating standard route-withdrawal semantics, on the order of seconds, removed Facebook’s routes from their forwarding tables.
By Cloudflare’s measurements, public resolvers’ cached records for facebook.com had expired by 15:50 UTC. From the outside, Facebook ceased to exist.
This is, on its own, a recoverable outage. Re-announcing the prefixes is a single configuration push. The question was whether engineers could get that push to the routers.
They could not.
The configuration management system ran on Facebook’s internal network. Facebook’s authoritative DNS servers, hosted at smaller facilities, had a safety rule: if they could not reach the main data centres, they treated themselves as unhealthy and withdrew their own BGP advertisements.
When the backbone went down, every DNS server independently concluded that it was isolated and pulled its routes. Facebook’s DNS therefore disappeared from the public Internet as a second-order consequence of the backbone failure.
And it disappeared from the inside as well, because the same authoritative DNS resolved the hostnames of the internal tools engineers would have used to undo the change.
It got worse. Many internal tools and services engineers would have used to coordinate the response, parts of Facebook’s authentication and communication infrastructure, also depended on the broken backbone or on the now-unreachable DNS.
Engineers reportedly could not log into internal tools; conference rooms whose locks were on the same network would not open; routine communication channels among responders failed.
It got worse again. Physical access to the data centres was gated by a card-access system whose backend ran on the same internal network. Engineers attempting to physically enter buildings or reach server cages directly found their badges no longer opened the doors.
Press reports during the incident described engineers using an industrial angle grinder to cut through a server-cage bar at the Santa Clara data centre; Facebook later disputed the specifics, acknowledging only that “some physical barriers had to be worked around.”
Either way, a team had to be physically dispatched to a data centre to restore service.
The total outage was approximately six hours; BGP advertisements resumed shortly before 21:00 UTC. The technical fix could have been completed in minutes if it had been reachable.
The duration was determined by the time required to physically reach a console inside the same dependency loop as the failure, restore enough of the internal network to allow remote actions, and only then perform the fix that was, in itself, trivial.
This is the second structural form: the action that would resolve the failure lies in the unreachable set induced by the failure itself.
Control theory has a name for the dual notion, a state that cannot be reached from the current state by any admissible input is uncontrollable from that state, but the network-engineering name is more vivid: the control plane was in-band.
The configuration changes that would repair the data plane had to travel through the data plane.
The principle Facebook subsequently invested in is out-of-band control. The control plane must reach its actuators by a path that does not depend on the system being controlled.
In dependency-graph terms, the directed graph of “X depends on Y to function” must contain no cycle that passes through the control surfaces of the production system.
If it does, there exists a failure mode in which those surfaces are no longer accessible, and the system can only be recovered by an out-of-band action: physical access, a separate management network, or a kept-current break-glass procedure that shares no infrastructure with normal operations.
The cost of maintaining true out-of-band control is non-trivial: a second network, separately operated, credentialed, monitored, exercised.
The path of least resistance is always to let the control plane drift back in-band, because in-band is cheaper, easier to operate, and works fine until the day it does not.
There is a related principle from safety-critical systems, sometimes called recovery independence, that any component whose failure can render the system inoperable must have a recovery path that does not require that component to be operating.
NASA flight rules have a version of this. Nuclear plant operating procedures have a version. Most production software systems do not.
When recovery depends on the failure being fixed: AWS Kinesis, 25 November 2020
The third structural form is the case where the path to recovery passes through the failure itself.
On 25 November 2020, the Wednesday before American Thanksgiving, AWS engineers added capacity to the Amazon Kinesis Data Streams front-end fleet in us-east-1 between 02:44 and 03:47 PST.
The first customer-impacting alarms fired at approximately 05:15 PST, with Kinesis error rates climbing through the morning. Kinesis is AWS’s high-throughput event ingestion service; it underpins CloudWatch metrics, AWS Lambda’s logging path, Cognito’s analytics path, and a large catalogue of downstream services.
When Kinesis became unhealthy in us-east-1, a substantial fraction of AWS itself became unhealthy with it.
The Kinesis front-end fleet handles authentication, throttling, and request routing to the appropriate back-end clusters that own the actual stream shards.
Each front-end server maintains in memory a shard-map: a cache containing membership data and shard ownership for the back-end clusters. To populate this cache, each front-end server creates an OS thread per peer in the front-end fleet, and exchanges shard information over those threads.
As AWS noted in its post-mortem, fully learning about a newly added fleet member can take up to an hour. The new capacity pushed the per-server thread count past a configured OS limit. When this limit was reached, front-end servers could not create the additional threads needed to complete the shard-map cache.
Cache construction failed, leaving servers with “useless shard-maps” (AWS’s phrase) that prevented them from routing requests to the correct back-end clusters. Errors began propagating to downstream callers, and the failure spread across the fleet as more servers crossed the threshold.
The remediation was familiar: stop the scaling, remove the additional capacity, and restart the fleet. The constraint was that on coming back up, each front-end server had to rebuild its shard-map by communicating with every other front-end server, and the resources needed to populate the cache competed with the resources needed to serve requests.
AWS could only bring servers back in small groups, a few hundred per hour, verifying stability between batches. The first servers re-entered traffic at 10:07 AM PST; Kinesis fully returned to normal at 10:23 PM PST: roughly 17 hours after the first alarms.
The duration of the Kinesis impairment is not the most interesting part. The most interesting part is what was failing while Kinesis was failing.
CloudWatch ingested metric data via Kinesis. With Kinesis impaired, its ability to ingest fresh metrics was degraded, so the dashboards customers and AWS engineers used to monitor their systems went dark or stale at exactly the moment they were needed.
Lambda invocations require publishing metric data to CloudWatch as part of the invocation; as CloudWatch metrics degraded, Lambda’s local metric agents exhausted their buffers and invocations began to fail.
Cognito uses Kinesis Data Streams to collect and analyze API access patterns; the path is documented as best-effort, with web servers buffering locally, but as the impairment dragged on, Cognito web servers exhausted those buffers and customer authentication started failing.
AWS’s own Service Health Dashboard, which would normally have communicated the outage to customers, was itself impaired in its ability to post updates.
This is the structural form: the path to recovery passed through the failure. The on-call engineers responding to the Kinesis outage needed monitoring to verify their interventions were working, and the monitoring depended on Kinesis.
Customers needed status updates to understand what was happening, and the status page depended on Kinesis. In dependency-graph terms, the recovery dependency graph contained a back-edge: a cycle in which X depends on Y to recover, and Y depends on X to function.
AWS’s documented mitigations: moving to larger CPU and memory servers (so the fleet needs fewer machines and therefore fewer per-server threads), accelerating the cellularisation of the front-end fleet so any single instance no longer needs a thread per peer across the whole fleet, and separating large internal consumers like CloudWatch onto their own partitioned front-end fleets.
The architectural changes are the substantive ones. Larger servers patch the specific trigger; cellularisation and partitioning defend against the general class of failure in which fleet-wide state synchronisation scales worse than the fleet itself.
The general principle, restated for the third time:
The path from a failed state back to a healthy state must not depend on the failed component being healthy.
This is harder to enforce than it sounds. Almost every production system has some path of this shape, somewhere. The deployment system that pushes the fix often depends on the very services the fix is repairing. The monitoring that verifies the fix worked depends on the metrics pipeline.
The communication tools the on-call team uses to coordinate the response depend on internal services that may be in the blast radius of the failure. The discipline is to identify these back-edges and either break them or document the manual workaround.
The recovery dependency graph is distinct from the runtime dependency graph, and almost always more pessimistic.
A system can have a clean runtime graph and a deeply cyclic recovery graph, because the runtime graph captures what depends on what during normal operation, where slow paths, retry loops, fallback caches, and degraded-mode handoffs are all tolerable, while the recovery graph captures what depends on what when something is broken.
In recovery, those tolerances vanish. The path from failed to healthy must be fast and reliable, and a cycle in that path means it may not exist at all.
The three failure modes, named
The three incidents above are not three instances of the same failure. They are three different ways the control loop loses authority over the plant, each with a precise structural cause.
Knight Capital is timescale decoupling. The plant ran at milliseconds; the controller ran at minutes. By the Nyquist sampling argument, the controller could not, in principle, react to disturbances at the plant’s natural frequency. The control structure was, mathematically, the wrong shape for the plant.
Facebook BGP is unreachable actuator. The diagnosis was correct and the intervention was correct, but the actuator was in the failure’s blast radius. There was no admissible input from the controller’s current state to the recovery state, because the path between them passed through the failed component. The recovery state was, in the formal sense, unreachable from where the operators were standing.
Kinesis is recovery dependency cycle. The controller could reach the actuator and the actuator could apply the intervention. But the verification path, knowing whether the intervention had worked, passed through the failure itself. The loop could be opened by the operators but not closed by feedback, which forced recovery to proceed at the speed of careful, blind, incremental probing.
The three failure modes compose. In a sufficiently bad incident, all three are active simultaneously: the plant moves faster than the controller can sample, the actuator is partially unreachable, and the feedback path that does reach the controller is itself degraded.
The control system has lost authority along three axes at once, and the operator is acting through the gaps.
The deeper claim, the one Parts I and II have been building toward, is that these regimes are not exceptional. They are what production distributed systems enter during incidents, by construction of how those systems are composed and observed.
The healthy operating envelope is the region in which the control loop has enough bandwidth, enough reach, and enough feedback to keep the plant on a setpoint.
Outside that envelope, one or more of those conditions fails, and the operator is no longer controlling the system, they are nudging it and waiting for the system either to find its own equilibrium or to deteriorate further.
The job of the operator in this regime is not to control. It is to survive: keep blast radius bounded, avoid actions that worsen the failure, preserve the option of recovery, and wait for conditions in which control becomes possible again.
The aviation discipline of aviate, navigate, communicate captures it: maintain altitude first, locate yourself second, talk to people third. Reversing that order is how you crash.
The OODA loop, and why it fails
The framework that incident responders most commonly use, often without naming it, is the OODA loop: Observe, Orient, Decide, Act.
The name and the framework are due to John Boyd, a U.S. Air Force colonel and fighter pilot who began formulating the ideas in the 1950s and 60s as an instructor at the USAF Fighter Weapons School and refined them through a series of briefings in the 1970s and 1980s, most prominently Patterns of Conflict (1986).
The central claim, which Boyd’s briefings emphasised, is that in adversarial situations, the actor whose decision cycle runs faster than the adversary’s wins, because they are operating inside the adversary’s decision cycle: by the time the slower actor has decided what to do, the faster actor has already changed the situation, invalidating the slower actor’s decision.
The framework migrated from military doctrine into business strategy, into emergency response, and eventually into software incident response, where it is used to describe the cycle an on-call engineer runs during an outage: observe the system, orient the observation against a model, decide on an intervention, act, and observe the result.
The framework is useful. It is also, in the production-systems context, frequently misapplied, and the way it is misapplied tells you something about why incidents go badly.
The original Boyd argument applies when the situation is adversarial and roughly symmetric: two actors with comparable OODA speeds, where being faster gives you the advantage. The situation an on-call engineer faces is not adversarial in this sense. There is no opponent making decisions.
The system is not trying to beat them. It is, instead, evolving according to dynamics (autoscaler decisions, retry storms, cache decays, queue accumulations) that have their own timescales, and those timescales are not necessarily compatible with the engineer’s OODA loop at all.
In an incident, three OODA loops are running simultaneously, and they do not all have the same period. The engineer’s loop runs at the speed of human cognition under stress: roughly thirty seconds to a few minutes per cycle, slower if multiple humans need to confer.
The automated control loop (autoscalers, load balancers, schedulers) runs at the speed of metric collection, typically seconds to tens of seconds. The plant’s own dynamics: connection pool exhaustion, queue overflow, cache regeneration; they all run at whatever rate the underlying physics dictate, which can be anywhere from milliseconds (TCP retransmits, GC pauses) to many minutes (cache warming, replication catch-up).
In Knight Capital’s case, the engineer’s loop ran at minutes against a plant operating at milliseconds. In Boyd’s terms, the plant was hopelessly inside the engineer’s decision cycle: every observation the engineers made was already obsolete by the time they oriented to it, and every action they took landed against a system that had moved orders of magnitude further during the action’s flight time.
The reverse case is also possible and is, in some ways, more insidious. An engineer’s OODA loop running faster than the plant’s natural recovery dynamics produces a different pathology: the engineer observes that the intervention has not yet worked, orients to that as a failure, decides on a new intervention, and acts, before the original intervention has had time to take effect.
The result is a stack of interventions in flight against a system that is already converging from the first one, with the later interventions arriving as disturbances against the recovery.
This is the operational form of what control theorists call over-control. The classical example is the shower with a slow-responding mixer: the user adjusts hot, observes no change, adjusts more, observes no change, then receives the full delayed effect of both adjustments and is scalded.
The loop is too fast for the plant. The mitigation, both in the shower and in production systems, is to wait, to extend the OODA cycle until it matches the plant’s natural response time, even though waiting feels, in the moment, like inaction.
The discipline this requires is hard. An on-call engineer under pressure does not feel that waiting is the correct action. The reflex, encouraged by the culture of incident response and reinforced by the stress of an outage in progress, is to act. To do something. To try the next thing on the runbook.
The structural argument against this reflex is not that the engineer is wrong to want to help; it is that, in a system with delayed feedback and partial observability, the cost of acting too quickly can exceed the cost of waiting one more OODA cycle for the previous action to land.
The principle from control theory is: the loop’s cycle time should be at least the plant’s settling time. If you act faster than the plant settles, you are stacking interventions against a system that has not yet responded to the previous one, and the resulting trajectory is not the sum of the interventions’ intended effects.
It is the response of a non-linear system to a sequence of disturbances, which is, almost always, worse than the response to any single intervention applied alone.
The mature on-call discipline, encoded in the better runbooks and the better incident command training, is to act, then wait the settling time, then observe, then decide whether to act again. The settling time is the half-life of the previous action, and it is system-specific: deployment rollbacks settle in minutes; cache invalidations settle in seconds; configuration pushes can take longer than either, depending on the propagation path.
Knowing the settling time of every action available during an incident is a kind of operational knowledge that does not usually live in the runbook, because it does not exist outside the system being operated. It lives in the heads of the engineers who have been on-call for that system long enough to have learned it from incidents.
This is, partly, why senior on-call engineers are so much more effective in incidents than junior ones, even when the junior engineers have the same runbook access and the same training. The senior engineers have an internal model of the plant’s settling times for each available action, and they pace the OODA loop to match.
The junior engineers, lacking the model, run the OODA loop at the natural speed of human cognition under stress, which is too fast for most production systems and produces the over-control pathology.
The illusion of the runbook
Every mature on-call function has runbooks. They are written after incidents, refined over months and years, kept in wikis and indexed by alert.
The good ones describe, for each known failure mode, the diagnostic steps to confirm it, the intervention to apply, and the verification to perform. The discipline of writing them is one of the visible practices of a mature operations culture.
The runbook is also, for a specific category of incident, a trap.
The trap has two forms. The first is that runbooks are written against past incidents. They encode the diagnoses that worked, the interventions that helped, the verifications that confirmed recovery.
They are, in the language of Part II, an artefact of monitoring rather than observability: comprehensive against known failure modes, useless against novel ones. The first time the runbook fails to describe the incident in front of you is the moment you discover this, and the discovery is rarely well-timed.
The second form is subtler. The runbook describes interventions and assumes those interventions will work the way they did last time. But the system has been changing since the runbook was written.
The intervention that worked last quarter may now have different downstream effects, because the components it touches have been migrated, the dependencies have shifted, the version of the underlying service has rolled forward.
The runbook, in effect, captures the system as it was; the engineer is operating the system as it is. The two diverge silently, and the divergence is invisible until the runbook produces an action whose effect surprises everyone.
There is a specific failure pattern this produces, common enough to have its own folklore: the engineer follows the runbook, the runbook prescribes restarting the X service, the X service restarts, and the system gets worse rather than better.
The reason, on inspection, is usually that the X service has acquired new dependencies since the runbook was written, and restarting it now drops more state than it used to, with consequences the runbook author did not anticipate.
The runbook is not wrong, in the sense that the steps it describes are the steps that worked. The runbook is out of date, in the sense that the system those steps were designed for is no longer the system the engineer is operating.
The mitigation is not to delete runbooks. They are too useful to discard, and the failure modes they handle correctly are far more common than the ones they handle wrongly.
The mitigation is to use runbooks as hypotheses, not as procedures: the runbook describes what worked last time, which is evidence about what might work this time, but the engineer must independently verify, in real time, that the conditions the runbook assumes still hold. The runbook says “restart X”; the engineer asks “is restarting X still safe given the system as it currently is?” before doing it.
This is hard to do under the pressure of an active incident, and it is one of the practical reasons that senior on-call engineers are more cautious about runbook execution than junior engineers, not less. The juniors, trusting the document, execute the steps.
The seniors, knowing how the document was written and how the system has changed since, pause to verify before each step. The juniors finish the runbook faster. The seniors finish the incident faster.
A related, sharper principle: the value of a runbook decays with the rate of system change. In a system that does not change, a runbook is durable: the same steps work indefinitely. In a system that changes weekly, with new deployments, new dependencies, new configurations, the half-life of a runbook is on the order of months (at best).
A two-year-old runbook in a fast-moving system is closer to historical fiction than to operational guidance. It still has value, but the value is in the model it encodes of how someone once thought about the system, not in the steps it prescribes.
The blast radius principle
The third structural property of action in distributed systems, after timescale and reachability, is blast radius: the set of components affected by a given intervention, and the bound on the damage if the intervention is wrong.
Every available action has a blast radius. Restarting a single instance has a small radius, that instance, briefly, plus whatever state it held. Rolling back a deployment has a larger radius, the entire fleet running that deployment, plus the load that flips back to the previous version, plus the downstream effects of running the older code.
Failing over a region has a radius that may include every customer routed to that region, every dependent service that has to repoint, and every cache that has to be rebuilt. Dropping a load balancer has a radius that can encompass the entire service.
The blast radius principle says: act with the minimum blast radius that can plausibly resolve the failure. If the failure is in one instance, restart the instance, not the fleet. If the failure is in one region, fail over that region, not the global topology. The reason is not just damage control. It is information.
A small-radius action either resolves the failure or does not, and either outcome is informative: it succeeded, which suggests the failure was localised; or it did not, which rules out a hypothesis and constrains the next action.
A large-radius action, by contrast, may resolve the failure without telling you which part of the action was responsible. If you fail over a region and the symptoms clear, you do not know whether the failure was in the region you abandoned, in the load on the region you moved to, or in the path between them.
You have ended the incident without learning anything that would prevent the next one. The blast radius was excessive for the diagnostic value returned.
The principle generalises into a hierarchy of interventions, ordered roughly by radius:
The smallest interventions are read-only: looking at logs, querying metrics, running diagnostic commands. These have effectively zero blast radius and are always safe to do first. Most outages benefit from more reading than the operators in the moment feel they have time for.
The next tier is single-instance: restarting one process, draining one node, removing one server from rotation. These have blast radius limited to the instance, and they are reversible within seconds. They are the appropriate first active intervention for almost any failure that has a candidate localised cause.
The next tier is service-level: rolling restart of a fleet, configuration push, deployment rollback. These have blast radius across the service, take longer to apply, and are harder to reverse. They are appropriate when single-instance interventions have ruled out localised causes, or when the failure is observed broadly enough that localising it is itself wasting time.
The largest interventions are infrastructure-level: regional failover, load shedding at the edge, traffic routing changes, emergency capacity additions. These have very large blast radii, and their consequences are often partly unobservable until well after the action.
They are appropriate only when smaller interventions have failed or are known to be insufficient, and they should be made with the explicit understanding that the system after the intervention will be a different system from the one before, with new failure modes that no one has yet characterised.
The principle, in compressed form: the appropriate blast radius scales with the certainty of the diagnosis. When you are confident in the cause, you can use a targeted, low-radius intervention.
When you are uncertain, you have two choices: invest more time in diagnosis to raise the certainty, or accept the higher blast radius of a less-targeted intervention. There is no third option.
Acting with high blast radius on low certainty is how outages turn into multi-region cascades.
The conservation of risk
There is a principle, with versions in the safety-engineering tradition, that goes roughly: the total risk in a complex system tends to be conserved; safety interventions move risk around as much as they reduce it. The folk version is “what gets safer somewhere gets more dangerous somewhere else.”
The more precise versions belong to the risk-homeostasis and risk-compensation literature, most prominently Gerald Wilde’s Theory of Risk Homeostasis (1982), with related work by John Adams, which argues that visible safety improvements are partly absorbed by behavioural adjustments, with the residual risk migrating elsewhere in the system rather than disappearing.
This applies to incident response with peculiar force. Each intervention an operator makes during an incident is a safety action: an attempt to reduce immediate risk.
The intervention, almost always, succeeds at reducing the visible component of the failure.
The error rate drops.
The dashboards green up.
The customers stop tweeting.
The intervention has, by every visible measure, worked.
What is harder to see, and what is rarely measured in real time, is what the intervention did to the invisible component.
Rolling back a deployment ends the immediate incident but leaves the older code running, which may have its own known issues that the deployment was meant to fix.
Failing over a region resolves the local failure but loads the target region beyond its tested capacity, potentially priming a second incident.
Adding capacity to a saturated service unblocks the immediate queue but increases the surface area for the failure mode that caused the saturation, if the underlying cause has not been addressed.
The pattern is not unique to software. Aviation has a long literature on accident patterns that begin with a successful response to a minor problem and end with a major accident triggered by the response itself. Healthcare has the same literature.
The general structural shape is: the intervention that resolves Failure A creates the conditions for Failure B, which is rarer, less familiar, and less recoverable.
The system has been moved from a known failure regime into an unknown one, and the unknown regime has its own failures that the operators have not yet learned to recognise.
The discipline this asks of an on-call function is, again, hard to maintain under pressure. It is the discipline of asking, after every intervention: what did this just change about the system, and what new failure modes have I introduced?
It is the discipline of treating recovery as a state that itself needs to be monitored, because the post-recovery system is not the same system as the pre-incident system, and its failure characteristics are not yet known.
In practice, this means that the end of an incident is not when the symptoms clear. The end of an incident is when the system has been observed in its new configuration long enough to be confident that the interventions did not introduce a worse failure than the one they resolved.
This is usually hours, sometimes days, longer than the time the dashboards take to go green. The on-call function that closes incidents at the moment of symptom resolution is, structurally, accepting an invisible risk in exchange for ending the call earlier.
The function that watches the recovered system through at least one full traffic cycle is paying a cost in operator time for the option of catching the second-order failure before it becomes a second incident.
The operator as Bayesian, under pressure
Underneath all of the above is a single epistemic structure.
The operator, during an incident, is performing inference: from observed symptoms, against a prior model of the system, to a posterior estimate of what is wrong and what to do about it.
The framework is Bayesian whether the operator names it that way or not, and the failure modes are the failure modes of Bayesian inference under conditions hostile to it.
The prior model, what the operator believes the system is, before any incident-specific evidence, is the simulation Richard Cook described and that Part II referenced.
The likelihood, what the operator expects to observe under each candidate hypothesis, is implicit, drawn from training and prior incidents.
The posterior, what the operator believes after seeing the symptoms, is the working diagnosis. The decision follows from the posterior and from the operator’s loss function over possible outcomes.
Each step has its failure mode. The prior can be wrong, as in the Cloudflare oscillation in Part I, where the prior placed most of the probability mass on external attack and left almost none for gradual rollout against a regeneration cycle.
The likelihood can be wrong, as in the Slack autoscaler in Part II, where the prior model said high load implies high CPU, and the actual system was producing high load with low CPU because the threads were waiting on a degraded network.
The posterior, conditioned on a wrong likelihood, will be wrong in the same direction.
The decision, conditioned on a wrong posterior, will be wrong. And, this is the cruellest part, the operator will get feedback from the decision, observe its outcome, and update the model.
If the wrong decision happened, by chance or by partial compensation from other parts of the system, to be followed by recovery, the operator will incorporate that outcome as evidence that the wrong model was right.
The next time a similar incident occurs, the operator will reach for the same wrong intervention, with higher confidence, because last time it appeared to work.
This is the correlated false positive problem in operational learning. The operator’s model is updated by outcomes that are partially decoupled from interventions, but the decoupling is not visible to the operator. Recoveries that happen for reasons unrelated to the intervention reinforce belief in the intervention.
The model drifts, not toward the system’s actual dynamics, but toward whatever pattern of intervention-and-recovery the operator has happened to experience.
The mitigation requires explicit discipline, and most on-call functions do not maintain it. The discipline is to ask, after every incident: did the intervention actually cause the recovery, or did the system recover for some other reason that I happened to be present for?
The blameless post-mortem culture is, partly, an attempt to create the conditions in which this question can be honestly asked. It often is not.
There is a deeper version of the same problem, which is that the operator’s model is updated by outcomes the operator can observe, and the outcomes the operator can observe are filtered by the same observability stack whose limits Part II discussed.
The model drifts toward whatever the dashboards can see. Failure modes invisible to the dashboards remain invisible to the model, and the operator becomes progressively more confident in a model that captures only the observable subspace of the system’s actual behaviour.
The model is calibrated, but on a projection that has discarded the dimensions that matter most during novel failures.
This is, finally, why the systems that recover quickly from incidents are not the systems with the best runbooks or the best automation.
They are the systems whose operators have maintained an active distinction between the model and the system:
who treat their model as a hypothesis under continuous test
who notice when the system surprises them and update accordingly
and who are willing, in the middle of an incident,
to admit that the model they have been operating under for years may not apply to the situation in front of them.
The discipline is epistemic humility under pressure. It is the rarest thing in operations, and it is the one that compounds.
The discipline of acting under uncertainty
The compressed form of this essay, the operational counterpart to Part I’s what does this depend on that I cannot see, and Part II’s what is this metric not telling me, is also a question, asked of every intervention before it is applied:
What does this assume about the system, and what happens if that assumption is wrong?
Every action has assumed conditions. Restarting an instance assumes the instance is the source of the failure. Rolling back a deployment assumes the previous version is healthier than the current one.
Failing over assumes the target region can absorb the load. Adding capacity assumes the bottleneck is capacity. Each assumption is, before the action, a hypothesis. Each becomes, after the action, a commitment the operator has to live with.
The discipline is to know, for every available intervention, what assumption it embeds, and to verify that assumption, before applying the intervention.
The verification can be imperfect; perfect verification is incompatible with the timescale of an outage.
But the question must be asked, because the alternative is acting on a hypothesis the operator has not consciously formed, and being surprised by the system’s response in a direction the operator was not prepared for.
The questions to ask, before any non-trivial intervention:
What does this action assume about the system?
What is the blast radius if the assumption is wrong?
What is the smallest intervention that would resolve this if my diagnosis is correct?
How will I know whether the action worked, and how long do I need to wait before that signal is reliable?
If this action fails, what state does it leave the system in, and is the next action still reachable from there?
The discipline is not to avoid acting. It is to size the intervention to the certainty of the diagnosis, allow the system time to settle, verify the assumption before escalating, and preserve the possibility of recovery after every step.
This is what separates short incidents from catastrophic ones. Not faster reactions, better dashboards, or thicker runbooks, but the ability to treat every intervention as a probe: an action that changes the system while also revealing something about it.
During an incident, the system is operating in a regime nobody has fully characterised. The operator’s job is not to force it back through sheer intervention. The job is to keep the system in a state where understanding is still possible, learn from each action, and navigate toward a recoverable regime without collapsing the remaining options.
This is on-call: control under uncertainty.
The loops are delayed. The observations are partial. The interventions have non-linear effects. The system being repaired is one no single engineer fully understands.
At 03:47 UTC, the operator is performing the function the entire architecture silently assumes someone will perform, while the architecture itself makes performing it nearly impossible.
The systems mostly do not work. The operators work, and the systems usually fail only when the operators lose the ability to see, infer, or intervene safely.
The pager will go off again. The dashboards will be wrong again. The runbook will be stale again.
The work is the same work:
act carefully, preserve reversibility, make the assumption explicit, and already know the next move before committing to the current one.
One more thing…
The structural argument across this series is that distributed systems fail at the seams: composition, observation, and control are each independent sources of failure modes no single component owns and no single engineer fully sees.
The same argument applies, in concentrated form, to GPU programming.
A modern CUDA kernel is itself a tiny distributed system: dozens of streaming multiprocessors, thousands of warps, multiple memory hierarchies, delayed observation through performance counters, and sharply non-linear control through launch configuration and synchronization.
Correctness is necessary. Performance lives in how composition, observation, and control interact under the workload you actually have, not the one your benchmark measured.
I wrote a deep CUDA guide from exactly this perspective: not isolated tricks, but how to reason about the GPU as a coupled dynamical system whose performance regimes and failure modes (occupancy collapse, memory-bandwidth thrashing, warp divergence, pipeline stalls) are structurally the same kinds of seam failures this series has been describing all along.



