How Systems Really Fail, Part II
The observer problem: why your dashboards lie, why aggregation destroys the signal, and the unbridgeable gap between the system and your model of it.
Intro
The first essay in this series argued that distributed systems fail in the spaces between their components, and that those spaces are structurally opaque. This one argues something more uncomfortable.
Even if you accept that the system is opaque, you still have to operate it. You still have to debug it at 03:47 UTC. You still have to decide, in the next ninety seconds, whether to roll back, fail over, shed load, or page someone more senior.
To do any of that, you have to see the system.
This is the second structural problem, the one most production engineers learn about the hard way, usually during an incident that lasted longer than it should have because the dashboards stayed green until they didn’t.
The system you observe is not the system that exists. It is a projection of the system into a low-dimensional representation, built out of metrics, logs, traces, and the mental model that lives in the operator’s head.
The projection is incomplete by construction. It is delayed by the time it takes to collect, aggregate, and render. It is biased by what someone decided to instrument three years ago. And it is aggregated, often violently, in ways that destroy precisely the signal needed to recover from an incident.
This essay is about that gap. Not the gap between the system as designed and the system as composed (Part I), but the gap between the system as composed and the system as perceived. The two gaps are different, and they compound.
Three incidents, mechanically reconstructed: Slack’s autoscaler chasing metrics that had decoupled from the network failure underneath, while the dashboards that would have shown the decoupling failed at the same time as the system;
GitHub’s 43-second partition and the 24-hour reconciliation that followed; Roblox’s 73-hour outage, where the monitoring stack failed at the same time as the system it monitored, and engineers spent two days debugging a fully dark cluster.
The pattern beneath all three is the same: the operator’s view of the system, and the system itself, were two different things. The interesting question is why this is structural, not accidental.
The observer problem, mechanically
The canonical incident for this material is the Slack outage of 4 January 2021, the first business day after the holiday break, documented in detail in Slack’s post-mortem by Laura Nolan.
Slack runs on AWS, with services running in dedicated VPCs (Virtual Private Clouds) connected by AWS Transit Gateways (TGWs). TGWs are managed by AWS and intended to scale transparently. Slack’s traffic pattern is unusual: the platform is quiet over the holidays and then ramps to one of its biggest days of the year on the first Monday back, when clients reconnect with cold caches and pull down more data than usual.
On 4 January, the TGWs did not scale fast enough for that ramp. Around 6:00 AM PST, one of them began dropping packets. The packet loss caused widespread degradation in internal calls across Slack’s services, but the symptom was not yet visible to users.
Slack’s web tier autoscales on two signals: CPU utilisation and utilisation of available Apache worker threads. Here is where the failure begins. As packets dropped on the TGW, threads in the web tier spent more time waiting on slow or stalled backend calls.
Waiting threads do not burn CPU. So as the system became less able to serve users, CPU utilisation actually dropped. The autoscaler, looking at CPU, concluded the fleet was over-provisioned and downscaled the web tier.
Then the mini-peak at 7:00 AM PST arrived. Load increased against the now-smaller fleet on a degraded network. Apache worker thread utilisation climbed sharply, threads were waiting longer, and more of them were in use, and the thread-utilisation signal triggered aggressive upscaling. Slack attempted to add 1,200 servers between 7:01 and 7:15 AM PST.
The scale-up failed. New instances are configured by an internal service Slack calls provision-service, which talks to other Slack systems and to AWS APIs over the same degraded network. Under the sudden load of 1,200 simultaneous provisioning requests, with elevated latency on every dependency call, provision-service hit two resource ceilings: the Linux open files limit and an AWS quota limit.
Most of the 1,200 instances were created but never fully provisioned. They counted against the autoscaling-group size limit, blocking further scale-up, but they did not serve traffic.
And then the second layer of the observer problem revealed itself. Slack’s dashboarding and alerting service had failed during the early stages of triage. The reason was structural: the monitoring stack ran in a different VPC from its backend databases, and the same TGW that was dropping packets sat on the path between them. The failure that was breaking the web tier had also blinded the engineers trying to diagnose it.
For roughly the next hour, incident responders worked without dashboards. They had logs, command-line tools, and the ability to query metrics backends directly, but none of the pre-built queries that turn raw metrics into actionable views.
Some engineers were SSHed into production instances when the autoscaler deprovisioned them mid-investigation, abruptly ending their sessions. provision-service recovered around 8:15 AM PST; serving capacity reached a degraded-but-functional state by 9:15 AM PST; full recovery, after AWS engineers manually scaled TGW capacity, completed at 10:40 AM PST.
The whole sequence is the observer problem in compounded form. The autoscaler responded first to a metric (CPU) that did not represent the system state, then to a metric (thread utilisation) that drove the wrong action under the conditions.
The control loop did exactly what it was designed to do; the signals it was acting on had decoupled from the reality of the network underneath.
And the observation surface that operators would normally have used to see this decoupling was itself, by architectural coincidence, a casualty of the same failure.
This is the observer problem in its operational form, and it has three structural sources, each worth pulling apart.
Instrumentation lag
Every observability pipeline introduces delay between an event occurring in the system and that event being visible on a dashboard. The delay has multiple stages.
First, emission delay: the event happens, but the code that emits the metric or log runs after the event, and the emission itself takes some time, usually buffered behind a batching layer with a flush deadline (StatsD typical: 10s; OpenTelemetry batch span processor: 5s default, max queue 2048). Second, collection delay: the emitted data is scraped or pushed to a collector at fixed intervals (Prometheus default scrape interval: 15s, with scrape_timeout typically 10s).
Third, aggregation delay: the collector pre-computes summary statistics, often on a window ending some seconds in the past to allow late-arriving data. Fourth, render delay: the dashboard queries the storage layer and renders, typically on a 30s to 60s refresh.
End-to-end delay in a well-tuned production pipeline is typically 15 to 60 seconds. In many real pipelines it is several minutes.
This is not a defect. It is the cost of producing observations that are coherent across thousands of hosts. The delay is the price paid for the metric being computable at all.
But the delay has a consequence that classical control theory makes precise. Any closed-loop control system whose feedback path introduces delay τ experiences a phase lag of ωτ radians at frequency ω.
The Nyquist stability criterion says, in essence, that a feedback loop becomes unstable when its total phase lag approaches 180° at the frequency where the loop gain is unity; with even modest controller gain, sufficient phase lag turns negative feedback into positive feedback and the loop oscillates.
Concretely: for an autoscaler with a minute-scale measurement-to-action delay attempting to track load that varies on similar timescales, phase lag approaches the stability boundary. Aggressive scaling policies tip the loop into oscillation; this manifests operationally as the autoscaler over-provisioning, then over-scaling-down, then over-provisioning again, never settling.
Slack’s post-mortem describes a variant of this pattern: an initial downscale on CPU, followed by an aggressive upscale on thread utilisation, against a network problem the loop could not see at all.
The Shannon-Nyquist sampling theorem provides the converse bound: a control loop sampling at interval T cannot observe, and therefore cannot react to, disturbances faster than 2T. A 15-second Prometheus scrape interval is structurally blind to load dynamics on timescales below 30 seconds. The information about those dynamics has been aliased into the lower-frequency band, where it appears as noise.
Marc Brooker has written about this directly in the context of AWS load balancing: a control loop with delay longer than the time constant of the thing it controls cannot stabilise that thing. It can only chase it.
The Slack autoscaler chasing a CPU metric that had decoupled from real load (because waiting threads do not burn CPU) was operating in exactly this regime.
The mitigation is not to make the metrics faster, though that helps. The mitigation is to design control loops that do not depend on global metrics: rate-limiting at the boundary, admission control based on local queue depth observable in O(1) from inside the affected process, fallback to last-known-good rather than reactive scaling.
Local observations made by the component itself have zero collection delay because they bypass the pipeline entirely. Global observations always carry the pipeline’s phase lag.
This is the structural argument for autonomic behaviour at the component level. Components that defend themselves locally (with circuit breakers, backpressure signals propagated synchronously to upstream callers, and load-shedding triggered by their own queue depth) do not depend on a delayed control loop to survive.
Components that wait for the autoscaler to rescue them are operating inside a feedback loop whose stability margin is, almost always, narrower than anyone has measured.
Aggregation destroys the signal
Every metric you look at on a production dashboard is an aggregate. A counter has been summed across hosts. A latency value has been percentiled across requests. A gauge has been averaged or maxed across a time window.
Aggregation is mathematically necessary; you cannot stare at every individual request. It is also, almost always, the thing that hides the failure.
The pathology is most visible in latency aggregation, and the canonical analysis is Gil Tene’s How NOT to Measure Latency (2015). The argument is worth deriving mechanically, because the conclusion is counter-intuitive and the mechanism is not.
Consider a load generator configured to issue requests at a rate of R per second; one request every 1/R seconds. The generator records, for each response, the time elapsed between send and receipt. Call this service_time. The reported P99 is the 99th percentile of the service_time distribution across some number of samples.
Now suppose the system stalls completely for T seconds, then recovers. During the stall, no responses arrive. The generator, in its naive form, has two possible behaviours.
In the coordinated form, the generator waits for each response before sending the next. During the T-second stall, exactly one request is in flight; its service_time is recorded as T. The other R·T requests that should have been sent during the stall are never sent. They do not appear in the histogram at all.
In the uncoordinated form, the generator sends on schedule regardless of whether responses arrive. During the stall, R·T requests pile up in the kernel’s socket buffer or in the generator’s own queue. When the system recovers, those requests are drained; each one’s service_time is measured from the moment it was sent, not the moment it was scheduled to be sent.
The bias is statistical. Let F_observed(t) be the empirical CDF of service_time the generator records, and F_true(t) the CDF of latency that a real user (arriving according to a Poisson process at rate R) would experience.
In the coordinated form, the missing samples should have been drawn from the slowest part of the latency distribution; their absence systematically truncates the right tail of F_observed. The quantile function Q_observed(0.99) = F_observed⁻¹(0.99) is, by construction, a lower bound on Q_true(0.99), with the gap widening as T grows.
The user who clicked at time t and got a response at time t + T + ε experienced a latency of T + ε. The histogram has either zero entries near T + ε (coordinated form) or entries clustered near ε (uncoordinated form). In neither case does the percentile reflect what the user felt.
The correction Tene proposes (HdrHistogram’s recordValueWithExpectedInterval) is to synthesise the missing samples: for each measured service_time exceeding the expected interval 1/R, insert additional samples at service_time − 1/R, service_time − 2/R, ..., down to 1/R.
These synthetic samples represent the users who would have arrived during the stall and would have experienced progressively shorter waits.
The implementation, in essence:
void recordValueWithExpectedInterval(long value, long expectedInterval) {
recordValue(value);
if (expectedInterval <= 0 || value <= expectedInterval) return;
long missingValue = value - expectedInterval;
while (missingValue >= expectedInterval) {
recordValue(missingValue);
missingValue -= expectedInterval;
}
}Six lines. One additional method call per recorded measurement. The result, when applied to real traces, is routinely an order-of-magnitude shift in the tail. A system reporting P99 = 200 ms under uncorrected measurement reports P99 = 2 to 4 seconds under coordinated-omission correction.
The dashboard was lying by a factor of ten, not by accident, but by construction of how percentiles are computed from a fixed-rate sampler against a non-stationary service-time distribution.
The same pathology appears, in different shapes, throughout the metric stack. Server-side latency histograms measure only requests the server got to process. Requests rejected at the load balancer, dropped at the TCP layer, or held in the kernel accept queue do not appear.
The server’s P99 can be excellent while the connection P99 (which the user actually experiences) is catastrophic. The Slack incident is exactly this: the web tier was reporting acceptable internal latencies for the requests it was handling, because the requests not being handled were not in the denominator. Survivorship bias, in observation form.
The same problem appears in averaging. A service that handles two classes of request, 99% of them fast and 1% of them slow, will show a mean latency dominated by the fast class. If the slow class gets ten times slower during an incident, the mean barely moves. The average is structurally insensitive to the tail, which is where outages live.
Percentiles are better than averages, but only if computed correctly, and only if the percentile you care about is in the data.
P99 across a million requests has ten thousand data points and is statistically reliable. P99 across a thousand requests has ten data points and is statistical noise. P99.9 across a million requests has a thousand data points, marginal. P99.99 across a million requests has a hundred data points, useless.
The deeper into the tail, the more samples needed to stabilise it, and the deeper into the tail is exactly where the interesting failures live.
The reason failures live in the tail is itself a queueing-theoretic result. For an M/M/1 queue (Poisson arrivals at rate λ, exponential service times at rate μ, single server) with utilisation ρ = λ/μ, the expected waiting time in the system is 1/(μ(1−ρ)). As ρ → 1, this diverges hyperbolically; the variance of waiting time grows even faster, as 1/(1−ρ)².
The practical consequence is that high-percentile latencies blow up far faster than utilisation increases: a service running near saturation has a P99 that is many multiples of its mean, and the ratio worsens sharply as you approach full utilisation.
Tails are not a measurement artefact; they are the physics of contention, and they are precisely what averages and low-resolution percentiles destroy.
The problem compounds when percentiles are themselves aggregated. The P99 of a service that is the union of ten hosts is not the average of those hosts’ P99s, nor their maximum, nor any function of them; it is a quantile of the underlying merged distribution, which is unrecoverable once each host has been independently percentiled.
Many monitoring systems do exactly this aggregation, producing a number labelled “P99” that is mathematically meaningless. Theo Schlossnagle and the Circonus team have written extensively on this; the correct primitive is to store the full histogram and percentile at query time, after merging across hosts.
Three production-grade histogram primitives dominate:
HDR Histogram (Tene): fixed-precision logarithmic bucketing across many orders of magnitude (typically nanoseconds to hours at three significant digits),
O(1)insert, mergeable across instances, widely used in the JVM ecosystem.t-digest (Dunning, 2013): centroid-based sketch with concentrated precision in the tails, useful when storage is constrained but tail accuracy matters;
O(log n)insert in the worst case, mergeable with a controlled error bound.DDSketch (Masson, Rim, Lee, 2019): relative-error guarantee
α, log-bucketed similarly to HDR but with provable tail accuracy, fully mergeable without error accumulation; used by Datadog as its native primitive.
All three solve the merge problem. None of them solve the storage problem: keeping a per-request histogram per dimension (customer, region, endpoint, version) at full cardinality costs roughly two orders of magnitude more than scalar metrics. Almost no organisation does it for everything.
Beyond latency, aggregation collapses cardinality. If a service is failing for one specific customer, on one specific endpoint, in one specific region, the aggregate error rate may show a 0.1% blip below any reasonable alerting threshold. The blip is the entire experience of that customer. Aggregation makes the rare invisible by averaging it against the common.
Charity Majors’s argument for high-cardinality observability, developed across her writing at Honeycomb and in Observability Engineering (O’Reilly, 2022, with Liz Fong-Jones and George Miranda), reduces to this: the questions that matter during an incident are almost always questions about specific slices of traffic, and any pre-aggregated metric has already destroyed the dimensions needed to slice on.
The pre-aggregation step is irreversible. Information theory enforces this; once you have computed the count of errors per minute, you cannot recover which customers produced those errors.
The cost of preserving cardinality is large; the cost of discarding it is invisible until the incident in which you need the dimension you discarded. The discipline is to know which dimensions are likely to matter, instrument those at full cardinality, accept that some incidents will be invisible until someone re-instruments after the fact.
You can only see what you instrumented
This is the third and most permanent of the three sources, and the hardest to mitigate, because it is a statement about the closure of the observable space.
Every metric, log line, and trace span in your system exists because some engineer, at some point, decided it would be useful. The decision was made under one model of how the system would fail. The decision was made before the migration that changed the failure modes. The decision was made by someone who has since left.
The observation space of a production system is, in effect, a fossil record of past concerns. It captures the questions someone thought to ask. The questions that nobody thought to ask are invisible, by definition, and they remain invisible until an incident forces someone to add the instrumentation in the middle of the firefight.
This is the operational meaning of Cindy Sridharan’s distinction (in Distributed Systems Observability, O’Reilly, 2018) between monitoring and observability. Monitoring is the practice of watching known failure modes by means of pre-defined metrics and alerts.
Observability is the property of a system that allows you to ask questions about its state that you did not anticipate having to ask.
The two are different in kind. Monitoring is comprehensive at handling familiar failures and useless at handling novel ones. Observability is the opposite. A mature production environment requires both, but the second is much harder to achieve, because it requires preserving structured, high-cardinality data about every interesting event, in a form that can be queried after the fact along dimensions no one specified in advance.
The DynamoDB DNS race condition from Part I is a clean example of the limit. The plan generation system had monitoring: are the Enactors running, are they applying plans, is Route 53 returning valid responses? All of these were green throughout the incident.
The question that would have caught the failure, is there a window during which one Enactor has applied an older plan after another Enactor has deleted it, was a question nobody had thought to ask, because the failure mode it describes had never happened.
There was no monitoring for it, because there was no model of it. Observability, in Sridharan’s sense, might have caught it: if the raw event stream of Enactor operations, with full cardinality on plan version and Enactor identity, had been queryable, an engineer during the incident could have constructed the query that revealed the interleaving. Whether anyone would have thought to construct that query in the first ninety minutes of the outage is a separate question.
Ben Sigelman, who designed Google’s Dapper tracing system and later co-founded LightStep, has argued that the practical limit of observability is set by the cost of the questions you are not yet asking. Storing every span, log, and structured event at full cardinality is theoretically ideal and economically impossible.
Every organisation makes a choice about which dimensions to preserve, and that choice is, in retrospect, always slightly wrong, because the next incident is the one whose relevant dimension was sampled out.
The discipline is not to eliminate this gap, which cannot be done, but to acknowledge it: to recognise that your dashboards are a model, that the model is incomplete, and that the moments when the model and the system disagree are the moments that matter most.
The three frames of reference
The DynamoDB cascade in Part I introduced, almost in passing, an idea worth making explicit. During the outage, the system existed in three simultaneously valid states, depending on whose vantage point you took: the data plane saw a healthy service, the control plane saw correctly-served DNS, the client saw nothing.
This is not a metaphor. It is structural, and it has a theoretical grounding. Lamport’s 1978 paper Time, Clocks, and the Ordering of Events in a Distributed System established that in a distributed system without a shared global clock, the only well-defined ordering between events is the happens-before relation →, defined transitively by causal message passing. Events that are not connected by → are concurrent, and concurrent events have no meaningful temporal ordering across observers; each observer may legitimately see them in a different sequence. Each observer constructs a partial order from the messages it has received, and the partial orders need not agree.
Any distributed system at scale has at least three such frames, and they almost never agree.
The data plane frame is the view of the components doing the actual work. Storage nodes know whether they are reachable on their primary network interfaces and whether their disks are responding. Compute hosts know whether their workers are processing requests. From this frame, the system is described in terms of internal state: queue depths, lock contention, GC pauses, file descriptor counts.
The control plane frame is the view of the systems that manage the data plane. Schedulers, load balancers, service discovery, deployment systems, autoscalers. The control plane sees the data plane through its own observations, typically lagged metrics and periodic health checks. From this frame, the system is described in terms of declarative goals and reconciliation: how many instances should be running, how many are running, what is the gap?
The client frame is the view of whatever is trying to use the system. This includes external customers but also internal services that depend on the one in question. The client sees the system only through its responses: latency, error rate, correctness. From this frame, the system is described in terms of the service contract being honoured or not.
In a healthy system, all three frames produce consistent descriptions. The data plane is processing requests, the control plane sees that the data plane is processing requests, and the client receives correct responses. This consistency is what allows operators to use any one frame as a proxy for the other two.
In an unhealthy system, the frames diverge, and the pattern of divergence is diagnostic. Data plane healthy, client failing: the failure is in the path between them, usually DNS, routing, or load balancing. Control plane healthy, data plane degraded: the control plane is observing a stale or filtered view of the data plane.
All three frames disagreeing: the system has entered a regime its designers did not anticipate, and the on-call engineer is going to have a long night.
Frame divergence, mechanically: GitHub, 21 October 2018
The 2018 GitHub MySQL split-brain is the textbook case of frame divergence, and it is worth reconstructing because the mechanism turns on which observers were in which partition.
At 22:52 UTC on 21 October, routine maintenance to replace failing 100G optical equipment severed connectivity between GitHub’s US East Coast network hub and its primary US East Coast data centre. The break lasted 43 seconds. Not enough for a human to react. More than enough for everything that follows.
GitHub ran MySQL in a topology managed by Orchestrator, an open-source replication-topology manager that, in GitHub’s configuration, uses Raft consensus among its own nodes to decide when to promote replicas. The primary was in US East.
Replicas existed in US West and in a public-cloud region. Orchestrator nodes were distributed across all three. Crucially, Orchestrator’s automated failover was configured to promote across regional boundaries.
The Raft protocol requires a strict majority for any decision: for a cluster of n nodes, ⌊n/2⌋ + 1 must agree. When the East Coast data centre dropped off the network, the Orchestrator nodes inside it were partitioned with it. The remaining nodes (US West plus US East public cloud) retained a quorum.
From their frame, the primary had failed; the only action under Raft’s liveness assumption was to elect a new leader and promote. Within seconds of the partition forming, the Orchestrator quorum began the leadership deselection process, opened a new Raft term, and promoted a US West MySQL replica to primary. Application traffic in the unaffected regions began flowing to it.
This is, in the language of CAP, the choice MySQL replication and Orchestrator had been configured to make: availability over consistency. When the network partitioned, the system preserved availability (writes continued to be accepted somewhere) at the cost of consistency (writes accepted in one partition were unknown to the other). Daniel Abadi’s PACELC refinement (2010) makes the framing sharper: if Partitioned, choose between Availability and Consistency; Else, choose between Latency and Consistency. GitHub’s topology had chosen PA/EL: availability under partition, latency in the steady state. The cost of that choice was paid, in full, during the 43 seconds the partition was active and the 40 minutes that followed.
The partition healed 43 seconds later. From the East Coast frame, nothing had happened: the local MySQL primary had continued serving writes the whole time, because applications in East continued routing to it. From the West Coast frame, it was now the primary, and writes were flowing in. Both databases had accepted writes for the duration of the partition, neither aware of the other’s writes.
Cross-region MySQL replication carries some lag in steady state, which is the window during which writes acknowledged to clients on East had not yet reached West and were therefore not present on the newly-promoted West primary.
The trap closed in the next forty minutes. Once connectivity restored, GitHub’s application tier saw the new West Coast primary and began directing writes to it. For nearly 40 minutes, the West Coast accepted writes that the East Coast primary did not see. Meanwhile, the East Coast primary contained the few seconds of writes from the partition window that had never been replicated to West.
When engineers locked deployment tooling and assessed state, they found two databases with divergent histories. Reconciling by failing back to East would discard the 40 minutes of West Coast writes. Failing forward on West would discard the East Coast partition-window writes. Neither was acceptable.
GitHub chose to fail forward, preserving the 40 minutes of West Coast writes at the cost of consistency: applications in the East Coast now had to make a cross-country round trip for every database call, adding cross-country latency to operations that had been designed to complete in local-region time.
The site was effectively degraded for 24 hours and 11 minutes while data was restored from backups, replication was rebuilt, and the orphaned East Coast partition-window writes were manually reconciled from binary logs. As the post-mortem records: one of the busiest clusters in the partition window contained 954 writes that had to be reconciled by hand.
From the East Coast control plane’s frame, the system had operated correctly: writes were accepted, replication was healthy locally, no monitoring fired. From the Orchestrator quorum’s frame, failover happened exactly as designed when the primary became unreachable.
From the data plane’s frame, two independent write histories existed during a window neither side observed in full. From the client frame, some writes had succeeded that, in a consistent universe, would have been rejected.
The 43 seconds of partition were the trigger. The frame divergence was the failure. The trigger was unavoidable; physical networks partition.
The deeper unavoidability was theoretical: the FLP impossibility result (Fischer, Lynch, Paterson, 1985) proves that no asynchronous deterministic consensus protocol can guarantee both safety and liveness in the presence of even a single crash failure.
Every real consensus system, including Raft, breaks this tie pragmatically, in Raft’s case, by leaning on timing assumptions to maintain liveness. Those timing assumptions are exactly what fail during a network partition; the algorithm has no choice but to make progress on the basis of local observations that, during the partition, were sufficient for a local quorum decision but insufficient to determine the global state.
GitHub’s subsequent re-architecture eliminated cross-region automatic failover precisely because no observation surface available in real time was sufficient to detect the divergence as it was happening.
They moved the consistency/availability tradeoff from automatic to human-in-the-loop, accepting longer mean-time-to-recovery in exchange for not making this specific decision wrong again.
The Heisenbug problem
There is a class of bug in distributed systems that disappears when you try to observe it. The folklore name is Heisenbug, after the uncertainty principle; the technical name is observation-dependent failure.
These bugs exist because adding observation to a system changes the system. Logging a value takes time, which changes the timing of the surrounding code, which changes the order in which concurrent operations interleave, which changes whether the race condition fires.
Capturing a stack trace acquires locks, which changes contention, which changes which thread reaches the critical section first. The act of looking at the system alters what the system does.
This is, again, not a metaphor. Modern observability tooling routinely takes 1-5% of a service’s CPU and a measurable fraction of its memory. eBPF-based profilers, distributed tracing, log aggregation, all of them consume resources that come from the same pool the application uses to do its work. In the steady state, the cost is acceptable.
In the regime where the system is already saturated, adding observation can be the perturbation that pushes the system from a marginal state into a failure state.
The mitigation is not to remove observation, which would leave the system unobservable. The mitigation is to design observation to be low-overhead and load-shedding: when the system is saturated, observation is the first thing to drop, not the last.
Sampling rather than full capture, head-based sampling rather than tail-based, structured events written to local buffers rather than synchronous network calls.
The modern technical answer to this is eBPF: in-kernel observation programs verified for safety at load time and executed in response to kernel events. Because the aggregation happens in kernel space, written into perf event arrays or BPF ring buffers shared with user-space readers via mmap, the observation path bypasses the syscall boundary entirely.
The cost of recording an event collapses to a few cache-line writes, with no context switch and no allocator pressure on the application path.
The Linux kernel’s eBPF verifier statically proves termination and memory safety at program load time, which means an observation program cannot crash the kernel even if it has a bug; bpftrace, BCC, and Cilium’s Hubble all build on this substrate.
The implication for the Heisenbug problem is that, for many workloads, eBPF-based observation has overhead small enough that observing the system no longer meaningfully alters the system’s behaviour.
Each of these decisions trades observability for the ability to keep the system running while it is being observed. The trade is acceptable only if you have decided it deliberately.
In most systems, it has been decided by accident, by whatever the default configuration of the tracing library is, set by whichever engineer integrated it.
But the deepest version of the observation-disturbing-the-system problem is not about overhead. It is about circular dependency between the observation surface and the thing being observed.
When the monitoring system depends on the system it is monitoring, a failure in the monitored system blinds the monitoring system, and the operator loses access to the diagnostic data at precisely the moment they need it most.
There is no better illustration of this than the 73 hours that began at 13:37 PDT on 28 October 2021.
The Roblox outage, mechanically
Roblox at the time ran more than 18,000 servers and 170,000 containers across its own data centres, orchestrated using the HashiCorp stack: Nomad for scheduling, Vault for secrets, and Consul for service discovery, health checks, session locking, and as a KV store.
Consul was the central nervous system. Every service depended on it to find its peers.
A single Consul cluster supported the entire backend: 5 voter nodes plus 5 non-voter read replicas. This was, as the post-mortem would later note, a single point of failure of a kind that violated every textbook lesson about blast-radius isolation.
In the months leading up to October, Roblox had upgraded from Consul 1.9 to 1.10 to take advantage of a new streaming feature designed to reduce CPU and network bandwidth on large clusters. The feature had been incrementally enabled across services without incident.
On 27 October at 14:00, the day before the outage, it was enabled on the traffic routing service, and the number of routing nodes was increased by 50% in anticipation of end-of-year traffic.
At 13:37 PDT on 28 October, Vault performance began to degrade and a single Consul server began exhibiting high CPU load. Engineers began to investigate; users were not yet impacted. The first signal was unusual write latency on Consul’s underlying KV store: the 50th percentile, normally under 300 ms, had climbed to 2 seconds.
The cluster was failing for two reasons that interacted, neither of which engineers identified for days.
The first was the streaming feature itself. HashiCorp would later explain that streaming, while overall more efficient than long polling, used fewer concurrency control elements (Go channels) in its implementation.
Under very high read and very high write load, the design exacerbated contention on a single Go channel, blocking writes and consuming CPU in kernel spin locks along the streaming subscription code path. This pathology had not appeared in HashiCorp’s pre-release benchmarks because it required the specific combination of large stream count and high churn rate that Roblox’s workload produced.
The second was buried inside Consul’s persistence layer. Consul uses BoltDB, an embedded Go key-value store inspired by LMDB’s memory-mapped design, to persist its Raft write-ahead log.
BoltDB’s design is a single memory-mapped file organised as a copy-on-write B+tree: every write transaction allocates new pages, never modifies existing ones, and commits by atomically swapping a single root pointer. This gives crash-safety, at the cost of page churn.
When pages become unreachable, BoltDB does not release them to the OS. Instead it tracks them in a freelist of free page IDs, which is rewritten in its entirety on every transaction commit. At normal scale, freelist maintenance is negligible. At Roblox’s scale, after months of accumulated Raft log writes, the freelist had grown pathologically.
The post-mortem provides the actual numbers, taken from a Consul server during the incident: the 4.2 GB Raft log store contained only 489 MB of actual data. The remaining 3.8 GB was empty space, tracked as free pages.
The freelist tracking those pages had grown to 7.8 MB, containing nearly a million free page IDs. For every Raft log append, with all the batching Consul applies, a write of 16 KB or less was triggering a rewrite of the entire 7.8 MB freelist to disk.
This is the pathology. Each transaction commit performed: a search of the million-entry freelist for free pages; an update to that freelist; serialisation of the entire 7.8 MB freelist to disk; and an fsync(2) whose cost was dominated by the size of the dirty page set, which was now dominated by the freelist itself.
The work scaled linearly with freelist length, and the freelist length grew with every snapshot the system performed to keep itself trim.
Raft, sitting on top of BoltDB, has timing assumptions. The leader replicates log entries to followers and must commit them durably.
When BoltDB commit latency entered the multi-second range, leaders could not durably persist log entries fast enough; followers timed out, started elections, and a new leader was chosen, which inherited the same BoltDB file, performed the same expensive freelist operations, became slow, lost its leadership in turn, and triggered another election.
The cluster was alive but unable to make progress: a Raft cluster trapped in a leader-flap loop, with each leader’s presence too brief to commit meaningful work.
This was an internal failure at the level of database page management, observed at the level of cluster leadership stability. The two layers were not connected in anyone’s mental model of the system. They were connected through a freelist data structure most engineers did not know existed.
The team’s first hypotheses, in order, were the ones the operator’s model suggested. They suspected degraded hardware and replaced a Consul node. Performance continued to suffer.
They suspected capacity, and replaced all the Consul nodes with new machines: 128 cores (up from 64) on faster NVME SSDs. As the post-mortem would later document, this made things worse: the new servers were dual-socket NUMA architectures, and the additional cores meant additional concurrent goroutines contending on the same Go channel in the streaming code path.
Cross-socket memory access added latency to operations that had been local on the old 64-core single-socket machines.
By 16:35 PDT on the 28th, concurrent users had dropped to 50% of normal. Subsequent attempts, resetting the cluster from a snapshot, blocking incoming traffic with iptables to bring it back under controlled conditions, reducing health-check frequency from 60 seconds to 10 minutes to give the cluster breathing room, all stabilised the system briefly and then returned it to the same 2-second KV write latency.
None of these interventions worked because none of them addressed the actual mechanism. The post-mortem is explicit: engineers did not identify the BoltDB freelist issue during the incident. HashiCorp engineers determined the root cause in the days after the outage ended.
This is where the observer problem becomes operationally devastating. Roblox’s monitoring infrastructure depended on Consul. When Consul was unhealthy, the dashboards that would have shown engineers what was happening inside Consul were themselves unable to report. The post-mortem describes this directly:
There was a circular dependency between our telemetry systems and Consul, which meant that when Consul was unhealthy, we lacked the telemetry data that would have made it easier for us to figure out what was wrong.
The diagnostic question the operator needed to ask, what is the actual state of the BoltDB file inside the affected Consul instances, required telemetry that the affected systems were supposed to provide. They could not.
Engineers were debugging a system whose internal state was now unobservable, against a failure mode buried two software layers deep in an open-source dependency that the affected engineers had not personally written.
The breakthrough came at 15:51 PDT on 30 October, roughly 50 hours after the outage began, when engineers disabled the streaming feature across all Consul systems. KV write latency immediately returned to 300 ms.
The Heisenbug-disguised streaming contention had been suppressed; the underlying BoltDB freelist problem was still there, manifesting as a “slow leader” symptom in which certain leaders inherited the worst of the freelist state.
The team pragmatically worked around it by preventing those leaders from staying elected, and continued the long process of repopulating caches and restarting services.
Total downtime: 73 hours, from 13:37 PDT on 28 October to 16:45 PDT on 31 October, when 100% of players were given access.
The trigger was the interaction between Consul streaming and BoltDB’s freelist; both bugs, both fixable, both ultimately fixed (the BoltDB freelist issue was resolved by migration to bbolt, the etcd-io fork of BoltDB, which uses a hashmap-based freelist).
The outage duration was a property of the observation surface. With working telemetry into Consul’s internal state, the freelist issue would have been visible in hours, not days. Without it, engineers were solving a Heisenbug with their eyes closed.
The lesson Roblox drew, encoded explicitly in their post-mortem and in the architectural changes that followed, was that observation surfaces must be independent of the systems they observe.
Telemetry must run on infrastructure that does not depend on the thing being measured. If the monitoring stack and the production stack share a common substrate, a failure in that substrate blinds both at once, and the operator is left to debug a fully dark system.
This is, in the language Part I used, an invariant at the boundary between the observation system and the production system: the observability stack must function independently of any system whose state it reports.
The invariant is rarely enforced. It is rarely even written down. It is one of the assumptions the operator does not know they have made until the day Consul stops responding and the dashboards go dark with it.
The operator’s simulation
Every operator, when debugging a system, is in fact debugging a model of the system that lives in their own head.
The model was built from architecture documents, code reading, prior incidents, and conversation with colleagues. The model is a simplification, by necessity, because the system is too complex to fit in one head.
The model is also wrong, in specific ways, and the operator does not know which specific ways until the model and the system diverge.
This determines whether an incident is resolved in twenty minutes or six hours. During an outage, the operator performs inference: symptoms are observed, hypotheses are generated from the model, tests of those hypotheses are designed and executed. The hypothesis space the operator can explore is bounded by the model the operator has.
If the failure mode lives outside the model, the operator cannot generate the hypothesis that would lead to it. They will iterate, increasingly desperately, on hypotheses inside the model, none of which fit the symptoms, until either someone with a better model joins the call or the system recovers on its own.
The Cloudflare incident from Part I is partly an instance of this. The first ninety minutes were spent on the hypothesis that the failure was an external DDoS attack, because the oscillation between good and bad states matched the signature of intermittent external pressure better than it matched the signature of anything Cloudflare’s engineers had a model for.
The model said “oscillation means external adversary.“ The reality was that oscillation meant gradual rollout against a five-minute regeneration cycle, but that failure mode was not in anyone’s model until the post-mortem.
Richard Cook’s How Complex Systems Fail (1998) names this directly: every operator’s view of the system is a practitioner-constructed simulation, assembled from training, experience, and the artefacts of past incidents.
The simulation diverges from reality continuously and silently. The role of the practitioner during an incident is to detect the divergence, update the simulation, and act on the updated version, all in real time, under pressure, with incomplete information.
The systems that recover quickly from incidents are not the systems with the best dashboards. They are the systems whose operators have the best simulations, and whose dashboards expose enough raw data that the simulation can be corrected in flight.
The discipline of seeing what is there
The compressed form of this essay, the operational counterpart to Part I’s diagnostic question, is also a question, asked of every observation surface in the system: what is this metric not telling me, and what would I look at if it were lying?
Every dashboard has a set of failure modes for which it is the right view, and a set of failure modes for which it is misleading. Both sets are large. The first is documented (it is the reason the dashboard was built). The second is almost never documented, because the failure modes it contains are, by definition, the ones nobody anticipated.
The discipline is to know, for every observation, what it is showing and what it is hiding.
The metric is an aggregate over what dimensions? Over what time window? With what sampling?
Does the observation surface share a substrate with the system it observes?
What would a failure that is invisible on this metric look like, and what other view would catch it?
When the metric and the user experience disagree, which one should be believed?
The answer to the last question is always: the user experience. The metric is a model. The user is in the data plane. The data plane is the system.
The discipline of observation-aware engineering is not to build dashboards that show everything; that is impossible, and the attempt produces dashboards that show nothing useful.
It is to know, for every dashboard, what it cannot show, to keep the raw event stream queryable for the moments when the dashboard and the system disagree, and to keep the observation stack architecturally independent of the systems it watches.
This is what separates the systems where the operator spends the first hour of an incident narrowing the hypothesis from the systems where the operator spends the first hour arguing about whether the dashboards are correct.
Part III will move from observation to action: what it takes to operate a system whose state you cannot fully see, whose feedback loops fight you, and whose composition you do not control. The technical name for this is control theory under uncertainty. The operational name is on-call.
One more thing…
The reason these failures keep happening is that engineers are trained to reason about the logical structure of their systems, not the physical dynamics of how those systems execute under load.
The same gap exists, in concentrated form, in GPU programming. CUDA correctness is necessary but not sufficient. Performance lives entirely in how memory traffic, warp scheduling, instruction issue, and inter-block synchronisation interact under realistic workloads: the same composition-and-observation problem, compressed into a single die.
I wrote a deep guide on CUDA from this perspective: not isolated tricks, but how to reason about the GPU as a coupled dynamical system whose performance regimes are as discontinuous as any distributed system’s.



