How Meta turned the Linux Kernel into a planet-scale Load Balancer. Part III
A deep architectural narrative on XDP, eBPF, stateless routing, and why hyperscale traffic outgrew proxies.
Introduction
Engineers who have spent many years in the trenches of distributed systems develop an instinctive radar for complexity.
It’s a radar that basically never sleeps. It picks up on the faintest signs that a system is about to balloon beyond comprehension: the silent growth of memory tables, the small code change that cascades into unpredictable failure, the queues that inflate exponentially under the first unexpected burst.
At small or medium scale, these things are manageable. You add retries, per-flow connection tracking, dynamic allocation: you feel just a bit more clever. But scale up to planetary levels, and these “clever” additions become the system’s Achilles’ heel.
Every bit of mutable state, every dynamic structure, every per-flow table is a ticking time bomb. And this is precisely where Katran finds its elegance.
Katran strips away all unnecessary complexity. It reduces load balancing to pure, deterministic computation, executed at the first possible moment a packet enters the kernel.
There are no per-flow tables. There are no loops. There are no dynamic allocations. There is no memory of the past. Every packet is an independent, self-contained entity, evaluated as a function: headers in, backend out, no storage, no cleanup, no drama.
It is, in a sense, load balancing as mathematics, executed at line rate.
Encountering the XDP Program
When you first lay eyes on Katran’s XDP program, disbelief is natural.
You instinctively start looking for the infrastructure you expect from a “normal” load balancer: socket managers, memory pools, per-flow hash tables, connection cleanup threads.
Where are the timers that expire idle flows?
Where is the heap allocation for buffering partial packets?
Where is the messy user-space coordination?
The answer, surprisingly, is: none of that exists. Every omission is intentional, dictated by the eBPF verifier’s strict execution model and the program’s architectural goal: stateless, line-rate forwarding.
The eBPF verifier acts as a pedantic static analyzer for the kernel. At load time, it simulates every possible execution path through the program, tracking all pointers, offsets, and integers.
It ensures that no memory access exceeds the bounds of the packet buffer (ctx->data to ctx->data_end), no pointer arithmetic overflows, and no uninitialized value is ever dereferenced.
If a program could potentially write past data_end or access memory outside a BPF map safely, the verifier refuses to load it.
This is why XDP code is highly defensive: explicit casts, pointer comparisons, early exits, and carefully sequenced checks are everywhere. In short, you can’t crash the kernel because the verifier won’t allow it.
Katran leverages this discipline elegantly. Every packet is validated before it touches the forwarding logic. Parsing begins at the Ethernet layer:
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
return XDP_DROP;
if (eth->h_proto != htons(ETH_P_IP))
return XDP_PASS;
Here, (eth + 1) calculates the memory address immediately after the Ethernet header. Comparing this against data_end ensures that the packet buffer fully contains the Ethernet header.
If not, the packet is dropped (XDP_DROP). Non-IPv4 traffic is passed upstream (XDP_PASS) because Katran only routes IPv4 packets.
Notice that no memory allocation occurs, and no state is stored; the check is purely functional, evaluating the packet as it exists in memory.
Next, the program parses the IP header:
struct iphdr *ip = data + sizeof(*eth);
if ((void *)(ip + 1) > data_end)
return XDP_DROP;
if (ip->protocol != IPPROTO_TCP)
return XDP_PASS;
Pointer arithmetic ensures that ip points exactly at the start of the IP header. The verifier confirms that (ip + 1) does not exceed data_end. Any malformed or truncated packet is dropped immediately.
Non-TCP packets are passed upstream. Again, no dynamic structures are created, no per-flow tables are touched, no queues are used. Every packet is an independent entity, evaluated purely by its content.
The final layer of parsing handles TCP headers:
struct tcphdr *tcp = (void *)ip + sizeof(*ip);
if ((void *)(tcp + 1) > data_end)
return XDP_DROP;
The program performs the same bounds validation at the TCP layer.
The cast (void *)ip is required because pointer arithmetic on struct iphdr * could otherwise introduce type-dependent scaling that violates verifier expectations.
Every field access is verified at load time, ensuring memory safety.
By the time the packet reaches this stage, Katran has guaranteed that Ethernet, IP, and TCP headers are complete, aligned, and accessible: all without allocating a single byte of memory.
Deterministic Hashing
With headers validated, Katran computes a deterministic 5-tuple hash:
__u64 hash = hash_5tuple(ip->saddr, ip->daddr,
tcp->source, tcp->dest,
ip->protocol);
This hash forms the cornerstone of the system. It alone determines the backend to which the packet will be forwarded.
The kernel holds no history of past flows; there is no connection table. There are no retries, no queues, no cleanup tasks. Every packet is a self-contained function:
Packet headers → 5-tuple hash → ring map lookup → backend map → packet rewrite → XDP_TX
eBPF Map Lookups
The program then queries two BPF maps: ring_map and backend_map. The ring_map implements a Maglev-style consistent hash ring.
Conceptually, it’s a fixed-size array where each slot points to a backend index, precomputed in the control plane. The backend_map holds the IP and port for each backend.
Both maps are read-only from the kernel’s perspective during packet processing; the control plane updates them asynchronously.
__u32 *backend_idx = bpf_map_lookup_elem(&ring_map, &hash);
if (!backend_idx)
return XDP_DROP;
struct backend *b = bpf_map_lookup_elem(&backend_map, backend_idx);
if (!b)
return XDP_DROP;
ip->daddr = b->ip;
tcp->dest = b->port;
recalc_ip_checksum(ip);
recalc_tcp_checksum(ip, tcp);
return XDP_TX;
Notice the simplicity and determinism. The kernel reads the slot atomically, rewrites the packet headers, recalculates checksums, and transmits.
No per-flow state, no locks, no dynamic weighting — all O(1), purely functional. The XDP_TX return code transmits directly back to the NIC TX ring, bypassing sockets entirely.
The packet never enters the kernel networking stack, avoiding context switches, scheduling delays, or buffer copies.
Line-Rate Safety and Verifier Compliance
At first glance, Katran’s code looks almost painfully minimalist. But every defensive check, every explicit cast, and every early return exists because the eBPF verifier demands total memory safety.
The verifier effectively simulates the NIC’s RX ring in software, reasoning about each possible execution path. If any access could exceed data_end or touch uninitialized memory, the program would fail to load.
This ensures that Katran’s XDP program is guaranteed to execute safely at line rate, independent of traffic patterns, packet sizes, or maliciously malformed headers.
By combining careful pointer arithmetic, deterministic 5-tuple hashing, and atomic BPF map reads, Katran transforms each packet into a stateless, functional computation.
No connections are remembered. No resources are dynamically allocated. No loops or locks slow the datapath.
This is minimalism enforced by both design and verifier; the kernel’s version of “if it isn’t necessary, it doesn’t exist”.
Maglev-Style Hash Ring Construction
Katran’s stateless elegance rests on one deceptively simple concept: a precomputed Maglev-style consistent hash ring.
At first glance, it’s just an array, a circle of RING_SIZE slots. But in reality, it is the linchpin that allows packets to be routed deterministically, without storing per-flow state, and without any runtime iteration over backends.
Each backend server occupies multiple slots on this ring, proportionally to its assigned weight. Packets are mapped to slots via the 5-tuple hash computed in the kernel.
Once the hash is computed, the kernel does nothing more than perform a BPF map lookup: the heavy lifting has already been done by the control plane.
The XDP program never recomputes the ring, never tracks the flow history, and never allocates memory: it simply consults the map, rewrites headers, and transmits.
Virtual Nodes and Weighted Distribution
The magic lies in virtual nodes. Each backend is assigned a number of virtual nodes proportional to its weight.
A backend that is twice as capable as another receives twice as many virtual nodes, which in turn means it will receive roughly twice the traffic. The control plane distributes these virtual nodes pseudo-randomly but deterministically across the ring.
Deterministic here means that given the same backend configuration and weight, every calculation produces the same ring layout: a property essential for consistent flow mapping and minimal disruption during updates.
When a backend fails, only the slots associated with its virtual nodes are removed. The remaining slots for other backends are untouched.
This minimal disruption ensures that the vast majority of traffic continues to flow to existing backends without rerouting, avoiding cache thrashing, queue spikes, or packet reordering at the network level.
It’s a conceptually simple idea, but its effects on reliability and predictability at planetary scale are profound.
Control Plane Construction
Here is a distilled view of how the control plane constructs the ring:
#define RING_SIZE 65537
#define VIRTUAL_NODES_PER_BACKEND 100
struct backend {
__u32 ip;
__u16 port;
__u32 weight;
};
__u32 ring[RING_SIZE]; // 0 = empty, 1..N = backend indices
for (int b = 0; b < num_backends; b++) {
int virtual_nodes = backends[b].weight * VIRTUAL_NODES_PER_BACKEND;
for (int v = 0; v < virtual_nodes; v++) {
__u64 hash = hash_fn(backends[b].ip, backends[b].port, v);
__u32 offset = hash % RING_SIZE;
__u32 skip = (hash % (RING_SIZE - 1)) + 1;
while (ring[offset] != 0) {
offset = (offset + skip) % RING_SIZE;
}
ring[offset] = b + 1; // backend index + 1 because 0 = empty
}
}
At first glance, it looks almost naïve: a nested loop, some arithmetic, and a while loop to resolve collisions. But every line has a precise purpose:
hash_fncombines the backend’s IP, port, and virtual node index to produce a deterministic pseudo-random value.offsetdetermines the initial slot in the ring for this virtual node.skipensures that collisions are resolved deterministically: if a slot is already occupied, we advance byskipuntil an empty slot is found.The final assignment writes the backend index plus one; because zero is reserved to indicate “empty.”
The result is a fully populated, deterministic hash ring. The control plane now has a complete map of which slot belongs to which backend, and crucially, this computation happens entirely outside the kernel, asynchronously from packet forwarding.
Pushing the Ring into the Kernel
Once the ring is computed, the control plane atomically populates a BPF map in the kernel:
for (int i = 0; i < RING_SIZE; i++) {
__u32 key = i;
__u32 val = ring[i];
bpf_map_update_elem(&ring_map, &key, &val, BPF_ANY);
}
From this moment, the XDP program doesn’t touch the ring logic. Each incoming packet:
Computes its 5-tuple hash.
Calculates the slot index via
hash % RING_SIZE.Looks up the slot in the
ring_mapto find the backend index.Looks up the backend IP and port in the
backend_map.Rewrites headers and transmits.
The ring is immutable from the kernel’s perspective during packet processing. Updates, weight adjustments, or backend removals are handled entirely by the control plane, and applied atomically.
This allows Katran to handle failures and reconfigurations without ever stalling packet forwarding.
Why This Matters
The beauty of this approach is that the XDP datapath is entirely stateless and deterministic, yet fully aware of backend weights.
There are no loops in the kernel, no per-flow tracking, no locks or contention. The cost of adding more backends or adjusting weights is absorbed by the control plane; the kernel simply reads the updated maps.
This separation allows line-rate forwarding at planetary scale, where a single ingress could be handling millions of flows per second.
The Maglev ring is the perfect example of precomputation as a scaling strategy: do the complex work once in user-space, push a static, deterministic representation to the kernel, and let the stateless datapath execute at hardware speed.
Every packet becomes a pure function evaluation: input headers → hash → slot → backend → rewritten packet → transmit.
No memory allocations, no state, no concurrency hazards; just physics applied to packets.
Lookup and Deterministic Forwarding
Once the packet has passed through Ethernet, IP, and TCP validation, the XDP program performs the most crucial step: deterministic forwarding using the precomputed Maglev-style hash ring.
In the kernel, this process is astonishingly simple: and yet it underpins the system’s ability to operate at line rate with planetary-scale fan-in.
__u32 slot = hash % RING_SIZE;
__u32 *backend_idx = bpf_map_lookup_elem(&ring_map, &slot);
struct backend *b = bpf_map_lookup_elem(&backend_map, backend_idx);
Here, a single modulo operation maps the packet’s 5-tuple hash to a slot.
A single map lookup retrieves the backend index, and a second map lookup fetches the backend’s IP and port. That is all.
No loops, no runtime weight calculations, no dynamic per-flow state. O(1) complexity, deterministic results, purely functional evaluation.
Because all kernel accesses are atomic and read-only during packet processing, this architecture scales linearly with CPU cores and NIC queues.
There are no locks, no contention, and no thread coordination. Each core can process its RX queue independently, and the control plane can update ring_map or backend_map asynchronously.
New flows automatically observe updated backends, while old flows continue to use the last-known-good mapping: all without disrupting the kernel datapath.
Failure as a First-Class Design Input
Statelessness is not just an optimization; it is a failure-handling strategy baked into the architecture. Because the kernel holds no flow tables or connection metadata, failure is trivial to contain.
When a backend server fails, the control plane simply removes its virtual nodes from the hash ring. From the kernel’s perspective, packets that would have landed on that backend now hash to other slots.
There are no stale connection entries to purge, no retries to manage, no cache warm-ups required.
When a Katran node itself fails, upstream Anycast automatically redirects traffic. Each surviving node continues forwarding packets using the last-known-good maps, oblivious to the failure.
There is no need for global coordination, distributed state reconciliation, or multi-step failover. Failure domains remain local, predictable, and instantly recoverable.
This is architectural serenity: the system is designed to survive failures because the kernel never held state that could be lost. Statelessness transforms complexity into predictability.
Visualizing Katran’s Packet Flow
Understanding 5-tuple hashing, virtual node assignment, and BPF map lookups is abstractly simple, but seeing the full flow clarifies the elegance:
[Client Packet]
|
v
[NIC RX Queue / RSS]
|
v
[XDP Parsing & Validation]
|
v
[5-Tuple Hash Computation]
|
v
[Ring Map Lookup -> Slot]
|
v
[Backend Map Lookup -> Backend IP/Port]
|
v
[Packet Rewrite & XDP_TX]
|
v
[Backend Server / Direct Server Return]
Control Plane (User Space)
|
|-- Computes virtual nodes per backend (weight-based)
|-- Calculates offsets & skips for Maglev ring
|-- Updates ring_map in kernel atomically
|-- Updates backend_map with IPs & ports
Figure 1 illustrates this flow: the kernel operates entirely statelessly and deterministically, while the control plane orchestrates updates asynchronously.
The XDP program is immutable during packet processing, and the system naturally adapts to changes without disruption.
Direct Server Return (DSR)
Katran very often employs the so-called Direct Server Return (DSR).
Instead of proxying replies through the load balancer, it rewrites only the destination IP and port, leaving the source IP intact. Backends reply directly to clients, bypassing the balancer entirely.
This approach eliminates double traversal of the datapath, removes additional memory and CPU overhead, and allows the system to handle planetary-scale fan-in.
When combined with Receive-Side Scaling (RSS) across NIC queues, each CPU core can process independent flows in parallel, achieving near-linear throughput across thousands of simultaneous connections: all without storing a single flow or connection state in the kernel.
Physics vs. Semantics
It is tempting to ask: “Could Katran replace proxies?” The shortest answer I could give you right now is: hell no.
Proxies operate in the semantic domain, handling TLS termination, header inspection, authentication, authorization, retries, circuit breaking, and protocol translation. They reason about requests, responses, cookies, tokens, and application-layer meaning.
Katran operates entirely in the physics domain. Its responsibility is deterministic packet routing, load distribution, and weight-based selection.
By isolating the concerns, Meta ensures that each layer can scale according to its own constraints: the kernel moves packets at line rate, and proxies reason about requests at application speed.
Collapse these layers, and the system suffers both CPU saturation and state management nightmares.
Katran redraws the boundary between raw throughput and semantic complexity, allowing both layers to excel.
Computation at the Data
Katran is also part of a larger architectural movement: perform computation where the data resides. Modern examples abound:
eBPF firewalls replacing iptables, operating in-kernel without context switches.
SmartNICs offloading encryption and filtering.
Programmable switches performing aggregation in-network.
io_uring bypassing syscalls for low-latency I/O.
RDMA eliminating kernel involvement for high-throughput transfers.
The principle is simple but profound: reduce copies, avoid context switches, and specialize each layer for its constraints.
Katran embodies this philosophy at the kernel’s ingress point, turning every packet into a deterministic computation before it ever reaches user-space.
Quiet Minimalism at Hyperscale
Ultimately, Katran’s brilliance is disciplined minimalism. It does not parse HTTP. It does not retry requests. It does not track connections or manage caches.
It simply forwards packets deterministically, guided by the Maglev-style hash ring materialized in kernel maps.
Combined with DSR, RSS, and atomic map updates, Katran achieves planetary-scale throughput. Proxies are freed from the burden of raw packet fan-in and can focus entirely on semantics.
Katran does not render proxies obsolete; it makes them viable at hyperscale. In doing so, it delivers one of the most profound lessons in distributed systems: at extreme scale, less complexity per operation is not a feature: it is survival.
The kernel, stripped to pure functionality, becomes a deterministic engine for traffic physics, leaving higher layers to reason about meaning.



