The Wafer & the Wallet

Memory costs are inflecting up as enterprise token budgets slam shut. The wafer math, the pass-through proof, and the engineering playbook that defends the spread.

Lorenzo Bradanini and Lorenzo Tettamanti

Jul 05, 2026

Intro

For three years, the cost of making a token fell every quarter while the budget to buy tokens grew without a meter. In H1 2026 both curves reversed at once: memory pass-through is raising the cost floor of every token served, while enterprises institutionalize token budgets at the ceiling.

This is the anatomy of inference’s first margin squeeze: the wafer math, the proof of pass-through, modeled cost floors marked to real market prices, and the engineering playbook that defends the spread.

Two curves reversed in the same half-year

Every industry eventually meets the quarter when its two defining curves cross. For LLM inference, that quarter just happened.

Since late 2022, the economics of this business rested on two assumptions so reliable nobody wrote them down: the cost of producing a token falls every quarter (process nodes, kernel engineering, quantization, and a relentless price war), and the budget for buying tokens grows without a meter, the era of “tokenmaxxing,” when Meta and Salesforce were publicly urging employees to consume as many tokens as possible. In the first half of 2026, both assumptions died within months of each other.

Jaw one: the cost floor turned upward. That is the core of this issue: the deepest memory supercycle in the forty-year history of DRAM, transmitted measurably into GPU rental rates with a 4-6 week lag ( Fig. 7). For the first time since the ChatGPT moment, the marginal cost of serving a token is inflecting up, not down: driven by the single most supply-constrained commodity in technology.

Jaw two: the price ceiling got institutionalized. SemiAnalysis published field data on July 2 from conversations with 50+ enterprises: Uber burned through its annual Claude Code and Codex budget in four months and responded with a $1,500/month per-employee cap; an aerospace manufacturer’s $250/month caps were exhausted by power users in four days; companies are switching off premium model tiers and downgrading defaults; enterprise budgets now range from $250 to tens of thousands per employee per month, but they are budgets, reviewed by finance.

Read the finding precisely, because it is not a demand-collapse story: SemiAnalysis concludes there is no material risk to 2H26 AI budgets and API spend keeps compounding. The jaw is not falling demand, it is the arrival of price sensitivity.

Unbounded willingness-to-pay is over; every serving provider now negotiates against a CFO’s cap instead of an enthusiast’s curiosity.

The quarter the memory market broke

Start with the raw numbers, because they have no precedent in the forty-year history of the DRAM industry. TrendForce’s Q1 2026 contract-price revisions came in at +90-95% quarter-over-quarter for conventional DRAM: revised upward from an initial forecast of +55-60% that analysts had already called unprecedented.

PC DRAM was expected to more than double in a single quarter; server DDR5 rose roughly 90%; NAND contract prices climbed 33-38% in the same window. Counterpoint tracked spot DRAM up 80-90% inside the quarter.

For the full year, TrendForce projects DRAM up more than 70% on top of the Q1 step-change; Bank of America has DRAM industry revenue +51% YoY and NAND +45%.

The supply side tells you why this is not a blip. SK Hynix announced in October 2025 that its entire 2026 production capacity was already sold out. Kioxia has said the same of its 2026 NAND output.

Micron shut down its Crucial consumer business outright (a two-decade brand, liquidated) to route every wafer toward hyperscaler and GPU-grade contracts, and its executives describe being able to cover at most two-thirds of medium-term demand for some customers.

Micron’s Boise fabs ramp in 2027-2028; the New York megafab in 2030. Jensen Huang, asked at CES 2026 whether gamers should resent AI for GPU prices, answered in effect that the world needs more memory factories.

As stated even before, in the previous issues: when the largest memory buyer on Earth says the fix is on the supply side, the message downstream is: you will not negotiate your way to allocation relief.

Three structural breaks from every previous cycle

1. The demand driver is durable, not episodic. Prior shortages came from fab fires, earthquakes, supply discipline, or one-off demand pulses (crypto, pandemic PCs). This one comes from inference workloads compounding in production. IDC’s assessment calls it a “permanent reallocation” of the world’s wafer capacity, not a cyclical shortage, and IDC does not deploy the word “permanent” casually.

2. The margin gradient is a one-way valve, and the wafer math is brutal. HBM carries gross margins 3-5× commodity DRAM, so every rational fab reallocates cleanroom toward it.

But because of TSV formation, die thinning, stacking, and compounded assembly yield, one gigabyte of HBM consumes ≈4× the wafer capacity of one gigabyte of standard DRAM; GDDR7 consumes ≈1.7×. AI’s effective wafer claim reaches ~20% of world DRAM capacity in 2026 against total bit-supply growth of only 10-16% per year.

The arithmetic cannot close without starving someone, and “someone” is every non-AI buyer plus every AI buyer without a long-term agreement.

3. The price convergence matters more than the price level. HBM3e historically sold at 4-5× server DDR5 per bit. TrendForce expects that gap to compress to 1-2× by end-2026, not because HBM got cheaper, but because DDR5 inflated toward it. Sit with that.

The entire tiered-memory architecture of modern serving (KV offload to host DRAM, paged hierarchies, DDR5-backed prefix caches, NVMe cold tiers) was designed in a world where the tier below HBM was 4-5× cheaper per bit.

When the discount for descending a tier collapses from ~80% to ~30-50%, a large fraction of the “obvious” tiering optimizations of 2024-2025 stop being obvious. Section 5 does that math properly.

One supply-side note that matters for H2: HBM4 becomes the mainstream HBM in the second half of 2026, with a doubled 2,048-bit interface, ~2 TB/s per stack, and a manufacturing-complexity premium above 30% over HBM3e. NVIDIA’s Rubin ships with up to 288 GB of HBM4 per GPU, and every one of those stacks is wafer capacity that used to be laptops.

The KV cache ate the memory supply

The lazy narrative is “AI ate the memory market.” The precise narrative is that the KV cache ate the memory market, and the industry’s shift from training-dominated to inference-dominated compute is what lit the fuse.

Training demand is large but concentrated and schedulable: a fixed pool of HBM for weights, activations, and optimizer states, planned quarters ahead. Inference demand is elastic and per-user: every concurrent conversation, agent trajectory, and 200K-token codebase instantiates its own slab of state that must live somewhere in the hierarchy for the life of the request, and increasingly beyond it, as prefix caches persist context between turns.

Industry estimates via TrendForce put cloud high-speed memory consumption near 3 exabytes in 2026, with core inference platforms (Gemini-, Bedrock-, ChatGPT-class serving) accounting for ~750 PB of live memory demand before redundancy roughly doubles it.

TrendForce separately projects that by 2029 inference (not training) becomes the primary driver of AI server demand outright, and North American CSPs are already bulk-buying high-capacity DDR5 RDIMMs specifically for inference fleets, the demand class that used to be the safety valve absorbing DRAM oversupply.

The per-token physics, derived from first principles

Everything in this issue hangs on one number: bytes of KV state per token. Take the workhorse case, a Llama-3-class 70B dense model with grouped-query attention: 80 layers, 8 KV heads, head dimension 128.

A single 128K sequence on the GQA-8 model pins ~41 GB at FP16: half an H100’s entire HBM for one user’s context, before a single weight is stored. This is why effective batch size at long context is a memory-capacity calculation, not a compute calculation, and why every marginal concurrent user at long context is, economically, a memory purchase.

Multiply by the fleet: 10,000 concurrent long-context sequences at FP8 KV pins ~200 TB of state across HBM, DDR5, and NVMe, state that produces no tokens itself; it exists purely so decode can stream it past the compute units at terabytes per second.

The 2026 memory market is what happens when the industry’s aggregate KV state, growing super-linearly with agent adoption and context inflation, collides with wafer supply growing 10-16% a year.

Two second-order effects complete the picture

NAND got dragged in. NVMe became the cold tier for prefix caches and the staging layer for weights (including GPU-direct storage paths). Kioxia now says nearly half its future NAND demand could come from AI. Meanwhile Samsung and SK Hynix cut NAND wafer output in 2024-2025 while chasing HBM margins. Omdia has Samsung’s NAND wafers falling 4.9M→4.68M and SK Hynix’s 1.9M→1.7M. There is no cheap tier left to hide in.

The host-DRAM tax on every GPU node inflated. A serious 2026 inference node pairs its GPUs with 1-2 TB of DDR5 for KV offload, CPU-side batching, and cache tiers. At Q1 2026 server-DRAM pricing, the host memory on a single node can cost what a mid-range GPU did two years ago. No 2024-vintage TCO model has this line item at anywhere near its current size.

Roofline: why prefill and decode were never the same workload

Before the silicon story makes sense, you need the roofline argument stated with numbers, not vibes. A kernel’s attainable throughput is bounded by min(peak_FLOPs, intensity × bandwidth), where arithmetic intensity is FLOPs performed per byte moved from memory.

The machine has a ridge point, the intensity at which it stops being bandwidth-bound and becomes compute-bound.

The corollary that most cost models miss: decode throughput is not a FLOPs number, it is a bandwidth budget divided by bytes-per-step. Per decode step the device must stream the weights once plus every live sequence’s KV:

Read those three lines again. Same GPU, same model, same bandwidth: 16× throughput difference between short-context and long-context traffic, and a clean 2× recovered at 128K purely by halving KV bytes. Context length is not a feature flag; it is a cost multiplier that acts through memory.

Every KV byte you remove converts directly into either more concurrent sequences or faster steps: both of which are revenue.

Rubin CPX is a memory trade wearing a GPU costume

Now read NVIDIA’s 2026 inference roadmap through the memory market and it snaps into focus in a way most launch coverage missed. If prefill is compute-bound and decode is bandwidth-bound (§03), then serving both phases on identical HBM-stuffed GPUs means that during prefill you are paying for the most supply-constrained, margin-rich commodity on the planet (HBM bandwidth) and not using it.

In 2024 that was an inefficiency. In 2026, with HBM at 4× wafer-equivalent cost and fully allocated into next year, it is an unforced error measured in real money.

Rubin CPX is the correction. Announced at the AI Infra Summit in September 2025, detailed through CES and GTC 2026, generally available late this year: a monolithic prefill-specialized die pairing 30 PFLOPS of sparse NVFP4 compute (20 PFLOPS dense, per SemiAnalysis) with 128 GB of GDDR7 at only ~2 TB/s of bandwidth (32 Gbps GDDR7 on a 512-bit bus.

That bandwidth would embarrass a decode GPU) it’s below an H100. On a prefill part it is the entire point: SemiAnalysis’s launch framing was exactly right, a chip deliberately skinny on bandwidth and fat on compute, because that is what prefill consumes. Its intensity budget sits far to the right of Fig. 5’s ridge, where GDDR7 is not a compromise but a correct sizing.

Overlay the wafer math and the arbitrage is explicit: GDDR7 costs 1.7× standard DRAM wafer-equivalent; HBM costs 4×, plus CoWoS packaging, interposer, and thermal budget. By building the prefill tier out of GDDR7, NVIDIA routes the fastest-growing slice of inference demand (long-context prompt processing) around the most constrained commodity in its own supply chain.

Every prefill FLOP served from GDDR7 is HBM allocation freed for decode, where it earns its keep.

NVIDIA is pitching this with a number that deserves both scrutiny and attention : $5B of token revenue per $100M of infrastructure, 30-50× platform ROI, and naming Cursor, Runway, and Magic as early partners, which tells you the target token distribution: repository-scale coding context and long-form generative media, i.e., prefill-dominated workloads. Discount the multiplier as marketing; the direction is load-bearing.

The strategic read for this audience: hardware disaggregation of prefill and decode is memory-market arbitrage instantiated in silicon, and it will not stay proprietary to NVIDIA. SRAM-based decode tiers are being positioned into the same disaggregated fabrics, AMD will be forced to answer, and every serious serving stack (Dynamo, vLLM’s disagg mode, SGLang, Mooncake-style architectures) has converged on KV-cache-transfer-over-fabric as the central abstraction of 2026 serving.

The clearest tell that this is permanent: per Tom’s Hardware’s platform teardown, the Vera Rubin rack’s BlueField-4 DPU ships with an integrated SSD specifically to store KV cache: cached context now has its own dedicated hardware tier in the reference design.

If your mental model of an “inference GPU” is one SKU doing both phases, you are one hardware generation behind the economics, and Q1’s memory prices just made that lag expensive.

The repriced hierarchy, and what happens to cost per token

The uncomfortable synthesis: the price ratios between tiers compressed at the same moment the absolute levels rose, and tiering strategies earn their complexity from the ratio, not the level. A prefix cache offloading to DDR5 at 20-25% of HBM’s per-bit cost is an easy win.

The same cache at 50-70% of HBM’s per-bit cost must clear a much higher bar once you charge it for transfer bandwidth, PCIe hops, TTFT-on-miss, and the payroll that maintains it.

The offload break-even, derived

System prompts, shared codebase contexts, RAG corpus headers: cache. One-shot user uploads and cold agent trajectories: recompute. Cache residency is now a metered commodity; treat admission like an underwriting decision.

One nuance cuts the other way: NAND inflated less than DRAM, so the relative case for demoting warm entries to NVMe actually improved even as both tiers rose.

Three channels into cost per token

Channel 1: CapEx per node. HBM is now the single largest line item in a flagship accelerator’s package BOM (SemiAnalysis’s finding for the GB300, after rising as a share every generation since Hopper) and adding host DDR5 and NVMe puts memory at roughly half the bill of a modern inference node (author’s estimate on top of that sourced base).

HBM moves slowly through long-term contracts; host DRAM and SSD are bought near spot: which is where 2026 budgets are bleeding now. A node spec’d mid-2025 with 1.5 TB DDR5 + 30 TB NVMe carries tens of thousands of dollars of new memory cost at current prices. Amortized over four years at high utilization it’s single-digit percent per token, but it compounds with Channel 2.

Channel 2: the capacity ceiling (the big one). When memory binds, its price doesn’t just raise cost : it caps revenue per node. Max concurrency = free-memory-after-weights ÷ KV-bytes-per-sequence; decode throughput scales with batch until bandwidth saturates. As traffic mix shifts long-context (it is: agents, codebases, multimodal), each node serves fewer users and each token carries more fixed cost.

Cost per token rises without any component getting more expensive: pure mix shift. The shortage means you can’t fix it by bolting on DRAM at 2024 prices; the escape routes are compression and disaggregation. Both are engineering, not procurement.

Channel 3: the tier-ratio compression of Fig. 3, which quietly deprecates a generation of offloading tricks and re-rates model-architecture choices (§02, Fig. 4).

Winners, losers, and the direction of the tilt

Hyperscalers and frontier labs are insulated. Multi-year HBM/DDR5 LTAs signed pre-squeeze, first-priority allocation (Micron: “larger, strategic customers”), fleets big enough that architecture fixes amortize across trillions of tokens.

Databricks’ disclosure that cost-aware autoscaling and “model unit” abstractions cut GPU spend over 80% versus static provisioning (on a platform serving 120 trillion tokens a month) shows where leverage lives at that scale: utilization engineering, because their input prices are contractually smoothed.

Mid-tier providers and self-hosters absorb the shock. They buy near spot, hold no allocation priority, and their pitch (undercutting frontier APIs) sits directly on the inflating commodity. Worse, their buyers are the newly capped: under a token budget, an enterprise squeezes more work from the same spend rather than paying up, so the mid-tier faces rising input costs and a customer base structurally optimizing against price at the same time.

That is the textbook geometry of a margin squeeze, and it lands here first. Watch for: per-token price floors firming through H2 2026 in open-weights serving; long-context surcharges turning explicit; and prompt-caching pricing rebalancing: cached-token discounts exist because storage was cheap relative to recompute, and that ratio just moved. Cache-storage line items (already visible in frontier pricing) will spread.

The self-hosting math shifted in a direction almost nobody has updated for. The 2025 pitch (”your $15-25K refurbished H100 workstation replaces $500/month of API spend”) has two 2026 problems. The workstation’s own BOM inflated: RAM alone on a new server build can now approach the cost of an entire refurbished platform. And the API side is partially shielded by hyperscaler contract insulation plus superior compression engineering.

The perverse result: the memory supercycle is a centralizing force, it taxes small-fleet inference more heavily than hyperscale inference, at precisely the moment open-weights quality made decentralized serving viable.

If you’re modeling a GPU buy this year, redo it with 2026 memory line items, and note that used enterprise hardware with pre-crisis RAM already installed is currently the cleanest arbitrage in the market, which is exactly why the refurb channel is having a moment.

Proof of pass-through: the March repricing, and the token bill in real dollars

A thesis this size needs a smoking gun: evidence that memory contract prices actually reach the price you pay per GPU-hour, on a measurable lag. Q1 2026 provided it.

Silicon Data’s SDB200RT index (the standardized benchmark for B200 cloud rental pricing) opened 2026 at 4.40, drifted through January and February, then rose 23.6% inside March alone, crossing 5.0 on March 15 and 6.0 on March 23 before settling at 5.48: up 24.4% year-to-date.

Their attribution is the entire argument of this issue in one sentence: Samsung and SK Hynix raised HBM3e contract prices ~20% for 2026 deliveries, NVIDIA revised hardware MSRPs upward in late February citing memory component costs, and those input costs flowed into cloud hourly rates with a 4-6 week lag: landing squarely in March.

Meanwhile the two-year-old H100, whose memory was bought at pre-squeeze prices, traded in a tight, boring band all quarter. Same market, same month: the GPUs carrying new memory repriced; the GPUs carrying old memory didn’t. That divergence is the supercycle reaching your invoice.

The token bill, priced in real mid-2026 dollars

With real rental rates in hand (H200 working median ~$3.50/GPU-hr on-demand across neocloud trackers (cohort median ~$4.00; floor $2.30; hyperscalers to $13.78)) the bandwidth model from §03 becomes an actual price sheet.

These are bandwidth-bound floors: ideal kernels, full overlap, output tokens only. Real deployments land at 50-70% of these throughputs; the ratios between cells are the robust result.

One honest caveat, stated before a reader states it for me: this model charges every decode step for streaming the full weights and KV and ignores prefill cost, which at 128K is substantial and shifts spend toward exactly the CPX-shaped silicon.

A production number also carries utilization (Databricks’ 80% savings figure is the size of that lever), SLO headroom, and interconnect overhead in multi-GPU serving. The model’s job is not to predict your invoice to the cent; it is to rank your decisions, and the ranking is unambiguous: context policy first, KV dtype second, hardware price third.

The spread, marked to market. July 2, 2026

Now close the loop the title promises. As of July 2, flat per-token pricing for Llama-3.3-70B-class serving spans $0.31-$0.90 per million output tokens across the fifteen providers Artificial Analysis tracks (DeepInfra’s FP8 “Turbo” at $0.40, Groq at $0.79, Together AI and Fireworks at $0.88-0.90) and the rate is flat across the model’s 131K context window.

Set those prices against this section’s floors and the cross-subsidy stops being an inference and becomes a table:

Note the confirming detail hiding in the market data: the price leader doesn’t serve the reference model at all, it serves an FP8-quantized variant. The cheapest provider in the market is this issue’s playbook, priced and shipped.

And the honest caveats, before a reader raises them: these floors are single-GPU bandwidth ideals (real serving is less efficient, which raises them), while real fleets also earn input-token revenue and batch mixed contexts (which lowers effective floors), so treat the magnitudes as directional.

The sign pattern is the robust result: at July 2026 flat prices, long-context traffic on 70B-class serving is sold below its modeled production floor, and budget-capped buyers remove the option of raising the flat rate to fix it. Hence the final item of §07’s playbook.

Eight decisions, ranked by dollar leverage

Everything above, compiled into action: ranked by expected cost impact per unit of engineering effort for a team serving open-weights models at scale in 2026. Under the scissors of §00, read “cost impact” as what it now is: spread defense: every dollar of floor you remove is margin the ceiling can no longer take from you.

Five falsifiable predictions

Scored publicly in Q1 2027, alongside the Issue 04 speculative-decoding calls. Confidence is stated so being wrong costs me something.

P1 · Supplyconfidence: high
DRAM relief does not arrive before late 2027. New capacity (Micron Boise) ramps 2027-2028; suppliers themselves guide consumer relief to ~2028. Any 2027 hardware-refresh model priced at 2024 memory levels is fiction.

P2 · Architectureconfidence: high
Open-weights flagships converge on compressed KV. By mid-2027 the majority of new frontier-adjacent open releases ship MLA-like latent attention, ≤8-head GQA, or hybrid sliding windows: marketed explicitly as serving-cost features. Model architecture is now downstream of the DRAM spot price.

P3 · Pricingconfidence: medium-high
Per-token API pricing bifurcates by context residency: at least two major providers introduce or restructure explicit cached-context storage pricing (per-token-per-hour or equivalent) by mid-2027, decoupling state from compute. Partial confirmation is already on the books (Google Vertex bills per-hour cache storage and Anthropic bills TTL-tiered cache writes) so the live prediction is that this becomes the norm in open-weights serving, not just frontier practice.

P4 · Siliconconfidence: medium
Prefill-specialized silicon becomes a category, not a SKU: a CPX competitor (AMD or an ASIC player) is announced within 12 months of CPX GA, and “prefill accelerator” enters the standard rack taxonomy.

P5 · Marketconfidence: medium
The open-weights serving price war pauses: median $/Mtok for 70B-class serving on third-party clouds goes flat-to-up through H1 2027 (the first sustained non-decline in that series since it has existed. The March B200 repricing (Fig. 7) is the leading edge) and enterprise token budgets are the demand-side mechanism that lets long-context surcharges and cache-storage fees actually stick, because a buyer operating under a cap optimizes usage instead of switching providers.

The dashboard: six leading indicators, with trigger levels

The spread is the product now

For two years, inference economics was a story about FLOPs: kernel efficiency, quantized matmuls, speculative decoding, Blackwell’s FP4 throughput.

Those battles were real; this publication has spent six issues inside them. But they were fought against a background assumption so universal nobody stated it: bytes are cheap and getting cheaper.

That assumption died in Q1 2026. The binding constraint of the inference industry is now measured in wafer starts and TSV yields: in exabytes of KV state colliding with 10-16% annual bit-supply growth, in an IDC report that uses the word “permanent,” in a rental index that repriced 24% in one month because a memory contract reset five weeks earlier.

The consequences run one direction through the entire stack: hardware specializes by phase because HBM is too precious to waste on prefill; model architectures compress their KV because fat caches became unserveable; serving stacks reorganize around shipping cached state across fabrics because holding it still got expensive; and the cost advantage tilts toward whoever holds allocation contracts and compression engineering: which is to say, toward the largest players, in an industry that spent two years congratulating itself on decentralizing.

The engineers who internalize this fastest hold a simple advantage: while the market reprices memory, they reprice their need for it. Every KV byte you decline to allocate is bought at 2026 prices and sold at 2026 prices, at 100% margin, with zero lead time and no allocation meeting. The wafer sets the cost floor. The wallet sets the price ceiling.

The engineering in between is the only variable you control, and in a margin regime, that engineering stops being a cost-optimization side quest and becomes the P&L itself. Squeezes do not reward the biggest fleet; they reward the widest spread per byte.

METHOD & VERIFICATION. Market figures trace to named sources below; all KV, roofline, concurrency, and $/Mtok arithmetic is the author’s, derived in-text from published model configs and device datasheets so you can check every step. Cost floors are bandwidth-bound ideals: absolute values are lower bounds, ratios are the robust result.

The Fig. 7 path between published waypoints is interpolated; Fig. 3’s intermediate points are directional; Fig. 0 is a directional synthesis whose turning points, not slopes, are the sourced claims. Pricing data was spot-checked the week of publication and moves weekly: recheck before committing capital. Corrections run at the top of the next issue, as always.

SOURCES: TrendForce: 1Q26 DRAM/NAND contract revisions; “Memory Wall” research (Jan 2026); HBM adoption & 2029 inference-demand outlook; AI wafer-equivalent consumption via Commercial Times (Dec 2025) · IDC (”Global Memory Shortage Crisis” (Feb 2026) · IEEE Spectrum) “AI Is a Memory Hog” (Apr 2026) · Counterpoint Research (Q1’26 spot DRAM tracking · CNBC) Micron/SK Hynix/Samsung allocation & CES 2026 reporting (Jan 2026) · Avnet/Omdia/BofA (supercycle & supply-relief estimates · SemiAnalysis) “Another Giant Leap: The Rubin CPX Specialized Accelerator & Rack” (Sep 2025) · The Register (CES 2026 Vera Rubin systems coverage (Jan 2026) · Tom’s Hardware) “Nvidia’s Vera Rubin platform in depth” (Nov 2025, incl. BlueField-4 KV-cache SSD) · Futurum/Ori/NADDOD (Rubin CPX technical analyses · NVIDIA) AI Infra Summit & GTC 2026 announcements · Databricks Engineering (”Reliable LLM Inference at Scale” (May 2026) · SemiAnalysis) “TokenBudgeting: Our Conversations with Enterprises on Token Spend” (Jul 2, 2026) · Artificial Analysis (Llama 3.3 70B provider benchmarking & cache-pricing mechanics (Jul 2026) · aipricing.guru / Price Per Token) provider price pages (sourced Jul 2, 2026) · Silicon Data. SDB200RT B200 index, March 2026 update · aimultiple GPU Rental Price Index; getdeploying B200/H200 trackers; Jarvislabs H200 pricing (2026) · DeepSeek-V3 technical report (MLA configuration · vLLM) NVFP4 KV-cache PR (in progress at publication).

The Software Frontier

Discussion about this post

Ready for more?