Scenario Planning
for Generative AI
Six currents. One habit. Your strategy.
The Question
The AI industry has committed over $600 billion in capital expenditure for 2026 alone. That money is already flowing into data centers, GPU clusters, and training runs. It will produce outcomes.
The question isn’t whether AI will change — it’s which forces to plan against.
A word on the title. Classic scenario planning — the discipline this booklet borrows its method from — builds a few distinct, internally-consistent future worlds and asks how you’d fare in each. This booklet does something deliberately different: instead of whole future worlds, it tracks six currents — the forces those worlds would be built from. In a field moving this fast, the forces are more stable, and more observable, than any single future. The method is still scenario planning — foresee, watch triggers, adjust — but the units you apply it to are currents, not scenarios.
This booklet sketches six underlying currents — forces actively moving the field over the next 2–3 years. They are not predictions and they are not a matrix. They are planning lenses, each with its own data, its own thesis, and its own trigger signals to watch.
The six currents are: Continued Scaling (the $700B capex bet on the next capability staircase), Efficiency Revolution (frontier capability becoming a commodity), Financial Correction (whether the investment timeline matches the revenue timeline), Sovereignty (what happens when a vendor’s access collides with a sovereign’s authority), From Lab to Production (the deployment gap that is now the binding constraint), and Hours and Dollars (the two units that will decide displacement — how long an agent can work, and at what cost). They reinforce and contradict each other in specific ways — that’s the point of holding several open at once.
Each current is anchored by either a visualization or a presenter card with key figures, followed by a written chapter that unpacks the data and arguments. You can use this booklet in two ways: as a presentation tool (lead with the visual or card, trigger discussion), or as a standalone reading experience (read the chapters for the full picture). Both work — design your session around your audience.
At the end, a short synthesis describes how the currents interact, a note explains how to use this in practice, and a workshop exercise helps you translate insight into action.
Continued Scaling
“Where Does the Money Go?”
The money is already spent
When people debate whether AI will “live up to the hype,” they often miss a crucial fact: the investment decisions have already been made. The Big Five hyperscalers — Alphabet, Amazon, Apple, Meta, and Microsoft — spent a combined $228 billion on capital expenditure in 2024, up 62% from $140 billion in 2023. For 2025, guided spending reaches $416 billion. For 2026, the trajectory points to $700 billion or more, with Oracle adding another $50 billion. Adding up 2025–2027, Goldman Sachs projects total hyperscaler capex of $1.15 trillion. This money is flowing into GPU clusters, power infrastructure, and data centers — and the vast majority is earmarked specifically for AI.
The reason this matters for planning is that capex doesn’t translate into capability instantly. Building a data center takes 12–24 months. Procuring the chips takes 6–12 months. Training a frontier model takes another 6–12 months. Post-training, safety testing, and deployment add 3–6 more. Each year’s spending splits into three bets running at different timescales: inference capacity arriving in roughly 2 years, training runs for models 3 years out, and research compute powering breakthroughs 4+ years away.
The staircase pattern
Looking backward, AI capability has advanced in a staircase pattern: a major jump every roughly two years, followed by refinement within that generation. GPT-3 (2020) was dramatically surpassed by GPT-4 (2023), which was then refined through GPT-4 Turbo, GPT-4o, and eventually GPT-4o-mini — each iteration better and cheaper, but not a fundamentally new capability tier. The same pattern appears with Claude 3 Opus giving way to Claude 3.5 Sonnet and then Claude Opus 4, and now Claude Opus 4.6.
If this pattern holds, the $228 billion spent in 2024 is currently producing the training infrastructure for models that will ship in 2026–2027. The $416 billion committed for 2025 funds models arriving in 2027–2028. And the $700 billion planned for 2026 is investing in capabilities whose architectures may not even be designed yet — research compute for ideas that haven’t been conceived. The clearest demonstration of this lag came on March 24, 2026, when OpenAI completed pre-training of GPT-6 (“Spud”) at the Abilene Stargate facility — a model whose existence was funded by 2024’s capex, roughly three years before its expected public launch. Anthropic’s next-generation “Mythos” sits in a similar lane, in limited testing with cybersecurity defenders as of Q1 2026.
What is being built
The scale of individual projects is staggering. Elon Musk’s xAI deployed Colossus — a cluster of 100,000 GPUs — in just 122 days in late 2024. Microsoft’s Rainier project targets 500,000 GPUs. Meta’s Abilene aims for 450,000. The Stargate project (a joint venture between OpenAI, Oracle, and SoftBank) plans clusters exceeding one gigawatt of power — the equivalent of a nuclear power plant dedicated to AI. At these scales, the limiting factor shifts from chip availability to raw electrical power: the Three Mile Island nuclear facility is being restarted specifically to supply AI data centers.
The inference bet
A common misunderstanding is that all this money is about training bigger models. In reality, the labs are increasingly betting that inference — running models at scale for millions of users — will dominate AI compute, accounting for roughly 70% by 2030. Training a frontier model is a one-time cost; serving it to every enterprise customer, every developer, every consumer product is a continuous cost that scales with adoption. The capex surge is as much about building the serving infrastructure for AI-powered products as it is about training the next generation of models.
This is the core planning insight of Scenario 1: even if you believe the current generation of AI is “good enough,” the investment already committed will produce outcomes over the next 2–4 years. Those outcomes — faster models, cheaper inference, new capabilities — will change what’s possible and what’s expected. Your plans need to account for a moving target, not a snapshot.
Trigger signals — what to watch for
- Next-generation models (GPT-5, Claude 5) show a large, undeniable capability jump over predecessors
- Enterprise AI revenue growth accelerates — the $500B+ revenue gap begins to close
- Hyperscaler capex guidance continues rising >30% year-over-year through 2027
- New model architectures emerge that can efficiently use the massive clusters being built
Implications by role
Data: company earnings reports & guidance • Big 5 = Alphabet, Amazon, Apple, Meta, Microsoft
Efficiency Revolution
“How Much Does GPT-4 Cost?”
| Model | Org | Released | Training cost | Capability claim |
|---|---|---|---|---|
| GPT-4 | OpenAI | Mar 2023 | >$100M | Frontier — set the “GPT-4 class” bar |
| Llama 3.1 405B | Meta | Jul 2024 | ~$60M (compute) | Matches GPT-4 on most public benchmarks |
| DeepSeek V3 | DeepSeek | Dec 2024 | $5.6M (final pre-training run) | Matches/beats GPT-4o on key benchmarks |
| DeepSeek R1 | DeepSeek | Jan 2025 | +$294K (RL on V3 base) | Matches OpenAI o1 on reasoning |
| GLM-5.1 | Zhipu | Apr 2026 | undisclosed | 744B MoE / 40B active, MIT license — 58.4% SWE-Bench Pro (beats GPT-5.4 and Opus 4.6) |
| Mistral Medium 3.5 | Mistral | Apr 2026 | undisclosed | 128B dense, self-hostable on Hugging Face — 77.6% SWE-Bench Verified |
| DeepSeek V4-Pro | DeepSeek | Apr 2026 | undisclosed | 1.6T total / 49B active, hybrid attention — 80.6% SWE-Bench Verified |
| Date | Frontier tier (closed) | Sub-frontier closed | Open-source equivalent |
|---|---|---|---|
| Mar 2023 | GPT-4: $60 | — | — |
| Nov 2023 | GPT-4 Turbo: $30 | — | — |
| May 2024 | GPT-4o: $15 | — | — |
| Jul 2024 | — | GPT-4o-mini: $0.60 | Llama 3 70B (Groq): ~$0.79 |
| Oct 2024 | GPT-4o (cut): $10 | — | — |
| Jan 2025 | — | — | DeepSeek V3: $0.42 |
| Apr 2026 | Opus 4.7: ~$75 | — | DeepSeek V4-Pro: $2.48 (~10× cheaper at frontier-equivalent) |
| Apr 2026 | — | — | Qwen 3.6 35B-A3B: self-host (single RTX 4090, 73.4% SWE-Bench) |
Mistral CEO Arthur Mensch’s thesis (paraphrased from early-2026 interviews): generic intelligence will commoditize, so competitive advantage moves to specialized systems built around your specific data and domain. Below are the layers he points to. Segment widths are equal — this is a stake-in-the-ground for discussion, not a measured value distribution:
Discussion: If the model is free, which of these layers is your team actually investing in — and which would Mensch say you should be?
The training cost freefall
In March 2023, OpenAI trained GPT-4 for an estimated $63–100 million — Sam Altman confirmed publicly that the cost exceeded $100 million including research and development. By July 2024, Meta had trained Llama 3.1 405B, an open-weight model matching GPT-4 on most benchmarks, for roughly $60 million in compute (30.84 million H100 GPU-hours). Then in December 2024, DeepSeek released V3 — a model that matched or exceeded GPT-4o on key benchmarks for just $5.6 million in GPU time.
That is a roughly 95% cost reduction in 20 months. A month later, DeepSeek released R1, which matched OpenAI’s o1 on reasoning tasks for an incremental $294,000 in training cost.
The caveats matter: DeepSeek’s $5.6 million figure covers only the final pre-training run. Their parent company High-Flyer invested over $500 million in Nvidia GPUs total, and the full cost from base model to R1 is estimated at $6–7 million by Epoch AI. But even the generous estimate represents a 90%+ reduction from GPT-4. The key innovations enabling this — FP8 mixed-precision training, mixture-of-experts architectures with load balancing, and custom CUDA kernels achieving 85%+ GPU utilization versus the industry average of 55–65% — are algorithmic, not hardware-dependent. They can be replicated.
The open-source convergence
The Stanford HAI 2025 AI Index Report documented the most important shift in the AI landscape: the performance gap between the best open-weight and proprietary models, measured by Chatbot Arena Elo ratings, shrank from 8.04% in January 2024 to 1.7% by February 2025 — a 79% reduction in a single year. On MMLU specifically, the gap between US and Chinese models collapsed from 17.5 percentage points to just 0.3 between the end of 2023 and the end of 2024.
Llama 3.1 405B was the first open model to match or exceed GPT-4 across multiple benchmarks in July 2024, roughly 16 months after GPT-4’s release. By early 2025, that lag had compressed further. Open-source models now represent 62.8% of all models by count, and the best open LLMs lag closed ones by 5–22 months on benchmarks — with the gap narrowing rapidly. One analysis projected open-closed parity by Q2 2026.
Inference pricing in freefall
The pricing evolution of OpenAI’s own API tells the commoditization story in dollar terms. GPT-4 launched at $60 per million output tokens in March 2023. GPT-4 Turbo brought that down to $30 in November 2023. GPT-4o launched at $15 in May 2024, then was cut to $10 in October. Meanwhile, GPT-4o-mini offered GPT-4-class performance at $0.60 per million tokens — a 99% reduction from GPT-4’s launch price in under two years.
Open-source alternatives are even cheaper. Llama 3.3 70B via Groq costs $0.71 per million output tokens. DeepSeek V3 is available at $0.42. Self-hosted 70B models on H100 hardware can reach approximately $0.07 per million tokens at full utilization. On average, open-source models cost 7.3 times less than their proprietary equivalents.
The April 2026 wave
Over an 18-day window in April 2026, three frontier-class open-weight coding models shipped. GLM-5.1 (Zhipu, April 7) is 744B MoE / 40B active under MIT license and posts 58.4% on SWE-Bench Pro — ahead of GPT-5.4 and Opus 4.6 on that bench. Qwen 3.6 (Alibaba, April 16) split into variants; the 35B-A3B open variant runs on a single RTX 4090 with quantization and posts 73.4% SWE-Bench Verified — the throughput sweet spot for solo developers and on-prem deployments. Kimi K2.6 (Moonshot, April 20) is 1T total / 32B active and introduces a 300-agent swarm primitive for parallel exploration on hard tickets — an agentic precursor that connects to Hours and Dollars in Section 6. DeepSeek V4-Pro (April 24) posts 80.6% SWE-Bench Verified at $0.28 input / $2.48 output per million tokens with a 1M-token context window — roughly an order of magnitude cheaper than Opus 4.6 at comparable coding capability, using hybrid attention (Compressed Sparse + Heavily Compressed) at about 27% of V3.2’s per-token FLOPs. Mistral Medium 3.5 (April 29) ships as a 128B dense model, self-hostable from Hugging Face, with 77.6% SWE-Bench Verified and configurable reasoning effort per request — the cleanest EU-data-residency narrative on the market, paired with Mistral’s $400M ARR (January 2026) at a $13.8B valuation.
By May 2026, the buyer question is no longer “is open-weight good enough.” It is which open-weight per workload, which hosting stack, and which sovereign deployment shape. For EU enterprises in particular, these intersect directly with the sovereignty story in Section 4 — for the first time, “self-hostable frontier-equivalent” isn’t a euphemism.
Where does value go when the model is free?
Mistral CEO Arthur Mensch has been the most articulate voice on this shift. Across early-2026 interviews — the Big Technology Podcast in January, Davos and Bloomberg the same month, the Economic Times in February — he framed AI as becoming infrastructure, “a utility” measured by efficiency, capital discipline, and reliable delivery rather than novelty. His most quoted line: “My generation of engineers has more or less succeeded in commoditizing its own profession.” The corollary, which he argues consistently, is that competitive advantage will increasingly accrue not to whoever has the largest model, but to whoever builds the most specialized system around their specific data and domain.
If Mistral is right — and the cost data supports the argument — then the model itself becomes a commodity layer, and value migrates to the layers around it: fine-tuning and domain adaptation, data pipelines and retrieval-augmented generation, tooling and orchestration (agent frameworks, MCP servers, evaluation pipelines), and ultimately domain expertise. The organizations that win in this scenario are not those with the best model, but those with the best understanding of their own problems.
Trigger signals — what to watch for
- Open-source model matches frontier proprietary within weeks of release, not months
- Major enterprise shifts production workloads from proprietary APIs to open-source alternatives
- Inference costs drop below $0.10 per million tokens for GPT-4-class output
- Hyperscaler capex growth decelerates because efficiency gains reduce hardware requirements
Implications by role
Data: OpenAI API pricing history • DeepSeek technical reports • Stanford HAI 2025 AI Index • Epoch AI
Financial Correction
“Have We Seen This Before?”
Survivors vs. Casualties — then and now
Click any card to flip it and see what happened.
The dot-com precedent
On March 10, 2000, the Nasdaq Composite reached an all-time high of 5,048.62. By October 9, 2002, it had fallen to 1,114 — a 78% decline that destroyed over $5 trillion in market value. The Nasdaq didn’t close above 5,000 again until April 23, 2015 — a recovery that took fifteen years. At the peak, venture capital investment had surged from roughly $7 billion in 1995 to nearly $100 billion in 2000, with internet companies absorbing 80% of all venture capital. Telecom companies invested more than $500 billion in infrastructure in the five years following the 1996 Telecommunications Act.
The lesson that most people take from this period is: “it was a bubble and it burst.” The more useful lesson is that the survivors and casualties were distinguished by one thing — not the quality of their technology, but whether they had real revenue, real customers, and cash to survive a funding drought.
The bubble argument has matured
The most visible bear voice of the past two years — Ed Zitron — has been notable not for being right but for how the argument has had to shift. His original case, sustained across blog posts and his “Better Offline” podcast, was economic: AI was a value-destruction machine, hyperscaler capex was insane, and the unit economics simply did not work. Some of that case has aged well (the ROI gap is real). Most of it has not. Inference costs have fallen 99% over two years. Anthropic crossed $30B+ ARR by spring 2026 and reportedly overtook OpenAI on business adoption in May. Cost decline plus revenue growth made the original economic argument harder to sustain in its strongest form. Kelsey Piper, writing in The Argument, documented the shift: Zitron’s case has migrated from “the economics don’t work” toward fraud and accounting allegations against OpenAI and the hyperscalers.
The bear case is still alive, and parts of it remain sharp. But the goalposts moved — and that itself is a signal worth weighing. A bubble argument that survives a 99% cost decline by switching from economics to fraud is a weaker argument than one that didn’t have to switch. Hold the correction scenario open; don’t hold this particular version of it as the bear case.
Amazon vs. Pets.com
Amazon’s stock fell 94% from roughly $106 in December 1999 to about $5.51 in late 2001. Yet its revenue grew every single year through the crash: $2.76 billion in 2000, $3.12 billion in 2001, $5.26 billion in 2003, $8.49 billion in 2005. It posted its first profitable quarter in Q4 2001 and its first full profitable year in 2003, with $35 million net income on $5.26 billion revenue. The key decision was a well-timed $1.25 billion bond offering that gave Amazon $1 billion in cash to survive the drought. Today it is worth roughly $2.5 trillion — over 800 times its trough market cap.
Pets.com raised $300 million total, spent over $70 million on advertising while generating only $619,000 in revenue, and shut down 268 days after its IPO. Webvan burned through $1.5 billion building automated warehouses before filing bankruptcy. Boo.com raised $135 million, burned it in 18 months, and sold its assets for under $2 million. The common thread: negative unit economics, no path to profitability, and complete dependence on the next funding round.
AI investment has entered unprecedented territory
The combined Big Five capex (Alphabet, Amazon, Meta, Microsoft, Apple) grew from roughly $140 billion in 2022 to $251 billion in 2024 (+62% year-over-year) to a projected $388–443 billion in 2025 and $600–640 billion in 2026. Capital intensity has reached 45–57% of revenue — historically unprecedented for these companies. Venture funding has concentrated similarly: global AI VC funding grew from roughly $45–50 billion in 2022 to $211 billion in 2025, the first year AI startups captured more than half (52.7%) of all global venture deal value.
OpenAI reached an $852 billion post-money valuationreported after its $122 billion funding round in March 2026. Annualized revenue hit $25 billion by February 2026reported, up from roughly $2 billion in 2023. But the company projects a $14–17 billion loss in 2026projected, is not expected to be profitable until 2029 at the earliest, and has committed $600 billion in compute spending through 2030. Anthropic reached $380 billion valuation with its $30 billion Series G in February 2026reported, with revenue growing from $1 billion ARR in December 2024 to an estimated $19–30 billion ARR by early 2026estimated.
The revenue gap
Sequoia partner David Cahn published “AI’s $600B Question” in June 2024, calculating that the AI infrastructure buildout requires roughly $600 billion in annual end-user revenue to justify itself. At the time, actual AI product revenue was roughly $100 billion — a $500 billion annual gap. Since then, both spending and revenue have grown, but spending has grown far faster: capex roughly tripled while the revenue gap has likely widened, not narrowed. Barclays estimated that current capex levels would require the equivalent of 12,000 ChatGPT-sized products to break even.
Personal value is clear. Enterprise value is the open question.
Two ROI stories sit on top of each other and are routinely conflated. Individual subscribers buying Claude or ChatGPT at $20–$200 a month report value clearly and stickily: paid consumer plans for the two leaders together cross tens of millions of seats by mid-2026, churn is unremarkable, and surveys consistently show personal users describing meaningful time savings. That part of the market has answered. The enterprise market has not.
Omdia’s October 2025 survey of 350 mid-to-large enterprises reported “very good” to “extraordinary” ROI from most respondents — a genuinely positive signal. Accenture, in parallel, found 61% of enterprise AI subscriptions underutilized due to poor integration. The MIT NANDA study reported 95% of organizations seeing zero return, with the measurement caveats already noted (no baselines, six-month cutoff, parallel-pilot designs). Reconcile these and the picture is: enterprises that have integrated AI into workflows are extracting real value; the majority that are still trying are not. The reason for the gap isn’t capability. The model can do the task. The bottleneck is how the model gets wired into the workflow. That is the subject of Section 5.
Vendor concentration is the under-discussed risk
Q1 2026 saw AI venture funding concentrate to a degree that has no recent precedent in software. OpenAI, Anthropic, and xAI accounted for roughly 67.3% of all AI venture funding across more than 1,500 deals. OpenAI’s $122 billion round at an $852 billion valuation consumed a non-trivial share of global venture capacity. Microsoft, Meta, Amazon, and Alphabet collectively guided investors toward ~$700 billion of capex in 2026. Three foundation-model labs sit on top of a stack the rest of the industry rents from.
Concentration this extreme is usually argued as safety — the giants won’t fail. The Anthropic-Pentagon situation (covered in Section 4) is the case to study before agreeing. A single sovereign decision in February 2026 severed access to a major AI vendor for the entire US federal government, mid-contract, with little notice. The technology kept working. The vendor didn’t fail. The buyer simply couldn’t buy. That is a vendor-concentration failure mode the dot-com analogy didn’t have. Stress-testing your AI strategy against vendor severance is now first-class planning work, not paranoia.
Why the parallel breaks — and why it might not matter
There are important differences from the dot-com era. Today’s leading AI investors are massively profitable companies spending from earnings, not startups burning venture capital. Nasdaq forward price-to-earnings ratios are approximately 26 times versus 60 times at the dot-com peak. Enterprise adoption is far more advanced: 87% of large enterprises have implemented AI in some form. But the core structural risk — investment dramatically outpacing revenue realization — is identical. And new risks have emerged: AI-related corporate debt has ballooned to $1.2 trillion (JPMorgan), GPU rental prices have already fallen roughly 70% from peak, and the real useful life of GPU infrastructure may be 2–3 years rather than the 5–6 years used for accounting depreciation.
The question for your planning is not whether AI is valuable. It is. The question is whether your specific vendors, tools, and providers are the Amazon or the Pets.com of this cycle.
Trigger signals — what to watch for
- OpenAI or Anthropic IPO valuations correct significantly (>30%) within 6 months of listing
- Hyperscaler capex guidance flattens or declines for the first time since 2022
- Multiple AI-native startups fail or get acqui-hired in a single quarter (Inflection, Character.AI pattern)
- GPU rental prices continue falling — H100 rates already down ~70% from peak
- Major AI-related debt defaults or CoreWeave-style stranded asset writedowns
Implications by role
Data: Nasdaq historical data • Sequoia “AI’s $600B Question” • Barclays Research • MIT NANDA, Deloitte, Omdia, Accenture enterprise surveys • Kelsey Piper / The Argument
Sovereignty
“What if your vendor isn’t allowed to sell to you?”
Two things changed in the last four months: a major US AI vendor was severed from its largest federal customer by executive action, and the Chinese open-weight stack closed the gap on coding and reasoning at roughly 10× lower cost. Sovereignty stopped being a paranoid’s concern.
Five anchors. The first is the failure mode; the next four are the alternatives that now exist.
Two collisions, one pattern
Two events in spring 2026 turned sovereignty from hypothetical to operational. The first: on February 27, 2026, the US Defense Secretary designated Anthropic a “supply chain risk,” and the Trump administration ordered federal agencies to stop using Claude. Anthropic and the Pentagon had signed a $200 million contract in July 2025 under Anthropic’s acceptable-use policy; the Pentagon wanted “all lawful purposes” access without limitation, and Anthropic refused to remove restrictions on autonomous weapons and domestic mass surveillance. Anthropic sued, won a preliminary injunction in late March (Judge Rita Lin called the designation “Orwellian” and First Amendment retaliation), then lost an appeals court bid in early April. As of late April, the White House was reportedly developing a workaround to let federal agencies use new Anthropic models, sidestepping the supply-chain designation. The technology never stopped working. The buyer simply could not buy.
The second: on April 8, Meta launched Muse Spark, its first proprietary closed-weight model, from Meta Superintelligence Labs under Alexandr Wang. After nearly a decade of public commitment to open frontier AI, Meta’s frontier development is now closed. Existing Llama models remain available but no longer evolve. The combination of $115–135B in 2026 capex, competitive pressure from Chinese labs building commercial products on top of Llama, and the strategic goal of a deeply integrated “personal superintelligence” tied to Meta’s user data drove the shift. Yann LeCun, Meta’s most visible open-source advocate, departed in November 2025. The “Linux of AI” thesis did not survive contact with $100B+ compute economics — at the Western frontier.
The Chinese open-weight wave fills the gap
Within 18 days in April 2026, three frontier-class open-weight coding models shipped from Chinese labs: GLM-5.1 (Zhipu, MIT-licensed 744B MoE / 40B active, 58.4% SWE-Bench Pro), Kimi K2.6 (Moonshot, 1T total / 32B active with a 300-agent swarm primitive), and DeepSeek V4-Pro (1.6T total / 49B active, 80.6% SWE-Bench Verified at roughly an order of magnitude lower output cost than Opus 4.6). Qwen 3.6 (Alibaba, April 16) split into variants; the 35B-A3B open variant runs on a single RTX 4090 with quantization. By workload as of May 2026: DeepSeek V4-Pro for cheap large-context coding agents, Kimi K2.6 for hard multi-step tickets with its swarm primitive, GLM-5.1 for self-hosted production where MIT licensing matters, Qwen 3.6-35B-A3B for local laptop or single-GPU deployment. Llama 4 remains integration-default but is no longer evolving.
What this means for EU enterprises
The buyer question has shifted from “is open-weight good enough” to a multi-part procurement question: which open-weight per workload, which hosting stack, which sovereign deployment shape. Mistral Medium 3.5 (128B dense, self-hostable, $400M ARR at a $13.8B valuation) is the cleanest EU-data-residency narrative on the market — dense rather than MoE, easier to deploy than the Chinese stacks, sovereign-aligned. Self-hosted Chinese open-weight is the other major option, with two caveats worth flagging: geopolitical exposure if procurement frameworks tighten, and dataset-provenance questions that some EU regulators are starting to ask.
The deeper point: for the first time in this booklet’s lifetime, “self-hostable frontier-equivalent” is not a euphemism. Procurement teams that previously assumed one or two US vendors had no realistic alternative now have several. The work shifts from negotiating with one vendor to architecting around the choice. That choice depends on which sovereign failure modes you weight highest.
Trigger signals — what to watch for
- Additional supply-chain designations or AUP collisions between frontier labs and sovereign buyers
- New sovereign-AI regulations requiring on-shore inference, training data, or model weights
- Additional Chinese open-weight releases matching closed-frontier capability within weeks of release
- EU enterprises shifting production workloads from US-hosted APIs to EU-sovereign or self-hosted stacks at material volume
- Counter-trigger: a US/EU/CN diplomatic settlement that re-normalizes cross-jurisdictional vendor access
Implications by role
Data: court filings (Anthropic v. DoD) • vendor releases (Meta Muse Spark, Mistral Medium 3.5) • open-weight model cards (DeepSeek V4-Pro, GLM-5.1, Qwen 3.6, Kimi K2.6) • Kelsey Piper / The Argument
From Lab to Production
“What we learned from 2015 — and what’s different now”
The capability ceiling overstates what you can deploy. The deployment floor is where the gap actually sits — and that floor is what enterprise AI roadmaps now hit first.
Five lab-to-real-world gaps. The pattern is consistent across modalities — capability is not the binding constraint.
The 2015 parallel
Anyone who lived through machine learning’s enterprise adoption between 2014 and 2018 has seen this shape before. Statisticians and data scientists arrived from math and statistics backgrounds, fluent in modeling but uneven in software engineering. They built models in notebooks; the models worked in the notebook; the models did not ship. The gap was real and load-bearing — not a fashionable complaint. The eventual resolution was a decade of work on MLOps, feature stores, model registries, and cross-functional teams pairing data scientists with software engineers and platform people. The capability was always there. The path from capability to production took the better part of ten years to build out.
The LLM gap has the same shape and is hitting enterprises hard right now. Capability has run ahead of the operational maturity to deploy it. Pilots multiply; production deployments lag. The MIT NANDA “95% zero ROI” figure has measurement issues, but even with conservative reframings the underlying message is correct: most enterprises haven’t finished the deployment side. The 61% of AI subscriptions Accenture identified as “underutilized due to poor integration” is the same story stated more carefully. The same talent and tooling gap. Same response: build the bridge.
What’s different this time
The 2015 parallel is the right scaffold but it isn’t a copy-paste. Two things are materially different, and they should reshape how teams budget the bridge.
First, far less data-pipeline work. The 2015 ML era spent enormous effort on data engineering — ETL, feature pipelines, training-serving skew, feature stores. LLMs invert most of that. They generate outputs from unstructured inputs rather than processing structured data; the data layer is retrieval and context assembly, not feature transformation; high-volume ETL is largely not the bottleneck. Teams that assume their LLM deployment needs an MLOps-shaped data team will mis-budget. The work is real but it sits elsewhere.
Second, far more testing and validation work. This is the part most enterprises systematically underestimate. An LLM can confabulate plausibly, an agent can take actions, output reaches end-users directly, and the damage potential of an undetected failure is qualitatively higher than “our regression test set drifted.” The work that was once 10–15% of MLOps spend — evaluation, monitoring, output review — becomes first-class infrastructure. Eval pipelines, red-teaming, calibration of human review, behavior change-management when a model upgrades: these are not afterthoughts. They are the deployment work itself. Teams that staff the bridge with the MLOps shape will discover the bridge is built wrong.
The benchmark-to-deployment gap, quantified
The same pattern shows up in every domain that has been measured carefully. ChatGPT-4 achieved 92% diagnostic accuracy in controlled medical studies, but a meta-analysis across 83 studies found only 52.1% overall AI diagnostic accuracy in real-world settings — nearly a 40-point gap. On the SWE-Lancer benchmark of real freelance coding tasks, even top models succeed only 26.2% of the time despite near-perfect HumanEval scores. On RE-Bench long-horizon tasks, AI systems score 4× higher than humans at 2 hours but humans outperform AI 2:1 at 32 hours. Deloitte’s 2026 enterprise survey found 20% of enterprises reporting AI-driven revenue, with two-thirds still stuck in pilot. None of these numbers describe a capability problem. They describe a deployment problem.
Regulation as a secondary force raising the floor
Regulation isn’t the headline of this current, but it is a real second-order force, and one piece of news clarifies how to weight it. On May 7, 2026, EU negotiators reached provisional political agreement on the Digital Omnibus on AI: Annex III high-risk obligations are postponed from August 2, 2026 to December 2, 2027 (a 16-month deferral), and Annex I product-regulated high-risk obligations are deferred from August 2, 2027 to August 2, 2028. Watermarking and AI-content transparency shift by only three months, to December 2, 2026.
The delay should not reduce urgency for buyers. General-purpose AI model obligations under Articles 50–55 are unchanged and continue on the original schedule. Article 5 prohibitions are already in force. The Article 4 AI literacy obligation is already binding. Standards and guidance will still publish close to the new deadlines. The Code of Practice on synthetic content is expected to finalise in May or June 2026. What the omnibus moved was the most expensive, most operationally heavy obligations — precisely the ones tied to deployment of high-risk systems. The deployment gap is the headline; regulation is the floor underneath it, which the omnibus moved but did not remove.
Copyright runs in parallel. The Bartz v. Anthropic case produced a $1.5 billion class-wide settlement. The New York Times v. OpenAI multi-district litigation had summary judgment due in April 2026. There are 56+ ongoing copyright lawsuits against AI companies. Every settlement raises the floor of data-provenance and due-diligence work required to deploy an LLM in production. Treat this as friction on the deployment side, not as a separate force.
What teams that bridge the gap actually look like
The 2015 resolution was cross-functional teams — data scientist plus software engineer plus platform engineer. The 2026 resolution looks similar in shape but reweighted: evaluation engineers, red-team specialists, and workflow designers become first-class roles. Less feature-store work; more behavioral testing. Less ETL; more output review. Less concept drift; more model upgrades that change personality. The teams that ship LLM features into production at scale in 2026 are the ones that have already staffed this shape. Most haven’t.
Trigger signals — what to watch for
- First EU AI Act enforcement actions under remaining-on-schedule obligations (GPAI / Article 5)
- Major copyright ruling against AI training (NYT v. OpenAI summary judgment, expected H1 2026)
- Your own internal pilot-to-production conversion rate stays below 20% across two quarters
- Evaluation / red-team roles become standard line items in enterprise AI procurement
- Counter-trigger: a turnkey LLM-deployment toolchain (eval + monitoring + retrieval) becomes commoditized in the way MLOps did between 2018 and 2022
Implications by role
Data: SWE-Lancer / SWE-Bench leaderboards • medical AI meta-analysis (83 studies) • Deloitte 2026, Accenture, MIT NANDA enterprise surveys • EU AI Act + Digital Omnibus (May 2026) • Bartz v. Anthropic settlement
Hours and Dollars
“The two units that will decide displacement”
Stop arguing about IQ benchmarks. The displacement curve to watch is hours of undisturbed work on the X axis, and cost per autonomous task-hour on the Y axis. That is how an employer will price an agent against a person.
Today, a frontier-Opus autonomous coding hour costs roughly an order of magnitude less than the human hour it would replace, before review overhead. After review overhead and retries, the net gap is narrower — but still wide enough to matter.
Two units
The conversation about AI capability is changing because the people who buy capability are not benchmark researchers. Employers do not care whether a model added two points on MMLU. They care about two numbers. First: how many hours an agent can work on something autonomously before a human needs to step in. Second: what an hour of that autonomous work costs, in API tokens or compute, compared to the loaded hourly rate of the person it would otherwise be done by. Those two numbers, multiplied, are the displacement math. The first is improving observably on a roughly four-month doubling cadence. The second is collapsing on the curve Section 2 covers. The product of the two is what will decide which work moves and which doesn’t.
Where the autonomy is today
The first unit is no longer hypothetical. The observable artifacts as of May 2026: Claude Code routinely runs multi-hour autonomous coding sessions; Anthropic’s Computer Use lets an agent drive a desktop directly; Cursor (with Composer 2.5) and Windsurf (bundling Devin Cloud) sell agentic coding by the hour, not by the demo. Google demonstrated Antigravity 2.0 by having 93 sub-agents generate 2.6 billion tokens to build the core framework of an operating system in roughly 12 hours — theatre, yes, but the artifact existed. Gemini Spark, announced at I/O on May 19, is a personal agent that runs cloud-side 24/7 even when the user’s device is off. Anthropic’s “Dreaming” feature gives models memory consolidation across long-running work. None of these tools clear the 8-hour undisturbed-work threshold reliably yet, but several can sustain hours of useful autonomous work in narrow domains. That was not true a year ago.
The cost comparison
The displacement framing only works if the second unit lines up with the first. Take a representative knowledge-worker task that takes a senior practitioner about eight hours: a mid-complexity coding ticket with tests, or a structured analysis with document synthesis. Today’s math, using May 2026 prices and observed Claude Code token telemetry:
- Claude Opus 4.7 at $5 input / $25 output per million tokens. A moderately loaded autonomous coding agent burns on the order of 1M tokens per hour (typically ~700K input, ~300K output, with most input cached). At cached-input pricing, that lands at roughly $8 per agent-hour. An 8-hour autonomous run lands near $65–$80. A heavy multi-tool autonomous workload pushing 3–5M tokens per hour pushes per-hour cost into the $25–$45 range.
- Claude Sonnet 4.6 at $3 input / $15 output per million tokens. Same workload at moderate load: roughly $2 per agent-hour; an 8-hour run lands near $15–$25. Heavy load reaches $8–$15 per hour.
- Senior developer comparator: in the US, loaded hourly cost typically lands at $130–$245 per hour (base $110–$175 plus 20–40% loaded for benefits, taxes, overhead). In Western Europe the equivalent runs €100–€150 fully loaded; in CEE roughly half that. Per 8-hour day, that’s $1,000–$2,000 US / €800–€1,200 Western EU.
The raw ratio, before any overhead, is striking: an autonomous Opus 4.7 hour costs roughly 15–30× less than the senior US developer hour it would replace; an autonomous Sonnet 4.6 hour costs roughly 60–100× less. Most observers stop here, get excited, and reach the wrong conclusion. The honest number adds review and retry overhead: every autonomous hour today realistically needs roughly 20–40 minutes of human supervision, evaluation runs, and retry cycles before the work ships. That overhead compresses the effective gap into something more like 3–10× cheaper for the right workflow, and to break-even or worse for workflows where the model still gets stuck.
Section 2 explains why the underlying token gap closes faster than people expect: frontier-equivalent inference costs fell ~99% in two years and the April 2026 open-weight wave (DeepSeek V4-Pro at $0.28/$2.48 per MTok) is another factor of 10 below paid frontier. The crossover for any specific workflow depends mostly on two things: how much human supervision overhead it still needs, and what the loaded-cost comparator actually is in your geography. Pick one workflow your team actually does. Estimate both numbers. Track the ratio quarterly. That ratio, more than any benchmark, will tell you when displacement becomes economic.
The METR data point
One supporting data point is worth keeping in view. METR (Model Evaluation & Threat Research) publishes a measured benchmark called Time Horizon: the duration of human work a model can complete with 50% reliability. Their 2019–2025 dataset showed the frontier doubling roughly every 7 months. Their TH 1.1 update (January 2026) used more tasks and far more 8-hour-plus measurements. The 2024–25 trajectory looks closer to a 4-month doubling. Claude 3.7 Sonnet’s 50% time horizon sits around 50 minutes. If the 4-month doubling holds, frontier models cross the 8-hour threshold by late 2026 or early 2027. If the 7-month doubling reasserts, that slips to 2028. Either way, the trajectory is the input to the displacement curve, not the headline number.
What this implies for capex
If “hours of autonomous work” is the right capability axis, the capex picture from Section 1 changes shape. The training portion of hyperscaler spend — the largest pre-training runs — becomes harder to justify on its own; smaller models with strong post-training, RL, and tool use can match the capabilities of larger ones for many workflows. The inference portion, already projected at roughly 70% of AI compute by 2030, becomes more justified, because every long-running agent is a multi-token, multi-call, often multi-hour inference workload. Reasoning models with extended thinking can use 10–100× more tokens per task than chat-style models. Capex is not wrong. Its allocation between training and serving is what shifts.
Trigger signals — what to watch for
- 8h+ undisturbed autonomy on routine knowledge work at net agent-cost below 25% of the loaded human comparator after review and retry overhead (today the gross gap is large; the net gap closes when supervision overhead drops)
- Multi-day autonomous agents enter production at major enterprises with measurable, audited ROI
- Hyperscaler capex shifts visibly from training to inference and tool-ecosystem infrastructure
- Vendors begin publishing agent cost-per-hour benchmarks alongside accuracy benchmarks
- Counter-trigger: net agent cost-per-hour (after supervision overhead) stays above human comparator on representative tasks for two consecutive frontier releases
Implications by role
Data: METR Time Horizons (TH 1.0 Mar 2025, TH 1.1 Jan 2026) • SWE-bench Verified / SWE-Lancer leaderboards • Antigravity 2.0 demo writeups • Anthropic API pricing (May 2026) • Claude Code usage telemetry • Index.dev / MarsDevs developer hourly-rate surveys 2026
How the Currents Interact
The six currents are not independent. They reinforce and contradict each other in specific ways — and the value of holding them all open at once is mostly in seeing those interactions clearly. Three patterns are worth naming.
Reinforcement. Efficiency Revolution (Current 2) accelerates Sovereignty (Current 4): cheap, frontier-equivalent open-weight inference is what makes EU-onshore deployment a real procurement choice rather than a slogan. Hours and Dollars (6) reinforces From Lab to Production (5): longer autonomy widens the gap between what an agent can do in a demo and what teams can actually deploy reliably, because eval and red-team work scales with task duration. Continued Scaling (1) reinforces Hours and Dollars (6) on the inference side: the capex shift from training to serving makes longer agentic runs cheaper.
Contradiction — sometimes only apparent. Continued Scaling (1) and Efficiency Revolution (2) pull capex in opposite directions but resolve via the training-vs-inference split: train-cluster spend gets harder to defend, serving-cluster spend gets easier. Financial Correction (3) and Continued Scaling (1) look opposed but aren’t mutually exclusive — both can be true at once: the technology works, individual products generate real revenue, and the investment timeline doesn’t match the revenue timeline. Sovereignty (4) cuts diagonally across (2) and (3): commoditization makes sovereign options viable, while vendor concentration risk is the failure mode that makes them necessary.
The stance. No current is “the answer.” The strongest planning position is the one that performs adequately under all six — not the one that bets everything on whichever current seems most live this quarter. Lock to none; watch triggers in all; let scenarios go stale when their triggers haven’t fired in 18 months and replace them with what better describes the world you’re actually living in. That discipline is what this booklet exists to make routine.
How to Use This in Practice
If you take one thing from this booklet, take this: currents are useful only if you commit to a habit. The habit is foresee, watch triggers, adjust. The currents are scaffolding for that habit, not a forecast.
Pre-commit to triggers, not predictions
The most useful artifact in this booklet is the trigger list under each current. Far more than the prose or the synthesis, the triggers are what tell you something has shifted. Decide now — before the headlines — what would update you. “If frontier agent cost-per-hour drops below $40, I will pilot the displacement workflow.” “If GPU rental prices fall another 30%, I will renegotiate our vendor contract.” The point is to short-circuit the response time between observing a signal and acting on it.
Review on a cadence
Quarterly is probably right for most teams. Faster than that and you’re reading noise; slower and you miss real shifts. Each review, walk through the trigger list and ask: has any trigger fired? Has any disconfirming signal landed? What changed in our environment? Update your stance accordingly. The output of the review is rarely “we were wrong” — more often it’s “we should weight this current heavier than we did last quarter.”
Let currents go stale
If a current’s triggers haven’t fired in 18 months and its disconfirming evidence has been steadily accumulating, the current probably isn’t live anymore. Retire it. Replace it with one that better describes the world you’re actually living in. This booklet is a snapshot of mid-2026; by mid-2027 at least one of these currents will likely need replacing. That’s the system working, not failing.
What I think
I want to be precise about my view: I think this approach — currents plus triggers plus periodic adjustment — is the healthy way to navigate a technology that changes this fast. I do not think any one of the six currents in this booklet is the right one to bet on. People are eager to jump to conclusions, and prediction makes good copy, but conclusions in a domain moving at this rate go stale before they’re acted on. The discipline of holding several currents open simultaneously, watching what fires, and updating without ego is, I believe, the actual skill worth building.
Trigger Drill
This worksheet is built around triggers, not predictions. For each of the six currents, answer three questions. The point isn’t to be right about which current dominates — it’s to know what you’d notice and what you’d do.
Current 1 — Continued Scaling
Current 2 — Efficiency Revolution
Current 3 — Financial Correction
Current 4 — Sovereignty
Current 5 — From Lab to Production
Current 6 — Hours and Dollars
One action that works under all six currents
What is one thing you can do this month that improves your position regardless of which current dominates?
Interactive versions of all visualizations: demos.barcik.training
Full research and data: publications.barcik.training
© 2026 Robert Barcik · LearningDoe s.r.o. · barcik.training