The Agent Horizon
A Strategic Guide to the Enterprise Agent Development Stack
April 2026
By Robert Barcik
LearningDoe s.r.o.
Contact: robert@barcik.training
About This Booklet
Every few weeks, a new agent framework launches. A new protocol gets announced. Another vendor SDK promises to make things easier. For an engineer or business stakeholder planning a 2026–2028 roadmap, the signal-to-noise ratio is terrible — and the stakes are high.
This booklet is a conceptual map, not a feature comparison. Its goal is to help you see what sits where in the emerging agent stack, why each layer exists, and how the pieces will most likely play out over the next two to three years. The scaffold is an analogy you already know: the cloud transition. The parallels aren’t perfect, but mapping agent primitives onto familiar cloud concepts gives you a mental model that survives the next rebranding cycle.
When you finish, you should be able to answer four questions with confidence. What is MCP, and why does everyone treat it as settled? What’s the real difference between Google ADK, the OpenAI Agents SDK, the Claude Agent SDK, and LangGraph — and when does each make sense? Where does the lock-in live, and how much should you care? And for a European enterprise specifically, is the sensible bet to ride the vendor wave first and worry about portability later, or to invest in agnostic infrastructure now?
The booklet closes with a worked case study: a regulated European bank resolving the 5-question framework into a specific stack — including the one decision where the framework said one thing and we did another, and why that was right.
No hype. No breathless predictions. Just a map and a specific example.
Who This Booklet Is For
- Enterprise engineers evaluating agent frameworks for production deployment
- Architects designing multi-agent systems that need to outlive a single vendor
- Technology leaders shaping a 2026–2028 roadmap
- Business stakeholders trying to understand what their engineers are arguing about
If you’ve heard the terms MCP, ADK, LangGraph, or A2A used in conversation and nodded along while quietly wondering which is a protocol and which is a framework — this booklet is for you.
How to Read It
Chapters 1 and 2 set up the mental model. Read these first even if you’re deep in the space. Chapter 3 covers the two settled protocols (MCP and A2A). Chapter 4 introduces the orchestration layer; Chapters 5 and 6 survey its two families (vendor and agnostic). Chapter 7 addresses observability. Chapters 8 through 10 cover strategy — lock-in, EU angle, timeline. Chapter 11 brings it all together with a decision framework, a worked bank case study, and a short epilogue.
Table of Contents
- The Cloud Parallel
- The Layer Cake
- The Protocol Layer — MCP and A2A
- The Orchestration Layer
- The Vendor Frameworks
- The Agnostic Frameworks
- Observability, Evaluation, and Cost
- The Lock-In Question
- The EU Angle
- Will the Timeline Actually Squeeze?
- Picking Your Stack — with a Worked Case
Chapter 1: The Cloud Parallel
If you’ve been in enterprise IT for more than a decade, the way agent development is being debated in 2026 should feel suspiciously familiar.
Three foundation-model vendors — Google, OpenAI, Anthropic — are pushing polished, opinionated development kits that make it easy to build an agent in an afternoon, provided you stay in their ecosystem. A smaller cluster of vendor-neutral frameworks — LangGraph, CrewAI — insists the only responsible choice is portable abstractions that outlive a model-provider change. In between, a protocol called the Model Context Protocol (MCP) has quietly become the default way for agents to reach tools and data. It’s open, foundation-governed, already at tens of millions of SDK downloads per month, and most of the industry has stopped arguing about it.
Swap names and this is almost exactly the conversation we had between 2010 and 2015. The polished vendor kits were AWS, Azure, GCP. The neutral frameworks were Kubernetes and Docker. The quiet protocol was HTTP. The debate was not whether to go to the cloud — it was whether to commit to one cloud and accept the lock-in, or build on portable abstractions and pay the abstraction tax upfront.
The claim of this booklet is that the agent landscape is re-running the cloud playbook. Not in every detail, and not at the same speed. But closely enough that a working mental model of how the cloud transition played out already gets you most of the way to a working mental model of how the agent transition will.
Foundation models are the new compute. MCP is the new HTTP. Vendor agent SDKs are the new Platform-as-a-Service. Agnostic frameworks like LangGraph are the new Kubernetes. That is the one-sentence version of the rest of this book.
Why the Parallel Holds
A handful of hyperscale vendors have an unassailable cost advantage at the bottom of the stack. Nobody could match AWS’s per-core price in 2012 because AWS had amortised its infrastructure across millions of customers. Nobody can match OpenAI’s or Anthropic’s per-token price today because they’ve amortised training costs across a similarly massive user base. The cost curve is structural, not temporary. Renting is cheaper than owning, except for a narrow set of workloads where compliance, latency, or data sovereignty force a different choice.
The vendors selling the cheap bottom layer are also trying to sell you the layer on top. AWS didn’t just sell EC2 — it pushed Elastic Beanstalk, Lambda, SageMaker, and dozens of other managed services that are extraordinarily convenient as long as you never want to leave AWS. Foundation-model vendors are doing the same thing: Google pushes ADK + Vertex, OpenAI pushes the Agents SDK with hosted sandboxes, Anthropic pushes the Claude Agent SDK with built-in computer-use tools. The gravitational pull is identical.
And in both cases, a neutral middle layer emerged in response. Not because the vendor offerings were bad, but because a critical mass of enterprises decided that being able to swap the bottom layer without rewriting the top layer was worth the engineering cost.
Where the Parallel Breaks
A good analogy is one you can stress-test. This one has two cracks worth flagging up front.
The timeline is compressed. The cloud transition took about a decade to reach its settled shape. The agent stack has gone from “interesting experiment” (late 2022) to “settled protocol plus competing frameworks” (early 2026) in under four years. Whether that means the final shape arrives in another four years or whether we’re in the equivalent of 2010 with another decade of churn ahead is an open question. Chapter 10 takes it seriously.
The lock-in is deeper. When an enterprise migrated off AWS, PostgreSQL was still PostgreSQL and Java was still Java. Vendor-specific parts — queues, DNS, identity — were replaceable. In the agent world, when an enterprise builds on Claude’s computer-use capability, that capability isn’t a portable abstraction; it’s baked into how Anthropic’s model was trained. You cannot run the same workflow through GPT-4o and expect it to behave. Vendor lock-in in the cloud era was mostly about surrounding services. Vendor lock-in in the agent era can reach all the way down to model behaviour itself.
File both caveats and keep them in mind as you read. The cloud analogy is scaffolding, not blueprint.
The EU Wrinkle
One cloud-era feature is worth calling out early because it will likely replay. The European cloud transition was slower than the US transition. GDPR wasn’t fully in force yet, but data-protection norms made cross-border data transfer a live engineering concern, not just a legal one. The predictable consequence: by the time European enterprises moved seriously to the cloud, they could skip the painful early lessons. Multi-cloud was already a recognised pattern; vendors had already been forced to offer portability tools. The European market effectively leapfrogged Phase 1.
There is a credible argument that the EU does this again with agents. The AI Act is a stronger forcing function than GDPR was for cloud. Sovereignty concerns are sharper, not softer. And the sovereign-AI movement across Europe is pushing an architectural pattern — model-agnostic routing with on-prem or EU-region execution for regulated data — that looks a lot like a mature agent stack rather than an early one.
Whether the leapfrog actually happens depends on things that are genuinely unknowable in April 2026. But the pattern is strong enough that any European enterprise planning its agent strategy should at least ask: are we about to repeat the cloud cycle, or skip ahead? Chapter 9 develops this directly.
The rest of this booklet is an elaboration of the picture above. The goal through Chapter 11 is to be specific enough that when you finish, you can tell a colleague what MCP is, why ADK and LangGraph are not in the same category, and what your organisation should actually do about any of it.
Chapter 2: The Layer Cake
The Most Common Mistake
If you read technology press about agent development, you’ll run into sentences like “should you use ADK or MCP?” or “companies are choosing between the Claude Agent SDK and the Model Context Protocol.” These sentences are nonsense. They treat a protocol and a framework as competing options, which they are not. They live on different layers of the stack and do different jobs.
This isn’t a pedantic complaint. It’s the single most load-bearing clarification in the booklet. If you leave this chapter with one thing, let it be this: the agent stack has layers, and the pieces you hear debated in the press are not always on the same one. Framework decisions happen at one layer, protocol decisions at another, model decisions at a third. You don’t “choose between” items from different layers. You choose one of each and combine them.
The cloud-era equivalent would be asking “should we use AWS or HTTP?” AWS is a cloud provider; HTTP is a protocol. You use both. They aren’t competing decisions — they’re complementary decisions at different levels of the stack.
The Layers
Read this picture top-down, from what the user experiences to what actually does the work.
┌─────────────────────────────────────────────────────┐
│ ORCHESTRATION LAYER │
│ "When do we call what, with what state, │
│ under what guardrails, across how many agents?" │
│ │
│ ADK · LangGraph · CrewAI · OpenAI Agents SDK · │
│ Claude Agent SDK · AWS Strands · Azure AI Agent │
├─────────────────────────────────────────────────────┤
│ LLM LAYER │
│ The reasoning engine that generates tool calls, │
│ plans, and responses. │
│ │
│ Gemini · Claude · GPT · Mistral · Llama · ... │
├─────────────────────────────────────────────────────┤
│ TOOL/AGENT ACCESS LAYER │
│ How the LLM reaches tools, data, and other agents. │
│ │
│ MCP (tools & data) · A2A (agent-to-agent) · │
│ Native function calling · Direct SDK calls │
├─────────────────────────────────────────────────────┤
│ ACTUAL SYSTEMS │
│ Databases, APIs, files, SaaS apps, other agents │
└─────────────────────────────────────────────────────┘
Four layers. Each has a distinct role. You pick something from each layer, then compose them.
Orchestration layer. The agent’s “brain.” Deciding which tool to call, in what order, what to do on failure, how to pass state step to step, when to stop. In a simple agent this might be a while loop that keeps calling the model until it says “done.” In a complex agent it’s a multi-step state machine with branches, retries, parallel execution, and handoffs to sub-agents. Google ADK, LangGraph, CrewAI, OpenAI Agents SDK, Claude Agent SDK, AWS Strands, Azure AI Agent Service — all live here. They disagree about whether the right abstraction is a graph, a team of agents with roles, a sequence of handoffs, or a tree of sub-agents. They’re all solving the same fundamental problem.
LLM layer. The model that does the thinking. Gemini, Claude, GPT, Mistral, Llama. Framework decides when to call the model and what to do with the result; model decides what to say when called. Framework is the project manager; model is the specialist being asked for an opinion.
Tool and agent access layer. How the agent reaches tools, data, and other agents. Two protocols matter: MCP for tools and data (how does my agent talk to a database or API), A2A for agent-to-agent (how does my agent talk to someone else’s agent). Both live at this layer. Chapter 3 covers them together.
Actual systems. Databases, APIs, SaaS apps, internal tools. The layer you already know. Agents don’t replace it; they plug into it.
Why “ADK vs MCP” Is a Category Error
With the layers in front of you, the question dissolves.
ADK lives at the orchestration layer. It decides what an agent does. MCP lives at the access layer. It defines how an agent — built with ADK or anything else — reaches a tool. An ADK agent can be an MCP client: when ADK wants to call a tool and that tool happens to be exposed as an MCP server, ADK uses MCP to make the call. When the same tool is a plain Python function, ADK calls the function directly. MCP is one of several ways ADK reaches tools — not a competitor to ADK.
Conversely, MCP doesn’t care which framework is on the other end. An MCP server your data team ships doesn’t know whether the agent talking to it was built with ADK, LangGraph, CrewAI, or a plain Python loop. It just sees a client speaking MCP.
The same logic applies to every apparent “framework vs protocol” debate: Claude Agent SDK vs MCP, LangGraph vs A2A, OpenAI Agents SDK vs MCP. All three are category errors. The framework does orchestration; the protocol does access. They compose.
The corrected mental model in one sentence: Frameworks sit at the orchestration layer and decide what an agent does; protocols sit at the access layer and decide how the agent reaches the outside world. They compose. They do not compete.
A useful discipline: when you read about a new agent technology, ask what layer does this sit on? before you form an opinion. If you can’t answer, you don’t understand it well enough yet. Most of the apparent complexity of the agent landscape evaporates once you can slot each piece into the layer it belongs on.
What to take from this chapter: The agent stack has four layers — orchestration, LLM, access (MCP + A2A), actual systems. Frameworks, models, and protocols are compositional, not competitive. The single most common mistake in agent-development discourse is treating a framework (ADK, LangGraph) as if it were competing with a protocol (MCP). It isn’t. Everything in this booklet is an elaboration of that picture.
Chapter 3: The Protocol Layer — MCP and A2A
Two protocols sit at the access layer of the agent stack. MCP handles agent-to-tool and agent-to-data traffic. A2A handles agent-to-agent. They’re peer standards, and almost every enterprise architecture conversation about agents eventually comes back to one or both.
MCP is settled. A2A is close to settled but still contested at the edges. This chapter covers both.
MCP: The HTTP of the Agent Era
Most technology standards spend years in a messy middle period where competing protocols fight for adoption. MCP had an unusually short one. Announced by Anthropic in late 2024, it went from “interesting open-source proposal” to “effective industry default” in about eighteen months.
By early 2026 the numbers are hard to argue with: more than 97 million SDK downloads per month across the Python and TypeScript implementations, more than 10,000 publicly indexed MCP servers, native support in Claude, ChatGPT, Cursor, every major IDE that ships an AI feature, and every framework covered in this booklet. In November 2025, Anthropic donated MCP to the Linux Foundation, which formed the Agentic AI Foundation to govern it — co-founded by Block and OpenAI, with additional backing from Google, Microsoft, AWS, Cloudflare, and Bloomberg. That’s not the roster of a contested standard. That’s the roster of a settled one.
The comparison everyone reaches for is HTTP. It’s a good comparison. HTTP is also not something you “choose” — it’s the ambient protocol that lets any browser reach any server. MCP is becoming that for agents: the ambient protocol that lets any agent reach any tool or data source. Increasingly, not supporting it is more expensive than supporting it.
What MCP Actually Standardises
Three things, not one. Enterprises that treat it as a simple API gateway miss most of the value.
Tools — executable functions the agent can invoke. A tool has a name, a short description, an input schema with field-level descriptions, and structured error responses. The agent calls tools/list to see the catalog, then tools/call to invoke a specific tool with arguments. The important design choice is that tool descriptions are written for a language model to reason about, not for a human reading docs. A well-designed MCP server is closer to a prompt-engineered API than to a conventional REST endpoint — the linguistic quality of tool descriptions is part of the server’s correctness.
Resources — read-only data the server makes available. Documents, database rows, configuration files, policy knowledge. The agent doesn’t “call” a resource; it fetches it and places the content into its own context. For enterprise deployments, resources often matter more than tools. An internal policy bot wants a resource tree (policies/hr/parental-leave.md, policies/security/acceptable-use.md) the agent can browse and pull from — not a tool called get_policy_document(id) it has to guess how to invoke.
Prompts — reusable prompt templates the server offers to clients. The least-used of the three, but the reason it exists is principled: tools are things to do, resources are things to read, prompts are things to say. A complete MCP server can offer all three.
The Handshake
MCP is a client-server protocol over a small JSON-RPC vocabulary. Initial connection does one round of negotiation (initialize): protocol version, capabilities, identity. After that, it’s just method calls — tools/list, tools/call, resources/list, resources/read, prompts/list, prompts/get. The vocabulary is small on purpose. Most of the interesting design work happens inside the server — how you model your domain, how you write your tool descriptions, how you design your resource tree — not in the protocol itself.
What MCP Looks Like in an Enterprise
The shape is almost always the same. A handful of internal MCP servers sit in front of existing systems — the CRM, the ticketing platform, the data warehouse, the internal knowledge base, the identity system. Each server is maintained by the team that owns the underlying system, because that team understands the domain semantics best. Any number of agents — built on any framework, using any model — connect as MCP clients.
This gives three enterprise-grade properties that explain most of the adoption curve. One: re-use across agents (one server, many consumers — dramatically better than the pre-MCP world where every agent integrated every backend separately). Two: re-use across frameworks (switch from LangGraph to ADK next year and the servers don’t need to change). Three: foundation governance (no single commercial interest can break compatibility or shift licensing).
The 2026 Roadmap — Enterprise Readiness
The Agentic AI Foundation’s 2026 roadmap lists four priority areas, and the first is explicitly enterprise readiness. Concretely: identity and access integration (OAuth 2.1, enterprise SSO, scoped tokens), management and observability (gateway behaviour, audit trails, admin consoles), and transport and configuration portability (streaming, cancellation, resilience under enterprise network conditions).
The subtext is that MCP is consciously being reshaped from a developer protocol into an enterprise protocol. It’s the same transition HTTP went through from 1993 to 1999, compressed into about a year.
What MCP Does Not Do
MCP is deliberately dumb about a lot of things. It doesn’t orchestrate — it doesn’t know which tool to call, in what order, or what to do on failure. That’s the orchestration layer’s job. It doesn’t authenticate end users on its own — the auth model is an enterprise deployment decision. It doesn’t handle agent-to-agent communication — that’s A2A, below. And it doesn’t replace your existing APIs; it sits in front of them.
One prediction worth making explicit: in eighteen months, if you have a non-trivial internal platform, you’ll almost certainly have MCP servers in front of it — written by your own team, maintained as part of normal platform engineering work. External vendors will ship MCP servers for their own products (GitHub, Linear, Notion already do), but your internal systems are yours to wrap. The enterprise-grade art of writing good MCP servers — tight tool surfaces, precise descriptions, strong auth, clean resource hierarchies — is an emerging platform-engineering craft that didn’t exist two years ago.
A2A: The Other Protocol
If MCP answers “how does my agent talk to a tool,” A2A answers “how does my agent talk to another agent.” These are the two traffic directions at the access layer, and they deserve symmetric attention. Most enterprise projects focus on tools and data first because that’s where the immediate wins live. Agent-to-agent sounds like a future problem — until, around the twelfth agent an organisation builds, it stops being a future problem and becomes urgent.
Why It Becomes Urgent Faster Than Teams Expect
Specialisation: a general-purpose support agent hits its limits, and the team carves out a billing specialist, a technical specialist, a compliance specialist. Organisational boundaries: sales builds a sales agent, HR builds an HR agent, finance builds a finance agent, and when a sales rep asks about commission policy the three have to cooperate. External counterparts: a vendor’s agent negotiates with your procurement agent, and the only viable way for them to cooperate is an open protocol. Composition: the microservices pattern is replaying at the agent level, and systems of agents need a protocol.
What A2A Does
Three things. Capability discovery — an A2A-compatible agent advertises an “agent card” describing what it can do, what inputs it expects, what outputs it produces, how to authenticate. Task delegation — a structured way to hand off a task, supporting both synchronous request-response and longer-running asynchronous interactions with streaming progress. Authentication and trust — hooks for carrying authentication context across the delegation, so downstream agents can make their own authorisation decisions rather than trusting upstream blindly. The third point is the one enterprises underweight. Cross-agent calls can cross organisational or even company boundaries, and whose authority is being exercised at each hop becomes an interesting question.
Where A2A Sits in the Landscape
Messier than MCP’s. Google ADK has native A2A support and auto-generates agent cards — one of the strongest reasons to take ADK seriously for multi-agent architectures. CrewAI also supports A2A, reflecting its multi-agent-first design. LangGraph, OpenAI Agents SDK, Claude Agent SDK have partial or emerging support; all three have published roadmap items. Several other agent-to-agent protocols exist (Microsoft’s semantic-kernel-based coordination, the Agent Communication Protocol from academia) but A2A has the strongest momentum.
Our read in April 2026: A2A is the leading contender but not yet the slam-dunk consensus MCP is. The sensible bet for most enterprises is to design with A2A in mind, treat cross-agent protocols as an area where some rework may be needed in 2027, and avoid building a proprietary in-house variant.
What to Do About A2A Today
Most enterprises should not over-invest in A2A infrastructure yet. If you’re building your first agent, focus on MCP, focus on the orchestration layer, and treat A2A as something you’ll adopt when it becomes relevant. Premature A2A usually means designing a multi-agent system before the business has a use case for one.
Do not rule out A2A by accident. When you pick a framework, check its A2A story. A framework with no credible A2A roadmap is a quiet bet that multi-agent architecture won’t matter for your use case. Defensible for a single-purpose agent; risky for a platform.
Start writing agent cards. Even without formal A2A, the habit of documenting — for every agent you build — what it does, what it needs, what it returns is useful discipline. That discipline is cheap to formalise later.
What to take from this chapter: MCP is the settled protocol for agent-to-tool and agent-to-data access. A2A is the emerging protocol for agent-to-agent — less settled but gaining consensus. Both live at the access layer of the stack, underneath the orchestration layer. Treat MCP as ambient infrastructure: consume the vendor servers, build your own for internal systems. Treat A2A as an architectural concern you’ll act on when multi-agent needs become real — but pick frameworks with a credible A2A roadmap even if you’re single-agent today.
Chapter 4: The Orchestration Layer
With both protocols covered, the stack’s bottom floors are in place. The LLM can reason. MCP can reach tools and data. A2A can reach other agents. But something still has to decide what the agent actually does — which tool to call first, what to do if it fails, when to stop, when to ask the user, when to hand off.
This is the orchestration layer. It’s where the brain lives. And it’s the layer where most of the interesting framework arguments happen — because unlike the protocol layer (settled around MCP) and the model layer (a handful of dominant vendors), the orchestration layer is still genuinely contested.
Before we survey frameworks in the next two chapters, a more fundamental question: do you need a framework at all?
The LLM-in-a-Loop Baseline
The simplest possible agent architecture has no framework. It’s a while-loop, a language model with function-calling support, and a list of tools:
- Send the user’s request, the tool catalogue, and any prior conversation to the model.
- Model either returns a final answer (stop) or a tool call (continue).
- If tool call, execute it, append the result to the conversation, go to 1.
That’s it. No framework. No graph. No handoffs. For a surprisingly large class of enterprise use cases, this is adequate. It handles most chatbot-style assistants, most “agent over a specific tool surface” cases, most short-running tasks with fewer than a dozen tools.
This matters because the industry’s instinct is to reach for a framework immediately — often before the use case requires one. A framework has real costs: learning curve, abstraction tax, a layer of indirection between you and the model, production archaeology when something goes wrong. If you don’t need those costs yet, don’t pay them.
The first honest question in any agent project is whether you’ve actually tried an LLM-in-a-loop with good prompts and a clean tool surface. If not, you don’t yet know whether you need a framework.
In the cloud analogy, an LLM-in-a-loop is a plain EC2 instance with your own scripts. Not sophisticated, not impressive in a design review, often exactly the right tool for the job, and underused because it’s unfashionable.
When the Baseline Breaks
The baseline starts to hurt in a predictable set of scenarios. When you hit these, a framework earns its keep.
Non-trivial control flow. The model needs to plan, execute steps in parallel, then synthesise. Or a retry policy with backoff for one specific tool. Or a branch where the model decides between two sub-workflows. Expressing these cleanly in a loop gets ugly fast.
State and memory across turns. A loop with a long conversation keeps stuffing everything into the prompt until the context window overflows. A framework can maintain explicit state, summarise older history, checkpoint progress, and resume from a saved state. For any agent that lives longer than a single session, state management isn’t optional.
Multi-agent coordination. Once you have more than one agent, the baseline becomes wrong. Frameworks offer structured patterns for supervisor/worker hierarchies, specialist teams, A2A-mediated delegation. Building these without a framework is possible but rarely a good use of effort.
Guardrails and callbacks. Production agents need hooks. “Before any tool call, check permissions.” “After the model responds, run a bias/PII filter.” “If the agent spends more than five euros, stop and ask.” Frameworks give you named lifecycle points. A loop forces you to sprinkle the same checks throughout the code, which rots quickly.
Durability. A thirty-second process is fine in a loop. An eight-hour process isn’t — if the server restarts, you lose everything. LangGraph offers durable execution: state checkpointed, long-running agents pause and resume, crashes recoverable. Serious engineering concern for agents that do real work at scale.
Observability and evaluation. Production agents need traces, token-cost attribution, quality metrics, replayability. Frameworks either provide this or integrate with tools (LangSmith, Langfuse, Phoenix) that do. Rolling your own is a sizeable project — and one covered in its own chapter (Chapter 7).
Hit several of these at once and a framework stops being nice-to-have and becomes necessary infrastructure. Hit none and it’s mostly dead weight.
The Two Families
If you need a framework, the core decision of this booklet arrives: which one?
As of 2026, the landscape has clarified into two broad families.
The vendor frameworks. Google ADK, OpenAI Agents SDK, Claude Agent SDK, AWS Strands, Azure AI Agent Service. Each built by an infrastructure or foundation-model vendor, each optimised for its creator’s ecosystem. Pitch: developer velocity. Play the role AWS Elastic Beanstalk and Google App Engine played — opinionated, fast, and vendor-aligned.
The agnostic frameworks. LangGraph, CrewAI, and a small number of quieter contenders. Model-agnostic and cloud-agnostic. Pitch: portability and control. Play the role Kubernetes and Docker Compose played — more control, more work, more future-proofing.
Neither family is “better.” They solve different problems. Vendor frameworks are for teams that want to ship quickly and have made peace with a vendor commitment. Agnostic frameworks are for teams that want long-term portability and will pay the abstraction tax for it.
The next two chapters cover each family in turn: Chapter 5 surveys the vendors; Chapter 6 covers the agnostic ones and engages with the “models got too good” counter-argument that’s been gaining currency.
One More Mental-Model Correction
It’s easy — especially for developers coming from traditional software — to think of the orchestration layer as “the agent.” It isn’t. The orchestration layer is the manager. The actual intelligence is in the LLM layer below it. The actual capability is in the tools and data exposed via MCP. The actual value is produced by the systems at the bottom. A good framework is valuable the way a good project manager is valuable: it makes a team of smart specialists work well together. And no framework, however polished, rescues weak specialists.
The corollary: when agent projects fail, the instinct is often to switch frameworks. This is almost always wrong. The failure is usually in the tool surface, the prompt design, the evaluation harness, or the model choice — not the orchestrator. Diagnose first. If your first instinct when an agent misbehaves is to reach for a different framework, you’re probably treating the wrong disease.
Chapter 5: The Vendor Frameworks
Every major AI vendor now ships an agent development framework. Google has ADK. OpenAI has the Agents SDK. Anthropic has the Claude Agent SDK. AWS has Strands. Microsoft has Azure AI Agent Service. In cloud terms, each is something close to Platform-as-a-Service: an opinionated environment that makes building agents extraordinarily fast provided you stay within the vendor’s walled garden.
This chapter is a tour. For each framework: the design philosophy, what it’s genuinely good at, and the gravitational pull it exerts toward its parent vendor. Feature-by-feature comparisons go out of date in weeks. What’s stable is what kind of tool each framework is and what kind of bet you’re making when you pick it. Chapter 8 handles the switching-cost question directly — here the frameworks get described on their own terms.
Google ADK
Hierarchical agent trees. A root agent receives the user’s request and delegates to sub-agents, which may delegate further. Execution is managed by structural primitives ADK calls Sequential, Parallel, and Loop agents. The agent system is a tree; the framework runs the tree.
Three genuine strengths. Visual debugging — ADK ships with a CLI and a web UI where you chat with your agent, watch its internal reasoning, and step through execution. For complex multi-agent deployments, one of the better developer experiences in the market. Native A2A support — ADK auto-generates agent cards and handles the protocol plumbing. If cross-boundary multi-agent work is on your roadmap, ADK gives you the smoothest on-ramp. Multimodal capability — ADK agents natively process images, audio, and video through Gemini’s multimodal API, opening visual inspection, voice-based customer support, and document-understanding use cases.
Gravitational pull: Gemini, Vertex AI, BigQuery, Google Cloud. Technically model-agnostic, but every friction point in the ecosystem quietly points back to Gemini. This isn’t a criticism — vendor frameworks are supposed to do this.
Take ADK seriously if you’re already on Google Cloud, prioritise multi-agent with cross-boundary communication, have meaningfully multimodal use cases, or find the visual debugging accelerates you more than the framework’s opinions slow you down.
OpenAI Agents SDK
Explicitly anti-graph. Where LangGraph wants you to draw a state machine, OpenAI wants you to define a small number of agents, each with a clear specialty, and let them hand off to each other as needed. Mental model: a team of specialists with a receptionist who routes calls, not a flowchart. Four primitives — Agents, Tools, Handoffs, Guardrails. That’s the whole vocabulary.
Strengths. Developer velocity — fast to learn, fast to read, fast to maintain. For a team that wants an agent architecture they can fit in one file, the framework that most respects your time. Hosted tools and sandboxing — web search, file search, code interpreter run on OpenAI’s infrastructure with no setup. For agents that need to write and run code, the managed sandbox is a real differentiator. Voice and multimodal — the Realtime API is first-class, GPT-4o’s multimodality exposed cleanly.
Gravitational pull: OpenAI models, hosted infrastructure, structured-output reliability tuned for OpenAI. Swap models via routing libraries and you keep the control-flow semantics but lose most of the managed infrastructure that made the framework attractive.
Take it seriously if you’re committed to OpenAI models, voice or code execution matters, and you want the fastest possible path from concept to running agent without architectural ceremony.
Claude Agent SDK
Different tack. Built around the assumption that the agent will operate in a computer-like environment — reading files, running shell commands, writing code, searching the web. Ships with eight built-in tools out of the box (Read, Write, Edit, Bash, Glob, Grep, WebSearch, WebFetch). Design mentality: give the agent a computer and let it work.
Orchestration model: hooks and subagents. Hooks intercept lifecycle events (“before tool call,” “after model response”) so you can enforce guardrails or track behaviour. Subagents delegate tasks to child agents with their own tool surfaces and instructions. Where OpenAI organises work by handoffs between peers, Claude organises by delegation to children.
Strengths. Long-running autonomous work — tasks that take hours or days rather than seconds. Context compaction, state checkpointing, asynchronous execution are baked in. For “review this codebase and produce a migration plan” or “analyse the last year of tickets and propose top five automation candidates,” the framework that handles the long-running shape most naturally. Built-in tool surface — eight tools means agents start with real capabilities rather than empty registries. Meaningful head-start for developer-assistant use cases. Hooks as control surface — precise control over agent behaviour at lifecycle points, which enterprises appreciate for compliance and observability reasons.
Gravitational pull: this is the deepest coupling in the vendor-framework category. Claude has been specifically trained on computer-use tasks — file systems, shell commands, browsers. Other models have no equivalent training. Running the same SDK through a non-Claude model produces noticeably worse results. This is model-level behavioural coupling, not just ecosystem affinity.
Take it seriously if you’re building engineering-heavy workloads (coding assistants, system-administration agents), long-running autonomous tasks, and want a framework with real opinions about computer-use safety.
AWS Strands
The newest and most explicitly experimental. Leans heavily on letting the LLM drive rather than constraining it. Where LangGraph makes you define edges in a graph, Strands makes you define goals in natural language and relies on the model to decide how to achieve them. A bet that models are now capable enough to handle orchestration autonomously, and the framework’s job is to provide safe execution + AWS integration — not to impose control flow.
Strengths. AWS integration — deep wiring into Bedrock (models), Lambda (tools), DynamoDB (state). If your infra is AWS-native, Strands removes a lot of plumbing. Flexibility within Bedrock — native access to Claude, Llama, Mistral, and others; more model flexibility than the foundation-vendor frameworks within the constraint that you’re using Bedrock for model access. Experimental primitives — “AI Functions” (describe a goal in natural language, framework generates validation logic for the model’s output) — interesting but not yet battle-tested.
Gravitational pull: AWS infrastructure, not a single model. Inverted lock-in from the foundation-vendor frameworks.
Take it seriously if your infrastructure centre of gravity is AWS and Bedrock-mediated model flexibility is concretely useful.
Microsoft Azure AI Agent Service
AutoGen’s multi-agent patterns absorbed into Microsoft’s production offering. Emphasises integration with the Microsoft enterprise ecosystem: agents that trigger from Azure events, read SharePoint, post to Teams, coordinate with Microsoft 365 copilots.
Strengths. Integration breadth — for organisations on Microsoft 365, Dynamics, SharePoint, Power Platform, depth no other framework matches. Identity and compliance posture — decades of Microsoft enterprise compliance infrastructure inherited natively. SSO, conditional access, audit trails, data residency, sovereign cloud support all there from day one. Absorbed AutoGen patterns — multi-agent conversation patterns (debate, consensus, hierarchical coordination) carried forward from Microsoft’s open-source research into a managed service.
Gravitational pull: Microsoft ecosystem. Default model is OpenAI through the partnership, default runtime is Azure, default integrations are Microsoft 365. Live in that world and the framework accelerates you; don’t and you’re paying for integrations you can’t use.
Take it seriously if you’re a Microsoft-shop organisation, need the compliance posture, or are building agents that heavily interact with Microsoft 365 data and workflows.
Summary
| Framework | The Pitch in One Line |
|---|---|
| Google ADK | Best-in-class multi-agent + A2A, best debugging, deep Gemini/GCP pull |
| OpenAI Agents SDK | Fastest path from zero to running agent, OpenAI ecosystem, handoffs model |
| Claude Agent SDK | Strongest computer-use and long-running task story, deepest model coupling |
| AWS Strands | AWS-native, Bedrock-mediated model flexibility, most experimental primitives |
| Azure AI Agent Service | Deepest Microsoft 365 integration, strongest enterprise compliance posture |
Each is a reasonable choice for the organisation it was built for. None is a reasonable choice for every organisation.
What to take from this chapter: Vendor frameworks are the PaaS of the agent era — opinionated, fast, and deeply aligned with the vendor that built them. Each has a genuine strength and a specific gravitational pull. The right choice depends on which ecosystem you already live in and how much portability you’ll trade for speed. Chapter 8 handles lock-in consequences per vendor; this chapter established what each framework is on its own terms.
Chapter 6: The Agnostic Frameworks
There’s a smaller — and noisier — corner of the framework landscape where the defining feature is not being aligned with a model vendor. These are the agnostic frameworks. They assume you’ll want to swap models, clouds, and tool surfaces, and they optimise for that flexibility even at the cost of developer velocity.
Two dominate: LangGraph and CrewAI. Both predate most of the vendor SDKs, both have larger communities than any single vendor framework, and both are currently positioning themselves as the neutral middle layer — the Switzerland — that large enterprises will eventually want between themselves and the foundation-model vendors.
The cloud analogy holds tightly here. If the vendor frameworks are PaaS, LangGraph is Kubernetes and CrewAI is Docker Compose.
LangGraph as Kubernetes
Agent workflows as state machines. Nodes (steps). Edges (transitions). State flows through the graph. The framework is explicit about persistence: at every step the state is checkpointed, so if the server crashes or the agent pauses for human approval, the workflow resumes from exactly where it left off. Its insistence on structural explicitness is what gives it power — and what generates most of the complaints.
Strengths.
Durable execution. LangGraph’s defining capability. Long-running agents can pause for hours or days, wait for human input, survive server restarts, and resume without losing state. For regulatory approvals, multi-step workflows with human-in-the-loop steps, agents that run overnight — often the only tractable solution.
Observability and audit. LangGraph pairs naturally with LangSmith, which provides deep traces, evaluation harnesses, and audit-grade records of every model call, tool invocation, and state transition. For regulated enterprises, this paper trail can be the difference between a deployable agent and a blocked one.
Model agnosticism. LangGraph doesn’t care whether the underlying model is Claude, GPT, Gemini, Mistral, or a local Llama. The framework’s primitives are model-neutral. You can swap the model layer without rewriting the workflow — which is the entire point.
Production scale. The most mature production story of any agent framework. More than 400 production deployments publicly documented, including high-scale cases (Klarna’s customer support agent at 85 million users, 80% reduction in resolution time). The boring reliability features — connection pooling, retry semantics, backoff, rate limiting — that matter more in production than in demos.
The cost is a real learning curve. Developers coming from “just write the agent code” find LangGraph’s graph formalism verbose for simple cases. Critics call it “a very fancy if-else statement” — and for a three-step linear agent, they have a point. LangGraph’s value shows up at scale, in production, under edge conditions, not in demos.
Take LangGraph seriously if you’re in regulated industries, running long or human-in-the-loop agents, doing multi-model routing, facing board-level anti-lock-in mandates, or large enough to absorb the learning curve.
CrewAI as Docker Compose
Work organised the way a human team does. Create agents, give each a role (“Senior Data Analyst”), a goal (“find trends in Q2 sales data”), and a backstory. Assemble into a “crew” and give the crew a task. CrewAI orchestrates the team. Business stakeholders can read a CrewAI agent definition and understand what’s happening — that friendliness is both the greatest strength and the greatest weakness.
Strengths.
Prototyping speed. Nothing in either framework category can get a multi-agent prototype running as fast. For workshops, proofs-of-concept, demos, often the fastest path.
Role-based reasoning. For tasks where the natural breakdown is “a team of specialists” — research pipelines, content creation, analysis workflows — the role-based abstraction is elegant. Fits the shape.
Protocol support. Native MCP and A2A support, earlier than most frameworks. Multi-agent-first design.
Community momentum. 44,000+ GitHub stars, active Discord, a growing commercial ecosystem. Wealth of examples you can borrow.
The cost: abstractions optimised for prototyping, not production. Thinner state management, limited checkpointing. For a long-running mission-critical agent, CrewAI forces you to solve enterprise concerns outside the framework — at which point its main value (quick prototyping) has been outgrown.
Take CrewAI seriously if you’re running rapid POCs, workshops, or multi-agent workflows where role-based decomposition fits naturally and time-to-demo matters more than time-to-five-nines.
“But Models Got Too Good” — the Counter-Argument
Before anyone adopts an agnostic framework, they should understand the strongest argument against doing so.
Agnostic frameworks exist in part because early LLMs were weak. Context windows were small, so frameworks added summarisation and memory management. Reasoning was unreliable, so frameworks added structured control flow. Function calling was crude, so frameworks added tool-validation scaffolding. Hallucinations were frequent, so frameworks added retry and validation layers. A significant portion of what LangGraph does was developed to patch model limitations.
By 2026, underlying models are dramatically better. Context windows in millions of tokens, not thousands. Reasoning reliable enough that a single well-prompted model call handles tasks that required a multi-step graph eighteen months ago. Function calling precise. Structured output native. For a meaningful slice of use cases, “do I need a framework to orchestrate this?” now answers “probably not, if the model is good enough.”
This argument has real weight. A developer building an agent in 2026 has at least three options that didn’t exist in 2024: a plain LLM-in-a-loop (works for more cases than it used to), a vendor SDK with aggressive hosted features (eliminates most integration plumbing), and an AI coding assistant that generates custom agent scaffolding in ten minutes. All three compress the space where an agnostic framework is the right answer.
The abstraction tax is real too. LangGraph adds layers between developer and model. When something fails, you have to dig through those layers to find the prompt that actually produced the bad output — “production archaeology.” For simple agents, that cost exceeds the portability benefit.
The honest version of the pro-agnostic case in 2026 is narrower than it was eighteen months ago. Agnostic frameworks still win decisively when you have: durability needs vendor SDKs don’t offer, multi-model routing driven by compliance or cost, deep regulated-industry audit needs, or a serious organisational mandate against vendor lock-in. Outside those, the case for paying the abstraction tax is weaker than it used to be.
When the Agnostic Case Is Strongest
Regulated industries where swap-ability is a compliance concern. When a regulator asks “can you demonstrate that this system doesn’t depend on a single vendor’s continued good behaviour?”, you want a framework that lets you answer yes.
Multi-model routing. When one workflow requires a local, open-weight model for compliance reasons and another benefits from a frontier API, a framework designed for routing is vastly easier than one assuming a single model.
Long-running, high-stakes, human-in-the-loop. Agents that pause for days or weeks, need bulletproof resumability, and carry audit-grade state through complex approval flows. LangGraph’s durable execution is hard to replicate from scratch.
Board-level lock-in mandates. In some enterprises the “must be model-agnostic” requirement is a C-suite directive with legal and commercial weight. The conversation about whether the vendor SDK is “good enough technically” is moot.
Large engineering organisations with bandwidth to absorb the learning curve. Easier to tolerate the abstraction tax with ten engineers contributing to the platform than two trying to ship before quarter-end.
For a typical enterprise in 2026, the honest answer is: probably a vendor SDK today, with a realistic option to migrate to an agnostic framework in 2027 if scale, regulatory environment, or vendor relationship demands. For a regulated enterprise — particularly in Europe, where Chapter 9’s compliance concerns bite — the answer tips more strongly toward the agnostic camp from the start.
Insurance policies have costs. They pay off in specific scenarios. Whether they’re worth it depends on how much risk you’re carrying and how much you believe the scenarios will come to pass. That’s the agnostic case, honestly stated.
Chapter 7: Observability, Evaluation, and Cost
Walk into a room of engineers debating agent frameworks and you’ll hear about control flow, state management, and multi-agent patterns. Walk into the room where the same engineers explain their agent deployment to the CFO and you’ll hear three questions: does it work, how do we know, and how much is it costing us?
The first room is where technical arguments happen. The second is where budget approvals happen. Enterprises that succeed with agents have figured out that the second room is where most of their engineering effort actually needs to land. This chapter is about the layer that sits across every framework, every model, and every protocol — and that becomes the difference between a deployed pilot and a shelved one.
In cloud terms, this is the Datadog layer. The Splunk layer. The layer that in mature cloud deployments represents a significant fraction of total infrastructure spend — and in mature agent deployments will do the same.
Why Agent Observability Is Harder Than Microservice Observability
In a classical web service, observability is well-understood. Log requests, trace distributed calls, measure latencies, alert on errors, compute SLOs. The surface is stable. The error modes are known. Agents break most of these assumptions.
The “correct” output isn’t well-defined. A web service either returns 200 or it doesn’t. An agent returns natural language, a tool call, a partial answer, a confidently-wrong answer. No single status code for “the agent was wrong.”
Execution is non-deterministic. Same agent, same input, different tools called, in different orders, with different arguments across runs. Debugging by reproducing the failing case is harder.
The feedback loop is slow. A bug in a service produces an immediate alert. A quality bug in an agent may not surface until a user flags an inaccurate answer a week later — and by then the offending model version, prompt, and conversation may all be different.
Cost is attached to quality. A verbose, hallucination-prone agent isn’t just wrong — it’s also expensive, because it calls more tools, retries more often, and burns more tokens per interaction. Quality and cost are entangled.
The implication: classical observability is necessary but not sufficient. You need traces and errors like you do for a microservice. You also need an agent-specific layer — trajectory recording, evaluation, token-cost attribution, human-review workflows — with no equivalent in traditional tools.
The Four Pillars
The agent-specific tools that emerged in 2025-2026 — LangSmith, Langfuse, Phoenix, Braintrust — are different attempts at the same problem. They organise around four pillars.
Traces. Every agent interaction produces a trace: the sequence of model calls, tool invocations, sub-agent delegations, state transitions that led to the output. A good trace lets you replay exactly what happened, with every prompt, response, and intermediate decision visible. For debugging, non-negotiable. For audit, often mandatory under regulatory frameworks like the EU AI Act.
Evaluation. An agent without an eval harness is an agent you can’t improve. You can change the prompt, swap the model, tweak the graph, and hope — or run changes against a corpus of representative inputs with known expected outputs and measure the difference. Evaluation is the least glamorous part of agent engineering and one of the highest-leverage. Teams that invest in good eval sets ship faster, iterate with more confidence, and catch regressions before users do.
The trickiest part of agent eval is that the “correct” output is often a range, not a string. That’s where LLM-as-judge evaluation comes in: using one model to grade another’s outputs against rubrics. Done well, scales evaluation dramatically. Done badly, measures nothing while looking rigorous.
Cost attribution. Agents produce costs at multiple layers: model inference, tool invocations (paid APIs), orchestration compute (LangGraph durability storage, OpenAI hosted sandbox), human review. Attributing these costs by user, workflow, feature, and team is what separates a deployment that stays under budget from one that burns through the quarterly AI budget in six weeks. Tooling is still early — most enterprises are building internal dashboards rather than buying off-the-shelf — but it’s improving fast.
A specific warning: the token cost of a single agent interaction can vary by an order of magnitude depending on tool calls, context pulled in, and retries. Cost observability needs to be per-interaction, not just per-month, or the long tail will bite you.
Quality signals. Beyond structured evaluation, production agents need lightweight continuous signals. User thumbs-up/thumbs-down, drop-off rates, follow-up message patterns (“that’s wrong,” “no, I meant…”), time to resolution. The agent equivalents of error rate and latency percentiles. Capturing them and feeding them back into eval sets and prompt iteration is the machinery of continuous improvement.
The Tool Landscape
| Tool | Positioning |
|---|---|
| LangSmith | LangChain/LangGraph ecosystem. Deepest integration with LangGraph, strongest eval + trace story in the agnostic camp. |
| Langfuse | Open-source alternative, vendor-agnostic, strong self-host story for data-sensitive deployments. |
| Phoenix (Arize) | Evaluation-centric, broad model support, ties to ML observability tooling. |
| Braintrust | Evaluation-first with focus on LLM-as-judge at scale. |
| W&B Weave | Weights & Biases extension into LLM observability. |
| Vendor native | Each vendor SDK ships its own basic observability. Serviceable for single-vendor, weak for multi-vendor. |
The strategic point: the agent observability layer is quickly becoming its own software category, analogous to APM for cloud. Enterprises will spend real money on the tools that make the difference between operational and dysfunctional.
Cost Governance Is Not Optional
One of the easiest-to-ignore failures in early agent deployments is runaway spend. A well-designed agent calling three tools per interaction at €0.003 each is cheap. The same agent under pressure — more retries, more context, more tool calls, more self-reflection — can easily 10x its cost without anyone noticing until the invoice arrives.
A small set of practices separates disciplined deployments from undisciplined ones. Per-interaction cost budgets (the agent knows its own cost limit and stops when approaching it). Per-user or per-tenant caps (an abusive user or buggy integration shouldn’t burn the monthly AI budget in a day). Model routing for cost (the expensive model for hard questions, the cheap model for routing and classification — savings compound quickly). Tool-call budgeting (if an agent calls five tools when two would do, that’s both a quality and a cost issue). Compaction and context hygiene (context compaction, prompt caching, disciplined prompt engineering can cut costs by 3x+ without touching model quality).
This isn’t exotic material. It’s the same discipline cloud engineers developed around reserved instances, autoscaling, and tag-based chargeback. The agent era will develop its own version. Enterprises that build this discipline early will spend materially less per unit of agent value.
The Regulatory Forcing Function
For European enterprises especially — and Chapter 9 returns to this — observability isn’t just developer convenience. It’s a regulatory requirement. The EU AI Act requires deployers of high-risk AI systems to maintain logs allowing traceability throughout the system’s lifecycle, retain those logs for at least six months, and demonstrate human oversight. You cannot satisfy those requirements without an observability layer.
The implication: for regulated enterprises, the observability stack is compliance infrastructure before it’s quality infrastructure. The choices — trace granularity, retention periods, access controls, audit workflows — have legal consequences, not just operational ones. Chapter 9 covers the architecture of a compliant EU deployment in detail.
Very few enterprises in 2026 have a mature observability practice in place. Most are somewhere between “we log model calls” and “we have a dashboard but nobody looks at it.” The gap between those two states and “this works” is the single biggest predictor of whether an agent program matures into something strategic.
Chapter 8: The Lock-In Question
The framework vendors say you can swap models. The agnostic frameworks say you can’t — or, rather, that you can in theory but not in practice without broken features and degraded behaviour. Who is right?
This chapter takes the question seriously, one vendor at a time. For each major vendor SDK: if you build on it today with the default model, and tomorrow you decide to swap the model, how much of your agent still works? And what breaks first?
The answer is more variable than the abstract lock-in debate suggests. Some vendor SDKs are almost genuinely model-agnostic. Some market themselves as agnostic but have deep hidden couplings. One is explicitly trained into model behaviour and shouldn’t even be called agnostic. The nuances matter because they determine the real engineering cost of a migration, not the theoretical one.
The lock-in honesty test: if I swap the underlying model in this framework, which of the framework’s headline capabilities still work? The honest answer ranges from “most” to “almost none.”
Google ADK — Medium Lock-In (Ecosystem, Not Model)
Nominally model-agnostic, and it is in the basic case. But three headline capabilities degrade on a swap.
Multimodal — deeply integrated into ADK through Gemini’s API. Swap to a text-only model and you lose a class of use cases entirely. Swap to a multimodal model from another vendor and you pay for custom integration work to reach parity.
Agent-card generation — ADK auto-generates A2A cards based on Gemini’s function-calling behaviour. Other models produce less predictable function-call outputs, which makes the auto-generated cards less reliable. You can fix it, but it becomes manual rather than automatic.
Vertex integrations — ADK’s most frictionless integrations are with Vertex AI for deployment, BigQuery for data, Google Cloud for compute. These don’t go away on a model swap, but they become less natural if you’re also moving off Google Cloud.
The orchestration structure (hierarchical agent tree, Sequential/Parallel/Loop primitives, visual debugger) continues to function on other models — just with more manual work at the edges.
Verdict: medium lock-in. The framework itself is moderately portable; the ecosystem around it isn’t.
OpenAI Agents SDK — High Lock-In (Hosted Features)
Partially agnostic. The SDK can be pointed at non-OpenAI models through routing proxies like LiteLLM. Mechanism exists. What breaks on a swap:
Hosted tools — web search, file search, code interpreter are OpenAI-hosted. Disappear the moment you swap providers.
Sandbox execution — the managed sandbox for code execution runs on OpenAI infrastructure. Swap providers and you either lose the sandbox or rebuild it yourself at non-trivial cost.
Handoff reliability — the handoff mechanism relies on OpenAI’s structured-output and function-calling reliability. Other models handle structured output, but not always with the same reliability profile. Subtle changes can make previously-working handoffs flaky.
Voice and Realtime API — OpenAI-specific. Voice use cases are effectively OpenAI-only.
Basic agent and handoff primitives continue to function on other models. Simple text-only agents with custom tools run on non-OpenAI models with modest friction.
Verdict: high lock-in, but more about hosted features than model behaviour. If you’re not using the hosted tools, sandbox, or voice, the SDK is more portable than its reputation. If you are — and most compelling OpenAI-SDK use cases are — the migration cost is substantial.
Claude Agent SDK — Very High Lock-In (Model Behaviour)
Not agnostic, and Anthropic doesn’t pretend otherwise. The SDK is named for Claude, built around Claude’s training, presumes Claude underneath.
Computer-use fidelity — Claude has been specifically trained on computer-use tasks (reading screens, running commands, navigating file systems). Other models haven’t had equivalent training. Running Claude Agent SDK workflows through a non-Claude model produces unpredictable output — hallucinated screen coordinates, misunderstood Bash semantics, failed file manipulation.
Built-in tools — the eight tools (Read, Write, Edit, Bash, Glob, Grep, WebSearch, WebFetch) are designed around prompt patterns Claude responds to well. They work with other models but precision and safety degrade noticeably.
Long-running session behaviour — Claude’s context compaction is trained into the model. Other models handle long contexts differently, sometimes worse.
Hooks and subagents — these structural primitives work with any model, but the benefit depends on model reliability at following hook contracts, which is Claude-specific training.
Verdict: the deepest lock-in of any framework in this chapter — and honestly, that’s by design. Not a generic agent framework that happens to come from Anthropic. A framework specifically built to exploit Claude’s training. Pick it and you’re picking Claude. That’s a reasonable bet if you’ve already made that decision; it’s a bad bet if you need portability.
AWS Strands — Cloud Lock-In, Model Flexibility
Inverted. Uses Bedrock for model access, which means native support for Claude, Llama, Mistral, and other Bedrock-hosted models.
Swapping within Bedrock is relatively painless — one of Strands’s strongest design features. Want to run the same agent on Claude today and Llama tomorrow? Strands supports it more gracefully than any foundation-vendor SDK.
Swapping off Bedrock breaks the framework. The cost is at the cloud layer, not the model layer. Strands assumes Bedrock for models, Lambda for tools, DynamoDB for state. Leave AWS and you’re rewriting the deployment from scratch.
Verdict: inverted lock-in from the foundation-vendor frameworks. Flexible on model (within Bedrock), locked to cloud (AWS). For AWS-native enterprises, often the best fit in the vendor-framework category. For cloud-portable enterprises, the worst.
Azure AI Agent Service — Microsoft Ecosystem Lock-In
Partially model-agnostic. Leans on OpenAI models through the Microsoft partnership, but non-OpenAI options exist.
Microsoft 365 integrations — SharePoint, Teams, Outlook, the whole M365 surface. Main reason you chose the service. Don’t depend on a specific model but are useless outside the Microsoft ecosystem.
Compliance and identity — Microsoft’s enterprise compliance infrastructure is a feature of the service. Don’t lose it on a model swap, lose it entirely leaving Azure.
Managed runtime — runs on Azure. The managed service doesn’t port.
Verdict: ecosystem lock-in, not model lock-in. Valuable to the extent you’re committed to the Microsoft enterprise ecosystem. If you are, a feature. If you’re trying to stay neutral, a trap.
Summary
| Framework | Nominal | Actual Depth | What Breaks First on Model Swap |
|---|---|---|---|
| Google ADK | Yes | Medium (ecosystem) | Multimodal, auto-A2A cards, Vertex integrations |
| OpenAI Agents SDK | Partial | High (hosted features) | Hosted tools, sandbox, voice, handoff reliability |
| Claude Agent SDK | No | Very High (model behaviour) | Computer-use, built-in tools, long-session behaviour |
| AWS Strands | Yes within Bedrock | Cloud lock-in | Leaves AWS and nothing survives |
| Azure AI Agent Service | Partial | Microsoft ecosystem | M365 integrations, Azure runtime |
| LangGraph | Yes | Low | Minimal — this is the design goal |
| CrewAI | Yes | Low | Minimal — this is the design goal |
Lock-In Is Not Always a Problem
Before declaring all vendor frameworks disqualified, a symmetric point. Lock-in isn’t automatically bad. It’s a trade.
Enterprises that locked into AWS in 2010 had a more expensive migration in 2018 than enterprises that used Kubernetes from day one. They also shipped faster in 2010, 2011, 2012, 2013, 2014 — and captured business value during those years that the more portable enterprises were still writing architecture documents about. In many cases, the velocity advantage compounded faster than the lock-in cost accumulated.
Same logic today. An enterprise that picks OpenAI Agents SDK in 2026 and ships three customer-facing agents by Q2 2027 has captured value an enterprise still arguing about LangGraph vs CrewAI hasn’t. If the eventual migration off OpenAI is expensive — and it might be — that’s a future cost to weigh against a present benefit.
The lock-in question, honestly asked, is not “will there be a cost?” The answer is yes. The question is “how does the cost of this future migration compare to the value I capture in the meantime?” For many enterprises the math is favourable. For others — particularly regulated, where the future migration may be involuntary and urgent — the math is unfavourable, and they should invest in portability from day one.
When Lock-In Becomes Structural
Four conditions tip the math toward portability being worth the cost.
Regulatory risk of forced migration. A new regulation could plausibly force you off a specific vendor. You need the portability before the regulation, not after.
Pricing power of the vendor. Your workload becomes dependent on a single vendor, and they can raise prices unilaterally. You’ve handed over your margin. Portability is leverage.
Strategic importance of the workload. A stake-the-company deployment has higher migration risk than departmental automation. Further up the business-critical spectrum, the more portability insurance is worth.
Data sovereignty. Your regulatory environment may require sovereign infrastructure. Vendor frameworks that assume their own infrastructure underneath become liabilities.
Outside these conditions, vendor lock-in is a real but manageable cost. Inside them, the cost is strategic, not operational — and strategic costs are the ones that put CEOs in uncomfortable board meetings.
What to take from this chapter: “Model-agnostic” in vendor-framework marketing usually means “you can technically swap the model” — not “the framework’s headline capabilities survive the swap.” Lock-in ranges from low (LangGraph, CrewAI) to medium (ADK) to high (OpenAI Agents SDK) to very high (Claude Agent SDK). AWS and Azure invert the pattern: model-flexible, cloud-locked. For most enterprises, vendor lock-in is a price worth paying for velocity, until it isn’t — and the conditions under which it isn’t are regulatory risk, vendor pricing power, workload criticality, and data sovereignty. Know which of those apply to you before committing.
Next: Chapter 9 — The EU Angle
Chapter 9: The EU Angle
Chapter 1 gestured at the idea that the EU might leapfrog the vendor-lock-in phase of the agent transition, the same way it leapfrogged the worst of the cloud-lock-in phase a decade ago. This chapter makes the case explicitly. It is the core strategic argument of the booklet for European readers.
The claim: the combination of the EU AI Act, GDPR, data sovereignty norms, and the sovereign AI movement gives European enterprises both the motivation and the top-cover to adopt agent architectures that are model-agnostic, multi-region, and observability-heavy from day one — rather than going through deep vendor commitment and painful migration.
This is a real option. Not the only option. And whether a given enterprise should take it depends on specifics. But the structural forces point one way.
The AI Act as a Forcing Function
The EU AI Act, which began phased enforcement in early 2025 and reaches its most consequential deadlines in August 2026 and August 2027, is not primarily a framework decision. It’s a forcing function that shapes the architecture around frameworks.
The provisions that matter most apply to “deployers” of high-risk AI systems — which, crucially, most enterprises using agents will be. Deployers must: ensure human oversight of the system’s decisions, maintain logs allowing traceability throughout the system’s lifecycle, monitor the system and report serious incidents, retain logs for at least six months, and conduct a fundamental rights impact assessment for certain categories.
Translate these into architectural implications.
Traceability requires observability. A production agent without a structured trace log doesn’t meet the logging requirement. Chapter 7’s observability stack isn’t optional investment — it’s compliance infrastructure.
Human oversight requires integration hooks. The agent can’t be a black box. Humans must inspect, override, intervene. Frameworks with strong callback and hook models have an easier time satisfying this than frameworks expecting autonomous execution.
Log retention drives data sovereignty. Six-month retention, particularly for agents handling personal data, invites the question: where are logs stored, who has access. Storing them in a US-vendor’s infrastructure creates cross-border data transfer problems. European infrastructure, under European legal control, is the default safe answer.
Impact assessment requires transparency. For any high-risk use case, describe what the system does, how, what rights it affects. Opaque black boxes — from the framework or the model vendor — make this harder than it should be.
Taken together, this pushes toward a specific shape of agent architecture: observability-heavy, human-in-the-loop, sovereignty-respecting, auditable end-to-end. That shape aligns with the agnostic-framework + European-infrastructure position more naturally than with the deeply-integrated vendor-SDK position.
A Worked Example: AI Act Logging → Architecture
An agent deployed for credit advisory in a European retail bank sits squarely under “high-risk” classification. The Act’s logging-and-traceability requirement translates step by step as follows.
Which framework hooks. You need a pre-model, post-model, pre-tool, and post-tool hook — every event tagged with a session-scoped trace ID and a user identifier. In LangGraph this is one callback handler attached to the graph. In OpenAI Agents SDK it’s the hooks parameter plus a custom guardrail. In Claude Agent SDK it’s the built-in hooks API. ADK exposes lifecycle events via its agent tree. The framework determines how much of this you write versus configure.
Which observability storage. Traces go to an append-only store with legal-hold capability. Langfuse self-hosted on Azure EU region, LangSmith self-hosted, or a custom object-store + query layer. The store must support PII redaction rules at write time and selective replay at read time (for audit) without re-hydrating redacted fields.
Which retention policy. AI Act minimum is six months. For credit advisory, internal banking regulation pushes it to seven years. Retention tiers — hot for 90 days, warm for 12 months, cold for seven years — mapped to storage costs roughly 1× / 0.3× / 0.05× per GB/month.
Which access control. The trace data is regulated personal data. Access requires a ticket + approver + logged read. This is identity-layer infrastructure — not the agent framework’s job — but it must bolt onto the observability store cleanly.
Three paragraphs that framework-choice discussions rarely get to. In regulated deployments they’re the first paragraphs.
Data Sovereignty: Sharper Than It Used to Be
For cloud adoption, data sovereignty was a slow, quiet concern that mattered for some workloads. For AI adoption, it has hardened.
Three reasons. AI training and inference are more entangled with data than compute traditionally was — when you send a request to OpenAI’s API, you send not just a query but the context, system prompt, tool outputs, and any data the model should consider. For enterprise agents this context routinely includes personal data, confidential business data, or regulated information. The privacy surface of agent usage is structurally larger than the privacy surface of running a web server.
National AI strategies have elevated the issue politically. Every major European country has articulated a sovereign-AI posture — that critical AI infrastructure shouldn’t be entirely dependent on US or Chinese vendors. Not just rhetoric; it’s producing concrete funding, infrastructure, and regulatory action. For enterprises in regulated sectors, aligning with national AI strategy is increasingly part of being a good corporate citizen.
The EU is actively investing in sovereign alternatives. Public-sector funding for European models and sovereign-cloud providers makes the “European alternative” story more credible than for cloud. Whether they’ll be competitive at the frontier is uncertain. Whether they’ll be adequate for a wide range of enterprise use cases is less uncertain — they probably will.
Net: running all your agent workloads through US-hosted APIs is a more politically charged decision in 2026 than running your cloud workloads through US-hosted infrastructure was in 2016. That charge affects strategic choices, even when the letter of the law doesn’t require a specific architecture.
Sidebar: The Sovereign-AI Landscape
A compressed orientation to who’s actually shipping sovereign alternatives, because many architects assume the field is thinner than it is.
Model providers. Mistral (France) — the most frontier-credible European lab, with Mistral Large and a growing open-weight family. Aleph Alpha (Germany) — enterprise-focused, with Pharia-class models designed for regulated deployment and strong German-language performance. Stability AI (UK) — image and text models with liberal licensing. Silo AI (Finland, acquired by AMD) — multilingual European models. Plus the usual open-weight incumbents that can be run on European infrastructure: Meta’s Llama family, the Qwen series, Gemma.
Sovereign cloud and inference. OVHcloud (France), Scaleway (France), IONOS (Germany), Hetzner (Germany), Exoscale (Switzerland) — all offering EU-only inference regions with contractual data-residency guarantees that US hyperscalers increasingly match via their EU sovereign offerings but don’t always start from. Several national-cloud initiatives (Germany’s Delos, France’s Bleu via Orange + Capgemini + Microsoft) target strictly the public sector.
European observability. Langfuse is the notable open-source option, self-hostable on European infrastructure. LangSmith self-hosted is available but newer. A handful of sovereign-cloud-native observability vendors are emerging.
The sovereign-AI story isn’t perfect — frontier-capability gaps remain and will persist for some workloads — but it’s credible enough that “we can only use US APIs” is, in 2026, usually a statement about budget or convenience rather than about availability.
The EU Routing Pattern
A specific architecture is gathering adherents across European enterprise AI programmes. Worth naming explicitly.
One. A neutral orchestration layer. Usually LangGraph, sometimes CrewAI, occasionally a custom lightweight layer. The important property is that the framework doesn’t bind the architecture to a specific model.
Two. A routing decision per interaction. For each task, the architecture decides which model based on data sensitivity, task complexity, cost profile, and sometimes language. Sensitive personal data → locally-hosted Llama or Mistral. Hard reasoning with non-sensitive data → Claude or GPT via API. Simple routing decisions → a small local model. The decision is explicit and auditable.
Three. European observability and audit infrastructure. Langfuse self-hosted, LangSmith self-hosted, or custom audit layer — running on European infrastructure, owned by the enterprise, with full control over retention and access.
This is the agent-era equivalent of the hybrid-cloud pattern that mature European enterprises adopted in the late 2010s: use the public cloud where it’s the right answer, keep the sensitive core under direct control, route per workload rather than committing everything to one provider. Not the fastest architecture to build. The architecture that survives most political and regulatory weather.
What European Leaders Should Actually Do
Five compressed directions.
Assume observability and audit are non-negotiable. Budget for them from day one, whatever framework you pick.
Pick frameworks with a credible path to agnosticism. Either an agnostic framework out of the gate, or a vendor SDK whose model-layer dependency you own as a deliberate strategic choice.
Design with routing in mind. Even if you use one model today, structure the system so per-interaction model choice is a config change, not an architecture change.
Keep audit data in Europe. The retention requirement isn’t the hard part. The sovereignty of retention is. Put traces, eval data, and agent state somewhere that won’t become a cross-border data-transfer problem.
Watch the regulatory posture. Actively. The 2026-2027 period will produce the first meaningful AI Act enforcement actions, and those actions will shape industry norms.
For most European enterprises, the resulting architecture costs slightly more in year one and materially less over five years, relative to a naive vendor-SDK approach. For regulated enterprises, the vendor-SDK approach may not even be legally viable by the time their agent programmes mature. The posture this chapter describes is the defensible default.
What to take from this chapter: The EU has structural reasons — AI Act, data sovereignty, sovereign-AI policy, the memory of the cloud lock-in cycle — to skip “deep vendor commitment followed by painful migration” and go straight to hybrid, agnostic, observability-heavy agent architectures. Not every European enterprise should take this option, but more should than currently are. Framework trade-offs tilt more decisively toward agnostic in Europe than elsewhere. The sovereign-AI landscape is thinner than US narratives suggest but not threadbare — credible enough to plan on. The EU Routing Pattern (agnostic orchestration + per-interaction model choice + European observability) is the architecture that survives most regulatory weather.
Chapter 10: Will the Timeline Actually Squeeze?
Everything so far has assumed the agent transition will move faster than the cloud transition did, and that European enterprises especially will have reason to jump ahead to the mature architecture rather than live through the lock-in phase. That assumption is baked into the advice this booklet gives. It is also genuinely contestable. This chapter owes the reader a forecast — not a both-sides essay, a clear call about which way the evidence points and what would have to happen for the call to be wrong.
Two Scenarios
The Leapfrog Scenario. Enterprises, particularly in Europe, move quickly past vendor lock-in and settle on hybrid, agnostic, routing-heavy architectures by 2027-2028. MCP-style protocols dominate the access layer. LangGraph-style agnostic frameworks become the reference orchestration layer for regulated industries. Vendor SDKs persist as acceleration tools for less-regulated verticals but don’t become the dominant enterprise default. Observability and audit tooling become a distinct enterprise software category analogous to APM. The cycle the cloud industry took twelve years completes in five.
The Pilot-Purgatory Scenario. Most enterprises get stuck in the same trap that’s already catching 95% of AI pilots: the technology works, the pilots are interesting, scaling never happens. Models keep getting better, which paradoxically makes frameworks feel less necessary, which keeps architectures small and informal. Vendor SDKs win by default because they’re the path of least resistance. Agnostic frameworks remain a specialty concern for a narrow slice of regulated enterprises. The lock-in cycle resembles the cloud cycle — a long messy intermediate phase taking most of a decade.
The Call
The likeliest outcome is bifurcated, tilted toward Leapfrog for regulated EU and Pilot Purgatory for everyone else. Regulated European enterprises (banking, insurance, public sector, healthcare, defence) will Leapfrog because the AI Act is a direct forcing function and the architectures they need to meet compliance look like the mature end-state. Everyone else will spend time in Pilot Purgatory — not because the technology fails, but because organisational machinery (data quality, exec sponsorship, evaluation discipline) isn’t ready. The vendor SDKs will do most of the quiet heavy lifting for the pilots that succeed. The agnostic frameworks plus observability will dominate the long-term architecture for the workloads that matter most.
In cloud-parallel terms: Leapfrog looks like the EU enterprise in 2012 that skipped AWS-mono and went straight to hybrid cloud with Kubernetes. Pilot Purgatory looks like the US mid-market enterprise in 2015 still running parallel systems in three clouds trying to figure out a coherent strategy. Both existed; both were rational responses to specific conditions.
Why This Might Still Be Wrong — Named Leading Indicators
A forecast should be falsifiable. Here are six specific 2026-2027 indicators. If they trend as listed, Leapfrog-for-regulated-EU holds. If they don’t, I’m wrong.
1. MCP reaches 200M monthly SDK downloads by Q4 2026. Currently ~97M. Continued doubling confirms the protocol layer is truly settled. Flattening below 150M by Q4 means protocol adoption is stalling and the thesis weakens.
2. The EU AI Act produces at least one publicly announced high-risk enforcement action by Q2 2027. Not a warning letter — an actual fine or order. Absent that, the forcing-function premise is weaker than I’ve claimed.
3. LangSmith crosses 1,000 paying enterprise seats by Q3 2026. Concrete, trackable (LangChain publishes milestone counts). If the observability category doesn’t monetise, the “becomes its own software category like APM” prediction is off.
4. Mistral, Aleph Alpha, or a similarly-positioned EU lab ships a model within 10% of the frontier (on a named benchmark — say, GPQA or SWE-Bench) by mid-2027. If the gap widens instead, the routing pattern collapses for reasoning-heavy workloads and EU enterprises will be forced to choose between frontier access and sovereignty. Pilot-Purgatory odds rise materially.
5. At least one of OpenAI, Google, or Anthropic ships a sovereign-cloud data-residency guarantee for EU customers — contractually binding, not just regional availability — by end of 2026. If they do, vendor SDKs stay in play for European regulated workloads and the agnostic case weakens. If they don’t, vendor SDKs are effectively disqualified from a chunk of the regulated EU market.
6. A2A (or a direct successor) reaches 10k+ public agent cards in a discoverable registry by end of 2027. That’s the indicator that multi-agent interop is becoming ambient rather than theoretical. If A2A traffic stays internal to single enterprises, the “multi-agent architecture becomes mainstream” premise is deferred and the timing of Leapfrog slips.
These aren’t the usual indicators. Nobody else is tracking them as a coherent set. If three of the six move as described, the forecast holds. If three or more don’t, I’m wrong about the timing or the shape — probably both.
Our forecast in one sentence: Regulated European enterprises will Leapfrog by roughly 2028; most other enterprises will live through a compressed but real version of the cloud-era Lock-In Cycle; vendor SDKs will win the short term and the agnostic frameworks plus their observability ecosystem will win the long term for workloads that matter most — unless three of the six named indicators above don’t move as described, in which case I’ve mis-read the cycle.
Chapter 11: Picking Your Stack
The landscape is mapped. MCP is the ambient protocol at the access layer. A2A is the emerging protocol for agent-to-agent. The orchestration layer splits into vendor SDKs and agnostic frameworks. Observability is a first-class concern. Lock-in has sharp per-vendor answers. The EU has specific reasons to pursue a different architecture. The forecast has six falsifiable indicators.
What this chapter does is compress the decision. Not into a ranking — rankings age badly. Into five questions where your honest answers determine the architecture that fits you, plus a decision tree that renders the answers visually, plus a worked case study of a regulated European bank, plus a short epilogue.
The Five Questions
Work through them in order. Each narrows the field.
1. Where does your cloud allegiance already lie? All-in on Google Cloud → ADK is the default candidate. All-in on AWS → Strands. All-in on Azure or Microsoft 365 → Azure AI Agent Service. Cloud-portable or multi-cloud by policy → the agnostic frameworks (LangGraph, CrewAI) become the natural centre. Building against the grain of your cloud ecosystem costs months of unnecessary integration work.
2. How hard is your model-swap requirement? Hard (must route by compliance / cost / language, or board-level anti-lock-in) → agnostic. LangGraph is the defensible default. Claude Agent SDK is immediately disqualified; OpenAI Agents SDK is marginal. Soft (prefer portability, wouldn’t rebuild everything) → vendor SDKs are viable, especially ADK and OpenAI Agents. None → all frameworks on the table; pick based on the other questions.
3. How much regulatory or audit weight is on this deployment? Heavy (banking, insurance, healthcare, public sector, defence, high-risk under AI Act) → observability and audit are non-negotiable. LangGraph + LangSmith (or Langfuse self-hosted) is the most commonly defensible architecture. Medium (GDPR, some sector rules) → vendor SDKs remain viable with supplemental observability. Light (internal productivity, non-sensitive) → governed by other questions.
4. How much do you need multi-agent coordination? Yes, core → ADK leads, with native A2A and hierarchical structure. CrewAI is a strong second for prototyping; LangGraph handles multi-agent but requires more explicit engineering. Maybe later → pick a framework with a credible multi-agent + A2A roadmap. Single-agent, probably always → LLM-in-a-loop baseline may even suffice. The most common mistake here is over-estimating the need — many enterprises ship three-agent systems where one well-prompted agent would handle the work.
5. How much in-house AI engineering talent do you have? Deep → full range open; agnostic frameworks more attractive because you can pay their cost. Moderate → vendor SDKs absorb more engineering burden; agnostic viable but consumes more capacity than you expect. Limited → vendor SDKs are the correct default. An agnostic framework without a team to drive it is a failed project waiting to happen. Be unflinching here — enterprises are most tempted to answer optimistically.
The Decision Tree
The tree is a scanning aid, not a substitute for thinking. The case study below shows how the questions actually resolve in practice — and where the tree’s answer was wrong.
A Worked Case: A Regulated European Bank
Let me walk through a composite — details are amalgamated from real engagements, specifics changed.
A mid-sized European retail bank. Roughly 4,000 employees across four EU countries. Retail products (mortgages, consumer credit, cards), a wealth-advisory arm, no investment-banking division. Tech stack: Azure-dominant, some on-prem mainframe for core banking (the usual European bank architecture). In-house dev capability: solid on Java/.NET, nascent on AI. Executive sponsorship from the COO, who has been told by the board that the bank “must ship something meaningful with AI in 2026” and is being careful about which meaningful thing.
The use case: an internal advisor assistant for relationship managers. Summarise a client’s portfolio, flag anomalies, surface relevant product offers, prepare meeting notes, draft follow-up emails. Not customer-facing. Not making credit decisions. But touching personal data continuously and, for the meeting-prep portion, adjacent to regulated advisory workflows.
Walking the Five Questions
Q1 Cloud allegiance. Azure-primary but with a sovereignty overlay: client data for regulated workflows has to execute in EU regions and the internal security team is openly hostile to any architecture that hard-binds to one US vendor. Initial instinct: Azure AI Agent Service. The tree disagrees (see below).
Q2 Model swap. Hard. Security policy explicitly requires that PII-bearing inference can be moved to a different model provider inside four weeks if a specific vendor becomes unavailable or non-compliant. This isn’t theoretical — the team has been burned by an abrupt vendor policy change in the past on a different product.
Q3 Regulatory weight. Heavy. The advisory-adjacent workflows likely classify as high-risk under the AI Act (even though we’re still waiting on case law around “advisory” scope). Six-month log retention is a floor; internal banking regulation pushes it to seven years for anything touching advisory content. Annual internal audit + quarterly external compliance review.
Q4 Multi-agent. Meaningfully yes. The final architecture wants three specialists: a retrieval agent (pulls client data and product catalog), an advisor agent (reasons about recommendations), a compliance agent (checks outputs against policy and flags anything that needs human review). Routing between them is structured, not ad-hoc.
Q5 Talent. Moderate-trending-toward-limited. The team has two engineers who have built LLM applications before. Neither has run LangGraph in production. Capacity to absorb learning curve is real but bounded by quarterly delivery pressure.
What the Tree Said
Heavy regulation + hard swap → LangGraph + self-hosted Langfuse, EU-region Azure, multi-model routing. That’s the blue leaf.
The architecture would be: LangGraph for orchestration; Langfuse self-hosted on Azure North Europe for observability/audit; MCP servers in front of the CRM, product catalog, and policy library; per-interaction model routing — Claude via Azure (Anthropic’s Azure partnership, with a contractual EU-region guarantee) for reasoning-heavy tasks; locally-hosted Mistral for tasks touching personal data; A2A between the three specialist agents with agent cards registered in an internal registry.
That’s the architecturally correct answer. It’s also where the framework’s answer and what the team actually did diverged.
Where We Overrode the Framework
The bank shipped v1 on the OpenAI Agents SDK via Azure OpenAI — not LangGraph.
The reason was Q5. The two AI-capable engineers didn’t have the bandwidth to simultaneously learn LangGraph, stand up a self-hosted Langfuse, configure multi-model routing, and deliver a pilot in the quarter the COO had committed to. The framework-correct answer was infeasible given the organisational reality. And shipping something good-enough in the committed window mattered more strategically than shipping the architecturally-perfect thing six months late.
What we did instead: OpenAI Agents SDK for orchestration, Azure OpenAI with EU-region Claude (which Microsoft offers via Anthropic through the Azure marketplace) as the default model, Azure Monitor + a thin internal tracing wrapper as the observability layer, vendor-native guardrails, MCP servers for the internal systems. We wrote down — explicitly, in the architecture decision record — that this was a temporary choice, that the migration target was LangGraph + Langfuse, and that certain features (A2A-mediated multi-agent delegation, durable execution for long-running compliance reviews) would be deferred until the migration.
We migrated to LangGraph in month 9. The migration took seven weeks including the observability cutover. It would have taken longer if we hadn’t designed the v1 architecture knowing it was temporary — specifically, if we hadn’t kept the prompts and tool surfaces as framework-agnostic as possible and invested early in MCP servers (which were the one piece that didn’t need to change at all). The prompts and the MCP servers moved verbatim. The orchestration rewrote cleanly once the team had the bandwidth.
What the Case Teaches
The framework-correct answer is often not the timing-correct answer. A v1 that ships on a compromised stack and gets migrated is frequently better than a v1 that’s architecturally pristine and ships eight months late. The decision tree gives you the destination. It doesn’t always give you the sequence.
Preserving optionality costs less than people think, if you design for it. The two things that made the migration tractable — MCP servers for internal systems, and prompts written to be framework-agnostic — cost the v1 team roughly 10% more engineering time than the fully-vendor-coupled alternative. That 10% saved 60% on the migration.
Observability is the hardest thing to retrofit. The weakest part of the v1 architecture was the thin internal tracing layer. When we needed to audit a specific advisory recommendation from month 4 for a compliance review in month 11, the traces existed but weren’t searchable in the way a real observability platform would have offered. If I were doing it again, I’d spend the extra three weeks to stand up Langfuse even in v1, even on the vendor SDK. Everything else you can retrofit. Trace history you can’t.
The “must migrate by Q3” clock worked. Writing the temporary nature into the ADR, with a named migration target and a named date, is what kept the team from drifting into “the vendor SDK is working fine, why migrate?” stasis. The ADR had teeth because three senior stakeholders had signed it. Without that, the v1 stack would probably still be running.
Epilogue: What Survives
This booklet has been a set of mental models. The layer cake. The cloud parallel. The two protocols. The two framework families. The lock-in ranges. The EU leapfrog hypothesis. The forecast with its six indicators. The five-question framework. The case study above.
In eighteen months, which of these will still be useful?
The layer cake will. The question “what layer does this sit on?” is a durable habit that pays off every time a new piece of technology gets announced. MCP will still matter — more, not less. The observability layer will be larger and more mature. A2A will either be ambient or will have been replaced by something that solves the same problem under a different name; in either case, the concept survives.
The specific frameworks are harder. LangGraph will very likely still be the agnostic default. CrewAI’s future is more uncertain. ADK will continue because Google’s incentives don’t change. The Claude Agent SDK will have either spread or narrowed dramatically depending on how much the broader market wants Claude-specific computer-use. The OpenAI Agents SDK is the hardest to forecast — it depends on decisions inside OpenAI that we can’t see. Azure and AWS will persist because their parent companies need them to.
The specific numbers will all be wrong. The 97M MCP downloads will be some larger number. The 44k CrewAI stars will have moved. The exact vendor capabilities will have drifted. Every cost figure will need revision. This is fine. The numbers are there to anchor the mental model, not to be load-bearing on their own.
What I’d tell a colleague asking how to use this book in 2027: start with the layer cake, believe the MCP default, own your observability, respect lock-in consciously, and when you come to pick a stack, run the five questions and then be honest about which answer you can actually execute this quarter. The rest you can update as the horizon moves.
The specifics will change. The discipline will not.
End of booklet.