The Token Economics

A Strategic Guide for EU IT Services Providers Navigating GenAI

April 2026

By Robert Barcik LearningDoe s.r.o.

About This Booklet

The generative AI revolution has created a strategic inflection point for IT services providers across Europe. Companies that built successful businesses on hosting, infrastructure management, and operational expertise now face a fundamental question: does their business model survive the shift from compute to intelligence?

This booklet examines that question through hard economics. We run the numbers on self-hosting large language models versus using commercial APIs. We explore which new business models work and which do not. We look at where EU-based providers have genuine advantages and where they face structural disadvantages. And we provide a practical roadmap for the next 18 months.

The analysis is grounded in real hardware costs, current API pricing (as of April 2026), and the operational realities of running AI infrastructure. No hype, no hand-waving — just the math and what it means for your business.

Who This Booklet Is For

IT services company leaders evaluating their GenAI strategy
Infrastructure and operations teams considering self-hosted AI
Sales and business development professionals packaging AI services for clients
Technical architects designing AI deployment strategies
Anyone in the EU IT services ecosystem trying to understand the economics

How to Read This Booklet

Chapters 1-4 build the economic foundation — read these in order. Chapter 5 covers the vendor ecosystem path most providers will take first. Chapters 6-8 explore three more independent business model pivots. Chapter 9 turns the lens inward on your own delivery model. Chapters 10-14 cover market dynamics, regulation, pricing, talent, and practical planning.

The GenAI Moment for IT Services Providers
How Large Language Models Actually Run
The Cost Equation: Self-Hosting vs API at Every Scale
Why Hyperscalers Win on Price
The Vendor Ecosystem Play
Business Model: The Privacy Proxy
Business Model: Local Deployment on Employee Devices
Business Model: Testing, Security, and Agentic Infrastructure
When AI Transforms Your Own Delivery
The Lock-In Power Shift
EU AI Act: Your Compliance Opportunity
Pricing Models and Packaging
Talent and the CEE Market
The “Do Nothing” Scenario and Your 18-Month Roadmap

Chapter 1

Chapter 1: The GenAI Moment for IT Services Providers

A Business That Made Perfect Sense

For twenty years, the math worked beautifully.

A mid-sized European IT services provider — the kind with 50 to 500 employees, operating from Prague or Bratislava or Warsaw or Munich — built its business on a layered proposition. At the base: hosting and infrastructure services generating steady recurring revenue. On top: professional services — architecture consulting, systems integration, security audits, compliance work — where the real margins lived.

This distinction matters. Many IT services providers make 30-60% gross margins on professional services and consulting, while infrastructure hosting runs at 15-30%. The hosting was often the anchor that got you into the client relationship; the services revenue was what made the business profitable. A typical portfolio included co-location and hosting, managed infrastructure operations with SLAs, cloud migration projects, and a growing layer of advisory and integration work on top.

This model was resilient. When the cloud wave hit in the early 2010s, providers adapted. Instead of selling physical server space, they resold cloud capacity from AWS, Azure, or Google Cloud, adding management, migration, and optimization services on top. The infrastructure margins compressed, but the services layer expanded — cloud architecture consulting, cost optimization, multi-cloud management. The total economics still worked.

When the mobile revolution came, it barely touched the core business. Mobile apps needed backends. Backends needed hosting. The cycle continued.

Even the shift to DevOps and containers, while demanding new skills, did not fundamentally threaten the model. Kubernetes clusters need to run somewhere. Someone needs to operate them. The value chain shifted, but the underlying logic — we operate infrastructure so you don’t have to — remained intact.

This is no longer the case.

Why GenAI Breaks the Middleman Model

Every previous technology shift preserved a basic economic structure: the IT services provider sat between the technology vendor and the client, adding value by reducing complexity and providing operational expertise. The provider’s margin came from the difference between what they paid for compute and what they charged clients for managed compute.

Generative AI inverts this structure, and understanding why requires looking at how the economics actually work.

When you resell cloud hosting, you buy compute at a bulk rate and sell it at a markup. A virtual machine that costs you EUR 200 per month from a hyperscaler might go to your client at EUR 300-350 with management included. The client pays the premium willingly because they are buying your operational expertise, your monitoring, your SLA guarantees — your human labour wrapped around the compute.

Now consider what happens with large language models. The leading API providers — OpenAI, Anthropic, Google, Mistral — operate at scales that produce extraordinary unit economics. A million tokens processed through a capable mid-tier model like Gemini 2.5 Flash costs $0.30 on input and $2.50 on output through the API. A flagship model like Claude Opus 4.6 costs $5.00 per million input tokens and $25.00 per million output tokens. These prices have been falling steadily and show no signs of stopping.

Here is the uncomfortable arithmetic. If you wanted to self-host a comparable open-source model — say, a 120-billion parameter model running on your own GPU infrastructure — the hardware alone for serving 100 concurrent users would cost between $600,000 and $1.2 million to purchase. Renting the equivalent cloud GPU capacity runs $25,000 to $50,000 per month. And that is before you account for the ML engineering talent to operate it, the inference optimization work, the model updates, and the inevitable hardware refresh cycle.

For most workloads at most scales that EU IT services providers operate at, the API is 5 to 10 times cheaper than self-hosting. Not slightly cheaper. Not marginally cheaper. Dramatically, structurally cheaper.

The core problem: In traditional IT services, the middleman was cheaper than the vendor because the middleman aggregated demand. In GenAI, the vendor is cheaper than the middleman because the vendor aggregates supply at a scale no middleman can match.

This is not a temporary market condition. It is a structural consequence of how large language models work. Training a frontier model costs hundreds of millions of dollars, but once trained, the marginal cost of serving one more request is tiny — and it gets tinier as the provider’s infrastructure scales. The hyperscalers who train and serve these models operate at a scale where they have already amortized their training costs across millions of paying users. You cannot replicate that cost structure with a rack of GPUs in a Frankfurt data centre.

Two Kinds of Clients, Two Different Realities

Before concluding that the old model is dead, we need to make an important distinction. Not all clients are the same, and the economics play out very differently depending on which kind you serve.

Cloud-comfortable clients can send data to external APIs. They may have some data governance policies, but their core business data is already in AWS or Azure. For these clients, the middleman economics described above apply in full force. They can sign up for an OpenAI API key tomorrow, and the case for your self-hosted infrastructure is a hard sell on price.

On-premises-required clients cannot or will not send data to external API providers. This includes banks under strict regulatory frameworks, healthcare providers handling patient data, defense contractors, public sector organizations, law firms with client confidentiality obligations, and any enterprise whose compliance or legal teams have drawn a hard line against external AI APIs. For these clients, the hyperscaler API price is irrelevant — it is not an option they can choose.

This distinction matters enormously, because for on-prem clients, the economic comparison is not “your self-hosted price vs. the API price.” It is:

Your managed AI service vs. the client building their own GPU infrastructure and hiring their own ML team
Your managed AI service vs. the client having no AI at all

That is traditional IT services economics. And it works. A client who needs on-prem AI and does not have the expertise to run it themselves will pay a reasonable premium for your operational expertise — just as they did for managed servers, managed databases, and managed Kubernetes clusters.

The proportion of your client base that falls into each category will determine how much of your traditional business model survives the GenAI shift. In heavily regulated EU markets — particularly Central and Eastern Europe, where banking, insurance, healthcare, and public sector clients often have strict data residency requirements — the on-prem segment may be larger than you think.

Key distinction: For cloud-comfortable clients, the API providers are 5-10x cheaper than you, and the middleman model is broken. For on-prem-required clients, you are not competing with API providers — you are competing with the client’s alternative of doing it themselves or going without. These are two fundamentally different economic games, and you need to know which one you are playing with each client.

We explore both sets of economics in detail in Chapter 3. For now, understand that the picture is more nuanced than “self-hosting never works.” It depends entirely on who you are serving and why.

The Previous Playbook Does Not Transfer (For Half Your Clients)

IT services providers have survived technology transitions before, and there is a tempting pattern to fall into: “We adapted to cloud. We adapted to containers. We will adapt to AI.”

For cloud-comfortable clients, this confidence is misplaced. GenAI is not primarily a new infrastructure category — another thing to host, another thing to manage. The infrastructure layer is increasingly commoditized by the model providers themselves. OpenAI, Anthropic, and Google do not just sell models — they sell fully managed inference infrastructure. There is no server for you to manage. There is no cluster for you to optimize. The client can sign up for an API key in five minutes and start sending requests.

For on-prem-required clients, however, the adaptation instinct is actually correct. These clients still need someone to procure GPU hardware, deploy models, optimize inference, handle updates, and monitor production systems. This is operational expertise wrapped around infrastructure — exactly the service you have been selling for decades. The technology changes (GPUs instead of CPUs, vLLM instead of Apache), but the relationship is the same: you operate complex infrastructure so the client does not have to.

The challenge is that even for on-prem clients, the skill requirements have shifted. Managing GPU clusters and ML inference pipelines is different from managing virtual machines and databases. We cover this transition in detail in Chapter 11.

What the client actually needs now

The real challenges clients face with GenAI are different from what they faced with cloud or mobile:

Which model do I use for which task? The landscape changes monthly. A model that was best for code generation six months ago may now be outperformed by three competitors at half the price.
How do I keep my data private? Many European organizations, especially in regulated industries, cannot send customer data to US-based API providers without careful architectural work.
How do I integrate AI into my existing workflows? This is not an infrastructure question — it is a systems integration and business process question.
How do I evaluate quality? Unlike a web server that either responds or does not, an LLM can produce subtly wrong, biased, or hallucinated output. Testing and validation require entirely new approaches.
How do I comply with the EU AI Act? The regulatory requirements for deploying AI systems in Europe are real and growing, and most clients have no idea where to start.

Notice what these questions have in common: none of them are primarily about hosting or infrastructure. They are about expertise, integration, evaluation, and compliance. This is the shift — from selling compute to selling intelligence about intelligence.

The Five-Year Question

Here is the question every IT services provider leadership team in Europe should be sitting with right now:

In five years, will your primary revenue come from infrastructure you operate, or from expertise you deliver?

This is not a rhetorical question, and the answer is not obvious. Both paths can work, but they require fundamentally different investments, different talent, different client relationships, and different pricing models.

The infrastructure path is broader than many commentators suggest — particularly in the European market. Regulated industries that cannot use external APIs, organizations with strict data sovereignty requirements, high-volume workloads where the cost curve tips in favour of owned hardware, and edge deployment scenarios all represent real demand for managed AI infrastructure. In some EU markets, this segment may represent the majority of enterprise AI demand. We examine these economics in detail in Chapters 3 and 4.

The expertise path is broader but demands transformation. If your organization pivots toward AI advisory, integration, testing, compliance, and managed intelligence services, the addressable market is large and growing. But this is a different business than the one you have been running. It requires different people, different sales motions, and a willingness to let go of the comfortable predictability of infrastructure-based recurring revenue.

Most providers will end up with some combination of both. The question is which one leads.

There Are Viable Paths — But Not the Old One

Let us be direct about what this booklet is and is not arguing.

We are not arguing that EU IT services providers are doomed. The European IT services market is large, growing, and shaped by regulatory and cultural factors that create genuine competitive advantages for local providers. Data sovereignty concerns, EU AI Act compliance requirements, language and cultural specificity, and the sheer complexity of integrating AI into existing enterprise workflows — these all create demand that hyperscalers alone cannot satisfy.

We are also not arguing that every provider must become an AI company overnight. The transformation is a spectrum, and the right position on that spectrum depends on your current capabilities, your client base, and your appetite for risk.

What we are arguing is this: the old playbook needs fundamental modification — but not wholesale abandonment. For cloud-comfortable clients, selling compute at a markup does not work against hyperscaler API prices. For on-prem-required clients, managed AI infrastructure is a natural and profitable extension of your existing business. For both, adding expertise, integration, compliance, and evaluation services on top of the infrastructure creates significantly more value than the infrastructure alone.

Providers who recognize the dual nature of this market — and invest accordingly — have a window of opportunity. The GenAI market is still young enough that expertise is scarce, best practices are not yet established, and clients are genuinely uncertain about how to proceed. That uncertainty is your opportunity. For some clients, it means “we will host it for you, because you cannot go to the cloud.” For others, it means “we will make AI work in your context, regardless of where the model runs.”

What This Booklet Will Show You

The chapters that follow build the case systematically.

Chapters 2-4 lay the economic foundation. We walk through how large language models actually run at the hardware level, then build a detailed cost comparison between self-hosting and API usage at different scales — for both cloud-comfortable and on-prem-required clients. We explain why the hyperscaler cost advantage is structural, not temporary, and identify where self-hosting still makes economic sense.

Chapter 5 examines the path of least resistance: reselling and implementing embedded AI from your existing vendor partners — Microsoft Copilot, SAP Joule, ServiceNow AI, and others. For many providers, this is the fastest route to AI revenue.

Chapters 6-8 explore three more independent business models: the Privacy Proxy (routing AI through a compliant European layer), Local Deployment on Employee Devices (a growing market as on-device models improve), and Testing, Security, and Agentic Infrastructure (where the provider’s operational expertise maps directly to new AI requirements).

Chapter 9 addresses the uncomfortable internal question: how AI transforms your own service delivery model. If AI can handle 40-60% of L1 tickets, that changes your cost structure, your staffing, and your pricing.

Chapters 10-14 cover the broader strategic landscape: how lock-in dynamics are shifting, how the EU AI Act creates a genuine compliance opportunity, how to price and package AI services, where to find talent in the Central and Eastern European market, and what happens if you do nothing. The final chapter provides a concrete 18-month roadmap.

Throughout, we use real numbers. The pricing data referenced in this booklet is current as of April 2026 and drawn from public pricing pages, published GPU rental rates, and hardware market prices. Where we estimate, we show our assumptions. Where the numbers are uncertain, we say so.

What to take from this chapter: The GenAI shift creates two distinct realities. For cloud-comfortable clients, the API providers are 5-10x cheaper than you at serving AI — the middleman markup model is broken. For on-prem-required clients, managed AI infrastructure is a viable, profitable business that builds on your existing expertise. Most EU IT services providers will serve both segments, and the winners will be those who understand which economic game they are playing with each client. The rest of this booklet gives you the numbers and the strategies for both.

Next: Chapter 2 — How Large Language Models Actually Run

Chapter 2

Chapter 2: How Large Language Models Actually Run

You already know how to size a database server. You know that a PostgreSQL instance handling 500 concurrent connections needs a certain amount of RAM for shared buffers, work memory, and connection overhead. You can estimate that a 2TB database with heavy read traffic needs specific IOPS and a certain number of CPU cores.

Running a large language model is the same kind of engineering problem — just with different hardware. The bottleneck moves from CPU and RAM to GPUs and VRAM, the workload shifts from disk I/O to matrix multiplication, and the scaling unit changes from “connections” to “tokens per second.” But the thinking process is identical: understand the resource demands, match them to hardware, and plan for concurrent users.

This chapter gives you that understanding.

Parameters, Precision, and Memory

A large language model is, at its core, a massive collection of numerical weights — called parameters — that encode everything the model learned during training. When someone sends a prompt, the model multiplies input data through these weights layer by layer to produce an output. Every single parameter must be loaded into GPU memory before the model can process a single token.

This is the fundamental constraint. Unlike a traditional application where you can page data in and out of RAM from disk, an LLM’s parameters need to sit in VRAM (the GPU’s dedicated memory) with extremely fast access. The entire model must be resident, all the time, for every request.

The memory footprint depends on two things: the number of parameters and the numerical precision used to store each one.

Precision Formats

Each parameter is a number. How many bytes you use to store that number is called its precision:

FP16 (half precision): 2 bytes per parameter — full quality, no accuracy loss
INT8 (8-bit quantization): 1 byte per parameter — minimal quality loss for most tasks
INT4 (4-bit quantization): 0.5 bytes per parameter — noticeable quality reduction on complex reasoning, but viable for many production use cases

Think of it like audio bitrate. A 320kbps MP3 is nearly indistinguishable from a CD. A 128kbps MP3 is fine for background music. A 64kbps file works for voice calls. The “right” quality depends on the use case.

Memory Math for Real Models

Here is what this means for two representative model sizes — a large frontier-class model (120B parameters) and a capable mid-size model (20B parameters):

Model Size	FP16 (2 bytes)	INT8 (1 byte)	INT4 (0.5 bytes)
120B parameters	~240 GB VRAM	~120 GB VRAM	~60-70 GB VRAM
20B parameters	~40 GB VRAM	~20 GB VRAM	~10-12 GB VRAM

A 120B model at full precision needs 240 GB of VRAM just for the weights. No single GPU on the market has that much memory, which means you must spread the model across multiple GPUs. A 20B model at INT4, on the other hand, fits comfortably on a single consumer-grade GPU with 24 GB of VRAM.

Key takeaway: Model weights are the baseline memory cost — your “minimum RAM” equivalent. But just like a database server needs memory beyond the data files, an LLM needs VRAM beyond the model weights. The biggest additional consumer is the KV cache.

The KV Cache: Where Concurrent Users Hit You

Here is where things get interesting for anyone thinking about multi-user deployments.

When a model processes a conversation, it computes intermediate values called keys and values (KV) for every token in the context. These get cached so the model does not have to recompute them for each new token it generates. This is the KV cache, and it grows with every token in every active conversation.

If you have run a database, think of the KV cache as the equivalent of connection-level session memory. Each active user consumes a share of memory proportional to the length of their conversation.

The Math Gets Serious at Scale

Consider a realistic enterprise scenario: 100 concurrent users working with a 120B parameter model. Some are having straightforward Q&A sessions (4K-8K context). Others are running agentic workflows — code generation, document analysis, multi-step reasoning — that push to 32K-128K tokens per session.

A conservative average of 16K active context tokens across 100 users means 1.6 million tokens of KV cache state that must live in VRAM simultaneously. For a 120B model, that translates to roughly 80-150 GB of additional VRAM on top of the model weights, depending on the model architecture and precision.

Let that sink in: the KV cache for 100 users can require as much VRAM as the model weights themselves.

Component	120B at FP16	120B at INT8
Model weights	240 GB	120 GB
KV cache (100 users, 16K avg context)	80-150 GB	80-150 GB
Runtime overhead (activations, buffers)	20-40 GB	15-30 GB
Total VRAM needed	340-430 GB	215-300 GB

Notice that quantizing the model weights helps with the first row, but the KV cache does not shrink proportionally — it depends on the model’s hidden dimensions and number of attention heads, not the weight precision. This is why quantization alone does not solve the multi-user scaling problem.

Key takeaway: When sizing GPU infrastructure, the model weights are the floor, not the ceiling. For multi-user deployments, the KV cache often dominates your memory planning. Every additional concurrent user with a long context window costs real VRAM.

Throughput: Tokens Per Second Per User

Memory determines whether a model fits. Throughput determines whether the experience is acceptable.

A good interactive experience requires 30-50 tokens per second per user. Below 20 tokens/sec, users perceive noticeable lag. Above 50, the output appears essentially instant — the bottleneck becomes reading speed, not generation speed.

For 100 concurrent users, that means your infrastructure must sustain 3,000-5,000 tokens per second in aggregate. This is the equivalent of sizing network bandwidth for concurrent connections — each user needs a guaranteed minimum, and the infrastructure must handle the aggregate peak.

Throughput depends on GPU compute power (measured in TFLOPS), memory bandwidth (how fast data moves between VRAM and compute units), and how efficiently the serving software schedules work across multiple requests.

GPU Hardware: A Practical Comparison

If you are used to comparing Xeon versus EPYC processors and DDR4 versus DDR5 memory, this table is your GPU equivalent:

GPU	VRAM	Memory Bandwidth	FP16 TFLOPS	Purchase Price (per unit)	Typical Use Case
NVIDIA H100 SXM	80 GB HBM3	3.35 TB/s	989	$25,000-40,000	Frontier models, high-throughput production
NVIDIA H200 SXM	141 GB HBM3e	4.8 TB/s	989	$30,000-45,000	Large models needing maximum VRAM
NVIDIA A100 SXM	80 GB HBM2e	2.0 TB/s	312	$15,000-17,000	Previous gen, good price/performance
NVIDIA A100	40 GB HBM2e	1.6 TB/s	312	$10,000-12,000	Budget production, smaller models
NVIDIA L40S	48 GB GDDR6X	864 GB/s	362	$7,000-10,000	Inference-optimized, data center
NVIDIA RTX 4090	24 GB GDDR6X	1.0 TB/s	330	$1,600-2,000	Development, light production

A few things stand out. The H100 and H200 are in a different league on memory bandwidth — 3-5x faster than the L40S. For LLM inference, memory bandwidth is often the bottleneck, because generating each token requires reading the entire model weights from memory. The H200’s 141 GB of VRAM is also notable: it can hold a 120B model at INT8 on a single GPU (though you would still need multiple GPUs for throughput at scale).

The RTX 4090 deserves attention for a different reason. At roughly $1,800, it delivers surprisingly capable inference performance for smaller models. Its 24 GB of VRAM limits what it can run, but for a quantized 20B model, it is a legitimate option.

Concrete Configurations: What Serves 100 Users

Let us put the pieces together with specific hardware configurations.

Configuration 1: 120B Model for 100 Users

A 120B model at INT8 needs ~120 GB for weights plus 80-150 GB for KV cache. You need substantial aggregate VRAM and compute.

Hardware: 8x H100 80GB node (640 GB total VRAM, NVLink interconnect)

One such node — costing $200,000-400,000 — provides enough VRAM and bandwidth to serve 20-30 concurrent users at good throughput. The model weights consume about 120 GB (at INT8), leaving ~520 GB for KV cache, activations, and batching overhead. That sounds generous until you account for long-context agentic sessions eating 1-2 GB of KV cache each.

For 100 concurrent users, plan for 3-4 nodes — a total investment of $600,000-1,600,000 in GPU hardware alone, before racks, networking, power, and cooling.

Configuration 2: 20B Model for 100 Users

A 20B model is a fundamentally different proposition. At FP16, the weights need ~40 GB. At INT8, ~20 GB. At INT4, ~10-12 GB.

Setup	Hardware	Estimated Cost	Concurrent Users
Full precision	2x H100 80 GB	$50,000-80,000	~100 users
INT8 quantized	4x A6000 or L40S (48 GB each)	$28,000-40,000	~100 users
INT4 quantized	2x RTX 4090 (24 GB each)	$3,200-4,000	Lighter loads, 20-40 users
Single GPU	1x H100 or A100 80 GB	$15,000-40,000	50-80 users

A single H100 or A100 80 GB can comfortably hold a 20B model at FP16 with ample room left for KV cache, serving 50-80 concurrent users at good throughput. Two H100s at FP16 handle 100 users with headroom.

The economics here are striking. Where a 120B model requires over half a million dollars in GPUs for 100 users, a 20B model can serve the same user count for under $80,000 — and with INT8 quantization on L40S cards, under $40,000.

Key takeaway: The jump from 20B to 120B is not a 6x cost increase — it is closer to 10-20x when you factor in KV cache, multi-node networking, and the premium pricing of top-tier GPUs. The question for your clients is whether that quality difference justifies the cost difference for their specific use case.

Serving Software: The Engine Room

Having the right GPUs is necessary but not sufficient. The software layer that sits between the model and incoming requests makes an enormous difference in how many users your hardware can actually serve. This is analogous to the difference between running a raw MySQL binary and running it behind a properly configured connection pooler with query optimization.

Key Serving Frameworks

vLLM is the current standard for production LLM serving. Its key innovation is PagedAttention — a memory management technique for the KV cache that works like virtual memory paging in an operating system. Instead of pre-allocating maximum context length for every request, it allocates KV cache memory in pages and reclaims them dynamically. This alone can improve throughput by 2-4x compared to naive serving.

Text Generation Inference (TGI) from Hugging Face is another solid production option, particularly well-integrated with the Hugging Face model ecosystem. It supports quantization, tensor parallelism, and continuous batching out of the box.

llama.cpp takes a different approach — it is optimized for running quantized models on consumer hardware, including CPU-only inference. Performance is lower than GPU-native frameworks, but it runs anywhere and is remarkably efficient for its weight class.

MLX is Apple’s framework for running models on Apple Silicon. If your clients have fleets of M2/M3/M4 MacBooks or Mac Studios, MLX enables local inference using the unified memory architecture. A Mac Studio with 192 GB of unified memory can run a 70B model — something we will explore in Chapter 6.

Three Techniques That Matter

Tensor parallelism splits a model across multiple GPUs within a single node. Each GPU holds a slice of every layer and they communicate over high-speed NVLink interconnects during each forward pass. This is how you run a 120B model across 8 H100s — the model is too large for any single GPU, so you partition it. Think of it as RAID striping, but for neural network layers instead of disk blocks.

Continuous batching is what makes multi-user serving economically viable. Instead of processing one request at a time (or waiting to fill a fixed batch), the serving framework dynamically adds new requests to the running batch and removes completed ones. A user who asks a short question gets their response without waiting for another user’s 4,000-token generation to finish. This is the LLM equivalent of HTTP/2 multiplexing — interleaving multiple streams on the same connection.

Speculative decoding uses a small, fast “draft” model to predict several tokens ahead, then verifies them in a single pass through the large model. When the predictions are correct (which is often the case for routine text), you get multiple tokens for the compute cost of one verification step. The speedup is typically 1.5-2.5x for suitable workloads. It is essentially branch prediction for language models — speculate, verify, and accept or reject.

What This Means for Your Infrastructure Business

If you currently manage server infrastructure for clients, everything in this chapter maps to skills you already have. Capacity planning, performance monitoring, memory management, multi-node orchestration — these are your core competencies applied to new hardware.

The critical differences are:

Capital intensity is higher. A well-equipped database server costs $20,000-50,000. A single 8-GPU inference node costs $200,000-400,000. The stakes per deployment are an order of magnitude larger.
The workload is memory-bound, not compute-bound. Traditional servers often have idle RAM. GPU inference is almost always constrained by VRAM — you will spend more time optimizing memory allocation than CPU utilization.
The model-hardware fit matters enormously. Choosing a 120B model where a fine-tuned 20B would suffice does not just waste money — it can make the entire business case collapse. Model selection is now an infrastructure decision.

The next chapter takes these hardware realities and turns them into a full cost comparison against commercial APIs. When does self-hosting make sense? At what user count? For which workloads? The answer, as you might expect, depends entirely on the numbers.

Chapter 3

Chapter 3: The Cost Equation — Self-Hosting vs API at Every Scale

This is the chapter where we stop talking in abstractions and start talking in euros. If you take away one thing from this booklet, it should be the numbers on these pages. They will either confirm your strategic direction or force you to change it.

We will walk through the full cost of self-hosting large language models at different scales, compare those costs against current API pricing, and — critically — show you that the right comparison depends on which clients you serve. For cloud-comfortable clients, the API comparison is what matters. For on-premises-required clients, the comparison that matters is entirely different: your managed service versus the client doing it themselves. Both economics are covered in this chapter.

Self-Hosting a 120B Parameter Model (100 Users)

Let us start with the scenario many IT services providers instinctively imagine: hosting a large, frontier-class open-weight model for a mid-size enterprise client. Something like Llama 3.1 405B quantized to fit, or a full-precision 120B parameter model such as Mistral Large or a fine-tuned derivative.

Compute Requirements

A 120B parameter model at FP16 requires roughly 240 GB of GPU VRAM just for the weights. Add KV-cache for concurrent users, activation memory, and operational headroom, and you need 3 to 4 nodes of 8xH100 (80 GB each) to serve 100 concurrent users with acceptable latency.

The Monthly Bill

Cost Component	Monthly Estimate
GPU cloud rental (3-4 nodes, 8xH100 each)	$30,000 - $50,000
Networking and storage	$1,500 - $3,000
Monitoring and observability	$1,000 - $2,000
DevOps / ML engineering staff (fractional)	$8,000 - $15,000
On-call and incident management	$3,000 - $5,000
Application hosting (API gateway, auth, logging)	$2,000 - $4,000
Security and compliance overhead	$2,000 - $4,000
Total monthly cost	$47,500 - $83,000

Per-User Reality

Metric	Value
Raw compute per user	$300 - $500/month
Compute + operations per user	$475 - $830/month
Realistic all-in per user	$600 - $1,000/month

The “realistic all-in” figure includes a margin buffer, amortized setup costs, and the inevitable underutilization during off-peak hours. GPU clusters do not scale to zero.

Key takeaway: At $600-1,000 per user per month, a self-hosted 120B model is essentially unsellable. Enterprise AI seats from OpenAI, Anthropic, and Google run $20-30 per user per month for standard tiers and $200 per seat for the most premium enterprise packages. You cannot compete on price. You would need to sell on a value proposition so compelling that customers pay 3-5x the market rate. For most IT services providers, this is not a viable business.

Self-Hosting a 20B Parameter Model at Different Scales

The more realistic play for an IT services provider is a smaller, more efficient model — 20B parameters or fewer. Models like Mistral Small, Llama 3.1 8B/70B (quantized), or domain-specific fine-tunes in the 7-20B range. These models can deliver strong performance on focused enterprise tasks while running on significantly less hardware.

10 Users — Dedicated Per-Customer Deployment

This is the “private AI appliance” scenario. One customer, one deployment, full data isolation.

Cloud option:

Component	Monthly Cost
1x A6000 or L40S (INT8 quantized)	$2,000 - $3,000
Ops overhead (monitoring, support, patching)	$500 - $1,000
Total	$2,500 - $4,000
Per user	$250 - $400

Hardware appliance option:

Component	Cost
Upfront hardware (server + GPU)	$10,000 - $15,000 (one-time)
Monthly software license + remote support	$1,000 - $2,000
Amortized per user (over 36 months)	$100 - $200/month

The appliance model is interesting because it shifts the cost structure. The upfront capital expense is significant, but the ongoing per-user cost drops considerably. This works best for regulated industries where data must stay on-premises — healthcare, legal, financial services.

100 Users — Shared Departmental Deployment

Component	Monthly Cost
2x H100 (handling concurrency and throughput)	$5,000 - $8,000
Ops overhead (scaled)	$5,000 - $8,000
Total	$10,000 - $16,000
Per user	$100 - $160

At this scale, the operations overhead is roughly equal to compute. You need proper monitoring, a deployment pipeline, someone on call, and a process for model updates and security patches. The GPU may run itself, but the system around it does not.

500 Users — Business Unit or Mid-Size Enterprise

Component	Monthly Cost
2-3x H100 (with load balancing)	$7,500 - $12,000
Ops overhead (dedicated team fraction)	$3,500 - $6,000
Total	$11,000 - $18,000
Per user	$22 - $36

This is where self-hosting starts to become competitive. Utilization improves dramatically with more users — 500 users generate enough traffic to keep GPU clusters reasonably busy throughout the business day. The per-user ops cost drops because you are spreading the same monitoring, support, and tooling across more seats.

1,000 Users — Large Enterprise or Multi-Tenant Platform

Component	Monthly Cost
3-4x H100 (high-throughput serving)	$11,000 - $18,000
Ops overhead (dedicated team)	$7,000 - $9,000
Total	$18,000 - $27,000
Per user	$18 - $27

At 1,000 users, the economics tilt decisively. You are now in the range where self-hosting a smaller model can undercut API pricing for mid-tier models — and you retain full data sovereignty. This is the sweet spot for providers who can aggregate demand across multiple clients.

API Pricing Landscape (April 2026)

Before comparing self-hosting to APIs, let us establish what the commercial API market actually charges today. Prices are listed per million tokens for input and output respectively.

Frontier and Mid-Tier Model Pricing

Provider	Model	Input (per M tokens)	Output (per M tokens)
OpenAI	GPT-4.1	$2.00	$8.00
	GPT-4o	$2.50	$10.00
	GPT-4o-mini	$0.15	$0.60
Anthropic	Claude Haiku 4.5	$1.00	$5.00
	Claude Sonnet 4.6	$3.00	$15.00
	Claude Opus 4.6	$5.00	$25.00
Google	Gemini Flash-Lite	$0.10	$0.40
	Gemini Flash	$0.30	$2.50
	Gemini Pro	$1.25	$10.00
Mistral	Small	$0.20	$0.60
	Medium	$1.00	$3.00
	Large	$2.00	$6.00
Llama (hosted)	8B	$0.05	$0.08
	Maverick	$0.15	$0.60
	70B	$0.70	$0.90

Several patterns jump out of this table.

First, the price floor keeps dropping. Google’s Flash-Lite at $0.10/$0.40 and Llama 8B at $0.05/$0.08 are nearly free for most business use cases. A year ago, these price points did not exist for models of comparable capability.

Second, there is a 50-100x spread between the cheapest and most expensive models. A Gemini Flash-Lite call costs roughly 1/50th of a Claude Opus 4.6 call. For most routine enterprise tasks — summarization, classification, extraction, simple Q&A — the cheaper models are more than adequate.

Third, output tokens are 3-5x more expensive than input tokens across most providers. This matters for your cost modeling: a chatbot that produces long, detailed responses will cost significantly more than one that gives concise answers.

GPU Rental Prices (April 2026)

For those planning self-hosted deployments, here are current cloud GPU rental rates. Prices vary significantly by provider, commitment level, and availability.

GPU	Hourly Rate Range	Monthly Estimate (730 hrs)
NVIDIA H100 (80 GB)	$1.49 - $6.98	$1,088 - $5,095
NVIDIA H200 (141 GB)	$2.29 - $10.60	$1,672 - $7,738
NVIDIA A100 (80 GB)	$0.78 - $2.50	$569 - $1,825
NVIDIA A6000 (48 GB)	$0.50 - $1.20	$365 - $876
NVIDIA L40S (48 GB)	$0.60 - $1.80	$438 - $1,314

The lower end of these ranges reflects spot pricing or long-term reservations with smaller GPU cloud providers (Lambda, RunPod, Vast.ai). The higher end reflects on-demand pricing from the major hyperscalers (AWS, Azure, GCP). For production workloads requiring reliability and SLAs, budget toward the mid-to-upper range.

Key takeaway: GPU rental prices have declined roughly 30-40% year-over-year as supply expanded, but they remain substantial. A single H100 at mid-range pricing ($2,500-$3,500/month) costs more per month than many traditional server configurations. This is GPU-as-a-premium-commodity, not GPU-as-a-utility.

API Cost Analysis: 100 Users Scenario

Now let us make the comparison concrete. Take our 100-user scenario and run the numbers through API pricing.

Assumptions

100 users, each generating approximately 1 million tokens per day (a mix of input and output)
That is 100 million tokens per day, or roughly 3 billion tokens per month
Blended rate assumes a typical 3:1 input-to-output token ratio

API Cost at Different Price Points

Model Tier	Blended Rate (per M tokens)	Monthly Cost (3B tokens)	Per User
Gemini Flash-Lite	~$0.18	$540	$5.40
GPT-4o-mini	~$0.30	$900	$9.00
Llama 70B (hosted)	~$0.75	$2,250	$22.50
Claude Haiku 4.5	~$2.00	$6,000	$60.00
Mistral Medium	~$1.50	$4,500	$45.00
GPT-4o	~$4.40	$13,200	$132.00
Claude Sonnet 4.6	~$6.00	$18,000	$180.00

Self-Hosting Cost for the Same 100 Users

From our earlier analysis of a self-hosted 20B model:

Component	Monthly Cost
Compute (2x H100)	$5,000 - $8,000
Operations overhead	$5,000 - $8,000
Total	$10,000 - $16,000
Per user	$100 - $160

The Comparison

Deployment	Monthly Cost	Per User
API: GPT-4o-mini	$900	$9
API: Claude Haiku 4.5	$6,000	$60
API: Mistral Medium	$4,500	$45
Self-hosted 20B model	$10,000 - $16,000	$100 - $160
API: GPT-4o	$13,200	$132
API: Claude Sonnet 4.6	$18,000	$180

The pattern is clear. Self-hosting is more expensive than small and mid-tier API models, and only approaches cost parity with frontier API models. At 100 users, you are paying $13,000-$16,000 per month for a 20B parameter model that is objectively less capable than Claude Sonnet or GPT-4o — which cost about the same via API.

Key takeaway: The traditional hosting logic is inverted. In traditional IT, self-hosting at moderate scale (100+ users) was almost always cheaper than buying a managed service. With AI, the hyperscaler API providers benefit from massive GPU fleets, custom silicon, and unbeatable economies of scale. Self-hosting only wins on cost at 500+ users with a smaller model — and even then, only if you are comparing against mid-tier API pricing. If your client is happy with GPT-4o-mini or Gemini Flash-Lite, no self-hosted deployment can compete on price. The case for self-hosting must rest on something other than cost.

The Complete Scale Comparison

Here is the summary view that every IT services provider should have pinned to their wall. We compare self-hosting a 20B model against API costs across four deployment scales.

Self-Hosted 20B Model vs API — Monthly Total Cost

Scale	Self-Hosted (Total)	API: Budget Tier	API: Mid Tier	API: Frontier
10 users	$2,500 - $4,000	$54	$600	$1,800
100 users	$10,000 - $16,000	$540	$6,000	$18,000
500 users	$11,000 - $18,000	$2,700	$30,000	$90,000
1,000 users	$18,000 - $27,000	$5,400	$60,000	$180,000

Budget tier: Gemini Flash-Lite / GPT-4o-mini class (~$0.18-$0.30/M blended). Mid tier: Claude Haiku / Mistral Medium class (~$2.00/M blended). Frontier: Claude Sonnet / GPT-4o class (~$6.00/M blended). Assumes 1M tokens/user/day.

Self-Hosted 20B Model vs API — Per-User Monthly Cost

Scale	Self-Hosted	API: Budget	API: Mid	API: Frontier
10 users	$250 - $400	$5.40	$60	$180
100 users	$100 - $160	$5.40	$60	$180
500 users	$22 - $36	$5.40	$60	$180
1,000 users	$18 - $27	$5.40	$60	$180

API pricing is perfectly linear — the per-user cost does not change with scale. Self-hosting costs drop dramatically as you add users. The crossover points are:

Self-hosting beats mid-tier APIs at roughly 400-500 users
Self-hosting beats frontier APIs at roughly 100-200 users
Self-hosting never beats budget-tier APIs at any realistic scale

The Crossover Chart

If you plotted per-user cost against number of users, you would see two very different curves. The API line is flat — $5.40 per user whether you have 10 or 10,000. The self-hosting curve is a steep hyperbola, starting at $250-$400 for 10 users and asymptotically approaching $15-$20 for very large deployments.

The lines cross at different points depending on which API tier you are comparing against:

Comparison	Crossover Point
Self-hosted vs Budget API (Flash-Lite, 4o-mini)	Never (self-hosting is always more expensive)
Self-hosted vs Mid-tier API (Haiku, Mistral Medium)	~400-500 users
Self-hosted vs Frontier API (Sonnet, GPT-4o)	~100-200 users
Self-hosted vs Premium API (Opus, GPT-4.1 + heavy usage)	~50-80 users

The On-Premises Economics: A Different Comparison

Everything above compares your self-hosting costs against API prices. That is the right comparison for clients who can use cloud APIs. But for a significant segment of the EU enterprise market — banking, healthcare, defense, legal, public sector, and any organization whose compliance or legal teams have ruled out external AI APIs — the API price is irrelevant. It is not an option they can choose.

For these clients, the comparison that matters is entirely different:

Your managed AI service vs. the client building and running it themselves
Your managed AI service vs. the client having no AI at all

This is traditional IT services economics, and the numbers look much more favorable.

What It Costs a Client to Do It Themselves

Consider a mid-sized European bank that wants to run a 20B model on-premises for 100 internal users. If they build and manage the infrastructure themselves, they need:

Cost Component	Annual Cost
GPU hardware (2x H100, amortized over 3 years)	$17,000 - $23,000
Server infrastructure, networking, cooling	$8,000 - $12,000
ML engineer (1 FTE, EU market)	$80,000 - $130,000
DevOps/infrastructure engineer (0.5 FTE)	$30,000 - $50,000
Software licensing, monitoring, security tools	$10,000 - $20,000
Training and upskilling	$5,000 - $10,000
Total annual cost (client DIY)	$150,000 - $245,000
Monthly equivalent	$12,500 - $20,400
Per user per month	$125 - $204

The dominant cost is not hardware — it is people. An ML engineer who can deploy, optimize, and maintain LLM inference infrastructure commands a significant salary in the EU market, and the client needs at least one full-time. Many will need more, especially during the initial setup phase.

What You Can Charge as a Managed Service

As an IT services provider, you have advantages the individual client does not:

Shared expertise. Your ML engineer serves multiple clients, not one. The cost is spread across your customer base.
Reusable tooling. Your deployment pipelines, monitoring dashboards, and update processes are built once and used for every client.
Operational maturity. You have been managing infrastructure for decades. The client’s newly-hired ML engineer is figuring it out for the first time.
Vendor relationships. You negotiate GPU procurement and cloud pricing at volume.

These advantages let you deliver the same service at a lower cost than the client can achieve alone — the exact same economics that made traditional IT managed services profitable.

Deployment Scale	Your Cost	You Charge	Client DIY Cost	Your Margin
10 users (dedicated)	$2,500 - $4,000/mo	$5,000 - $8,000/mo	$7,000 - $12,000/mo	40-55%
100 users (shared infra)	$10,000 - $16,000/mo	$18,000 - $28,000/mo	$12,500 - $20,400/mo	40-50%
500 users (platform)	$11,000 - $18,000/mo	$22,000 - $35,000/mo	$20,000 - $35,000/mo	45-55%

At 10 users, the economics are especially compelling. A small client cannot justify a full-time ML engineer for 10 users, but they still need someone to manage the infrastructure. Your shared-expertise model gives them enterprise-grade AI ops at a fraction of the cost of doing it themselves.

At 100 users, you price competitively with the client’s DIY cost while capturing a healthy margin. The client gets operational reliability, SLAs, and the ability to focus their internal team on using AI rather than running it.

At 500+ users, the client starts to have enough scale to justify their own team — but even then, your platform approach (serving multiple clients on shared infrastructure, with isolated data) can remain cost-competitive.

Key takeaway: For on-prem-required clients, your competition is not OpenAI or Google. It is the client’s internal IT team. And you beat internal IT teams the same way you always have: through operational specialization, shared costs across multiple clients, and mature tooling. The margin structure looks like traditional managed services — 40-55% — not the razor-thin margins of trying to compete with hyperscaler API pricing.

The Market Size Question

How large is the on-prem segment? No published data answers this precisely for GenAI specifically, but several indicators suggest it is substantial in the EU market:

Banking and financial services: ECB and national regulators increasingly scrutinize cloud concentration risk. Many EU banks maintain strict policies requiring sensitive data processing on-premises or in private cloud.
Healthcare: Patient data under GDPR has stringent processing requirements. Many EU healthcare systems have explicit policies against external AI APIs for clinical data.
Public sector: Government organizations across EU member states frequently require on-premises or sovereign cloud deployment for sensitive workloads.
Legal: Attorney-client privilege and professional confidentiality obligations create strong incentives for on-prem AI.
Defense and critical infrastructure: By definition, these sectors require controlled environments.

For a typical EU IT services provider whose client base skews toward regulated industries, the on-prem segment could represent 30-60% of potential AI service revenue. This is not a niche — it may be the core market.

The Combined Picture

The reality for most EU IT services providers is that they will serve both segments simultaneously:

Client Segment	Your Role	Revenue Model	Margin
On-prem required	Managed AI infrastructure provider	Monthly retainer + per-user fees	40-55%
Cloud-comfortable	AI solutions integrator	Project fees + API passthrough + support	25-40%
Both	Compliance and evaluation overlay	Assessment fees + monitoring retainer	50-65%

The healthiest business combines all three: infrastructure margins from on-prem clients, integration and advisory revenue from cloud-comfortable clients, and compliance services layered across both. Do not make the mistake of focusing exclusively on one segment when the other may be equally or more lucrative.

What These Numbers Do Not Capture

The tables above cover the direct, measurable costs. But several factors shift the equation in ways that are harder to quantify:

Factors favoring APIs: - Zero setup time — you can be live in hours, not weeks - Automatic model upgrades — when GPT-4.1 improves, you get it free - Elastic scaling — pay nothing during weekends and holidays - No GPU procurement risk — you never own depreciating hardware

Factors favoring self-hosting: - Data sovereignty — tokens never leave your infrastructure - Regulatory compliance — some EU regulations and client contracts require on-premises processing - Customization — fine-tuning, custom system prompts with no rate limits, specialized tooling - Predictable costs — no surprise bills from a prompt injection attack or runaway agent loop - Independence — no vendor can deprecate your model or change your terms overnight

For EU IT services providers, the data sovereignty and compliance arguments are often the strongest justification for self-hosting. The cost math alone rarely supports it. But combine cost with a genuine compliance requirement, and the self-hosting premium starts to look like a reasonable price for regulatory peace of mind.

Key takeaway: Do not build your business case on cost savings from self-hosting. Build it on the value of data sovereignty, regulatory compliance, and customization. Then use these numbers to price your offering realistically — knowing exactly what premium you are asking customers to pay, and why that premium is worth it.

Practical Implications for Your Pricing Strategy

These numbers lead to three immediate conclusions for how you should think about pricing:

1. Do not try to undercut API providers on price. You will lose. OpenAI, Google, and Anthropic are spending billions on custom silicon and infrastructure. Your cost per token will always be higher than theirs for equivalent model quality.

2. For on-prem clients, price against the client’s DIY cost — not against API pricing. A managed AI infrastructure service at $180/user/month is expensive compared to a $5.40/user/month API call, but it is a bargain compared to the $125-$204/user/month it would cost the client to build and staff it themselves. Frame your pricing against the right benchmark.

3. Price on value, not on cost-plus. If your service provides data sovereignty, compliance assurance, or specialized fine-tuning, price those outcomes directly. A $180/user/month service that keeps patient data on-premises is a different product from a $5.40/user/month API call that sends data to US servers.

4. Consider hybrid architectures. Route sensitive queries through your self-hosted infrastructure and non-sensitive queries through cheap APIs. This keeps your GPU utilization high on the work that actually requires privacy, while keeping costs down on everything else. We will explore this model in detail in Chapter 5.

The numbers tell two stories. For cloud-comfortable clients, the strategy is not about running models cheaper than the hyperscalers — it is about delivering expertise, integration, and compliance on top of their APIs. For on-prem-required clients, you are still in the infrastructure business, and the economics work in your favour — as long as you price against the right comparison.

Chapter 4

Chapter 4: Why Hyperscalers Win on Price

In Chapter 3, we ran the numbers on self-hosting versus API access. The result was not close. Even under generous assumptions — cheap CEE electricity, aggressive hardware negotiation, high utilization — the cost-per-token gap between your on-premises setup and a hyperscaler API call ranged from 5x to 15x, depending on model class and workload profile.

Many readers will look at those numbers and think: this is temporary. Prices will come down. Open-source models will catch up. We will optimize. And some of that is true — prices are dropping, open models are improving, and there are real optimizations to make.

But the core pricing gap is structural. It is not a market inefficiency waiting to be corrected. It is the result of at least five compounding advantages that hyperscalers have and you do not. Understanding these advantages is not defeatist — it is the foundation of every viable strategy we discuss in Chapters 5 through 7.

Custom Silicon: The Single Biggest Factor

When you run inference on NVIDIA H100s, you are not just paying for silicon. You are paying for NVIDIA’s gross margins, which have consistently exceeded 75% since the AI boom began. For every dollar you spend on GPU hardware, roughly 75 cents is NVIDIA’s profit. That profit margin is baked into every self-hosted token you produce.

Google does not pay that margin. Its TPU (Tensor Processing Unit) chips are designed in-house, fabricated at cost through TSMC, and deployed exclusively in Google data centers. There is no external vendor extracting a 75% margin from the silicon. The same logic applies to Amazon’s Trainium and Inferentia chips and to Microsoft’s Maia accelerator, which entered production in late 2025.

The performance characteristics differ — TPUs are optimized for matrix operations and large batch inference, not general-purpose GPU compute — but for the specific workload of running transformer models at scale, custom silicon is not merely competitive with NVIDIA hardware. It is dramatically cheaper per useful operation.

Conservative estimates put the cost-per-FLOP advantage of custom silicon at 3-5x compared to purchasing NVIDIA GPUs at market price. Some internal analyses suggest the advantage is even larger for inference specifically, because these chips can be architected for exactly the memory bandwidth and compute ratio that transformer inference demands, rather than the general-purpose design that NVIDIA must maintain to serve gaming, scientific computing, and training workloads simultaneously.

Key takeaway: When you buy NVIDIA GPUs, you are financing NVIDIA’s 75%+ gross margins. When Google uses TPUs, that margin disappears from the cost structure. This single factor accounts for a 3-5x cost difference before anything else is considered.

Utilization Rates: The Economics of an Empty GPU at 3am

A GPU that is not running inference is a GPU that is burning electricity and depreciation while producing zero tokens. This is the utilization problem, and it hits self-hosted deployments harder than almost any other factor.

A typical enterprise deployment serving a single company or a small cluster of clients will see dramatic demand variation. Peak hours might saturate the hardware. Nights, weekends, and holidays leave it idle. Realistic average utilization for a well-managed enterprise GPU cluster sits between 30% and 40%. Poorly managed ones — common among companies new to AI infrastructure — can drop below 20%.

Hyperscalers operate at 80-90%+ average utilization. They achieve this through three mechanisms that are simply unavailable at smaller scale:

Geographic demand smoothing. When Europe sleeps, the Americas are working. When the Americas sleep, Asia-Pacific picks up. A global customer base across all time zones flattens the demand curve in ways a regional provider never can.

Customer diversity. Millions of API users with uncorrelated workload patterns create natural statistical smoothing. Your batch processing job fills the gap left by another customer’s real-time chat application winding down.

Continuous batching. Modern inference engines do not process one request at a time. They dynamically batch thousands of concurrent requests, filling GPU compute capacity to its theoretical maximum. The KV-cache management and scheduling algorithms required to do this efficiently at scale represent years of engineering investment.

The math is straightforward. If your hardware runs at 35% utilization and a hyperscaler runs at 85%, the hyperscaler extracts 2.4x more useful tokens from the same dollar of hardware investment. This is not an optimization you can engineer away with better scheduling software. It is a function of scale and demand diversity.

Key takeaway: Self-hosted GPUs typically achieve 30-40% utilization. Hyperscalers run at 80-90%+. Same hardware, same power draw, but 2-3x more useful output per dollar — purely from having millions of diverse users across global time zones.

Model-Specific Optimizations: The Engineering Gap

When you self-host an open-source model, you typically run it through an off-the-shelf serving framework — vLLM, TGI, or similar. These are good tools. They implement PagedAttention, basic continuous batching, and standard quantization. They represent the state of the art for general-purpose open-source inference.

Hyperscalers do not use general-purpose tools for their flagship models.

Mixture-of-Experts (MoE) architectures. GPT-4o is almost certainly a Mixture-of-Experts model, as are Gemini and likely several other frontier systems. An MoE model might have 200 billion total parameters but only activate 20-30 billion for any given token. You get output quality comparable to a dense 200B model at the computational cost of a 30B model. This is an architectural advantage that the model provider captures but the self-hosting provider cannot replicate for proprietary models — and the best open-source MoE models (Mixtral, DBRX) still lag frontier quality.

Custom CUDA kernels and inference pipelines. Google, OpenAI, and Anthropic each maintain thousands of engineer-hours of custom inference code. Flash Attention variants tuned for their specific hardware. Custom memory management that exploits known access patterns. Speculative decoding implementations where a small draft model predicts likely continuations, allowing the large model to verify multiple tokens in parallel. Per-layer quantization schemes that selectively reduce precision where quality loss is minimal.

Hardware-software co-design. When you control both the chip and the software stack, you can optimize in ways that are impossible with off-the-shelf components. Google’s TPU software stack is co-designed with the hardware. The compiler, the runtime, the scheduling — everything is optimized for the specific silicon it runs on.

The cumulative effect of these optimizations is an additional 3-5x efficiency gain over what you can achieve with open-source tooling on commodity hardware. Some industry benchmarks suggest the gap may be even wider for the largest models.

Key takeaway: Hyperscalers run custom MoE architectures, proprietary CUDA kernels, speculative decoding, and co-designed hardware-software stacks. You run a dense model on vLLM. The engineering gap alone is worth another 3-5x in cost efficiency.

Scale Amortization: The Marginal Cost of One More User

Building a frontier LLM requires hundreds of millions to billions of dollars. GPT-4’s training run reportedly cost over $100 million. The research teams, the data pipelines, the RLHF infrastructure, the safety testing, the platform engineering — these are fixed costs that must be recovered.

When you spread those costs across millions of paying API users, the per-user burden becomes trivial. The marginal cost of adding one more API customer — until you need to provision additional GPU nodes — is effectively zero. The infrastructure is already running. The model is already loaded in memory. One more request in the batch changes nothing.

This is classical economies of scale, but the magnitude is unusual. The fixed-to-variable cost ratio in LLM serving is extreme. A hyperscaler’s cost structure is dominated by capital expenditure (hardware) and R&D (model development), both of which are fixed. The variable costs — electricity for incremental compute, network bandwidth — are tiny per request.

For a self-hosting provider, the math inverts. You bear the full fixed cost of hardware and operations, but spread it across a much smaller user base. Your per-token overhead from fixed costs can be 100x or 1000x higher than a hyperscaler’s, simply because you are dividing by thousands of users instead of millions.

Strategic Underpricing: The Land Grab

Here is the factor that makes the cost comparison even more lopsided than the structural advantages alone would suggest: current API prices are not cost-reflective. They are market-capture prices.

Google sells Gemini 2.0 Flash input tokens at $0.10 per million. At that price, it is plausible — perhaps likely — that Google is selling at or below cost, even on its own optimized infrastructure. Why? Because every developer who builds on Gemini Flash is a developer locked into Google Cloud Platform, consuming Vertex AI services, storing data in GCS, and running adjacent workloads on GCE. The LLM API is a loss leader for the cloud ecosystem.

OpenAI prices aggressively because market share matters more than profit in the land-grab phase of a platform market. Backed by Microsoft’s investment and its own multi-billion-dollar funding rounds, OpenAI can sustain below-cost pricing for years. Anthropic operates under similar logic with Amazon’s backing.

The numbers tell the story clearly. From early 2024 to early 2026, LLM API prices dropped approximately 80% for equivalent capability. GPT-4-class output that cost $30 per million tokens in early 2024 now costs $2.50-5.00 through GPT-4o. Small model pricing has collapsed even further — GPT-4o-mini and Gemini Flash offer capable output at $0.10-0.60 per million tokens.

These prices are not the floor. But they are also not sustainable reflections of true cost. They are the result of tens of billions of dollars in venture capital and strategic investment subsidizing the growth phase of the largest platform shift since cloud computing itself.

Key takeaway: Current API prices are market-capture prices, not cost-reflective prices. Google, OpenAI, and Anthropic are backed by billions in strategic investment and are pricing to win market share, not to maximize margin. You are competing against subsidized prices on top of structural cost advantages.

What Things Actually Cost

Cutting through the strategic pricing and working from hardware costs, energy, and engineering estimates, here is what frontier inference likely costs the major providers on their optimized infrastructure:

Frontier models (GPT-4o, Claude Sonnet, Gemini Pro) — actual provider cost is probably $1-3 per million output tokens on fully optimized custom silicon with high utilization. They sell at $2.50-15.00, meaning margins range from thin to healthy depending on the model and provider.

Small/fast models (GPT-4o-mini, Gemini Flash, Claude Haiku) — actual provider cost is probably $0.05-0.20 per million tokens at hyperscaler scale. They sell at $0.10-0.60, meaning some of these are genuinely near-cost or below-cost offerings.

Now compare those numbers to what self-hosting costs. From Chapter 3, a well-run on-premises deployment of a 70B open-source model lands at roughly $8-15 per million output tokens when you honestly account for hardware depreciation, energy, operations, and realistic utilization.

You are not competing with the API price. You are not even competing with the provider’s actual cost. You are operating in a fundamentally different cost regime.

The Compounding Effect

These advantages do not merely add up — they compound. Consider the full chain:

Advantage	Cost Multiplier
Custom silicon vs. NVIDIA margins	3-5x
Utilization (85% vs. 35%)	2-2.5x
Model optimizations (MoE, custom kernels)	3-5x
Scale amortization	2-5x
Combined theoretical advantage	36-300x

The real-world gap is smaller than the theoretical maximum because these factors overlap and interact. But a 10-30x all-in cost advantage is realistic for the largest providers. Even a conservative 5-10x gap is devastating if you are trying to compete on price.

When the Hyperscaler Advantage Does Not Matter

Everything above is true — and for clients who can freely choose between your self-hosted service and a hyperscaler API, it is devastating. But there is a large and important segment of the EU enterprise market where the hyperscaler cost advantage is irrelevant, because the hyperscaler API is not an option the client can select.

Consider a European bank whose compliance team has determined that customer financial data cannot be processed by external AI providers. Or a defense contractor handling classified information. Or a healthcare system where patient data governance rules out any external API, regardless of the provider’s data processing agreements.

For these clients, the comparison is not “your GPU cluster vs. Google’s TPU farm.” The comparison is:

Your managed AI infrastructure vs. the client building it themselves
Your managed AI infrastructure vs. no AI at all

In this comparison, the hyperscaler advantages we have catalogued — custom silicon, utilization rates, MoE architectures, scale amortization — are advantages that the client also cannot access. The client faces the same GPU purchase prices, the same utilization challenges, the same open-source model limitations that you do. You are competing on equal footing.

And on equal footing, the IT services provider wins — for the same reasons you have always won against internal IT departments: operational specialization, shared costs across multiple clients, mature tooling, and the ability to attract and retain skilled engineers more effectively than a bank or hospital can.

As Chapter 3 showed, the managed AI infrastructure margin for on-prem clients looks remarkably like traditional managed services: 40-55%. This is a profitable, sustainable business — and in the EU market, where regulated industries represent a substantial share of enterprise IT spending, it may be the largest addressable opportunity for many IT services providers.

Key takeaway: The hyperscaler cost advantages in this chapter apply when clients have a choice. Many EU enterprise clients do not. For regulated industries that require on-premises AI, you are not competing against Google’s TPUs — you are competing against the client’s internal team. That is a competition you can win, at margins you can build a business on.

The Uncomfortable Conclusion (For Cloud-Comfortable Clients)

For clients who can use cloud APIs, the cost-efficiency gap between hyperscaler APIs and self-hosted inference is probably the widest structural gap in all of enterprise software today. It is wider than the gap between on-premises email and Gmail. It is wider than the gap between running your own CDN and using CloudFlare. It is wider because the underlying technology — GPU/TPU inference at massive scale — has uniquely extreme returns to scale.

This gap cannot be closed by:

Cheaper CEE labor. Your operations team could work for free and it would not close a 10x cost gap that is rooted in silicon and utilization.
Lower margins. Even at zero margin, your cost structure does not reach their selling price.
Better open-source models. The model quality gap is narrowing. The infrastructure efficiency gap is not.
Waiting for prices to stabilize. Prices will stabilize eventually. They will stabilize at a level that reflects hyperscaler cost structures, not yours.

For this client segment, competing on infrastructure price is not viable. But this is only part of the story. The next three chapters explore business models that work across both client segments: the privacy proxy for cloud-comfortable clients with compliance concerns (Chapter 5), local deployment on employee devices (Chapter 6), and testing, security, and agentic infrastructure services (Chapter 7). For on-prem-required clients, the infrastructure business remains viable — and the services in Chapters 5-7 add margin on top.

Key takeaway: Know which game you are playing. For cloud-comfortable clients, the hyperscaler cost advantage is structural and permanent — compete on expertise, not on compute. For on-prem-required clients, the infrastructure business works because the hyperscaler alternative does not exist for them. Most EU IT services providers will serve both segments, and the winners will be those who price and position correctly for each.

Chapter 5 examines the first viable business model: acting as a privacy and compliance proxy between your clients and the hyperscaler APIs they need but cannot use directly.

Chapter 5

Chapter 5: The Vendor Ecosystem Play

Chapters 2 through 4 delivered a clear message: on raw compute economics, the hyperscalers win. For cloud-comfortable clients, competing on infrastructure price is a losing proposition. For on-prem-required clients, the managed infrastructure business works, but it serves a specific segment.

So where does the broadest, fastest near-term AI revenue come from?

For most EU IT services providers, the honest answer is not glamorous. It is reselling and implementing the AI features that your existing vendor partners are embedding into products your clients already use. Microsoft Copilot. SAP Joule. ServiceNow Now Assist. GitHub Copilot. The vendors you have partnered with for years are shipping AI into every product, and your clients need help turning it on, making it work, and measuring whether it was worth the money.

This is the path of least resistance. It is also, for many providers, the path of most revenue — at least in the near term.

The Vendor AI Landscape in 2026

Every major enterprise software vendor has now embedded AI into their core product suite. The approach varies — some charge per user, some use consumption-based models, some have absorbed the cost into base license increases — but the direction is universal. AI is no longer an add-on. It is becoming the product.

Here is what the landscape looks like as of early 2026.

Microsoft Copilot for Microsoft 365

Microsoft has made the most aggressive push, and for most EU IT services providers with Microsoft partnerships, this is the largest immediate opportunity.

Tier	Price
Business (promotional)	EUR 18/user/month
Business (standard annual)	EUR 21/user/month
Business (month-to-month)	EUR 25.20/user/month
Enterprise	EUR 30/user/month

Adoption has been significant but uneven. 70% of Fortune 500 companies have adopted Copilot for Microsoft 365, though “adopted” here often means pilot programs or phased departmental rollouts rather than wall-to-wall deployment. The UK government ran a pilot with 20,000 civil servants and reported an average of 26 minutes saved per employee per day — a number that, if sustained at scale, represents a compelling ROI story. PwC has deployed Copilot to over 230,000 users across more than 100 countries, making it one of the largest enterprise AI rollouts to date.

For partners, the economics improved materially in early 2026 when Microsoft increased partner incentives by approximately 50%. Investment funds now cover up to 20% of engagement costs for consultancy, planning, adoption, rollout, and education activities. Microsoft also introduced the Copilot certification (AB-900) and new specialization tracks, signalling that they want a certified partner ecosystem driving adoption, not just license sales.

SAP Joule / Business AI

SAP’s approach is consumption-based rather than per-seat. The core unit is the “AI Unit,” priced at approximately EUR 7 per unit, with a minimum commitment of 100 units per year (EUR 700 base). The model has two tiers:

Joule Base: included at no additional cost with SAP cloud products — covers basic AI assistant capabilities.
Joule Premium: advanced skills, autonomous agents, and premium scenarios — this is where the AI Units are consumed.

For SAP partners, the consumption model changes the implementation conversation. Instead of a fixed per-user cost, you are helping clients right-size their AI unit consumption based on actual usage patterns. This creates a natural advisory engagement that pure per-seat licensing does not.

ServiceNow Now Assist / AI Agents

ServiceNow uses a consumption model based on “Assist” tokens, with custom quotes required for each deployment. The baseline requirement is a Pro Plus or higher license for the underlying ServiceNow platform.

Usage varies significantly by action: a single incident summarization consumes 1 Assist token, while creating an entire application through AI consumes 20. This granularity creates complexity — and complexity, for a services partner, means opportunity. Clients need help forecasting consumption, optimizing usage, and deciding which workflows justify the AI spend.

GitHub Copilot

Tier	Price
Business	$19/user/month
Enterprise	$39/user/month

The Enterprise tier includes knowledge bases and custom model capabilities, making it relevant for organizations that want Copilot trained on their proprietary codebase. For IT services providers with development teams, GitHub Copilot also changes your own cost structure — a topic worth considering separately from the resale opportunity.

Google Gemini for Workspace

Google took a different approach entirely. Rather than charging a separate AI add-on fee, Gemini capabilities were absorbed into a 17-22% increase in base Workspace pricing during 2025. Business Standard at $16.80/user/month now includes full Gemini AI functionality.

For partners, this means the AI is already paid for — the conversation shifts entirely to adoption and value realization rather than license justification.

VMware Cloud Foundation (Broadcom)

Broadcom has positioned VMware Cloud Foundation as an AI-native platform, integrating AI capabilities into the core infrastructure rather than offering them as paid add-ons. The platform includes private cloud AI deployment capabilities, and Broadcom has publicly stated that private AI demand “far exceeded early expectations.”

For providers with VMware practices, this creates a bridge between the vendor ecosystem play described in this chapter and the local/private deployment models discussed in Chapter 7. VMware’s AI integration means that some clients can deploy AI capabilities within their existing private cloud infrastructure, managed by your existing VMware team.

The Comparison at a Glance

Vendor	Pricing Model	Entry Cost	Partner Opportunity
Microsoft Copilot M365	Per user/month	EUR 18-30/user	Highest volume, strongest incentives
SAP Joule	Consumption (AI Units)	~EUR 700/year minimum	Advisory on right-sizing, workflow design
ServiceNow Now Assist	Consumption (Assist tokens)	Custom quote	Complex forecasting, workflow optimization
GitHub Copilot	Per user/month	$19-39/user	Developer productivity, code security
Google Gemini Workspace	Bundled in base price	$16.80/user (included)	Pure adoption/training play
VMware Cloud Foundation	Included in platform	Platform license	Private AI deployment services

The Real Economics: Licenses Are the Anchor, Services Are the Profit

The license resale margin on embedded AI products is modest. Standard AI reseller programs offer 20-40% revenue sharing, depending on the vendor, your partnership tier, and volume. On a Copilot Enterprise license at EUR 30/seat, a 20% margin yields EUR 6/seat/month. For a 500-seat client, that is EUR 3,000 per month — EUR 36,000 annually. Decent recurring revenue, but not transformative.

The real margin is in the professional services that surround the license.

A 500-seat Copilot deployment is not a license activation. It is a project. It requires a readiness assessment of the client’s Microsoft 365 environment, data governance review, pilot design with selected user groups, phased rollout planning, workflow-specific training (Copilot in Excel is a different conversation than Copilot in Teams), change management to overcome adoption resistance, and measurement frameworks to demonstrate ROI. That engagement runs EUR 50,000-80,000 over three to four months, at professional services margins of 50-70%.

The math is stark:

Revenue Source	Annual Revenue (500 seats)	Margin
License resale (20% margin)	EUR 36,000	20% = EUR 7,200
Implementation services	EUR 50,000-80,000	60% = EUR 30,000-48,000
Ongoing optimization (quarterly)	EUR 20,000-40,000	65% = EUR 13,000-26,000

The services revenue is four to ten times the license margin. This is not a new pattern — it is exactly how the ERP implementation model worked for decades. SAP licenses were the entry point; Accenture made billions on the implementation. The AI embedded in enterprise software creates the same dynamic.

Key takeaway: Do not sell Copilot licenses. Sell AI-powered workplace transformation. The license is the anchor that creates the client engagement. The assessment, implementation, training, and ongoing optimization are where you make money.

The Strategic Risk — Said Honestly

Before you build your entire AI practice around vendor-embedded AI, you need to understand what you are signing up for.

You are a pass-through. The vendor holds the product, the roadmap, the client relationship at the platform level, and ultimately the lock-in. Your value exists in the gap between what the vendor ships and what the client can absorb. That gap is real today. It may not always be this wide.

Your margin is not yours. If Microsoft changes partner terms — reduces incentive funds, adjusts revenue sharing, or introduces direct-to-customer adoption tools — your license margin can evaporate overnight. You have no control over this. The early 2026 incentive increase was welcome, but the same lever that raised your margin by 50% could cut it by 50% just as easily.

You are not building proprietary capability. Every implementation methodology you develop, every training program you create, every adoption framework you design — your competitors are building the same things. There is nothing about a Copilot deployment at a 500-person company in Munich that cannot be replicated by another Microsoft partner in Munich. The differentiation is in execution quality and client relationships, not in proprietary technology.

The wrapper economy is struggling. Across the broader AI market, 95% of businesses built as thin wrappers around someone else’s AI capability are struggling to sustain revenue. The vendor ecosystem play is a higher-quality version of this — you are wrapping established enterprise products, not raw APIs — but the underlying dynamic is the same: you are renting someone else’s capability and adding a services layer.

Clients increasingly want your brand, not the vendor’s. White-label demand is growing — 68% of enterprise AI inquiries now request provider-branded solutions rather than visibly resold vendor products. Clients want “your AI-powered analytics platform,” not “we will help you set up Copilot.” This tension between what the vendor ecosystem offers (branded vendor AI) and what clients want (your differentiated solution) will only intensify.

The Real Opportunity: Making Adoption Actually Work

Here is the counterargument, and it is a strong one.

Ninety-five percent of enterprise AI pilots fail to deliver measurable ROI. Not because the technology does not work, but because organizations cannot bridge the gap between “we turned it on” and “it changed how we work.” That gap is your entire business opportunity.

The analyst data reinforces this. Gartner projects that 40% of enterprise applications will include task-specific AI agents by the end of 2026, up from less than 5% at the start of 2025. That is an extraordinary adoption curve — and Forrester estimates that three out of four firms attempting advanced agentic architectures independently will fail.

Your value is not the license. It is making the technology actually work in the client’s context.

This means structured adoption programs:

Readiness assessment. Evaluate the client’s data quality, process maturity, and organizational readiness before any license is purchased. Many clients are not ready for AI-assisted workflows, and telling them so — before they waste six months of license fees — builds trust that no competitor pitch can match.
Pilot design. Identify the three to five workflows where embedded AI will deliver measurable impact, design controlled pilots with clear success criteria, and run them for 60-90 days before committing to full rollout.
Workflow integration. The difference between “Copilot is available” and “Copilot is embedded in our procurement approval process” is the difference between a gadget and a tool. Integration into existing business processes is where most deployments stall and where your consulting expertise creates the most value.
Training and change management. Users who receive generic vendor training use AI features for two weeks and stop. Users who receive role-specific training tied to their actual daily tasks sustain adoption. Building and delivering that training is a recurring engagement.
Measurement and optimization. Quarterly reviews of usage data, productivity metrics, and ROI calculations — tied back to the business case that justified the investment. This is the engagement that renews indefinitely.

There is also a compliance dimension that connects directly to Chapter 11 of this booklet. Every vendor AI deployment in the EU needs an AI Act compliance overlay. Which Copilot features constitute “AI systems” under the Act? What transparency obligations apply when SAP Joule generates procurement recommendations? Who is responsible for bias in ServiceNow’s AI-generated incident categorization? These questions do not have default answers in the vendor documentation, and your clients need help navigating them.

Key takeaway: 95% of AI pilots fail to deliver ROI. That failure rate is your market. The providers who can turn vendor AI products into measurable business outcomes — not just activated licenses — will command premium services fees regardless of what happens to license margins.

How to Position This

The positioning matters as much as the capability. Selling “Copilot licenses” puts you in a commodity market where every Microsoft partner competes on the same product at the same price. Selling “AI-powered workplace transformation” puts you in an advisory market where your methodology, your track record, and your client relationships are the differentiators.

The bundle should look like this:

Assessment (fixed fee, EUR 8,000-15,000): evaluate readiness, identify high-impact use cases, build a business case with projected ROI.
Pilot (fixed fee, EUR 15,000-25,000): deploy to selected user groups, measure results against baseline, refine approach.
Rollout (project fee, EUR 25,000-40,000): phased deployment across the organization with role-specific training and workflow integration.
Optimization (retainer, EUR 5,000-10,000/quarter): ongoing usage analysis, new feature adoption, ROI reporting, compliance updates.

The license sits underneath all of this, flowing through as a line item. The client sees a transformation program. You see a services engagement with embedded recurring license revenue.

This is the ERP implementation model, updated. SAP licenses were always the entry point. The implementation, customization, training, and ongoing optimization were always the business. The same logic applies to embedded AI — the license is the anchor, the services are the profit.

The Honest Assessment

Let us be direct about where this chapter’s strategy sits in the broader landscape of options.

This is the lowest-risk, fastest-time-to-revenue path into AI services. You are leveraging existing vendor relationships, existing client accounts, and existing partnership infrastructure. The vendors are actively incentivizing you to do this. The clients are already asking for it. You do not need to build proprietary technology, hire ML engineers, or take on infrastructure risk.

It is also the least differentiated path. Every other partner in your vendor ecosystem is pursuing the same strategy, attending the same partner events, earning the same certifications, and pitching the same adoption frameworks. In a market where every Microsoft Solutions Partner offers Copilot deployment services, the competitive advantage reduces to execution quality and client trust — valuable, but not structural.

Use it as your entry point, not your destination. The revenue from vendor AI implementation funds the transformation described in the rest of this booklet. While your Copilot practice generates cash flow and builds AI credibility with clients, you invest in the more differentiated capabilities covered in Chapters 6 through 8: the privacy proxy, local deployment, and testing and agentic infrastructure. These models require more investment and longer time to market, but they build proprietary capability that a vendor cannot take away with a partner program change.

The vendor ecosystem play is the pragmatic first move. It gets you into the AI conversation with every client in your portfolio, generates immediate revenue, and teaches your team how enterprises actually adopt AI — lessons that transfer directly to every other model in this booklet.

But if it is your only move three years from now, you have a problem. The adoption gap that makes this strategy valuable today will narrow as vendors improve their own onboarding, as clients build internal competence, and as the technology becomes more self-service. The providers who thrive long-term will be those who used the vendor ecosystem revenue to build something the vendors cannot replicate.

Key takeaway: The vendor ecosystem play is the right first move for most EU IT services providers. It is fast, low-risk, and directly monetizes your existing partnerships. But it is a bridge to more differentiated capabilities, not a destination. The license margin pays the bills. The services margin funds the transformation. And the transformation is what keeps you relevant when the vendors inevitably make adoption easier without your help.

Chapter 6 examines the first independent business model: acting as a privacy and compliance proxy between your clients and the frontier AI models they need but cannot use directly.

Chapter 6

Chapter 6: Business Model: The Privacy Proxy

Chapter 5 described the most accessible path into AI revenue: reselling and implementing the AI features that your existing vendor partners ship. It works, it generates cash flow, and it builds credibility. But it leaves you dependent on the vendor’s roadmap, the vendor’s pricing, and the vendor’s partner program terms.

This chapter and the next two explore more independent business models — ways to build proprietary capability that a vendor cannot take away with a program change. We start with the model that feels most natural to European IT services providers: sitting between your clients and the public AI APIs, acting as a privacy and compliance intermediary.

The pitch is simple. Your client wants to use Claude, GPT-4o, or Gemini. They cannot — or believe they cannot — send their data directly to these APIs because of GDPR obligations, internal data governance policies, or contractual restrictions with their own customers. You build a proxy layer that strips personally identifiable information before it reaches the API, anonymizes sensitive business data, and re-injects the necessary context when the response comes back. The client gets frontier model intelligence. You handle the compliance headache. Everyone sleeps at night.

It is an appealing concept. It is also more complicated — and more fragile — than it first appears.

The Architecture

The privacy proxy sits as a stateless processing layer between the client’s application and the AI provider’s API. The flow looks like this:

The client’s application sends a prompt containing potentially sensitive data to your proxy endpoint.
Your proxy scans the prompt, identifies PII and sensitive business information, replaces it with anonymized placeholders, and logs the mapping.
The sanitized prompt goes to the AI API — OpenAI, Anthropic, Google, or whichever provider the client prefers.
The response comes back referencing the placeholders.
Your proxy re-injects the original values and forwards the completed response to the client.

The client never interacts with the AI API directly. From the AI provider’s perspective, they only ever see anonymized data. From the client’s perspective, they get full frontier model capability as if no proxy existed.

You add compliance value on top: audit logs showing exactly what data was sent and when, data residency guarantees (your proxy runs in the EU, the API call may go elsewhere but carries no identifiable data), and documentation that satisfies DPOs and regulators during audits.

The Economics

This is where the model looks attractive on a spreadsheet.

Per-Client Cost Structure

Component	Monthly Cost
Client’s API usage (pass-through)	~$5,000
Your proxy infrastructure (compute, networking)	$500 - $1,000
Your compliance premium (10% of API spend)	~$500
Client pays total	~$6,000
Your gross margin	~$500 per client

The proxy infrastructure itself is cheap. You are running a stateless processing layer — no GPU inference, no model hosting, no large storage requirements. A few well-configured containers behind a load balancer handle the PII detection, placeholder substitution, and re-injection. The compute is modest. The networking cost scales linearly with API call volume but remains a fraction of the API cost itself.

At Scale

Number of Clients	Monthly Proxy Margin	Monthly Infrastructure Cost	Net Monthly Margin
10	$5,000	$3,000 - $5,000	$0 - $2,000
25	$12,500	$5,000 - $8,000	$4,500 - $7,500
50	$25,000	$8,000 - $12,000	$13,000 - $17,000
100	$50,000	$12,000 - $18,000	$32,000 - $38,000
200	$100,000	$18,000 - $28,000	$72,000 - $82,000

The infrastructure costs do not scale linearly with client count because the proxy layer is fundamentally lightweight and shares resources well. At 50 clients, you are looking at $13,000-$17,000 per month in net margin — roughly $160,000-$200,000 annually. Respectable, but not a business that funds itself at small scale. You need volume.

There is also staff cost to consider. Running a privacy proxy is not zero-touch. You need engineers maintaining the PII detection rules, monitoring for false negatives (sensitive data that slipped through), updating the system as new data patterns emerge, and responding when a client’s compliance team has questions. Budget at least two to three full-time engineers for a production service. At European salaries, that is $200,000-$400,000 annually — which means you need 50+ clients just to break even on the dedicated staff, before accounting for sales, management overhead, and the engineering effort to build the platform in the first place.

Key economics: The privacy proxy is a thin-margin, volume-dependent business. At 10 clients, you lose money. At 50, you break even. At 100+, the economics start working. The question is whether you can acquire and retain 100+ clients for a service that faces significant competitive pressure from the very vendors you are proxying.

The Technical Reality

The concept is clean. The implementation is where things get difficult.

PII Detection Is Harder Than It Looks

The naive approach — regex patterns matching email addresses, phone numbers, social security formats, credit card numbers — catches the obvious cases. Tools exist for this: Microsoft Presidio is open-source and handles structured PII patterns well. Private AI and Protecto offer commercial PII detection with higher accuracy. These are reasonable starting points.

But the hard cases are not structured patterns. They are context-dependent.

“The patient in room 412 responded well to the treatment.” No PII by regex standards. But if the client is a hospital and only one patient occupied room 412 on a given day, that sentence identifies an individual. “Revenue from the Hamburg project exceeded projections by 40%.” No names, no identifiers — but if the client has only one project in Hamburg, this is commercially sensitive information that a competitor could use. “Send the follow-up to the person who complained about the delivery last Tuesday.” No PII, but the context makes re-identification trivial within the client’s organization.

Context-dependent sensitivity is a genuinely hard problem. It requires understanding the client’s data landscape, not just pattern-matching against a list of PII formats. The further you go down this path, the more your “lightweight proxy” starts to look like a bespoke consulting engagement for every client.

Re-Injection Is Genuinely Hard

Stripping PII from the outbound prompt is the easier half. Re-injecting it into the response is where things break.

If the prompt says “Summarize the performance review for [PERSON_1]” and the response says “The review for [PERSON_1] was generally positive,” the re-injection is trivial — find the placeholder, replace it with the original value.

But what if the response says “The employee demonstrated strong leadership qualities and was recommended for the senior management track”? The model understood that [PERSON_1] is a person and generated a response that refers to them indirectly without using the placeholder. Your re-injection logic has no placeholder to replace. The response is correct but now disconnected from the original identity in ways that may confuse the end user or break downstream processing.

Complex outputs — tables, multi-step analyses, documents with cross-references — make this worse. The more sophisticated the AI’s response, the more likely it is to paraphrase, restructure, or indirectly reference the anonymized entities in ways that your placeholder substitution cannot handle cleanly.

Latency

Every proxy hop adds latency. Your PII detection runs before the API call. Your re-injection runs after. For simple requests, the overhead might be 50-200 milliseconds — negligible when the API call itself takes 2-5 seconds. For high-throughput applications or streaming responses, the overhead becomes more noticeable and harder to manage. Streaming in particular is painful: you need to buffer enough of the response to identify and replace placeholders before forwarding, which defeats the purpose of streaming for the end user.

The Honest Problems

The technical challenges are solvable with enough engineering effort. The strategic problems are harder.

The Vendors Are Closing This Gap

Azure already offers zero data retention options and EU data boundary capabilities. Anthropic offers regional data processing. Google Cloud provides data residency controls. Every major AI provider has recognized that enterprise data handling is a first-order concern, and they are investing heavily in solving it at the platform level.

Each vendor announcement that improves their native data handling erodes your value proposition. When Microsoft announces that Azure OpenAI Service processes and stores all data within the EU with zero retention and full audit logging — and that announcement is a matter of when, not if — your compliance premium gets harder to justify. The client can go direct and get the same guarantees without the proxy overhead.

One Announcement Can Undercut You

This is the fragility at the heart of the model. Your entire business depends on a gap between what the AI providers offer natively and what your clients’ compliance teams require. That gap is real today. But it is closing, and it can close suddenly. A single product announcement from Microsoft, Google, or Anthropic about enhanced EU data residency, provable data deletion, or compliance certification can eliminate the core value proposition for a significant portion of your client base in a single quarter.

You cannot build a durable business on a gap that the party on the other side of that gap is actively working to close.

The 10% Premium Is Thin

A 10% premium on API spend gives you $500/month on a $5,000/month client. That is a real number, but it is a small number. If the client’s API usage drops — because they optimize their prompts, switch to a cheaper model, or reduce usage — your revenue drops proportionally. You have no floor.

Compare this to the local deployment model in Chapter 7, where your per-user software license creates predictable recurring revenue regardless of usage volume. Or the vendor ecosystem play in Chapter 5, where professional services fees are decoupled from the underlying license cost. The proxy model ties your revenue directly to a variable you do not control: how much the client spends on API calls.

Competition From Dedicated Players

You are not the only one who sees this opportunity. Dedicated privacy middleware companies — Private AI, Protecto, Skyflow, and others — are building exactly this capability as their core product. They have deeper ML expertise in PII detection, more sophisticated anonymization techniques, and the ability to invest their entire engineering budget in improving accuracy. You are building a proxy as one of several service offerings. They are building it as their entire company.

When a client evaluates your privacy proxy against a dedicated solution from a company whose entire reputation depends on getting PII detection right, the comparison is not flattering unless you bring something the dedicated players cannot: the broader relationship, the compliance consulting, the integration services.

Where It Actually Works

Given all of the above, where does the privacy proxy model create genuine, defensible value?

As a Feature, Not a Product

The privacy proxy works best as one layer in a larger platform — not as a standalone offering. If you are already providing managed AI services, compliance consulting, integration work, and ongoing optimization, the privacy proxy becomes a natural component of the overall service. It adds value without needing to carry the full weight of a standalone business case.

Combined With Compliance Consulting

The proxy alone is a commodity. The proxy combined with a data protection impact assessment, ongoing compliance monitoring, EU AI Act classification support, and regulatory reporting — that is a consulting engagement at consulting margins. The proxy is the delivery mechanism for a broader compliance service that cannot be replicated by a software tool alone.

For Highly Regulated Industries

Healthcare, financial services, public sector, legal — industries where regulatory uncertainty itself is the problem. These clients do not just want data residency. They want an accountable partner who will testify during an audit that data handling met regulatory requirements. They want contractual guarantees backed by a local entity subject to local jurisdiction. They want someone to call when a regulator asks questions.

For these clients, the proxy is the technical implementation of a trust relationship. The 10% premium is trivial compared to the cost of regulatory non-compliance — fines under GDPR can reach 4% of global annual turnover. You are not selling a proxy. You are selling provable compliance and audit-readiness.

For Clients Needing Audit Trails

Some organizations need to demonstrate, with evidence, exactly what data was sent to an AI system, when it was sent, what was returned, and how PII was handled at every step. This is not a technical preference — it is a legal and contractual obligation. Insurance companies answering to regulators, law firms managing client confidentiality, government agencies subject to freedom-of-information requirements.

Your proxy generates these audit trails as a by-product of its core function. The logs, the anonymization records, the data flow documentation — these have standalone value for organizations that would otherwise need to build this instrumentation themselves.

Key takeaway: The privacy proxy creates the most value when it is embedded in a broader compliance and advisory relationship — not when it is sold as a standalone middleware product. The technology is the delivery mechanism. The trust, the accountability, and the regulatory expertise are the actual product.

The Recommendation

Build the privacy proxy as a layer, not a company.

If you are already serving enterprise clients who need AI capabilities but face genuine compliance constraints, a privacy proxy adds real value to your service portfolio. It solves an immediate problem for the client, generates incremental margin on API spend, and deepens the relationship by making you the trusted intermediary for their AI usage.

But do not build a standalone business around it. The margins are too thin to sustain a dedicated company. The competitive threats — from vendors closing the data handling gap, from dedicated privacy middleware companies, from evolving platform capabilities — are too numerous and too unpredictable. A single product announcement from a major AI provider can materially damage your revenue in a quarter.

Instead, treat the proxy as one component of a broader managed AI services offering:

Year 1: Build the proxy capability, deploy it for your most compliance-sensitive clients, learn what actually matters in production PII handling.
Year 2: Integrate it with your compliance consulting practice, bundle it with EU AI Act readiness assessments, make it part of your standard enterprise AI onboarding.
Year 3: The proxy is a feature of your platform, not a product. It differentiates your managed AI service from competitors who do not offer it, but it does not need to carry its own P&L.

The providers who build their entire strategy around the privacy proxy will find themselves squeezed between vendors solving the problem natively and dedicated middleware companies solving it better. The providers who build it as one layer of a comprehensive service — the compliance wrapper around the technical wrapper — will find it a durable, if modest, source of differentiation and margin.

Key takeaway: The privacy proxy is viable as an add-on and fragile as a standalone business. Build it as a layer in your managed AI services stack. Combine it with compliance consulting, integration services, and audit-readiness support. Do not bet the company on a gap that the AI providers are actively working to close — but do use it to deepen client relationships while the gap remains open.

Chapter 7 examines the most technically independent model: deploying open-source AI directly on employee devices, where no data ever leaves the hardware your client already owns.

Chapter 7

Chapter 7: Business Model: Local Deployment on Employee Devices

Chapter 6 described the privacy proxy — a model with genuine value but structural fragility. Your business depends on a compliance gap between what AI providers offer natively and what clients require, and that gap is closing. The proxy adds a layer of trust. It does not eliminate the fundamental issue: your client’s data still leaves the building.

This chapter describes a model where the data never leaves the device. Not “we promise not to store it.” Not “we process it in the EU.” Not “we offer zero data retention.” The data literally never touches a network interface. The model runs on the employee’s laptop, the prompt stays on the laptop, the response is generated on the laptop, and nothing — not a single token — is transmitted anywhere.

This is the local deployment model. And of all the business model pivots described in this booklet, it is the one with the most compelling long-term trajectory.

The Concept

The idea is straightforward. You take an open-source large language model — Llama, Mistral, Phi, Qwen, Gemma, or any of the dozens now available — quantize it to 4-bit precision (INT4), and deploy it to run natively on hardware the client already owns or can acquire at consumer pricing.

The enabling technology stack has matured rapidly. On Apple Silicon Macs, llama.cpp and Apple’s own MLX framework provide optimized inference that fully utilizes the unified memory architecture. On Windows and Linux machines with discrete GPUs, the same llama.cpp runtime with CUDA or Vulkan backends delivers comparable throughput on NVIDIA and AMD hardware. The tooling has reached a point where a competent engineer can have a quantized 8B model running on a MacBook in under an hour. The question is no longer whether this works. It is who packages, deploys, maintains, updates, and supports it across hundreds or thousands of employee devices.

That is where you come in.

Your service is not “install Ollama and hand over the laptop.” Your service is a managed local AI deployment: curated model selection for the client’s use cases, quantization and optimization for their specific hardware fleet, a lightweight management layer for pushing model updates and configuration changes, guardrail configuration to prevent misuse, integration with the client’s existing applications and workflows, and ongoing support when something breaks or a better model becomes available.

This is, fundamentally, endpoint management — the business many IT services providers have been running for years. You are adding an AI layer to a service delivery model you already understand.

The Economics

The economics of local deployment are the inverse of every other model in this booklet. Instead of managing the tension between compute costs and client willingness to pay, you are working with hardware the client already owns and software that is free. Your entire revenue is margin.

Hardware: What Clients Already Have (or Can Afford)

The hardware requirements are modest and getting more modest every year.

Hardware	RAM	Models Supported	Approximate Cost
MacBook Air M2 (16GB)	16 GB	7-8B models at INT4	~$1,200 (often already owned)
MacBook Pro M3/M4 (24GB)	24 GB	8-13B models at INT4	~$2,000-2,500
MacBook Pro M3/M4 (36GB)	36 GB	Up to 30B models at INT4	~$2,800-3,200
Windows laptop + RTX 4060 (8GB VRAM)	8 GB VRAM	7-8B models at INT4	~$1,200-1,500
Windows workstation + RTX 4090 (24GB VRAM)	24 GB VRAM	8-13B models at INT4	~$2,500-3,500

Many of your clients’ employees already have 16GB or 24GB MacBooks — the standard corporate procurement spec for knowledge workers has been trending upward for years. For those who need an upgrade, a 24GB MacBook Pro at approximately $2,000-2,500 is a normal laptop refresh cost, not a special AI hardware investment. The client’s procurement team barely blinks.

Your Revenue Model

You charge a per-user monthly fee for the managed local AI service.

Component	Monthly Cost
Software license + managed service	$20-50/user/month
Your infrastructure cost per user	~$0
Your support and maintenance cost (amortized)	$3-8/user
Your gross margin per user	$12-47/user

Read that infrastructure line again. Zero. You are not hosting anything. You are not paying for compute. You are not buying API tokens. You are not running a proxy layer. The model runs on the client’s hardware, consuming the client’s electricity, using the client’s memory. Your cost is the engineering time to build and maintain the platform, amortized across your entire client base.

The Comparison That Matters: Local vs API Total Cost of Ownership

Here is where the economics become genuinely compelling.

Consider a knowledge worker who uses AI heavily — a consultant, analyst, developer, or content creator. Not a casual user asking one question per day, but someone who has integrated AI into their workflow and runs dozens of sessions daily.

Usage Level	Daily Tokens	API Cost (GPT-4o class)	API Cost (Claude Sonnet class)	Annual API Cost
Light user	~100K tokens	$0.30-0.50/day	$0.25-0.45/day	$100-180/year
Moderate user	~1M tokens	$3-5/day	$2.50-4.50/day	$900-1,800/year
Heavy user	~5M tokens	$15-25/day	$12-22/day	$4,400-9,000/year
Power user	~10M+ tokens	$30-50/day	$25-45/day	$9,000-18,000/year

A MacBook Pro at $2,500, serving three to four years of daily use, has an annualized hardware cost of $625-835. Add your $30/month managed service fee — $360 annually — and the total annual cost is approximately $1,000-1,200. For a heavy user who would otherwise spend $4,400-9,000 per year on API calls, the local deployment saves $3,200-7,800 annually. The hardware pays for itself in months, not years.

And this is the key economic insight: there is zero marginal cost per token. Once the model is loaded into memory, the user can generate a million tokens or ten million tokens and your cost does not change. Their cost does not change. There is no meter running. No bill shock at the end of the month. No procurement approval needed when the team’s usage exceeds the forecasted API budget.

Margin Structure at Scale

Number of Users	Monthly Revenue ($35/user avg)	Monthly Cost (engineering + support)	Monthly Gross Margin	Gross Margin %
50	$1,750	$1,200	$550	31%
200	$7,000	$2,500	$4,500	64%
500	$17,500	$4,000	$13,500	77%
1,000	$35,000	$5,500	$29,500	84%
2,000	$70,000	$8,000	$62,000	89%

The margin structure improves dramatically with scale because adding user 101 or user 1,001 costs you almost nothing in compute. Your costs are engineering headcount (platform development, model testing, update preparation) and support staff. These grow sub-linearly with the user base. At 500+ users, you are operating at 77%+ gross margins — comparable to a SaaS business, but without the hosting bill.

Key economics: Local deployment is the only AI business model in this booklet where your compute cost is literally zero. The client owns the hardware. The models are open-source. Your entire fee is margin minus engineering and support labor. At scale, this delivers 70-85% gross margins with no infrastructure risk.

The Defensibility

Every business model needs a moat. Local deployment has several, and they reinforce each other.

Zero Data Exfiltration — The Only True Guarantee

This is the strongest selling point, and it deserves emphasis. The privacy proxy in Chapter 6 anonymizes data before sending it to an API. That is good. But the data still travels over a network, still reaches a third-party server in some form, and still requires trust that the anonymization was complete and the provider honored their data handling commitments.

Local deployment eliminates the entire chain. The data does not leave the device. There is no network call to intercept. There is no third-party server to trust. There is no data processing agreement to negotiate because no data is being processed by anyone other than the employee’s own machine. For industries where data sensitivity is not a preference but a legal requirement — defense contractors, intelligence agencies, law firms handling privileged communications, healthcare providers with patient data, financial institutions with trading strategies — this is not a nice-to-have. It is the only architecture that satisfies the requirement.

No other deployment model can make this claim. Not the privacy proxy. Not Azure’s EU data boundary. Not Anthropic’s regional processing. Only local.

No Hosting Costs for You

You bear no infrastructure cost. No servers to provision. No GPUs to lease. No cloud bills that spike when usage spikes. Your cost structure is entirely labor-based and predictable. This means you can price aggressively during acquisition and still maintain healthy margins as the client relationship matures.

Scales Beautifully

Adding a new user means deploying the model to one more laptop. There is no backend capacity to plan, no API rate limits to manage, no inference queue to optimize. Each device is its own self-contained inference server. User 1,001 gets the same performance as user 1, regardless of what the other 1,000 are doing. There is no shared resource contention.

Works Offline

Employees on airplanes, at client sites without reliable WiFi, in secure facilities that prohibit external network connections, in regions with poor connectivity — they all have full AI capability. For consulting firms whose people spend half their time at client sites, for field engineers, for traveling executives, this is a practical advantage that cloud-based AI cannot match.

Natural Extension of Your Existing Business

If you manage endpoints today — deploy software, push updates, enforce security policies, maintain configurations across a fleet of corporate devices — then local AI deployment is a natural extension of that capability. You already have the MDM (Mobile Device Management) infrastructure, the deployment pipelines, the support processes, and the client relationships. Adding a managed AI layer to your existing endpoint management service is an upsell, not a pivot.

The Trajectory Argument

This is the most important section of this chapter, because it addresses the obvious objection: “But local models are not as good as Claude or GPT-4o.”

That is true today. A quantized 8B model running on a MacBook is noticeably less capable than GPT-4o or Claude Sonnet at complex multi-step reasoning, long-document analysis, nuanced coding tasks, and sophisticated creative writing. The quality gap is real and your clients will notice it.

But consider the trajectory.

The best 8-13B models available in early 2026 — Llama 3.3 8B, Mistral 7B v0.4, Phi-4, Qwen 2.5 — are already substantially better than GPT-3.5 was when it launched ChatGPT and ignited the entire generative AI revolution. GPT-3.5 was good enough to onboard 100 million users in two months. Today’s local models exceed that capability, running entirely on a laptop, with no internet connection required.

And the trajectory is accelerating from both sides: models are getting better at smaller sizes, and hardware is getting more powerful.

The Model Side

Every major AI lab is investing heavily in efficient small models. The techniques that make this possible — knowledge distillation from larger models, more efficient training data curation, architectural improvements like mixture-of-experts at smaller scales, improved quantization methods that reduce precision loss — are advancing rapidly. The gap between a 10B parameter model and a 100B parameter model in 2026 is meaningfully smaller than the same gap was in 2024.

In two to three years, a 30-40B model will fit comfortably in the 32-48GB of unified memory that mid-range Apple Silicon laptops will ship with. A 30-40B model in 2028, trained with the techniques of 2028, will be competitive with the frontier models of 2026 for the vast majority of business tasks. Not for cutting-edge research. Not for the hardest reasoning benchmarks. But for drafting emails, summarizing documents, analyzing spreadsheets, generating reports, writing code, answering questions about internal documentation — the tasks that constitute 90% of enterprise AI usage.

The Hardware Side

Apple, Qualcomm, and Intel are all pushing Neural Processing Unit (NPU) performance as a primary competitive differentiator. Apple’s M-series chips already provide the best consumer-grade AI inference performance per watt. Qualcomm’s Snapdragon X Elite brought competitive NPU performance to Windows laptops in 2024, and subsequent generations are closing the gap. Intel’s Lunar Lake and Arrow Lake architectures include substantially improved NPU capabilities.

The trend is unmistakable: every chip vendor is optimizing for on-device AI inference. They are doing this because they see the same market opportunity you do. Every improvement they ship makes your local deployment service more capable, at no additional cost to you.

The Strategic Implication

Building local deployment muscle now — the tooling, the expertise, the deployment processes, the client relationships, the model evaluation methodology — gives you a massive head start. When 30-40B models run smoothly on standard corporate laptops in two to three years, the providers who have been deploying and managing local AI since 2026 will have years of operational experience, established client relationships, refined update processes, and a reputation for making this work. The providers who waited will be starting from scratch in a market where the early movers have already locked up the most sophisticated clients.

Key takeaway: The intelligence gap between local and cloud models is real today and closing fast. Building local deployment capability now is not about what 8B models can do today — it is about being the established provider when 30-40B models run on every laptop in two to three years. The providers who start now will own this market. The providers who wait will be competing on price against incumbents with years of operational advantage.

The Honest Problems

The trajectory is encouraging. The present has real limitations. Your sales team and your clients need to understand both.

Quality Gap Is Noticeable

A user who has experienced Claude Sonnet or GPT-4o will notice the difference when using a local 8B model. Complex multi-step reasoning degrades. Nuanced coding tasks produce more errors. Long-context analysis — the kind where a user pastes a 50-page contract and asks for a summary — may exceed the local model’s effective context window or produce less accurate results. Creative writing lacks the polish of frontier models.

This is not a subtle difference. Users will compare, and the comparison will not always favor the local model. Your positioning must be honest about what the local model excels at (fast responses, data privacy, offline availability, unlimited usage) and where users should expect to use a cloud model for demanding tasks.

Users Will Compare to ChatGPT

This is the consumer expectations problem. Your client’s employees use ChatGPT or Claude at home. They know what frontier models feel like. When you give them a local model that stumbles on a complex query, their instinct is not “this is a reasonable trade-off for data privacy.” Their instinct is “this is worse.” Managing that expectation requires proactive communication, honest capability documentation, and — critically — the hybrid approach described below.

Model Update and Guardrail Management

When a better model becomes available — and this happens every few months — who tests it, validates it against the client’s use cases, ensures the guardrails still work, and pushes it to 500 laptops without disrupting anyone’s workflow? This is an MDM-like challenge, and it is genuinely hard. Models are not operating system patches. A model update can change the behavior of every AI interaction the employee has. Testing and validation before deployment is essential, and the tooling for this is still immature.

You also need guardrails without a server. Content filtering, usage policies, and output restrictions typically rely on a server-side layer that inspects requests and responses. In a local deployment, that layer must run locally as well, consuming additional resources and adding complexity to the deployment. Getting this right — especially in regulated industries where guardrail failures have compliance implications — requires real engineering effort.

Windows Fragmentation

Apple Silicon provides a uniform, predictable platform for local AI inference. Every M2/M3/M4 Mac has unified memory that the model can fully utilize, and the performance characteristics are well-understood and consistent.

Windows is a different story. Some corporate laptops have discrete NVIDIA GPUs with enough VRAM. Some have AMD GPUs with different driver requirements. Some have only integrated graphics and rely entirely on CPU inference, which is dramatically slower. Some have Qualcomm NPUs. The hardware diversity means you need to test and optimize for multiple configurations, maintain multiple deployment profiles, and support users whose experience varies significantly based on the hardware lottery of their corporate procurement.

For clients with a homogeneous Mac fleet, local deployment is clean. For clients with heterogeneous Windows hardware, expect fragmentation headaches and plan your support costs accordingly.

The Hybrid Approach: Best of Both Worlds

The smartest deployment is not pure local. It is local-first with cloud fallback.

The architecture works like this: the local model handles everyday tasks — email drafting, document summarization, quick code generation, Q&A against internal knowledge bases, meeting note processing, routine analysis. These represent 80-90% of a typical knowledge worker’s AI interactions, and a good 8-13B model handles them well.

When the user encounters a task that requires frontier model capability — complex multi-step reasoning, long-document analysis, sophisticated code refactoring, nuanced creative work — the system routes that request to a cloud API. The user experiences a seamless transition. The local model handles the volume. The cloud model handles the peaks.

This hybrid approach offers three advantages:

Cost optimization. The vast majority of tokens are generated locally at zero marginal cost. Only the genuinely demanding tasks incur API charges, reducing the client’s cloud AI spend by 80-90% compared to full cloud deployment.
Quality management. Users get frontier model quality when they need it, without the “this is worse than ChatGPT” frustration. The system intelligently routes based on task complexity, or the user can explicitly request cloud processing for important tasks.
Graceful degradation. When the user is offline, the local model handles everything. The experience degrades gracefully rather than failing entirely. For employees who travel or work in low-connectivity environments, this is the difference between having AI and not having it.

Your managed service includes configuring the routing logic, managing the API integration for cloud fallback, and optimizing the split between local and cloud based on the client’s usage patterns and budget. This routing optimization itself becomes a recurring advisory engagement — reviewing monthly usage data, adjusting thresholds, recommending model upgrades, and ensuring the client gets maximum value from both tiers.

The Model Lifecycle: Your Recurring Revenue Engine

Open-source models are superseded every few months. Llama 3 replaced Llama 2. Mistral v0.3 replaced v0.2. Phi-4 replaced Phi-3. Qwen 2.5 replaced Qwen 2. Each new release brings meaningful improvements in capability, efficiency, or both.

For an individual user running Ollama on their personal laptop, a model update is a download and a restart. For an enterprise with 500 employees relying on local AI for daily work, a model update is a project.

Someone needs to evaluate the new model against the client’s specific use cases. Someone needs to test the guardrails with the new model’s behavior characteristics. Someone needs to validate that the quantized version maintains acceptable quality. Someone needs to plan the rollout — all 500 devices at once, or a phased deployment with a canary group? Someone needs to handle the exceptions — the devices that fail to update, the users who report regressions, the edge cases where the new model behaves differently on a task the old model handled well.

That someone is you. And that lifecycle management is recurring revenue that renews every time a meaningful new model is released — which, at the current pace of open-source AI development, means quarterly at minimum.

The revenue streams compound:

Model evaluation and recommendation: Quarterly assessment of new models against the client’s requirements. Advisory engagement at consulting rates.
Fine-tuning transfer: If the client has invested in fine-tuning the current model on their domain data, that fine-tuning needs to be transferred or recreated for the new base model. This is specialized work that commands premium pricing.
Deployment and rollout: The actual push of the new model to all devices, including testing, staging, and production deployment. Project-based revenue.
Guardrail reconfiguration: Each new model may require updated content filtering rules, output formatting adjustments, and compliance validation. Ongoing maintenance revenue.
Performance optimization: Tuning inference parameters, adjusting quantization settings, optimizing memory usage for the specific hardware fleet. Technical services revenue.

This lifecycle creates a durable, recurring revenue relationship that deepens over time. The longer you manage a client’s local AI deployment, the more institutional knowledge you accumulate about their use cases, their hardware fleet, their users’ preferences, and their compliance requirements. Switching to another provider means losing all of that accumulated context — a meaningful switching cost that protects your revenue without contractual lock-in.

Key takeaway: The model lifecycle is not a burden — it is your business model. Every new open-source release creates a managed upgrade engagement. Every fine-tuning investment creates transfer work. Every guardrail update requires validation. The pace of open-source AI development is not a threat to your business. It is the engine that drives recurring revenue.

The Recommendation

Build local deployment capability now. Do not wait for the models to get better — they will, and when they do, you want to be the established provider with operational experience, not the newcomer trying to catch up.

Start with your clients who have the clearest need: those with strict data sovereignty requirements, those with homogeneous Apple Silicon fleets, those whose employees already use AI heavily and are generating large API bills, and those in regulated industries where the “zero data exfiltration” guarantee has immediate compliance value.

Your initial offering should include:

Assessment (fixed fee): evaluate the client’s hardware fleet, identify target use cases, recommend models, and build a deployment plan.
Deployment (project fee): configure, test, and push the local AI stack to all target devices, including guardrails, integrations, and user training.
Managed service (monthly per-user fee): ongoing model updates, performance monitoring, support, and quarterly model lifecycle reviews.
Hybrid integration (optional add-on): cloud API fallback configuration, routing optimization, and cost management for the local+cloud tier.

The pricing — $20-50 per user per month for the managed service — positions you well below per-user API costs for moderate to heavy users while delivering margins that improve dramatically with scale. At 500 users paying an average of $35/month, you are generating $210,000 in annual recurring revenue at 77% gross margins. That is a real business built on open-source software, commodity hardware, and operational expertise.

This is the good news chapter. Of all the business model pivots available to EU IT services providers, local deployment is the one where your existing skills — endpoint management, device fleet support, software deployment, security configuration — translate most directly. The economics are favorable. The trajectory is in your favor. The defensibility is structural. And the competitive landscape is wide open because most providers have not yet realized that managing AI on a laptop is fundamentally the same business as managing everything else on that laptop.

Key takeaway: Local deployment is the most naturally defensible AI business model for IT services providers. Zero data exfiltration, zero compute cost, strong margins at scale, and a trajectory that turns today’s adequate local models into tomorrow’s good-enough-for-everything models. Start building this capability now. The providers who establish local deployment expertise in 2026 will own the market when on-device AI becomes the default enterprise deployment model in 2028 and beyond.

Chapter 8 examines the third independent business model: providing the testing, security, and agentic infrastructure that every organization deploying AI — whether in the cloud or locally — needs but almost none have built.

Chapter 8

Chapter 8: Business Model: Testing, Security, and Agentic Infrastructure

The previous two chapters examined business models built around a single core capability — the privacy proxy in Chapter 6, local model deployment in Chapter 7. This chapter covers three related opportunities that share an important characteristic: they are emerging as distinct service categories that enterprises will need but cannot easily build internally.

Testing and validation, GenAI-specific security, and agentic infrastructure each represent high-margin work for providers who invest in the right skills early. They span a wide range of difficulty, time to revenue, and alignment with existing IT services capabilities. The common thread is that none are about running models. They are about everything that surrounds the model — the evaluation, the protection, and the plumbing that makes AI systems work in production.

Testing and Validation

The EU AI Act entered into force in August 2024, with obligations phasing in through 2027. For IT services providers, the most commercially significant requirement is conformity assessment for high-risk AI systems — and the ongoing monitoring that follows.

What the Regulation Requires

High-risk AI systems must undergo conformity assessments before deployment, evaluating accuracy, robustness, cybersecurity, transparency, human oversight, and non-discrimination. These are not one-time events. The Act requires continuous monitoring: detecting drift, identifying emerging biases, and documenting everything.

This is not a checkbox exercise. Testing for distributional bias across protected characteristics, evaluating robustness against adversarial inputs, and measuring calibration under real-world conditions goes well beyond what most enterprises can staff internally.

The Economics

Service	Price Range	Frequency
Initial conformity assessment	$20,000 - $50,000 per system	One-time (per deployment)
Ongoing monitoring and evaluation	$3,000 - $10,000/month per system	Continuous

A provider with 20 clients, each running two to three high-risk AI systems, generates $120,000 to $300,000 in initial assessments and $1.4 to $7.2 million annually in ongoing monitoring. The margins are high — the primary cost is skilled labour, not infrastructure.

The Talent Problem

This is not traditional IT operations work. Proper AI testing requires people who understand ML evaluation methodology — not just “run the test suite” but the statistical reasoning behind why certain approaches are valid and others are misleading. You need people who can design adversarial test cases, determine whether a 2% accuracy difference across demographic groups is noise or systematic bias, and navigate the fact that different fairness definitions are often mutually exclusive.

This is research and engineering talent. These people are scarce, expensive, and currently employed at AI labs, universities, and the handful of companies with dedicated AI safety teams. Recruiting them requires a compelling pitch — interesting work, competitive compensation, and genuine commitment to building a practice.

Key economics: Testing and validation offers the highest margins and the highest barriers of the three opportunities in this chapter. Budget 12-18 months from initial investment to meaningful revenue. The regulatory demand is certain — the EU AI Act is law. The question is whether you can build the team fast enough to capture it.

GenAI-Specific Security

If you run a security practice today, you have a head start — but a smaller one than you might think. GenAI security is a distinct discipline. The threat surface is different, the attack vectors are novel, and the defensive tools are still maturing.

The Threat Landscape

GenAI systems are vulnerable to traditional security threats plus a set with no direct analogue in conventional IT:

Prompt injection — an attacker crafts input that causes the model to ignore its instructions and follow the attacker’s instead. This is a structural property of how language models process instructions and data in the same channel. Defences exist but none are complete.

Data exfiltration through outputs — a model inadvertently reveals training data, RAG context, or system prompts in its responses. Carefully constructed queries can coax a model into surfacing internal documents, confidential instructions, or patterns from fine-tuning data.

Model poisoning via fine-tuning — fine-tuning data introduces malicious behaviour. A poisoned model might behave normally on most inputs but produce subtly wrong outputs in attacker-chosen scenarios. Detection requires evaluation beyond standard accuracy metrics.

Supply chain risks with open-source weights — the AI equivalent of dependency vulnerabilities. When you download a model from Hugging Face, you trust that weights have not been tampered with and training data was not poisoned. The open-source AI ecosystem lacks the maturity of tools like npm audit or Dependabot.

Jailbreaking — bypassing safety guardrails to produce harmful or policy-violating outputs. New techniques emerge faster than providers can patch them.

Security Audits and Red-Teaming

A security audit for a GenAI deployment covers prompt injection resistance, data leakage testing, access control, logging adequacy, and policy alignment. A mid-complexity audit runs $15,000 to $40,000.

Red-teaming is more intensive — a dedicated team spends days or weeks attempting to break the system through adversarial inputs, extraction attacks, and exploitation of tool-use capabilities. Red-teaming engagements run $30,000 to $80,000 and are increasingly expected in regulated industries.

The Skills Transfer

Your existing security engineers understand threat modelling, attack surfaces, and defence in depth. That foundation transfers. What does not transfer is the specific knowledge of how language models fail — the mechanics of prompt injection, statistical methods for detecting data leakage, evaluation frameworks for model robustness.

Expect a six to twelve month ramp-up. This is faster than building testing from scratch because the security mindset — think like an attacker, assume vulnerability, verify rather than trust — is already present. The domain-specific knowledge layers on top.

Key takeaway: GenAI security is a natural extension of existing security practices, but requires significant new technical knowledge. The threat surface is genuinely different, and the tools are still immature. Providers who invest in upskilling their security teams now will own a market that every enterprise deploying AI will need to buy from.

Agentic Infrastructure as a Service

Of the three opportunities, agentic infrastructure is the most accessible for IT services providers — and most likely to generate revenue within the first year.

The Shift to Agentic AI

Enterprise GenAI is moving beyond simple prompt-response interactions toward tool use, multi-step workflows, autonomous agents, and orchestration layers. The model itself — Claude, GPT-4o, Gemini, or open-source — is just one component. Increasingly, it is the least differentiated component.

The value is in everything around the model:

MCP servers (Model Context Protocol) provide standardized interfaces between AI models and external data sources, tools, and services. Building and maintaining MCP servers for enterprise environments — connecting models to databases, document repositories, ticketing systems, CRM platforms — is integration work that IT services providers have done for decades in a new protocol.

RAG pipelines require document ingestion, chunking strategies, embedding model selection, vector database management, and continuous evaluation of retrieval quality. A RAG pipeline that works in a demo and one that works reliably in production with millions of documents are entirely different engineering challenges.

Tool integrations give agents the ability to act — creating tickets, querying databases, updating records, triggering workflows. Each integration requires authentication, error management, rate limiting, audit logging, and guardrails preventing unauthorized actions.

Workflow orchestration coordinates multiple agents, tools, and data sources into multi-step processes. A customer service agent that looks up an order, checks inventory, initiates a return, updates the CRM, and sends a confirmation is a choreographed sequence of tool uses, conditional logic, and human-in-the-loop checkpoints.

Guardrails — input and output filtering, content policies, usage limits, and safety boundaries — are table-stakes infrastructure every enterprise deployment needs.

Why This Maps to Existing Skills

MCP servers are API integrations. RAG pipelines are data engineering. Tool integrations are systems integration. Workflow orchestration is business process automation. Guardrails are policy enforcement. The specific technologies are new — embedding models and vector databases instead of ETL and relational databases — but the underlying disciplines are the same. Your team that connects Salesforce to SAP can learn to connect Claude to your client’s document repository.

The model is just an API call. Your value is the plumbing.

The Economics

Phase	Revenue	Duration
Discovery and architecture	$15,000 - $40,000	2-4 weeks
Build and deploy	$50,000 - $200,000	2-4 months
Ongoing management and optimization	$5,000 - $20,000/month	Continuous

The ongoing management is where recurring revenue lives. RAG pipelines need tuning as document corpora change. Tool integrations need maintenance as APIs evolve. Guardrails need updating as new threats emerge. This is managed services work at higher margins than traditional infrastructure monitoring because the skills are more specialized.

Key takeaway: Agentic infrastructure is the most natural transition for IT services providers. The model is just an API call — your value is in the MCP servers, RAG pipelines, tool integrations, workflow orchestration, and guardrails that make it useful. This maps directly to integration and automation skills you already have.

Multi-Model Management

No serious enterprise will run a single model for all use cases. The emerging pattern:

Frontier API models (Claude, GPT-4o, Gemini) for complex reasoning and customer-facing interactions
Smaller API models (Claude Haiku, GPT-4o mini) for high-volume, lower-complexity tasks
Locally deployed open-source models for sensitive data that cannot leave the organization
Fine-tuned models for domain-specific tasks — medical terminology, legal analysis, industry classification

Each model has different capabilities, costs, latency, data handling guarantees, and update cycles. Managing this complexity across dozens of AI-powered applications is a significant operational challenge.

The Multi-Cloud Analogy

This is structurally similar to multi-cloud management — a market that has sustained viable businesses for over a decade. The services include:

Model lifecycle management — tracking deployments, managing version updates, coordinating migrations when providers deprecate models.

Evaluation and testing — when a provider releases a new model version, someone must evaluate whether it performs better or worse on the organization’s specific use cases. This requires systematic A/B testing, regression evaluation, and benchmarking against actual workloads.

Cost optimization — routing requests to the most cost-effective model that meets quality requirements. Intelligent routing can reduce costs by 30-50% without degrading quality.

Unified observability — consistent logging, monitoring, and alerting across all models. When a model produces degraded outputs, you need to detect it regardless of provider.

Multi-model management is the connective tissue tying together agentic infrastructure, testing, and security. Like multi-cloud management, this is a complexity tax that enterprises will pay someone to manage.

The Talent Reality

These opportunities span a wide range of talent requirements, and honesty is critical for planning.

Testing and validation requires the scarcest talent: ML evaluation expertise, statistical rigour, and research mindset. Hard to find, expensive to hire, longest to become productive. Budget 12-18 months.

GenAI security requires traditional security expertise plus AI-specific knowledge. The security mindset transfers; the AI knowledge is genuinely new. Budget 6-12 months to upskill.

Agentic infrastructure is most accessible. API integration, data engineering, workflow automation — these skills already exist. New knowledge (MCP, embeddings, vector databases, prompt engineering) can be learned in three to six months. Budget 3-6 months to first engagements.

Multi-model management builds on all three and emerges naturally with experience.

Opportunity	Key Talent	Ramp Time	Margin	Barrier
Testing & validation	ML evaluation, statistics, adversarial testing	12-18 months	Highest	Highest
GenAI security	Security + AI-specific knowledge	6-12 months	High	High
Agentic infrastructure	Integration, data engineering, automation	3-6 months	Moderate-high	Moderate
Multi-model management	Operational + AI breadth	Builds over time	High	Moderate-high

The Recommendation

Start with agentic infrastructure. Not because it is the most valuable — testing and validation commands higher margins long-term — but because it is the natural transition from what you already do. Your integration engineers can start building MCP servers and RAG pipelines within months. Revenue comes faster because the skills gap is smallest.

Use agentic infrastructure engagements to build credibility and client relationships. As you deploy and maintain AI systems, you will naturally encounter testing, security, and multi-model management challenges — each an opportunity to expand.

In parallel, invest in testing and security:

Months 1-6: Deliver first agentic infrastructure engagements. Designate one to two people to build testing and security expertise. Send security staff to AI-specific training.
Months 6-12: Offer basic AI security audits alongside infrastructure deployments. Pilot testing services with existing clients.
Months 12-18: Launch formal testing and validation practice. Offer EU AI Act conformity assessments. Position multi-model management as a managed services extension.
Months 18-24: All four capabilities operate as an integrated practice. Agentic infrastructure generates client relationships. Testing and security generate the highest margins. Multi-model management generates the stickiest recurring revenue.

What to take from this chapter: Three related opportunities — testing, security, and agentic infrastructure — plus multi-model management offer high-margin services that enterprises cannot easily build in-house. Start with agentic infrastructure because it maps closest to existing skills and generates revenue fastest. Invest in testing and security in parallel — regulatory demand is certain and margins are highest, but talent requirements are steeper and time to revenue is longer. The model is just an API call. Everything around it is your business.

Chapter 9 examines how the same AI you are learning to sell to clients is simultaneously transforming how you deliver your existing services — and why that internal disruption may be the most important strategic challenge you face.

Chapter 9

Chapter 9: When AI Transforms Your Own Delivery

Chapters 5 through 8 examined how to sell AI services to clients — privacy proxies, local deployments, testing, security, agentic infrastructure, and the shifting dynamics of lock-in. All of that matters. But there is a conversation most IT services providers are not having, and it is the one that will determine whether they are still competitive in three years.

AI is not just something you sell. It is something that is transforming how you deliver the services you already offer.

If you run a managed services practice — a service desk, a NOC, a SOC, a monitoring operation — AI is coming for your delivery model whether you plan for it or not. The providers who recognize this and act first will expand their margins and scale their businesses. The ones who ignore it will find themselves undercut by competitors who automated what they still do manually.

This chapter is about the internal disruption nobody wants to talk about. It may be uncomfortable. It should be.

The Service Desk Is Already Changing

The most immediate impact is at the service desk — the L1 support function that forms the foundation of most managed services practices. The numbers are no longer speculative. They are operational reality at scale.

AI agents now deflect over 45% of incoming B2B customer queries, with sectors like retail and travel exceeding 50%. Well-designed AI systems consistently achieve 40-60% deflection rates, and the upper end of the market is pushing further: up to 80% of routine inquiries handled automatically, with no human involvement.

These are not lab results. These are production deployments at major enterprises:

Company / Platform	Metric	Result
Moveworks at Broadcom	Autonomous resolution rate	88%
Moveworks at Equinix	Ticket deflection	68%
Moveworks at Equinix	Autonomous resolution	43%
Aisera customers	Ticket deflection	75%
Aisera customers	Support staffing cost savings	35%
Unity	Tickets deflected	8,000 tickets, saving $1.3 million
NIB Health Insurance	Cost reduction	60%, saving $22 million

The impact extends beyond deflection. AI-assisted agents — humans working alongside AI tools — resolve issues 47% faster with 25% higher first-contact resolution rates. This means even the tickets that do reach a human are handled more efficiently.

Sit with those numbers for a moment. If you run a 20-person service desk and AI can deflect 50% of incoming tickets while making the remaining agents 47% faster — you are looking at a fundamentally different staffing model.

The uncomfortable math: A 50% ticket deflection rate plus a 47% improvement in agent efficiency means your service desk could handle roughly three times its current volume with the same headcount. That is either a massive threat or a massive opportunity, depending on how quickly you move.

NOC and SOC: Burnout Meets Automation

If the service desk transformation is about efficiency, the NOC/SOC transformation is about survival. The staffing crisis in security operations is not a future risk — it is a present emergency.

71% of SOC analysts report burnout. 64% are considering leaving within a year. 67% of organizations report being short-staffed in their security operations. These are not numbers from a pessimistic outlier survey — they represent the structural reality of an industry that generates more alerts than humans can process.

AI is filling the gap, and it is filling it fast:

SOC/NOC Function	AI Adoption	Impact
Alert triage and prioritization	73% have automated	67% say biggest immediate AI impact
Alert enrichment	68% have automated	Reduces manual research per alert
Investigation time reduction	60%+ of AI adopters	At least 25% reduction, with 21% achieving >50%
Phishing response	AI-assisted	From 1 hour to 10 minutes

The phishing response metric deserves emphasis. Reducing response time from one hour to ten minutes is not an incremental improvement — it is a category change. In the time a human analyst would investigate one phishing incident, an AI-assisted workflow handles six.

For managed security services providers, this changes the economics of every SOC contract. If your SOC analysts can handle three to five times the alert volume with AI assistance, you can either serve more clients with the same team or deliver dramatically better service at the same price point. Either way, the provider still running a purely manual SOC is at a structural disadvantage.

Self-Healing Infrastructure: The End of Routine Alerts

Beyond the service desk and SOC, AI is transforming infrastructure monitoring itself. Self-healing systems — automated workflows that detect, diagnose, and resolve common infrastructure issues without human intervention — are moving from niche automation to standard practice.

ConnectWise Automate’s self-healing workflows already handle 30-40% of routine alerts without human intervention. By 2026, over 60% of large enterprises are moving toward self-healing systems powered by AIOps. The AIOps market is projected to exceed $40 billion by 2026.

What does this mean for a managed services provider? It means a significant portion of the routine monitoring and remediation work that justifies your monthly retainer is being automated away. Server ran out of disk space? Self-healing clears the logs. Service crashed? Self-healing restarts it. Certificate expiring? Self-healing renews it. These are the bread-and-butter tickets that keep NOC teams busy — and they are disappearing.

Key takeaway: Self-healing infrastructure does not eliminate the need for managed services. It eliminates the need for the type of managed services most providers currently deliver. The value shifts from “we watch your screens and fix routine problems” to “we architect, deploy, and optimize the AI systems that watch your screens and fix routine problems.”

The Revenue Model Threat

Here is where the discomfort becomes financial.

Traditional MSP pricing is built on inputs: per-user fees, per-ticket charges, per-incident rates. These models assume a relatively stable relationship between the number of users or systems and the amount of work required to support them.

AI breaks that assumption.

If you charge per ticket and AI resolves 50% of tickets, you just lost 50% of that revenue stream. The work disappeared, and so did the revenue. Per-user rates are predicted to drop 25% in the next two years due to automation — not because clients are being unreasonable, but because the cost of delivering the service is genuinely falling, and clients know it.

The market is already responding. MSP M&A activity increased 50% in 2024, as providers who cannot achieve automation efficiency become acquisition targets for those who can. The managed security services market faces predicted consolidation from roughly 200 top MSSPs to approximately 120 by 2028. That is a 40% reduction in the number of independent providers.

This consolidation is not random. It follows a clear pattern: providers with advanced automation acquire those without, absorb their client bases, and serve the combined portfolio at lower cost. If you are the provider being acquired, you are getting a fraction of the value you built. If you are the one acquiring, you are buying revenue at a discount because you know you can deliver the same service with fewer people.

Threat	Timeline	Impact
Per-user rate compression	Next 24 months	25% predicted decline
Ticket volume decline from AI deflection	Happening now	40-60% of routine tickets
MSP market consolidation	Through 2028	~200 top MSSPs to ~120
M&A acceleration	2024 onward	50% increase in MSP deals

The revenue threat in one sentence: If your pricing model charges for inputs (tickets, hours, incidents) and AI reduces the inputs, your revenue shrinks while your fixed costs remain — unless you change the model first.

The Opportunity Flip: Why This Is Actually Good News

Now for the part that makes this chapter worth reading rather than merely frightening.

AI is a margin expander, not just a headcount reducer — if you manage the transition deliberately. The data from providers who have already adopted AI internally is striking:

66% of MSPs cite automation as a way to scale without adding staff
76% noted increased efficiency; 40% citing lower labour costs
78% of professional services clients saw increased billable hours (because AI handles the non-billable administrative work)
MSPs report operational cost reductions of 30-50%

The math is straightforward: if AI cuts your delivery cost by 40% but you only reduce prices by 15%, your margin grows. You are more profitable per client while simultaneously being more competitive on price. This is the rare scenario where you can improve margins and market position simultaneously.

Consider a concrete example. You run a managed services desk with 10 analysts, each costing you EUR 45,000 fully loaded, supporting 50 clients at EUR 3,000 per month each.

Metric	Before AI	After AI
Analysts	10	6 (4 redeployed to higher-value roles)
Clients supported	50	75 (same quality, 50% more capacity)
Monthly revenue	EUR 150,000	EUR 210,000 (75 clients at EUR 2,800 — a 7% price cut)
Monthly staff cost	EUR 37,500	EUR 22,500 + EUR 5,000 AI tooling
Monthly margin	EUR 112,500 (75%)	EUR 182,500 (87%)

You cut prices, grew revenue by 40%, improved your margin by 12 percentage points, and redeployed four people to higher-value work. The clients are happy because they pay less. Your team is happy because the redeployed analysts do more interesting work. Your business is stronger on every metric.

That is the opportunity — but only if you move before the market forces your hand.

Pricing Model Evolution: From Inputs to Outcomes

The transition from input-based to outcome-based pricing is not optional. It is the natural consequence of automation making inputs irrelevant as a measure of value.

The pioneers are already demonstrating what this looks like. Intercom’s Fin AI agent charges $0.99 per AI resolution — not per seat, not per agent hour, but per resolved conversation. This aligns the provider’s revenue with the client’s outcome. More resolutions means more revenue for the provider and more value for the client.

For managed services providers, the evolution follows a clear path:

From: Per-ticket, per-user, per-hour pricing that penalizes efficiency.

To: Outcome-based pricing that rewards it.

The practical structures include:

Blended base fees with AI-linked outcome metrics: A base retainer covering the service, plus bonus components tied to automation rates, mean time to resolution (MTTR), and SLA performance
Pricing corridors that flex as AI handles more work: Monthly fees that adjust within defined bands as the automation rate increases — the client pays less per ticket, but you handle more tickets profitably
Outcome guarantees: Sell the result, not the activity. 99.9% uptime. Less than 15-minute MTTR. 95% first-contact resolution rate. These commitments are what the client actually cares about — and with AI, they are commitments you can actually keep

The pricing insight: Sell outcomes — uptime, resolution speed, first-contact resolution rates — not inputs like hours, tickets, or seats. When AI makes your inputs cheap, input-based pricing is a race to the bottom. Outcome-based pricing lets you capture the value of what you deliver, not the cost of how you deliver it.

The Internal Transformation Playbook

Knowing the landscape is not enough. Here is what to actually do, in order:

1. Adopt AI in your own operations first. Eat your own cooking. Deploy AI triage on your own service desk before selling it to clients. Implement AI-assisted alert enrichment in your own SOC before proposing it to prospects. If you have not transformed your own delivery, you have no credibility telling clients to transform theirs.

2. Measure everything. Ticket deflection rates. MTTR improvements. Cost per resolution. Analyst utilization before and after AI. Automation rates by ticket category. These numbers are not just operational metrics — they are your future sales collateral.

3. Use the data to build your external pitch. “We reduced our own resolution time by 47% and our cost per ticket by 35% — here is how we will do the same for you.” This is infinitely more compelling than a vendor slide deck. It is proof, not a promise.

4. Retrain displaced L1 staff for higher-value work. AI oversight, complex escalation handling, client advisory, AI system tuning, prompt engineering for operational workflows. The people who understood your service desk best are the ones who can manage the AI that replaces the routine parts of it. Losing them is a waste of institutional knowledge.

5. Redesign pricing models before clients ask you to. If you wait until a client says “Why am I paying for tickets that AI resolves?” you are negotiating from weakness. If you proactively propose an outcome-based model that saves the client money while protecting your margin, you are negotiating from strength.

The Strategic Imperative

Let us be direct about the stakes.

If you do not adopt AI internally, a competitor will — and they will undercut you on price while delivering better service. This is not a hypothetical. It is the consolidation pattern already visible in MSP M&A data.

The providers who transform their own delivery first will have the most credible pitch to clients. They will have the metrics, the case studies, and the operational maturity that no amount of marketing can substitute for. They will also have the margin structure to invest in growth while competitors are still trying to cover their costs.

The broader trajectory is unmistakable. IDC projects that by 2030, 0% of IT work will be done by humans without AI assistance, 75% will be done by humans augmented with AI, and 25% will be done by AI alone. Gartner forecasts that 40% of enterprise applications will include task-specific AI agents by end of 2026 — not 2030, next year.

This is not a question of whether AI will transform your delivery model. It is a question of whether you will lead the transformation or be caught by it.

What to take from this chapter: The same AI you are learning to sell to clients is simultaneously transforming how you deliver your existing services. The providers who adopt it internally first — measuring the impact, retraining their teams, and redesigning their pricing — will expand their margins, scale their capacity, and build the most credible sales pitch in the market. The providers who wait will find themselves on the wrong side of a consolidation wave that is already underway. This is not a future problem. The numbers are already real, the tools are already available, and your competitors are already moving.

Next: Chapter 10 — EU AI Act: Your Compliance Opportunity

Chapter 10

Chapter 10: The Lock-In Power Shift

Every IT services provider understands lock-in. You may not call it that in client meetings — you say “deep partnership” or “institutional knowledge” — but the mechanism is the same. The more entangled your systems become with the client’s operations, the harder it is for them to leave. The harder it is for them to leave, the more predictable your revenue.

This chapter is about what happens when that lock-in migrates away from you and toward the model vendors. It is a shift many providers have not fully reckoned with, because the daily work feels familiar even as the power dynamics underneath are quietly rearranging.

The Old World: Lock-In Favored the IT Provider

Let us be honest about how the traditional IT services model actually worked.

A mid-sized European company hires you to manage their infrastructure. Over the first year, you learn their environment — the legacy ERP system that requires a specific JDK version, the VPN configuration that was set up by a contractor who left in 2019, the backup schedule that accounts for a batch job running every Thursday at 3 AM. You document some of this. Much of it lives in the heads of your operations team.

By year two, the client is deeply dependent on you. Not because your technology is superior, but because the switching cost is enormous. A competing provider would need months to understand the environment. The migration risk is real. Your contract renewal conversations are comfortable, because both sides know the alternative is painful.

This was the business moat. The more complex the client’s environment, the stickier the relationship. Providers who accumulated institutional knowledge about their clients’ systems built durable businesses with high retention rates and healthy margins. The lock-in was not malicious — it was a natural consequence of infrastructure complexity. The client was not locked in because you tricked them. They were locked in because the work was hard, and you were the ones who knew how to do it.

The old lock-in formula: Complexity of the client’s environment multiplied by your accumulated knowledge of it equaled switching cost. High switching cost equaled sticky revenue. This was not a side effect of the business model — it was the business model.

The New World: Lock-In Runs Toward the Model Vendors

Now consider what happens when a client’s most important technology interaction is not a managed server fleet but an API call to a large language model.

A European logistics company wants GenAI for customer support automation and document processing. A developer writes integration code using OpenAI’s function calling format, defines tool schemas in OpenAI’s JSON structure, and builds workflows around the Assistants API. The prompts are engineered for GPT-4o’s strengths. The evaluation metrics are calibrated against GPT-4o’s output patterns.

Lock-in has migrated. The logistics company is now deeply dependent on OpenAI — their tool definitions, their API schema, their model’s behavioral quirks. But the IT services provider who helped set this up? Far more replaceable. Another provider could read the API documentation and take over in weeks. The complexity that created switching cost has been abstracted away by the model vendor.

The platform captured the lock-in that used to belong to the middleman.

This shift is structural. In traditional IT, the abstraction layer sat low — close to the hardware. Managing it required deep operational expertise, which is exactly what IT services providers sold. With GenAI APIs, the abstraction is much higher. A competent developer can integrate the OpenAI API in an afternoon. The “we will manage it for you” pitch loses force when the thing being managed is a REST API call, not a multi-server deployment with failover and disaster recovery.

Where Lock-In Now Lives

To understand the new competitive landscape, map where lock-in has migrated. It now lives in the model-specific choices that accumulate as organizations build AI-powered workflows.

Model-specific prompt engineering. Prompts that work well with Claude do not necessarily work well with GPT-4o. Organizations that invest months refining system prompts, few-shot examples, and chain-of-thought templates for a specific model have created assets that are partially non-portable.

Tool and function calling formats. This is one of the most concrete lock-in vectors. OpenAI’s function calling schema differs from Anthropic’s tool use format, which differs from Google’s function declarations. If a client has built 50 tool definitions in OpenAI’s format, migrating to Claude requires more than a format conversion — it requires retesting every tool interaction because different models interpret tool descriptions differently.

Context window strategies. An application designed around Claude’s 200K context window works differently from one designed around GPT-4o’s 128K window. Switching to a model with a smaller effective window means rearchitecting the pipeline — chunking strategies, retrieval augmentation, summarization layers.

Evaluation pipelines. Perhaps the most subtle form of lock-in. Organizations that build serious AI applications develop test suites and quality benchmarks calibrated to a specific model’s output patterns. Switching models means recalibrating what “good” looks like — expensive, time-consuming, and a source of real organizational resistance.

Fine-tuning investments. If a client has invested in fine-tuning a model — curating training data, running training jobs, evaluating iterations — that investment is entirely locked to the provider’s platform. A fine-tuned GPT-4o cannot be ported to Claude. The training data might be portable, but the training investment is not.

Where lock-in lives now: It is in prompt libraries, tool schemas, context window architectures, evaluation pipelines, and fine-tuning investments. The more you build around one model, the harder it is to switch. And none of these lock-in vectors benefit the IT services provider in the middle — they all benefit the model vendor.

Defensive Strategies for IT Services Providers

Understanding the power shift is step one. Step two is building a defensible position despite it. The new lock-in landscape creates specific opportunities for providers who think architecturally rather than operationally.

Abstract the model layer. This is the single most important architectural decision you can make on behalf of your clients. Build an abstraction layer between the client’s application logic and the model provider’s API. Tool definitions stored in a provider-neutral format, translated at the integration layer. Prompts templated with model-specific variants. No hardcoded references to api.openai.com. When you control the abstraction layer, you control the switching capability — and switching capability is negotiating power.

Own the integration layer. Your value is not in the model. It is in connecting the model to the client’s business systems — their ERP, CRM, document management, compliance workflows. This integration work is genuinely complex, deeply client-specific, and hard for a competitor to replicate quickly. It creates healthy lock-in that favors you, not the model vendor.

Build switching capability as a service. If you can swap Claude for GPT-4o for Gemini within days rather than months, you have something valuable. The client gets resilience against price increases, model deprecation, or quality regressions. You get a defensible position as the provider who ensures vendor independence. The old value proposition reframed: complexity management, but now vendor abstraction rather than server management.

Own the data layer. RAG pipelines, knowledge bases, vector databases, fine-tuning datasets — these make AI work in a specific business context. Design them model-agnostic and they become portable assets that you manage. The client depends on your knowledge of their data architecture and embedding strategies. This is the new institutional knowledge — the GenAI equivalent of knowing where the legacy VPN config lives.

Build evaluation frameworks. If you can objectively compare model performance for a client’s specific use case — measuring quality, latency, and cost across providers — you become the trusted advisor. This is a defensible, recurring advisory relationship that no model vendor can replicate, because the model vendor has an inherent conflict of interest in the comparison.

The Uncomfortable Truth

Let us not sugarcoat this. Defensive strategies are real and valuable, but they do not change the fundamental math for every provider.

Some IT services providers will have fewer clients who need them, regardless of how well they execute. When a solo developer can integrate an LLM API in an afternoon, the pool of clients who need a managed services provider shrinks. Not to zero, but meaningfully.

The “we will manage it for you” pitch requires redefining what “it” means. If “it” is API integration and prompt engineering, the pitch is weak. If “it” is multi-model orchestration, compliance architecture, evaluation frameworks, and ongoing optimization across a portfolio of AI applications — that is a different proposition entirely. But it demands capabilities most traditional IT services providers do not currently have.

The value must shift from “we run your stuff” to “we make AI work in your specific context.” That implies understanding the client’s business domain, not just their infrastructure. It implies advisory capability, not just operational capability. And it implies willingness to be measured on outcomes rather than uptime.

The uncomfortable truth: The structural simplification of AI infrastructure means some clients will not need an IT services provider at all. The providers who thrive will be the ones whose value clearly exceeds what the client can do with an API key and an afternoon of reading documentation.

The Opportunity in Abstraction

There is a genuine, defensible business in being the layer between organizations and model vendors — but it looks nothing like traditional infrastructure management.

Multi-model orchestration as a service. Route different request types to different models based on complexity, cost, and quality requirements. A customer support chatbot handling routine questions uses a fast, cheap model. The same system escalating to complex reasoning dynamically routes to a more capable model. Building and operating this routing layer — with quality monitoring, cost tracking, and continuous optimization — is real, recurring work.

Vendor management and cost optimization. When a client uses three model providers across a dozen applications, someone needs to track spend, negotiate enterprise agreements, monitor rate limits, and detect when a provider’s price change or model update breaks a workflow. This is procurement and operations expertise applied to a new domain.

Model evaluation and selection. The model landscape changes quarterly. A provider who maintains current benchmarks across models for common enterprise use cases — and can advise clients on when to switch, when to stay, and when to hedge — provides ongoing strategic value.

These services share a characteristic: they are more valuable the more model providers exist and the faster the market moves. In a world with five or six competitive providers releasing new models every quarter — which is the world we are in — the complexity of navigating the landscape is itself a source of value.

The opportunity: The same market fragmentation that threatens your old business model creates demand for your new one. Multi-model complexity is the new infrastructure complexity. If you can manage it, you have a business.

From Passive to Active Indispensability

In the old model, you were indispensable because leaving you was painful — passive lock-in. In the new model, you must be indispensable because the alternative is worse — the client managing multi-model orchestration, evaluation, compliance, and optimization on their own. This is active indispensability: the client stays because your contribution is visible and measurable, not because switching is hard.

Active indispensability is harder to build but more durable. It depends on expertise the client can see and value, not on information asymmetry. The providers who internalize this earliest will have a significant head start — not because the work is impossibly complex, but because the transformation from infrastructure operators to intelligence advisors takes time. The window is open now.

Next: Chapter 9 — EU AI Act: Your Compliance Opportunity

Chapter 11

Chapter 11: EU AI Act: Your Compliance Opportunity

Every technology regulation creates two groups: those who see it as a cost to be minimized, and those who see it as a service to be sold. The EU AI Act is the most significant piece of AI regulation in the world, and for EU-based IT services providers, it is firmly in the second category.

This chapter is not a legal primer. You can hire a lawyer for that. What it is: a practical guide to the specific implementation work the Act demands, why that work is technical rather than legal, and how to position your firm as the one who does it.

Just Enough Context: What the Act Actually Requires

The EU AI Act entered into force on 1 August 2024. It follows a phased enforcement timeline:

February 2025: Prohibitions on unacceptable-risk AI systems take effect (social scoring, real-time biometric identification in public spaces with narrow exceptions, manipulation of vulnerable groups).
August 2025: Obligations for general-purpose AI (GPAI) model providers kick in — transparency, documentation, copyright compliance, and for models with systemic risk, additional safety evaluations.
August 2026: The big one. Obligations for high-risk AI systems become enforceable. This is where most of the implementation work lives and where most of your clients’ obligations begin.
August 2027: Extended deadline for high-risk AI systems that are safety components of products already regulated under existing EU sectoral legislation (medical devices, machinery, aviation, vehicles).

The Act classifies AI systems into four risk tiers:

Unacceptable risk — banned outright. Social scoring, manipulative AI targeting vulnerable groups, untargeted facial recognition databases, emotion recognition in workplaces and schools. Your clients should not be building these. If they are, the conversation is with legal, not with you.

High risk — heavily regulated. This is where the money is. High-risk systems include AI used in recruitment and hiring, credit scoring and financial assessments, law enforcement and border control, critical infrastructure management (energy, water, transport), education (exam scoring, student assessment), access to essential services, and migration/asylum processing. Any AI system used as a safety component in products covered by existing EU product safety legislation also falls here.

Limited risk — transparency obligations. Chatbots must disclose they are AI. Deepfakes must be labelled. Emotion recognition systems must inform users. These are lighter requirements but still need implementation.

Minimal risk — no specific obligations. Spam filters, AI in video games, most internal productivity tools. The vast majority of AI systems fall here.

Providers vs Deployers: The Distinction That Matters

The Act draws a critical line between providers (those who develop or place AI systems on the market) and deployers (those who use AI systems in a professional capacity). Most of your enterprise clients will be deployers. Some may also become providers if they fine-tune or substantially modify an AI system.

Deployers of high-risk AI systems must:

Use the system in accordance with the provider’s instructions
Ensure human oversight by qualified personnel
Monitor the system’s operation and report serious incidents
Conduct a fundamental rights impact assessment (for certain categories)
Maintain logs generated by the system for at least six months
Inform employees and their representatives that they are subject to AI systems
Ensure input data is relevant and representative

These are not abstract policy requirements. They are operational obligations that need technical systems, processes, and infrastructure to fulfil. Someone has to build those systems. That someone should be you.

Key takeaway: Most of your enterprise clients will be “deployers” under the EU AI Act. By August 2026, they need functioning human oversight, monitoring, logging, incident reporting, and impact assessment processes — not just policies on paper, but working technical implementations.

Why This Is Implementation Work, Not Legal Work

Here is the crucial insight that many providers miss: EU AI Act compliance is roughly 20% legal interpretation and 80% technical implementation. Lawyers will tell your clients what they need to do. You will build the systems that actually do it.

Consider what a deployer of a high-risk AI system — say, a bank using AI for credit scoring — actually needs:

A risk management system. Not a document titled “Risk Management Policy.” An actual system that continuously identifies, evaluates, and mitigates risks throughout the AI system’s lifecycle. This means monitoring pipelines, alerting infrastructure, risk scoring dashboards, and integration with the bank’s existing risk management frameworks. This is engineering work.

Data governance. The training and validation data used for any fine-tuning or customization needs documentation: provenance, preprocessing steps, bias analysis, representativeness assessment. If your client is fine-tuning models on their own data, they need a data governance pipeline — versioning, quality checks, bias testing, lineage tracking. This is data engineering work.

Technical documentation and a technical file. High-risk systems require detailed documentation of the system’s purpose, design, development process, testing methodology, and performance metrics. For a deployer who has customized or integrated an AI system, this means documenting the full integration architecture, the prompt engineering decisions, the evaluation results, the failure modes. This is technical writing backed by engineering analysis.

Human oversight mechanisms. The Act requires that high-risk AI systems can be effectively overseen by natural persons. In practice, this means building interfaces and workflows where human reviewers can inspect AI decisions, override them when necessary, and intervene in real-time for certain use cases. This is UX design and systems integration work.

Logging and monitoring infrastructure. Systems must generate logs that allow traceability throughout the AI system’s lifecycle. For production AI systems, this means structured logging of inputs, outputs, model versions, confidence scores, and human override decisions — stored securely, retained for the required period, and accessible for audit. This is infrastructure engineering.

Incident reporting. Serious incidents must be reported to market surveillance authorities. This requires detection mechanisms (how do you know something went wrong?), classification logic (is it serious?), and reporting workflows integrated with the client’s existing incident management. This is DevOps and process engineering.

A law firm cannot build any of this. A management consultancy can write the policies but cannot implement the systems. The work sits squarely in the domain of technical services providers who understand both AI systems and enterprise infrastructure — which is exactly what you are.

Key takeaway: Compliance with the EU AI Act is primarily a technical implementation challenge, not a legal one. Lawyers define obligations; you build the systems that meet them. Risk management pipelines, logging infrastructure, human oversight interfaces, bias testing frameworks, incident detection — this is your domain.

Eight Service Lines You Can Build Today

Here are the specific service opportunities, ordered roughly by how quickly you can bring them to market:

1. AI System Inventory and Classification

Before a client can comply, they need to know what AI systems they are actually using. Many enterprises have no comprehensive inventory. Shadow AI — departments buying API access or using AI tools without IT’s knowledge — is rampant. The first engagement is often a discovery exercise: what AI systems exist, who uses them, for what purpose, and which risk category they fall under.

This is low-cost, high-value work. It requires no deep AI expertise — just systematic assessment skills and knowledge of the Act’s classification criteria. And it is a natural entry point for every other service on this list.

2. Risk Management System Implementation

Building the continuous risk assessment infrastructure for high-risk AI deployments. This includes defining risk metrics, building monitoring dashboards, setting up alerting pipelines, and integrating with the client’s existing risk and compliance frameworks. If you have experience with ISO 27001 or similar management systems, the structure is familiar — the content is just different.

3. Bias Testing and Fairness Evaluation

High-risk AI systems must be tested for bias across protected characteristics. This is a recurring service, not a one-time engagement. Models drift. Data distributions shift. New edge cases emerge. A quarterly or monthly bias audit, with documented methodology and results, is exactly the kind of retained service that builds recurring revenue.

The technical work involves building evaluation datasets, running structured tests across demographic groups, statistical analysis of outcomes, and clear reporting. If your team has data science capabilities, this is a natural fit.

4. Monitoring and Logging Infrastructure

Production AI systems need structured, auditable logging that captures inputs, outputs, model versions, latency, confidence scores, and human intervention events. This is classic infrastructure work — the kind your operations team already knows how to build — applied to a new domain.

The key difference from traditional application logging: AI system logs need to support traceability of individual decisions, not just system health metrics. This means richer data capture, longer retention, and query capabilities that support after-the-fact analysis of specific outputs.

5. Human Oversight Mechanism Design

Building the interfaces and workflows that enable meaningful human oversight. For a hiring AI, this might mean a review dashboard where recruiters can see the AI’s reasoning, override decisions, and flag concerns. For a credit scoring system, it might mean an escalation workflow that routes edge cases to human analysts.

This is UX and systems integration work. It is also deeply specific to each client’s domain and processes, which makes it hard to commoditize and hard for a hyperscaler to offer as a generic service. That specificity is your advantage.

6. Technical Documentation and Conformity Assessment Preparation

High-risk AI systems need a technical file that would make a medical device regulator feel at home. System architecture, design decisions, training data documentation, testing methodology, performance benchmarks, known limitations, deployment specifications. For providers, this also includes conformity assessment — either self-assessment or third-party assessment depending on the use case.

Most clients will need help preparing these materials. The work combines technical depth (you need to understand the system to document it) with regulatory awareness (you need to know what the documentation must cover). This is consulting at premium rates.

7. Data Governance for Training and Fine-Tuning

If your client fine-tunes or adapts AI models on their own data, they need governance frameworks covering data provenance, quality assessment, bias analysis, consent management (where personal data is involved), and version control. This intersects heavily with existing GDPR data governance — another area where your EU presence is an advantage.

8. Post-Market Monitoring and Incident Reporting

Once a high-risk AI system is in production, the obligations do not stop. Deployers must monitor performance, detect degradation or drift, identify serious incidents, and report them to authorities. Building the systems that do this — automated performance tracking, anomaly detection, incident classification, and reporting workflows — is ongoing infrastructure work with a natural retainer model.

Key takeaway: The EU AI Act creates at least eight distinct technical service lines, from initial inventory and classification through to post-market monitoring. Most of these are recurring engagements, not one-off projects. Start with AI system inventory — it is the lowest barrier to entry and the natural gateway to everything else.

Data Sovereignty as a Premium Service

In Chapter 5, we discussed the privacy proxy model and its economics. The EU AI Act adds a regulatory dimension that strengthens the case for EU-hosted AI services in specific segments.

Some clients do not just prefer to keep data in the EU. They are legally required to. The combination of GDPR data transfer restrictions, sector-specific regulations, and the EU AI Act’s requirements around data governance and system monitoring creates scenarios where sending data to US-based AI providers is genuinely not an option:

Banking and financial services under ECB and national supervisory authority oversight, where outsourcing to non-EU processors triggers additional regulatory requirements that can exceed the cost savings.
Healthcare in jurisdictions with strict patient data localization requirements — Germany’s health data infrastructure regulations are a prime example.
Defence and national security, where data classification rules prohibit external processing entirely.
Public sector in specific member states with data sovereignty mandates — France’s SecNumCloud qualification, Germany’s IT-Grundschutz requirements, and similar frameworks.

For these clients, the economics we outlined in Chapters 3 and 4 — where self-hosting is 5-15x more expensive than API access — are irrelevant. The relevant comparison is not “self-hosted vs API.” It is “self-hosted vs not using AI at all.” And against that alternative, the self-hosting premium is easily justified.

This is the one scenario where the infrastructure business model from the old world transfers cleanly to GenAI. You host the models. You operate the infrastructure. You guarantee that data never leaves your EU-based facilities. And you charge a premium that reflects the regulatory constraint, not the commodity cost of compute.

The market is real. But it is narrower than the marketing materials of most EU cloud providers suggest. Do not plan your entire business around data sovereignty clients. Plan a profitable service line for them, and build the rest of your AI practice around the broader compliance and integration opportunities.

Key takeaway: Data sovereignty is a genuine premium service opportunity for banking, healthcare, defence, and certain public sector clients who literally cannot use US-based AI APIs. Price it as regulatory necessity, not cost-plus infrastructure. But recognize this is a profitable niche, not a mass market.

How to Position This

The biggest mistake you can make is to sell “EU AI Act compliance” as a standalone product. Here is why: compliance is a cost in the client’s mind. Nobody wakes up excited to buy compliance. It is a thing they have to do, and they will try to do it as cheaply as possible. If you sell compliance as a line item, you are inviting price competition from every consultancy, law firm, and freelancer who can read the regulation.

Instead, integrate compliance into your AI deployment offering. The pitch is not:

“We will help you comply with the EU AI Act.”

The pitch is:

“We deploy AI in your organization — and every deployment we do is EU AI Act compliant from day one.”

The difference is profound. In the first pitch, you are a cost centre. In the second, you are an enabler who happens to remove a major risk. The compliance is bundled into the value, not sold as overhead.

This positioning works especially well when combined with the capabilities from earlier chapters:

Privacy proxy (Chapter 5) + EU AI Act compliance = “We route your AI usage through EU infrastructure with full regulatory compliance built in.”
Local deployment (Chapter 6) + EU AI Act compliance = “We deploy AI on your employees’ devices — no data leaves your organization, and every deployment meets EU AI Act requirements.”
Testing and security (Chapter 7) + EU AI Act compliance = “We test and monitor your AI systems for quality, security, and regulatory compliance as a single managed service.”

Each of these is a stronger proposition than any component sold alone. The compliance layer makes the technical offering more valuable, and the technical offering makes the compliance tangible rather than theoretical.

Training Your Team

This is the investment that pays off fastest. The EU AI Act is new enough that genuine expertise is scarce. If your team understands both the regulation’s requirements and how to implement them technically, you have a real differentiator — one that will persist for at least 18-24 months as the market catches up.

The knowledge you need is not deep legal expertise. It is practical understanding of:

Which systems fall into which risk categories
What deployers specifically need to do (and by when)
What conformity assessment involves for different system types
How to structure technical documentation that meets the Act’s requirements
What monitoring and logging infrastructure needs to capture

A team of three to five engineers who understand these requirements and can implement the corresponding systems is more valuable than a team of fifty who can build generic cloud infrastructure. The Act creates a knowledge premium that rewards early investment.

The Timeline Advantage

The enforcement timeline creates a specific strategic window. Most enterprises are in one of three states right now:

Unaware. They use AI but have not connected it to the EU AI Act’s obligations. They do not know they are deployers of potentially high-risk systems. A surprising number of companies fall here — particularly those that adopted AI tools informally, without a centralized procurement process.

Aware but paralyzed. They know the Act exists. They may have had a lawyer present an overview to the board. But they have no concrete implementation plan, no internal expertise, and no budget allocated. They are waiting for someone to tell them what to do in practical terms.

Actively preparing. A small minority, mostly large enterprises and those in heavily regulated sectors. They have begun compliance programmes but are discovering that the implementation work exceeds their internal capacity.

All three groups need help, but the middle group — aware but paralyzed — is the largest and the most receptive. They have the urgency (the August 2026 deadline for high-risk system obligations is not far away) but not the capability. A provider who can walk in with a clear assessment methodology, a concrete implementation roadmap, and demonstrated technical capability will win these engagements.

And here is the strategic upside: compliance partnerships are sticky. Once you have built a client’s risk management system, documented their AI deployments, implemented their monitoring infrastructure, and set up their incident reporting workflows, switching to another provider is painful and expensive. The client would need to re-document everything, retrain staff on new tools, and rebuild trust with a new partner — all while the compliance clock keeps ticking.

This is the kind of structural stickiness that infrastructure hosting used to provide. Except instead of being locked in by data gravity and migration costs, the client is retained by compliance continuity and institutional knowledge. It is a better form of lock-in because it is driven by value delivered, not by switching costs imposed.

Key takeaway: The enforcement timeline creates a narrow window — roughly now through August 2026 — where providers who build EU AI Act implementation expertise will establish themselves as trusted partners. Once embedded in a client’s compliance infrastructure, these relationships are naturally sticky. The early movers will have a durable advantage.

The Public Sector Procurement Reality

Many EU IT services providers derive 30-40% of their revenue from government and public sector contracts. If this describes your business, GenAI strategy is not just a technology question — it is a procurement question.

Public sector IT across the EU is typically procured through framework agreements and multi-year tenders. Adding “AI services” to an existing framework contract is rarely as simple as updating a service catalogue. In most jurisdictions, it requires a new procurement process, new evaluation criteria, and often new certifications from the provider.

What this means practically:

Procurement cycles are 12-18 months. If you want to sell AI services to a government client in 2027, you need to be responding to tenders and framework renewals now.
Existing frameworks are your entry point. If you already hold a managed services or consulting framework agreement with a government client, explore whether AI services can be positioned under existing service categories (e.g., “IT consulting,” “systems integration,” “infrastructure management”). This is often faster than a new procurement.
Certifications matter. Some EU member states are developing specific AI-related certifications or standards for government suppliers. Being early to certify is a competitive advantage in tender evaluations.
Security clearances and data classification. Government AI deployments often involve classified or sensitive data. If your team already holds relevant security clearances, this is a significant barrier to entry that protects your position.
The incumbent advantage is real. A government client with an existing relationship is far more likely to extend your mandate to include AI services than to run a separate procurement for a new provider. Use this.

The EU AI Act amplifies the public sector opportunity. Government bodies are themselves deployers of AI systems and must comply with the Act — often at the high-risk classification level (law enforcement, immigration, public benefits, critical infrastructure). They need implementation partners who understand both the technology and the regulatory requirements, and they strongly prefer working with providers they already trust.

Key takeaway: If public sector represents a significant share of your revenue, start positioning AI services within existing procurement vehicles now. The procurement cycle means that opportunities missed today will not return for 12-18 months. Your incumbent status is an asset — use it before the next tender cycle lets in new competitors.

What This Chapter Means for Your Strategy

The EU AI Act is not a burden to endure. It is a market to serve. The regulation creates mandatory demand for technical implementation work that sits squarely in the competency zone of IT services providers. It favours EU-based providers who share their clients’ regulatory environment. It creates recurring revenue through ongoing monitoring and reporting obligations. And it provides natural stickiness that protects against the commoditization pressures we discussed in earlier chapters.

The concrete moves:

Invest in training now. Get three to five people fluent in the Act’s practical requirements within the next quarter. This is not a six-month project — the core material can be absorbed in weeks by technical people with compliance experience.
Start with AI inventory engagements. Offer existing clients a discovery exercise: what AI systems are you using, and which ones trigger EU AI Act obligations? This is low-risk, low-cost, and opens the door to everything else.
Bundle, do not unbundle. Sell compliance as part of your AI deployment and management services, not as a standalone product. The margin is better and the positioning is stronger.
Build the data sovereignty service line for clients who need it. Price it as a premium offering. Do not apologize for the markup — the client has no cheaper alternative.
Target the August 2026 deadline. Every high-risk AI system deployer needs functioning compliance infrastructure by then. That deadline is your best sales tool for the next several months.

The regulation is complex. The opportunity is straightforward.

Next: Chapter 12 — Pricing Models and Packaging

Chapter 12

Chapter 12: Pricing Models and Packaging

You have built the capability. You understand the infrastructure economics, the business models, the compliance landscape. Now comes the question that determines whether any of it generates revenue: how do you actually price this?

Pricing GenAI services is harder than pricing traditional IT services, and the reason is simple. Your client has a reference point, and that reference point is devastating. ChatGPT Team costs $30 per seat per month. Claude Pro costs $20. Microsoft Copilot costs $30. These are frontier-level AI products, built by companies with tens of billions in infrastructure investment, offered at prices that would not cover your team’s coffee budget.

If what you are selling looks anything like “access to an AI chatbot,” you have already lost. No amount of positioning, no sales deck, no carefully worded value proposition will overcome the basic arithmetic: why would a client pay you $80 per user per month when they can get ChatGPT for $30?

The answer, of course, is that you are not selling an AI chatbot. You are selling a solution to a specific business problem, and the language model is one component of that solution — your cost of goods sold, not your product. The model cost is to your AI service what flour is to a bakery. Nobody walks into a bakery and says “I can buy flour for $0.50 per kilo, so this bread should cost $0.60.” But if your storefront looks like a flour shop, that is exactly the comparison they will make.

The pricing principle: You are not selling LLM access. You are selling an AI-powered solution. The moment a client can compare your offering line-by-line with a $30/seat consumer product, you have a positioning problem, not a pricing problem.

This chapter walks through five pricing models, a packaging framework, and the service-level price ranges you need to build a viable commercial practice.

Five Pricing Models

There is no single correct way to price GenAI services. The right model depends on your delivery architecture, your client’s risk tolerance, and where you sit on the spectrum from infrastructure provider to solution partner. Here are the five models that work in practice, with honest assessments of each.

1. Per-Seat Subscription

Structure: $40-100 per user per month, flat rate regardless of usage.

This is the model most familiar to enterprise buyers. It is easy to budget, easy to compare, and easy to procure. The client knows exactly what they will spend: 200 users at $50 per seat equals $10,000 per month, no surprises.

Where it works: Per-seat pricing works best when your underlying cost structure is predominantly fixed — which means it is a natural fit for the local deployment model discussed in Chapter 6. If you have deployed a model running on the client’s own hardware or on your managed infrastructure, your costs do not scale meaningfully with per-user activity. A user who sends 500 queries per day and a user who sends 5 queries per month cost you roughly the same in infrastructure. The fixed subscription captures this reality cleanly.

A practical example: $50 per user per month for a managed local AI deployment including guardrails, model updates, basic RAG over company documents, and 8/5 support. At 100 users, that is $5,000 per month in revenue against perhaps $1,500-2,000 in infrastructure and support costs. The margins are healthy because the heavy users are subsidized by the light ones.

The risk: That subsidy cuts both ways. If your client’s usage pattern is heavily skewed — a small group of power users generating 80% of the load — the light users may question why they are paying the same rate. And if a competitor offers a usage-based alternative, the light users have a reason to leave while the heavy users (who are expensive to serve) stay. This is classic adverse selection, and it can erode your margins quietly.

Mitigation: Tier your per-seat pricing. A “standard” tier at $40 per user with reasonable usage limits and a “power user” tier at $80 per user with higher limits and priority support. This segments demand without abandoning the subscription model’s predictability.

2. Per-Token Passthrough with Markup

Structure: Client pays actual API costs plus a 20-40% margin.

This is the most transparent model and, for that reason, the most dangerous. The client can see exactly what the underlying API calls cost, exactly what your markup is, and exactly how much they would save by going direct.

Where it works: The passthrough model makes sense for the privacy proxy architecture (Chapter 5), where your value-add is demonstrably not the model itself but the compliance layer, PII stripping, and audit trail you wrap around it. A client paying $5,000 per month in API costs plus a $1,500 compliance premium understands they are paying for the privacy infrastructure, not for a more expensive version of the same API.

It also works during early-stage engagements where the client wants to experiment with AI without committing to a fixed subscription. “Pay for what you use, plus our management fee” is a low-friction way to start a relationship.

The risk: Margins are thin and structurally capped. If you are marking up API costs by 30%, your gross margin on a $10,000 monthly API spend is $3,000. Out of that $3,000 you need to cover your engineering time, support, infrastructure, and sales costs. At the typical EU IT services cost structure, you need a substantial number of clients to make this model viable as a standalone offering.

Worse, the client has a permanent incentive to disintermediate you. Every time they look at their invoice, they see the API cost and the markup as separate line items. When they eventually hire someone who can call an API, you lose the account.

Mitigation: Never present token passthrough as your only value. Bundle it with monitoring, cost optimization (you can often reduce their API costs by 30-50% through model routing and prompt optimization), compliance, and support. The markup should be the smallest visible part of a larger service fee.

3. Fixed Monthly Retainer

Structure: $5,000-25,000 per month covering infrastructure, support, and a defined usage tier.

The retainer model shifts usage risk from the client to you, and in exchange gives you predictable monthly recurring revenue. The client pays a flat fee; you deliver a defined service level including a certain amount of AI capacity, monitoring, support, and regular updates.

Where it works: Retainers are the natural pricing model for managed AI infrastructure engagements with enterprise clients. The client wants a budget line item they can plan around. They do not want to think about tokens, GPU hours, or API calls. They want “our AI works, someone competent is making sure it keeps working, and we know what it costs.”

A practical structure:

Retainer tier	Monthly fee	Included capacity	Support level
Standard	$5,000-8,000	Up to 50 users, standard models	Business hours, 4-hour response
Professional	$10,000-15,000	Up to 200 users, premium models, RAG	Extended hours, 1-hour response
Enterprise	$18,000-25,000	Unlimited users, custom models, full integration	24/7, 15-minute response for critical

The risk: You absorb usage spikes. If a client on a $10,000 retainer suddenly doubles their AI usage because they rolled out a new internal tool, your costs jump while your revenue stays flat. You can manage this with fair-use clauses and usage caps, but enforcing those caps damages the client relationship.

Mitigation: Define usage tiers clearly in the contract, include overage pricing for usage beyond the included tier (at a per-token rate, but positioned as an exception rather than the norm), and build a 20-30% buffer into your pricing to absorb normal variation.

4. Project-Based Fee Plus Ongoing Retainer

Structure: $20,000-50,000 implementation project plus $3,000-10,000 per month ongoing.

This is the highest total-value pricing model and the one that most naturally aligns with how enterprise AI deployments actually work. There is an upfront phase — discovery, architecture, integration, testing, deployment — followed by an ongoing phase of maintenance, monitoring, updates, and optimization.

Where it works: This model is a natural fit for compliance-heavy engagements (Chapter 9), custom RAG implementations, and any deployment that requires significant integration with existing client systems. The project fee covers your intensive engineering effort during setup; the retainer covers the long tail of keeping it running, keeping it compliant, and keeping it current as models evolve.

Example engagement: - Phase 1 — Assessment and Architecture (4-6 weeks): $15,000-25,000. Discovery, data audit, architecture design, compliance gap analysis. - Phase 2 — Implementation (8-12 weeks): $30,000-60,000. Model selection, deployment, RAG pipeline, guardrails, integration with client systems, testing. - Phase 3 — Ongoing Management: $5,000-10,000 per month. Monitoring, model updates, compliance documentation maintenance, support.

Over a 24-month engagement, the total value ranges from $165,000 to $325,000. This is meaningful revenue from a single client, with the ongoing retainer providing the recurring base that makes the business sustainable.

The risk: Longer sales cycles. Enterprise procurement for a six-figure engagement involves more stakeholders, more approvals, and more competition than a simple subscription sale. You need to budget 3-6 months from initial conversation to signed contract, and you need a pipeline large enough to absorb the deals that stall or fall through.

Mitigation: Start small. Offer the assessment phase as a standalone engagement at $5,000-15,000. This gives the client a low-risk entry point and gives you an opportunity to demonstrate competence before asking for the larger commitment. Most implementation contracts grow from successful assessments, not from cold proposals.

5. Outcome-Based and Value-Based Pricing

Structure: Price linked to a measurable business outcome — documents processed, tickets resolved, hours saved, accuracy achieved.

This is the model with the highest potential margin and the hardest execution. Instead of pricing your inputs (time, tokens, infrastructure), you price your outputs (business results). If your AI-powered document processing system handles 10,000 invoices per month that previously required 3 full-time employees, you charge based on the value created, not the compute consumed.

Where it works: Outcome-based pricing works for mature, well-tested vertical applications where you have high confidence in the solution’s reliability and can clearly measure the outcome. If you have deployed the same invoice processing solution for five similar clients and you know it consistently achieves 95%+ accuracy, you can price at, say, $0.50 per invoice processed — delivering clear ROI to the client while capturing margins far above your actual compute costs.

The risk: You are betting on your solution’s performance. If accuracy drops, if the client’s data is messier than expected, if edge cases multiply — you are still committed to the outcome while your costs spiral. You also need robust measurement and attribution: both you and the client must agree on what constitutes a “processed document” or a “resolved ticket,” and that agreement needs to survive contact with messy operational reality.

Mitigation: Only offer outcome-based pricing for solutions you have deployed at least 2-3 times successfully. Include a pilot period (60-90 days) with traditional time-and-materials pricing before switching to outcome-based. Define the metrics precisely in the contract, including exclusions for edge cases and data quality issues.

Packaging: The Three-Tier Framework

Individual pricing models work for individual engagements, but building a scalable practice requires packaging — predefined bundles that clients can evaluate, compare, and buy without starting from scratch each time.

The three-tier model is not original, but it is effective. Here is a framework calibrated for EU IT services providers selling GenAI solutions:

	Starter	Professional	Enterprise
Target	SMBs, 10-50 users	Mid-market, 50-200 users	Enterprise, 200+ users
Deployment	Local AI on existing hardware	Local + cloud hybrid	Full managed AI infrastructure
Models	Standard open-source models, quarterly updates	Premium open-source + API access, monthly updates	Custom fine-tuned models, continuous updates
Features	Basic guardrails, standard RAG	Custom guardrails, advanced RAG, document processing	Full compliance suite, custom integrations, analytics
Support	Email, next business day	8/5 phone + email, 4-hour response	24/7, dedicated account manager, 1-hour critical response
Compliance	Basic documentation	EU AI Act risk assessment, GDPR documentation	Full compliance management, audit support, regulatory liaison
Price range	$20-40/user/month	$50-80/user/month	$100-200/user/month or custom retainer
Minimum commitment	Monthly	Annual	Multi-year

Packaging principle: Never sell “AI hosting” in isolation. The LLM is one component of a vertical solution. Bundle the model with integration, compliance, support, and domain expertise. The bundle is what creates value; the model alone is a commodity.

The tiers serve a dual purpose. They give the client a clear upgrade path (start at Starter, grow into Professional as usage matures), and they give your sales team anchoring (the Enterprise tier at $200 per user makes the Professional tier at $60 look reasonable by comparison).

Vertical Bundles

Beyond the horizontal tiers, consider packaging vertical solutions for specific industries:

“AI-Powered Document Processing for Legal”: Local model deployment + RAG over case law and precedents + GDPR-compliant data handling + EU AI Act compliance documentation + integration with document management systems. Price: $15,000-25,000 implementation + $8,000-15,000 per month.
“Internal Knowledge Assistant for Manufacturing”: Local model on-premise + safety guardrails for operational procedures + integration with ERP and maintenance systems + multilingual support for factory floor staff. Price: $20,000-40,000 implementation + $5,000-10,000 per month.
“Compliant Customer Service AI”: API proxy with privacy layer + compliance documentation + conversation monitoring and quality scoring + integration with CRM and ticketing. Price: $10,000-20,000 implementation + $5,000-12,000 per month.

Vertical bundles command higher prices because they solve a complete problem. A legal firm does not want “an AI model” — they want a system that helps their associates research case law faster while maintaining client confidentiality. That is a different sale at a different price point.

Service-Level Price Guide

Beyond packaged products, you will sell professional services. Here are realistic price ranges for the EU market as of 2026, reflecting rates that Central and Eastern European providers can charge while remaining competitive against Western European consultancies:

Service	Price range	Duration	Notes
Initial AI assessment	$5,000-15,000	2-4 weeks	Use case identification, feasibility analysis, architecture recommendation. Often the entry point for larger engagements.
Implementation and integration	$20,000-80,000	6-16 weeks	Full deployment including model selection, infrastructure setup, RAG pipeline, integration with client systems, testing.
Fine-tuning engagement	$10,000-30,000	4-8 weeks	Data preparation, fine-tuning runs, evaluation, deployment. Requires ML engineering capability.
Monthly managed service	$3,000-25,000/month	Ongoing	Infrastructure monitoring, model updates, support, optimization. The recurring revenue engine.
EU AI Act compliance assessment	$20,000-50,000	6-12 weeks	Risk classification, documentation, conformity assessment support. High-value, expertise-intensive.
Training and workshops	$2,000-5,000/day	1-5 days	Staff enablement, executive briefings, hands-on technical training. Good relationship builder.

A note on rates: These ranges assume delivery from Central or Eastern European teams. If you are operating from Western Europe with higher cost structures, adjust upward by 30-50%. The ranges also assume the client is a mid-market or enterprise organization — SMB pricing is typically 30-40% lower.

Cost Structure and Margin Dynamics

Different pricing models interact differently with your cost structure, and understanding these dynamics is essential for maintaining healthy margins.

Local deployment model: High fixed costs during development and setup, low variable costs during operation. Your initial investment in engineering, hardware procurement, and deployment is significant — $50,000-100,000 per client for a full implementation. But once deployed, the incremental cost of serving additional users is minimal. No per-token charges, no API bills that scale with usage. This means your margin improves with scale and time: the longer the engagement lasts and the more users adopt the system, the better your economics get. Per-seat subscription pricing captures this dynamic well.

API passthrough model: Low fixed costs (your proxy infrastructure is lightweight), but variable costs that scale linearly with client usage. Every query the client sends costs you tokens, and that cost rises in direct proportion to adoption. Your margin stays roughly flat regardless of scale — you earn your percentage on every dollar of API spend, but you never benefit from the operating leverage that makes the local deployment model attractive. Token passthrough with markup is the honest pricing model here, but the margins are permanently thin.

Retainer model: Both your costs and your revenue are predictable month to month, which makes this the easiest model to manage from a financial planning perspective. The risk is in the mismatch between contracted price and actual cost of delivery — if you price the retainer too low relative to the service level required, you eat the difference. Build your retainers with a 25-30% margin buffer above your expected cost of delivery.

Model	Fixed costs	Variable costs	Margin trend	Best pricing approach
Local deployment	High	Low	Improves with scale	Per-seat subscription
API passthrough	Low	High (linear)	Stays flat	Token markup + service fee
Managed retainer	Medium	Medium	Stable if well-priced	Fixed retainer with tiers
Project + retainer	High (front-loaded)	Low (ongoing phase)	High on retainer after payback	Project fee + monthly retainer

The Bundling Imperative

If there is one commercial lesson in this chapter, it is this: never sell a component when you can sell a solution.

The AI model is a component. Infrastructure is a component. Compliance documentation is a component. Support is a component. Individually, each of these can be compared to a cheaper alternative or done in-house. Bundled into a solution that solves a specific business problem, they become something the client cannot easily replicate or replace.

Your client does not want to buy an LLM, a RAG pipeline, a compliance audit, and a support contract separately. They want to buy “our legal research is now 3x faster and fully GDPR compliant.” Price accordingly.

This means your sales team needs to stop talking about technology and start talking about outcomes. Not “we deploy Llama 4 locally with RAG” but “we make your engineers’ knowledge searchable and keep your proprietary data on your premises.” Not “we provide EU AI Act compliance documentation” but “we make sure you pass the audit.” The pricing follows the positioning: outcome-oriented packaging supports premium pricing in a way that component-level pricing never will.

What to take from this chapter: Your biggest pricing risk is not charging too much — it is being compared to $30/seat consumer AI products. Avoid this by selling solutions, not components. Start new clients with a low-risk assessment ($5,000-15,000), grow into implementation ($20,000-80,000), and anchor long-term relationships with managed service retainers ($3,000-25,000/month). Bundle aggressively. Price on value delivered, not tokens consumed. And remember: the model is your cost of goods sold, not your product.

Next: Chapter 11 — Talent and Team Structure

Chapter 13

Chapter 13: Talent and the CEE Market

Everything we have discussed so far — the business models, the pricing strategies, the compliance opportunities — depends on one thing: having people who can actually deliver.

This is where many IT services providers in Central and Eastern Europe hit the wall. Not because the strategy is wrong, but because the team that built and maintains traditional infrastructure is not the same team that builds and delivers AI services. The skills overlap is real, but it is incomplete. And the gap between what you have and what you need cannot be closed by sending everyone to a weekend workshop.

The Skills Gap Is Real — And Specific

Take a typical mid-sized IT services company in CEE. You have system administrators, network engineers, cloud architects, a helpdesk team, project managers, and developers building internal tools or client-facing applications. Competent people doing real work.

GenAI services need a different profile: ML engineers who understand model architectures and fine-tuning, prompt engineers who design guardrails and evaluate outputs, evaluation specialists who measure hallucination rates, AI security testers who understand adversarial attacks on language models, and solution architects who can design AI-integrated systems for real business problems.

Some of your current team can grow into these roles. A DevOps engineer who automates everything has the mindset for AI operations. A developer who builds integrations has the foundation for connecting AI to existing systems. A technical lead who understands client problems is halfway to being an AI solutions architect.

But halfway is not there. The retraining path takes 12 to 18 months of serious investment — not as an afterthought alongside existing project delivery, but as a genuine commitment of time and money. And the market window is now. Clients are making decisions about AI partners today.

The core tension: Building AI capability takes 12-18 months. Client decisions are happening now. You need a strategy that addresses both timelines simultaneously.

The Five Roles You Actually Need

A functional GenAI practice needs five distinct roles. Not all need to be full-time hires from day one, but you need access to all five capabilities.

1. AI Solutions Architect — sits between client and technical team, translating business needs into buildable AI designs. Natural evolution of your best senior engineers or existing architects. They need to learn the model landscape, common AI architecture patterns (RAG, agents, fine-tuning), cost estimation, and enough about the EU AI Act for credible compliance conversations. Upskilling path: 3-6 months. Your most realistic internal promotion.

2. ML/AI Engineer — implements the actual AI components: fine-tuning pipelines, RAG systems, inference optimization. This is the hardest role to fill and the most expensive. Good ML engineers in CEE are often already employed by Western companies paying Western salaries remotely. You are not going to upskill into this role quickly — hire one or two at competitive salaries. They become the nucleus around which you build the rest of the team.

3. Integration Engineer — connects AI capabilities to client systems: CRM, ERP, databases, legacy applications. This is where your existing talent shines. Your current developers already do this work, just not with AI as an endpoint. The new skills — model APIs, streaming responses, token budgets — are learnable in weeks. Upskilling path: 4-8 weeks.

4. AI Operations / DevOps — deploys, monitors, and maintains AI systems in production. Natural evolution of existing DevOps roles. Engineers who manage Kubernetes and CI/CD pipelines have 80% of what they need. The remaining 20% — model-specific metrics, GPU management, AI failure modes — can be learned alongside the ML engineer. Upskilling path: 2-4 months.

5. Prompt Engineer / AI Trainer — designs system prompts, builds evaluation frameworks, tests edge cases, iterates on quality. Does not require a traditional engineering background. Domain experts who deeply understand client business can be very effective here. A helpdesk team lead who knows how support conversations work might design better prompts than an ML researcher who has never fielded a ticket. Upskilling path: 4-8 weeks.

The CEE Talent Market: Advantages and Realities

An ML engineer in Prague or Warsaw currently earns EUR 45,000 to 80,000 annually, compared to EUR 90,000 to 150,000 in Western European hubs. This cost differential means you can build a small AI team for what a Western company spends on two or three senior hires. In the early stages — experimenting, building internal tools, running pilots — you can afford to try things and fail without it being existential.

But the gap is closing. Remote work lets Western companies hire CEE talent at near-Western salaries. A strong ML engineer in Bratislava can work remotely for a London firm and earn London-adjacent compensation. You are competing not just with local companies, but with every remote-friendly tech company in Europe.

Technical universities across CEE — Charles University, Warsaw University of Technology, Comenius University, Budapest University of Technology — are expanding AI programmes. The graduates are strong technically, but the volume is still small relative to demand, and fresh graduates need 6-12 months before they are productive on client projects. Build relationships with these universities now: internships, thesis sponsorships, guest lectures. Low-cost investments that build a pipeline.

Here is an uncomfortable truth: the best AI talent does not dream of working for a managed hosting company. They want interesting problems, open-source contributions, conference talks. If your public identity is “we manage servers,” you will struggle to attract AI talent regardless of salary. Build a credible AI narrative through real projects, then talk about them publicly.

On talent attraction: Your ability to hire AI talent is directly correlated with the visibility of your AI work. Ship something real, then talk about it publicly.

Upskilling: What Actually Works

The programmes that produce competence — not just certificates — share common traits.

Pick the right candidates. Not everyone is suited for AI upskilling. The best candidates are experienced enough for strong engineering fundamentals, curious enough to experiment on their own, comfortable with ambiguity, and interested in business problems beyond pure technology. Do not try to turn a system administrator content with network configurations into an ML engineer. It will not work, and you will lose a good sysadmin.

Prioritize doing over watching. The gap between completing an online ML course and building a production RAG system is enormous. What works: internal AI projects with real users and feedback, hackathons with real constraints (“here is a client problem — prototype a solution in two days”), pairing your hired ML engineer with upskilling candidates on real work, conference attendance, and open-source contributions that build both depth and public credibility.

Use fractional expertise while you build. Hire a part-time AI advisor, contract with an ML consultancy for initial implementations, bring in an AI security specialist for assessments. More expensive per hour than internal staff, but you start delivering now. The key: ensure knowledge transfer is part of every external engagement. If a consultant builds something and leaves without your team understanding it, you bought a deliverable, not a capability.

Organizational Change: The Uncomfortable Parts

Sales Cannot Sell What They Do Not Understand

AI sales conversations are fundamentally different from selling hosting. The client does not know what they need. The scope is unclear. The pricing model is different. Your sales team needs to understand AI capabilities well enough to have credible discovery conversations — to hear a client describe a problem and recognize whether AI can help.

Bring your technical AI team into sales conversations from the start. Run internal workshops with live demos. Create frameworks — not scripts — that help salespeople ask the right questions. Accept that the first few AI deals will be slower and messier than what you are used to.

Delivery Must Shift from Maintenance to Evolution

Traditional IT delivery is about stability: set it up, keep it running, fix it when it breaks. AI delivery is different. Models improve quarterly. What worked three months ago might now be outperformed at half the price. Delivery teams need to shift from “maintain what is running” to “continuously improve and evolve.” This means iterative project structures, outcome-based metrics instead of uptime, and comfort with the fact that AI systems are probabilistic.

Leadership Must Accept the Valley

The transition will involve a period where old revenue declines and new revenue has not yet materialized. The temptation is to hedge — keep the old business untouched while running AI as a side experiment. This does not work. It signals to your best people that AI is not a priority, so they leave. It signals to clients that you are not serious, so they find AI partners elsewhere.

On organizational change: The hardest part is not learning new technology. It is accepting that revenue may dip before it grows, and that the transition will be uncomfortable for everyone.

The Partnership Alternative

You do not have to build everything in-house. Focus on what you already own — client relationships, domain knowledge, integration expertise. Partner with specialized AI companies for deep ML work: model selection, fine-tuning, complex RAG architectures.

Your value is the last mile. The AI model comes from a partner, but you understand the client’s legacy systems, their data formats, their compliance requirements, their organizational politics. You do the integration, deployment, operations, and client management.

This is not a fallback. It is a legitimate strategy. The key: be honest about which parts you own and which you source, ensure knowledge transfer from partners, and shift the balance toward internal capability over time.

What Your Existing Team Can Learn in 6 Months vs What Requires a New Hire

Not every AI role requires external hiring. Some skills build naturally on existing expertise; others represent a genuine gap. Being realistic about which is which will save you both money and frustration.

Existing Role	Can Learn in 6 Months	Requires New Hire or Partner
Senior developer	RAG pipeline development, API integration, prompt engineering, basic fine-tuning	Advanced ML architecture, custom model training, inference optimization
DevOps engineer	vLLM/TGI deployment, GPU monitoring, model serving pipelines, basic MLOps	ML-specific CI/CD, model evaluation frameworks, A/B testing infrastructure
Systems architect	AI solution design, model selection, hybrid architecture planning	ML system design at scale, custom inference optimization
Security engineer	Prompt injection testing, AI output monitoring, data leakage audits	Adversarial ML, model robustness evaluation, red-teaming methodology
Service desk manager	AI ticket triage configuration, automation workflow design, quality monitoring	Custom AI agent development, NLU model fine-tuning
Project manager	AI project scoping, iterative delivery with AI, client expectation management	— (this is fully learnable)

The practical upskilling path:

Week 1-4: Fundamentals. Every candidate completes a structured introduction to LLMs, prompt engineering, and RAG. Use hands-on labs, not just courses — build a working chatbot over company documentation by end of month one.
Month 2-3: Specialization. Developers focus on API integration and RAG pipelines. DevOps learns model serving (vLLM, llama.cpp) and GPU monitoring. Architects work through reference architectures for hybrid local/cloud deployments.
Month 4-6: Real project. The upskilling cohort builds something real — either an internal tool or a supervised component of a client engagement. This is where abstract knowledge becomes practical capability.

The cost is modest: EUR 2-5K per person in training materials and cloud compute for labs, plus 20-30% of their time over six months. For a cohort of five people, that is EUR 10-25K in direct costs plus the opportunity cost of reduced availability — far less than hiring five new AI engineers.

Key takeaway: A senior developer with six months of focused upskilling can handle 70-80% of the integration and RAG work that makes up most early AI engagements. Reserve new hires for the genuinely specialized roles: ML engineering, model evaluation, and inference optimization. Build the team around a core of upskilled veterans plus one or two experienced AI hires who can mentor the rest.

A Realistic 12-Month Talent Plan

Months 1-3: Assessment and First Hires. Audit your team. Identify 5-10 upskilling candidates based on aptitude and interest — forced retraining does not work. Hire 1-2 AI-experienced engineers at competitive salaries. Start the upskilling programme above. Engage a fractional AI advisor if you cannot hire full-time immediately.

Months 4-6: Internal Project. Your AI engineers and upskilling candidates build something for internal use — an AI knowledge base, an automated report generator, or an AI-enhanced service desk pilot (see Chapter 9). Full cycle: architecture, implementation, deployment, iteration. Low stakes, real learning.

Months 7-9: First Client Pilot. Select a friendly client with tolerance for imperfection. Small, well-defined scope. Clear success criteria. Build in extra time and support. The project must succeed well enough for a reference case.

Months 10-12: Scale and Systematize. Refine team structure and delivery process. Develop repeatable patterns. Start second and third engagements. Build the public narrative — blog posts, case studies, conference talks.

The 12-month reality check: At the end of this plan, you will not have a mature AI practice. You will have a functioning team, a couple of real engagements, and a much clearer picture of what your AI business looks like. Maturity takes another 12-18 months beyond that.

The Cost of Waiting

Every month you delay, the gap widens in two ways. First, the AI talent market gets more competitive — the engineers you could hire today will cost more next year. Second, your best people are watching. If they see their company ignoring AI, the curious and ambitious ones will leave for employers who take it seriously.

The talent transformation is not something you defer until the business case is proven. It is part of proving the business case. You cannot land AI projects without AI capability, and you cannot build capability overnight. The time to start is not when you are ready. It is now, precisely because you are not ready.

Next: Chapter 14 — The “Do Nothing” Scenario and Your 18-Month Roadmap

Chapter 14

Chapter 14: The “Do Nothing” Scenario and Your 18-Month Roadmap

The Comfortable Default

Doing nothing feels reasonable right now. Your hosting contracts are still renewing. Nobody in your client base has called to say they are cancelling because you do not offer AI services. The monthly recurring revenue continues to arrive, predictable and familiar.

So the temptation is to treat generative AI like every other technology wave: watch from a distance, wait for the market to mature, step in when the path is clear. This approach worked with cloud, containers, and DevOps.

It will not work this time.

The Slow Erosion You Will Not See Coming

Your infrastructure revenue will probably not collapse in 2026 or 2027. Hosting contracts have inertia. But here is what happens over 3-5 years if you do nothing.

Your clients start exploring AI. A junior developer builds a prototype, or a department head sees what a competitor is doing and asks the CTO to investigate. They need someone to help evaluate models, build integrations, handle EU AI Act compliance, and eventually operate AI systems at scale. They look at you first — you are their IT services provider, you understand their environment. And when your answer is “not yet,” they call someone else.

The risk is not that infrastructure dies. It is that AI becomes the entry point for the client relationship. Whoever helps a client with their first AI deployment earns trust and positions themselves as the strategic technology partner. That partner then says: “By the way, we also do infrastructure management. Want us to take a look at your hosting?”

The real threat is not that AI replaces your business. It is that AI becomes the front door to the client relationship, and whoever walks through that door first takes the hosting contract with them.

This played out with cloud migration consulting in the 2010s. The firms that helped clients move to AWS captured not just the migration project but the ongoing management, security monitoring, and eventually the entire IT relationship.

Your Competitors Are Already Moving

The IT services provider in your market who starts today will, by the time you begin, have 18+ months of learning, 2-3 client references, and engineers who have been through the full cycle of deploying, debugging, and improving AI systems in production. You cannot buy that experience on the open market. You cannot shortcut it with a strategy deck. It comes only from doing the work.

The cost of starting late is always higher than the cost of starting imperfectly. A failed pilot with a friendly client is infinitely more valuable than a flawless plan that never executes. And the learning compounds — each project makes the next one faster, cheaper, and better scoped.

What Not to Do: Lessons from Early Movers

Before we lay out the roadmap, here are the mistakes that early movers have already made across the European IT services market. These are not hypothetical.

Do not try to compete with OpenAI on price. You will lose. Chapter 4 laid out why the hyperscaler cost advantage is structural. Your value is better integration, compliance, and client context — not cheaper inference.
Do not promise “we will build you your own GPT.” Clients hear this and expect frontier-level capability running privately. What they get is a noticeably less powerful system that costs more to operate. Set expectations honestly.
Do not underestimate the ops burden of self-hosting. Running GPU inference infrastructure is not like hosting a web server. The failure modes, monitoring requirements, and talent are all different. Several providers have learned this the hard way.
Do not assume your current team can transition without significant upskilling. ML inference optimization, prompt engineering, and AI evaluation are genuinely different disciplines from cloud operations. Plan for a 6-12 month learning curve.
Do not ignore shadow AI. Your clients’ employees are already using ChatGPT and Claude with company data, without IT’s knowledge, without compliance review. Solving this is an excellent entry point for broader AI services.
Do not build around a single model or provider. The landscape shifts quarterly. Build model-agnostic architectures from the start.
Do not start with the hardest use case. Pick low-risk, high-visibility wins first — a chatbot over company docs, document classification, meeting summaries. Build credibility before attempting agentic workflows.
Do not sell “AI” as a buzzword. “We will add AI to your business” means nothing. “We will classify incoming support tickets and route them automatically, reducing response time by 40%” means everything.

The 18-Month Roadmap

What follows is a phased action plan designed for an EU IT services provider with 30-500 employees, existing infrastructure clients, and little to no current AI capability. Adjust the specifics to your situation, but respect the sequencing — each phase builds on the one before it.

Phase 1: Foundation (Months 1-6)

The goal of this phase is not revenue. It is learning.

Conduct an honest internal skills assessment. Map your team against what AI services require — ML engineering, prompt engineering, evaluation methodology, AI compliance. Be unflinching about the gaps.
Identify 2-3 team members for AI upskilling. Choose people who combine technical aptitude with curiosity. They do not need to be your most senior engineers — they need to be your most adaptable. Invest in structured training, certifications, and hands-on workshops.
Build one internal AI tool for your own company. This is the most important item on this list. Eat your own cooking. A chatbot over internal docs, automated ticket classification, AI-assisted proposal writing — it does not matter what. What matters is going through the full cycle of building, deploying, and iterating. This single project will inform every client engagement that follows.
Start with API integration, not self-hosting. Use frontier APIs from OpenAI, Anthropic, Google, or Mistral. Learn what the models can do before you worry about hosting them. The application layer is where most of the value sits. Infrastructure comes later.
Research EU AI Act obligations relevant to your clients’ industries. This knowledge becomes a differentiator faster than any technical capability.
Have conversations with 3-5 clients about their AI pain points. Do not sell anything yet. Listen. These conversations will shape your offerings more than any market report.

Budget: EUR 20,000-50,000 for training, tools, API credits, and experimentation time.

Phase 2: First Client Engagements (Months 7-12)

The goal is learning with real clients. Revenue is welcome but secondary.

Launch 2-3 pilot projects with friendly clients. Choose clients who trust you and are forgiving of imperfection. Offer favourable pricing in exchange for patience and detailed feedback.
Start small. A chatbot over company documentation, automated document processing, an internal knowledge base with natural language search. Well-understood use cases with manageable complexity and high visibility.
Use frontier APIs — do not self-host yet. Focus on building good prompt templates, evaluation pipelines, clean integrations, and monitoring that catches quality issues before the client does.
Build your first repeatable delivery template. Document the discovery process, architecture decisions, implementation steps, and handoff procedure. The second pilot should be faster than the first.
Develop initial pricing models and packaging. Test different approaches — per-user fees, project-based pricing, token pass-through, or hybrids. Review Chapter 10.
Hire or contract 1-2 AI-experienced engineers if budget allows. Even one person who has deployed AI in production dramatically accelerates your team’s learning.
Start evaluating local deployment feasibility. Test llama.cpp and MLX on standard hardware. Benchmark open-source models against the APIs you have been using, so you are ready for Phase 3.

Budget: EUR 40,000-80,000 for hiring or contracting, client pilots, tools, and API costs.

Phase 3: Scale and Specialize (Months 13-18)

The goal is productizing what you have learned.

Double down on what works. Your pilots will reveal which use cases resonate, which delivery approaches are efficient, and which pricing sticks. Deprioritize what did not work — sunk cost discipline matters.
Develop 1-2 packaged offerings. Based on the business models explored in Chapters 5-7, build repeatable service packages with clear scope, pricing, and deliverables. Local deployment plus compliance for privacy-sensitive clients, or agentic infrastructure monitoring — whatever your pilots validated.
Build model-agnostic architecture capabilities. Ensure your framework can swap models without rebuilding the application. This protects clients from lock-in and protects you from the next model landscape shift.
Consider your first local deployment client. If you have the talent, the client need is genuine, and the hardware is adequate — execute it. This adds a high-value capability that few competitors will have.
Formalize EU AI Act compliance services. By month 13, you should have enough practical experience to offer compliance advisory with credibility.
Train your sales team. They need to articulate business value and handle objections, not explain transformer architectures.
Target: 5-10 paying clients by month 18. A mix of pilot conversions, new clients, and upsells to existing infrastructure clients.

Budget: EUR 60,000-120,000 for team growth, marketing, sales enablement, and infrastructure.

Key Metrics to Track

You cannot manage what you do not measure. From month one:

Revenue from AI-related services — should be greater than zero by month 12
Number of client pilots and conversion rate — a healthy rate is 40-60%; lower means your pilots may be too ambitious or pricing misaligned
Team capabilities — how many people can independently scope, deliver, and support an AI engagement?
Client NPS on AI services specifically — measure separately from infrastructure; early satisfaction predicts referrals
Pipeline of AI-related opportunities — your leading indicator of future revenue

Infrastructure or Expertise: The Five-Year Question

We return, in this final chapter, to the question posed in Chapter 1:

In five years, will your company’s primary revenue come from infrastructure you operate, or from expertise you deliver?

The eleven chapters between that question and this answer have built the case from every angle — the economics of self-hosting, the structural advantages of hyperscalers, the business models that work, the regulatory landscape, the talent market, the pricing dynamics. Here is what all of it points to.

The infrastructure margin in generative AI is being compressed to near zero for anyone who is not a hyperscaler. The GPU is becoming a commodity input, not the product. The companies that try to compete on compute will find themselves in an unwinnable race against providers with trillion-dollar capital bases and global-scale infrastructure.

The expertise margin — knowing how to make AI work in a specific client’s context, how to evaluate its output, how to keep it safe, how to integrate it into workflows, how to navigate the EU AI Act, how to manage the model lifecycle — that is where value is migrating. And it is migrating fast.

This is actually good news.

Because you own the client relationship. You understand their environment, their constraints, their industry, their regulatory obligations, and their organizational culture. A hyperscaler in Mountain View or Seattle does not have that knowledge and cannot easily acquire it. An AI startup in San Francisco does not have your 15 years of trust with the CTO.

Your value is in the last mile — getting AI to work reliably in a specific context, for a specific client, with their specific data and their specific compliance requirements. That last mile is messy, complex, context-dependent, and enormously valuable. It is also exactly the kind of work that IT services providers have always done well.

You just need to stop thinking of yourself as an infrastructure company. You are an expertise company that sometimes operates infrastructure. The distinction matters, because it changes what you invest in, what you hire for, what you sell, and how you price it.

The companies that internalize this shift early will thrive. They will capture the advisory relationships, build the delivery muscle, and establish the client references that create compounding competitive advantage. The ones that keep trying to sell compute — that keep looking for a way to mark up GPU hours the way they once marked up virtual machines — will find the margins thinner every year and the competition more intense.

Start This Week

The best strategic plan is worthless without the first step. And the first step is small enough to take this week.

Pick one AI project. It does not need to be for a client — it can be for your own company. Build a chatbot over your internal wiki. Automate the weekly status report that nobody enjoys writing. Set up a document classifier for your support tickets. Give two engineers API keys and a week to experiment.

The project will be imperfect. The first prototype will be clumsy. The model will hallucinate something embarrassing. The integration will break in a way that makes everyone laugh. Good. That is learning. That is what month one looks like. And month eighteen looks nothing like month one — but only if month one actually happens.

The window for building AI capability in the EU IT services market is open right now. The technology is accessible. The client demand is real and growing. The regulatory environment, for once, favours local providers who understand European requirements. And the competition, while moving, has not yet locked up the market.

Eighteen months from now, you will either be a provider with AI capabilities, client references, and a growing pipeline — or you will be a provider watching from the sidelines as your competitors capture the relationships that used to be yours.

The math is clear. The path is laid out. The first step is yours to take.

End of The Token Economics: A Strategic Guide for EU IT Services Providers Navigating GenAI

The Token Economics

A Strategic Guide for EU IT Services Providers Navigating GenAI

About This Booklet

Who This Booklet Is For

How to Read This Booklet

Table of Contents

Chapter 1: The GenAI Moment for IT Services Providers

A Business That Made Perfect Sense

Why GenAI Breaks the Middleman Model

Two Kinds of Clients, Two Different Realities

The Previous Playbook Does Not Transfer (For Half Your Clients)

What the client actually needs now

The Five-Year Question

There Are Viable Paths — But Not the Old One

What This Booklet Will Show You

Chapter 2: How Large Language Models Actually Run

Parameters, Precision, and Memory

Precision Formats

Memory Math for Real Models

The KV Cache: Where Concurrent Users Hit You

The Math Gets Serious at Scale

Throughput: Tokens Per Second Per User

GPU Hardware: A Practical Comparison

Concrete Configurations: What Serves 100 Users

Configuration 1: 120B Model for 100 Users

Configuration 2: 20B Model for 100 Users

Serving Software: The Engine Room

Key Serving Frameworks

Three Techniques That Matter

What This Means for Your Infrastructure Business

Chapter 3: The Cost Equation — Self-Hosting vs API at Every Scale

Self-Hosting a 120B Parameter Model (100 Users)

Compute Requirements

The Monthly Bill

Per-User Reality

Self-Hosting a 20B Parameter Model at Different Scales

10 Users — Dedicated Per-Customer Deployment

100 Users — Shared Departmental Deployment

500 Users — Business Unit or Mid-Size Enterprise

1,000 Users — Large Enterprise or Multi-Tenant Platform

API Pricing Landscape (April 2026)

Frontier and Mid-Tier Model Pricing

GPU Rental Prices (April 2026)

API Cost Analysis: 100 Users Scenario

Assumptions

API Cost at Different Price Points

Self-Hosting Cost for the Same 100 Users

The Comparison

The Complete Scale Comparison

Self-Hosted 20B Model vs API — Monthly Total Cost

Self-Hosted 20B Model vs API — Per-User Monthly Cost

The Crossover Chart

The On-Premises Economics: A Different Comparison

What It Costs a Client to Do It Themselves

What You Can Charge as a Managed Service

The Market Size Question

The Combined Picture

What These Numbers Do Not Capture

Practical Implications for Your Pricing Strategy

Chapter 4: Why Hyperscalers Win on Price

Custom Silicon: The Single Biggest Factor

Utilization Rates: The Economics of an Empty GPU at 3am

Model-Specific Optimizations: The Engineering Gap

Scale Amortization: The Marginal Cost of One More User

Strategic Underpricing: The Land Grab

What Things Actually Cost

The Compounding Effect

When the Hyperscaler Advantage Does Not Matter

The Uncomfortable Conclusion (For Cloud-Comfortable Clients)

Chapter 5: The Vendor Ecosystem Play

The Vendor AI Landscape in 2026

Microsoft Copilot for Microsoft 365

SAP Joule / Business AI

ServiceNow Now Assist / AI Agents

GitHub Copilot

Google Gemini for Workspace

VMware Cloud Foundation (Broadcom)

The Comparison at a Glance

The Real Economics: Licenses Are the Anchor, Services Are the Profit

The Strategic Risk — Said Honestly