Open-Weight Model Families & Model Selection
A Decision Framework for On-Prem Inference
April 2026
By Robert Barcik
LearningDoe s.r.o.
About This Booklet
You have GPUs. You have a model running in production. It works. Tickets get resolved, code gets reviewed, documents get summarised. Then someone on your team opens HuggingFace, sees a new model with better benchmark scores, and asks: should we switch?
You do not know. Not because you are uninformed, but because the question is unanswerable without context. Switch for which task? On which GPU? At what quantisation? Serving how many users? Under what license? The model is one variable in a system of constraints, and the leaderboard answers none of them.
This booklet does not tell you which model to pick today. By the time you read this, the rankings will have shifted. What it gives you is the machinery to answer the question yourself — this time and every time after. Part 1 maps the five major families: who builds them, why, and what that means for you. Part 2 connects the landscape to your actual hardware. Part 3 gives you a decision framework and four scenarios to stress-test it.
The numbers are current as of April 2026. The framework should outlast them.
Who This Is For
- Operations and services professionals deploying AI on company hardware
- IT teams managing GPU infrastructure for internal or client workloads
- Technical decision-makers evaluating which models to run on-prem
- Anyone deploying LLMs on datacenter or workstation GPUs who needs to make model choices without a dedicated ML research team
How to Read
Read in order for the full picture. If you already know the model families and just need the practical selection guidance, skip to Part 2. If you want the decision framework and exercise, jump to Part 3 — but you will get more from it having read Parts 1 and 2 first.
Table of Contents
Part 1: Model Families
- The Landscape in 2026
- Family by Family
- Beyond LLMs
Part 2: Practical Selection
- The Hardware Reality
- What Fits Where
- Quantization
- Inference Frameworks
- Memory & Throughput
Part 3: Decision Framework
- The Decision Framework
- Scenario Exercise
- What to Watch
Why Not Just a Comparison Table?
In December 2024, the safe recommendation was straightforward: Llama 3.3 70B, run it on an H100, done. Six months later, Qwen 3 had overtaken it on most benchmarks. Six months after that, Gemma 4 arrived with Apache 2.0 licensing and Mistral Small 4 rewrote what a 6B-active-parameter model could do. Anyone who built their strategy around a December 2024 comparison table spent 2025 on the wrong model.
The problem is not that comparison tables are wrong when published. The problem is that they are right for about eight weeks. The mean engagement duration for any model on HuggingFace is approximately six weeks before it is superseded. Building your model strategy around a snapshot is like choosing a server vendor based on last quarter’s benchmark — the number was accurate, the decision was not.
What does not change every eight weeks: your hardware constraints, your use case requirements, the licensing implications, and the inference cost structure. A team that understands these variables can read a model card on release day and reach a defensible decision by end of day. A team that waits for someone else’s recommendation table is always two months behind.
The goal: When the next model drops, you evaluate it yourself — against your hardware, your constraints, your use cases — without waiting for someone else’s table.
The Landscape in 2026
Eighteen months ago, if someone asked you to run an LLM on your own hardware, the answer was some variant of Llama. Today, five organisations compete to be the default choice, and their models have converged on a licensing model that would have seemed unlikely in 2024: genuine permissive open licensing.
The Families
Five families dominate the open-weight landscape: Llama (Meta), Gemma (Google DeepMind), Qwen (Alibaba Cloud), Mistral (Mistral AI, Paris), and Phi (Microsoft). Between them, they cover parameter counts from 600 million to 680 billion, every major modality, and nearly every viable on-prem deployment scenario.
These families are more stable reference points than specific model versions. A version (Llama 3.3, Qwen 3.5) might be superseded in months. The family behind it — its philosophy, its licensing approach, its community ecosystem, its preferred architecture — evolves more slowly and tells you more about where your investment will end up.
Two Architecture Eras
Understanding one distinction will help you make sense of everything that follows: dense versus Mixture of Experts (MoE).
A dense model activates all its parameters for every token. A 70B dense model does 70 billion computations per token and needs enough memory to hold all 70 billion parameters. Straightforward.
A MoE model has a large total parameter count but activates only a fraction per token. Mistral Small 4 has 119B total parameters but activates only ~6B per token. You get frontier-quality outputs at small-model inference cost — but you still need enough memory to store all 119B parameters. MoE trades memory for compute efficiency.
Every frontier open-weight release since mid-2025 uses MoE: DeepSeek R1, Qwen 3.5, Mistral Large 3, Llama 4, Gemma 4’s 26B variant. Dense models (27–32B) remain the sweet spot for single-GPU deployment, but the trend is clear.
The Licensing Convergence
The biggest practical shift of the past year is not a model — it is a license. Apache 2.0 has become the default for new open-weight releases. Gemma 4, Qwen 3/3.5, Mistral’s current lineup, and DeepSeek V3 all use it. Microsoft’s Phi uses MIT (even more permissive). The only major holdout is Meta’s Llama, which uses a custom community license with branding requirements, a 700M monthly-active-user cap, and — critically for EU-based teams — restrictions on multimodal models in the EU.
For an IT services company, this means that four of the five families impose zero commercial restrictions on your use. You can deploy them for clients, embed them in products, fine-tune them, and redistribute derivatives without legal friction. That was not the case 18 months ago.
Family by Family
Do not read this section as a catalog. Read it as context for decisions you will make in Part 3. For each family, the question that matters is: under what circumstances would I choose this one? The corporate strategy behind each is a single sentence — enough to understand their commitment to open weights. The rest is what you need for model selection.
Specific model versions (Llama 3.3, Gemma 4, Qwen 3.5) will be superseded. The families, their licensing philosophies, and the “when to choose” criteria are more stable — those are what to retain.
Meta Llama — the incumbent you already know
If you are running open-weight models today, you are probably running Llama. That is its greatest strength — ecosystem depth, battle-tested tooling, the most derivative models on HuggingFace (~85K) — and increasingly its only differentiator, as competitors match or exceed its quality.
Llama 3.3 70B remains the practical workhorse: 405B-class quality through distillation, fits on a single H100 at FP8 (~70GB), runs with every inference engine. Llama 4 (April 2025) pivoted to multimodal MoE — Scout (109B total, 17B active, 10M context) fits on a single H100 at INT4, but the ecosystem around Llama 4 has been slower to mature than Llama 3.x.
The risk: Meta released Muse Spark in April 2026 — their first proprietary model. The community reads this as a potential signal that future frontier releases may not be open. Meta open-weights to prevent platform lock-in by other AI providers; if that strategic calculus changes, so does Llama’s future.
- You value ecosystem maturity over cutting-edge benchmarks — every tool, tutorial, and fine-tune assumes Llama compatibility
- License caveat: Not Apache 2.0. Requires “Built with Llama” branding. EU entities cannot use Llama 4 multimodal models (text-only fine)
- Fits on: Llama 3.3 70B on H100 at FP8. Llama 4 Scout at Q4 on H100 or DGX Spark
Google Gemma — distilled Gemini, now with a real license
Gemma models are distilled from Google’s closed Gemini research, which gives them outsized capability per parameter. The move to Apache 2.0 with Gemma 4 (April 2026) removed the licensing friction that had held back enterprise adoption of earlier versions — this was a direct competitive response to Qwen and Mistral.
The 26B MoE is the standout: it activates only 3.8B parameters per token yet delivers near-31B dense quality (#6 on Arena AI). The 31B dense model ranks #3 on Arena AI among all open models. Specialised variants exist for medical, translation, security, and embeddings — Google is building a complete task-specific portfolio, which matters if you want models that work together.
- You want the best quality at 27–31B with a clean Apache 2.0 license and no geopolitical complications
- Speed caveat: Inference is slower than competitors — the 26B MoE ran at ~11 tok/s where Qwen 3.5 hit 60+ on the same hardware. Community engine optimisations are catching up
- License trap: Gemma 1–3 used a restrictive custom license. Verify you are using Gemma 4
- Fits on: 31B dense or 26B MoE on H100 or DGX Spark at Q4 (~18–20GB)
Alibaba Qwen — technically best, geopolitically complicated
By the numbers, Qwen wins. Most-forked family on HuggingFace (180K+ derivatives). Widest size range (0.6B to 480B). 201+ languages versus Llama’s ~8. Strongest reasoning model at the 32B tier (QwQ-32B). Most aggressive release cadence. Qwen 3.5 introduced Gated DeltaNet — linear attention with constant memory complexity, a genuine architectural step beyond standard transformers. Alibaba open-weights to drive cloud revenue; their AI business has grown at triple-digit rates for eight consecutive quarters.
The complication is not technical. China’s National Intelligence Law obligates all Chinese companies to cooperate with government intelligence operations. The EU AI Act requires training data transparency — Qwen provides none. For EU teams: always self-host on EU infrastructure (never Alibaba Cloud API for GDPR data), avoid high-risk AI Act categories without legal review, and treat Qwen as a strong default for non-sensitive workloads where its technical lead justifies the compliance conversation.
- You need multilingual, coding, or reasoning capability and the workload is not in a regulated/sensitive category
- You need a size the others do not offer — only Qwen covers 0.6B to 480B
- Compliance overhead: Chinese origin means extra due diligence for EU regulated, defence, and public sector workloads
- Fits on: QwQ-32B or Qwen 3 8B/14B on DGX Spark. Qwen 3 235B MoE on two H200s at FP8
Mistral AI — the EU sovereignty option
The only major AI lab headquartered in the EU. Full EU jurisdiction, no US CLOUD Act exposure. The French Ministry of Armed Forces signed a 2026–2030 deployment framework. France and Germany have established joint public administration AI frameworks using Mistral. If your client’s procurement requires EU-origin software, Mistral may be the only choice that passes legal review.
Mistral Small 4 (March 2026) is architecturally clever: 119B total, ~6B active, with a configurable reasoning_effort parameter that lets one model serve as both a fast chatbot and a step-by-step reasoner. Ministral 3 covers the compact tier (3B/8B/14B). Licensing lesson: Mistral went from Apache 2.0 (original 7B) to non-commercial (Codestral) to research-only (Large 2) and back to Apache 2.0 after community backlash. Always check the license on the specific version, not the family reputation.
- EU data sovereignty is a hard requirement — regulated industries, public sector, defence
- You want a single model that flexes between fast chat and deep reasoning (Mistral Small 4’s
reasoning_effortparameter) - HW requirement: Mistral Small 4 needs 2×H200 or 4×H100. Ministral 3 8B/14B fits anywhere
- Watch for: Mistral Compute (their own EU cloud, 18K Blackwell chips, launching 2026)
Microsoft Phi — the specialist that ignores general knowledge
Phi is not trying to be a general-purpose assistant. It is a reasoning engine trained on synthetic “textbook quality” data that deliberately removes factual content to preserve reasoning capacity. The result: Phi-4-reasoning-plus (14B) outperforms DeepSeek-R1 (671B) on AIME math problems. Phi-4-multimodal (5.6B) tops the HuggingFace speech recognition leaderboard. Phi-4-mini (3.8B) runs on CPU. Microsoft open-weights Phi to drive ONNX Runtime adoption and serve as the developer on-ramp to Azure.
The trade-off is explicit. TriviaQA scores are low. The 16K context window on base Phi-4 rules out large-document tasks. Multilingual is weak (primarily English). Agentic capability is “very limited.” If you need Phi for knowledge-heavy work, pair it with RAG.
- Your task is math, code, or scientific reasoning and you want the smallest model that handles it
- You need something that runs on CPU or minimal hardware (Phi-4-mini at 3.8B)
- License: MIT — zero restrictions of any kind. The cleanest legal option
- Not for: General chat, multilingual, long-context, agentic workflows
Licensing at a Glance
| Family | License | Commercial use | Key restriction |
|---|---|---|---|
| Qwen 3+ | Apache 2.0 | Unrestricted | Chinese origin → compliance overhead for regulated workloads |
| Mistral (current) | Apache 2.0 | Unrestricted | Devstral 2: $20M/month revenue gate |
| Gemma 4 | Apache 2.0 | Unrestricted | Gemma 1–3 used restrictive license — check version |
| Phi | MIT | Unrestricted | None |
| Llama | Meta Community | Under 700M MAU | “Built with Llama” branding; EU multimodal restriction |
Beyond LLMs: When Smaller Models Win
The default instinct — “just use the LLM for everything” — is one of the most expensive mistakes an operations team can make. An LLM is a generalist. For defined, repeatable tasks, a specialised small model will be faster, cheaper, and often more accurate.
Consider a ticket routing task. A fine-tuned DistilBERT (66M parameters) achieves 95%+ accuracy at 2–5ms latency. A 70B LLM achieves 85–92% accuracy at 500ms–5s latency. At a million classifications per month, the cost difference is roughly $50 versus $5,000–50,000 in GPU infrastructure.
Here is what the specialised model landscape looks like:
- Embedding models (BGE-M3, Qwen3-Embedding, Nomic Embed) — convert text to vectors for semantic search and RAG retrieval. 100–600M parameters. Essential for any RAG pipeline. e5-small (118M) achieved 100% Top-5 retrieval accuracy, outperforming an 8B embedding model at 16ms versus 195ms
- NER / entity extraction (GLiNER, 100–350M) — zero-shot Named Entity Recognition. Define entity types at runtime (server names, error codes, IPs) without retraining. Outperforms ChatGPT on NER benchmarks
- Classifiers (DistilBERT, 66M) — ticket routing, sentiment analysis, priority detection. 2–5ms latency
- Rerankers (BGE-reranker-v2-m3, 568M) — second-stage scoring after embedding retrieval. Improves RAG quality by up to 48%. Reranking 50 documents takes ~1.5 seconds
- Speech recognition (Whisper large-v3-turbo, 809M) — 5.4× faster than Whisper large-v3 with near-equivalent accuracy. Self-hostable on a single GPU
The Right-Sizing Principle
The optimal architecture uses each model for what it does best. A practical IT operations pipeline:
| Step | Model type | Params | Latency | Purpose |
|---|---|---|---|---|
| 1. Classify | DistilBERT | 66M | 3ms | Route ticket, detect priority |
| 2. Extract | GLiNER | 100–350M | 10ms | Server names, error codes, IPs |
| 3. Embed | BGE-M3 | 568M | 20ms | Semantic search for similar incidents |
| 4. Rerank | bge-reranker | 110M | 50ms | Refine top-k results |
| 5. Generate | LLM (7B–70B) | 7–70B | 500ms+ | Human-readable resolution summary |
Steps 1–4 use approximately 1.2GB VRAM total and complete in under 100ms. The LLM in step 5 is called only when needed, with pre-filtered context, reducing LLM costs by 80%+ while improving answer quality.
The principle: Do not use a 70B model for a task a 300M model handles better. Reserve the LLM for reasoning and generation. Use specialised models for classification, extraction, search, and ranking.
The Hardware Reality
Before you pick a model, you need to know what your hardware can actually do. Not what the marketing says — what happens when you load a model and start generating tokens.
Two numbers determine your experience: memory capacity (can the model fit?) and memory bandwidth (how fast can it generate?). The second one is less intuitive but more important for user experience.
3.35 TB/s bandwidth
4.8 TB/s bandwidth
273 GB/s bandwidth
Why Bandwidth Matters More Than You Think
During token generation, the GPU must stream the entire model from memory for every single token. The formula is simple:
tokens/second ≈ memory bandwidth ÷ model size in bytes
This is why the DGX Spark, despite having 128GB of memory (more than an H100), generates tokens 12× slower than the H100 for models that fit on both. The Spark’s 273 GB/s bandwidth is just 8% of the H100’s 3.35 TB/s. It can load enormous models. It cannot run them fast.
The practical threshold for interactive use is roughly >10 tokens/second. Below 5 tok/s, users will find the experience frustrating. This directly constrains which models are viable for which hardware — not just whether they fit, but whether they are usable.
The VRAM Budget
A model’s memory footprint is not just its weights. You also need room for the KV cache (which grows with context length and concurrent users) and framework overhead.
The base formula: weight memory = parameters × bytes per weight. At FP16 (2 bytes per parameter), a 70B model needs ~140GB just for weights. At Q4 (0.5 bytes), it drops to ~35GB. The 1.3–1.5× multiplier accounts for KV cache and overhead at moderate concurrency.
The KV cache deserves special attention. For Llama 3.1 70B at BF16, it consumes roughly 0.31 MB per token per request. At 8K context with 4 concurrent users, that is ~10GB — manageable. At 128K context with 4 users, it is ~160GB — exceeding the model weights. Long context and high concurrency are the hidden VRAM killers.
What Fits Where
Here is the table that matters. For each model size, it shows the VRAM required at different precision levels and whether it fits on your three hardware targets. “Fits” means loads with room for KV cache at moderate usage. “Tight” means it loads but leaves minimal headroom.
The VRAM column values are durable — they follow from the formula (params × bytes). The specific model names will rotate; the size tiers and memory math will not.
| Model size | FP16 | FP8 / INT8 | Q4 / INT4 | H100 (80GB) | DGX Spark (128GB) | H200 (141GB) |
|---|---|---|---|---|---|---|
| 3–4B (Phi-4-mini, Ministral 3B) | ~7 GB | ~4 GB | ~2 GB | Fits easily | Fits easily | Fits easily |
| 7–8B (Llama 3.1 8B, Qwen 3 8B) | ~16 GB | ~8 GB | ~4.5 GB | Fits easily | Fits easily | Fits easily |
| 14B (Phi-4, Qwen 3 14B, Ministral 14B) | ~28 GB | ~14 GB | ~8 GB | Fits | Fits | Fits |
| 27–32B (Gemma 4 31B, QwQ-32B, Qwen 3 32B) | ~64 GB | ~32 GB | ~18 GB | FP8: tight. Q4: fits | FP16: fits. FP8: fits | Fits at any precision |
| 70B (Llama 3.3 70B) | ~140 GB | ~70 GB | ~35 GB | FP8: tight. Q4: fits | Q4: fits but slow (~4–8 tok/s) | FP8: fits with headroom |
| 109B MoE (Llama 4 Scout, 17B active) | ~218 GB | ~109 GB | ~55 GB | Q4: fits | Q4: fits, fast (MoE) | FP8: fits |
| 119B MoE (Mistral Small 4, 6B active) | ~238 GB | ~119 GB | ~60 GB | Q4: tight | Q4: fits, fast (MoE) | FP8: fits |
| 235B MoE (Qwen 3 235B, 22B active) | ~470 GB | ~235 GB | ~118 GB | Does not fit | NVFP4: tight (or 2× Spark) | 2×H200 at FP8 |
The rule of thumb: multiply billion parameters by 2 to get GB at FP16. Divide by 2 for FP8, by 4 for Q4. Then multiply by 1.3–1.5× for practical deployment with KV cache headroom.
Key insight: MoE models are the DGX Spark’s best friend. A 120B MoE that activates only 20B parameters per token runs at 40–55 tok/s on the Spark — because the bandwidth bottleneck scales with active parameters, not total. A 70B dense model on the same hardware crawls at 4–8 tok/s. If you are deploying on the Spark, think MoE first.
Quantization: The Compression Trade-off
Quantization reduces the precision of model weights to fit larger models in less memory and run them faster. The question operations teams care about: how much quality do you actually lose?
The short answer: at 4-bit, almost nothing. Below 4-bit, a lot.
Use when: You have the VRAM and need maximum quality (fine-tuning, quality-critical production).
Use when: Production GPU serving. This is the default for H100/H200 deployments.
Use when: You need a larger model to fit on limited hardware. The quality loss is imperceptible for most operational tasks.
Use when: Emergency only — the model absolutely must fit and nothing else will work.
The Format Zoo, Simplified
Five quantization formats matter in practice. Do not worry about the rest.
- GGUF (Q4_K_M, Q5_K_M, etc.) — the universal format. Works on CPU, NVIDIA, AMD, Apple Silicon. Single self-contained file. Your default for Ollama and local deployment
- AWQ — activation-aware 4-bit for GPU production serving. Excellent vLLM integration via Marlin kernels. Your default for GPU production with vLLM or SGLang
- GPTQ — the only 4-bit format that supports LoRA adapters in vLLM. Use it when you need multi-LoRA production deployments
- NVFP4 — NVIDIA’s Blackwell-native FP4 format. Native Tensor Core execution on DGX Spark, no dequantisation step. Your default on DGX Spark via TensorRT-LLM
- EXL2 — mixed-precision, best quality-per-bit. NVIDIA-only, single-user. Use for personal development on RTX GPUs
Where to Get Pre-Quantised Models
You almost never need to quantise yourself. On HuggingFace, look for bartowski and Unsloth for GGUF models (every quant level, importance-matrix calibrated). Check the nvidia/ namespace for NVFP4 models. Model publishers increasingly provide official AWQ/GPTQ variants.
When evaluating pre-quantised models: prefer K-quants (Q4_K_M) over legacy quants (Q4_0), check download counts for community validation, and verify the upload date — quantisation of a brand-new model may have bugs in the first week.
Inference Frameworks
You have picked a model, chosen a quantisation level, and confirmed it fits your hardware. Now: what software do you use to actually run it?
The right answer depends on one question: how many concurrent users do you need to serve?
The Five That Matter
- Ollama — the 30-second start.
ollama run llama3.3and you are serving. Docker-style model management, OpenAI-compatible API. Pre-installed on DGX Spark. Limitation: no continuous batching. At 50 concurrent users, time-to-first-token climbs to 3,200ms versus 145ms on vLLM. Use for 1–5 users - vLLM — the production workhorse. PagedAttention eliminates 60–80% of KV cache memory waste. Continuous batching handles concurrent requests efficiently. Prometheus metrics, health endpoints, Helm charts for Kubernetes. Powers production at Stripe, Meta, Mistral AI. Use for 5+ concurrent users
- SGLang — vLLM’s newer challenger. RadixAttention automatically reuses shared prefixes across requests (75–90% cache hit rates for multi-turn chat). 29% higher throughput than vLLM on batch inference. Use when your workload is dominated by multi-turn conversations or RAG pipelines
- TensorRT-LLM — maximum NVIDIA performance. 15–30% above vLLM in matched benchmarks. Native NVFP4 support, EAGLE-3 speculative decoding. Cost: requires compiling a TensorRT engine per GPU architecture (~28 min for 70B). Use when maximum throughput on NVIDIA hardware justifies the setup investment
- llama.cpp — CPU and hybrid workloads. The C/C++ engine behind Ollama. Hybrid CPU+GPU offloading via
--n-gpu-layers N. Use for edge, air-gapped, or always-on workloads where latency is not critical
The Decision in Practice
| Your scenario | Use this | Why |
|---|---|---|
| Getting started / prototyping | Ollama | 30-second install, zero configuration |
| DGX Spark quick start | Ollama | Pre-installed, officially partnered |
| DGX Spark maximum performance | SGLang or TensorRT-LLM | Leverages NVFP4 and Blackwell kernels |
| Multi-user production API | vLLM | Continuous batching, mature monitoring |
| Multi-turn chat / RAG | SGLang | 75–90% prefix cache hit rate |
| Maximum throughput (H100/H200) | SGLang or TensorRT-LLM | 29% (SGLang) to 30% (TRT-LLM) above vLLM |
| CPU-only or hybrid | llama.cpp | The only real option for mixed compute |
| Minimum ops overhead | vLLM | Richest Prometheus metrics, Helm charts |
One note on HuggingFace TGI: it entered maintenance mode in December 2025. HuggingFace now officially recommends vLLM or SGLang for new deployments. If you see TGI mentioned in tutorials, know that it is a legacy recommendation.
Memory & Throughput Realities
A model that “fits” is not the same as a model that “performs.” Here is what your hardware actually delivers at the token level.
Single-User Decode Speed
Snapshot data — April 2026. These numbers change with firmware updates, framework versions, and new model architectures. The formula (bandwidth ÷ model size) is durable; the specific tok/s values are not. Re-benchmark when your stack changes.
This is the number that determines whether a model feels interactive. Measured in tokens per second during generation:
| Model | Quant | H100 | H200 (est.) | DGX Spark |
|---|---|---|---|---|
| 8B (Llama 3.1 8B) | FP8 | ~250–300 | ~360–430 | ~20–34 |
| 8B | NVFP4 | — | — | ~38–40 |
| 14B (Qwen 3 14B) | NVFP4 | ~180 | ~260 | ~22 |
| 27–32B (Gemma 3 27B) | Q4 | ~80–110 | ~120–160 | ~9–11 |
| 70B (Llama 3.3 70B) | Q4 | ~60–96 | ~90–137 | ~4–8 |
| 70B | FP8 | ~35–48 | ~48–69 | ~2.7–4 |
| 120B MoE (GPT-OSS, 20B active) | MXFP4 | — | — | ~41–55 |
The pattern is clear. On the DGX Spark:
- 8B–14B dense models at 20–40 tok/s — comfortable for interactive use
- 27–32B dense models at 9–11 tok/s — borderline interactive
- MoE models up to 120B at 40–55 tok/s — comfortable, because bandwidth scales with active parameters
- 70B dense at 4–8 tok/s — not viable for interactive use
The H200 Advantage
Moving from H100 to H200 gives you a consistent ~1.4× speedup from the bandwidth increase (4.8 versus 3.35 TB/s). But the bigger win is capacity: Llama 70B at FP8 (~84GB with overhead) fits on one H200 but needs two H100s with tensor parallelism. That eliminates inter-GPU communication overhead and effectively doubles your serving capacity per model.
Batching Changes Everything
The single-user numbers above tell only half the story. Batched inference shifts the bottleneck from bandwidth to compute. Llama 3.1 8B on the DGX Spark at batch size 32 achieved 368 tok/s total throughput via SGLang — an 18× increase over single-user. If your use case is a queue of requests rather than an interactive chat, smaller models batched efficiently can handle surprisingly high volumes.
Practical guidance: An H100 will run circles around the DGX Spark for any model that fits in 80GB. The Spark’s value is running models that do not fit on an H100 (120B MoE, prototyping 200B+) and serving as a quiet, desk-side development environment. When a model graduates from prototyping to production, move it to datacenter GPUs behind vLLM.
The Decision Framework
When a new model drops, do not start with the benchmarks. Start with your constraints. The following sequence will get you to a defensible choice in under an hour.
Six Questions to Ask Every Time
- 1. What is the task? Chat, code generation, document summarisation, RAG, translation, classification, entity extraction? If the task is narrow and repeatable (classification, NER, routing), check whether a specialised small model handles it better than an LLM
- 2. What hardware do I have? Calculate your VRAM ceiling. Use the formula: params × bytes per weight × 1.3–1.5. This eliminates models that cannot fit before you waste time evaluating them
- 3. How many concurrent users? 1 user? Ollama is fine. 5–20 users? You need vLLM or SGLang with continuous batching. 50+? You need either a smaller model, more hardware, or both
- 4. What quality floor is acceptable? Can you tolerate Q4 quality loss (~3–5%)? For internal tools and prototyping, almost always yes. For client-facing quality-critical output, you may want FP8 or higher
- 5. What licensing constraints apply? Is this for a regulated client? EU public sector? Defence? Mistral gives you EU sovereignty. Llama’s branding requirement may be a problem. Qwen requires a compliance conversation for sensitive workloads
- 6. What does the ecosystem look like? How many fine-tuned variants exist on HuggingFace? Is the model well-supported by your chosen inference framework? A technically superior model with no community tooling is a risky choice
When to Re-Evaluate
You do not need to re-evaluate every time a new model drops. Re-evaluate when:
- Your hardware changes (new GPUs, new memory capacity)
- Your use case requirements change (new language, higher concurrency, new modality)
- A new model family appears (not a version update — a genuinely new player)
- Your current model’s family changes licensing (check release notes, not just headlines)
- You see a >20% improvement on your specific task in independent benchmarks (not leaderboard games — your actual evaluation set)
Version updates within a family (Qwen 3 → Qwen 3.5) are worth a quick test on your evaluation set. They are not worth a full re-architecture.
Scenario Exercise
Four scenarios. For each, use the decision framework to choose a model, quantisation level, and inference framework. Justify your choice. Then reveal the suggested approach and compare.
There is no single correct answer — the point is whether your reasoning follows the framework and whether you can defend your choice against the constraints.
Your client wants an on-prem chatbot for their internal IT helpdesk. The bot should answer common questions about password resets, VPN setup, printer configuration, and software installation. It needs to handle 10 concurrent users during peak hours. Conversations are in English and German. The client insists on EU data residency.
Available hardware: DGX Spark (128GB unified memory)
Your task: Choose a model, quantisation level, and inference framework. Justify each choice.
Suggested model: Mistral Small 4 (119B MoE, ~6B active) at Q4
Why this model: The task is conversational helpdesk — it needs good instruction following, multilingual capability (English + German), and reasonable speed for 10 concurrent users. Mistral Small 4 activates only ~6B parameters per token despite 119B total, so on the DGX Spark it will deliver decent MoE throughput. At Q4, it needs ~60GB of the 128GB — leaving room for KV cache. Mistral’s EU origin satisfies the data residency requirement cleanly.
Why not alternatives: Qwen 3 14B would be faster on the Spark (~22 tok/s at NVFP4) but the Chinese origin creates unnecessary compliance friction given the EU residency requirement. Llama 3.3 70B at Q4 fits but at 4–8 tok/s single-user it is too slow for 10 concurrent users. Gemma 4 31B at Q4 (~18GB) would be lighter and fast enough at ~15 tok/s — a legitimate alternative, especially if Mistral Small 4’s MoE throughput proves insufficient on the Spark.
Framework: SGLang — the multi-turn chat workload benefits from RadixAttention prefix caching. Ollama would struggle at 10 concurrent users (no continuous batching). vLLM is also viable here.
Fallback: If Mistral Small 4 is too slow for 10 users on the Spark, drop to Gemma 4 31B dense at Q4 or Qwen 3 14B at FP8. Test both and measure actual throughput at target concurrency before committing.
Your development team wants an on-prem code review assistant. It should review pull requests, suggest improvements, and flag potential security issues. Single-user at a time is acceptable. The codebase is primarily Python and TypeScript. Some PRs are large — up to 32K tokens of context needed.
Available hardware: H100 (80GB)
Your task: Choose a model, quantisation level, and inference framework. Justify each choice.
Suggested model: Qwen 3 32B at FP8, or QwQ-32B at FP8
Why this model: Code review is a reasoning-heavy task that benefits from larger models. At FP8, a 32B model needs ~32GB + overhead — fits comfortably on the H100 with ~40GB free for KV cache at 32K context. Qwen 3 is the strongest coding family at this size range, and QwQ-32B adds chain-of-thought reasoning that helps with security analysis. Both support 128K+ context windows, more than enough for 32K token PRs.
Why not alternatives: Llama 3.3 70B at FP8 (~70GB) would fit on the H100 but leaves only ~10GB for KV cache, which at 32K context might cause problems. Phi-4 14B excels at code but its 16K context window is too short. Devstral 2 (123B dense) is purpose-built for code but needs more than 80GB even at Q4.
Framework: vLLM with FP8 — single-user so no batching pressure, but vLLM’s PagedAttention efficiently manages the large KV cache at 32K context. Ollama would also work fine for single-user.
Quantisation: FP8 — with 80GB available and a 32B model, there is no reason to drop below FP8. You keep ~99% quality and have headroom for the context window.
A client needs to extract structured data from scanned invoices — vendor name, invoice number, line items, amounts, tax, and total. The documents are PDFs that have been OCR’d. Processing happens overnight in batch — no real-time requirement. Volume: 5,000 documents per night. The output must be structured JSON.
Available hardware: DGX Spark (128GB unified memory)
Your task: Choose the right approach. This one might not need an LLM at all — or it might.
Suggested approach: pipeline of specialised models, not a single LLM
Step 1 — NER extraction: Use GLiNER (100–350M) to extract entity types directly: vendor name, invoice number, amounts, dates. Define entity types at runtime. Runs on CPU at ~10ms per document. For 5,000 docs: ~50 seconds total.
Step 2 — Structured output: For invoices where GLiNER’s extraction is clean (likely 70–85% of documents), map directly to JSON. No LLM needed.
Step 3 — LLM fallback: For the remaining 15–30% where extraction is ambiguous (multi-page invoices, unusual formats), use Qwen 3 8B at NVFP4 on the DGX Spark to parse the full text and produce structured JSON. At ~38 tok/s single-user and batch processing, you can process ~200 tokens per document × 1,500 documents in roughly 2 hours.
Why not just use the LLM for everything: 5,000 documents × ~500 input tokens each × ~200 output tokens = ~3.5M tokens. At single-user speed on the Spark, that is ~25 hours for an 8B model. With the pipeline approach, 85% of documents finish in under a minute, and the LLM handles only the hard cases. Total processing time drops to ~2–3 hours.
Framework: GLiNER runs standalone (pip install, no GPU needed). For the LLM fallback, use Ollama for simplicity (batch queue, no concurrent users) or vLLM with batching for faster throughput.
Your client was using a hosted API running Llama 3.3 70B for a customer-facing Q&A system. The API provider just announced they are shutting down in one week. The client needs to self-host a replacement immediately. The system handles ~20 concurrent users during business hours. Responses must be in English. Quality cannot noticeably degrade — customers are used to the current output quality.
Available hardware: H100 (80GB)
Your task: What do you deploy, and what do you tell the client about the trade-offs?
Suggested model: Llama 3.3 70B at Q4 (GGUF Q4_K_M or AWQ 4-bit)
Why the same model: The requirement is “quality cannot noticeably degrade.” The fastest way to guarantee that is to run the same model. Llama 3.3 70B at Q4 needs ~46GB total, fitting on the H100 with ~34GB headroom for KV cache and batching.
Why Q4 and not FP8: At FP8 (~84GB total with overhead), the model barely fits and leaves almost no room for the KV cache needed by 20 concurrent users. At Q4, you have the headroom. The ~3–5% quality loss from Q4 is less noticeable than the latency degradation from running out of KV cache space.
Framework: vLLM with AWQ quantisation — this is not optional at 20 concurrent users. Ollama would queue requests and create unacceptable wait times. vLLM with continuous batching and PagedAttention can serve 20 users at ~60–96 tok/s system throughput at Q4. Enable FP8 KV cache (kv_cache_dtype="fp8") to double your KV cache capacity.
What to tell the client:
- Output quality will be ~95–97% of what they had (Q4 quantisation). Most users will not notice
- Latency per response will likely be similar or slightly slower than the hosted API, depending on their previous provider’s hardware
- They are now responsible for uptime, monitoring, and model updates. Set up Prometheus metrics from vLLM’s
/metricsendpoint and NVIDIA DCGM Exporter for GPU monitoring - If you upgrade to H200 later, migrate to FP8 for full quality restoration and better concurrent user handling
Timeline: vLLM + AWQ Llama 3.3 70B can be serving in production within a day. The pre-quantised AWQ model is available on HuggingFace. The remaining time is for testing, monitoring setup, and load testing at 20 concurrent users.
What to Watch
The landscape changes fast. Here is what to monitor so you know when to re-evaluate.
Signals That Matter
- New model releases from the five families. Not every release requires action. Check: is it a new architecture or just a version bump? Does it target a size range you use? Does it change the licensing? A new Qwen 4 with a different license matters. A Qwen 3.5.1 patch does not
- Hardware changes. When new hardware arrives — H200s, next-gen Blackwell, a second DGX Spark — re-run the “what fits” analysis. Models that were too large or too slow may suddenly become viable. Two H200s (282GB combined) open up the 235B MoE tier
- Quantisation breakthroughs. Recent advances in 2-bit quantisation (like QuIP# and AQLM) are narrowing the quality gap below Q4. If 2-bit reaches the Q4 quality level, it doubles the effective models you can fit on existing hardware
- Inference framework updates. SGLang and vLLM release roughly monthly. Major features (better MoE support, new quantisation backends, speculative decoding improvements) can change throughput by 30–50%
- Licensing changes. Track the specific model version’s license, not the family’s reputation. Mistral went from Apache 2.0 to restrictive and back. Gemma went from restrictive to Apache 2.0. The next shift could go either direction for any family
- EU regulatory developments. The EU AI Act is now enforceable. Watch for guidance on training data transparency requirements (affects Qwen), high-risk AI system classifications, and any new restrictions on model deployment in regulated sectors
Where to Look
- Chatbot Arena (LMArena.ai) — the most reliable quality ranking. 6M+ blind human A/B votes. More trustworthy than any benchmark leaderboard
- HuggingFace model pages — check download counts, derivative models, and community discussions for practical deployment experience
- r/LocalLLaMA — the fastest signal for new model quality, quantisation issues, and inference framework bugs
- Model family blogs/release notes — Mistral blog, Google AI blog, Qwen GitHub, Meta Llama blog, Microsoft Research blog
- Your own evaluation set — the most important signal. Build a set of 50–100 representative prompts from your actual use cases and evaluate new models against it. No external benchmark replaces this
The open-weight landscape in April 2026 offers more options, better licensing, and more capable models than at any point in the history of accessible AI. Five families compete fiercely. Apache 2.0 is the norm. A single H100 runs models that would have required a data centre three years ago.
But the models are not the hard part. The hard part is matching the right model to the right task on the right hardware with the right trade-offs — and having the confidence to make that call yourself when the next release drops.
That is what the framework is for. Use it.
Robert Barcik · barcik.training · LearningDoe s.r.o. · April 2026