Open-Weight Model Families & Model Selection

A Decision Framework for On-Prem Inference

April 2026

By Robert Barcik
LearningDoe s.r.o.

About This Booklet

You have GPUs. You have a model running in production. It works. Tickets get resolved, code gets reviewed, documents get summarised. Then someone on your team opens HuggingFace, sees a new model with better benchmark scores, and asks: should we switch?

You do not know. Not because you are uninformed, but because the question is unanswerable without context. Switch for which task? On which GPU? At what quantisation? Serving how many users? Under what license? The model is one variable in a system of constraints, and the leaderboard answers none of them.

This booklet does not tell you which model to pick today. By the time you read this, the rankings will have shifted. What it gives you is the machinery to answer the question yourself — this time and every time after. Part 1 maps the five major families: who builds them, why, and what that means for you. Part 2 connects the landscape to your actual hardware. Part 3 gives you a decision framework and four scenarios to stress-test it.

The numbers are current as of April 2026. The framework should outlast them.

Who This Is For

Operations and services professionals deploying AI on company hardware
IT teams managing GPU infrastructure for internal or client workloads
Technical decision-makers evaluating which models to run on-prem
Anyone deploying LLMs on datacenter or workstation GPUs who needs to make model choices without a dedicated ML research team

How to Read

Read in order for the full picture. If you already know the model families and just need the practical selection guidance, skip to Part 2. If you want the decision framework and exercise, jump to Part 3 — but you will get more from it having read Parts 1 and 2 first.

Why Not Just a Comparison Table?

In December 2024, the safe recommendation was straightforward: Llama 3.3 70B, run it on an H100, done. Six months later, Qwen 3 had overtaken it on most benchmarks. Six months after that, Gemma 4 arrived with Apache 2.0 licensing and Mistral Small 4 rewrote what a 6B-active-parameter model could do. Anyone who built their strategy around a December 2024 comparison table spent 2025 on the wrong model.

The problem is not that comparison tables are wrong when published. The problem is that they are right for about eight weeks. The mean engagement duration for any model on HuggingFace is approximately six weeks before it is superseded. Building your model strategy around a snapshot is like choosing a server vendor based on last quarter’s benchmark — the number was accurate, the decision was not.

What does not change every eight weeks: your hardware constraints, your use case requirements, the licensing implications, and the inference cost structure. A team that understands these variables can read a model card on release day and reach a defensible decision by end of day. A team that waits for someone else’s recommendation table is always two months behind.

The goal: When the next model drops, you evaluate it yourself — against your hardware, your constraints, your use cases — without waiting for someone else’s table.

Part 1: Model Families

The Landscape in 2026

Eighteen months ago, if someone asked you to run an LLM on your own hardware, the answer was some variant of Llama. Today, five organisations compete to be the default choice, and their models have converged on a licensing model that would have seemed unlikely in 2024: genuine permissive open licensing.

5

Major model families

Apache 2.0

Now the default license

MoE

Dominant architecture at frontier

The Families

Five families dominate the open-weight landscape: Llama (Meta), Gemma (Google DeepMind), Qwen (Alibaba Cloud), Mistral (Mistral AI, Paris), and Phi (Microsoft). Between them, they cover parameter counts from 600 million to 680 billion, every major modality, and nearly every viable on-prem deployment scenario.

These families are more stable reference points than specific model versions. A version (Llama 3.3, Qwen 3.5) might be superseded in months. The family behind it — its philosophy, its licensing approach, its community ecosystem, its preferred architecture — evolves more slowly and tells you more about where your investment will end up.

Two Architecture Eras

Understanding one distinction will help you make sense of everything that follows: dense versus Mixture of Experts (MoE).

A dense model activates all its parameters for every token. A 70B dense model does 70 billion computations per token and needs enough memory to hold all 70 billion parameters. Straightforward.

A MoE model has a large total parameter count but activates only a fraction per token. Mistral Small 4 has 119B total parameters but activates only ~6B per token. You get frontier-quality outputs at small-model inference cost — but you still need enough memory to store all 119B parameters. MoE trades memory for compute efficiency.

Every frontier open-weight release since mid-2025 uses MoE: DeepSeek R1, Qwen 3.5, Mistral Large 3, Llama 4, Gemma 4’s 26B variant. Dense models (27–32B) remain the sweet spot for single-GPU deployment, but the trend is clear.

The Licensing Convergence

The biggest practical shift of the past year is not a model — it is a license. Apache 2.0 has become the default for new open-weight releases. Gemma 4, Qwen 3/3.5, Mistral’s current lineup, and DeepSeek V3 all use it. Microsoft’s Phi uses MIT (even more permissive). The only major holdout is Meta’s Llama, which uses a custom community license with branding requirements, a 700M monthly-active-user cap, and — critically for EU-based teams — restrictions on multimodal models in the EU.

For an IT services company, this means that four of the five families impose zero commercial restrictions on your use. You can deploy them for clients, embed them in products, fine-tune them, and redistribute derivatives without legal friction. That was not the case 18 months ago.

Part 1: Model Families

Family by Family

Do not read this section as a catalog. Read it as context for decisions you will make in Part 3. For each family, the question that matters is: under what circumstances would I choose this one? The corporate strategy behind each is a single sentence — enough to understand their commitment to open weights. The rest is what you need for model selection.

Specific model versions (Llama 3.3, Gemma 4, Qwen 3.5) will be superseded. The families, their licensing philosophies, and the “when to choose” criteria are more stable — those are what to retain.

Meta Llama — the incumbent you already know

Meta Community License US origin 1.2B+ cumulative downloads

If you are running open-weight models today, you are probably running Llama. That is its greatest strength — ecosystem depth, battle-tested tooling, the most derivative models on HuggingFace (~85K) — and increasingly its only differentiator, as competitors match or exceed its quality.

Llama 3.3 70B remains the practical workhorse: 405B-class quality through distillation, fits on a single H100 at FP8 (~70GB), runs with every inference engine. Llama 4 (April 2025) pivoted to multimodal MoE — Scout (109B total, 17B active, 10M context) fits on a single H100 at INT4, but the ecosystem around Llama 4 has been slower to mature than Llama 3.x.

The risk: Meta released Muse Spark in April 2026 — their first proprietary model. The community reads this as a potential signal that future frontier releases may not be open. Meta open-weights to prevent platform lock-in by other AI providers; if that strategic calculus changes, so does Llama’s future.

When to choose Llama

You value ecosystem maturity over cutting-edge benchmarks — every tool, tutorial, and fine-tune assumes Llama compatibility
License caveat: Not Apache 2.0. Requires “Built with Llama” branding. EU entities cannot use Llama 4 multimodal models (text-only fine)
Fits on: Llama 3.3 70B on H100 at FP8. Llama 4 Scout at Q4 on H100 or DGX Spark

Google Gemma — distilled Gemini, now with a real license

Apache 2.0 (from Gemma 4) US origin 400M+ downloads

Gemma models are distilled from Google’s closed Gemini research, which gives them outsized capability per parameter. The move to Apache 2.0 with Gemma 4 (April 2026) removed the licensing friction that had held back enterprise adoption of earlier versions — this was a direct competitive response to Qwen and Mistral.

The 26B MoE is the standout: it activates only 3.8B parameters per token yet delivers near-31B dense quality (#6 on Arena AI). The 31B dense model ranks #3 on Arena AI among all open models. Specialised variants exist for medical, translation, security, and embeddings — Google is building a complete task-specific portfolio, which matters if you want models that work together.

When to choose Gemma

You want the best quality at 27–31B with a clean Apache 2.0 license and no geopolitical complications
Speed caveat: Inference is slower than competitors — the 26B MoE ran at ~11 tok/s where Qwen 3.5 hit 60+ on the same hardware. Community engine optimisations are catching up
License trap: Gemma 1–3 used a restrictive custom license. Verify you are using Gemma 4
Fits on: 31B dense or 26B MoE on H100 or DGX Spark at Q4 (~18–20GB)

Alibaba Qwen — technically best, geopolitically complicated

Apache 2.0 (from Qwen 3) Chinese origin 700M+ downloads

By the numbers, Qwen wins. Most-forked family on HuggingFace (180K+ derivatives). Widest size range (0.6B to 480B). 201+ languages versus Llama’s ~8. Strongest reasoning model at the 32B tier (QwQ-32B). Most aggressive release cadence. Qwen 3.5 introduced Gated DeltaNet — linear attention with constant memory complexity, a genuine architectural step beyond standard transformers. Alibaba open-weights to drive cloud revenue; their AI business has grown at triple-digit rates for eight consecutive quarters.

The complication is not technical. China’s National Intelligence Law obligates all Chinese companies to cooperate with government intelligence operations. The EU AI Act requires training data transparency — Qwen provides none. For EU teams: always self-host on EU infrastructure (never Alibaba Cloud API for GDPR data), avoid high-risk AI Act categories without legal review, and treat Qwen as a strong default for non-sensitive workloads where its technical lead justifies the compliance conversation.

When to choose Qwen

You need multilingual, coding, or reasoning capability and the workload is not in a regulated/sensitive category
You need a size the others do not offer — only Qwen covers 0.6B to 480B
Compliance overhead: Chinese origin means extra due diligence for EU regulated, defence, and public sector workloads
Fits on: QwQ-32B or Qwen 3 8B/14B on DGX Spark. Qwen 3 235B MoE on two H200s at FP8

Mistral AI — the EU sovereignty option

Apache 2.0 (current lineup) EU origin (Paris) ~700 employees

The only major AI lab headquartered in the EU. Full EU jurisdiction, no US CLOUD Act exposure. The French Ministry of Armed Forces signed a 2026–2030 deployment framework. France and Germany have established joint public administration AI frameworks using Mistral. If your client’s procurement requires EU-origin software, Mistral may be the only choice that passes legal review.

Mistral Small 4 (March 2026) is architecturally clever: 119B total, ~6B active, with a configurable reasoning_effort parameter that lets one model serve as both a fast chatbot and a step-by-step reasoner. Ministral 3 covers the compact tier (3B/8B/14B). Licensing lesson: Mistral went from Apache 2.0 (original 7B) to non-commercial (Codestral) to research-only (Large 2) and back to Apache 2.0 after community backlash. Always check the license on the specific version, not the family reputation.

When to choose Mistral

EU data sovereignty is a hard requirement — regulated industries, public sector, defence
You want a single model that flexes between fast chat and deep reasoning (Mistral Small 4’s reasoning_effort parameter)
HW requirement: Mistral Small 4 needs 2×H200 or 4×H100. Ministral 3 8B/14B fits anywhere
Watch for: Mistral Compute (their own EU cloud, 18K Blackwell chips, launching 2026)

Microsoft Phi — the specialist that ignores general knowledge

MIT License US origin Small Language Models

Phi is not trying to be a general-purpose assistant. It is a reasoning engine trained on synthetic “textbook quality” data that deliberately removes factual content to preserve reasoning capacity. The result: Phi-4-reasoning-plus (14B) outperforms DeepSeek-R1 (671B) on AIME math problems. Phi-4-multimodal (5.6B) tops the HuggingFace speech recognition leaderboard. Phi-4-mini (3.8B) runs on CPU. Microsoft open-weights Phi to drive ONNX Runtime adoption and serve as the developer on-ramp to Azure.

The trade-off is explicit. TriviaQA scores are low. The 16K context window on base Phi-4 rules out large-document tasks. Multilingual is weak (primarily English). Agentic capability is “very limited.” If you need Phi for knowledge-heavy work, pair it with RAG.

When to choose Phi

Your task is math, code, or scientific reasoning and you want the smallest model that handles it
You need something that runs on CPU or minimal hardware (Phi-4-mini at 3.8B)
License: MIT — zero restrictions of any kind. The cleanest legal option
Not for: General chat, multilingual, long-context, agentic workflows

Licensing at a Glance

Family	License	Commercial use	Key restriction
Qwen 3+	Apache 2.0	Unrestricted	Chinese origin → compliance overhead for regulated workloads
Mistral (current)	Apache 2.0	Unrestricted	Devstral 2: $20M/month revenue gate
Gemma 4	Apache 2.0	Unrestricted	Gemma 1–3 used restrictive license — check version
Phi	MIT	Unrestricted	None
Llama	Meta Community	Under 700M MAU	“Built with Llama” branding; EU multimodal restriction

Part 1: Model Families

Beyond LLMs: When Smaller Models Win

The default instinct — “just use the LLM for everything” — is one of the most expensive mistakes an operations team can make. An LLM is a generalist. For defined, repeatable tasks, a specialised small model will be faster, cheaper, and often more accurate.

Consider a ticket routing task. A fine-tuned DistilBERT (66M parameters) achieves 95%+ accuracy at 2–5ms latency. A 70B LLM achieves 85–92% accuracy at 500ms–5s latency. At a million classifications per month, the cost difference is roughly $50 versus $5,000–50,000 in GPU infrastructure.

Here is what the specialised model landscape looks like:

Specialised models worth knowing

Embedding models (BGE-M3, Qwen3-Embedding, Nomic Embed) — convert text to vectors for semantic search and RAG retrieval. 100–600M parameters. Essential for any RAG pipeline. e5-small (118M) achieved 100% Top-5 retrieval accuracy, outperforming an 8B embedding model at 16ms versus 195ms
NER / entity extraction (GLiNER, 100–350M) — zero-shot Named Entity Recognition. Define entity types at runtime (server names, error codes, IPs) without retraining. Outperforms ChatGPT on NER benchmarks
Classifiers (DistilBERT, 66M) — ticket routing, sentiment analysis, priority detection. 2–5ms latency
Rerankers (BGE-reranker-v2-m3, 568M) — second-stage scoring after embedding retrieval. Improves RAG quality by up to 48%. Reranking 50 documents takes ~1.5 seconds
Speech recognition (Whisper large-v3-turbo, 809M) — 5.4× faster than Whisper large-v3 with near-equivalent accuracy. Self-hostable on a single GPU

The Right-Sizing Principle

The optimal architecture uses each model for what it does best. A practical IT operations pipeline:

Step	Model type	Params	Latency	Purpose
1. Classify	DistilBERT	66M	3ms	Route ticket, detect priority
2. Extract	GLiNER	100–350M	10ms	Server names, error codes, IPs
3. Embed	BGE-M3	568M	20ms	Semantic search for similar incidents
4. Rerank	bge-reranker	110M	50ms	Refine top-k results
5. Generate	LLM (7B–70B)	7–70B	500ms+	Human-readable resolution summary

Steps 1–4 use approximately 1.2GB VRAM total and complete in under 100ms. The LLM in step 5 is called only when needed, with pre-filtered context, reducing LLM costs by 80%+ while improving answer quality.

The principle: Do not use a 70B model for a task a 300M model handles better. Reserve the LLM for reasoning and generation. Use specialised models for classification, extraction, search, and ranking.

Part 2: Practical Selection

The Hardware Reality

Before you pick a model, you need to know what your hardware can actually do. Not what the marketing says — what happens when you load a model and start generating tokens.

Two numbers determine your experience: memory capacity (can the model fit?) and memory bandwidth (how fast can it generate?). The second one is less intuitive but more important for user experience.

80 GB

H100 HBM3
3.35 TB/s bandwidth

141 GB

H200 HBM3e
4.8 TB/s bandwidth

128 GB

DGX Spark unified
273 GB/s bandwidth

Why Bandwidth Matters More Than You Think

During token generation, the GPU must stream the entire model from memory for every single token. The formula is simple:

tokens/second ≈ memory bandwidth ÷ model size in bytes

This is why the DGX Spark, despite having 128GB of memory (more than an H100), generates tokens 12× slower than the H100 for models that fit on both. The Spark’s 273 GB/s bandwidth is just 8% of the H100’s 3.35 TB/s. It can load enormous models. It cannot run them fast.

The practical threshold for interactive use is roughly >10 tokens/second. Below 5 tok/s, users will find the experience frustrating. This directly constrains which models are viable for which hardware — not just whether they fit, but whether they are usable.

The VRAM Budget

A model’s memory footprint is not just its weights. You also need room for the KV cache (which grows with context length and concurrent users) and framework overhead.

VRAM Budget Breakdown

Model weights Params × bytes per weight

↓

KV cache Grows with context length × batch size

↓

Framework overhead CUDA context, activations, buffers

↓

Total VRAM needed Multiply weight size by 1.3–1.5× for estimate

The base formula: weight memory = parameters × bytes per weight. At FP16 (2 bytes per parameter), a 70B model needs ~140GB just for weights. At Q4 (0.5 bytes), it drops to ~35GB. The 1.3–1.5× multiplier accounts for KV cache and overhead at moderate concurrency.

The KV cache deserves special attention. For Llama 3.1 70B at BF16, it consumes roughly 0.31 MB per token per request. At 8K context with 4 concurrent users, that is ~10GB — manageable. At 128K context with 4 users, it is ~160GB — exceeding the model weights. Long context and high concurrency are the hidden VRAM killers.

Part 2: Practical Selection

What Fits Where

Here is the table that matters. For each model size, it shows the VRAM required at different precision levels and whether it fits on your three hardware targets. “Fits” means loads with room for KV cache at moderate usage. “Tight” means it loads but leaves minimal headroom.

The VRAM column values are durable — they follow from the formula (params × bytes). The specific model names will rotate; the size tiers and memory math will not.

Model size	FP16	FP8 / INT8	Q4 / INT4	H100 (80GB)	DGX Spark (128GB)	H200 (141GB)
3–4B (Phi-4-mini, Ministral 3B)	~7 GB	~4 GB	~2 GB	Fits easily	Fits easily	Fits easily
7–8B (Llama 3.1 8B, Qwen 3 8B)	~16 GB	~8 GB	~4.5 GB	Fits easily	Fits easily	Fits easily
14B (Phi-4, Qwen 3 14B, Ministral 14B)	~28 GB	~14 GB	~8 GB	Fits	Fits	Fits
27–32B (Gemma 4 31B, QwQ-32B, Qwen 3 32B)	~64 GB	~32 GB	~18 GB	FP8: tight. Q4: fits	FP16: fits. FP8: fits	Fits at any precision
70B (Llama 3.3 70B)	~140 GB	~70 GB	~35 GB	FP8: tight. Q4: fits	Q4: fits but slow (~4–8 tok/s)	FP8: fits with headroom
109B MoE (Llama 4 Scout, 17B active)	~218 GB	~109 GB	~55 GB	Q4: fits	Q4: fits, fast (MoE)	FP8: fits
119B MoE (Mistral Small 4, 6B active)	~238 GB	~119 GB	~60 GB	Q4: tight	Q4: fits, fast (MoE)	FP8: fits
235B MoE (Qwen 3 235B, 22B active)	~470 GB	~235 GB	~118 GB	Does not fit	NVFP4: tight (or 2× Spark)	2×H200 at FP8

The rule of thumb: multiply billion parameters by 2 to get GB at FP16. Divide by 2 for FP8, by 4 for Q4. Then multiply by 1.3–1.5× for practical deployment with KV cache headroom.

Key insight: MoE models are the DGX Spark’s best friend. A 120B MoE that activates only 20B parameters per token runs at 40–55 tok/s on the Spark — because the bandwidth bottleneck scales with active parameters, not total. A 70B dense model on the same hardware crawls at 4–8 tok/s. If you are deploying on the Spark, think MoE first.

Part 2: Practical Selection

Quantization: The Compression Trade-off

Quantization reduces the precision of model weights to fit larger models in less memory and run them faster. The question operations teams care about: how much quality do you actually lose?

The short answer: at 4-bit, almost nothing. Below 4-bit, a lot.

FP16 / BF16 — Full Precision

Baseline quality. No loss. 2 bytes per parameter.
Use when: You have the VRAM and need maximum quality (fine-tuning, quality-critical production).

FP8 / INT8 — Half Size

~99% quality retained. 1 byte per parameter. Doubles throughput on H100/H200.
Use when: Production GPU serving. This is the default for H100/H200 deployments.

Q4 / INT4 — Quarter Size

~95–97% quality retained. 0.5 bytes per parameter. 4× size reduction from FP16.
Use when: You need a larger model to fit on limited hardware. The quality loss is imperceptible for most operational tasks.

Q3 / Q2 — Below 4-bit

Quality drops steeply and non-linearly. Q2_K showed 7.94% perplexity increase versus 0.51% for Q4_K_M.
Use when: Emergency only — the model absolutely must fit and nothing else will work.

The Format Zoo, Simplified

Five quantization formats matter in practice. Do not worry about the rest.

Which format for which situation

GGUF (Q4_K_M, Q5_K_M, etc.) — the universal format. Works on CPU, NVIDIA, AMD, Apple Silicon. Single self-contained file. Your default for Ollama and local deployment
AWQ — activation-aware 4-bit for GPU production serving. Excellent vLLM integration via Marlin kernels. Your default for GPU production with vLLM or SGLang
GPTQ — the only 4-bit format that supports LoRA adapters in vLLM. Use it when you need multi-LoRA production deployments
NVFP4 — NVIDIA’s Blackwell-native FP4 format. Native Tensor Core execution on DGX Spark, no dequantisation step. Your default on DGX Spark via TensorRT-LLM
EXL2 — mixed-precision, best quality-per-bit. NVIDIA-only, single-user. Use for personal development on RTX GPUs

Where to Get Pre-Quantised Models

You almost never need to quantise yourself. On HuggingFace, look for bartowski and Unsloth for GGUF models (every quant level, importance-matrix calibrated). Check the nvidia/ namespace for NVFP4 models. Model publishers increasingly provide official AWQ/GPTQ variants.

When evaluating pre-quantised models: prefer K-quants (Q4_K_M) over legacy quants (Q4_0), check download counts for community validation, and verify the upload date — quantisation of a brand-new model may have bugs in the first week.

Part 2: Practical Selection

Inference Frameworks

You have picked a model, chosen a quantisation level, and confirmed it fits your hardware. Now: what software do you use to actually run it?

The right answer depends on one question: how many concurrent users do you need to serve?

The Five That Matter

Framework comparison

Ollama — the 30-second start. ollama run llama3.3 and you are serving. Docker-style model management, OpenAI-compatible API. Pre-installed on DGX Spark. Limitation: no continuous batching. At 50 concurrent users, time-to-first-token climbs to 3,200ms versus 145ms on vLLM. Use for 1–5 users
vLLM — the production workhorse. PagedAttention eliminates 60–80% of KV cache memory waste. Continuous batching handles concurrent requests efficiently. Prometheus metrics, health endpoints, Helm charts for Kubernetes. Powers production at Stripe, Meta, Mistral AI. Use for 5+ concurrent users
SGLang — vLLM’s newer challenger. RadixAttention automatically reuses shared prefixes across requests (75–90% cache hit rates for multi-turn chat). 29% higher throughput than vLLM on batch inference. Use when your workload is dominated by multi-turn conversations or RAG pipelines
TensorRT-LLM — maximum NVIDIA performance. 15–30% above vLLM in matched benchmarks. Native NVFP4 support, EAGLE-3 speculative decoding. Cost: requires compiling a TensorRT engine per GPU architecture (~28 min for 70B). Use when maximum throughput on NVIDIA hardware justifies the setup investment
llama.cpp — CPU and hybrid workloads. The C/C++ engine behind Ollama. Hybrid CPU+GPU offloading via --n-gpu-layers N. Use for edge, air-gapped, or always-on workloads where latency is not critical

The Decision in Practice

Your scenario	Use this	Why
Getting started / prototyping	Ollama	30-second install, zero configuration
DGX Spark quick start	Ollama	Pre-installed, officially partnered
DGX Spark maximum performance	SGLang or TensorRT-LLM	Leverages NVFP4 and Blackwell kernels
Multi-user production API	vLLM	Continuous batching, mature monitoring
Multi-turn chat / RAG	SGLang	75–90% prefix cache hit rate
Maximum throughput (H100/H200)	SGLang or TensorRT-LLM	29% (SGLang) to 30% (TRT-LLM) above vLLM
CPU-only or hybrid	llama.cpp	The only real option for mixed compute
Minimum ops overhead	vLLM	Richest Prometheus metrics, Helm charts

One note on HuggingFace TGI: it entered maintenance mode in December 2025. HuggingFace now officially recommends vLLM or SGLang for new deployments. If you see TGI mentioned in tutorials, know that it is a legacy recommendation.

Part 2: Practical Selection

Memory & Throughput Realities

A model that “fits” is not the same as a model that “performs.” Here is what your hardware actually delivers at the token level.

Single-User Decode Speed

Snapshot data — April 2026. These numbers change with firmware updates, framework versions, and new model architectures. The formula (bandwidth ÷ model size) is durable; the specific tok/s values are not. Re-benchmark when your stack changes.

This is the number that determines whether a model feels interactive. Measured in tokens per second during generation:

Model	Quant	H100	H200 (est.)	DGX Spark
8B (Llama 3.1 8B)	FP8	~250–300	~360–430	~20–34
8B	NVFP4	—	—	~38–40
14B (Qwen 3 14B)	NVFP4	~180	~260	~22
27–32B (Gemma 3 27B)	Q4	~80–110	~120–160	~9–11
70B (Llama 3.3 70B)	Q4	~60–96	~90–137	~4–8
70B	FP8	~35–48	~48–69	~2.7–4
120B MoE (GPT-OSS, 20B active)	MXFP4	—	—	~41–55

The pattern is clear. On the DGX Spark:

8B–14B dense models at 20–40 tok/s — comfortable for interactive use
27–32B dense models at 9–11 tok/s — borderline interactive
MoE models up to 120B at 40–55 tok/s — comfortable, because bandwidth scales with active parameters
70B dense at 4–8 tok/s — not viable for interactive use

The H200 Advantage

Moving from H100 to H200 gives you a consistent ~1.4× speedup from the bandwidth increase (4.8 versus 3.35 TB/s). But the bigger win is capacity: Llama 70B at FP8 (~84GB with overhead) fits on one H200 but needs two H100s with tensor parallelism. That eliminates inter-GPU communication overhead and effectively doubles your serving capacity per model.

Batching Changes Everything

The single-user numbers above tell only half the story. Batched inference shifts the bottleneck from bandwidth to compute. Llama 3.1 8B on the DGX Spark at batch size 32 achieved 368 tok/s total throughput via SGLang — an 18× increase over single-user. If your use case is a queue of requests rather than an interactive chat, smaller models batched efficiently can handle surprisingly high volumes.

Practical guidance: An H100 will run circles around the DGX Spark for any model that fits in 80GB. The Spark’s value is running models that do not fit on an H100 (120B MoE, prototyping 200B+) and serving as a quiet, desk-side development environment. When a model graduates from prototyping to production, move it to datacenter GPUs behind vLLM.

Part 3: Decision Framework

The Decision Framework

When a new model drops, do not start with the benchmarks. Start with your constraints. The following sequence will get you to a defensible choice in under an hour.

Model Selection Flow

Define task What exactly?

➔

Check HW VRAM ceiling

➔

Filter families License, origin

➔

Size & quant Fit + speed

➔

Framework Concurrency

➔

Validate Test on your data

Six Questions to Ask Every Time

The model selection checklist

1. What is the task? Chat, code generation, document summarisation, RAG, translation, classification, entity extraction? If the task is narrow and repeatable (classification, NER, routing), check whether a specialised small model handles it better than an LLM
2. What hardware do I have? Calculate your VRAM ceiling. Use the formula: params × bytes per weight × 1.3–1.5. This eliminates models that cannot fit before you waste time evaluating them
3. How many concurrent users? 1 user? Ollama is fine. 5–20 users? You need vLLM or SGLang with continuous batching. 50+? You need either a smaller model, more hardware, or both
4. What quality floor is acceptable? Can you tolerate Q4 quality loss (~3–5%)? For internal tools and prototyping, almost always yes. For client-facing quality-critical output, you may want FP8 or higher
5. What licensing constraints apply? Is this for a regulated client? EU public sector? Defence? Mistral gives you EU sovereignty. Llama’s branding requirement may be a problem. Qwen requires a compliance conversation for sensitive workloads
6. What does the ecosystem look like? How many fine-tuned variants exist on HuggingFace? Is the model well-supported by your chosen inference framework? A technically superior model with no community tooling is a risky choice

When to Re-Evaluate

You do not need to re-evaluate every time a new model drops. Re-evaluate when:

Your hardware changes (new GPUs, new memory capacity)
Your use case requirements change (new language, higher concurrency, new modality)
A new model family appears (not a version update — a genuinely new player)
Your current model’s family changes licensing (check release notes, not just headlines)
You see a >20% improvement on your specific task in independent benchmarks (not leaderboard games — your actual evaluation set)

Version updates within a family (Qwen 3 → Qwen 3.5) are worth a quick test on your evaluation set. They are not worth a full re-architecture.

Part 3: Decision Framework

Scenario Exercise

Four scenarios. For each, use the decision framework to choose a model, quantisation level, and inference framework. Justify your choice. Then reveal the suggested approach and compare.

There is no single correct answer — the point is whether your reasoning follows the framework and whether you can defend your choice against the constraints.

Scenario A: The Internal Helpdesk Bot

Your client wants an on-prem chatbot for their internal IT helpdesk. The bot should answer common questions about password resets, VPN setup, printer configuration, and software installation. It needs to handle 10 concurrent users during peak hours. Conversations are in English and German. The client insists on EU data residency.

Available hardware: DGX Spark (128GB unified memory)

Your task: Choose a model, quantisation level, and inference framework. Justify each choice.

Suggested model: Mistral Small 4 (119B MoE, ~6B active) at Q4

Why this model: The task is conversational helpdesk — it needs good instruction following, multilingual capability (English + German), and reasonable speed for 10 concurrent users. Mistral Small 4 activates only ~6B parameters per token despite 119B total, so on the DGX Spark it will deliver decent MoE throughput. At Q4, it needs ~60GB of the 128GB — leaving room for KV cache. Mistral’s EU origin satisfies the data residency requirement cleanly.

Why not alternatives: Qwen 3 14B would be faster on the Spark (~22 tok/s at NVFP4) but the Chinese origin creates unnecessary compliance friction given the EU residency requirement. Llama 3.3 70B at Q4 fits but at 4–8 tok/s single-user it is too slow for 10 concurrent users. Gemma 4 31B at Q4 (~18GB) would be lighter and fast enough at ~15 tok/s — a legitimate alternative, especially if Mistral Small 4’s MoE throughput proves insufficient on the Spark.

Framework: SGLang — the multi-turn chat workload benefits from RadixAttention prefix caching. Ollama would struggle at 10 concurrent users (no continuous batching). vLLM is also viable here.

Fallback: If Mistral Small 4 is too slow for 10 users on the Spark, drop to Gemma 4 31B dense at Q4 or Qwen 3 14B at FP8. Test both and measure actual throughput at target concurrency before committing.

Scenario B: The Code Review Assistant

Your development team wants an on-prem code review assistant. It should review pull requests, suggest improvements, and flag potential security issues. Single-user at a time is acceptable. The codebase is primarily Python and TypeScript. Some PRs are large — up to 32K tokens of context needed.

Available hardware: H100 (80GB)

Your task: Choose a model, quantisation level, and inference framework. Justify each choice.

Suggested model: Qwen 3 32B at FP8, or QwQ-32B at FP8

Why this model: Code review is a reasoning-heavy task that benefits from larger models. At FP8, a 32B model needs ~32GB + overhead — fits comfortably on the H100 with ~40GB free for KV cache at 32K context. Qwen 3 is the strongest coding family at this size range, and QwQ-32B adds chain-of-thought reasoning that helps with security analysis. Both support 128K+ context windows, more than enough for 32K token PRs.

Why not alternatives: Llama 3.3 70B at FP8 (~70GB) would fit on the H100 but leaves only ~10GB for KV cache, which at 32K context might cause problems. Phi-4 14B excels at code but its 16K context window is too short. Devstral 2 (123B dense) is purpose-built for code but needs more than 80GB even at Q4.

Framework: vLLM with FP8 — single-user so no batching pressure, but vLLM’s PagedAttention efficiently manages the large KV cache at 32K context. Ollama would also work fine for single-user.

Quantisation: FP8 — with 80GB available and a 32B model, there is no reason to drop below FP8. You keep ~99% quality and have headroom for the context window.

Scenario C: The Document Processing Pipeline

A client needs to extract structured data from scanned invoices — vendor name, invoice number, line items, amounts, tax, and total. The documents are PDFs that have been OCR’d. Processing happens overnight in batch — no real-time requirement. Volume: 5,000 documents per night. The output must be structured JSON.

Available hardware: DGX Spark (128GB unified memory)

Your task: Choose the right approach. This one might not need an LLM at all — or it might.

Suggested approach: pipeline of specialised models, not a single LLM

Step 1 — NER extraction: Use GLiNER (100–350M) to extract entity types directly: vendor name, invoice number, amounts, dates. Define entity types at runtime. Runs on CPU at ~10ms per document. For 5,000 docs: ~50 seconds total.

Step 2 — Structured output: For invoices where GLiNER’s extraction is clean (likely 70–85% of documents), map directly to JSON. No LLM needed.

Step 3 — LLM fallback: For the remaining 15–30% where extraction is ambiguous (multi-page invoices, unusual formats), use Qwen 3 8B at NVFP4 on the DGX Spark to parse the full text and produce structured JSON. At ~38 tok/s single-user and batch processing, you can process ~200 tokens per document × 1,500 documents in roughly 2 hours.

Why not just use the LLM for everything: 5,000 documents × ~500 input tokens each × ~200 output tokens = ~3.5M tokens. At single-user speed on the Spark, that is ~25 hours for an 8B model. With the pipeline approach, 85% of documents finish in under a minute, and the LLM handles only the hard cases. Total processing time drops to ~2–3 hours.

Framework: GLiNER runs standalone (pip install, no GPU needed). For the LLM fallback, use Ollama for simplicity (batch queue, no concurrent users) or vLLM with batching for faster throughput.

Scenario D: The Emergency Migration

Your client was using a hosted API running Llama 3.3 70B for a customer-facing Q&A system. The API provider just announced they are shutting down in one week. The client needs to self-host a replacement immediately. The system handles ~20 concurrent users during business hours. Responses must be in English. Quality cannot noticeably degrade — customers are used to the current output quality.

Available hardware: H100 (80GB)

Your task: What do you deploy, and what do you tell the client about the trade-offs?

Suggested model: Llama 3.3 70B at Q4 (GGUF Q4_K_M or AWQ 4-bit)

Why the same model: The requirement is “quality cannot noticeably degrade.” The fastest way to guarantee that is to run the same model. Llama 3.3 70B at Q4 needs ~46GB total, fitting on the H100 with ~34GB headroom for KV cache and batching.

Why Q4 and not FP8: At FP8 (~84GB total with overhead), the model barely fits and leaves almost no room for the KV cache needed by 20 concurrent users. At Q4, you have the headroom. The ~3–5% quality loss from Q4 is less noticeable than the latency degradation from running out of KV cache space.

Framework: vLLM with AWQ quantisation — this is not optional at 20 concurrent users. Ollama would queue requests and create unacceptable wait times. vLLM with continuous batching and PagedAttention can serve 20 users at ~60–96 tok/s system throughput at Q4. Enable FP8 KV cache (kv_cache_dtype="fp8") to double your KV cache capacity.

What to tell the client:

Output quality will be ~95–97% of what they had (Q4 quantisation). Most users will not notice
Latency per response will likely be similar or slightly slower than the hosted API, depending on their previous provider’s hardware
They are now responsible for uptime, monitoring, and model updates. Set up Prometheus metrics from vLLM’s /metrics endpoint and NVIDIA DCGM Exporter for GPU monitoring
If you upgrade to H200 later, migrate to FP8 for full quality restoration and better concurrent user handling

Timeline: vLLM + AWQ Llama 3.3 70B can be serving in production within a day. The pre-quantised AWQ model is available on HuggingFace. The remaining time is for testing, monitoring setup, and load testing at 20 concurrent users.

Part 3: Decision Framework

What to Watch

The landscape changes fast. Here is what to monitor so you know when to re-evaluate.

Signals That Matter

New model releases from the five families. Not every release requires action. Check: is it a new architecture or just a version bump? Does it target a size range you use? Does it change the licensing? A new Qwen 4 with a different license matters. A Qwen 3.5.1 patch does not
Hardware changes. When new hardware arrives — H200s, next-gen Blackwell, a second DGX Spark — re-run the “what fits” analysis. Models that were too large or too slow may suddenly become viable. Two H200s (282GB combined) open up the 235B MoE tier
Quantisation breakthroughs. Recent advances in 2-bit quantisation (like QuIP# and AQLM) are narrowing the quality gap below Q4. If 2-bit reaches the Q4 quality level, it doubles the effective models you can fit on existing hardware
Inference framework updates. SGLang and vLLM release roughly monthly. Major features (better MoE support, new quantisation backends, speculative decoding improvements) can change throughput by 30–50%
Licensing changes. Track the specific model version’s license, not the family’s reputation. Mistral went from Apache 2.0 to restrictive and back. Gemma went from restrictive to Apache 2.0. The next shift could go either direction for any family
EU regulatory developments. The EU AI Act is now enforceable. Watch for guidance on training data transparency requirements (affects Qwen), high-risk AI system classifications, and any new restrictions on model deployment in regulated sectors

Where to Look

Monitoring sources

Chatbot Arena (LMArena.ai) — the most reliable quality ranking. 6M+ blind human A/B votes. More trustworthy than any benchmark leaderboard
HuggingFace model pages — check download counts, derivative models, and community discussions for practical deployment experience
r/LocalLLaMA — the fastest signal for new model quality, quantisation issues, and inference framework bugs
Model family blogs/release notes — Mistral blog, Google AI blog, Qwen GitHub, Meta Llama blog, Microsoft Research blog
Your own evaluation set — the most important signal. Build a set of 50–100 representative prompts from your actual use cases and evaluate new models against it. No external benchmark replaces this

The open-weight landscape in April 2026 offers more options, better licensing, and more capable models than at any point in the history of accessible AI. Five families compete fiercely. Apache 2.0 is the norm. A single H100 runs models that would have required a data centre three years ago.

But the models are not the hard part. The hard part is matching the right model to the right task on the right hardware with the right trade-offs — and having the confidence to make that call yourself when the next release drops.

That is what the framework is for. Use it.

Robert Barcik · barcik.training · LearningDoe s.r.o. · April 2026

Open-Weight Model Families & Model Selection

A Decision Framework for On-Prem Inference

About This Booklet

Who This Is For

How to Read

Table of Contents

Why Not Just a Comparison Table?

The Landscape in 2026

The Families

Two Architecture Eras

The Licensing Convergence

Family by Family

Meta Llama — the incumbent you already know

Google Gemma — distilled Gemini, now with a real license

Alibaba Qwen — technically best, geopolitically complicated

Mistral AI — the EU sovereignty option

Microsoft Phi — the specialist that ignores general knowledge

Licensing at a Glance

Beyond LLMs: When Smaller Models Win

The Right-Sizing Principle

The Hardware Reality

Why Bandwidth Matters More Than You Think

The VRAM Budget

What Fits Where

Quantization: The Compression Trade-off

The Format Zoo, Simplified

Where to Get Pre-Quantised Models

Inference Frameworks

The Five That Matter

The Decision in Practice

Memory & Throughput Realities

Single-User Decode Speed

The H200 Advantage

Batching Changes Everything

The Decision Framework

Six Questions to Ask Every Time

When to Re-Evaluate

Scenario Exercise

What to Watch

Signals That Matter

Where to Look