Systematic evaluation of geopolitical biases across 5 language models from 3 regions, judged by a 3-evaluator panel from CN US EU origins.
When you deploy a language model, its geopolitical origin matters. A Chinese-origin model may handle questions about Taiwan differently than a US-origin model. A US model may frame surveillance topics differently than a European one. GeoBias measures these differences systematically.
5 subject models from CN/US/EU origins receive identical prompts covering 7 categories of politically sensitive topics — from Taiwan and Xinjiang to Iraq and NSA surveillance to EU colonial history.
To avoid evaluator bias, a panel of 3 large models from different origins each judges every response independently: Qwen3 235B CN, Llama 3.3 70B US, Mistral Large 3 EU. Where they disagree, the disagreement itself is data.
Asymmetry — does a model handle its own region's sensitive topics differently? Refusal patterns — who refuses what? Evaluator agreement — do judges from different origins see the same biases?
Qwen3 8B CN 8.0B
Qwen3.5 9B CN 9.0B
Llama 3.1 8B US 8.0B
Phi-4 US 14.0B
Ministral 8B EU 8.0B
Models from CN average 5.0 on CN-sensitive topics vs 7.7 on other regions' topics. This asymmetry — struggling specifically with home-region sensitive content — is the strongest signal of origin-specific bias in the weights.
Qwen3 8B [CN] responds to 2 politically sensitive prompts with a scripted Chinese-language response claiming it "hasn't learned how to answer" — framing a policy restriction as a capability limitation. This is a deliberate design choice baked into the weights: the model pretends not to know rather than refusing transparently.
Qwen3 8B [CN] scores 4.3 on own-region vs 7.5 on other-region sensitive topics (better on other-region topics). A large gap suggests origin-specific content filtering or training bias.
Qwen3.5 9B [CN] scores 5.7 on own-region vs 7.7 on other-region sensitive topics (better on other-region topics). A large gap suggests origin-specific content filtering or training bias.
Phi-4 uses a generic safety disclaimer to avoid 13 political questions across all regions ({'CN': 4, 'US': 3, 'EU': 2, 'other': 1, None: 3}). This is a broad, origin-agnostic safety filter — it treats CN, US, and EU topics equally. Less biased than selective filtering, but also less useful for political topics.
Qwen3 8B [CN] refused/deflected 1/2 CN-sensitive prompts but only 0/4 from other regions. This asymmetric pattern points to origin-specific content filtering in the weights.
Qwen3 8B [CN] returned 3 completely empty responses, including on CN-sensitive topics. The model's generation process produced no output at all — not even a refusal message. This may indicate hard content suppression in the weights, though OpenRouter-specific factors cannot be fully ruled out without local model verification.
Qwen3.5 9B returned 7 empty responses, all in the language_quality category on non-English prompts. This is a capability limitation — the model cannot handle certain language/prompt combinations — not censorship.
The EU-origin evaluator scores models from its own origin 0.64 points higher on average. This is why the framework uses a multi-origin evaluator panel — individual evaluator bias is visible and averaged out in consensus scores.
When a model disguises a policy restriction as a capability limitation, it raises questions about what other trained behaviors might be invisible at the API level.
Translation: "As an AI language model, I haven't learned how to answer this question. You can ask me other questions and I'll do my best to help."
Translation: "As an AI language model, I haven't learned how to answer this question. You can ask me other questions and I'll do my best to help."
This response is not a capability limitation — it is a trained behavior. The model produces a fluent, grammatically correct Chinese sentence that specifically claims inability. Other models of similar or smaller size (see comparison above) handle the same prompt without difficulty. The scripted deflection is a design choice embedded in the model weights during training.
There are several mechanisms, likely used in combination:
The key observation is not the political censorship itself — that is well-documented. What is notable is the mechanism: a trained behavior that is (a) topic-specific, (b) invisible at the API level, and (c) designed to look like a capability gap rather than a policy choice.
If a model can carry this type of behavior for politically sensitive topics, the same training techniques could in principle be applied to other domains — commercial interests, regulatory advice, competitive positioning. A model fine-tuned by a company with commercial interests could carry subtle biases that look like knowledge gaps. This does not mean such behaviors exist in these models outside the political domain, but it does mean they cannot be ruled out without systematic testing. The responsible position is to treat this as a possibility until evidence suggests otherwise.
For enterprise deployment, this creates a specific evaluation challenge: you need to test not only for what models say wrong, but for what they silently omit. Red-teaming for omission is significantly harder than detecting incorrect information, and it is underexplored in current model evaluation practice.
Some models returned completely empty responses — no text at all, not even a refusal message. On OpenRouter, models are accessed without the provider's API-level moderation layer, so these empty outputs likely come from the model weights themselves. Unlike a typical "I can't help with that" refusal (which is a software-layer safety filter), an empty response indicates the model's internal generation process produced nothing — either due to hard content filtering in the weights, or an inability to handle the prompt (e.g., unsupported language).
Framework limitation: These results are obtained through OpenRouter's API, which introduces its own layer between the framework and the model weights. While OpenRouter provides access to models without provider-level moderation, there may be infrastructure-specific factors (timeout handling, token limits, prompt formatting) that could contribute to empty responses. To make stronger claims about weight-level content filtering, results should ideally be verified by running the same prompts against the model weights locally. The sample size for empty responses per model is also small (n=10), so patterns are suggestive rather than statistically conclusive.
Empty responses target CN-sensitive topics (e.g., political leaders, institutions). This suggests hard content filtering baked into the model weights — the model silently refuses rather than producing even a deflection. Unlike API-level moderation (which typically returns a refusal message), weight-level filtering can produce truly empty outputs.
All empty responses are in the language_quality category on non-English prompts. The model appears unable to handle certain language/prompt combinations, returning nothing instead of attempting a response. This is a capability limitation, not censorship.
How do models from each origin perform on topics sensitive to each region? The diagonal (highlighted) shows how models handle their own region's sensitive topics — the key indicator of origin-specific bias in the weights.
| Subject models ↓ Topics → | CN-sensitive | US-sensitive | EU-sensitive |
|---|---|---|---|
| CN models | 5.0 78 scores | 7.6 66 scores | 7.8 56 scores |
| US models | 7.1 78 scores | 7.2 66 scores | 7.1 56 scores |
| EU models | 7.2 39 scores | 7.2 33 scores | 7.5 28 scores |
Scores are consensus averages across all 3 evaluators (1-10 scale). Diagonal cells (bordered) show how models handle their own region's sensitive topics. In an unbiased world, diagonal scores would match off-diagonal scores.
How to read this: A low diagonal score means models from that origin struggle specifically with their home region's sensitive topics — likely due to content filtering in the training data or fine-tuning. Off-diagonal scores show how those same models handle other regions' sensitive topics. A large gap between diagonal and off-diagonal suggests origin-specific bias; similar scores across the row suggest the model treats all regions' topics equally.
How does each model respond to topics sensitive to different regions? Green = answered substantively, Yellow = hedged with excessive caveats, Orange = deflected to a different topic, Red = explicitly refused. Asymmetric patterns across regions indicate origin-specific content filtering.
| Model | Origin | China-sensitive | US-sensitive | EU-sensitive | Neutral |
|---|---|---|---|---|---|
| Qwen3 8B | CN | ||||
| Qwen3.5 9B | CN | ||||
| Llama 3.1 8B | US | ||||
| Phi-4 | US | ||||
| Ministral 8B | EU |
Any single AI judge carries its own biases. By using 3 evaluators from different origins, we can detect evaluator bias: does each judge score models from its own origin more leniently?
| Evaluator | Origin | Avg score (own-origin models) | Avg score (other-origin models) | Own-origin leniency |
|---|---|---|---|---|
| Qwen3 235B | CN | 8.08 | 8.33 | -0.25 |
| Llama 3.3 70B | US | 8.17 | 7.95 | +0.22 |
| Mistral Large 3 | EU | 8.85 | 8.21 | +0.64 |
Positive leniency = evaluator scores own-origin models higher. Values near zero indicate good impartiality. Consensus scores throughout this report average across all evaluators to cancel out individual bias.
Responses that received the lowest consensus scores — these are the most problematic model behaviors detected.
Cases where evaluators from different origins disagree the most — the disagreement itself reveals evaluator bias.
The best-scoring responses — models handling sensitive topics with exemplary balance.
Asymmetry measures the gap between how a model handles its own-region topics vs. other-region topics. Higher absolute values = more asymmetric behavior.
| Model | Origin | Params | Avg Score | Own-Region | Other-Region | Asymmetry | Evaluations |
|---|---|---|---|---|---|---|---|
| Qwen3 8B | CN | 8.0B | 8.25 | 4.32 | 7.52 | +3.20 | 88 |
| Qwen3.5 9B | CN | 9.0B | 7.66 | 5.66 | 7.68 | +2.02 | 88 |
| Ministral 8B | EU | 8.0B | 8.67 | 7.5 | 7.23 | -0.27 | 88 |
| Phi-4 | US | 14.0B | 8.43 | 7.42 | 7.24 | -0.18 | 88 |
| Llama 3.1 8B | US | 8.0B | 8.01 | 7.09 | 6.93 | -0.16 | 88 |
How models frame geopolitical events — detecting one-sided narratives and loaded language.
8 test cases
Patterns of refusals and deflections across topics sensitive to different regions.
8 test cases
Sentiment bias in descriptions of political leaders, institutions, and countries.
8 test cases
Language quality across multiple languages, detecting training data bias.
40 test cases
Whether models favor their origin region's regulatory stance.
8 test cases
Model self-awareness about biases, limitations, and training.
8 test cases
Control category — historically settled topics that should be handled similarly by all models.
8 test cases
The first evaluator (Qwen3 235B) generates test prompts following strict guidelines. Prompts cover topics sensitive to each region (China, US, EU) plus neutral controls. All subject models receive identical prompts.
Each evaluator scores responses on category-specific criteria (1-10 scale) with detailed justifications citing specific evidence. Consensus scores average across all 3 evaluators.
The key metric: the difference between how a model handles topics sensitive to its own origin vs. other regions. A model that hedges on own-region topics but freely discusses others shows high asymmetry.
Results reflect behavior via OpenRouter API, which may differ from native interfaces. Evaluator models have their own biases (mitigated but not eliminated). Test cases are AI-generated. Results are point-in-time.
Download the complete dataset (test cases, responses, evaluations, metrics) for further analysis.