CivBench: How Frontier AI Models Fail at Strategic Reasoning

The Nuke That Didn't Save the Game

Turn 305. Portugal (AI) launches two nuclear devices at Toulouse, France's cultural capital. The AI built the Manhattan Project, researched Nuclear Fission, and executed a 50-turn plan to stop a French culture victory. It worked — the culture clock stopped. But France won anyway, by diplomacy, two turns later. The AI lost because it couldn't monitor the diplomatic victory progress tool.

This isn't a game anecdote. It's a benchmark result.

The Wrong Benchmark

The author, a former UK government AI builder now at Tony Blair Institute, started with GovBench — 3,497 multiple-choice questions on UK legislation. Gemma 3 27B scored 94% out of the box. GPT-5 scored 99.26%. The problem: these scores measure recall, not reasoning. A model that passes a quiz on parliamentary procedure can't navigate parliamentary procedure.

Enter Civilization VI

Civilization VI's decision space explodes from ~10,000 actions per turn early game to ~10^166 actions per turn late game. Six victory types (science, culture, domination, religion, diplomacy, score) mean no single objective dominates. The author found a debug port in Civ VI's engine and built an MCP server with 76 tools, letting AI models play via text commands.

The Sensorium Effect

The AI sees nothing unless it asks. A human player sees a hex grid, minimap, unit animations, and notification banners simultaneously. The AI gets four lines of text:

Turn 150/330 | Poland (Jadwiga) | Score: 179 | Prince | Quick speed (67% costs)
Gold: 628 (+20/turn) | Income: 38 | Maintenance: -18 (units: 9) | Science: 26.6 | Culture: 16.2 | Faith: 904 | Favor: 88 (+4/turn)
Research: TECH_EDUCATION | Civic: CIVIC_FEUDALISM
Cities: 3 | Population: 21 | Units: 4

To see units, it calls get_units. If it doesn't call, a Man-at-Arms two tiles from a city doesn't exist in its world. In one game, the AI playing India (Gandhi) ignored France's missionaries spreading Catholicism for 76 turns because it didn't ask about religion. France won a religious victory.

The Knowing-Doing Gap

The AI has read every Civ strategy guide. It knows Alexander of Macedon needs Encampments early. In its Macedon game, it wrote a detailed domination plan with Ancient through Renaissance phases. It never built a single Encampment in 110 turns. It defaulted to a generic science sprint every game. Its diary repeated: "I need to build military infrastructure." Identified, acknowledged, not acted upon.

This maps to BALROG findings across game environments: models articulate optimal strategies but fail to execute them.

The Nuke, Revisited

Playing Portugal (João III), the AI finally found a non-science loop: trade routes → gold → envoys → city-state alliances → diplomatic favor → World Congress votes. By turn 162, Portugal was #1. By endgame, it had 18 of 20 diplomatic victory points. But France was running a culture clock. The AI locked onto that threat. Peaceful counters failed: Rock Bands couldn't be activated via debug protocol, melee combat dealt zero damage, the space project was bugged.

The AI spent 50 turns building nukes. It probed the engine's Lua code to find nuclear launch commands. It nuked Toulouse (turn 305) and again (turn 311). The culture clock stopped. Then France won by diplomacy at turn 318. The AI's post-game note: "France reached 20 first through WC votes that we couldn't monitor, victory progress tool broken."

CivBench: Quantifying the Failure

The author rebuilt the setup as a proper benchmark, CivBench, testing multiple frontier models across games. Results (precise numbers not yet published) show consistent patterns: models miss threats they don't think to ask about, and fail to execute known optimal strategies. The sensorium effect and knowing-doing gap are measurable, not anecdotal.

What This Means for AI in Government

The author builds AI for governments. The same problems — missing threats because you didn't ask the right question, failing to execute known best practices — apply directly to policy-making. A health policy that looks brilliant today might cause a housing crisis in 15 years. The AI that aces the quiz can't navigate that complexity.

Next Steps

CivBench is open-source. The author plans to release the full evaluation suite and results. If you're building AI for complex decision-making, run your models through it. The nuke story is memorable. The lesson is sobering: we're good at measuring recall, terrible at measuring strategic reasoning.

CivBench: How Frontier AI Models Fail at Strategic Reasoning

The Nuke That Didn't Save the Game

The Wrong Benchmark

Enter Civilization VI

The Sensorium Effect

The Knowing-Doing Gap

The Nuke, Revisited

CivBench: Quantifying the Failure

What This Means for AI in Government

Next Steps

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

YOLO26 Drops NMS and DFL: Real-Time Vision Redefined

VibeThinker-3B: 3B Model Beats Opus 4.5, DeepSeek V3.2 on Reasoning

Provenance Vectors Override Boolean Trust in Agent Chains

Run GLM-5.2 Locally: 744B MoE Model Fits on 256GB Mac via 2-Bit Quant