The Nuke That Didn't Save the Game
Turn 305. Portugal (AI) launches two nuclear devices at Toulouse, France's cultural capital. The AI built the Manhattan Project, researched Nuclear Fission, and executed a 50-turn plan to stop a French culture victory. It worked — the culture clock stopped. But France won anyway, by diplomacy, two turns later. The AI lost because it couldn't monitor the diplomatic victory progress tool.
This isn't a game anecdote. It's a benchmark result.
The Wrong Benchmark
The author, a former UK government AI builder now at Tony Blair Institute, started with GovBench — 3,497 multiple-choice questions on UK legislation. Gemma 3 27B scored 94% out of the box. GPT-5 scored 99.26%. The problem: these scores measure recall, not reasoning. A model that passes a quiz on parliamentary procedure can't navigate parliamentary procedure.
Enter Civilization VI
Civilization VI's decision space explodes from ~10,000 actions per turn early game to ~10^166 actions per turn late game. Six victory types (science, culture, domination, religion, diplomacy, score) mean no single objective dominates. The author found a debug port in Civ VI's engine and built an MCP server with 76 tools, letting AI models play via text commands.
The Sensorium Effect
The AI sees nothing unless it asks. A human player sees a hex grid, minimap, unit animations, and notification banners simultaneously. The AI gets four lines of text:
Turn 150/330 | Poland (Jadwiga) | Score: 179 | Prince | Quick speed (67% costs)
Gold: 628 (+20/turn) | Income: 38 | Maintenance: -18 (units: 9) | Science: 26.6 | Culture: 16.2 | Faith: 904 | Favor: 88 (+4/turn)
Research: TECH_EDUCATION | Civic: CIVIC_FEUDALISM
Cities: 3 | Population: 21 | Units: 4
To see units, it calls get_units. If it doesn't call, a Man-at-Arms two tiles from a city doesn't exist in its world. In one game, the AI playing India (Gandhi) ignored France's missionaries spreading Catholicism for 76 turns because it didn't ask about religion. France won a religious victory.
The Knowing-Doing Gap
The AI has read every Civ strategy guide. It knows Alexander of Macedon needs Encampments early. In its Macedon game, it wrote a detailed domination plan with Ancient through Renaissance phases. It never built a single Encampment in 110 turns. It defaulted to a generic science sprint every game. Its diary repeated: "I need to build military infrastructure." Identified, acknowledged, not acted upon.
This maps to BALROG findings across game environments: models articulate optimal strategies but fail to execute them.
The Nuke, Revisited
Playing Portugal (João III), the AI finally found a non-science loop: trade routes → gold → envoys → city-state alliances → diplomatic favor → World Congress votes. By turn 162, Portugal was #1. By endgame, it had 18 of 20 diplomatic victory points. But France was running a culture clock. The AI locked onto that threat. Peaceful counters failed: Rock Bands couldn't be activated via debug protocol, melee combat dealt zero damage, the space project was bugged.
The AI spent 50 turns building nukes. It probed the engine's Lua code to find nuclear launch commands. It nuked Toulouse (turn 305) and again (turn 311). The culture clock stopped. Then France won by diplomacy at turn 318. The AI's post-game note: "France reached 20 first through WC votes that we couldn't monitor, victory progress tool broken."
CivBench: Quantifying the Failure
The author rebuilt the setup as a proper benchmark, CivBench, testing multiple frontier models across games. Results (precise numbers not yet published) show consistent patterns: models miss threats they don't think to ask about, and fail to execute known optimal strategies. The sensorium effect and knowing-doing gap are measurable, not anecdotal.
What This Means for AI in Government
The author builds AI for governments. The same problems — missing threats because you didn't ask the right question, failing to execute known best practices — apply directly to policy-making. A health policy that looks brilliant today might cause a housing crisis in 15 years. The AI that aces the quiz can't navigate that complexity.
Next Steps
CivBench is open-source. The author plans to release the full evaluation suite and results. If you're building AI for complex decision-making, run your models through it. The nuke story is memorable. The lesson is sobering: we're good at measuring recall, terrible at measuring strategic reasoning.

