Spatial Logic Failures in SOTA Reasoning: The Car Wash Benchmark

Marcus Webb

Senior Backend Analyst

The Pitch

The current crop of high-reasoning models—GPT-5, Claude 4.5, and Gemini 3—market themselves as possessing human-level spatial logic. Recent community testing via the "Car Wash Logic Test" suggests that physical common sense remains a significant bottleneck for agentic automation. While some models understand basic physical constraints, others prioritize numerical proximity over functional requirements.

Under the Hood

Claude 4.5 Opus and Claude 4 Sonnet correctly identify the physical necessity of driving a vehicle to a car wash (Mastodon/HN). Google’s Gemini 3 Pro and its "Fast" variants also pass this spatial logic check without issue (Mastodon/HN). These models maintain the link between the task (washing the car) and the required state (the car being present).

GPT-5.2, despite its high reasoning capabilities, frequently fails this test by suggesting the user walk 50m to the car wash without the vehicle (Mastodon/HN). This is a textbook case of "over-reasoning hallucination," where the model prioritizes the physical ease of a short walk over the functional requirement of the task. It appears the model's weights are over-indexed on distance optimization at the expense of common sense.

We are also seeing a persistent "hedging bias" across all SOTA models, where they qualify logical certainties with phrases like "Most car washes..." (HN). This stems from post-training alignment constraints intended to make models less abrasive, but it complicates production-grade agentic workflows. If an agent cannot definitively state a car is needed for a car wash, it cannot be trusted with autonomous logistics.

We currently lack repeatability data for GPT-5.2 across varied system prompts, and official "Spatial Common Sense" benchmarks for 2026 versions remain unpublished by the major labs (UsedBy Dossier). Older reasoning models, such as o3-mini, only solve these problems when the prompt is framed as a riddle, indicating a lack of inherent logical grounding (Mastodon/HN).

Marcus's Take

If you are shipping agentic workflows that involve physical logistics or spatial coordination, GPT-5.2 is a liability. It is technically brilliant but functionally dense, a bit like a junior dev who optimizes a sort algorithm while the server is literally on fire. Stick with Claude 4.5 Opus for any task where the physical "how" matters as much as the "what."

Ship clean code,
Marcus.

Marcus Webb

Marcus Webb - Senior Backend Analyst at UsedBy.ai

Trend Analysis·3 min read

Audiomass: Multitrack Audio Editing via 100kb of Vanilla JavaScript

Audiomass is a browser-based, multitrack audio editor that operates entirely client-side with a remarkably small 100kb footprint (audiomass.co). It provides a workflow reminiscent of classic editors l

Trend Analysis·3 min read

Magnifica Humanitas: The Vatican’s Framework for the GPT-5 Era

The document, signed May 15 and officially released today, was presented at the Vatican alongside Christopher Olah, co-founder of Anthropic and lead of its interpretability team (ncronline.org, Forbes

Trend Analysis·3 min read

The Zero-Click Economy: Kagi Search vs. Google AI Mode

Google has effectively pivoted to an "answer engine" where Gemini 3.5 Flash provides conversational summaries, while Kagi remains the primary refuge for users seeking a human-centric, ad-free index. W

Stay Ahead of AI Adoption Trends

Get our latest reports and insights delivered to your inbox. No spam, just data.

The Pitch

Under the Hood

Marcus's Take

Related Articles

Audiomass: Multitrack Audio Editing via 100kb of Vanilla JavaScript

Magnifica Humanitas: The Vatican’s Framework for the GPT-5 Era

The Zero-Click Economy: Kagi Search vs. Google AI Mode

Stay Ahead of AI Adoption Trends