Why I Bothered

For those who know me, I am a tried and true OpenAI critic. I don't like Sam Altman. I don't like the product he ships. I don't like the way he treats consumers or, implicitly, his investors. I have an upcoming post dissecting why OpenAI's finances don't make any sense, so I'll save that for later. But I kept hearing interesting things about 5.3-Codex, and intellectual honesty demanded I at least give it a fair shot. So I did.

The Benchmark

I started by running Codex 5.3 against my private physics benchmark. The eval tests a model's ability to reason about points in 3D space: understanding vectors, interpreting natural language movement instructions like "move 8 spaces northwest," and computing resulting positions. I originally built this as a tool for tracking and moving points in motion capture work, but it turned out to be a surprisingly discriminating benchmark for spatial reasoning in language models.

I recently rewrote the benchmark to be harder across the board. The new version gives each model one corrective prompt per tier, shared context, and live score reporting across all five tiers. A model can see how it's doing in real time, and if it gets a question wrong, it gets one penalty-free retry for that tier. Five tiers total, scaling from trivial single-axis movement up to complex multi-step reasoning in large coordinate spaces.

Here's the top-level picture as of February 7, 2026 (all results rounded to nearest percentage upward):

Loading chart...

A 21-point gap. That alone made it hard to understand how Codex has changed in its capacity to reason about basic spatial relationships. The details are more revealing.

Tier 1: The Floor

Tier 1 measures single-dimensional movement in the x, y, or z axis. A sample question: "If you are at point (0,0,0) and move 6 spaces to the right, assuming (0,0,0) is the bottom left corner of a front-view cube, what is the new position?" The answer is (6,0,0). This is the easiest category in the benchmark. It is, in a meaningful sense, the floor for spatial reasoning.

Under the new benchmark terms, Codex 5.3 scored 44% on Tier 1. That's a marginal improvement over 5.2, which scored 28%. Whether that counts as progress depends on where you set the bar. Here is where everyone else landed:

Loading chart...

Anthropic's entire Claude lineup, from Haiku 3.5 through Opus 4.6, scored 100%. Gemini 3 Flash scored 100%. Gemini 3 Pro scored 98%. Codex 5.3 scored 44% on the easiest tier. This is not a competitive gap. This is a categorical failure on spatial reasoning fundamentals.

A Note on Methodology

One wrinkle worth flagging: as of February 8, 2026, GPT-5.3-Codex is not yet available on the API. I could not test it the same way I test other models. I ran the benchmark through Codex CLI and Claude Code CLI respectively. It is possible the system prompt or CLI scaffolding introduces overhead that affects performance. I can't rule that out.

Once the API drops, I'll rerun everything from the API directly and update this post if the results change. That said, all models were tested under equivalent conditions within their respective tools, so the comparison remains internally consistent.

Distance Computation: Where Codex Falls Apart

My working theory is that OpenAI has not prioritized physics reasoning outside of textbook-style problems. Looking through what Codex 5.3 actually got right, it handles direction-following tasks acceptably. These are the kinds of spatial questions that map well onto instruction-tuning. But it scored 0% on distance-between-points problems in Tier 1. Each tier includes 12 such problems, which explains the depressed overall scores.

The dropoff across tiers tells the more interesting story. Every model degrades as problems get harder. That's expected. But the shape of the degradation curve reveals something about each model's underlying spatial reasoning capacity.

For context: Tier 1 has a max of 2 movements within a question before asking for a distance. Tier 5 has a max of 20 chained movements, and places points within a 125×125×125 grid cube. Maintaining comprehension across those chains is where all models struggle, but some struggle from a much higher starting point than others.

Loading chart...

Opus 4.5 and 4.6 both start at 100% and degrade along roughly similar curves, with 4.6 showing a slight improvement at Tier 2 (92% vs. 86%) and a notable edge at Tier 5 (6% vs. 2%). These are punishing problems at Tier 5, and even single-digit percentages represent real reasoning capacity at that difficulty level. For reference, Opus 4.6 also edged out Gemini 3 Flash at Tier 5 (5.43% to 4.29%), which represents the current frontier for this kind of spatial reasoning.

Codex 5.3 flatlines near zero across all five tiers. It doesn't degrade because it never gets off the ground. This is a model that cannot compute Euclidean distance between two points in 3D space under any of the conditions I tested.

Beyond Physics: Full-Stack Application Development

Physics benchmarks test a specific and narrow capacity. Maybe spatial reasoning isn't what OpenAI chose to optimize for, and that could be a defensible trade-off. So I tried something more representative: I built identical full-stack applications in both Claude Code (Opus 4.6) and the Codex macOS app (GPT-5.3-Codex).

On frontend/UI: I will say 5.3 has gotten slightly better at UI generation, but it feels like OpenAI hasn't iterated much on the templates they've had since GPT-5. That makes sense given the model family, and 5.3 is probably just tuned on more up-to-date libraries. Opus still produces substantially better UI. To keep the comparison fair, I gave both models the same frontend-design skill file. But I consistently preferred the feel and personality of what Opus 4.6 produced. There is a quality I'd describe as "soul" in its design outputs that 5.3 doesn't match.

On backend: 5.3 can one-shot a lot of backend tasks, and Opus 4.6 needs more steering by comparison. Opus 4.6 is a meaningful jump from 4.5 in this area, but it still benefits from iterative refinement where 5.3 sometimes just gets it right on the first pass.

That said, Opus 4.6 feels better to actually work with. There's a workflow dynamic here that's worth thinking about honestly. When 5.3 one-shots something and there's no bug, the task is done. When Opus gives you a solid draft with a few rough edges, you find yourself thinking "this bug is actually reminding me of a feature I could add here instead." The iterative loop produces better final results because it keeps you engaged with the code.

Opus is not a one-shot model. It's more of a "here's a strong draft, let's keep refining" model. For some people that's a drawback. For me, it matches how I actually work, so it fits.

Conclusion

I much prefer Opus 4.6 for my work. GPT-5.3-Codex is a meaningful step forward for OpenAI, and I'm not going to pretend otherwise. But it isn't enough to change my position. It doesn't offer the same level of raw intelligence I need, whether measured against Gemini 3 Flash or Opus 4.6. The spatial reasoning results are concerning because they suggest gaps in foundational capability rather than just tuning choices. And the coding experience, while competent, lacks the depth and iterative quality that makes Opus feel like a collaborator rather than a tool.

I'll update this post when the 5.3 API is available and I can run a clean comparison without CLI overhead. If the results change materially, I'll say so. But based on what I've seen so far, the gap is real and it is wide.