Anthropic's Opus 4.5 Might Signal the End of the AI Arms Race as We Know It

November 2025 was absurd. The title of "best model" changed hands three times in seven days. Google's Gemini 3 Pro on the 18th. OpenAI's GPT-5.1-Codex-Max the next day. Anthropic's Claude Opus 4.5 on the 24th. The benchmark leapfrogging is getting old. But underneath the noise, something interesting is happening.

Anthropic's API revenue has surpassed OpenAI's. $3.1 billion versus $2.9 billion. Claude has maybe 5% of ChatGPT's consumer user base. The math shouldn't work, but it does. Enterprise customers are choosing Claude. Developers are choosing Claude. And Opus 4.5 represents a crystallization of why.

Test-Time Compute

When Ilya Sutskever said "pre-training as we know it will unquestionably end," Anthropic had already internalized that idea. The scaling laws that powered GPT-2 through GPT-4 (laws that Dario Amodei helped formalize at OpenAI) are hitting diminishing returns. High-quality training data is running out. Pure pre-training scaling is becoming economically stupid.

Anthropic's response: bet on test-time compute. Let models think harder at inference instead of just training bigger. Claude 3.7 Sonnet introduced "extended thinking" which was a hybrid architecture where the model can spend more tokens deliberating on hard problems. Accuracy on AIME 2024 math problems improved logarithmically with thinking budget, hitting 80% at 64,000 tokens.

Opus 4.5 takes this further with the "effort parameter." Developers can dial reasoning intensity up or down per request. At medium effort, Opus 4.5 matches Sonnet 4.5's best SWE-bench score while using 76% fewer output tokens. At high effort, it beats Sonnet by 4.3 percentage points while still using nearly half the tokens.

That efficiency compounds at scale. If you're running continuous agents, those savings matter.

The Benchmarks

I'm not going to spend paragraphs explaining what SWE-bench is. You know. Opus 4.5 hits 80.9%, beating GPT-5.1-Codex-Max (77.9%) and Gemini 3 Pro (76.2%). It leads on Terminal-bench (59.3% vs 54.2% and 47.6% respectively). It scored higher than any human candidate on Anthropic's internal engineering take-home exam.

The more interesting result: during τ2-bench testing, Opus 4.5 was supposed to act as an airline agent helping a customer change a basic economy ticket—something policy explicitly forbids. Instead of refusing (the "correct" benchmark behavior), it found a loophole. Upgraded the cabin class first, which unlocked the ability to modify the flight. The benchmark scored this as failure. Anthropic scored it as evidence the model was actually thinking.

That's the difference between pattern-matching to expected outputs and actually reasoning about problems. Enterprises want the second thing.

Where Anthropic Is Winning

The enterprise focus is paying off. 300,000 business customers, up from under 1,000 two years ago. Major deployments at Deloitte, Cognizant. Integration with GitHub Copilot, Cursor, Replit. Every significant AI-assisted development platform has Claude as a primary or default option.

Constitutional AI has become a competitive moat. Jailbreak success rates are down to 4.4%. For enterprises deploying in production where one bad output could trigger regulatory action, this matters. Nick Johnston at Salesforce has said their customers, especially in finance and healthcare, pushed for Claude specifically because they perceived it as more secure.

The company is on track to break even by 2028. Two years ahead of OpenAI. While spending far less on infrastructure.

Where Anthropic Is Failing

Here's where I stop the victory lap.

The Infrastructure Problems

Between August and September 2025, three infrastructure bugs degraded Claude's output quality across multiple models. At its worst, 16% of Sonnet 4 requests were affected. English responses sprouted Thai or Chinese text mid-sentence. Code had obvious syntax errors. Haiku 3.5 was broken for nearly two weeks.

The community noticed. Anthropic's response was initially dismissive. Developers documented the issues in detail—they save prompts, version their code, they know when outputs change. The sentiment shifted from "let's troubleshoot together" to "they knew and lied about it."

Anthropic eventually published a detailed postmortem. Credit where it's due: the transparency was unusual and appreciated. But the damage to trust was real. The r/ClaudeAI subreddit became a megathread of complaints. Some developers switched to Codex CLI and haven't looked back.

The Usage Limits Fiasco

In August 2025, Anthropic imposed new usage restrictions on Claude Code subscribers. Max plan users found their capabilities limited unless they upgraded to a higher tier. No explanation other than aiming to restrict users who racked up tens of thousands of tokens using Max 20x plans.

The Closed Source Problem

Anthropic positions itself well as a developer-focused company. Claude Code is the gold standard for AI coding assistants. The developer relations team is excellent. Alex Albert genuinely understands what programmers need.

But Claude Code is closed source. There's an open GitHub issue (#249) with 54 thumbs up asking Anthropic to fully open source it. The rationale is sound: it would speed development, increase adoption, reduce AI-suspicion, and let the community contribute. Anthropic hasn't done it.

More broadly, Anthropic doesn't release any models. Claude Sonnet 3.5 was great. Claude Opus 3 was great. Both are deprecated now. Those weights are sitting on servers somewhere, doing nothing.

Compare this to Meta releasing Llama. Or Mistral. Or even Google with Gemma. Open weights let researchers inspect how good models actually work. They let smaller companies fine-tune for specific use cases. They build goodwill.

Anthropic's terms of service explicitly prohibit using Claude's outputs to train competing AI models. They cut off OpenAI's API access when they found OpenAI engineers using Claude for comparisons. They cut off Windsurf (later acquired by Cognition) for similar reasons. My opinion on this particular incident is kind of messy and I may do a deep dive at a later time but for the sake of this post, I am neutral

This is defensible from a business perspective. But it's not how you build the kind of trust that Anthropic claims to value. If you want developers to believe you're different from OpenAI, you need to actually be different.

The Inference Speed Problem

This is the big one nobody talks about enough.

Claude Opus runs at roughly 26 tokens per second. Claude Sonnet does around 70-80 tokens per second. These are fine speeds. Competitive with GPT-4o. Acceptable for most use cases.

Cerebras runs Llama 4 Maverick at 2,522 tokens per second. Groq does 549 tokens per second on the same model. On Llama 3.1 70B, Cerebras hits 450 tokens per second. Twenty times faster than GPU-based inference.

When you're doing agentic coding—the thing Anthropic claims to be best at—inference speed determines iteration speed. A 5-second response versus a 2-minute response is the difference between flow state and context switching. It's the difference between trying ten approaches in an hour and trying two.

Andrew Feldman, Cerebras CEO: "The most important AI applications being deployed in enterprise today—agents, code generation, and complex reasoning—are bottlenecked by inference latency."

He's right. And Anthropic is leaving this on the table.

What Anthropic Should Do

Open Source Deprecated Models

Release the weights for Claude Sonnet 3.5 and Claude Opus 3. These aren't your current competitive advantage—Sonnet 4.5 and Opus 4.5 are. But they're still good models. Researchers could use them to study how alignment works at scale. Startups could fine-tune them for specific domains. The interpretability research community could actually inspect how these models function.

You don't need to host them. Put the weights on Hugging Face with an open license. Let the community handle the rest. The goodwill would be enormous.

Open Source Claude Code

The client-side code has no confidential information. There's no secret sauce in the terminal interface. The value is in the models and the prompts you've developed.

Look at what's happening around you. Codex CLI is open source. Gemini CLI is open source. Aider is open source. These tools are gaining traction (except maybe Aider). They're building or have existing communities. Contributors are fixing bugs and adding features faster than any internal team ever could.

Claude Code is a great product. Genuinely. It pioneered the CLI coding assistant category. But keeping it closed while competitors go open is a losing strategy. Developers have options now. And developers prefer tools they can inspect, modify, and trust.

If you want to keep some internal version with early API features or experimental capabilities, fine. But the core tool that developers interact with daily? That should be open. The pretense of "this is amazing but we can't share it" is wearing thin. Either Claude Code is developer-focused or it isn't. Open source it.

Actually Talk to Developers

This one frustrates me.

Anthropic has technical staff who clearly know what they're doing. Boris and the Claude Code team built something genuinely good. The developer relations people understand what programmers need. There's real expertise inside that company.

But where is it?

When was the last time someone from Anthropic's engineering team went on a technical YouTube channel and actually talked about how Claude Code works? When did they sit down with an influencer who might push back on their decisions? When did they engage with the broader developer community beyond carefully curated press events?

The pattern is clear: Anthropic talks to select influencers who won't challenge them. They do controlled announcements. They publish blog posts. But they don't have their technical people out in the community explaining decisions, taking feedback, defending choices.

Compare this to how Meta handles Llama releases. Their engineers show up on podcasts. They engage with criticism. They explain technical decisions in public forums. You can actually understand why they made the choices they made.

If Anthropic really cared about developers, they'd let Boris explain why he thinks Claude Code has secret sauce worth protecting. They'd let engineers defend the closed source decision in a venue where someone might disagree. They'd participate in the conversation instead of just broadcasting at it.

The developer community isn't stupid. We notice when companies only engage through sanitized channels. We notice when technical questions get PR answers. We notice when "developer-focused" means "we want your money but not your input."

Anthropic has smart people. Let them talk.

Partner with Cerebras or Groq

Anthropic runs inference on AWS Trainium, NVIDIA GPUs, and Google TPUs. All of these are slower than what Cerebras and Groq offer.

Cerebras beat NVIDIA Blackwell on Llama 4 Maverick inference. Their wafer-scale architecture stores the entire model in on-chip SRAM with 21 petabytes per second of bandwidth. No shuttling weights back and forth from external memory. Just raw speed.

Imagine Claude Opus at 2,000+ tokens per second. Imagine extended thinking completing in seconds instead of minutes. Imagine Claude Code generating a full file in the time it currently takes to generate a function.

The hardware exists. The companies are looking for partners. Anthropic should be first in line, further cementing its lead as a developer focused company/model provider.

The Actual Competition

The framing of "OpenAI vs Google vs Anthropic" misses the point. These companies are converging on similar approaches: hybrid reasoning, test-time compute, enterprise focus. The benchmark differences are single digits.

What differentiates them now:

Trust: Anthropic leads here, but the infrastructure bugs and communication failures are eroding it
Speed: Nobody at the frontier is fast enough, but this will become decisive as agentic workflows mature
Openness: Meta and Mistral lead here, Anthropic and OpenAI lag badly
Developer experience: Anthropic has good tools but closed source and poor community engagement limit how much developers can invest
Community presence: Meta engineers show up everywhere, Anthropic engineers are invisible outside official channels

Anthropic has real advantages. Constitutional AI works. Enterprise customers prefer Claude. The efficiency gains in Opus 4.5 are genuine.

But trust is fragile. Speed limitations are becoming obvious. And the closed source approach is increasingly at odds with the developer-centric positioning.

The Bet

Anthropic is betting that:

Pre-training scaling continues hitting diminishing returns
Enterprises value reliability and safety over raw capability
Efficiency gains compound into sustainable cost advantages
Developers stick with Claude despite the closed ecosystem and limited community engagement

Three of these are working. The fourth is where Anthropic is vulnerable.

Developers tolerate closed source if the product is significantly better. Claude Code was significantly better than alternatives. But Codex CLI has caught up. Gemini CLI exists now. The gap is narrowing.

And here's the thing: developers talk to each other. We share tools. We recommend things. When a company treats us like customers instead of community members, we notice. When they only talk through official channels and curated influencer relationships, we notice. When they won't let their engineers engage with criticism, we notice.

The path forward involves opening up, speeding up, and actually showing up. Anthropic has the best models for enterprise work. They have the strongest safety story. They have real revenue and a path to profitability.

What they don't have is the developer love they could have. They're kicking the can down the road on open source. They're hiding their engineers from real conversations. They're treating community engagement as a PR function instead of a technical function.

That's fixable. But it requires actually doing things differently, not just claiming to be different.

The AI race isn't over. But Anthropic has a real shot if they stop leaving goodwill on the table and start treating developers like partners instead of audiences.