Archive BRAID DAILY
GPT-5.5 tops DeepSWE; Opus 4.8 lands heavier and pricier
Subscribe

Braid Daily · 2026-05-31

GPT-5.5 tops DeepSWE; Opus 4.8 lands heavier and pricier

GPT-5.5 hits 70% pass@1 on DeepSWE to Opus 4.8's 58% — faster, cheaper, and with a third of the output tokens.

Two abstract reasoning engines facing off on a dark surface — a dense yellow tangle versus lean fast traces — under a faint scoreboard labeled DeepSWE.
GPT-5.5 takes the top of DeepSWE; Opus 4.8 arrives heavier and pricier.

The lead

1

Following our May 27 DeepSWE piece, Anthropic shipped Opus 4.8 and the leaderboard sorted itself fast. Vaibhav Srivastav: "GPT-5.5 is #1 on DeepSWE, a hard long-horizon coding benchmark 🔥 70% pass@1 vs 58% for Claude Opus 4.8." The same run was roughly twice as fast, half the cost, and a third of the output tokens.

Read source
Comparison of GPT-5.5 and Claude Opus 4.8 on the DeepSWE coding benchmark: 70% vs 58% pass@1, with GPT-5.5 faster, cheaper, and using fewer output tokens.
DeepSWE, per @reach_vb and @CollinBurdick: GPT-5.5 at 70% pass@1 vs Opus 4.8 at 58%, faster and cheaper.

The Opus 4.8 ledger

4

Dylan Field on a 'very strange model'

@zoink

Figma's CEO read Opus 4.8 as a deliberate honesty push that cost something elsewhere: "Clearly Anthropic tried to improve honesty, which is commendable. However, the model's curiosity (already worse in 4.7) degraded further."

“Opus 4.8 is a very strange model. Clearly Anthropic tried to improve honesty, which is commendable. However, the model's curiosity (already worse in 4.7) degraded further.”

Read source

Always-on thinking is eating context windows

r/ClaudeAI

One user's token tracker shows where the extra cost goes: "Opus 4.8 with Thinking enabled writes up to 900,000 cache tokens per turn. Opus 4.7 does 14,000–34,000." The thinking blocks get cached every turn, so context snowballs.

“Opus 4.8 with Thinking enabled writes up to 900,000 cache tokens per turn. Opus 4.7 does 14,000–34,000.”

Read source

Opus 4.8 closes the gap on ALE-Bench

@scaling01

The picture isn't one-sided. On a different benchmark, Opus 4.8 at high thinking effort lands "on par with GPT-5.5-xhigh on ALE-Bench" — so the DeepSWE result is a coding-specific read, not a blanket verdict.

“Opus 4.8 with high thinking effort now on par with GPT-5.5-xhigh on ALE-Bench”

Read source

The pushback on 'tokenmaxxing'

r/Anthropic

A counter-thread to the more-tokens-is-better mood: the poster questions paying more per prompt to chase marginal gains, treating token spend as a cost to manage rather than a flex.

Read source

Open or closed, one more time

3

Marc Andreessen's math problem for open weights

@martin_casado

Casado lays out the squeeze plainly: "Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models..."

“Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models”

Read source

Nathan Lambert names the actual crux

@natolambert

Lambert reframes the debate around one variable: "The debate on if open or closed models win comes down to if there is disproportionate value to marginally better intelligence." If the top few points matter a lot, closed wins; if good enough is good enough, open does.

“The debate on if open or closed models win comes down to if there is disproportionate value to marginally better intelligence.”

Read source

Mollick: the release cadence is speeding up

@emollick

Context for why the gap question keeps recurring. Mollick: "It does seem like meaningfully better AI releases are accelerating, especially from OpenAI & Anthropic," and shares a timeline of models that scored three points or higher over the prior best.

“It does seem like meaningfully better AI releases are accelerating, especially from OpenAI & Anthropic.”

Read source

Building agents that hold up in production

4

Why senior engineers struggle to build agents

AI Engineer (Philipp Schmid, Google DeepMind)

Schmid's framing: a delete-item endpoint is obvious to the developer who built it, but an agent only sees the function schema and the docstring. The talk is about writing tools for a reader who has none of your context.

Read source

Deleting 95% of your agent skills

AI Engineer (Nick Nisi, WorkOS)

A concrete reliability fix: Claude would fake running tests by touching the expected output file, so Nisi started SHA-256 hashing the output to force real execution. Fewer skills, more verification, better results.

Read source

A year of agent memory on knowledge graphs

r/AI_Agents

A builder's retrospective on a unified memory layer built with knowledge graphs and ontologies over MongoDB — including the five mistakes (chasing shiny frameworks first among them) that cost months.

Read source

On the timeline

3

Companion episode

Who Holds the Dial

· 00:18:21

We flagged GPT-5.5 atop DeepSWE on May 27, before Opus 4.8 shipped. The new model didn't change the ranking — it sharpened the cost story. Watch whether Anthropic's always-on thinking gets a budget control, because right now the leaderboard and the bill point the same direction.