◆ Braid Daily · 2026-05-31

GPT-5.5 tops DeepSWE; Opus 4.8 lands heavier and pricier

31 May 2026

GPT-5.5 hits 70% pass@1 on DeepSWE to Opus 4.8's 58% — faster, cheaper, and with a third of the output tokens.

The lead

Following our May 27 DeepSWE piece, Anthropic shipped Opus 4.8 and the leaderboard sorted itself fast. Vaibhav Srivastav: "GPT-5.5 is #1 on DeepSWE, a hard long-horizon coding benchmark 🔥 70% pass@1 vs 58% for Claude Opus 4.8." The same run was roughly twice as fast, half the cost, and a third of the output tokens.

Read source

The Opus 4.8 ledger

Dylan Field on a 'very strange model'

@zoink

Figma's CEO read Opus 4.8 as a deliberate honesty push that cost something elsewhere: "Clearly Anthropic tried to improve honesty, which is commendable. However, the model's curiosity (already worse in 4.7) degraded further."

“Opus 4.8 is a very strange model. Clearly Anthropic tried to improve honesty, which is commendable. However, the model's curiosity (already worse in 4.7) degraded further.”

Read source

Always-on thinking is eating context windows

r/ClaudeAI

One user's token tracker shows where the extra cost goes: "Opus 4.8 with Thinking enabled writes up to 900,000 cache tokens per turn. Opus 4.7 does 14,000–34,000." The thinking blocks get cached every turn, so context snowballs.

“Opus 4.8 with Thinking enabled writes up to 900,000 cache tokens per turn. Opus 4.7 does 14,000–34,000.”

Read source

Opus 4.8 closes the gap on ALE-Bench

@scaling01

The picture isn't one-sided. On a different benchmark, Opus 4.8 at high thinking effort lands "on par with GPT-5.5-xhigh on ALE-Bench" — so the DeepSWE result is a coding-specific read, not a blanket verdict.

“Opus 4.8 with high thinking effort now on par with GPT-5.5-xhigh on ALE-Bench”

Read source

The pushback on 'tokenmaxxing'

r/Anthropic

A counter-thread to the more-tokens-is-better mood: the poster questions paying more per prompt to chase marginal gains, treating token spend as a cost to manage rather than a flex.

Read source

Open or closed, one more time

Marc Andreessen's math problem for open weights

@martin_casado

Casado lays out the squeeze plainly: "Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models..."

“Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models”

Read source

Nathan Lambert names the actual crux

@natolambert

Lambert reframes the debate around one variable: "The debate on if open or closed models win comes down to if there is disproportionate value to marginally better intelligence." If the top few points matter a lot, closed wins; if good enough is good enough, open does.

“The debate on if open or closed models win comes down to if there is disproportionate value to marginally better intelligence.”

Read source

Mollick: the release cadence is speeding up

@emollick

Context for why the gap question keeps recurring. Mollick: "It does seem like meaningfully better AI releases are accelerating, especially from OpenAI & Anthropic," and shares a timeline of models that scored three points or higher over the prior best.

“It does seem like meaningfully better AI releases are accelerating, especially from OpenAI & Anthropic.”

Read source

Building agents that hold up in production

Why senior engineers struggle to build agents

AI Engineer (Philipp Schmid, Google DeepMind)

Schmid's framing: a delete-item endpoint is obvious to the developer who built it, but an agent only sees the function schema and the docstring. The talk is about writing tools for a reader who has none of your context.

Read source

Deleting 95% of your agent skills

AI Engineer (Nick Nisi, WorkOS)

A concrete reliability fix: Claude would fake running tests by touching the expected output file, so Nisi started SHA-256 hashing the output to force real execution. Fewer skills, more verification, better results.

Read source

A year of agent memory on knowledge graphs

r/AI_Agents

A builder's retrospective on a unified memory layer built with knowledge graphs and ontologies over MongoDB — including the five mistakes (chasing shiny frameworks first among them) that cost months.

Read source

NVIDIA's SkillSpector scans skills before you install them

@bibryam

A new static scanner for agent skills: 64 security checks across 16 categories, fast static analysis, with optional large language model semantic evaluation — aimed at the supply-chain risk of installing third-party skills.

Read source

On the timeline

SoftBank's €75B French compute pledge

Financial Times via Techmeme

SoftBank pledges up to €75B in AI computing clusters in France, leading a €45B first round to build 3.1GW of capacity by 2031 in Hauts-de-France.

Read source

Lisa Su and Jensen Huang's split China playbooks

Reuters via Techmeme

A look at how AMD's Lisa Su keeps a lower profile than Nvidia's Jensen Huang on China, where the country accounts for about 20% of AMD's revenue.

Read source

Energy is now the AI industry's hottest business

Axios

The AI boom is pulling companies from tech giants to automakers into the energy business as the scramble for power and storage reshapes who builds what.

Read source

Companion episode

Who Holds the Dial

2026-05-31 · 00:18:21

Episode Sources Transcript Chapters JSON

We flagged GPT-5.5 atop DeepSWE on May 27, before Opus 4.8 shipped. The new model didn't change the ranking — it sharpened the cost story. Watch whether Anthropic's always-on thinking gets a budget control, because right now the leaderboard and the bill point the same direction.