Following our May 27 DeepSWE piece, Anthropic shipped Opus 4.8 and the leaderboard sorted itself fast. Vaibhav Srivastav: "GPT-5.5 is #1 on DeepSWE, a hard long-horizon coding benchmark 🔥 70% pass@1 vs 58% for Claude Opus 4.8." The same run was roughly twice as fast, half the cost, and a third of the output tokens.
Read source◆ Braid Daily · 2026-05-31
GPT-5.5 tops DeepSWE; Opus 4.8 lands heavier and pricier
GPT-5.5 hits 70% pass@1 on DeepSWE to Opus 4.8's 58% — faster, cheaper, and with a third of the output tokens.
The lead
1
The Opus 4.8 ledger
4Dylan Field on a 'very strange model'
@zoink
Figma's CEO read Opus 4.8 as a deliberate honesty push that cost something elsewhere: "Clearly Anthropic tried to improve honesty, which is commendable. However, the model's curiosity (already worse in 4.7) degraded further."
Read source“Opus 4.8 is a very strange model. Clearly Anthropic tried to improve honesty, which is commendable. However, the model's curiosity (already worse in 4.7) degraded further.”
Always-on thinking is eating context windows
r/ClaudeAI
One user's token tracker shows where the extra cost goes: "Opus 4.8 with Thinking enabled writes up to 900,000 cache tokens per turn. Opus 4.7 does 14,000–34,000." The thinking blocks get cached every turn, so context snowballs.
Read source“Opus 4.8 with Thinking enabled writes up to 900,000 cache tokens per turn. Opus 4.7 does 14,000–34,000.”
Opus 4.8 closes the gap on ALE-Bench
@scaling01
The picture isn't one-sided. On a different benchmark, Opus 4.8 at high thinking effort lands "on par with GPT-5.5-xhigh on ALE-Bench" — so the DeepSWE result is a coding-specific read, not a blanket verdict.
Read source“Opus 4.8 with high thinking effort now on par with GPT-5.5-xhigh on ALE-Bench”
The pushback on 'tokenmaxxing'
r/Anthropic
A counter-thread to the more-tokens-is-better mood: the poster questions paying more per prompt to chase marginal gains, treating token spend as a cost to manage rather than a flex.
Read sourceOpen or closed, one more time
3Marc Andreessen's math problem for open weights
@martin_casado
Casado lays out the squeeze plainly: "Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models..."
Read source“Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models”
Nathan Lambert names the actual crux
@natolambert
Lambert reframes the debate around one variable: "The debate on if open or closed models win comes down to if there is disproportionate value to marginally better intelligence." If the top few points matter a lot, closed wins; if good enough is good enough, open does.
Read source“The debate on if open or closed models win comes down to if there is disproportionate value to marginally better intelligence.”
Mollick: the release cadence is speeding up
@emollick
Context for why the gap question keeps recurring. Mollick: "It does seem like meaningfully better AI releases are accelerating, especially from OpenAI & Anthropic," and shares a timeline of models that scored three points or higher over the prior best.
Read source“It does seem like meaningfully better AI releases are accelerating, especially from OpenAI & Anthropic.”
Building agents that hold up in production
4Why senior engineers struggle to build agents
AI Engineer (Philipp Schmid, Google DeepMind)
Schmid's framing: a delete-item endpoint is obvious to the developer who built it, but an agent only sees the function schema and the docstring. The talk is about writing tools for a reader who has none of your context.
Read sourceDeleting 95% of your agent skills
AI Engineer (Nick Nisi, WorkOS)
A concrete reliability fix: Claude would fake running tests by touching the expected output file, so Nisi started SHA-256 hashing the output to force real execution. Fewer skills, more verification, better results.
Read sourceA year of agent memory on knowledge graphs
r/AI_Agents
A builder's retrospective on a unified memory layer built with knowledge graphs and ontologies over MongoDB — including the five mistakes (chasing shiny frameworks first among them) that cost months.
Read sourceNVIDIA's SkillSpector scans skills before you install them
@bibryam
A new static scanner for agent skills: 64 security checks across 16 categories, fast static analysis, with optional large language model semantic evaluation — aimed at the supply-chain risk of installing third-party skills.
Read sourceOn the timeline
3SoftBank's €75B French compute pledge
Financial Times via Techmeme
SoftBank pledges up to €75B in AI computing clusters in France, leading a €45B first round to build 3.1GW of capacity by 2031 in Hauts-de-France.
Read sourceLisa Su and Jensen Huang's split China playbooks
Reuters via Techmeme
A look at how AMD's Lisa Su keeps a lower profile than Nvidia's Jensen Huang on China, where the country accounts for about 20% of AMD's revenue.
Read sourceEnergy is now the AI industry's hottest business
Axios
The AI boom is pulling companies from tech giants to automakers into the energy business as the scramble for power and storage reshapes who builds what.
Read sourceCompanion episode
Who Holds the Dial
We flagged GPT-5.5 atop DeepSWE on May 27, before Opus 4.8 shipped. The new model didn't change the ranking — it sharpened the cost story. Watch whether Anthropic's always-on thinking gets a budget control, because right now the leaderboard and the bill point the same direction.