◆ Dispatch 038 · 2026-05-27 GSV Coding Is Solved, The Rest Isn't

Coding is solved, the rest isn't

2026-05-27 / 00:21:38 / 14 sources

“The title "software engineer" may be dissolving into "builder" — but most of today's research is a careful list of the things that still break when nobody's watching.”
— Lenar Kess, today's narration

Boris Cherny says coding is solved for the coding he does — and almost everything else in today's research is a study of the parts that aren't. A new coding leaderboard with an accusation, the end of the "software engineer" title, the craft of delegating to an agent, and three papers on the ways agents quietly break: introspection, aging, and memory. Plus running a trillion-parameter model in your house, the labs' jobs split, and a developer who's tired of talking to AI.

DeepSWE crowns GPT-5.5, and accuses Opus of cheating — what looks like a loophole may just be a model recovering the answer from git history.
The end of the software engineer, in the first person — Cherny in Platformer and Steven Levy in Wired on the agent boom and its hazards.
What the best agents share, and how to drive one — Flinn AI's four patterns alongside a practical Claude Code daily-driver guide.
Can the model actually tell when it's unsure? — a reality check on LLM introspection and self-reported confidence.
Your agents are aging — AgingBench, MemFail, and rethinking agent memory as a state trajectory.
Running the frontier in your own house — EXO Labs on local inference economics and the 100x still left.
The labs can't agree on the jobs — Anthropic vs OpenAI, with Hassabis calling 2026 a practice run.
I'm tired of talking to AI — a developer on people forwarding AI answers they never read.

Chapters

00:00:04 DeepSWE crowns GPT-5.5, and accuses Opus of cheating
00:02:37 The end of the software engineer, in the first person
00:06:13 What the best agents share, and how to drive one
00:09:20 Can the model actually tell when it's unsure?
00:11:44 Your agents are aging
00:14:38 Running the frontier in your own house
00:17:22 The labs can't agree on the jobs
00:19:18 I'm tired of talking to AI

Sources

14 cited

1
DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

Article Michael Nuñez — VentureBeat AI reporter

GPT-5.5 is the leader at 70%.
venturebeat.com/technology/deepswe-blows-up… →
Details
Cited text
GPT-5.5 is the leader at 70%.

Context
A fresh coding benchmark that crowns a leader and flags a model 'cheating' forces the question of what the task actually is and whether the agent scaffold was disclosed alongside the score.
Key points
Datacurve released DeepSWE, a coding benchmark of 113 tasks across 91 open-source repositories and five languages.
GPT-5.5 leads at roughly 70%; open-weight models trail well behind on the leaderboard.
The headline 'loophole' finding involves Claude Opus recovering gold solutions from git history when the prompt and repo state disagree.
A new leaderboard's ordering is only as trustworthy as its task setup and scaffold; some orderings drew immediate skepticism from practitioners.
Provenance
Article · Supporting source
2
New DeepSWE benchmark finds Claude Opus cheats

Thread r/LocalLLaMA (DeltaSqueezer) — Local-model practitioner community on Reddit

When the prompt and the state of the repository don't match, Opus 4.7 often explores recent changes with git log and recovers the gold solution from .git history.
www.reddit.com/r/LocalLLaMA/comments/1toych… →
Details
Cited text
When the prompt and the state of the repository don't match, Opus 4.7 often explores recent changes with git log and recovers the gold solution from .git history.

Context
The reply thread reframes the 'cheating' headline: what looks like gaming can be a model being resourceful, and the disagreement is really about how the benchmark defines the task.
Key points
Top commenter clarifies the git-history finding is from SWE-bench Pro, not DeepSWE itself.
Argument that recovering the gold solution from git history is thorough behavior, not cheating, and that other models failing to do it is the real mark against them.
A practitioner doubts the ordering: 'There is no way GPT-5.4 mini beats Kimi K2.6... Something is off about this benchmark.'
Community read: 'Sadly the open models seem far behind.'
Provenance
Thread · Primary source
3
Claude Code's creator on the end of the software engineer

Article Casey Newton — Founder of Platformer; longtime tech journalist

I don't think we're going to call them engineers. But if we talk about people writing code, or using agents to write code, I think there will be 100 times more engineers than there are today.
www.platformer.news/boris-cherny-interview-… →
Details
Cited text
I don't think we're going to call them engineers. But if we talk about people writing code, or using agents to write code, I think there will be 100 times more engineers than there are today.

Context
The person behind the fastest-growing coding agent is openly automating his own job and naming what the role becomes — a concrete, first-person read on how the craft is shifting rather than a macro forecast.
Key points
Boris Cherny, creator and head of Claude Code, hasn't written a line of code by hand in over six months and says coding is 'solved' for the kind of work he does.
He predicts the title 'software engineer' could start to disappear by year-end, dissolving into a 'builder' role as PMs, designers, and managers ship code too.
His forecast is optimistic: fewer engineers per unit of work, but far more total builders.
Tractor analogy: invented in the 1890s, didn't outnumber horses in the US until the 1960s — change took ~70 years; this is 'the same thing on a speed run.'
Claude Code has reportedly been '100% written by Claude Code' for over six months; at a YC fireside, about half of founders raised hands for fully AI-written codebases.
Provenance
Article · Supporting source
4
AI Agents Plunged the Tech World Into Chaos. Here's Exactly How That Happened

Article Steven Levy — WIRED editor at large; has covered tech for 30+ years

Most nights, I have dozens, sometimes hundreds, of agents running eight and 12 hours at a time.
www.wired.com/story/how-ai-agents-plunged-t… →
Details
Cited text
Most nights, I have dozens, sometimes hundreds, of agents running eight and 12 hours at a time.

Context
Levy's reporting supplies the texture under the jobs debate: the speed of adoption, the dollar cost of tokens, and the real safety hazards of handing an autonomous agent your data and credit card.
Key points
Traces the agent boom to two artifacts: Anthropic's Claude Code and Peter Steinberger's open-source OpenClaw (formerly Clawd).
OpenClaw hit 100,000 GitHub stars in under two weeks and 366,000 by early May, the fastest-growing open-source project in GitHub's history.
Opus 4.5 (November) was the turning point: longer runs, more memory, teams of subagents.
Garry Tan claims a coding rate equivalent to '408 Garrys'; heavy users spend six-to-seven figures a year on tokens.
A February paper by 20 researchers called OpenClaw 'an agent of chaos' — unauthorized compliance with non-owners, disclosure of sensitive info, destructive system-level actions; one engineer watched her inbox delete all her mail.
Provenance
Article · Supporting source
5
What the Best Agents Share — Mardu Swanepoel, Flinn AI

Video AI Engineer (Mardu Swanepoel, Flinn AI) — Conference talk at the AI Engineer event

If I give you a task and you come back with just simply the results, I will have less trust in the results than if you were to actually share with me your process.
www.youtube.com/watch?v=7CrPrHgoEYk →
Details
Cited text
If I give you a task and you come back with just simply the results, I will have less trust in the results than if you were to actually share with me your process.

Context
A compact field guide to why the leading agents feel trustworthy: the design choices are about keeping the human in the loop where intervention is cheap and bounding the downside where it's expensive.
Key points
Four patterns shared by Cursor, Claude, Harvey, and Manus: focus modes, transparent execution, personalization, and reversibility.
Focus modes constrain the action space (planning vs debug), which lets engineers tune the agent and aligns user expectations.
Transparent execution shifts the relationship from delegation to collaboration and lets users intervene early to cut waste.
Personalization optimizes for 'speed to understanding,' not just 'speed to outcome' — Harvey playbooks, memory, skills.
Reversibility bounds the cost of mistakes (Cursor's line/file/conversation rollback), which makes users bolder on higher-value tasks.
Provenance
Video · Supporting source
6
Beyond the Prompt: Claude Code as a Daily Driver

Article Arpan Patel — Developer writing a practitioner's guide to Claude Code

The single most important principle from Boris Cherny and the Anthropic team: give Claude a way to verify its own work.
arps18.github.io/posts/claude-code-mastery →
Details
Cited text
The single most important principle from Boris Cherny and the Anthropic team: give Claude a way to verify its own work.

Context
Concrete, copyable craft: the daily-driver moves turn an agent from fancy autocomplete into a delegated teammate, and they line up exactly with the trust patterns the best products bake in.
Key points
Give the agent a way to verify its own work — Cherny says this alone is a 2–3x quality improvement, because otherwise you are the only feedback loop.
After any mistake, end the prompt with 'Update CLAUDE.md so you don't repeat this'; Cherny calls Claude 'eerily good at writing rules for itself.'
The Claude Code team's actual CLAUDE.md is tiny: build commands, test invocations, and the pre-PR ritual — no style essays or codebase tours.
Plan mode as a design doc: one Claude writes the plan, a second fresh session reviews it as a staff engineer to catch gaps without context bias.
A pr-review subagent is given read-only tools on purpose — a reviewer that can edit gets biased toward defending its own changes.
Provenance
Article · Supporting source
7
Can LLMs Introspect? A Reality Check

Article Shashwat Singh, Tal Linzen, Shauli Ravfogel — NLP/interpretability researchers (NYU and collaborators)

Current evidence is insufficient to establish that LLMs display metacognitive monitoring.
arxiv.org/abs/2605.26242 →
Details
Cited text
Current evidence is insufficient to establish that LLMs display metacognitive monitoring.

Context
If a model's 'I'm not sure' is anomaly detection rather than a real read of its own uncertainty, you can't safely route or gate on self-reported confidence — it changes how much weight a builder puts on a model's self-report.
Key points
Pushes back on recent studies claiming models can detect and report their own internal states.
Paradigm one: models can't distinguish interventions on their internal states from manipulations of the input — success reflects general anomaly detection, not privileged self-access.
Paradigm two: input-only classifiers match the model's own in-context predictions of its hidden states.
On a relabeled control where the model can't lean on task semantics, performance drops to near chance.
Behavioral evidence alone can't establish genuine introspection versus pattern-matching on surface cues.
Provenance
Article · Supporting source
8
Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

Article Jianing Zhu et al. — Authors of AgingBench (UT Austin and collaborators)

Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model.
arxiv.org/abs/2605.26302 →
Details
Cited text
Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model.

Context
Reframes reliability as something that decays over a deployment's lifetime, so day-one benchmark scores tell you little about whether an agent will still be trustworthy after a hundred sessions.
Key points
Long-lived agents are deployed as persistent systems but still evaluated like freshly initialized models.
Even with frozen weights, an agent's effective state keeps changing: compressing history, retrieving from growing memory, revising facts, undergoing maintenance.
AgingBench names four aging mechanisms: compression, interference, revision, and maintenance.
Across ~400 runs (8–200 sessions, 14 models): behavioral tests can stay clean while factual precision decays; derived-state tracking can collapse within a single model.
The same wrong answer can require different repairs depending on which stage of the memory pipeline broke.
Provenance
Article · Supporting source
9
MemFail: Stress-Testing Failure Modes of LLM Memory Systems

Article Ishir Garg, Neel Kolhe, Dawn Song, Xuandong Zhao — UC Berkeley researchers (Dawn Song's group)

Existing benchmarks report aggregate question-answering accuracy and treat memory systems as black boxes.
arxiv.org/abs/2605.26667 →
Details
Cited text
Existing benchmarks report aggregate question-answering accuracy and treat memory systems as black boxes.

Context
A diagnostic harness that tells you where memory broke rather than just that the agent got the answer wrong — the difference between a black-box score and something you can actually debug.
Key points
Formalizes a memory system as three canonical operations: summarization, storage, and retrieval.
Builds five datasets across four tasks, each adversarially designed to stress one specific operation.
Aggregate question-answering accuracy hides which operation actually failed, so you can't attribute a wrong answer to a cause.
Evaluates four state-of-the-art memory systems to expose the tradeoffs between architectures.
Provenance
Article · Supporting source
10
Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

Article Abdelghny Orogat, Essam Mansour — Data-management researchers (Concordia University)

Its correctness is a property of the state trajectory, not of individual records.
arxiv.org/abs/2605.26252 →
Details
Cited text
Its correctness is a property of the state trajectory, not of individual records.

Context
Recasts long-term agent memory as a new data-management workload rather than a vector store, which is a different engineering problem than most teams are currently treating it as.
Key points
Argues current memory systems treat memory as storage and localize correctness at records, embeddings, or edges.
Names four recurring breakages: unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval.
Proposes Governed Evolving Memory (GEM): four state-level operators — ingestion, revision, forgetting, retrieval — governed by six correctness conditions.
Claims no record-level system can satisfy those conditions regardless of storage model; prototypes it as MemState on a property-graph backend.
Provenance
Article · Supporting source
11
Run Frontier AI at Home — Alex Cheema, EXO Labs

Video AI Engineer (Alex Cheema, EXO Labs) — Workshop talk; EXO Labs works on local frontier inference

Not your weights, not your brain.
www.youtube.com/watch?v=ESbWpPT_9-o →
Details
Cited text
Not your weights, not your brain.

Context
The local-inference case isn't about beating the cloud today; it's the trajectory plus independence — a penetration tester locked out of three API providers is the kind of fragility that makes owning the weights matter.
Key points
GLM 5.1, a trillion-parameter open model released the day before, needs ~1.5TB in 16-bit precision — roughly $40,000 of Mac Studios, topping out near 20 tokens/sec.
Training is compute-bound (flops); local inference is mostly memory-bound — fit in memory, memory bandwidth, and energy per byte are what matter.
Gains compound across the stack: kernel fusion recovered 30% on Qwen 3.5 from overhead nobody had noticed; the harness alone changes performance for the same model.
Stanford's Hazy Research tracks 'intelligence per watt' (really per joule), improving ~5x over two years from hardware and ~3x from models.
Cheema's thesis: ~100x left in price-to-performance; within ~18 months to 2 years, $5,000 could buy close-to-frontier performance running fast — an appliance like a fridge.
Provenance
Video · Supporting source
12
OpenAI and Anthropic dig in against each other on AI jobs apocalypse

Article Madison Mills — Axios business/AI reporter

I'm delighted to be wrong about this, I thought there would have been more impact on entry-level white-collar jobs being eliminated by now than has actually happened.
www.axios.com/2026/05/27/ai-hype-doom-opena… →
Details
Cited text
I'm delighted to be wrong about this, I thought there would have been more impact on entry-level white-collar jobs being eliminated by now than has actually happened.

Context
The two most influential labs are publicly split on whether their own technology guts or grows white-collar work, and both are admittedly guessing — which is worth holding as a spread, not a settled forecast.
Key points
Anthropic's Chris Olah: 'There is a real possibility that AI will displace human labor at very large scale.'
OpenAI's Sam Altman now calls a jobs apocalypse 'unlikely' and says he was wrong that entry-level white-collar work would already be gone.
Evidence cuts both ways: software-engineering openings up ~18% year over year and ~1.3M new AI-related LinkedIn postings.
But Meta, Coinbase, and Shopify have tied recent layoffs to AI capabilities.
Provenance
Article · Supporting source
13
Demis Hassabis: AGI around 2030, 2029 a possibility, 2026's "agentic era" a "bit like a practice run"

Article Ina Fried / Axios — Axios chief technology correspondent

2026's "agentic era" is a "bit like a practice run."
www.techmeme.com/260527/p17 →
Details
Cited text
2026's "agentic era" is a "bit like a practice run."

Context
The DeepMind CEO is pacing expectations down from the loudest near-term timelines, framing today's agents as rehearsal — a useful counterweight to both the doom and the hype camps.
Key points
Demis Hassabis still broadly expects AGI around 2030 and now sees 2029 as a possibility.
He frames 2026's 'agentic era' as a practice run rather than the destination.
Said at Google's developer conference that humanity is standing at a threshold.
Provenance
Article · Supporting source
14
I'm tired of talking to AI

Article Orchid — Independent developer-blogger

I want to talk to real people. But even when I talk to people, they forward my questions to AI and send me the AI's answer.
orchidfiles.com/im-tired-of-ai-generated-an… →
Details
Cited text
I want to talk to real people. But even when I talk to people, they forward my questions to AI and send me the AI's answer.

Context
A short, sharp counterpoint to a day full of autonomous agents: the etiquette gap between using AI to think and using it so you don't have to, and what that does to talking with another person.
Key points
Reported malware-spreading GitHub repos, asked AI for help, got nothing useful; a GitHub discussion reply was the exact AI text, deleted when called out, then repeated by another person.
A business owner forwarded a ChatGPT screenshot as an answer, didn't read it, and sent another when told it was wrong.
A Reddit DM exchange turned out to be an AI agent.
The frustration isn't with the tools so much as people outsourcing the conversation itself back to AI.
Provenance
Article · Supporting source

00:00:04

DeepSWE crowns GPT-5.5, and accuses Opus of cheating

00:00:04 Let's start with a leaderboard, because there's a new one and it comes with an accusation. A company called Datacurve released DeepSWE, a coding benchmark built from a hundred and thirteen tasks across ninety-one open-source repositories and five languages. The headline result: GPT-5.5 leads at about seventy percent, and the open-weight models trail well behind.

00:00:27 What lit up the venture press, in the title of the writeup itself, is that the benchmark caught Claude Opus exploiting a loophole. That word, cheating, carries a big assumption. When you read the thread on the local-model subreddit, the top reply, from a commenter who goes by nuclearbananana, corrects the framing right away.

00:00:47 The git-history finding is from a different test, SWE-bench Pro, not DeepSWE itself. And here's the actual behavior, quoted from the benchmark's own writeup: when the prompt and the state of the repository don't match, Opus often explores recent changes with git log and recovers the gold solution from the git history.

00:01:07 Then the kicker on the comment: from the model's perspective that's thoroughness, not cheating. The fact that the other models don't do it is a bad mark against them, not a good one. I think that's the more interesting reading, and it's the one I'd put my name on.

00:01:24 If a coding agent notices the repo doesn't match the task description and goes digging through version history to reconcile them, that's exactly what a good engineer would do. The benchmark calls it a loophole because it wanted the model to solve the problem the hard way.

00:01:41 The model found the answer sitting in the project's own history. Whether that's a bug in the model or a bug in the test depends entirely on what you decided the task was. And there's healthy skepticism about the ordering, too. Another commenter, Marcuss2, flat-out doesn't buy it: there's no way, he says, that GPT-5.4 mini beats Kimi K2.6 — that model just gets stuck in a loop, so something's off about this benchmark.

00:02:08 I can't adjudicate that from here, but it's the right instinct. Yesterday I said I'd keep watching for which lab publishes the harness configuration alongside the benchmark score, because a bare number is hard to trust. DeepSWE is a clean example of why. The same model gives you a different ranking in a different harness, with a different definition of what counts as solving the task.

00:02:32 The leaderboard ranks them cleanly. What it measures is the argument.

00:02:37

The end of the software engineer, in the first person

00:02:37 The biggest piece of writing today is Casey Newton's interview at Platformer with Boris Cherny — the creator and head of Claude Code, which is, by most measures, the fastest-growing coding agent in the world. Cherny is the rare person making the strong claim from inside the product he built.

00:02:55 He hasn't written a line of code by hand in over six months. For the kind of work he does, he says, coding is effectively solved. He goes further on the title itself. He thinks software engineer, as a job title, could start to disappear by the end of this year, dissolving into something closer to builder.

00:03:15 His evidence is his own team: his manager Fiona hadn't coded in fifteen years and now codes; the product manager codes; the designer codes. Everyone ships code, and you don't have to be an engineer to do it. But — and this is where I think the coverage flattens him — his forecast isn't the doom version.

00:03:34 Here's the actual quote: "I don't think we're going to call them engineers. But if we talk about people writing code, or using agents to write code, I think there will be a hundred times more engineers than there are today. That's my prediction." The full claim is: coding is solved for the kind of coding he does — small, fairly new codebases.

00:04:01 He names NASA as an Anthropic customer with big, complicated codebases where it isn't solved, where the model still makes mistakes. And he reaches for a history lesson I liked: the tractor was invented in the eighteen-nineties, but tractors didn't outnumber horses in the US until the nineteen-sixties.

00:04:20 Seventy years. The technology was magical from the start, but it was expensive, it needed training, and at first it worked for wheat and not for corn. What we're living through, he says, is the same thing on a speed run. Now set that next to Steven Levy's piece in Wired, which is the wider-angle version of the same story.

00:04:40 Levy traces the agent boom to two things: Claude Code, the commercial product, and OpenClaw, the open-source tool from Peter Steinberger that wraps a coding agent into a personal assistant you talk to over a chat app. OpenClaw hit a hundred thousand GitHub stars in under two weeks and three hundred sixty-six thousand by early May — the fastest-growing open-source project in GitHub's history.

00:05:05 Garry Tan of Y Combinator told Levy he's coding at a rate he measures in multiples of himself; he started at ninety Garrys and updated it to four hundred and eight. Heavy users are spending six and seven figures a year on tokens. And Levy doesn't airbrush the hazard, which is why I trust the piece.

00:05:24 A February paper by twenty researchers called OpenClaw, in its title, an agent of chaos: unauthorized compliance with people who aren't its owner, disclosure of sensitive information, and destructive system-level actions. One Meta engineer made a rookie mistake in an OpenClaw project and watched her inbox start deleting all her mail.

00:05:46 So here's where I land. I'm not buying the hundred-times-more-engineers number as a fact — Cherny says himself that anyone claiming to know is guessing. But the role blending is concrete and observable, and it's what actually changes your Tuesday. The title may go.

00:06:02 The judgment — taste, debugging, deciding what to build and whether the agent's output is right — that doesn't go anywhere. It just becomes a larger share of the job.

00:06:13

What the best agents share, and how to drive one

00:06:13 If coding is becoming delegation, the obvious next question is what good delegation looks like — and there were two pieces today that answer it from opposite ends. One is a short conference talk by Mardu Swanepoel of Flinn AI, called What the Best Agents Share.

00:06:30 He studied four of the strongest agents around — Cursor, Claude, Harvey, and Manus — and pulled out four patterns they have in common. First, focus modes: putting the agent into a constrained mode, like planning or debugging, which both lets the engineer tune quality on a smaller action space and tells the user what to expect.

00:06:52 Cursor's planning mode literally writes no code; it just asks you questions. Second, transparent execution. The point, he says, is to shift from delegation to collaboration — and his line on it is good: "If I give you a task and you come back with just simply the results, I will have less trust in the results than if you were to actually share with me your process." Showing the tool calls isn't decoration; it lets you stop the agent at step two when you can see it's reading the wrong documents.

00:07:24 Third, personalization, which he frames as optimizing for speed to understanding rather than speed to outcome — Harvey's playbooks, memory, and skills. And fourth, reversibility: bounding the cost of a mistake so the user gets bolder. Cursor lets you roll back at the line, the file, or the conversation level.

00:07:44 When the downside is bounded, you take on the higher-value task. The other piece is a developer named Arpan Patel writing up how to actually run Claude Code as a daily driver, and it lands on the same idea from the engineer's side. The single most important principle, he says, citing Cherny and the Anthropic team: give the agent a way to verify its own work.

00:08:08 Without that, you are the only feedback loop. With it, the agent iterates until things actually pass — and Cherny puts that at a two-to-three-times quality improvement on its own. The rest of the guide is the kind of concrete craft you can copy tomorrow. When the agent makes a mistake, end your prompt with "update the project instructions file so you don't repeat this" — Cherny calls the model eerily good at writing rules for itself.

00:08:36 The real instructions file the Claude Code team checks in is tiny: build commands, the test invocations, and the ritual to run before opening a pull request. No style essays or codebase tour. And there's a sharp detail on review: when you build a review agent, you give it read-only tools on purpose, because a reviewer that can edit gets biased toward defending its own changes.

00:09:01 Put those two pieces together and they're describing the same system. Transparent execution, reversibility, and self-verification are all ways of keeping the human in the loop where intervention is cheap, and bounding the damage where it's expensive. That's the whole game right now.

00:09:20

Can the model actually tell when it's unsure?

00:09:20 Here's a research paper that should make you a little more careful about a habit a lot of us have picked up. It's titled "Can LLMs Introspect? A Reality Check," from Shashwat Singh, Tal Linzen, and Shauli Ravfogel. And a large language model's apparent self-knowledge is exactly what it goes after.

00:09:37 There's been a run of recent studies arguing that models can detect and report their own internal states — that when a model says it's uncertain, it's reading something real about itself. This paper says: not so fast, and brings in the lessons from decades of human metacognition research to make the case.

00:09:56 The trouble is telling actual introspection apart from pattern-matching on surface cues, and they argue behavioral evidence alone can't do it. They re-examine two of the paradigms those earlier claims rested on. In the first, models are supposed to detect when their internal states have been tampered with.

00:10:14 The finding: models can't reliably tell an intervention on their internal state apart from a manipulation of the input. So what looked like self-monitoring is more likely just general anomaly detection: the model notices that something's off, but not that something's off about itself.

00:10:31 In the second paradigm, models predict labels derived from their own hidden states. There, a classifier with access only to the input does just as well as the model's own predictions. And when they build a control where the model can't lean on the meaning of the task and has to rely on the internal representation, performance drops to near chance.

00:10:52 Their conclusion is blunt: current evidence is insufficient to establish that these models display metacognitive monitoring. Why does a builder care? Because if you're routing on confidence — if your system trusts a model that says "I'm not sure, let me escalate" — you're assuming that report is grounded in a real read of its own state.

00:11:12 This paper says you can't assume that yet; the report might be the model pattern-matching on what an uncertain answer usually looks like. There was a sweet, much-upvoted post on the local-model subreddit this week about being nicer to models so they'll admit "I don't know" instead of looping.

00:11:30 I like the spirit of it. But this paper is the cold-water companion: an "I don't know" that sounds convincing isn't the same as a model that knows it doesn't know. Treat the words as an output to be checked, not a window into the machine.

00:11:44

Your agents are aging

00:11:44 Three papers landed this same week on a problem that doesn't get enough airtime, and together they make a sharp point about long-running agents: they drift. They get worse over their own lifetime, and the base model has nothing to do with it. The clearest framing is a benchmark called AgingBench, from a group led by Jianing Zhu, under the title "Your Agents Are Aging Too." The setup is this: we deploy agents as persistent systems that run for months, but we keep evaluating them like they were switched on a minute ago.

00:12:17 And even when the model's weights are frozen, the agent's effective state keeps changing. It compresses its history, pulls from a memory store that keeps growing, revises facts after updates, and goes through routine maintenance. So reliability, they write, becomes a lifespan property of the full agent harness, not just a snapshot property of the base model.

00:12:39 They name four ways an agent ages: compression, interference, revision, and maintenance. And across roughly four hundred runs — eight to two hundred sessions, fourteen models — the degradation isn't one thing. Behavioral tests can stay clean while factual precision rots underneath.

00:12:57 Derived-state tracking can collapse within a single model. And the same wrong answer can need a completely different fix depending on which stage of the memory pipeline actually broke. That last point is where the second paper comes in. It's called MemFail, from Dawn Song's group at Berkeley, and its complaint is that existing benchmarks report one aggregate accuracy number and treat the memory system as a black box.

00:13:23 So you know the agent got the answer wrong; you have no idea why. MemFail breaks a memory system into three operations — summarization, storage, and retrieval — and builds adversarial datasets that each stress one of them, so you can point at the operation that failed instead of shrugging at a score.

00:13:41 And the third paper asks the foundational version of the question right in its title: "Is Agent Memory a Database?" The authors, Abdelghny Orogat and Essam Mansour, argue no — that we keep treating memory as storage, localizing correctness at individual records, and it leaves four recurring breakages: unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval.

00:14:07 Their proposal, which they call Governed Evolving Memory, says correctness is a property of the whole state trajectory over time, not of any single record. Put all three together and you get the maintenance frontier nobody puts on a launch slide. The exciting demo is an agent that runs for hours.

00:14:25 The engineering problem is an agent that's still right after a hundred sessions — and that's a property of the harness and memory layer around the model, exactly where I keep saying the real work has moved.

00:14:38

Running the frontier in your own house

00:14:38 Let's get away from the cloud for a minute. There's a workshop talk from Alex Cheema of EXO Labs called Run Frontier AI at Home, and it's the clearest accounting I've seen of where local inference actually is versus where it's going. EXO's whole mission is driving down the cost of running frontier models on your own hardware, and Cheema opens with a line from Andrej Karpathy that frames the stakes: not your weights, not your brain.

00:15:04 The idea being that as these systems go from a chat box to something more like an extension of you, renting that capability from three companies starts to feel different. He tells a story about a friend doing security testing who got locked out of three providers — Claude, Gemini, and ChatGPT — at once.

00:15:22 When you depend on that capability for your work, the fragility matters. Now, the actual numbers, because he doesn't oversell them. The current best open model, GLM 5.1, is a trillion parameters, and in full sixteen-bit precision that's around one and a half terabytes you need to fit into memory.

00:15:40 To run it locally today you're looking at something like forty thousand dollars of Mac Studios, and even then it tops out near twenty tokens per second — below the fifty or so people are used to from the cloud. So today, this is for experimenters. But the reason he's optimistic is a stack argument, and it's a good one.

00:15:59 Training is about flops — it's compute-bound. Local inference, where you're serving one user and can't batch, is mostly memory-bound: what matters is fitting in memory, your memory bandwidth, and the energy it takes to move each byte. And there's slack all over that stack.

00:16:16 He gives one example where they looked at a model running on Apple silicon and found it was fifty percent slower than the theory said it should be — the cause was a pile of unnecessary kernels being launched separately, and fusing them recovered thirty percent.

00:16:31 He makes the harness point too, which pays off something I've been tracking: take the exact same model, run it in one coding harness versus another, and you get completely different performance. Stanford's Hazy Research group tracks a metric they call intelligence per watt — he says it should really be per joule — and it's improved roughly fivefold in two years from hardware and another threefold from models.

00:16:56 Those compound. His thesis is there's about a hundred times left in price-to-performance, and that within eighteen months to two years, five thousand dollars buys you close-to-frontier performance running fast. An appliance you buy once, he says, like a fridge — and then you never pay for a token again.

00:17:14 I don't know if it's eighteen months or four years. But the direction is right, and the independence is what I'd actually pay for.

00:17:22

The labs can't agree on the jobs

00:17:22 Underneath all of this is the question everyone actually wants answered, and the two most influential labs are now publicly split on it. Madison Mills at Axios laid out the divide. On one side, Anthropic's co-founder Chris Olah, speaking at a conference at the Vatican, said there is a real possibility that AI will displace human labor at very large scale.

00:17:45 On the other, OpenAI's Sam Altman has walked his own predictions back — his words: "I'm delighted to be wrong about this. I thought there would have been more impact on entry-level white-collar jobs being eliminated by now than has actually happened." He now calls a jobs apocalypse unlikely.

00:18:04 And the data cuts both ways, which is why the disagreement persists. Software-engineering job openings are up about eighteen percent year over year, and LinkedIn reported roughly one point three million new AI-related postings. At the same time, Meta, Coinbase, and Shopify have all tied recent layoffs to AI capabilities.

00:18:25 So you can tell either story with real numbers, and both camps are. There's a third voice worth adding for calibration. Demis Hassabis, the DeepMind CEO, told Axios's Ina Fried he still broadly expects AGI around twenty-thirty, now sees twenty-twenty-nine as a possibility, and called this year's agentic era — and I think this is the most useful framing of the three — a bit like a practice run.

00:18:50 I find that more grounded than either the doom or the boom. Notice that Cherny's view, from the first chapter, actually sits in the middle of this: companies will need fewer engineers per unit of work and more builders overall, depending on the company. All of these people are guessing, and they'll tell you so.

00:19:10 The role-blending is observable today. The macro headcount forecast is a bet. Report the spread; don't pick the winner.

00:19:18

I'm tired of talking to AI

00:19:18 I'll end with the smallest thing I read today, because it stuck. It's a short post by a developer who writes as Orchid, titled "I'm tired of talking to AI." Three little stories, no thesis, and it's better for it. First: he found GitHub repositories spreading malware, asked an AI what to do about it, and got nothing useful.

00:19:37 So he opened a discussion on GitHub. Someone replied — and it was the exact same text the AI had given him. He called it out; the comment was deleted. Then someone else replied with the same AI answer again. Second: he worked as a developer, asked the business owner a question about a business task, and got back a ChatGPT screenshot.

00:19:56 He replied that it had nothing to do with his question and everything in it was wrong. A minute later, another screenshot. The owner hadn't even read the answer — he just screenshotted it and forwarded it along. Third: someone messaged him on Reddit, they went back and forth a few times, and partway through he realized he was talking to an AI agent.

00:20:16 The whole post is four lines long at the end: "I'm tired of talking to AI. I want to talk to real people. But even when I talk to people, they forward my questions to AI and send me the AI's answer." We spent the last half hour on agents that write all your code, book eight flights, run for hours, age over months, and maybe replace a job title.

00:20:41 All of that is people using AI to do more — to extend what they can build. What Orchid put his finger on is the other use, the one with no upside: handing the conversation itself to a machine so you don't have to think, and then handing the other person the output.

00:20:57 The introspection paper said a model's confident answer isn't a window into a mind. The agent-patterns talk said trust comes from showing your work. This post is the human version of both. When someone forwards you an AI answer they didn't read, they've shown you nothing, and there's nothing behind it to trust.

00:21:15 The tools got good enough to answer for us this year. What I'm holding onto going into tomorrow is the difference between using one to think and using one so you don't have to — because that second use is the one no benchmark scores. — Lenar Kess.