◆ Braid Daily · 2026-05-27

Claude Code's creator says the title 'software engineer' may fade

27 May 2026

The person behind the fastest-growing coding agent hasn't hand-written code in six months, and expects far more builders, not fewer.

The lead

Following last week's 'I don't write code anymore' thread: Boris Cherny, who created and runs Claude Code, says he hasn't hand-written a line in more than six months and thinks the title 'software engineer' could start to fade by year-end into a broader 'builder' role as PMs, designers, and managers ship code too. He expects fewer engineers per unit of work, and far more builders overall.

Read source

The role, and who's still arguing about it

OpenAI and Anthropic dig in on the jobs question

Axios (Madison Mills)

Anthropic's Chris Olah warns of 'a real possibility that AI will displace human labor at very large scale,' while OpenAI's Sam Altman now calls a jobs apocalypse 'unlikely.' The labor data they each point to cuts both ways: software-engineering openings up about 18% year over year, set against AI-tied layoffs at Meta, Coinbase, and Shopify.

“I'm delighted to be wrong about this, I thought there would have been more impact on entry-level white-collar jobs being eliminated by now than has actually happened.”

Read source

Hassabis calls 2026 a 'practice run'

Axios / Techmeme (Ina Fried)

Demis Hassabis still expects artificial general intelligence around 2030, now sees 2029 as possible, and frames today's agentic era as rehearsal rather than the destination. It's a deliberate pacing-down from the nearest-term timelines.

“2026's "agentic era" is a "bit like a practice run."”

Read source

How agents 'plunged the tech world into chaos'

WIRED (Steven Levy)

Steven Levy traces the boom to two artifacts: Anthropic's Claude Code and Peter Steinberger's open-source OpenClaw, which hit 100,000 GitHub stars in under two weeks. A February paper by 20 researchers documents the hazard side, from unauthorized actions to leaked sensitive data, including one engineer who watched an agent delete all her mail.

“Most nights, I have dozens, sometimes hundreds, of agents running eight and 12 hours at a time.”

Read source

What a coding benchmark actually measures

DeepSWE crowns GPT-5.5, flags Opus for a benchmark loophole

VentureBeat (Michael Nuñez)

Datacurve's new coding benchmark spans 113 tasks across 91 open-source repositories and five languages, with GPT-5.5 out front and open-weight models trailing. The headline finding: Claude Opus recovers gold solutions from git history when the prompt and the repo state disagree.

“GPT-5.5 is the leader at 70%.”

Read source

The reply thread reframes the 'cheating'

r/LocalLLaMA (DeltaSqueezer)

Practitioners push back: the git-history behavior comes from SWE-bench Pro, not DeepSWE, and reading git log to recover the gold solution might be resourceful rather than cheating. Several also doubt the ordering, with one calling out a result that has a mini model beating Kimi K2.6.

“When the prompt and the state of the repository don't match, Opus 4.7 often explores recent changes with git log and recovers the gold solution from .git history.”

Read source

Designing agents people trust

What the best agents share

AI Engineer (Mardu Swanepoel, Flinn AI)

A field guide drawn from Cursor, Claude, Harvey, and Manus: focus modes that constrain the action space, transparent execution, personalization, and reversibility. The common thread is keeping a human in the loop where intervention is cheap and bounding the downside where it's expensive.

“If I give you a task and you come back with just simply the results, I will have less trust in the results than if you were to actually share with me your process.”

Read source

Claude Code as a daily driver

Arpan Patel

A practitioner's notes that line up with those same trust patterns: give the agent a way to verify its own work, keep the project's CLAUDE.md file tiny, and hand the review subagent read-only tools so it doesn't end up defending its own edits. Cherny's claim is that self-verification alone is a 2 to 3x quality gain.

“The single most important principle from Boris Cherny and the Anthropic team: give Claude a way to verify its own work.”

Read source

Reliability you can't read off a day-one score

Your agents are aging too

AgingBench (Jianing Zhu et al.)

Long-lived agents get deployed as persistent systems but still get evaluated like freshly initialized models. Across about 400 runs, behavioral tests stayed clean while factual precision decayed, so a day-one score tells you little about session one hundred.

“Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model.”

Read source

MemFail: where the memory broke, not just that it broke

UC Berkeley (Dawn Song's group)

Formalizes a memory system as three operations, summarization, storage, and retrieval, then builds adversarial datasets that stress each one. The point is attribution: aggregate question-answering accuracy hides which operation actually failed.

“Existing benchmarks report aggregate question-answering accuracy and treat memory systems as black boxes.”

Read source

Is agent memory a database?

Concordia University (Orogat, Mansour)

Argues long-term agent memory is a data-management workload rather than a vector store: four state-level operators governed by six correctness conditions, with the claim that no record-level system can satisfy them regardless of storage model.

“Its correctness is a property of the state trajectory, not of individual records.”

Read source

Can models introspect? A reality check

NYU and collaborators (Singh, Linzen, Ravfogel)

Pushes back on claims that models can report their own internal states. On a relabeled control where the model can't lean on task semantics, performance drops to near chance, so a model's 'I'm not sure' may be anomaly detection rather than a real read of its own uncertainty.

“Current evidence is insufficient to establish that LLMs display metacognitive monitoring.”

Read source

Off the cloud, and off the AI

Run frontier AI at home

AI Engineer (Alex Cheema, EXO Labs)

Alex Cheema makes the case for local inference on two grounds: where price-to-performance is heading, and owning the weights. Running the new trillion-parameter GLM 5.1 today takes roughly $40,000 of Mac Studios. He argues there's about 100x left, and within two years $5,000 could buy close-to-frontier speed.

“Not your weights, not your brain.”

Read source

I'm tired of talking to AI

Orchid

A sharp counterpoint to a day of autonomous agents: the writer keeps reaching people who forward an AI's answer back, sometimes without reading it. The frustration lands on the people who outsource the conversation itself rather than on the tools.

“I want to talk to real people. But even when I talk to people, they forward my questions to AI and send me the AI's answer.”

Read source

Companion episode

Coding is solved, the rest isn't

2026-05-27 · 00:21:38

Episode Sources Transcript Chapters JSON

Four days running, the engineer's job has been the throughline, from Tuesday's 'I don't write code anymore' to the creator of Claude Code saying the title itself may fade by year-end. The people closest to the tools and the people forecasting from the top still don't agree on where it lands, and both camps say they're guessing.