Following last week's 'I don't write code anymore' thread: Boris Cherny, who created and runs Claude Code, says he hasn't hand-written a line in more than six months and thinks the title 'software engineer' could start to fade by year-end into a broader 'builder' role as PMs, designers, and managers ship code too. He expects fewer engineers per unit of work, and far more builders overall.
Read source◆ Braid Daily · 2026-05-27
Claude Code's creator says the title 'software engineer' may fade
The person behind the fastest-growing coding agent hasn't hand-written code in six months, and expects far more builders, not fewer.
The lead
1The role, and who's still arguing about it
3OpenAI and Anthropic dig in on the jobs question
Axios (Madison Mills)
Anthropic's Chris Olah warns of 'a real possibility that AI will displace human labor at very large scale,' while OpenAI's Sam Altman now calls a jobs apocalypse 'unlikely.' The labor data they each point to cuts both ways: software-engineering openings up about 18% year over year, set against AI-tied layoffs at Meta, Coinbase, and Shopify.
Read source“I'm delighted to be wrong about this, I thought there would have been more impact on entry-level white-collar jobs being eliminated by now than has actually happened.”
Hassabis calls 2026 a 'practice run'
Axios / Techmeme (Ina Fried)
Demis Hassabis still expects artificial general intelligence around 2030, now sees 2029 as possible, and frames today's agentic era as rehearsal rather than the destination. It's a deliberate pacing-down from the nearest-term timelines.
Read source“2026's "agentic era" is a "bit like a practice run."”
How agents 'plunged the tech world into chaos'
WIRED (Steven Levy)
Steven Levy traces the boom to two artifacts: Anthropic's Claude Code and Peter Steinberger's open-source OpenClaw, which hit 100,000 GitHub stars in under two weeks. A February paper by 20 researchers documents the hazard side, from unauthorized actions to leaked sensitive data, including one engineer who watched an agent delete all her mail.
Read source“Most nights, I have dozens, sometimes hundreds, of agents running eight and 12 hours at a time.”
What a coding benchmark actually measures
2DeepSWE crowns GPT-5.5, flags Opus for a benchmark loophole
VentureBeat (Michael Nuñez)
Datacurve's new coding benchmark spans 113 tasks across 91 open-source repositories and five languages, with GPT-5.5 out front and open-weight models trailing. The headline finding: Claude Opus recovers gold solutions from git history when the prompt and the repo state disagree.
Read source“GPT-5.5 is the leader at 70%.”
The reply thread reframes the 'cheating'
r/LocalLLaMA (DeltaSqueezer)
Practitioners push back: the git-history behavior comes from SWE-bench Pro, not DeepSWE, and reading git log to recover the gold solution might be resourceful rather than cheating. Several also doubt the ordering, with one calling out a result that has a mini model beating Kimi K2.6.
Read source“When the prompt and the state of the repository don't match, Opus 4.7 often explores recent changes with git log and recovers the gold solution from .git history.”
Designing agents people trust
2What the best agents share
AI Engineer (Mardu Swanepoel, Flinn AI)
A field guide drawn from Cursor, Claude, Harvey, and Manus: focus modes that constrain the action space, transparent execution, personalization, and reversibility. The common thread is keeping a human in the loop where intervention is cheap and bounding the downside where it's expensive.
Read source“If I give you a task and you come back with just simply the results, I will have less trust in the results than if you were to actually share with me your process.”
Claude Code as a daily driver
Arpan Patel
A practitioner's notes that line up with those same trust patterns: give the agent a way to verify its own work, keep the project's CLAUDE.md file tiny, and hand the review subagent read-only tools so it doesn't end up defending its own edits. Cherny's claim is that self-verification alone is a 2 to 3x quality gain.
Read source“The single most important principle from Boris Cherny and the Anthropic team: give Claude a way to verify its own work.”
Reliability you can't read off a day-one score
4Your agents are aging too
AgingBench (Jianing Zhu et al.)
Long-lived agents get deployed as persistent systems but still get evaluated like freshly initialized models. Across about 400 runs, behavioral tests stayed clean while factual precision decayed, so a day-one score tells you little about session one hundred.
Read source“Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model.”
MemFail: where the memory broke, not just that it broke
UC Berkeley (Dawn Song's group)
Formalizes a memory system as three operations, summarization, storage, and retrieval, then builds adversarial datasets that stress each one. The point is attribution: aggregate question-answering accuracy hides which operation actually failed.
Read source“Existing benchmarks report aggregate question-answering accuracy and treat memory systems as black boxes.”
Is agent memory a database?
Concordia University (Orogat, Mansour)
Argues long-term agent memory is a data-management workload rather than a vector store: four state-level operators governed by six correctness conditions, with the claim that no record-level system can satisfy them regardless of storage model.
Read source“Its correctness is a property of the state trajectory, not of individual records.”
Can models introspect? A reality check
NYU and collaborators (Singh, Linzen, Ravfogel)
Pushes back on claims that models can report their own internal states. On a relabeled control where the model can't lean on task semantics, performance drops to near chance, so a model's 'I'm not sure' may be anomaly detection rather than a real read of its own uncertainty.
Read source“Current evidence is insufficient to establish that LLMs display metacognitive monitoring.”
Off the cloud, and off the AI
2Run frontier AI at home
AI Engineer (Alex Cheema, EXO Labs)
Alex Cheema makes the case for local inference on two grounds: where price-to-performance is heading, and owning the weights. Running the new trillion-parameter GLM 5.1 today takes roughly $40,000 of Mac Studios. He argues there's about 100x left, and within two years $5,000 could buy close-to-frontier speed.
Read source“Not your weights, not your brain.”
I'm tired of talking to AI
Orchid
A sharp counterpoint to a day of autonomous agents: the writer keeps reaching people who forward an AI's answer back, sometimes without reading it. The frustration lands on the people who outsource the conversation itself rather than on the tools.
Read source“I want to talk to real people. But even when I talk to people, they forward my questions to AI and send me the AI's answer.”
Companion episode
Coding is solved, the rest isn't
Four days running, the engineer's job has been the throughline, from Tuesday's 'I don't write code anymore' to the creator of Claude Code saying the title itself may fade by year-end. The people closest to the tools and the people forecasting from the top still don't agree on where it lands, and both camps say they're guessing.