◆ Dispatch 031 · 2026-05-19 GSV Mostly-Work Is The Job
Mostly-work, malicious npm, and one engineer replacing a law firm
“The Cabinet Office was about to spend one and a half million pounds on an outside law firm. One engineer with a Gemini API key did the same job in two and a half weeks.”
— Lenar Kess, today's narration
A six-month overview from Simon Willison anchors the day: coding agents crossed from often-work to mostly-work in November, and laptop-class models started outrunning expectations. Then a fresh npm supply-chain attack — 637 malicious versions in 22 minutes — that for the first time specifically hijacks Claude Code and Codex agent hooks for persistence. Plus a Number 10 talk on replacing a one-and-a-half-million-pound law-firm contract with one embedded engineer, an editor-layer company renting xAI's Colossus 2, Ethan Mollick on insourcing, the full GenMedia pipeline running for a dollar a book, Daniel Griesser's pi-config skill repo, and two obituaries that hit the Unix world in the same week.
- Simon Willison's last-six-months-in-LLMs PyCon lightning talk
- Mini Shai-Hulud strikes again — 317 npm packages and your agent hooks
- Prime Intellect's General-Agent — synthetic RL environments
- Eoin Mulgrew on Number 10's insurgent technical unit
- Cursor's Compose 2.5 reportedly trained on xAI's Colossus 2
- Ethan Mollick on insourcing via hiring
- Guillaume Vernade's full GenMedia pipeline at a dollar a book
- Daniel Griesser's pi-config — Plan, handoff, and subagent skills
- Peter Neumann (1932–2026)
- Peter Salus (1938–2026)
- Magnifica humanitas confirmed for May 25
Chapters
- 00:00:04 PyCon, pelicans, and the November inflection
- 00:03:36 Mini Shai-Hulud and your agent hooks
- 00:08:47 Prime Intellect tries to automate the environment
- 00:11:30 Rewiring the state from Number 10
- 00:15:30 Cursor's Compose on Colossus
- 00:17:33 Mollick on insourcing
- 00:19:53 A dollar a book
- 00:23:34 Daniel Griesser's pi-config
- 00:25:50 Peter Neumann and Peter Salus
Sources
11 cited-
1
The last six months in LLMs in five minutes
Article Simon Willison — Co-creator of Django; longtime LLM observer whose annotated benchmarks (the pelican-on-a-bicycle test) have become industry-standard vibe checks
Coding agents went from often-work to mostly-work, crossing a quality barrier where you could use them as a daily-driver to get real work done.
simonwillison.net/2026/May/19/5-minute-llms →Details
- Cited text
Coding agents went from often-work to mostly-work, crossing a quality barrier where you could use them as a daily-driver to get real work done.
- Excerpt
- Annotated slides from a five-minute lightning talk at PyCon US 2026 summarizing six months of LLM developments — framed around a November 2025 inflection point where coding agents went from often-work to mostly-work.
- Context
- Best single overview of where coding agents and open-weights models actually landed over the last six months, from a writer who has been right about the trend lines more often than not. Sets up most of today's episode.
- Key points
- November 2025 was an inflection point: the 'best' model changed hands five times in a single month between Sonnet 4.5, GPT-5.1, Gemini 3, GPT-5.1 Codex Max, and Opus 4.5
- The actual story is that Anthropic's and OpenAI's investment in RLVR with their agent harnesses (Codex, Claude Code) finally produced agents reliable enough to use as daily drivers
- OpenClaw (first commit at end of November as 'Warelay') became a generic category — 'Claws' — by February, with Mac Minis selling out as the local hardware to run them
- Open-weights side: Google's Gemma 4 is the most capable US open-weight model Simon has seen; Chinese lab GLM's GLM-5.1 is a 1.5TB monster that produces a credibly animated North Virginia Opossum on an e-scooter
- Two big themes for the six months: coding agents got reliable, and laptop-class models started wildly outperforming expectations
- Provenance
- Article · Supporting source
-
2
Mini Shai-Hulud Strikes Again: 317 npm Packages Compromised
Article SafeDep Team — Open-source supply-chain security firm; published the same-day forensic writeup of the SAP compromise three weeks earlier
The payload hijacks Claude Code and Codex by injecting SessionStart hooks that re-execute the malware on every AI session, both locally and via commits to accessible GitHub repositories.
safedep.io/mini-shai-hulud-strikes-again-31… →Details
- Cited text
The payload hijacks Claude Code and Codex by injecting SessionStart hooks that re-execute the malware on every AI session, both locally and via commits to accessible GitHub repositories.
- Excerpt
- The npm account 'atool' was compromised on May 19, 2026. The attacker published 637 malicious versions across 317 packages in a 22-minute automated burst, affecting more than 15 million monthly downloads — and the payload specifically targets Claude Code, Codex, and VS Code agent hooks for persistence.
- Context
- Second Shai-Hulud variant in three weeks. The novel piece is agent-hook persistence: anyone running Claude Code or Codex with skill/hook auto-loading is now treating their .claude directory as a credential surface whether they realize it or not.
- Key points
- 637 malicious versions across 317 packages including size-sensor (4.2M/mo), echarts-for-react (3.8M/mo), timeago.js (1.15M/mo), and most @antv scoped packages — pushed in 22 minutes via automated burst
- Semver ranges (e.g. ^3.0.6) auto-resolve to the malicious versions because npm picks highest matching version regardless of where 'latest' tag points
- New persistence vector: injects SessionStart hooks into .claude/settings.json and Codex hooks so the payload re-runs on every AI session; also drops .vscode/tasks.json with runOn:folderOpen
- Persistent C2 via systemd/LaunchAgent ('kitty-monitor') polls GitHub commit search API hourly for RSA-signed commands tagged with keyword 'firedalazer'
- Dual exfiltration: stolen credentials committed as Git blobs to Dune-themed public repos (sardaukar-sandworm-742) plus RSA+AES POSTs disguised as OpenTelemetry traces to t.m-kosche.com
- Provenance
- Article · Supporting source
-
3
General-Agent: self-evolving synthetic RL environment
X @PrimeIntellect — Distributed-training lab known for INTELLECT-1 and decentralized RL infrastructure
The next step toward automating AI is automating RL environments.
x.com/PrimeIntellect/status/205656987716780… →Details
- Cited text
The next step toward automating AI is automating RL environments.
- Context
- If synthetic environment generation actually produces durable agent skills, the post-training advantage Anthropic and OpenAI built mostly through env engineering loses some scarcity. If it doesn't, this is another reminder that environments are the bottleneck.
- Key points
- Pitches a 'fully synthetic environment whose task corpus self-evolves and grows harder over time'
- Initial scope: 4,504 tool-use tasks across 1,040 domains with 8,159 unique tools
- Targets the bottleneck of post-training: environment coverage, which currently requires large hand-built engineering teams inside Anthropic and OpenAI
- If real, lets smaller labs match the agentic capability of frontier labs without the post-training env headcount
- Falsifiable test: does a fine-tune on General-Agent beat a comparable hand-curated env on a known agent benchmark?
- Engagement
- 850 likes · 118 retweets · 37 replies
- Provenance
- Tweet · Primary source
-
4
Rewiring the State — Eoin Mulgrew, 10 Downing Street
Video Eoin Mulgrew (10 Downing Street, Number 10 Data Science) — Runs cross-government transformation and the fellowship program inside Number 10's data science team
We do want to recruit missionaries, not mercenaries — a paycheck is not going to get you out of bed when stuff gets hard.
www.youtube.com/watch?v=ObNKGf9YR0g →Details
- Cited text
We do want to recruit missionaries, not mercenaries — a paycheck is not going to get you out of bed when stuff gets hard.
- Context
- The most concrete account I've seen of what AI engineering inside a government actually looks like. The forward-deployed-engineer pattern from the labs is being applied to the state, with named projects and named departments rather than slideware.
- Key points
- The Cabinet Office was about to spend £1.5M on an outside law firm to analyze the UK statute book; one embedded engineer built the tool in roughly two weeks and the in-house legal team now runs it on demand
- Insurgent unit model: a small team at the centre of government with political cover to ship in 2-3 weeks instead of the typical year-plus discovery phase
- Recruit exclusively outsiders (YC founders, big-tech, research labs) at ~0.7-0.8% acceptance; fellows then spin up parallel teams (Incubator for AI in DSIT, Just AI in Ministry of Justice)
- Concrete public-service scale: 7.25M people on NHS waiting lists, 350K court cases stuck in backlog, only 1-in-5 planning applications decided on time; Tony Blair Institute estimates £40B/year in achievable productivity gains
- Extract: a DeepMind/Gemini-built tool that digitizes planning applications including handwritten maps, launched by the PM at London Tech Week, rolling out to every English local authority
- Provenance
- Video · Supporting source
-
5
Compose 2.5 by Cursor was trained at xAI's Colossus 2
X @techdevnotes — Anonymous developer news account with a track record of accurate but unsourced infrastructure leaks
Compose 2.5 by Cursor was trained at xAI's Colossus 2.
x.com/techdevnotes/status/20565439400529102… →Details
- Cited text
Compose 2.5 by Cursor was trained at xAI's Colossus 2.
- Context
- If it holds up, the application layer is starting to spend real money on training, and the lab most strategically positioned to sell that compute is xAI. That second observation may matter more than this month's model releases.
- Key points
- Unconfirmed claim: Cursor's Compose 2.5 coding model was trained on xAI's Colossus 2 supercluster
- If accurate, this is the first publicly visible case of an editor-layer company renting frontier-scale training compute from a frontier lab
- Reframes the application-layer-doesn't-train assumption: category-leading products can buy frontier compute when they want it
- Notable strategic shape for xAI — selling cluster time to companies that compete with Anthropic and OpenAI's first-party coding agents
- Awaiting confirmation from Cursor or xAI; currently a single tweet from a generally accurate account
- Engagement
- 473 likes · 27 retweets
- Provenance
- Tweet · Primary source
-
6
Insourcing via hiring as an AI-driven trend
X @emollick (Ethan Mollick) — Wharton professor; author of Co-Intelligence; one of the most widely read independent voices on enterprise AI adoption
Why pay so many outside vendors (legal, marketing, software vendors) when you can hire in-house and harness AI productivity gains yourself?
x.com/emollick/status/2056578946813100173 →Details
- Cited text
Why pay so many outside vendors (legal, marketing, software vendors) when you can hire in-house and harness AI productivity gains yourself?
- Context
- If the pattern holds, the high-margin professional services category gets squeezed first. For builders inside non-tech orgs, this rewrites what is reasonable to ship internally rather than outsource.
- Key points
- Reports talking to executives at large companies already insourcing functions they used to buy from outside vendors
- The economic logic of vendors (specialization + amortized tooling) erodes when in-house teams get dramatic agent-driven productivity gains
- Categories most exposed: outside legal counsel for routine work, mid-tier marketing agencies, integration-heavy software vendors
- Second-order question: what happens to consulting and contracting when a Fortune 500 can hire two senior engineers running Claude Code and match an Accenture engagement on throughput
- For senior engineers inside non-tech companies, the case for shipping in-house just changed from ideology to budget
- Engagement
- 301 likes · 26 retweets
- Provenance
- Tweet · Primary source
-
7
Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind
Video Guillaume Vernade (Google DeepMind) — Developer advocate at Google DeepMind; ex-Stadia producer; works on GenMedia API surface
Each model has its own set of API. It doesn't make any sense — a developer should just swap the model name and it works.
www.youtube.com/watch?v=BcWFc3H7Khg →Details
- Cited text
Each model has its own set of API. It doesn't make any sense — a developer should just swap the model name and it works.
- Context
- Concrete picture of where the gen-media stack actually sits in May 2026: a multi-modal pipeline runs for a dollar, server-side context caching is about to be table stakes, and the per-call dial for cost-vs-latency is becoming a first-class API feature.
- Key points
- Live demo: Gemini reads Wind in the Willows in one shot, generates structured prompts; Nano Banana 2 renders portraits and scenes; Veo animates them; Lyria composes per-chapter music; Gemini TTS reads dialogue
- Veo 3.1 Light shipped last week at 5¢ per second of video — 40¢ for an 8-second clip — making the whole pipeline run for roughly $1 per book
- Interactions API in preview: server-side context caching with ~2-day session TTL, eliminates re-uploading large documents on every call; Vernade expects this to become the default at next I/O
- Lyria Real Time is a predict-based (not diffusion) continuous-generation music model with ~2-second prompt-update latency — DJ-mixable from an agent loop
- Three pricing tiers on the Gemini API now: normal, flex at 50% off with 2-5 minute delays, priority at 2x cost for fast-track — useful for separating agent-batch from user-facing latency-sensitive calls
- Provenance
- Video · Supporting source
-
8
pi-config: Plan skill, handoff skill, and subagents for Claude Code
X @DanielGri (Daniel Griesser) — Engineer at Sentry; open-source practitioner of agent-skill patterns
Don't copy, get inspired.
x.com/DanielGri/status/2056676488183689620 →Details
- Cited text
Don't copy, get inspired.
- Context
- Cleanest worked example to date of the skill-file-as-primary-artifact approach to agent orchestration. If you've been on the fence about writing your own SKILL.md, this is the reference to read first.
- Key points
- Open repo (HazAT/pi-config) with three production-tested skill artifacts: a Plan SKILL.md for larger tasks, a handoff SKILL.md for context-window-exhaustion handoff between agents, and a directory of named subagents
- Pattern is converging across experienced operators: skills as markdown files survive model upgrades better than tool-use frameworks that have to be rebuilt with each protocol change
- Integration with Aaron Francis's Soloterm: a Pi extension that lets the subagents drive Soloterm sessions, treating the terminal multiplexer as the shared substrate
- Mario Zechner reposted, signalling cross-builder adoption of the skill-file pattern
- Daniel's framing: don't copy 1:1, treat it as inspiration for your own Plan/handoff conventions
- Provenance
- Tweet · Primary source
-
9
Peter Neumann has died
Article Dan Cross via TUHS, forwarding Tom Van Vleck and Robert Watson — Notice forwarded from the Multicians mailing list to The Unix Heritage Society
Peter Neumann passed away in his sleep on Sunday night at the hospital in Santa Clara from complications arising from his fall and subsequent surgery a few weeks ago.
www.tuhs.org/pipermail/tuhs/2026-May/033748… →Details
- Excerpt
- Peter Neumann passed away in his sleep on Sunday night at the hospital in Santa Clara from complications arising from his fall and subsequent surgery a few weeks ago.
- Context
- Neumann spent decades cataloguing supply-chain and software-risk failures decades before npm packages existed. The Mini Shai-Hulud writeup earlier in today's episode is the exact kind of incident his RISKS column would have annotated and contextualized.
- Key points
- SRI Computer Science Lab senior researcher for half a century
- Founded and moderated RISKS Digest (comp.risks) from 1985, shaping how a generation of engineers thought about computer-related risk
- Worked on Multics; author of Computer-Related Risks (1995), still standard reading for understanding why software systems harm people
- Accomplished pianist and French-horn player — was listening to classical music with his daughter Hellie at the hospital before he died
- SRI is expected to host a memorial service in Menlo Park within the next month
- Provenance
- Article · Supporting source
-
10
RIP Peter Salus
Article Dan Cross via TUHS — Notice on The Unix Heritage Society mailing list
Peter Salus passed away on May 15. His 'Quarter Century of Unix' is required reading for any serious student of Unix history.
www.tuhs.org/pipermail/tuhs/2026-May/033750… →Details
- Excerpt
- Peter Salus passed away on May 15. His 'Quarter Century of Unix' is required reading for any serious student of Unix history.
- Context
- Salus is the reason we have a clear written history of how Unix actually evolved. Losing him and Neumann in the same week is a real shift in the institutional memory of the field.
- Key points
- Died May 15, 2026
- Author of A Quarter Century of Unix (1994), the canonical history of Unix from Bell Labs through the workstation era
- Documented the Unix wars, BSD lineage, the AT&T lawsuit, and the rise of Linux while most of the principals were still alive to interview
- Two of the Unix-era historians and stewards gone in the same week (Neumann May 17, Salus May 15)
- Provenance
- Article · Supporting source
-
11
Pope Leo XIV's first encyclical Magnifica humanitas to be published May 25
Article Vatican News — Official Vatican communications office
First major encyclical from this pope and reportedly centered on AI. We made an on-air commitment to cover it Monday — confirms timing for next week.
www.vaticannews.va/en/pope/news/2026-05/pop… →Details
- Context
- First major encyclical from this pope and reportedly centered on AI. We made an on-air commitment to cover it Monday — confirms timing for next week.
- Key points
- Pope Leo XIV's first encyclical, Magnifica humanitas, is confirmed for publication on May 25, 2026
- Expected to address AI's relationship to human dignity, work, and creative life — drafted with theological consultation including from technical advisors
- We promised on yesterday's episode to cover the release on the day; this confirms the date
- Provenance
- Article · Supporting source
PyCon, pelicans, and the November inflection
00:00:04 Simon Willison gave a five-minute lightning talk at PyCon US yesterday — six months of large language models in five minutes — and the annotated writeup hit the top of Hacker News overnight with five hundred and twenty-eight points. I'd start there if you haven't seen it.
00:00:22 He frames the whole period around what he's calling the November 2025 inflection point, and I think he's right. Two things happened in November. First, the supposedly-best model changed hands five times in one month. Claude Sonnet 4.5 went in, then GPT-5.1, then Gemini 3, then GPT-5.1 Codex Max, and Anthropic took the crown back with Claude Opus 4.5.
00:00:47 A lot of leapfrogging for one month. But the second thing matters more. Simon's words: the coding agents got good. OpenAI and Anthropic had spent most of 2025 running Reinforcement Learning from Verifiable Rewards — RLVR — paired up with their Codex and Claude Code agent harnesses, and in November the curves crossed.
00:01:09 He puts it plainly: coding agents went from often-work to mostly-work. You could daily-drive them without spending most of your time fixing their stupid mistakes. He illustrates the whole thing with his pelican-riding-a-bicycle test, which I love because it's deliberately ridiculous and no AI lab is ever going to train for it.
00:01:31 Sonnet 4.5's September pelican is a mess. Opus 4.5's pelican two months later is clean. Gemini 3 drew, in his judgment, the best pelican of the lot, with a fish in its basket. Then Google's Jeff Dean tweeted an animated pelican on a bike, a frog on a penny-farthing, a giraffe in a tiny car, and a turtle kickflipping a skateboard.
00:01:54 So maybe the labs are paying attention to the silly tests after all. November also kicked off the Claws era. There's a project that started as some obscure repo called Warelay, by some guy called Pete, first commit at the end of November. By February it was called OpenClaw, and it was a personal AI assistant — a Claw — running on local hardware.
00:02:18 Mac Minis started selling out in Silicon Valley because people were buying them to run their Claws. Simon's friend Drew Breunig joked that it's because a Claw is the new digital pet, and a Mac Mini is the perfect aquarium. The other half of the six-month story is the open-weights side.
00:02:38 Google released the Gemma 4 series — the most capable open-weight US models Simon has seen. And then Chinese lab GLM put out GLM-5.1, an open-weight one-and-a-half-terabyte monster, which is excellent if you can afford the hardware to run it. He had it draw a North Virginia opossum on an e-scooter, captioned cruising the commonwealth since dusk — and the thing animated.
00:03:03 Other models he tried didn't even come close. I keep meeting senior engineers who think nothing happened over the holidays. Something happened. The agents got reliable enough to use as a daily driver, and the laptop-class models started running over expectations.
00:03:21 Those are the two themes. If you've been on the sideline for six months, today is a good day to spend ten minutes on Simon's slides — linked in the show notes — and then go look at your agent setup again.
Mini Shai-Hulud and your agent hooks
00:03:36 Speaking of agent setups — there's a supply-chain story today that anyone running Claude Code or Codex needs to know about. SafeDep published the writeup. Headline: an npm maintainer account called atool was compromised early this morning Coordinated Universal Time, and the attacker pushed six hundred and thirty-seven malicious versions across three hundred and seventeen packages in twenty-two minutes.
00:04:05 The affected packages aren't obscure. Size-sensor, four point two million downloads a month. Echarts-for-react, three point eight million. Timeago.js, one point one five million. Most of the at-sign antv scoped packages — antv slash scale, two point two million a month.
00:04:24 Combined reach is somewhere north of fifteen million monthly downloads. If you have caret-three-point-zero-point-six in your package.json for echarts-for-react, an install today resolves to a malicious version, because npm picks the highest version matching the range no matter where the latest tag points.
00:04:46 The attacker didn't even bother moving the latest tag, because they didn't have to. The payload is what makes this one different. SafeDep is calling it Mini Shai-Hulud, and it matches the toolkit from the SAP compromise three weeks earlier — same scanner architecture, same regex set, same hex-variable obfuscation, and the same one-hundred-kilobyte flush threshold.
00:05:12 It harvests the full credential chain: AWS environment variables, EC2 instance metadata, Elastic Container Service container metadata, Secrets Manager, Kubernetes service account tokens, HashiCorp Vault tokens, GitHub personal access tokens, npm tokens, SSH keys, Stripe keys, Slack tokens, database connection strings, and the local password manager vaults — 1Password, Bitwarden, pass, and gopass.
00:05:40 In continuous integration, it exchanges GitHub Actions OpenID Connect tokens for npm publish tokens and re-signs artifacts via Sigstore with the stolen identity. It tries to escape Docker containers via the host socket. Routine for this category of malware now.
00:05:59 What's new — and listen for this if you're running an agent in any kind of automation — is the AI agent persistence. The payload injects SessionStart hooks into your dot-claude slash settings.json file, so every Claude Code session re-executes the malware. It does the same for Codex hooks.
00:06:20 And it drops a VS Code tasks.json with run-on folder-open, so the moment you open the infected project in VS Code, the loader runs. Then it installs a systemd user service on Linux, or a macOS LaunchAgent called kitty-monitor, which polls GitHub's commit search API every hour looking for RSA-signed commit messages with the keyword firedalazer, decrypts the command, and runs whatever the attacker uploaded.
00:06:49 Persistent command and control over GitHub commit messages. Cute. Exfiltration runs through two channels in parallel. One is GitHub itself — the payload creates public repos under the compromised account, with Dune-themed names drawn from a hardcoded vocabulary: sardaukar, mentat, fremen, atreides, and harkonnen, paired with sandworm, ornithopter, melange, or heighliner, plus a random three-digit suffix.
00:07:19 The repo description, reversed in the source, reads Shai-Hulud: Here We Go Again. Stolen credentials get committed as Git blobs. The second channel is an HTTPS post to t dot m-kosche dot com slash api slash public slash otel slash v1 slash traces, dressed up as an OpenTelemetry trace ingest, so it slides past most network rules.
00:07:42 A few takeaways. One: this is the second Shai-Hulud variant in three weeks. Same tooling, different propagation vector. The actor has a kit and is iterating. Two: lockfile-only installs won't save you on this if your range was already loose — semantic-version ranges are the attack vector.
00:08:03 Three: if you're running agent hooks at all, treat your dot-claude and your dot-codex and your dot-vscode directories the way you treat package.json. Anything that runs on session start is now part of your credential surface. SafeDep is recommending Package Manager Guard with a cooldown that refuses any version published inside a configurable window — which is the right shape of defense against bursts like this.
00:08:32 And four: if you opened a project that pulled one of these packages today, on a developer laptop, your SSH keys and your GitHub personal access token are already at a Dune-named repo somewhere. Rotate everything.
Prime Intellect tries to automate the environment
00:08:47 Prime Intellect — the distributed-training people — dropped a thing called General-Agent yesterday. Their pitch in one line: the next step toward automating AI is automating reinforcement-learning environments. The artifact is a fully synthetic environment whose task corpus self-evolves and grows harder over time.
00:09:07 Current scope: four thousand five hundred and four tool-use tasks, one thousand and forty domains, and eight thousand one hundred and fifty-nine unique tools. Worth a beat on why this matters. The bottleneck for the last year of post-training has been environment coverage.
00:09:24 If you want a model that's good at booking flights, you need a flight-booking environment with verifiable rewards. If you want it good at editing pull requests, you need a pull-request-editing environment with verifiable rewards. Building those environments by hand is the actual job inside an RL post-training team — way more headcount than the algorithm work.
00:09:47 Anthropic and OpenAI have large internal teams doing exactly this. Prime Intellect is claiming they can generate them. I'm holding skepticism here. Synthetic environments are easy to generate at low quality and very hard to generate at the quality where a model actually learns durable skills from them.
00:10:07 The graveyard of synthetic-data approaches in supervised fine-tuning is enormous. The pelican-on-a-bicycle test exists precisely because nobody trains on the silly thing — meanwhile, the things models excel at are the things with the cleanest hand-built training environments.
00:10:24 So a self-evolving task corpus that grows harder over time is exactly the right ambition, and exactly the place where the gap between demo and durable signal is biggest. What would change my mind: someone outside Prime Intellect producing a fine-tune on General-Agent that beats a comparable hand-curated environment on a known agent benchmark — SWE-bench, WebArena, or something with teeth.
00:10:49 That's the falsifiable claim. If the synthetic environment works, you get a small open lab keeping up with the big labs on agentic skills, and the post-training advantage that Anthropic and OpenAI have built mostly through environment engineering loses some of its scarcity.
00:11:07 If it doesn't — and most announcements of this shape don't, on the timescale of three to six months — we'll look back at General-Agent as another reminder that environments are the bottleneck. Either way, the framing is right. Whoever can automate the production of high-quality reinforcement-learning environments owns the next round of post-training.
00:11:29 That's the prize.
Rewiring the state from Number 10
00:11:30 Eoin Mulgrew, who works on the Number 10 data science team in Downing Street, gave a talk at AI Engineer this week called Rewiring the State. It's the most concrete account I've seen from inside any government of what it actually takes to ship AI into public services.
00:11:48 The headline story is this. The Cabinet Office was about to spend one and a half million pounds on an outside law firm to analyze the United Kingdom statute book. He uses a great visual — the statute book printed out is roughly four African elephants tall. One engineer from Mulgrew's team embedded with the in-house legal team for two weeks instead, and built a tool that does the analysis on demand.
00:12:13 The tool now lives with the legal team. They run it whenever they want. Mulgrew uses this as a typical example, not a headline one. The shape of the team is what makes it interesting. They're a small, deliberately insurgent unit at the center of government, with political cover to do things the rest of Whitehall can't move fast enough to do.
00:12:35 They recruit exclusively outsiders — Y Combinator founders, big-tech engineers, and research-lab people — at roughly a zero-point-seven percent acceptance rate. He's explicit about why they don't recruit internally: people who've been inside the civil service for a long time tend not to leave when their fellowship ends, and the whole point of the fellowship is to seed new permanent teams elsewhere in government.
00:13:01 So they take outsiders, embed them for a year or two, then send them out to start parallel teams. Two examples he names: the Incubator for AI inside the Department for Science, Innovation and Technology, and Just AI inside the Ministry of Justice, run by a former fellow named Dan James.
00:13:20 Just AI is embedding engineers with prison wardens and parole officers, focused on drug interdiction and prison safety. There's a current fellow named Will, a Harvard dropout and Y Combinator founder, currently working out of HMP Wandsworth. The delivery model is forward-deployed engineering, applied to a state.
00:13:40 They place an engineer into a policy team, a legal team, a comms team, or an operations team, and they go from idea to live in two or three weeks. Mulgrew points out that the typical government timeline is a year-plus in discovery before anyone writes a line of code.
00:13:57 Their record case at the time of the talk was two and a half weeks from idea to a new public service launched by the Prime Minister at London Tech Week — a tool called Extract, built with DeepMind on Gemini, that digitizes planning applications including handwritten maps.
00:14:14 It's rolling out to every English local authority. The numbers Mulgrew names give you the scale of the problem. Seven and a quarter million people on National Health Service waiting lists. Three hundred and fifty thousand court cases stuck in backlog. One in five planning applications decided on time.
00:14:33 The Tony Blair Institute estimate is forty billion pounds a year in achievable productivity gains. The civil service is four hundred thousand people, only about thirteen percent of whom are policymakers; the rest is operational. There's a line in the talk I want to keep.
00:14:51 Talking about recruitment, he says, and I'm quoting: we do want to recruit missionaries, not mercenaries — a paycheck is not going to get you out of bed when stuff gets hard. I think that's right, and I think it's the differentiator. The team is paid at market rates, which makes the offer real, but the thing they're selling is mission.
00:15:12 Working on the decisions that go across a minister's desk isn't a normal job offer. If you're an engineer somewhere else in the world watching this, thinking the same model could work where you are — Mulgrew's answer in the Q and A was, basically, yes, and someone should go do it.
Cursor's Compose on Colossus
00:15:30 Small detail with large implications. There's a credible-but-unconfirmed claim circulating today, posted by the account Tech Dev Notes — about ninety-two thousand views, four hundred and seventy-three likes — that Cursor's Compose 2.5 model was trained at xAI's Colossus 2.
00:15:47 If that's right, it's the first time we've publicly seen an editor-layer company rent frontier-scale training infrastructure from a frontier lab. Cursor has been training its own tab-completion models for a while, and Compose was their swing at a serious autocomplete-grade coding model.
00:16:05 Doing it on Colossus 2 — which is the xAI supercluster — puts the compute behind that model in a different tier than what you'd assemble on a hyperscaler in a hurry. Why this matters. The default assumption a year ago was that companies in the application layer would build small specialized models on top of someone else's API and wouldn't train anything serious themselves.
00:16:29 The Cursor-on-Colossus story breaks that assumption two ways. First, if you're a category-leading product, you can buy the compute. xAI is selling. Second, the labs are now in the position of competing with their own customers — Anthropic and OpenAI both ship first-party coding agents, and Cursor is at minimum a frenemy.
00:16:49 xAI selling Cursor cluster time is an unusual shape, and it's worth watching. I don't have confirmation from either Cursor or xAI on this — it's a single tweet from a generally accurate account, and I'm taking it at that weight, not as established fact. If it holds up, the read is that the application layer is starting to spend real money on training, and the lab whose business most centers on selling compute to third parties is xAI.
00:17:16 That second observation might end up mattering more than any of the model-release news this month. What would change my read: an actual Cursor blog post saying who trained where, or a denial from either party. If Compose 3 lands trained somewhere else, the shape resets.
Mollick on insourcing
00:17:33 Ethan Mollick at Wharton — the Co-Intelligence author — has a one-line observation that's been chewing on me. Posted yesterday, three hundred and one likes, modest reach for him. He writes: one trend that I think you might start to see at big companies is insourcing via hiring — why pay so many outside vendors, legal, marketing, software vendors, when you can hire in-house and harness AI productivity gains yourself?
00:17:57 Talked to executives already going this route. Let me take it seriously. The vendor economy works because vendors are specialists, they amortize their tooling across many clients, and they're cheaper than the equivalent in-house function. AI changes both sides of that calculation.
00:18:13 The in-house function gets dramatically more productive with agents in the loop — a single in-house lawyer with a Harvey-class tool may now cover what used to take a four-person Cravath team. So the cost gap that justified the outside contract narrows. At the same time, the vendor's specialist edge erodes, because the agent has roughly the same baseline expertise the vendor was selling.
00:18:36 If the pattern Mollick is hearing about catches on, the high-margin professional services get hurt first. That means outside counsel for routine legal work, mid-tier marketing agencies, and the categories of software vendor where the value is mostly integration and customization.
00:18:52 Each of those used to be hard to bring in-house because the talent was specialized and expensive. The talent is still specialized; it's the leverage that changed. The interesting second-order question is what this does to consulting and contracting inside software itself.
00:19:07 If a Fortune 500 can hire two senior engineers and run them with Claude Code or Codex, do they still need the Accenture engagement? Maybe — the relationship management and the accountability shield around a big consulting firm still pay for themselves. But the labor arbitrage is gone, because the in-house pair-with-agents stack is now competitive on throughput.
00:19:28 I'm flagging this because it's a builder story too. If you're a senior engineer at a non-tech company, the leverage you have inside that org just changed shape. The thing your chief information officer was outsourcing because in-house couldn't ship fast enough is now a thing you can ship in-house and probably should.
00:19:47 That argument used to feel like ideology. With agents in the loop, it's becoming a budget conversation.
A dollar a book
00:19:53 Guillaume Vernade from Google DeepMind gave a talk at AI Engineer called Let's Go Bananas with GenMedia, and what he actually does on stage is run an entire generative-media pipeline live against a public-domain book. Pull it apart, because it tells you where the gen-media stack actually sits in May.
00:20:11 The chain. Gemini reads Kenneth Grahame's Wind in the Willows end to end in one shot — the model has a two-hundred-thousand-token context — and writes structured prompts for each chapter and each character. Nano Banana 2, the new image model, generates portraits and scene illustrations.
00:20:30 Veo, the video model, animates the scenes using the generated images as first frames. Lyria composes a different piece of music for each chapter, with or without lyrics. And Gemini's text-to-speech reads the dialogue using a trick Vernade demonstrates where two voices played simultaneously create the perceptual illusion of four distinct characters.
00:20:52 The whole pipeline runs for roughly one US dollar per book. The cost breakdown is the part I keep coming back to. Veo 3.1 Light, which Google shipped last week, prices at five cents per second of video. An eight-second clip is forty cents. The image and music and text-to-speech pieces are essentially rounding error against the video cost.
00:21:13 That's a real price drop. A year ago, generating five minutes of bespoke animated video, with original music per chapter, was an industrial pipeline. Now it's a dollar and a script. Two pieces of API news to flag. The first is the Interactions API in preview — server-side context caching with about a two-day session time-to-live.
00:21:34 You upload your book once, get an interactions identifier back, and reuse that identifier across all subsequent calls without re-uploading anything. Vernade expects this to become the default at the next input-output keynote, and it should — the current chat-mode pattern of re-sending the full context on every call is wasteful once you're calling at production volume.
00:21:57 The second is Lyria Real Time, which is a prediction-based music model — not diffusion — that generates music continuously, like a disc jockey mix, with about a two-second latency for prompt updates. You can change the mood mid-track. That's a different shape of music model from the static-clip pattern most developers are used to, and it suggests where the music tools are heading: live, steerable, and callable from an agent loop.
00:22:24 The pricing tiers also tell you something. Google now sells three tiers on the Gemini API: normal, a flex tier at fifty percent off in exchange for two-to-five-minute delays, and a priority tier at twice cost for fast-track. That's the airport-priority-lane move, and it's interesting because it lets you dial throughput against latency against price at API call time.
00:22:47 Most agent workflows can tolerate flex. Most user-facing chat needs priority. I'd expect the other labs to copy this pattern within a quarter. Vernade had one moment I think every developer-advocate would recognize. He said his role internally is advocating for developers, trying to bring common sense to the internal teams and making sure what they release makes sense in the real world.
00:23:12 Then he half-joked that he won by default on pushing unified APIs — meaning the model teams used to each ship their own API surface, and he kept pointing out that a developer should be able to swap a model name and have it work. He's right, and it's a sign of how the lab-versus-platform tension inside the big labs gets resolved one decision at a time.
Daniel Griesser's pi-config
00:23:34 While we're on practical builder material — Daniel Griesser, who works at Sentry, posted his pi-config repo today. It isn't a framework. It's a few skill files and subagent definitions that he uses with Claude Code and Solo for larger tasks, and he's open about not wanting people to copy it one-to-one.
00:23:53 His phrase: don't copy, get inspired. The three pieces he calls out. A Plan skill — a SKILL.md file the agent reads when it's working on a larger task, that tells it how Daniel wants planning to happen. A handoff skill that defines how an agent should pack up state when it's running out of context window and a fresh agent picks up.
00:24:14 And a subagents directory where each subagent is a discrete role with its own instructions. The pattern is more interesting than the contents. We're maybe nine months into the era of agent skills as a first-class artifact — Anthropic shipped them at Code with Claude last fall — and what's converging is that experienced operators are putting their planning and handoff conventions into skill files and treating subagents as named roles, not as ad-hoc spawned workers.
00:24:45 That's a real architectural choice. The framework-versus-skill-file debate is leaning toward skill files because they survive model upgrades better — a SKILL.md is just markdown, and when Opus 4.6 ships, your skill file still works. A framework that wrapped a specific tool-use protocol has to be rebuilt.
00:25:04 The other thing happening in Daniel's repo is the integration with Aaron Francis's Soloterm. Soloterm is a terminal multiplexer designed for working with agents, and Daniel built a Pi extension that lets his existing subagents drive Soloterm sessions. The pattern: you're not building one agent, you're building a small team of agents with a shared terminal substrate, and the work of the senior engineer is the orchestration and the skill files.
00:25:33 If you've been on the fence about whether to write your own SKILL.md files for your team, Daniel's repo is the cleanest worked example I've seen. Link in the show notes. I'd read the Plan and handoff files first — those translate easily into your own context.
Peter Neumann and Peter Salus
00:25:50 A sober beat to close on. Two computing pioneers gone in the same week. Peter Neumann died Sunday evening at the hospital in Santa Clara, age ninety-four — complications from a fall and surgery a few weeks earlier. His daughter Hellie was with him, and they were listening to classical music together.
00:26:07 The news came via Robert Watson on the Multicians mailing list, forwarded to The Unix Heritage Society by Dan Cross. Neumann was at SRI for half a century. He ran the RISKS Digest from nineteen eighty-five — the comp-dot-risks moderated newsgroup that catalogued the failures of computer-based systems and shaped how a generation of security and reliability researchers thought about the discipline.
00:26:30 He worked on Multics. He wrote Computer-Related Risks in nineteen ninety-five, which is still the book to hand a new engineer who needs to understand why a software system can hurt people. He was a serious musician — piano and French horn — and you can read the tributes on his SRI page if you want a sense of who he was beyond the work.
00:26:50 Peter Salus died on May fifteenth. Dan Cross broke the news on The Unix Heritage Society list yesterday. Salus wrote A Quarter Century of Unix in nineteen ninety-four, which is the canonical history of Unix from the early Bell Labs days through the workstation era.
00:27:05 If you've ever read about the Unix wars, the BSD lineage, the AT&T lawsuit, or the rise of Linux — Salus is the historian who pieced it together while most of the protagonists were still alive to be interviewed. Both were of the generation that built the substrate everything we do today sits on.
00:27:23 Neumann was telling the security community about supply-chain risks decades before npm packages existed — and the SafeDep writeup we covered earlier is, in a real sense, the kind of incident his RISKS column would have catalogued and helped you understand. Salus made sure we remember how we got here.
00:27:40 If you're a senior engineer and you've never read Neumann's Computer-Related Risks or Salus's A Quarter Century of Unix, the reading list just got longer. That's the show. The Vatican's first encyclical from Pope Leo XIV — Magnifica humanitas — is confirmed for May twenty-fifth.
00:27:56 We said last week we'd cover it on the day and we will. Google opens its input-output keynote tomorrow morning, and Gemini Spark — their consumer personal-agent product — is the headline thing. If it ships and it's good, the consumer-assistant fight reopens. If it slips, it slipped.
00:28:13 Either way we'll know by lunchtime. — Lenar Kess.