◆ Dispatch 036 · 2026-05-25 GSV Smallest Model That Does The Job

A few hundred dollars a proof, and the long argument about what machines are for

2026-05-25 / 00:23:40 / 15 sources

“The skill that ages well isn't running the most agents; it's getting the result with the smallest model and the fewest tokens that do the job.”
— Lenar Kess, today's narration

A frontier lab proves nine decades-old math problems for a few hundred dollars each, two talks make the numeric case that the cheapest agents route work to the smallest model that can do it, a lawsuit names an individual researcher over how Llama's training data was sourced, and a papal encyclical argues about AI on the terms of work and dignity. Eight things worth knowing today, told one developer to another.

DeepMind's AlphaProof Nexus clears nine open Erdős problems — Lean-verified proofs, a few hundred dollars apiece.
"You don't need GPT to zoom for you" — Callosum's numbers on routing subtasks to smaller models.
The token-efficiency turn — ThePrimeagen on why the org paying retail eventually does the math.
Inside how DeepMind runs its own agents — worse quotas than customers, a Darwinian skills library, and skepticism about MCP.
The lawsuit that names a name — Hobbs v. Meta, an individual researcher, and the internal dissent in the record.
Simon Willison on publishing GPT-4's retired architecture — the guesswork behind the water numbers.
Jujutsu and the pile of laundry — making a mess on purpose, then sorting it at the end.
Filming your chores for the robots — where the embodied-AI training data is actually coming from.
Pope Leo XIV's AI encyclical — technology is never neutral, and what no machine replaces.

Chapters

00:00:04 A few hundred dollars a proof
00:03:20 You don't need GPT to zoom for you
00:06:46 The token-efficiency turn
00:09:42 Inside how DeepMind runs its own agents
00:12:43 The lawsuit that names a name
00:15:58 Jujutsu and the pile of laundry
00:18:28 Filming your chores for the robots
00:21:00 Pope Leo XIV, and what no machine replaces

Sources

15 cited

1
Grok foundation model V9-Medium (1.5T) has finished training

X elonmusk — CEO of xAI, Tesla, SpaceX; owner of X

A lot of Cursor data was added in supplementary training and there is more to come.
x.com/elonmusk/status/2058787384364265734 →
Details
Cited text
A lot of Cursor data was added in supplementary training and there is more to come.

Context
A frontier lab pre-announcing a 1.5-trillion-parameter model trained partly on coding-agent interaction data, with an open-source pledge for the prior model, signals where xAI is aiming its coding push — though no public evals exist yet.
Key points
xAI's next foundation model, Grok V9-Medium, is 1.5 trillion parameters and has finished pre-training; fine-tuning underway, reinforcement learning to begin in days, public release in 2-3 weeks.
Musk says a lot of Cursor data was added in supplementary training, framing it as 'much better at coding.'
The current production model serving all Grok traffic is the 0.5T V9-... sorry, the 0.5T v8-small; the new model is pitched as a major improvement for difficult coding tasks.
xAI plans to open-source the 0.5T v8-small model toward the end of the year.
A reply notes benchmark performance and real-world performance are diverging, a fair caution before evals are public.
Provenance
Tweet · Primary source
2
Microsoft switched from Claude Code to GitHub Copilot, both on Opus 4.7

X trengriffin — Tren Griffin, longtime tech/finance writer ("12 Things") and a Microsoft employee

The wrapper is interchangeable — the engine isn't... The moat was never the UI.
x.com/trengriffin/status/2058786460103532597 →
Details
Cited text
The wrapper is interchangeable — the engine isn't... The moat was never the UI.

Context
It separates two things builders often conflate — the agent harness and the underlying model — and argues the spend follows the model, not the interface, which reframes a 'cost-cutting' rumor as internal dogfooding.
Key points
Responding to a claim that Microsoft throttled Claude Code usage to cut an out-of-control AI bill, Griffin says the move was a harness swap, not a cost cut.
His claim: Microsoft switched engineers from Claude Code to GitHub Copilot, both running Opus 4.7, both paid via enterprise API usage — 'Same Anthropic bill. Zero expense cut.'
He frames it as Microsoft wanting to dogfood the GitHub Copilot harness for scale and feedback.
A reply by Ash Cole captures the point: people read harness swaps as model swaps; the wrapper is interchangeable, the engine isn't.
This is one person's assertion, not a Microsoft announcement; treat the specifics as a claim.
Provenance
Tweet · Primary source
3
1000 tokens/sec generation on Qwen3.6 27B with V100s

Source Simple_Library_2700 (r/LocalLLaMA) — Local-inference hobbyist posting benchmark numbers

For single user the generation is around 80 t/s with 3000 t/s processing, no mtp!!
www.reddit.com/r/LocalLLaMA/comments/1tmyln… →
Details
Cited text
For single user the generation is around 80 t/s with 3000 t/s processing, no mtp!!

Context
It shows a capable 27-billion-parameter coding-grade model running fast on multi-generation-old, cheap GPUs — evidence that serious local inference no longer requires current-gen hardware.
Key points
A hobbyist hit roughly 1000 tokens/sec aggregate on Qwen3.6 27B across 128 concurrent requests using older Nvidia V100 server cards.
Single-user (batch one) generation is around 80 tokens/sec with about 3000 tokens/sec prompt processing, without multi-token prediction.
The headline interest in the comments is the cheap hardware: people want V100 pairs at reasonable prices.
It's a best-case throughput demo, not a typical single-user figure — the poster is clear the 128-concurrent number is far beyond personal need.
Provenance
Source · Background source
4
llama.cpp: fix checkpoints creation (faster agentic coding on local models)

Source jacekpoplawski (jacek2023) — llama.cpp contributor; the merged PR addresses context reprocessing during agentic coding

In the worst case, it has to reprocess the entire context and you get "forcing full prompt re-processing."
github.com/ggml-org/llama.cpp/pull/22929 →
Details
Cited text
In the worst case, it has to reprocess the entire context and you get "forcing full prompt re-processing."

Context
For anyone running agents against a local model, this is the difference between a snappy loop and a multi-second stall on every turn — the kind of runtime detail that decides whether local agentic coding is usable.
Key points
The problem: agent harnesses that rewrite conversation history to 'optimize context' force llama.cpp to reprocess huge chunks of tokens — sometimes the entire 70k-token context — stalling local agentic coding.
Two triggers: tools that rewrite history (he switched from opencode to pi to avoid it) and models that strip reasoning from context (enable 'preserve thinking,' e.g. with Qwen 3.6).
The merged PR fixes checkpoint creation so llama.cpp reprocesses only what actually changed, getting closer to the best case.
The author reports two weeks of use with noticeably more responsive agentic coding.
It's a concrete reminder that local agent performance is as much about cache/context plumbing in the runtime as about the model.
Provenance
Source · Background source
5
Is NVIDIA still the default best choice for local LLMs in 2026?

Source pmv143 (r/LocalLLaMA) — r/LocalLLaMA discussion, 230+ comments

MI50 can be had for just $600... 32GB of VRAM and 1TB/s of memory bandwidth.
www.reddit.com/r/LocalLLaMA/comments/1tmkau… →
Details
Cited text
MI50 can be had for just $600... 32GB of VRAM and 1TB/s of memory bandwidth.

Context
The default 'just buy Nvidia' answer is fracturing by task and budget — relevant to anyone speccing a local inference box this year.
Key points
The gap to AMD has closed for text inference: a commenter runs an all-AMD homelab pain-free on llama.cpp's Vulkan backend.
AMD still hurts outside text inference — training and image generation run into ROCm headaches; llama.cpp's native training looks half-finished.
Value argument for AMD: an MI50 at about $600 gives 32GB of video memory and 1TB/s bandwidth, and AMD's open ISAs/drivers mean community support can outlast vendor decisions.
Apple's unified memory is the turnkey alternative — the 512GB Mac Studio was the go-to for hosting very large models like GLM-5.
MSRP is treated as nearly useless; real local-hardware decisions hinge on street prices and which task you're doing.
Provenance
Source · Background source
6
AlphaProof Nexus: Autonomous formal mathematics with agentic loops

Article Google DeepMind — Google DeepMind's formal-mathematics team; preprint arXiv 2605.22763v1, posted May 21, 2026, with proofs on GitHub and erdosproblems.com

9 of 353 open Erdős problems, at an inference cost of a few hundred dollars per problem.
arxiv.org/abs/2605.22763 →
Details
Cited text
9 of 353 open Erdős problems, at an inference cost of a few hundred dollars per problem.

Context
Verified formal proofs sidestep the trust problem that dogs LLM math: the Lean kernel is the referee, so a few-hundred-dollar agent loop produces results a human can check by running them. It extends the autonomous-math thread from OpenAI's planar-unit-distance result we covered May 21.
Key points
AlphaProof Nexus autonomously solved 9 of 353 open problems in the Erdős catalogue and proved 44 of 492 open OEIS sequence conjectures.
It pairs a large language model with the Lean proof assistant, running agentic loops that refine proofs against formal verification until they pass or it gives up.
Inference cost was a few hundred dollars per problem; proofs are machine-checkable rather than natural-language.
Some problems had been open for decades; the team also reports a 15-year-old algebraic-geometry result.
Because output is Lean-verified, there is no question of a hallucinated proof — it either typechecks or it doesn't.
Provenance
Article · Supporting source
7
Scaling the Next Paradigm of Heterogeneous Intelligence — Adrian Bertagnoli, Callosum

Video Adrian Bertagnoli (Callosum / Colossyan) — Founding engineer presenting Callosum's work on routing subtasks across different models and chips; talk hosted on the AI Engineer channel

You don't need GPT to zoom for you.
www.youtube.com/watch?v=WRBNDpUhsJQ →
Details
Cited text
You don't need GPT to zoom for you.

Context
It's a concrete, numbers-backed case that the cheapest wins in agent systems come from matching each subtask to the smallest model that can do it — a design lever any builder shipping multi-step agents can pull today.
Key points
On Video Web Arena, a mixture of Qwen3 VL 8B and Kimi K2.5 beat GPT-5.2 and Gemini 2.5 by 18 and 25 percent respectively.
Routing cheap subtasks (zooming, visual parsing) to a small model alone produced 11x faster and 43x cheaper results on those steps.
On a long-context benchmark, mapping recursive sub-agents across Cerebras/SambaNova hardware ran 7-12x cheaper and 3-5x faster than GPT-5.2.
Core thesis: real problems decompose into subtasks that need different model sizes and architectures; homogeneous single-model scaling is inefficient for inference.
An automation layer now detects task complexity and predicts the best-suited model and hardware, replacing hand-coded routing.
Provenance
Video · Supporting source
8
Everyone is Wrong about Tokens

Video ThePrimeagen — Developer and streamer known for blunt takes on engineering culture; reacting to a screenshot of $1.3M / 603 billion tokens spent in 30 days

It's going to be the people that are just being engineers... not the people spending Infinity.
www.youtube.com/watch?v=0zw-Uk9KJiA →
Details
Cited text
It's going to be the people that are just being engineers... not the people spending Infinity.

Context
A useful counterweight to the maximalist agent-swarm pitch: the person showing off billions of tokens often isn't paying retail, and the org paying retail will eventually ask who shipped the most per dollar.
Key points
Reacts to a post showing $1.3M and 603 billion tokens spent in a month running OpenClaw — and notes the poster paid zero for those tokens.
Compares today's 'spend infinity on tokens' culture to the 2016-2020 era of startups with more microservices than customers.
Prediction: companies will swap 'token maxing' for token efficiency, ranking people by features delivered, not spend.
Frames the new build calculus as 'buy vs build vs vibe' — vibing costs both time and money.
Skeptical of the 10x-cheaper-every-year promise: 'that promise is 2 years old and I feel like things have never been more expensive.'
Provenance
Video · Supporting source
9
How Google DeepMind Runs Agents at Scale — KP Sawhney & Ian Ballantyne

Video KP Sawhney & Ian Ballantyne (Google DeepMind) — KP Sawhney is a software engineer on DeepMind's AI platform team; Ian Ballantyne is a DevRel engineer; panel on the AI Engineer channel

We have worse limits than you do because obviously we prioritize customers and not ourselves.
www.youtube.com/watch?v=7gujZrJ9L5I →
Details
Cited text
We have worse limits than you do because obviously we prioritize customers and not ourselves.

Context
A rare look inside how a frontier lab actually operates its own agents day to day — quota politics, skill curation, and observability — which is more honest about the constraints than most vendor demos.
Key points
DeepMind engineers get worse rate limits than paying customers — customers are prioritized; internal throttling is 'kind of brute force.'
A 'Darwinian' skills library: experts contribute skills, the org culls them so only the best survive, and agents inherit that knowledge for free.
KP is skeptical of MCP ('may be a little bit of a flash in the pan') and favors skills plus guardrailed CLI interactions.
Subscription pricing 'doesn't really work' for token-hungry agents; they want harness-level fallback (Pro to Flash to local) so an unattended job doesn't stall on a limit.
An agent-trajectory store lets them replay runs down to raw predict requests to find exactly when a run started looping.
Provenance
Video · Supporting source
10
Ed Newton-Rex on individual researchers being sued over AI training

X @ednewtonrex — Ed Newton-Rex, founder of Fairly Trained and former audio lead at Stability AI; a prominent critic of unlicensed AI training data

It's no longer just AI companies & their founders being sued over AI training - individual researchers are now being sued, too.
x.com/ednewtonrex/status/2058433725889716519 →
Details
Cited text
It's no longer just AI companies & their founders being sued over AI training - individual researchers are now being sued, too.

Context
Personal liability for researchers changes the calculus of how training corpora get assembled. If an individual engineer can be named for sourcing data, the 'move fast, sort licensing later' default gets a lot more expensive to choose.
Key points
Flags a new suit (Hobbs v. Meta) naming an individual AI researcher, not just the company and its executives.
Authors Jeff Hobbs and A. Douglas Stone allege Guillaume Lample, then a Meta researcher, torrented roughly 70+ terabytes of pirated books to train Llama.
Court records put the figure at 81.7TB pulled from shadow libraries including LibGen, Anna's Archive and Z-Library.
Lample allegedly referred to a LibGen copy as 'BooksZero' and kept the code off Meta's repository.
Defendants also include Mark Zuckerberg and Joelle Pineau; Lample has since co-founded Mistral AI.
Provenance
Tweet · Primary source
11
Meta staff torrented nearly 82TB of pirated books for AI training — court records

Article Tom's Hardware — Trade-press reporting on the Hobbs v. Meta court records

I don't think we should use pirated material. I really need to draw a line here.
www.tomshardware.com/tech-industry/artifici… →
Details
Cited text
I don't think we should use pirated material. I really need to draw a line here.

Context
The internal dissent in the record is the part that lands: engineers flagged the line and it was crossed anyway. That's the institutional pattern worth watching as more of these suits name names.
Key points
Court records describe 81.7TB of data downloaded via torrents from shadow libraries to train Llama.
Internal messages show researchers objecting: one said using pirated material 'should be beyond our ethical threshold.'
Plaintiffs allege Meta removed copyright-management information and avoided licensing to preserve a fair-use posture.
The case names individuals alongside the company, a shift from earlier AI-training suits.
It centers on books specifically, treated as uniquely valuable training data.
Provenance
Article · Supporting source
12
Simon Willison on publishing GPT-4's retired architecture

X @simonw — Simon Willison, co-creator of Django and author of the widely-read blog on practical LLM use

Given how much of the original 'bottle of water per generated email' water estimate came from guesses at the architecture of GPT-4, it would be very much in OpenAI's interest to publish the architecture of that now-reti…
x.com/simonw/status/2058877314004627690 →
Details
Cited text
Given how much of the original 'bottle of water per generated email' water estimate came from guesses at the architecture of GPT-4, it would be very much in OpenAI's interest to publish the architecture of that now-retired, three year old model.

Context
Much of the public debate about AI's resource cost runs on reverse-engineered guesses. The opacity isn't just an academic gap — it shapes regulation and reputation built on numbers nobody can verify.
Key points
Argues OpenAI should publish the architecture of the now-retired GPT-4, three years on.
The widely-cited 'bottle of water per email' figure rested on guesses about GPT-4's architecture.
Publishing real numbers would let people replace estimates with facts on AI's energy and water footprint.
Ties to a broader transparency gap: outsiders still reason about frontier models from leaks and inference.
Provenance
Tweet · Primary source
13
Defeating git rigour fatigue with Jujutsu

Article Ike Saunders — Developer writing about the Jujutsu (jj) version control system

Doing Commits Like A Big Pile Of Laundry, perhaps?
ikesau.co/blog/defeating-git-rigour-fatigue… →
Details
Cited text
Doing Commits Like A Big Pile Of Laundry, perhaps?

Context
It's a concrete workflow for the messy-middle of feature development, and a good example of how Jujutsu's model lets you treat commit history as something you arrange at the end rather than maintain throughout.
Key points
Names a real pain: keeping clean, reviewable commits during long feature work is effortful and people give up ('git rigour fatigue').
Proposes building the ideal commit history first as empty labeled commits (jj new -B / -A), then sorting hunks into them.
Squash everything messy into one 'everything commit,' then interactively squash hunks into the right labeled commit until the everything commit is empty.
Claims it beats jj split and as-you-go squashing because the final state is guaranteed conflict-free.
Honest caveat: there's no guarantee every commit compiles, which may be a dealbreaker for bisect-clean history.
Provenance
Article · Supporting source
14
Why tech companies are paying people to film their chores

Article The Washington Post — Reporting on the gig-work economy springing up to generate household training video for humanoid robots

Gig workers earn $20-25 an hour to record themselves folding laundry, washing dishes, and making beds.
www.washingtonpost.com/technology/interacti… →
Details
Cited text
Gig workers earn $20-25 an hour to record themselves folding laundry, washing dishes, and making beds.

Context
The data bottleneck for embodied AI is physical demonstration, and a labor market is forming to supply it one folded towel at a time. It's the clearest sign yet that the next training-data land grab is happening in living rooms, not on the open web.
Key points
AI and robotics firms are paying gig workers to film everyday chores as training data for humanoid robots.
DoorDash launched a Tasks app in March 2026 letting its US Dashers earn money filming laundry, dishes and bed-making.
Micro1 reports ~4,000 'robotics generalists' across 71 countries sending more than 160,000 hours of video a month.
Equipment is typically head-mounted phones; pay runs roughly $20-25 an hour.
A viral Reddit post framed this as 'OpenAI installing 360 cameras' — the verified players are DoorDash, Scale AI and Micro1, training robots from Figure, Tesla and Agility.
Provenance
Article · Supporting source
15
Magnifica Humanitas — Encyclical Letter of Pope Leo XIV

Article Pope Leo XIV — An encyclical (a senior teaching document of the Catholic Church) addressing artificial intelligence and human dignity, dated May 15, 2026

Technology is never neutral, because it takes on the characteristics of those who devise, finance, regulate and use it.
www.vatican.va/content/leo-xiv/en/encyclica… →
Details
Cited text
Technology is never neutral, because it takes on the characteristics of those who devise, finance, regulate and use it.

Context
It's a major non-industry institution arguing about AI on the terms of work and dignity rather than benchmarks, and its line about technology carrying its makers' values is a sharp counterpoint to the 'tools are neutral' reflex common in engineering.
Key points
Argues technology is not antagonistic to humanity in itself, but is never neutral — it carries the values of whoever builds, funds and deploys it.
Warns of a 'Babel syndrome' promising limitless progress at the cost of human dignity, versus a 'Nehemiah approach' of shared responsibility.
Flags that power over ourselves is unprecedented yet increasingly concentrated in private hands rather than democratic institutions.
Insists work has dignity independent of productivity metrics and opposes reducing workers to 'costs of production.'
Frames human dignity as ontological — 'no machine can ever replace' it, regardless of efficiency.
Provenance
Article · Supporting source

00:00:04

A few hundred dollars a proof

00:00:04 Start with a number that's almost insulting in how small it is: a few hundred dollars. That's the inference cost, per problem, for what Google DeepMind says its new system just pulled off. They're calling it AlphaProof Nexus, and it autonomously solved nine of the three hundred and fifty-three open problems in the Erdős catalogue.

00:00:24 It also proved forty-four of nearly five hundred open conjectures from the online encyclopedia of integer sequences. The preprint went up on the twenty-first, with the proofs themselves posted to GitHub and to a site that tracks these problems, so this is checkable, not a press release you have to take on faith.

00:00:44 A few days ago, when OpenAI's system cracked a piece of the planar unit-distance problem, I said the thing I'd be watching was whether anyone could push autonomous math past a single headline result. This is that push, from a different lab, on a different set of problems.

00:01:01 Paul Erdős, for anyone who didn't spend time near a math department, was the wandering Hungarian mathematician who left behind a famous list of open questions — some of which have outlived most of the people now reading about them. Nine of them just fell to an agent loop.

00:01:18 What I find clean about it is the verification loop. The system pairs a large language model with Lean, which is a formal proof assistant — a piece of software where a proof is essentially a program that either typechecks or it doesn't. The model proposes an argument, Lean checks it, and the loop keeps refining until the proof passes verification or the system decides the problem can't be cracked.

00:01:43 So when DeepMind says it solved nine problems, there's no asterisk about whether the model hallucinated a confident-sounding argument that falls apart under scrutiny. The Lean kernel is the referee. The proof is machine-checkable. You run it, and it's valid or it isn't.

00:02:00 That matters because the usual knock on language models doing math is exactly the trust problem. They're fluent, and fluency is the enemy of rigor. A model can write four paragraphs of proof that smuggle in the very thing they're trying to show. Formal verification shuts that door.

00:02:17 It's slower, and expressing a proof in Lean is harder than writing it in prose — but what you get on the other side is a result that needs no human to vouch for it. Let me be careful about the scale here. Nine of three hundred and fifty-three is about two and a half percent of one specific catalogue, and the problems this kind of loop can reach are the ones that happen to be expressible and attackable this way — not a random sample of hard mathematics.

00:02:46 This isn't the end of the mathematician. But the cost line is what I keep coming back to: a few hundred dollars per result, on questions that sat open for decades, with proofs you can verify yourself by running them — a different shape of research economics than a postdoc-year per question.

00:03:04 And the pattern travels: if you build anything where correctness can be formally specified — and a lot of systems work can be — then propose-with-a-model, check-with-a-verifier, loop-until-it-passes is a pattern I'd file away well beyond number theory.

00:03:20

You don't need GPT to zoom for you

00:03:20 There was a talk from a startup called Callosum that I think anyone shipping multi-step agents should sit with, because it puts hard numbers on something a lot of us suspect but haven't measured. The engineer, Adrian Bertagnoli, calls it heterogeneous intelligence.

00:03:37 It's a mouthful, but the idea underneath is simple: stop running one big model on every step of a problem. His framing is that the last few years were homogeneous — you scale a single model on a fleet of identical chips, and the scaling laws reward you for it. That's a training-era idea.

00:03:56 In the inference era, he argues, it stops being optimal, because real problems decompose into subtasks that need wildly different amounts of intelligence. And he brought receipts. Take visual web navigation — an agent driving a browser, looking at the screen, clicking through.

00:04:13 On the Video Web Arena benchmark, Callosum's mixture of models beat GPT-5.2 and Gemini 2.5 by eighteen and twenty-five percent. The way they did it wasn't a smarter single model. They broke the task into its actual pieces — some visual reasoning, some text reasoning, and some plain mechanical steps like zooming into part of a page — and routed each piece to the cheapest model that could handle it.

00:04:39 His line is the one that stuck with me: you don't need GPT to zoom for you. On those low-end subtasks alone, offloading the zoom-and-parse work to a small model was eleven times faster and forty-three times cheaper than calling ChatGPT for it. Stack that across the whole task and the full system came out roughly three times faster and three-point-seven times cheaper than running the frontier model for everything, with better accuracy on top.

00:05:07 He showed the same shape on long-context work, building on a recent idea from MIT called recursive language models. Instead of stuffing a giant document into one prompt — where quality rots as the information you need gets more complex — you put the context in a file and let a coding agent search through it programmatically, spawning sub-agents to answer pieces.

00:05:30 Callosum mapped those sub-agents onto different inference hardware. Running on Cerebras chips they were seven times cheaper and five times faster than GPT-5.2 on the benchmark; on SambaNova, twelve times cheaper. Same answers, a fraction of the cost and the wait.

00:05:47 The sharpest question from the audience cut straight to it: how do you decide what runs where? And the answer was that they started by hand — bespoke rules, this subtask to that small model. Since then they've built an automation layer that reads task complexity and predicts the best model and hardware.

00:06:07 That prediction layer is where this lives or dies, and it's the part I'd want stress-tested, because a router that guesses wrong on a hard step drags your quality down while you congratulate yourself on the savings. The direction, though, is hard to argue with.

00:06:23 His closing line — that compute went from fast, to massively parallel, to heterogeneous — is a tidy way of saying the next efficiency frontier isn't a bigger model, it's better dispatch. And the practical takeaway needs no research grant: open your agent's trace, find the steps that don't actually need your most expensive model, and move them down a tier.

00:06:46

The token-efficiency turn

00:06:46 That cost story has a cultural twin, and it came from ThePrimeagen, who put out a video with the very on-brand title, Everyone is Wrong about Tokens. The trigger was a screenshot making the rounds: one-point-three million dollars and six hundred and three billion tokens spent in a single month, running OpenClaw.

00:07:06 And the first reaction online was the usual — that if you're not spending six figures a month on tokens you're not going to make it, that you're in the permanent underclass, and that you should buy somebody's course. His pushback has two parts, and both land. The first is a follow-the-money point: the person flexing six hundred billion tokens paid zero dollars for them.

00:07:30 You, running the same workload at retail, would not. So treating that number as a target rather than as a research budget is a category error. The second is historical. He compares the moment to the startups of roughly 2016 to 2020 that ran more microservices and Kubernetes than they had customers — one friend, he says, was maintaining ten microservices for three customers.

00:07:54 His takeaway: just because it works for Google doesn't mean it works for you. Then he makes an actual prediction, which he flags as dangerous, because tech predictions usually are. Right now companies are pushing the opposite of frugality — people getting nudged, even pressured, for not using enough AI, their token spend treated as a virtue.

00:08:16 He thinks that snaps back. The same company that two years ago needed a vice president to sign off on a four-hundred-dollar memory upgrade isn't going to let you burn one-point-three million dollars a month on an agent forever. At some point the question flips from how much did you spend to who shipped the most per dollar.

00:08:36 He frames the old build calculus — buy versus build — as becoming buy versus build versus vibe, where vibing costs both time and money, so you'd better know the trade-off. And he takes a clean shot at the ten-times-cheaper-every-year promise that underwrites the whole token-utopia pitch.

00:08:55 That promise, he says, is two years old, and things have never felt more expensive. Now, he's a streamer with a comedic register, and some of this is a bit — he riffs about a coming class of token-efficiency consultants and prompt trainers who'll one-on-one you on prompts like Pokémon.

00:09:13 But strip the bit away and his core claim lines up exactly with the Callosum numbers from a minute ago. The maximalist pitch — you plus a hundred agents running nonstop — assumes a future of effectively free inference. We're not in that future; we're in the one where the org paying retail eventually does the math.

00:09:33 So the skill that ages well isn't running the most agents. It's getting the result with the smallest model and the fewest tokens that do the job.

00:09:42

Inside how DeepMind runs its own agents

00:09:42 Here's a counterpoint to all the polish, from inside a frontier lab. There was a panel with two people from Google DeepMind — KP Sawhney, an engineer on their AI platform team, and Ian Ballantyne, a developer-relations engineer — about how they actually run agents day to day.

00:09:59 The most human detail in it is that DeepMind's own engineers get worse rate limits than paying customers do. KP said it plainly: we have worse limits than you do, because obviously we prioritize customers and not ourselves. During the demo he was clicking continue over and over because the system recognized he was a Googler and throttled him.

00:10:19 Quota management internally, in his words, is kind of brute force — they have real power users, and eventually someone just tells them to stop. The cluster side of that is almost folksy. Ian said that when he joined, he asked how you know when you're using too much of the data center, and a colleague told him: oh, they'll tell you.

00:10:39 And sure enough, there are people watching the graphs around the clock who will reach out and ask you to kill a specific job on a specific cluster. That's the reality under the magic — supply-constrained, human-policed, a little improvisational. Two ideas from that panel are the ones I'd carry out of it.

00:10:58 The first is what they call a Darwinian skills library. Internally there's a big library of skills agents can pull from, contributed by people who are genuine experts in their corner of the company. KP's point: I and the agent get that knowledge for free. But at Google's scale, skills sprawl, so they actively cull — only the best survive, almost Darwinian.

00:11:19 The author of a skill owns writing its tests and evals, though they're starting to have agents design those evals too, which he admitted is a little meta. The second is a take I didn't expect from a Google engineer. KP said — and he flagged it as maybe controversial — that he thinks the model context protocol, the standard way agents plug into tools right now, might be a bit of a flash in the pan.

00:11:43 He's found that a combination of skills plus guardrailed command-line interactions works better for him. I'm definitely team skills, he said. He isn't against the protocol for authentication, but for getting work done he reaches for skills and a command line. That connects to something I said I'd watch on Sunday — whether a real agent-coordination layer shows up that isn't just bolted onto tools we already had.

00:12:08 This isn't that layer, but it is a vote, from inside DeepMind, for skills as the durable primitive over the protocol everyone has been standardizing around. And the last thing KP flagged is the one most relevant to the last two chapters: these systems are so token-hungry that the subscription model doesn't really work for them.

00:12:28 What he wants is harness-level fallback — when you hit your limit on the big model, drop automatically to a smaller one, so an unattended job doesn't spend an hour doing nothing because it hit a wall. Even the people running the frontier are budgeting tokens.

00:12:43

The lawsuit that names a name

00:12:43 From how the models run to what went into them — and a lawsuit that crosses a line earlier cases didn't. Ed Newton-Rex, who runs Fairly Trained and used to lead audio at Stability, flagged it: it's no longer just AI companies and their founders being sued over training data — individual researchers are now being named too.

00:13:03 The case is Hobbs versus Meta. Two authors, Jeff Hobbs and A. Douglas Stone, allege that Guillaume Lample — at the time a Meta researcher, since then a co-founder of Mistral — torrented something on the order of seventy terabytes of pirated books to train Llama.

00:13:20 Court records put the figure higher, around eighty-one terabytes, pulled from shadow libraries like LibGen, Anna's Archive, and Z-Library — the usual names. The complaint alleges the LibGen copy was referred to internally as BooksZero, that the code was kept off Meta's repository, and that one copy later vanished from Meta's possession.

00:13:41 Mark Zuckerberg and Joelle Pineau are named alongside Lample. I want to be fair about what this is: allegations in a complaint, not findings. Meta has defended its training as fair use, and it has supporters who argue exactly that. But what lands hardest in the record is the internal dissent — not the terabyte count.

00:14:01 Per the court documents, a senior researcher said back in October 2022, I don't think we should use pirated material, I really need to draw a line here. Another said using pirated material should be beyond our ethical threshold. People inside flagged the line. The line got crossed anyway.

00:14:19 What changes here is personal exposure. For a couple of years the AI-training suits aimed at companies and their executives — abstractions with legal departments. Naming an individual researcher for how he sourced data is a different signal to every engineer assembling a corpus.

00:14:36 The grab-it-now, sort-the-licensing-out-later default has been the cheap path. If a named individual can be on the hook for it, that default gets more expensive to choose. That's what I think changes here, regardless of how this particular case resolves. There's a smaller, related point from Simon Willison — co-creator of Django, and probably the most useful working writer on practical model use.

00:15:01 He made a narrow, sharp point: given how much of that famous bottle-of-water-per-generated-email estimate came from guesses about GPT-4's architecture, it would be very much in OpenAI's interest to publish the architecture of that now-retired, three-year-old model.

00:15:18 Think about what he's saying. The numbers people cite for AI's water and energy footprint — the ones that end up in headlines and in front of regulators — were partly reverse-engineered from leaks and inference about a model that's now obsolete. OpenAI could replace the guesswork with facts at basically no competitive cost, since the model is retired.

00:15:40 They haven't. So the public argument about what these systems consume, and the legal argument about what went into them, both run on the same fuel: outsiders guessing because the builders won't say. That opacity is a choice, and it keeps costing the labs trust they could buy back cheaply.

00:15:58

Jujutsu and the pile of laundry

00:15:58 Let's bring it back down to the desk, because there was a post I liked from a developer named Ike Saunders about a problem every one of us has had and pretended we didn't. He calls it git rigour fatigue. You know the good version of a pull request — define the types, add the database functions, the server changes, the client API, and the UI — each commit a clean little chapter a reviewer can step through.

00:16:22 And you know what you actually do: define types, add database functions, work-in-progress test code, server stuff, client stuff, fix the database function, fix a UI bug, refactor, and fix another UI bug. The later commits stomp on the earlier ones, and the story falls apart.

00:16:39 His fix uses Jujutsu — that's jj, the version-control system that sits on top of git and treats your commits as something you can rearrange freely rather than a log you have to get right as you go. And the technique is almost funny in how it inverts the usual advice.

00:16:55 Instead of writing clean commits as you work, you make a mess on purpose. Then, at the end, you build your ideal commit history first — empty, labeled commits, one for types, one for the UI, and so on. You squash all your messy work into a single everything-commit.

00:17:11 And then you interactively pull hunks out of that everything-commit into the right labeled boxes, one at a time, until the everything-commit is empty. He doesn't have a good name for it — his best offer is doing commits like a big pile of laundry, which is exactly right.

00:17:27 You don't fold each shirt as you take it off. You make the pile, then you sort. The reason it beats the alternatives is specific. If you sort as you go with interactive squash, you can break a commit that happens to touch the same file. Doing it all at the end means the final state is guaranteed to have no conflicts, because everything started in one place.

00:17:49 He flags the cost too: there's no guarantee that every intermediate commit compiles, so if you depend on clean bisects through history, this one isn't for you. I'm bringing it up not because it's earth-shaking but because it's exactly the kind of craft knowledge that's getting starved of attention while everyone talks about agents.

00:18:09 The shape of your commit history is a gift to the next person reading the diff — and that person is usually a future you. A tool that lets you separate writing the code from telling its story is doing something useful. If you've been jj-curious, this is a nice concrete reason to try it on your next big feature branch.

00:18:28

Filming your chores for the robots

00:18:28 This next one started as a Reddit headline and got more interesting once I checked it. The post claimed OpenAI is paying people in New York to install three-hundred-sixty-degree cameras in their homes that record everything — vacuuming, washing dishes, and cooking.

00:18:44 I couldn't verify the OpenAI-specific version of that, and I'd take the exact framing with a grain of salt. But the underlying trend is well-documented, and I'll walk you through it. There's a labor market forming to film household chores as training data for humanoid robots.

00:19:01 The Washington Post and CNN have both covered it. DoorDash launched a Tasks app back in March that lets its US delivery workers — there are around eight million of them — earn money by recording themselves folding laundry, washing dishes, and making beds. A company called Micro1 says it has roughly four thousand robotics generalists across seventy-one countries sending in more than a hundred and sixty thousand hours of video every month.

00:19:27 Pay runs around twenty to twenty-five dollars an hour. The gear is usually a head-mounted phone, not a fancy camera rig. And the robots being trained on all this are the ones you've heard about — Figure, Tesla, and Agility. The reason this deserves your attention as a builder isn't the privacy angle, though that's there too.

00:19:47 It's what it tells you about where the bottleneck is for embodied AI. Language models had the open web — trillions of tokens of text just sitting there to be scraped, which is what that Meta lawsuit is all about. Robots don't have that. There's no internet-scale corpus of someone's hands folding a fitted sheet from a first-person view.

00:20:07 The demonstrations have to be manufactured, one chore at a time, by paying actual people to do their actual housework on camera. So the next great training-data land grab isn't happening on the web. It's happening in living rooms. And I think the shape of it is going to matter.

00:20:24 The text-data era was a grab-first, ask-later free-for-all, and we're now watching the legal bill come due for it. The physical-demonstration era is starting from a different place — these are explicit, paid, consented recordings, with a wage attached and a worker who agreed.

00:20:40 Whether that's because the industry learned something from the copyright fights or just because there's no other way to get the data, I can't tell you. But it's a more transparent acquisition model than scraping, and if it holds, it's a small piece of good news buried inside a slightly dystopian-sounding gig listing.

00:21:00

Pope Leo XIV, and what no machine replaces

00:21:00 I want to end somewhere unusual for a developer show. Earlier this month, Pope Leo XIV published an encyclical — one of the most formal teaching documents the Catholic Church issues — and it's largely about artificial intelligence. It's called Magnifica Humanitas, and it climbed the front page of Hacker News, which tells you something about who's reading it.

00:21:20 You don't have to share the faith to find the argument compelling, because it lands on territory this show cares about: work, and what these tools are for. The central line is one I keep turning over. Technology, he writes, is never neutral, because it takes on the characteristics of those who devise, finance, regulate and use it.

00:21:39 That's a direct challenge to a reflex a lot of us in engineering carry around — the idea that a tool is just a tool, that the model is neither good nor bad, it's all in how you use it. His claim is that a system already carries the values of the people who built it and paid for it before you ever touch it.

00:21:57 After the day we just walked through — a lawsuit about how training data got sourced, a debate about what the labs won't disclose — that lands harder than it would have a week ago. He isn't anti-technology, and he says so directly: technology should not be considered, in itself, a force antagonistic to humanity.

00:22:15 His worry is concentration — that the power to reshape how we live is landing in private hands faster than any democratic institution can keep up with. And he makes a point about work specifically that any engineer watching headcount decisions should sit with: that work carries a dignity that doesn't come from its productivity, and that reducing a person to a cost of production gets something fundamental wrong.

00:22:39 His phrase is that human dignity is something no machine can ever replace. You can take that as theology or you can take it as a design brief. I take it as both. We covered a system that proves theorems for a few hundred dollars, model-routing that cuts cost forty-fold, a lab rationing its own tokens, and robots learning to fold laundry from gig workers.

00:22:59 Every one of those is a story about getting more done with less human effort. The encyclical is a reminder, from outside the industry, to keep asking what the less-human-effort is in service of — and to notice the moment the answer shrinks to nothing but the cost line.

00:23:15 That tension — between what we can now automate and what we shouldn't reduce to a number — runs under most of what this show covers. It's the question I'd keep in the room the next time you wire up an agent and watch it do in a minute what used to take you a morning.

00:23:30 Lenar Kess.