◆ Dispatch 011 · 2026-05-25 GSV The Meter Is Part of the Machine

The Harness Starts to Count

2026-05-25 / 00:13:52 / 9 sources

“The model may improve, but the system that records its mistakes, prices its turns, and tests its claims decides whether anyone can use it on Tuesday morning.”
— Lenar Kess, today's narration

Monday's CONSTRUCT follows a practical tension: model capability is moving, but the systems around the model now decide whether that capability becomes usable work.

Google DeepMind and Kaggle's agentic evaluation talk anchors the episode's argument that benchmark creation has to move from a small research circle into ordinary developer practice.
Tren Griffin's Microsoft and GitHub Copilot post gives the enterprise version of the same issue: companies don't just buy a model, they buy the harness where feedback and spending show up.
Two Minute Papers' Demis Hassabis interview summary supplies the science platform frame, where many specialized models become a drug-discovery system rather than one magic model.
The llama.cpp CUDA Walsh-Hadamard pull request shows the other end of progress: a small kernel-level gain can change local inference economics when it lands in common tooling.
Ivan Fioravanti's MLX DeepSeek V4 Flash post points at the pressure to make large models fit on consumer Apple hardware with custom quantization.
Viv's note on the Hugging Face agent vocabulary write-up closes the loop: people can't operate shared systems if they don't agree on what an agent, harness, environment, and evaluation mean.

Chapters

00:00:00 Transcript

Sources

9 cited

1
Agentic Evaluations at Scale, For Everybody

Video AI Engineer / Google DeepMind and Kaggle speakers — Conference talk by Kaggle Benchmarks product and engineering staff, as summarized in the packet.

On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other.
www.youtube.com/watch?v=Ubwb6NzegyA →
Details
Cited text
On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other.

Context
The episode uses this as the central evidence that agent progress depends on reusable evaluation surfaces, not model release claims alone.
Key points
Kaggle is building hackathon-based benchmark creation, a Standardized Agent Exam, Game Arena, and an open benchmark platform.
The talk frames benchmark creation as too concentrated among AI researchers compared with the broader developer population.
More than 500 agents were evaluated within the first week of the Standardized Agent Exam launch.
Provenance
Video · Supporting source
2
Microsoft switched from Claude Code to GitHub Copilot

Thread Tren Griffin — Technology and business commentator; packet item is an X post, not an official Microsoft statement.

They just want to dogfood the GHCP harness so they get scale and feedback.
x.com/trengriffin/status/2058975752738189566 →
Details
Cited text
They just want to dogfood the GHCP harness so they get scale and feedback.

Context
It gives the enterprise version of the episode's thesis: the surface around the model controls feedback, policy, and spend.
Key points
Claims Microsoft moved engineers from Claude Code to GitHub Copilot while still paying for Opus 4.7 through enterprise API usage.
Frames the move as harness dogfooding and feedback capture rather than an Anthropic-payment cut.
Related posts in the packet repeat the same claim across several IDs.
Provenance
Thread · Primary source
3
Inference cost per token falling 60%-70% per year

X Tren Griffin — Technology and business commentator, cited from the packet.

Semiconductor providers are delivering lower cost per token of 60%-70% per year for inference.
x.com/trengriffin/status/2058973327583293763 →
Details
Cited text
Semiconductor providers are delivering lower cost per token of 60%-70% per year for inference.

Context
The number lets the hosts separate model-call economics from the business value of owning the agent workflow.
Key points
Gives a concrete cost-pressure claim for inference economics.
Pairs with enterprise harness standardization to show raw calls can get cheaper while control surfaces become more valuable.
Provenance
Tweet · Primary source
4
DeepMind's Insane AI Breakthroughs With CEO Demis Hassabis

Video Two Minute Papers / Demis Hassabis — Interview summary in the packet covering DeepMind and Isomorphic Labs' AI-for-science platform strategy.

The strategy moves beyond single-model solutions toward deploying six to twelve AlphaFold-level systems.
www.youtube.com/watch?v=huAwz_BR8WM →
Details
Cited text
The strategy moves beyond single-model solutions toward deploying six to twelve AlphaFold-level systems.

Context
It lets the episode extend the harness argument into science: the platform boundary matters as much as any one model.
Key points
Hassabis describes specialized models for different stages of drug discovery, from structure prediction to toxicity and clinical-trial optimization.
The Co-Scientist system is described as a fine-tuned Gemini variant with specialized tools for research work.
The interview presents AI as a collaborative sparring partner rather than an autonomous researcher.
Provenance
Video · Supporting source
5
MLX DeepSeek V4 Flash on an M3 Ultra using less than 128GB

X Ivan Fioravanti — Developer posting a local MLX inference experiment, cited from the packet.

107GB used here with a custom quantization recipe!
x.com/ivanfioravanti/status/205903210948274… →
Details
Cited text
107GB used here with a custom quantization recipe!

Context
It grounds the local inference segment in a concrete memory-fit claim rather than generic enthusiasm.
Key points
Claims MLX DeepSeek V4 Flash ran on an M3 Ultra under 128GB, with 107GB used in the test.
Attributes help to Claude plus Opus 4.7 for the quantization recipe.
Shows the pressure to make large models fit on local Apple hardware.
Provenance
Tweet · Primary source
6
CUDA: add fast walsh-hadamard transform

Source am17an / ggml-org llama.cpp contributors — GitHub pull request surfaced through r/LocalLLaMA in the packet.

1-2% boost on pp & 7-9% boost on tg.
github.com/ggml-org/llama.cpp/pull/23615 →
Details
Cited text
1-2% boost on pp & 7-9% boost on tg.

Context
It shows how small runtime gains in shared tooling can change local agent economics.
Key points
Adds a fast Walsh-Hadamard transform for CUDA paths in llama.cpp.
The packet reports gains when quantizing the key-value cache, including larger token-generation improvements.
The benchmark cited uses a 5090 with q8_0 cache settings.
Provenance
Source · Background source
7
Viv on Hugging Face agent vocabulary aggregation

Thread Viv — X post from the packet pointing to a Hugging Face write-up on agents, harnesses, environments, and reinforcement learning.

The more we can roughly have a shared vocabulary the better.
x.com/Vtrivedy10/status/2058969154523435256 →
Details
Cited text
The more we can roughly have a shared vocabulary the better.

Context
Shared terms are necessary for comparing evaluations, incidents, procurement decisions, and product claims.
Key points
Highlights confusion around agent, harness, environment, and reinforcement-learning terminology.
The episode uses the post as a vocabulary and operations item rather than a full literature review.
Provenance
Thread · Primary source
8
DHH on GPT-5.5 coding capability

X DHH — Programmer and Rails creator posting a personal capability report, cited from the packet.

All steering, no handwriting.
x.com/dhh/status/2058953269360156783 →
Details
Cited text
All steering, no handwriting.

Context
It helps separate felt workflow change from measured model improvement.
Key points
Reports more 'I can't believe it's this good' moments with GPT-5.5 than any model since Opus 4.5.
Frames the experience as steering rather than writing code by hand.
The episode treats this as an expert-user signal, not a benchmark.
Provenance
Tweet · Primary source
9
OpenAI model and an Erdos conjecture claim

X Peter Diamandis — Entrepreneur and public commentator; packet provides the post but not the underlying proof source.

An OpenAI model just disproved an 80 year old math conjecture from Paul Erdos.
x.com/PeterDiamandis/status/205895695607787… →
Details
Cited text
An OpenAI model just disproved an 80 year old math conjecture from Paul Erdos.

Context
It shows how the hosts handle high-heat claims without over-claiming beyond the packet.
Key points
Claims an OpenAI model disproved an eighty-year-old Erdos conjecture.
The packet does not provide the paper, theorem statement, or independent mathematical review.
The episode deliberately treats it as a capability-report mention.
Provenance
Tweet · Primary source

00:00:00

Transcript

00:00:00 liraenKaggle says more than five hundred agents went through its Standardized Agent Exam in the first week, and that number is the cleanest way into Monday's show. Five hundred isn't large by internet standards. It matters because the harness is becoming part of the product. A model ships, an agent wraps it, a benchmark tries to catch it, and then the company using it has to decide whether any of that maps to work. So my question today is simple: when capability keeps arriving through wrappers, exams, and cost controls, where does the actual improvement live?

00:00:34 halekIn whatever you can rerun. That's my short answer. The DeepMind and Kaggle talk helps because Nick Kang and Michael Aaron didn't pitch one perfect benchmark. They described four pieces: hackathons for community-built tests, the Standardized Agent Exam, Game Arena for unsaturated model-versus-model tasks, and an open benchmarks platform. I read that as an attempt to make evaluation less like a paper appendix and more like a service you can point at a model after lunch.

00:00:56 liraenAnd they name the social bottleneck. The talk says benchmark creation is still concentrated around roughly thirty thousand AI researchers, while the developer population is closer to thirty million. That isn't just a participation statistic. It means the people discovering weird local needs — the insurance workflow, the robotics lab notebook, the municipal form parser — usually don't have an easy way to turn those needs into public tests.

00:01:23 halekYes, and the weird local need is where models get jagged. A leaderboard can tell you that six frontier systems are within a few points on a known suite. It can't tell you that your agent refuses to edit a spreadsheet after it sees one merged cell, or that it gets risk-averse in a poker game when the prompt makes it think folding is morally cleaner than losing chips. Their Game Arena point is good there: games keep producing new states, so the model can't memorize the exact exam.

00:01:46 liraenThat gets stranger when the evaluator is also an experience surface. Yesterday's Braid episode treated agent UX as more than chat. Today, the Kaggle talk gives the back half of that: if agents become products, their exams need to be products too. A team has to submit a prompt, watch a result, compare it, and understand why it lost. Otherwise the benchmark becomes another screenshot people trust because it has a number on it.

00:02:12 halekAnd because it has a number, someone will route money through it. [tsk] This keeps pulling me back to the Microsoft and GitHub Copilot posts. Tren Griffin's version says Microsoft moved engineers from Claude Code to GitHub Copilot while still using Opus 4.7 through enterprise API usage, because they wanted to dogfood the GitHub Copilot harness and get scale plus feedback. I don't have a Microsoft primary statement for that, so I would treat it as a reported claim from Griffin, not settled fact. But the pattern is plausible: the vendor value isn't only the model. It's where usage, review, policy, and telemetry live.

00:02:48 liraenTren Griffin also posted that semiconductor providers are delivering lower inference cost per token by sixty to seventy percent per year. Put that next to the Copilot claim and you get an odd enterprise bargain: the raw model call can get cheaper while the wrapper around the call becomes more valuable. Does that hold, or am I compressing too much?

00:03:09 halekIt holds if the wrapper captures the loop. Lower token cost helps everyone; the learning loop helps whoever sees the work. If Copilot sees the rejected patch and the accepted suggestion, it sees more than a model call. Add repo shape, policy exceptions, and the human edit after the model output, and Copilot becomes where the company learns. Claude or Opus might still be the engine underneath. The harness becomes the institutional memory for agent work.

00:03:31 liraenThat is a colder reading than the employee-benefit version of the story. The softer version says, fine, teams standardize on one tool so engineers aren't juggling subscriptions. Your version says standardization decides who gets the feedback.

00:03:47 halekThose can both happen. Standardizing the tool can make support easier, permissions easier, and spend easier to forecast. But if the reason is cost only, you would just set budgets. If the reason is feedback, you move people into the tool where your own product team can watch the work. That isn't sinister by itself. It also isn't neutral.

00:04:06 liraenJohnmark Obiefuna's post goes at the human side of that and says some companies are revoking or planning to revoke Claude licenses for software engineers because AI bills rose too fast. Again, that's not an official procurement memo. But it matches the pressure: first the agent becomes normal, then the agent bill becomes a management object, then the company asks which surface gives it control.

00:04:31 halekAnd engineers will feel that as tooling politics. The model might be the same family, or even the same paid upstream model, but the day-to-day experience changes. Your saved context changes. Your review flow changes. Your agent's permission boundary changes. A team that doesn't measure those changes will call it a vendor swap and then spend a month wondering why the work feels different.

00:04:52 liraenThe Demis Hassabis interview summary takes the same structure into science. He doesn't describe AI drug discovery as one model that cures disease. He describes six to twelve AlphaFold-level systems, each aimed at a different stage: static structures, protein interactions, protein-ligand interactions, ADME properties, toxicity, compound design, and eventually clinical-trial optimization.

00:05:17 halekNow it starts sounding like engineering instead of mythology. AlphaFold was a landmark, but a drug-discovery platform needs interfaces between models. One system predicts structure. Another predicts how a molecule behaves in the body. Another helps design a compound. Another stratifies patients. The value is in the chain and in the error bars between links.

00:05:38 liraenHe also frames the Co-Scientist system as a fine-tuned Gemini variant with specialized tools for hypothesis generation, data analysis, and literature summarization. I like the restraint in that, actually. It doesn't ask the scientist to vanish. It gives the scientist a sparring partner that can move across sources and experiments faster than a person can.

00:06:01 halekCareful with restraint. [chuckle] The ambition in that interview is curing all diseases within ten to twenty years, which is about as large as a claim gets. But the implementation story is more sober than the claim. If you ask me what has to work, it isn't one model becoming wise. It is clean data flow, assay quality, reproducible pre-clinical tests, regulatory evidence, and clinical trial design. The AI piece can speed each stage, but the handoff between stages can still break the whole result.

00:06:28 liraenFair. And the prior Braid coverage on AlphaProof Nexus already covered formal verification and autonomous math, so I don't want to repeat the same awe. The new angle here is the platform boundary. The moment a lab has many specialized models, the hard work is deciding which model owns which claim and how a human sees the uncertainty before a wet-lab decision gets made.

00:06:52 halekExactly. In software, we call that provenance and tests. In medicine, the cost of a bad handoff isn't a broken build; it's a wrong compound or a trial design that misses the patient group where the drug works. So when Hassabis says AI can help clinical trials through patient stratification and dosage prediction, the operator question is: where does that recommendation get logged, challenged, and audited?

00:07:13 liraenIvan Fioravanti posted that MLX DeepSeek V4 Flash was running on an M3 Ultra using less than one hundred twenty-eight gigabytes of memory — one hundred seven gigabytes in his test — with a custom quantization recipe and Claude plus Opus 4.7 helping. The claim is small enough to be concrete and large enough to matter: a serious model on a high-end desktop-class Apple machine.

00:07:38 halekThe number to hear is one hundred seven gigabytes. That still means expensive hardware, and custom quantization isn't a checkbox most teams can maintain. But it tells you where the pressure is going. People want frontier-ish models close to the data, close to the developer, and close to the product loop. MLX matters because Apple Silicon is already on desks, and every memory reduction turns one more machine into a possible inference box.

00:08:00 liraenThen the llama.cpp pull request gives the less glamorous companion detail: CUDA support for a fast Walsh-Hadamard transform, with a one to two percent boost on prompt processing and a seven to nine percent boost on token generation when quantizing the key-value cache.

00:08:19 halekPeople miss changes like that because they don't arrive as a new model card. A seven to nine percent token-generation gain in a common runtime can matter more than a splashy demo if it lands for everyone using that path. And the key-value cache detail matters. In long conversations, the cache is where memory pressure shows up. Make that cheaper, and local agents get a little less awkward.

00:08:40 liraenThere is a useful tension with Saturday's episode here. Braid talked about BeeLlama on an RTX 3090 and speculative decoding for local inference. Today adds a different layer: Apple MLX memory fit on one side, CUDA kernel work in llama.cpp on the other. The local-model story isn't one trick. It is many small claims about memory, cache layout, quantization, and acceptable quality loss.

00:09:06 halekAnd every one of those claims needs a workload attached. A benchmark that says tokens per second on a clean prompt is helpful. It doesn't tell you how the system behaves after two hundred thousand tokens of tool output, three failed edits, and a user asking it to explain a regression. Local inference for agents is more than throughput. It's throughput under messy state.

00:09:26 liraenMonday also brings a cluster around world models: Marc Andreessen quote-posting a world-model item with 'Interesting,' and MTS saying they broke down DreamZero, Agora-1, the math behind world models, and what these systems are building. The brief doesn't give us the article body, so I don't want to pretend we have the technical argument. But the term keeps reappearing for a reason.

00:09:50 halekBecause everyone wants a word for models that can predict action consequences, not just continue text. World model is a useful phrase when it points at a specific training setup or environment. It becomes vapor when it means 'the model seems to understand stuff.' Without the source article, I'd keep this as a vocabulary item, not a deep technical segment.

00:10:10 liraenThat connects to Viv's post about the Hugging Face write-up aggregating work on agents, harnesses, environments, reinforcement learning, and shared vocabulary. Viv's line is wonderfully plain: the more we can roughly have a shared vocabulary the better, while also admitting the space is still confusing.

00:10:29 halekShared vocabulary isn't cosmetic here. If one team says agent and means a model with tools, another means a long-running worker with memory, and another means a UI assistant inside an IDE, then evaluation results don't compare. Procurement doesn't compare. Incident reports don't compare. The same word hides three products.

00:10:49 liraenAnd it changes how people hear claims. A 'world model' sounds more grounded than a simulator until someone names the data, the action space, and the evaluation. An 'agent harness' sounds mature until someone asks where permissions, retries, and human review sit. Vocabulary can make a system legible, or it can let a vague claim borrow the authority of an engineering term.

00:11:12 halekThat's why I like the Hugging Face aggregation as a segment even without turning it into a literature review. It says the field is old enough to need a dictionary. Product claims usually get more testable at that point. You can ask: do you mean environment as in a benchmark task, or environment as in the computer the agent acts inside? Do you mean reinforcement learning from a reward model, or a workflow where a human accepts and rejects patches? Those distinctions change what you build.

00:11:34 liraenDHH's post says he has had more 'I can't believe it's this good' moments with GPT-5.5 than any other model since Opus 4.5, and describes days of progress with all steering and no handwriting. Peter Diamandis posts that an OpenAI model disproved an eighty-year-old Erdos conjecture. Both are capability reports with heat in them. How should we handle them without flattening either one?

00:11:59 halekDHH's post is a credible operator reaction, but it's still a reaction. I would treat it as evidence that a serious builder is feeling a change in coding flow, not as a benchmark. The Erdos-conjecture post needs even more care. Monday's recap already covered AlphaProof Nexus and formal verification, so unless we have the paper, the theorem statement, and independent math context, I would mention it only as a sign that math capability is still pushing into public attention.

00:12:22 liraenSo we keep the shape: personal reports are useful because they tell us where expert users feel the tool crossing a threshold, but they don't replace artifacts. A benchmark without a transcript is thin. A theorem claim without the proof trail is thinner. A post from an experienced programmer can tell us the steering interface feels different, but it can't tell us whether the model improved, the harness improved, or the user got better at steering.

00:12:47 halekAnd that distinction matters for teams. If DHH is getting better results because GPT-5.5 is stronger, you test the model. If he is getting better results because the workflow is 'all steering, no handwriting,' you test the process. If he is better at steering because he has taste and context, then copying the tool won't copy the result. That was Saturday's lesson too: fast models reward slow human judgment.

00:13:08 liraenMonday's answer, then, isn't a single breakthrough. It is a stack of meters. Kaggle wants better agent exams. Microsoft, if Griffin's report is right, wants the harness where feedback gathers. DeepMind wants a science platform made of specialized models. Local inference people are shaving memory and kernel costs. The shared vocabulary people are trying to make the words less slippery. The model may improve, but the system that records its mistakes, prices its turns, and tests its claims decides whether anyone can use it on Tuesday morning.