◆ Dispatch 042 · 2026-05-30 GSV The Number Nobody Optimized For
The number nobody optimized for
“Two documents from the same week that don't contradict each other. One says Opus 4.8 jumped on math and slipped on business ops. The other says the whole bar-chart genre measures the harness as much as the model.”
— Lenar Kess, today's narration
Claude Opus 4.8 landed overnight with a math score that leapt and a business-ops score that fell — and reading the release honestly means distrusting the chart. Lenar and Damra work through the gap between the number that moved and the number that matters, then chase it into agent budgets, the protocol wars, local-inference tooling, Mistral's on-prem bet, and the power grid.
- A scrape of 100+ Opus 4.8 evals shows USAMO 2026 jumping 69%→97% while Vending-Bench 2 nearly halved — a retune that helped some distributions and hurt others.
- "AI benchmarks are useless" argues the record scores ride on elaborate prompt setups: change a few prompt words and results swing 10–20 points.
- The BAGEN study finds frontier agents can't estimate their own remaining budget mid-task — which collides with enterprises trying to rein in "tokenmaxxing" (WSJ via Techmeme).
- "MCP is dead?" gets a sharp rebuttal from OpenAI's Max Stoiber: nearly every company is building an MCP server, even ones with no CLI or external API.
- Multi-token prediction benchmarks hit ~3.3x faster local inference; llama.cpp got a real website and antirez shipped distributed inference.
- Notes from the Mistral AI Now Summit — on-prem KYC at BNP Paribas, against a comment that Mistral's 120B "small" model loses to models a quarter its size. xAI countered with a one-dollar coding model.
- FERC's June grid-connection proposal is the duller, realer infrastructure story next to an unsourced TerraFab "one terawatt" claim.
Chapters
- 00:00:00 Transcript
Sources
20 cited-
1
AI Engineer · 20m12s
Video AI Engineer
Why your agents need decision traces, not just documents — Zach Blumenfeld, Neo4j — A knowledge base tells a financial analyst agent the risk factors. A context graph tells it whether to reject or accept, because it…
www.youtube.com/watch?v=B9h9ovW5H9U →Details
- Excerpt
- Why your agents need decision traces, not just documents — Zach Blumenfeld, Neo4j — A knowledge base tells a financial analyst agent the risk factors. A context graph tells it whether to reject or accept, because it…
- Context
- Directly addresses agentic tools and advanced AI infrastructure (graph DBs, RAG extension, decision tracing), which is core to the podcast topic.
- Key points
- Directly addresses agentic tools and advanced AI infrastructure (graph DBs, RAG extension, decision tracing), which is core to the podcast topic.
- Provenance
- Video · Supporting source
-
2
Techmeme - Industry Adjacent (US)
Article
Executives at Uber, Meta, Microsoft, and other companies are trying to rein in "tokenmaxxing" by employees, which led to ballooning AI use costs (Bradley Olson/Wall Street Journal) - Bradley Olson / Wall Street Journal.…
www.techmeme.com/260529/p22 →Details
- Excerpt
- Executives at Uber, Meta, Microsoft, and other companies are trying to rein in "tokenmaxxing" by employees, which led to ballooning AI use costs (Bradley Olson/Wall Street Journal) - Bradley Olson / Wall Street Journal...
- Context
- Directly addresses the cost and resource constraints (compute/money) of AI, a core topic.
- Key points
- Directly addresses the cost and resource constraints (compute/money) of AI, a core topic.
- Provenance
- Article · Supporting source
-
3
@xai (xAI)
X xai
grok-build-0.1 is now available via the xAI API in public beta. This is the same model that powers the Grok Build CLI and excels at agentic coding. Priced at $1/m input and $2/m output, it’s extremely cost effective,…
x.com/xai/status/2060392249402552457 →Details
- Excerpt
- grok-build-0.1 is now available via the xAI API in public beta. This is the same model that powers the Grok Build CLI and excels at agentic coding. Priced at $1/m input and $2/m output, it’s extremely cost effective,…
- Context
- Announces a new, specific, and functional agentic coding tool (grok-build-0.1) and its API, directly addressing the podcast's focus on agentic tools and frontier models.
- Key points
- Announces a new, specific, and functional agentic coding tool (grok-build-0.1) and its API, directly addressing the podcast's focus on agentic tools and frontier models.
- Provenance
- Tweet · Primary source
-
4
@ttunguz (Tomasz Tunguz)
X ttunguz
I've been using state-of-the-art models to teach small models running on my computer how I work. The result : a personal agent that runs my inbox, my deal pipeline, my blog, my calendar, & my research. 🧵
x.com/ttunguz/status/2060393514144502070 →Details
- Excerpt
- I've been using state-of-the-art models to teach small models running on my computer how I work. The result : a personal agent that runs my inbox, my deal pipeline, my blog, my calendar, & my research. 🧵
- Context
- Describes a personal agent built using state-of-the-art models to automate professional workflows (inbox, pipeline, calendar), directly addressing agentic tools and AI application.
- Key points
- Describes a personal agent built using state-of-the-art models to automate professional workflows (inbox, pipeline, calendar), directly addressing agentic tools and AI application.
- Provenance
- Tweet · Primary source
-
5
@ggerganov (Georgi Gerganov)
X ggerganov
llama.cpp now has an official website: https:// llama.app Our goal is to make local AI accessible to everyone, and improving the user experience is a big part of that. On the new landing page you’ll find a single-line…
x.com/ggerganov/status/2060394400237109567 →Details
- Excerpt
- llama.cpp now has an official website: https:// llama.app Our goal is to make local AI accessible to everyone, and improving the user experience is a big part of that. On the new landing page you’ll find a single-line…
- Context
- Announcing a primary artifact (official website/installer) for a key local AI tool (llama.cpp), directly related to AI infrastructure and accessibility.
- Key points
- Announcing a primary artifact (official website/installer) for a key local AI tool (llama.cpp), directly related to AI infrastructure and accessibility.
- Provenance
- Tweet · Primary source
-
6
Notes from the Mistral AI Now Summit — 399 pts · 174 comments
Article vnglst
https://koenvangilst.nl/lab/mistral-ai-now-summit · @trouve_search: OK, I'm 100% rooting for both Mistral and task focused small models. But Mistral has fall really far behind since 2025Q3. It seems they can't get good…
koenvangilst.nl/lab/mistral-ai-now-summit →Details
- Excerpt
- https://koenvangilst.nl/lab/mistral-ai-now-summit · @trouve_search: OK, I'm 100% rooting for both Mistral and task focused small models. But Mistral has fall really far behind since 2025Q3. It seems they can't get good…
- Context
- Directly discusses Mistral's positioning, small models, and on-prem use in regulated European industries, hitting core topics.
- Key points
- Directly discusses Mistral's positioning, small models, and on-prem use in regulated European industries, hitting core topics.
- Provenance
- Article · Supporting source
-
7
@wzenus (Zihan "Zenus" Wang)
X wzenus
🧵 Claude-Opus-4.8 takes you too much tokens - but is this issue general across agents? Do agents know how much they'll spend? Introducing Budget-Aware Agents (BAGEN): We study budget awareness across 4 envs & 5…
x.com/wzenus/status/2060397732846612489/pho… →Details
- Excerpt
- 🧵 Claude-Opus-4.8 takes you too much tokens - but is this issue general across agents? Do agents know how much they'll spend? Introducing Budget-Aware Agents (BAGEN): We study budget awareness across 4 envs & 5…
- Context
- Discusses a technical limitation (token usage) and introduces a new concept/study (BAGEN) directly related to agentic tools and AI infrastructure.
- Key points
- Discusses a technical limitation (token usage) and introduces a new concept/study (BAGEN) directly related to agentic tools and AI infrastructure.
- Provenance
- Tweet · Primary source
-
8
@antirez
X antirez
DwarfStar distributed inference is now on GitHub: you can run 2 bit Flash using 2 64GB machines, or 4 bit Flash with two 128GB machines or 4 64GB, ad so forth. Prefill speed will increase thanks to pipelining.…
x.com/antirez/status/2060403966676987918 →Details
- Excerpt
- DwarfStar distributed inference is now on GitHub: you can run 2 bit Flash using 2 64GB machines, or 4 bit Flash with two 128GB machines or 4 64GB, ad so forth. Prefill speed will increase thanks to pipelining.…
- Context
- Reports a primary artifact (GitHub release) related to AI infrastructure (distributed inference/GPUs), directly relevant to the podcast's focus.
- Key points
- Reports a primary artifact (GitHub release) related to AI infrastructure (distributed inference/GPUs), directly relevant to the podcast's focus.
- Provenance
- Tweet · Primary source
-
9
@saen_dev (Saeed Anwar)
X saen_dev
1 in 3 teams running open weights is the inflection point where the ecosystem starts building tooling for open models instead of treating them as second-class citizens. Once the tooling catches up, the adoption curve…
x.com/saen_dev/status/2060409865638457805 →Details
- Excerpt
- 1 in 3 teams running open weights is the inflection point where the ecosystem starts building tooling for open models instead of treating them as second-class citizens. Once the tooling catches up, the adoption curve…
- Context
- Discusses the critical inflection point of open weights adoption and the resulting tooling ecosystem, directly addressing the 'power dynamics' and 'AI infrastructure' aspects of the topic.
- Key points
- Discusses the critical inflection point of open weights adoption and the resulting tooling ecosystem, directly addressing the 'power dynamics' and 'AI infrastructure' aspects of the topic.
- Provenance
- Tweet · Primary source
-
10
@OnlyEvaWonder (Eva Wonder)
X OnlyEvaWonder
gemma leading the pack and ollama in the open inference section makes so much sense
x.com/OnlyEvaWonder/status/2060420977205370… →Details
- Excerpt
- gemma leading the pack and ollama in the open inference section makes so much sense
- Context
- Mentions specific models (Gemma) and tools (Ollama) within the context of open inference, directly related to AI infrastructure and frontier models.
- Key points
- Mentions specific models (Gemma) and tools (Ollama) within the context of open inference, directly related to AI infrastructure and frontier models.
- Provenance
- Tweet · Primary source
-
11
@LaceyPresley (Lacey)
X LaceyPresley
TERAFAB IS GOING TO BE INSANE. We’re targeting 100–200 billion custom AI + memory chips per year at full ramp that’s 1 terawatt (1,000 GW) of annual AI compute capacity. Roughly 50x current global AI chip output. This…
x.com/LaceyPresley/status/20604361356716320… →Details
- Excerpt
- TERAFAB IS GOING TO BE INSANE. We’re targeting 100–200 billion custom AI + memory chips per year at full ramp that’s 1 terawatt (1,000 GW) of annual AI compute capacity. Roughly 50x current global AI chip output. This…
- Context
- Discusses massive, specific AI infrastructure scale (1 terawatt, 50x current output), directly addressing the podcast's focus on AI infrastructure and power dynamics.
- Key points
- Discusses massive, specific AI infrastructure scale (1 terawatt, 50x current output), directly addressing the podcast's focus on AI infrastructure and power dynamics.
- Provenance
- Tweet · Primary source
-
12
r/ClaudeAI: Ai Benchmarks are useless - 0 pts · 0 comments
Article Significant-Care-135
I'm done with the launch cycle. Every new model drops with the same flashy report, bar charts all over the place, hitting 92% on MMLU-Pro, 94% on GPQA, or whatever coding benchmark they're pushing this week. Then you...
www.reddit.com/r/ClaudeAI/comments/1trclg3/… →Details
- Excerpt
- I'm done with the launch cycle. Every new model drops with the same flashy report, bar charts all over the place, hitting 92% on MMLU-Pro, 94% on GPQA, or whatever coding benchmark they're pushing this week. Then you...
- Context
- Directly critiques the current state of AI evaluation (benchmarks), which is central to understanding frontier model capabilities and limitations in practice.
- Key points
- Directly critiques the current state of AI evaluation (benchmarks), which is central to understanding frontier model capabilities and limitations in practice.
- Provenance
- Article · Supporting source
-
13
r/LocalLLaMA: I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO. - 0 pts · 0 comments
Article FantasticNature7590
Hey guys, I spent the last few weeks benchmarking Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B locally GGUF, FP8 using both vLLM and llama.cpp. MTP is the inference trick every major lab is quietly...
www.reddit.com/r/LocalLLaMA/comments/1trf0r… →Details
- Excerpt
- Hey guys, I spent the last few weeks benchmarking Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B locally GGUF, FP8 using both vLLM and llama.cpp. MTP is the inference trick every major lab is quietly...
- Context
- Benchmarking MTP on major models (Gemma 4, Qwen 3.6) and frameworks (vLLM, llama.cpp) is a primary artifact that measurably changes the developer's mental model of inference speed.
- Key points
- Benchmarking MTP on major models (Gemma 4, Qwen 3.6) and frameworks (vLLM, llama.cpp) is a primary artifact that measurably changes the developer's mental model of inference speed.
- Provenance
- Article · Supporting source
-
14
MCP is dead? — 283 pts · 265 comments
Article nadis
https://www.quandri.io/engineering-blog/mcp-is-dead · @mxstbr: I run the team at OpenAI that's responsible for the ChatGPT App Store, Codex plugins, and all things MCP. The thing that all these "MCP is dead" posts are…
www.quandri.io/engineering-blog/mcp-is-dead →Details
- Excerpt
- https://www.quandri.io/engineering-blog/mcp-is-dead · @mxstbr: I run the team at OpenAI that's responsible for the ChatGPT App Store, Codex plugins, and all things MCP. The thing that all these "MCP is dead" posts are…
- Context
- Directly discusses AI agents, service access, and protocols for connecting models to external services, which is central to the podcast's focus on agentic tools and AI infrastructure.
- Key points
- Directly discusses AI agents, service access, and protocols for connecting models to external services, which is central to the podcast's focus on agentic tools and AI infrastructure.
- Provenance
- Article · Supporting source
-
15
AI News & Strategy Daily | Nate B Jones · 51s
Video AI News & Strategy Daily | Nate B Jones
The death of the filing cabinet #ai #tech — Full Story w/ Prompts: https://natesnewsletter.substack.com/p/your-engineers-are-building-your?r=1z4sm5&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true…
www.youtube.com/shorts/59NCmQ3hxz4 →Details
- Excerpt
- The death of the filing cabinet #ai #tech — Full Story w/ Prompts: https://natesnewsletter.substack.com/p/your-engineers-are-building-your?r=1z4sm5&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true…
- Context
- Discusses the shift from siloed data (filing cabinets/Jira) to a unified, intelligent context platform, directly addressing AI infrastructure and the changing craft of software engineering.
- Key points
- Discusses the shift from siloed data (filing cabinets/Jira) to a unified, intelligent context platform, directly addressing AI infrastructure and the changing craft of software engineering.
- Provenance
- Video · Supporting source
-
16
@LaceyPresley (Lacey)
X LaceyPresley
Elon Musk's Terafab & The Silicon Empire TerraFab: Tesla’s $119B semiconductor foundry in Texas is a masterclass in vertical integration. By owning 2nm AI silicon production (in JV with SpaceX, xAI & Intel), Tesla is…
x.com/LaceyPresley/status/20605140423813246… →Details
- Excerpt
- Elon Musk's Terafab & The Silicon Empire TerraFab: Tesla’s $119B semiconductor foundry in Texas is a masterclass in vertical integration. By owning 2nm AI silicon production (in JV with SpaceX, xAI & Intel), Tesla is…
- Context
- Discusses a major, specific artifact (TeraFab foundry) and its direct impact on AI hardware/compute, which is central to the podcast's scope.
- Key points
- Discusses a major, specific artifact (TeraFab foundry) and its direct impact on AI hardware/compute, which is central to the podcast's scope.
- Provenance
- Tweet · Primary source
-
17
r/Anthropic: Here's >100 evals for Opus 4.8 compared to top AI models - 0 pts · 0 comments
Article davidthesong
I scraped 100+ evals on Opus 4.8 to see what changed. The big gains vs 4.7: Math: USAMO 2026 jumped from 69% → 97% Coding: Vibe Code Bench +12 pp Economically valuable work: #1 of 275 on GDPval-AA Biology Long-context...
i.redd.it/xdz6vagi464h1.png →Details
- Excerpt
- I scraped 100+ evals on Opus 4.8 to see what changed. The big gains vs 4.7: Math: USAMO 2026 jumped from 69% → 97% Coding: Vibe Code Bench +12 pp Economically valuable work: #1 of 275 on GDPval-AA Biology Long-context...
- Context
- This post ships a primary artifact (benchmark data) comparing a new frontier model (Opus 4.8) against others, directly addressing model capability and performance.
- Key points
- This post ships a primary artifact (benchmark data) comparing a new frontier model (Opus 4.8) against others, directly addressing model capability and performance.
- Provenance
- Article · Supporting source
-
18
The AI Daily Brief: Artificial Intelligence News · 23m45s
Video The AI Daily Brief: Artificial Intelligence News
First Impressions of the New Opus 4.8 — Anthropic releases Claude Opus 4.8 with improved honesty, stronger self‑verification, and multi‑agent dynamic workflows for large code tasks. Benchmark scores narrow versus…
www.youtube.com/watch?v=zf8BfgJghd8 →Details
- Excerpt
- First Impressions of the New Opus 4.8 — Anthropic releases Claude Opus 4.8 with improved honesty, stronger self‑verification, and multi‑agent dynamic workflows for large code tasks. Benchmark scores narrow versus…
- Context
- Covers major topics: Anthropic/OpenAI model releases, agentic coding (Cognition), and AI infrastructure/power dynamics (K&E, Meta).
- Key points
- Covers major topics: Anthropic/OpenAI model releases, agentic coding (Cognition), and AI infrastructure/power dynamics (K&E, Meta).
- Provenance
- Video · Supporting source
-
19
CourtListener AI RECAP Search - Legal Courts (US)
Article District Court, D. Vermont
Brunell v. OpenAI, LLC - Original document
www.courtlistener.com/docket/73209323/38/br… →Details
- Excerpt
- Brunell v. OpenAI, LLC - Original document
- Context
- A lawsuit naming OpenAI directly addresses power dynamics, liability, and control, which is core to the podcast topic.
- Key points
- A lawsuit naming OpenAI directly addresses power dynamics, liability, and control, which is core to the podcast topic.
- Provenance
- Article · Supporting source
-
20
Techmeme - Industry Adjacent (US)
Article
Sources detail AI companies' engagement with US FERC as the energy regulator readies a June proposal to speed up data center connections to regional power grids (Politico) - Politico : Sources detail AI companies'...
www.techmeme.com/260530/p7 →Details
- Excerpt
- Sources detail AI companies' engagement with US FERC as the energy regulator readies a June proposal to speed up data center connections to regional power grids (Politico) - Politico : Sources detail AI companies'...
- Context
- Directly addresses AI infrastructure (data centers, energy) and power dynamics (regulators, capital), which is a core topic.
- Key points
- Directly addresses AI infrastructure (data centers, energy) and power dynamics (regulators, capital), which is a core topic.
- Provenance
- Article · Supporting source
Transcript
00:00:00 lenarHere's a number from overnight. A model sat down with the USA Math Olympiad — this year's set, 2026 — and went from sixty-nine percent to ninety-seven percent in a single version bump. That's davidthesong on the Anthropic subreddit, who scraped more than a hundred evals on the new Claude Opus 4.8 to see what actually moved against 4.7. Ninety-seven percent on olympiad math is the kind of jump that, a year ago, would have been the entire headline. So my first question isn't whether the model is good. It's narrower — when a number moves that far, that fast, what's the first thing you check?
00:00:37 damraContamination. [tsk] The USA Math Olympiad set for 2026 — was it public before the training cut-off? Because the cleanest explanation for a twenty-eight-point jump on a named, dated benchmark is that the questions, or close paraphrases of them, ended up in the training data. I'm not saying that's what happened. I'm saying it's the first hypothesis, and the post doesn't rule it out. And the same scrape has the counterweight built right in. davidthesong lists the areas that barely moved or got worse — legal reasoning, healthcare, finance, and multilingual. And Vending-Bench 2, the run-a-tiny-business simulation, nearly halved.
00:01:16 lenarHalved. So the same release that aces olympiad math gets noticeably worse at running a pretend vending machine.
00:01:22 damraAnd that's the more honest picture of a frontier bump in 2026. One rising number doesn't drag everything up with it. The model got retuned, and the retune helped some distributions and hurt others. The vending-bench drop is the one I'd want Anthropic to explain, because that's a long-horizon task — keep your goal straight over many steps — and that's exactly the place these models are supposed to be getting better, not worse.
00:01:46 lenarThe official framing, at least as it comes through the AI Daily Brief's write-up, isn't really about the math score at all. The pitch for 4.8 is reliability. Better code judgment, more self-correction, and what they call uncertainty flagging — the model telling you when it isn't sure instead of confidently bluffing. The detail that caught me is that it's more willing to critique a proposal without being asked. You hand it a plan, and it pushes back on the plan itself.
00:02:13 damraWhich is useful inside a long agent run, and also the claim I trust least from a benchmark. How do you score 'pushed back appropriately'? And the caveat is right there in the same write-up — it sometimes grounds that critique in assumptions it never verified. So now it's confidently disagreeing instead of confidently agreeing. The sycophancy moved. It didn't leave.
00:02:36 lenarThe headline numbers, for what they are, are modest and real. SWE-bench Pro went from about sixty-four to sixty-nine. Terminal-Bench 2.0 — the agent-in-a-shell test — from sixty-six to seventy-four. Humanity's Last Exam ticked up a few points. Solid, not seismic. But there's a post on the Claude subreddit this week, titled flatly 'AI benchmarks are useless,' and I think it's the right thing to read next to the eval scrape — not against it.
00:03:06 damraHe's blunt, and he's mostly right. Quote — 'This is Goodhart's Law playing out completely. The labs tuned everything for the tests, and now we've got these fragile models that break down in production.' And then the line I'd underline for any working engineer — quote — 'Tweak a few words in the prompt and your results swing ten to twenty points.' That's the whole gap between a leaderboard and your repository. The record score rides on elaborate prompt setups and multi-shot prompts tuned to the eval. You send a normal prompt, and a lot of that performance evaporates.
00:03:39 lenarSo we have two documents from the same week that don't actually contradict each other. One says Opus 4.8 jumped on math and slipped on business operations. The other says the entire bar-chart genre is measuring the harness as much as the model. What I'll do is wait for someone to run 4.8 on a fresh, private task and report back. The number I believe most is the one nobody optimized for.
00:04:02 lenarHere's a word the Wall Street Journal put in print this week — tokenmaxxing. Bradley Olson's piece, which came across Techmeme, says executives at Uber, Meta, Microsoft and others are trying to rein in tokenmaxxing by their own employees. People burning model spend so fast it's turned into a real line item on the bill.
00:04:21 damraAnd I want to be fair to the employees before anyone calls it waste. Tokenmaxxing usually means somebody found that stuffing the whole codebase into context, or firing off five agent attempts and keeping the best one, actually produces better work. The incentive is local and rational — better output today. The cost is somebody else's problem a quarter later, when finance reads the invoice. That's an unpriced resource being used the way unpriced resources always get used.
00:04:50 lenarAnd it connects directly to a study that dropped the same day. Zihan Wang — handle wzenus — posted a thread that opens, almost rudely, with quote, 'Claude-Opus-4.8 takes you too much tokens.' Then it turns into a real piece of research called BAGEN, Budget-Aware Agents. They test budget awareness across four environments and five frontier agents, and find structured failures in most of them.
00:05:16 damraWhat does 'budget awareness' even mean as a measurable thing? Because that's the part that makes it research instead of a complaint.
00:05:24 lenarTheir definition is sharp. Quote — 'A budget-aware agent doesn't just spend less, but estimates remaining budget mid-task with uncertainty.' So the bar is whether the agent, halfway through a job, can tell you roughly how much of its budget is left, and how confident it is in that estimate — not just whether it spends less.
00:05:42 damraAnd the finding is that mostly they can't. The agent has no proprioception about its own spending. Ask it how much it's burned and it doesn't really know. [tsk] That's the uncomfortable pairing with the tokenmaxxing story. The enterprise response is going to be the crude lever — a hard token cap, a quota, a rate limit. But a budget-blind agent under a hard cap doesn't plan toward the limit. It runs full speed and dies at the boundary, mid-task, with the work half done.
00:06:12 lenarWhich is the worst of both — you paid for the tokens and you didn't get the finished job.
00:06:17 damraRight. The thing I'd actually want, and what BAGEN is gesturing at, is an agent that treats budget like a resource it reasons about — slows down, picks a cheaper path, and tells you it's running low and asks whether to continue. That's a planning capability, not a billing setting. And almost nobody's models have it yet. So the near-term fix is external governors clamping budget-blind agents, and that'll feel exactly as clumsy as it sounds.
00:06:44 lenarThere's a post on the Quandri engineering blog with a deliberately provocative title — 'MCP is dead?' — and it hit the front page of Hacker News, two hundred eighty-odd points. MCP is the Model Context Protocol, the thing Anthropic introduced to let agents reach outside services. The post is part of a recurring genre by now: the protocol's done, code-mode or plain command-line tools will eat it.
00:07:08 damraAnd there's a real technical argument underneath the headline. The complaint is that MCP as a transport — the actual wire format the model speaks — is clunky, and you could replace it with the agent just calling a command-line tool, or writing code that hits an API directly. That part's a fair fight. People really are routing around it.
00:07:29 lenarAnd then the top comment in the thread is from someone with standing to know. Max Stoiber — handle mxstbr — opens with, quote, 'I run the team at OpenAI that's responsible for the ChatGPT App Store, Codex plugins, and all things MCP.' His argument is that the transport question is a red herring.
00:07:48 damraLet me read the core of it, because it's the strongest version. Quote — 'practically every company on the planet is building an MCP server. I know this because we interact with all of them. Most of these companies don't have a CLI. Many of these companies don't even have an external API. And yet, they're all building MCP servers.' That's the move. The value isn't the wire format. It's that MCP became the default thing a company ships to make its service reachable by an agent at all.
00:08:19 lenarAnd he basically concedes the technical complaint to make that point — quote, 'Maybe we will turn every MCP server into a CLI under the hood. Maybe we'll use code mode.' He's saying the implementation can change completely and the protocol still won.
00:08:34 damraWhich is a little too clean, so let me add the comment that complicates it. A user named tanin lays out the actual split. If you're building a connector for yourself or your team, skip MCP — just hand your teammates a command-line tool and some prompts. If you have external users, you need MCP, because that's what Cursor and the other agent apps support out of the box. So 'dead' is context-dependent. It's over-engineered for internal tooling, and it's the thing that gets you distribution if you want to be reachable by everyone else's agent.
00:09:06 lenarAnd mxstbr is honest enough to undercut himself in a footnote — he says the Codex app's computer and browser-use features have made even his own argument weaker, because the agent can just drive a browser instead of needing your API at all.
00:09:20 damraWhich is the thread I'd pull on next. If browser-use gets reliable enough, the question stops being 'MCP or CLI' and becomes 'why expose an interface at all, when the agent can use the same website a human uses.' That's further out and less reliable today. But it's the version where the whole protocol debate gets reframed entirely. What I'm watching is whether companies keep standing up MCP servers next quarter, or start betting the agent will just click through their existing site.
00:09:50 lenarLet's get concrete and local. There's a write-up on the LocalLLaMA subreddit from someone running the handle FantasticNature7590, who spent a few weeks benchmarking multi-token prediction — MTP — on Gemma 4, the thirty-one-billion-parameter model, and Qwen 3.6 at twenty-seven billion. The headline result: on Gemma 4, inference jumped from about forty tokens per second to a hundred and thirty-two. Roughly three-and-a-third times faster, on a single RTX PRO 6000 card.
00:10:21 damraQuick translation of what multi-token prediction is doing, because it's clever. A small draft model guesses several tokens ahead, and the big model verifies them in one pass instead of generating one token at a time. On Gemma 4 the draft model is tiny — seventy-six million parameters. And here's the part that matters for trust: the target model still verifies every token before accepting it. So the output path is identical to normal decoding. In principle the quality shouldn't change at all.
00:10:53 lenarIn principle. Did he check?
00:10:55 damraNo — and to his credit, he says so flatly. He calls the quality and the video-memory numbers, quote, 'directional observations, not benchmarked facts.' He ran out of time for a proper quality eval. That's the right way to post a benchmark. The architecture says quality should hold because of the verify step, but he isn't claiming he proved it. The other honest caveat: the speedup depends on your stack. vLLM won on Gemma at a hundred thirty-two tokens per second, but llama.cpp was actually solid on Qwen, around a hundred seventeen. It's not one tool winning everything.
00:11:34 lenarAnd llama.cpp had its own small moment this week — Georgi Gerganov posted that it finally has an official website, at llama.app. His line was, quote, 'Our goal is to make local AI accessible to everyone,' with a single-line installer on the landing page.
00:11:52 damraWhich sounds minor and isn't. llama.cpp has been, for years, a clone-the-repo-and-build-it project. That's a wall for anyone who isn't already comfortable at a compiler. A real website with a one-line install is the difference between a tool for people who already know it exists and a tool a curious person can actually start on a Saturday.
00:12:15 lenarAnd one more from the same corner — antirez, Salvatore Sanfilippo, put his DwarfStar distributed-inference project on GitHub. The pitch is that you can run a two-bit quantized Flash model across two sixty-four-gigabyte machines, or a four-bit version across two machines with a hundred twenty-eight gigabytes each, with pipelining to speed up the prefill.
00:12:37 damraThat's the 'I don't own one giant machine, I own two medium ones' path, and it's a real unlock for home setups. Stitch a couple of consumer boxes together instead of buying a server. Put all of this next to a point one developer in that thread keeps making — once roughly one in three teams is running open weights, that's the point where the ecosystem starts building tooling for open models instead of treating them as second-class. MTP landing in vLLM and llama.cpp, a real installer, distributed inference — that's the tooling layer maturing in real time. The thing I'd watch is whether someone runs the MTP quality eval the poster skipped, because the whole speed story rests on the verify step actually holding.
00:13:22 lenarOver in Europe, there are notes from the Mistral AI Now Summit — Koen van Gilst wrote them up, and they hit Hacker News at around four hundred points. The detail that jumped out, and Simon Willison flagged the same one, is concrete: BNP Paribas runs Mistral's models on-premises for know-your-customer checks in Belgium, with the sensitive data staying inside the bank's walls. And Abanca is using agent orchestration on sensitive customer data across two million customers in their app.
00:13:51 damraAnd then the bear case, which is right there in the same thread and worth saying plainly. A commenter who says they're rooting for Mistral writes that the company has fallen really far behind since the third quarter of 2025. Their words — Mistral's 'small' model has roughly four times the parameter count, around a hundred and twenty billion, and isn't even competing with models a quarter its size. Gemma 4 and Qwen 3.6 own the small tier right now, and they're a fraction of the size.
00:14:20 lenarSo there are two readings sitting on top of each other. One: Mistral isn't winning capability benchmarks. Two: the on-premises, European, regulated-industry position is a genuine business that doesn't actually depend on topping a leaderboard.
00:14:35 damraBoth true, and here's the tension between them. If your hundred-twenty-billion-parameter model loses to a twenty-seven-billion open model on capability, your on-prem customers can run that smaller open model on-prem too. Data sovereignty isn't something only Mistral can offer — anyone can run Gemma or Qwen inside the bank's walls. So what Mistral is actually selling is the support contract, the orchestration layer, the compliance paperwork, and someone to call when the know-your-customer pipeline breaks at two in the morning. That's a real moat. It's just a different one than 'our model is best.'
00:15:12 lenarDifferent corner of the same week — xAI shipped grok-build-0.1 through their API in public beta. It's the model behind their Grok Build command-line tool, aimed at agentic coding, and the pricing is the headline: one dollar per million tokens of input, two dollars per million output.
00:15:31 damraThat price is aggressive — it undercuts most of the frontier coding options by a wide margin. But notice what we don't have. 'Excels at agentic coding' is a vendor sentence with no eval attached. No SWE-bench number, no task result, nothing checkable. The cost is the only verified fact in the announcement. And cheap-and-available does beat good-on-paper sometimes — that's how a lot of adoption actually happens. But before I tell anyone it's good at coding, I'd want to see it on a real task, not a price tag.
00:16:04 lenarSo put them side by side: a coding model that's cheap and unproven, against a French lab with proven customers and underwhelming models. Two different bets on what actually closes a deal. Last one, and it's about the part of AI that runs on actual electricity. Politico reports, via Techmeme, that the Federal Energy Regulatory Commission — FERC, the US energy regulator — is readying a proposal in June to speed up connecting data centers to regional power grids. And AI companies are engaging the regulator directly about it.
00:16:37 damraThis is the constraint nobody puts in a launch video, and it might matter more than any model release this week. The bottleneck on the buildout isn't always money or chips — it's the interconnection queue. You can finance a data center, pour the concrete, install the racks, and still wait years to actually connect to the grid. If FERC shortens that, it's a concrete lever on how fast capacity comes online. A regulator with a June deadline is a real thing.
00:17:06 lenarAnd on the far other end of the credibility spectrum, there's a thread making the rounds about TerraFab — an account named Lacey Presley posting that Tesla's hundred-and-nineteen-billion-dollar semiconductor foundry in Texas is targeting one terawatt of annual AI compute capacity. Roughly fifty times current global chip output.
00:17:26 damraFifty times global output is the number that should stop you cold. [tsk] That's a slogan, not a forecast. And I can't find a primary source for it — this is a hype account, not a Tesla filing, not an earnings call, not a press release. So I'd file the terawatt claim under aspiration posted as fact until there's a document with Tesla's name on it. The contrast with the FERC item is the point. One is duller and far more real, because it's a regulator with a deadline. The other is a big round number on social media.
00:17:58 lenarAnd that's the note to end on, going into Sunday. The capacity story isn't bottlenecked on whether someone hits a terawatt. It's bottlenecked on interconnection queues, permitting, and the slow problem of getting power to the building. What I'll be reading is the actual text of that June FERC proposal when it lands — not the foundry number. The next piece of real signal is in that filing.