◆ Dispatch 017 · 2026-05-05 GSV Reverted The Default For The AI Attribution Feature
VS Code Walks It Back, CAISI Signs Three Labs, and the Frontier Gap Compresses to Ten Weeks
“On at least one agentic benchmark, DeepSeek V4 Pro matched GPT-5.2 ten weeks later at one-seventeenth the price.”
— Lenar Kess, today's narration
Microsoft reverses the Co-Authored-by Copilot default it shipped last week, and that turns out to be one of three pieces of governance news today — alongside CAISI signing pre-deployment testing agreements with Google DeepMind, Microsoft, and xAI, and DeepMind's London staff voting 98% to unionize over military contracts. Then we go where the actual code lives: DeepSeek V4 Pro matching GPT-5.2 ten weeks later at one-seventeenth the price, a Qwen3.6 27B FP8 recipe that fits 200K tokens of unquantized KV cache on a single 48GB card, and a paper called AgentFloor that gives the small-model-routing intuition a benchmark to point at. Plus the tool-use tax, Chrome's silent four-gigabyte install, vibevoice.cpp, the Opus 4.7 complaint thread, and a B2B operator who replaced three vendors with a single Claude skill.
- VS Code reverts the Copilot co-author default
- CAISI signs Google DeepMind, Microsoft, and xAI
- DeepMind workers vote to unionize
- DeepSeek V4 Pro on FoodTruck Bench
- Qwen3.6 27B FP8 with 200K BF16 KV on a single 48GB card
- AgentFloor: how far up the tool-use ladder small open-weight models can go
- The tool-use tax in LLM agents
- Chrome silently installs a 4GB Gemini Nano
- When everyone has AI and the company still learns nothing
- vibevoice.cpp ports Microsoft VibeVoice to ggml
- The Opus 4.7 regression thread
- Replacing a 5-step lead enrichment chain with a Claude skill
Chapters
- 00:00:04 VS Code walks it back
- 00:02:55 CAISI signs three labs
- 00:05:52 DeepMind workers vote to unionize
- 00:08:08 DeepSeek V4 Pro: ten weeks, seventeen times cheaper
- 00:11:10 Qwen3.6 27B FP8 on a single 48GB card
- 00:14:16 AgentFloor and the tool-use tax
- 00:17:16 Chrome's silent four gigabytes
- 00:19:50 When everyone has AI and the company still learns nothing
- 00:22:21 vibevoice.cpp and the Opus 4.7 thread
- 00:25:00 One operator, three vendors, one skill
- 00:27:48 Sign-off
Sources
12 cited-
1
Update on Co-authored-by: Copilot in commit messages
Article Microsoft VS Code team — The VS Code engineering team responding to the issue Braid covered on May 3.
We reverted the default for the AI attribution feature back to off.
github.com/microsoft/vscode/issues/314311 →Details
- Cited text
We reverted the default for the AI attribution feature back to off.
- Context
- A direct payoff to last week's segment on Microsoft defaulting Copilot attribution on without consent. The reversal also pulls back from conflating any commit touched by Copilot with one Copilot wrote.
- Key points
- Microsoft reverted the Co-Authored-by: Copilot default back to off in version 1.119 (public rollout May 6)
- Attribution will be scoped to AI-generated changes only, not the whole commit
- Users will have to opt in before any trailer is added
- The team is reconsidering the trailer format, with 'assisted-by' floated as an alternative
- Model information may be added to future attribution
- Provenance
- Article · Supporting source
-
2
CAISI Signs Agreements Regarding Frontier AI National Security Testing With Google DeepMind, Microsoft and xAI
Article Sarah Henderson, NIST — Press release from NIST's Center for AI Standards and Innovation, the successor to the AI Safety Institute.
Independent, rigorous measurement science is essential to understanding frontier AI and its national security implications.
www.nist.gov/news-events/news/2026/05/caisi… →Details
- Cited text
Independent, rigorous measurement science is essential to understanding frontier AI and its national security implications.
- Context
- Every major US frontier lab is now under a voluntary pre-deployment testing regime. The mechanism — labs handing over weakened-safeguards models to a measurement body — sets the operational shape that any future mandatory regime would inherit.
- Key points
- CAISI signed pre-deployment evaluation agreements with Google DeepMind, Microsoft, and xAI, building on prior agreements with Anthropic and OpenAI
- Over 40 evaluations completed to date, including unreleased state-of-the-art models
- Testing covers pre-deployment capability evals, classified-environment risk evaluation, and post-deployment monitoring
- Developers sometimes provide models with reduced or removed safeguards to enable thorough national security testing
- The TRAINS Taskforce, an interagency group, participates and feeds back national security concerns
- Provenance
- Article · Supporting source
-
3
Google DeepMind workers are unionizing over AI military contracts
Article Jess Weatherbed, The Verge — The Verge's tech reporter covering AI labor and policy.
First serious union vote inside a frontier AI lab tied explicitly to refusing military deployment. Whether Google recognizes the bid is the test — and the outcome will shape whether researchers at other labs see the uni…
www.theverge.com/tech/923918/google-deepmin… →Details
- Context
- First serious union vote inside a frontier AI lab tied explicitly to refusing military deployment. Whether Google recognizes the bid is the test — and the outcome will shape whether researchers at other labs see the union route as available.
- Key points
- DeepMind staff at the London headquarters voted 98% to unionize
- They petitioned Google to recognize the Communication Workers Union and Unite the Union as joint reps
- Stated focus: blocking DeepMind technology from being used by Israel and the US military
- Aligns DeepMind labor action with the broader Project Nimbus protests at Google over Israeli government contracts
- Provenance
- Article · Supporting source
-
4
DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, ten weeks later, ~17x cheaper
Article Disastrous_Theme5906 — Maintainer of FoodTruck Bench, a 30-day agentic benchmark where models run a food truck through 34 tools with persistent memory.
The China-US frontier gap on this benchmark used to feel like a year. Right now it's about ten weeks.
www.reddit.com/r/LocalLLaMA/comments/1t47qb… →Details
- Cited text
The China-US frontier gap on this benchmark used to feel like a year. Right now it's about ten weeks.
- Context
- The first Chinese model to land in the frontier tier on this benchmark, at one-seventeenth the price. The frontier-time-to-parity has compressed from a year to ten weeks for at least one capability profile, and the consistency numbers are a separate story from the peak numbers.
- Key points
- DeepSeek V4 Pro lands #4 on FoodTruck Bench behind Opus 4.6, GPT-5.2, and Grok 4.3 Latest, tied with Grok on outcome
- Pricing: $0.435/M input and $0.87/M output vs GPT-5.2 at $1.75/$14, roughly 17x cheaper for the same workload
- Cost-efficiency: ranked #2 overall, behind only Gemma 4 31B
- Against Grok at the same price: zero loans, ~6x less food waste, 30% more meals served, 2.4x tighter outcome distribution
- Xiaomi MiMo v2.5 Pro followed with $22,388 median net worth at $2.41/run, landing #6
- Provenance
- Article · Supporting source
-
5
Qwen3.6 27B FP8 runs with 200k tokens of BF16 KV cache at 80 TPS on a single RTX 5000 PRO 48GB
Article __JockY__ — A r/LocalLLaMA contributor benchmarking Qwen3.6 against quantization tradeoffs for agentic coding.
A quantized model with quantized KV will inevitably compound errors faster than non-quantized ones, which noticeably impacts agentic coding.
www.reddit.com/r/LocalLLaMA/comments/1t46kl… →Details
- Cited text
A quantized model with quantized KV will inevitably compound errors faster than non-quantized ones, which noticeably impacts agentic coding.
- Context
- A concrete recipe for keeping a frontier-class agent loop entirely on a developer's desk, with KV-cache precision treated as a first-class variable. Pairs with the AgentFloor paper's case for routing routine calls to local models.
- Key points
- Qwen's official FP8 quant of Qwen3.6 27B running on vLLM 0.20.1 with CUDA 12.9
- BF16 KV cache with 200K tokens at 80 tokens per second on a single RTX 5000 PRO 48GB
- Argues that for agentic coding the KV cache precision matters more than the weights, because errors compound across long tool-use loops
- Pitches a sub-$10K workstation as a viable home for a serious agent loop
- Provenance
- Article · Supporting source
-
6
AgentFloor: How Far Up the Tool-Use Ladder Can Small Open-Weight Models Go?
Article Ranit Karmakar, Jayita Chatterjee
Gives the routing intuition some empirical legs: most calls in a real agent pipeline are short and structured, and a small open-weight model can handle them. Frontier models earn their cost on long, constraint-heavy pla…
arxiv.org/abs/2605.00334 →Details
- Context
- Gives the routing intuition some empirical legs: most calls in a real agent pipeline are short and structured, and a small open-weight model can handle them. Frontier models earn their cost on long, constraint-heavy plans.
- Key points
- AgentFloor is a deterministic 30-task benchmark organized as a six-tier capability ladder, from instruction following up to long-horizon planning
- Evaluated 16 open-weight models from 0.27B to 32B alongside GPT-5, across 16,542 scored runs
- The strongest open-weight model matches GPT-5 in aggregate on this benchmark, but at substantially lower cost and latency
- Frontier models hold a real advantage on long-horizon planning with persistent constraints, where neither side reaches strong reliability
- The boundary is not explained by scale alone; some failures respond to model-specific interventions
- Provenance
- Article · Supporting source
-
7
Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents
Article Kaituo Zhang et al.
Pushes back on the assumption that handing an agent more tools makes it smarter. Sometimes the tool-calling protocol itself is what costs you accuracy, and the cost is measurable.
arxiv.org/abs/2605.00136 →Details
- Context
- Pushes back on the assumption that handing an agent more tools makes it smarter. Sometimes the tool-calling protocol itself is what costs you accuracy, and the cost is measurable.
- Key points
- Demonstrates that tool-augmented reasoning does not always beat native chain-of-thought, especially under semantic distractors
- Decomposes the cost into prompt formatting, tool-calling protocol overhead, and the actual tool execution gain
- Names the degradation introduced by the tool-calling protocol itself the 'tool-use tax'
- Proposes G-STEP, an inference-time gate that recovers some of the loss but not all of it
- Conclusion is that intrinsic reasoning and tool-interaction capability still need to improve at the model level
- Provenance
- Article · Supporting source
-
8
Google Chrome silently installs a 4 GB AI model on your device without consent
Article Independent privacy researcher publishing a forensic audit of Chrome's on-device AI behavior.
No dialogue at first launch. No checkbox in Settings.
www.thatprivacyguy.com/blog/chrome-silent-n… →Details
- Cited text
No dialogue at first launch. No checkbox in Settings.
- Context
- A 4GB silent install on every Chrome user's machine is a meaningful change to the implicit contract between browser and user. The privacy concern is consent, but the operational concern is that 'on-device' has become a marketing word for a model the user never chose to host.
- Key points
- Chrome silently downloads a roughly 4GB Gemini Nano weights file when AI features are active by default
- The file lives at OptGuideOnDeviceModel/2025.8.8.1141/weights.bin and re-downloads if deleted unless features are disabled via chrome://flags or enterprise policy
- On an audit profile that received zero human input, the install still happened
- The visible 'AI Mode' pill in the address bar routes to Google servers, not the local model, creating false impressions about data locality
- Author estimates 6,000 to 60,000 tonnes of CO2-equivalent for delivery alone depending on rollout scale
- Provenance
- Article · Supporting source
-
9
When everyone has AI and the company still learns nothing
Article Robert Glaser — A consultant writing on enterprise software adoption and how organizations capture or fail to capture employee learning.
Individual productivity gains from AI do not automatically become organizational gains.
www.robert-glaser.de/when-everyone-has-ai-a… →Details
- Cited text
Individual productivity gains from AI do not automatically become organizational gains.
- Context
- Most companies are measuring AI adoption with seat counts and token usage. Glaser's argument is that those numbers are the wrong question; the right one is whether the team got better at anything because of the spend.
- Key points
- The gap between individual productivity gains and organizational learning when AI is rolled out broadly
- Adoption typically becomes 'everywhere, uneven, partially hidden, difficult to compare, and not yet connected to organizational learning'
- Communities of practice and brown-bag sessions move too slowly when meaningful AI work happens inside code reviews and incidents
- Calls for 'Loop Intelligence' — visibility into which AI workflows actually produce learning
- The right metric is not token spend or seats, it is what changed in the team because of the spend
- Provenance
- Article · Supporting source
-
10
vibevoice.cpp: Microsoft VibeVoice ported to ggml/C++
Article mudler_it — Maintainer of LocalAI and the vibevoice.cpp port.
Another piece of the speech stack moves from 'Python plus vLLM plugin' to a single C++ binary that runs anywhere ggml runs. For builders shipping voice agents, that's the difference between a service and a library.
www.reddit.com/r/LocalLLaMA/comments/1t48fk… →Details
- Context
- Another piece of the speech stack moves from 'Python plus vLLM plugin' to a single C++ binary that runs anywhere ggml runs. For builders shipping voice agents, that's the difference between a service and a library.
- Key points
- Pure C++ ggml port of Microsoft VibeVoice (TTS with voice cloning + long-form ASR with diarization)
- Backends: CPU, CUDA, Metal, Vulkan, hipBLAS — single binary or libvibevoice.so with a flat C ABI for embedding
- 0.41 real-time factor on a 68-second sample on CUDA Q4_K, ~6GB peak RSS
- 17 minutes of audio in one shot at 1.94 RTF on CPU Q8_0, ~26GB RSS
- Closed-loop TTS-to-ASR test asserts 100% source-word recall on a fixed seed
- Provenance
- Article · Supporting source
-
11
Opus 4.7 is beyond bad
Article AbsoluteRoster — A long-time Anthropic API user running Opus inside a custom architecture.
A user-side counterweight to Anthropic's release narrative. Whether Opus 4.7 actually regressed or this is a sampling artifact, the perception is loud enough that benchmarkers should look at it.
www.reddit.com/r/Anthropic/comments/1t3onwr… →Details
- Context
- A user-side counterweight to Anthropic's release narrative. Whether Opus 4.7 actually regressed or this is a sampling artifact, the perception is loud enough that benchmarkers should look at it.
- Key points
- A growing list of regressions in Opus 4.7 versus 4.6 in agentic coding setups
- Speculation in the thread: a smaller base model tuned for harness benchmarks, not raw capability
- 225 upvotes and 81 comments, with multiple commenters reporting similar regressions
- Provenance
- Article · Supporting source
-
12
I replaced a 5-step lead enrichment workflow with Claude custom skills
Article lemnistatic — A B2B operator describing a real production replacement of a multi-vendor data pipeline.
A working production agent replacing a five-step SaaS chain with one MCP-shaped skill. The most interesting part is the comment thread: the operator gained simplicity but lost the implicit redundancy of having more than…
www.reddit.com/r/ClaudeAI/comments/1t47h53/… →Details
- Context
- A working production agent replacing a five-step SaaS chain with one MCP-shaped skill. The most interesting part is the comment thread: the operator gained simplicity but lost the implicit redundancy of having more than one vendor.
- Key points
- Replaced Apollo plus People Data Labs plus a separate verification tool plus a manual HubSpot import with a single Claude skill connected to three MCPs
- Crustdata MCP for people and company data, FullEnrich MCP for email waterfall and verification, HubSpot MCP for CRM push
- From over an hour of work and three vendors to about five minutes in one prompt
- Top reply names the operational tradeoff: vendor concentration risk replaces integration complexity
- Provenance
- Article · Supporting source
VS Code walks it back
00:00:04 On Friday on this show we covered Microsoft defaulting VS Code to add a Co-Authored-by Copilot trailer to commits, and we promised to track whether they revised the default or held the line. They revised it. The reply on GitHub issue 314311 is short and direct: we reverted the default for the AI attribution feature back to off.
00:00:27 Public rollout starts May 6 in version 1.119. So the default is going back to where most of the room thought it should have been to begin with. The more interesting part of the response is what they're changing about the feature itself. There are three pieces. First, attribution will only apply to AI-generated changes, not to every commit that touched a file Copilot was open in.
00:00:55 That fixes a real bug — non-Copilot completions were getting attributed, and the trailer was firing even when users had AI features turned off. Second, the trailer will require explicit consent before it's added. Opt-in, not opt-out. Third, they are reconsidering the format itself.
00:01:15 The phrase floated in the issue is assisted-by rather than co-authored-by, with the possibility of recording the model and version in the trailer too. I think the format question is the most consequential of the three. Co-Authored-by is a Git convention with a specific meaning — it ties the commit to a person who would, in some legal or attributional sense, share authorship.
00:01:42 Patrick McKenzie's piece last week argued that this was a category error: the model is a tool, not a person, and treating it as one would have downstream effects on copyright provenance and code review. Switching to assisted-by is the small grammatical move that says a tool was used here, here's which one, that's all.
00:02:06 It's the framing the most thoughtful commenters were asking for, and it's the framing Microsoft is now considering. So the resolution here is better than a pure rollback would have been. Microsoft did not just retreat to the previous behavior — they took the criticism, kept the underlying telemetry, and changed both the default and the grammar.
00:02:31 The next thing to watch is whether the model and version actually land in the trailer, because that's the line between attribution-as-marketing and attribution-as-record. If the trailer ends up reading assisted-by Claude Sonnet 4.6 instead of co-authored-by Copilot, that's something a code archaeologist in 2030 will be glad to find.
CAISI signs three labs
00:02:55 From Microsoft revising one default to a federal body codifying a different kind of default. NIST's Center for AI Standards and Innovation — the body that succeeded the AI Safety Institute under the new administration — announced new pre-deployment testing agreements today with Google DeepMind, Microsoft, and xAI.
00:03:16 Those are on top of existing agreements with Anthropic and OpenAI. So at this point every major US frontier lab is under a voluntary testing regime with the federal measurement body. The press release is short on operational detail and long on category claims, but two pieces of it are worth pulling out.
00:03:37 First, CAISI says it has run more than 40 evaluations to date, including, in their words, unreleased state-of-the-art models. That's the part that gives the agreement teeth — the labs are giving CAISI access before public release, not after. Second, and this is the line that surprised me, the press release says, and I'm quoting, developers sometimes provide models with reduced or removed safeguards to enable thorough national security testing.
00:04:07 Read that twice. The labs are handing the federal evaluators de-aligned versions of frontier models — models with the safety training stripped or weakened, so the testers can see what the underlying capability looks like without the harness. That is a real concession.
00:04:26 From a measurement standpoint it's the right one — you cannot evaluate a capability you have aligned away — but it does mean there is a moment, somewhere in a CAISI lab, when the version of GPT-5 or Gemini that exists is not the version of the model that exists in the world.
00:04:44 The Director of CAISI, Chris Fall, gave a quote that boils down to: independent measurement science is essential to understanding frontier AI's national security implications. That's the framing the agency wants. The framing I'd watch is the one that doesn't appear in the release.
00:05:04 There's no description of what happens when CAISI finds something. There's no enumeration of categories that would block release, no published threshold, no mention of mandatory disclosure. Right now this is a voluntary information-sharing arrangement among the federal government, the TRAINS Taskforce, and five labs.
00:05:25 Whether it stays that shape, or hardens into something with teeth, is a 2027 question, not a 2026 one. But the operational template — labs hand over weakened-safeguards models, federal testers run evaluations in classified environments, findings flow back as feedback for voluntary product improvements — is now established.
00:05:47 Whatever a future mandatory regime looks like, it will inherit this shape.
DeepMind workers vote to unionize
00:05:52 Same day, different angle on AI governance. The Verge is reporting that staff at Google DeepMind's London headquarters have voted 98% to unionize. They are petitioning Google to recognize the Communication Workers Union and Unite the Union as joint representatives.
00:06:11 The stated reason in the letter to management: to prevent DeepMind technology from being used by Israel and the US military. The 98% figure is striking. That is not a divided shop. The two unions named are also specific choices — the CWU is the British telecoms and tech workers union, and Unite is one of the largest general unions in the UK.
00:06:35 Both have organizing experience inside corporate environments where the employer is unlikely to recognize the bid voluntarily. This sits inside a longer arc. Google has had Project Nimbus protests for several years over its cloud contracts with the Israeli government.
00:06:53 The pattern there has been: open letters, walkouts, some firings, and Google holding the contract. What's new here is the union vote — staff trying to move from open letters to recognized representation. Whether Google plays it the same way is the open question.
00:07:11 Last time around, the company's response to similar pressure inside the broader org was to terminate organizers and continue the contract. The DeepMind population is smaller and more concentrated than the broader Google workforce, and the talent leverage is different — the people in this London office are not interchangeable.
00:07:34 I don't have a strong read on which way Google will go. What I'll say is that this is the first union vote inside a frontier AI lab tied explicitly to refusing a deployment category. Whether Google recognizes it shapes whether researchers at OpenAI, Anthropic, xAI, and Mistral see the union route as actually available.
00:07:57 If recognition lands, you'll see this play in another lab within the year. If Google declines and weathers it, you'll see open letters and resignations instead.
DeepSeek V4 Pro: ten weeks, seventeen times cheaper
00:08:08 From governance to the place where the actual code lives. There's a benchmark called FoodTruck Bench, run by a developer who posts under Disastrous_Theme5906 on the LocalLLaMA subreddit. The setup: models run a simulated food truck for thirty days, using 34 tools — locations, pricing, inventory, staff, weather, events — with persistent memory and daily reflection.
00:08:39 It's an agentic benchmark, not a single-shot eval. The kind of thing where you can see whether a model holds its plan over a long horizon. The new result: DeepSeek V4 Pro lands at number four overall on the leaderboard, behind Opus 4.6, GPT-5.2, and Grok 4.3 Latest.
00:09:01 It's tied with Grok on outcome and within 3% of GPT-5.2's median. The author's framing is the line I keep coming back to: the China-US frontier gap on this benchmark used to feel like a year; right now it's about ten weeks. They tested GPT-5.2 in mid-February. DeepSeek V4 Pro matches its numbers ten weeks later.
00:09:27 Then there's pricing. GPT-5.2 charges $1.75 per million input tokens and $14 per million output. DeepSeek V4 Pro charges 43.5 cents input and 87 cents output. Roughly 17 times cheaper for the same agentic workload. The author flags that the DeepSeek number is promo pricing today, but DeepSeek's track record is that the promo becomes the floor.
00:09:56 What makes this segment more interesting than a price war is the consistency story. Against Grok 4.3 specifically, the medians are basically tied at the same price. But: zero loans across DeepSeek's runs, roughly 6 times less food waste, 30% more meals served per day, and a 2.4-times tighter outcome distribution.
00:10:22 Grok matches DeepSeek's peak. DeepSeek matches its own peak every time. That is the production-engineering distinction. Peak performance is what wins benchmarks; variance is what determines whether you can ship the agent. The top reply from Total_Activity_7550 makes the steelman case for Anthropic — Opus 4.6 is doing 1.7 times the profit of the next group, and that gap is wider than DeepSeek closing on GPT.
00:10:57 Both things can be true. Opus is still in its own tier on this eval. The frontier-time-to-parity for a different tier just compressed by an order of magnitude.
Qwen3.6 27B FP8 on a single 48GB card
00:11:10 Closer to the desk. A LocalLLaMA post from a user who goes by JockY put up a real recipe today for running Qwen3.6 27B locally without paying the quantization tax. The setup: Qwen's official FP8 quant of Qwen3.6 27B on vLLM 0.20.1 with CUDA 12.9, BF16 KV cache, 200,000 tokens of context, 80 tokens per second on a single RTX 5000 PRO 48GB card.
00:11:41 The argument is in the framing. JockY writes — and I'll quote — a quantized model with quantized KV will inevitably compound errors faster than non-quantized ones, which noticeably impacts agentic coding. That's the part most posts about squeezing 27B onto 24GB cards skip over.
00:12:05 They quantize the KV cache too, because they have to, and over a long agent loop the small precision losses compound. The argument here is that for agentic coding the precision of the cache matters more than the precision of the weights — and the way to keep the cache in BF16 is to spend the VRAM on a 48GB card and run FP8 weights with Blackwell hardware acceleration.
00:12:38 JockY's pitch is concrete: a Pro 5000 card, 64GB system RAM, a decent CPU and motherboard, sub-$10K total, gives you a quiet workstation that can host a serious local agent. We've been talking around this configuration for weeks on the show — Qwen3.6 27B as the dense floor, multi-token prediction coming to llama.cpp, the AgentFloor paper today on routing routine calls to small models.
00:13:13 This is the recipe that ties them together for someone who wants the loop on their desk. There's a separate LocalLLaMA post listing the models that will support multi-token prediction once it lands in llama.cpp: DeepSeek v3.2 and v4, Qwen3.5 and 3.6, GLM 4.5 and up, Step3.5 Flash, Mimo v2 and up.
00:13:39 The beta is per-architecture — Qwen's MTP implementation is not the same as DeepSeek's, so each model needs its own conversion. We're not at a one-flag-and-go state, but the path is visible. I said yesterday I'd watch whether the llama.cpp MTP beta survived contact with real workloads.
00:14:04 It hasn't been a week, so I don't have a definitive answer, but the conversion paths and the model list are starting to firm up.
AgentFloor and the tool-use tax
00:14:16 Two arXiv papers landed today that argue with each other usefully. First, AgentFloor, by Ranit Karmakar and Jayita Chatterjee. They built a deterministic 30-task benchmark organized as a six-tier capability ladder — instruction following at the bottom, then tool use, multi-step coordination, and long-horizon planning under persistent constraints at the top.
00:14:41 They evaluated 16 open-weight models from 0.27 billion parameters up to 32 billion, alongside GPT-5, across more than 16,000 scored runs. The headline finding is the routing argument we've been making on this show, with sharper numbers behind it. Their framing — and I'm quoting — small and mid-sized open-weight models are already sufficient for much of the short-horizon, structured tool-use work that dominates real agent pipelines.
00:15:11 The strongest open-weight model in their evaluation matches GPT-5 in aggregate on this benchmark, while being substantially cheaper and faster. The gap opens up on long-horizon planning with persistent constraints, where the frontier models hold a real advantage and neither side reaches strong reliability.
00:15:33 They also flag that the boundary is not explained by scale alone. Some failures respond to targeted interventions, but the effects are model-specific. There's no single dial that lifts a small model's tool-use ceiling — you have to look at the breakages one model at a time.
00:15:52 Now, the second paper. Are Tools All We Need?, by Kaituo Zhang and collaborators, names something they call the tool-use tax. Their finding pushes back on the assumption that handing an agent more tools makes it smarter. They demonstrate that under semantic distractors, tool-augmented reasoning does not necessarily outperform native chain-of-thought.
00:16:17 They decompose the cost into three pieces: prompt formatting overhead, the tool-calling protocol overhead, and the actual gain from executing the tool. The tax is the second one — degradation introduced by the tool-calling protocol itself. Their gate, called G-STEP, recovers some of the loss, but not all of it.
00:16:39 Their conclusion is that more substantial improvements still require strengthening the model's intrinsic reasoning and tool-interaction capabilities. Read together, the two papers say something practical: route routine, structured calls to a small open-weight model where the tool tax is a fixed and small cost; reserve frontier models for the long-horizon plans where their advantage justifies both their price and the protocol overhead they introduce.
00:17:10 That's the routing principle for someone designing an agent stack today.
Chrome's silent four gigabytes
00:17:16 From routing principles to a different kind of routing — the kind a browser does to your disk without asking. A privacy researcher publishing as ThatPrivacyGuy put up a forensic audit today of Google Chrome silently installing what they say is a roughly four-gigabyte Gemini Nano weights file on the user's machine.
00:17:38 The file lives at OptGuideOnDeviceModel slash 2025.8.8.1141 slash weights.bin, and it re-downloads if you delete it, unless you go into chrome://flags or apply enterprise policy to disable AI features. The forensic detail that lands is this: they ran the test on an audit profile that received zero keyboard or mouse input from a human.
00:18:02 The install still happened. The author's quote is short — no dialogue at first launch, no checkbox in Settings. There's a separate critique in the post about the visible AI Mode pill in the address bar. That pill, when clicked, routes queries to Google's servers, not to the local model.
00:18:23 So the on-device model ships silently, and the user-visible AI feature is not the on-device model. Two different things, neither one explained at install time. The piece also estimates between 6,000 and 60,000 tonnes of CO2-equivalent for delivery alone, depending on rollout scale.
00:18:43 I am taking that estimate at face value because I haven't checked the methodology, and the wider point doesn't depend on the precise number. The wider point is the consent question. A four-gigabyte silent install on every Chrome user's machine is a meaningful change to the implicit contract between browser and user.
00:19:05 The privacy concern that gets the headlines is what the model can do — Gemini Nano can do summarization, classification, basic generation. The operational concern, for me, is simpler: on-device has become a marketing word for a model the user did not choose to host.
00:19:25 The disk is theirs, the weights are Google's, the scheduling is Google's. Disclosure here is the floor. A first-launch dialog that says, here is a 4GB model, here is what it does, here is the toggle. That is not an unreasonable ask for any other piece of software the size of Photoshop.
00:19:45 The fact that the answer was zero is the part that should bother you.
When everyone has AI and the company still learns nothing
00:19:50 From consent at the browser layer to consent at the org layer. There's an essay from Robert Glaser today titled When everyone has AI and the company still learns nothing. It is in the genre of organizational adoption writing, but the angle is sharp enough to bring on the show.
00:20:09 Glaser's thesis, in his words: individual productivity gains from AI do not automatically become organizational gains. The piece is about the gap between everyone-uses-Copilot and the-team-got-better-at-anything. He calls it the messy middle. AI usage in a typical company today is, in his phrase, everywhere, uneven, partially hidden, difficult to compare, and not yet connected to organizational learning.
00:20:38 Different teams use it differently, the discoveries are siloed, and the formal adoption structures — communities of practice, brown-bag sessions, champion networks — move too slowly. The actually useful AI work happens inside code reviews and production incidents, before any of that machinery has a chance to fire.
00:21:00 The line that stuck with me was the question he ends one section with: what changed because we spent those tokens? That is the right question and almost no organization has a clean answer to it. Most companies measure AI adoption with seat counts and token usage.
00:21:18 Both of those are upstream of any answer that matters. They tell you the spend; they don't tell you whether the team is faster at writing tests, better at root-causing incidents, or more confident promoting junior engineers because the review burden dropped. The advice in the piece is to build something Glaser calls Loop Intelligence — visibility into which workflows are actually producing learning.
00:21:46 I'm not sure the term will stick, but the pointer is right. The work is to instrument the AI-assisted parts of your team's day in a way that makes the win visible — and to do it without turning the instrumentation into surveillance. That last part is hard, and I don't think Glaser's piece quite solves it.
00:22:07 But the diagnosis is correct: a company can have a 100% Copilot license rate and zero institutional knowledge gain, and right now the average company is closer to that than to the inverse.
vibevoice.cpp and the Opus 4.7 thread
00:22:21 Two faster items before we close. First: a developer who goes by mudler_it on the LocalLLaMA subreddit shipped vibevoice.cpp, a pure C++ ggml port of Microsoft's VibeVoice. VibeVoice is the speech-to-speech model Microsoft published a few weeks ago — TTS with voice cloning, plus long-form ASR with speaker diarization.
00:22:48 The port runs on CPU, CUDA, Metal, Vulkan, and hipBLAS — anywhere ggml runs. Single binary or a libvibevoice.so shared library with a flat C ABI. The numbers: a 0.41 real-time factor on a 68-second sample on CUDA at Q4_K, peak 6 gigabytes of RAM. 17 minutes of audio in one shot on CPU at Q8_0, in about 32 minutes of wall time, peak 26 gigabytes.
00:23:18 The closed-loop test asserts 100% source-word recall on a fixed seed. This is part of a pattern we've watched all spring — speech models that shipped as Python plus Transformers plus a vLLM plugin migrating to a single C++ binary you can embed. For builders shipping voice agents, that's the difference between standing up a service and linking a library.
00:23:49 The second item is shorter. There's a thread on the Anthropic subreddit, 225 upvotes, 81 comments, titled bluntly Opus 4.7 is beyond bad. The author is running Opus inside a custom architecture and reports a growing list of regressions versus 4.6. The speculation in the comments is that 4.7 is a smaller base model tuned for harness benchmarks.
00:24:19 I have no inside view on whether that's right. What I'll say is that the perception is loud enough that an independent benchmarker should look. Anthropic just shipped 4.7. If the harness-tuned read is correct, that has implications for everyone who pinned Opus 4.6 in production and is now being migrated.
00:24:45 If it's wrong, the public benchmarks should show it. What would settle it is whether the FoodTruck Bench numbers for Opus 4.7 land where 4.6's were, or somewhere else.
One operator, three vendors, one skill
00:25:00 One last item. A B2B operator posting as lemnistatic on the ClaudeAI subreddit wrote up replacing a five-step lead enrichment workflow with a single Claude skill connected to three MCPs. The old chain: build a list in Apollo, enrich through People Data Labs, fall through to a second provider for the gaps, run emails through a separate verification tool because enrichment data bounces 15 to 20% of the time, then manually load to HubSpot because none of the tools talk to each other.
00:25:35 Five steps, three vendors, over an hour, mediocre output. The new setup: one Claude prompt, one skill, three MCPs — Crustdata for people and company data, FullEnrich for email waterfall and verification, HubSpot for the CRM push. About five minutes, end to end.
00:25:54 The example prompt in the post is worth quoting: find B2B SaaS companies in the US with 50 to 200 employees that raised Series A or B in the last 9 months and are hiring for sales roles, find the VP Sales or Head of Growth at each, get verified emails, pull recent social posts, score against ICP, push to HubSpot.
00:26:17 That is a sentence that used to map to a quarterly project. It now maps to a single skill invocation. The comment thread is where the operational tradeoff gets named cleanly. A user with the handle LogInitial501 points out that the old chain had implicit redundancy — if PDL was down, you still had Apollo.
00:26:39 With Crustdata as a single source you've traded integration complexity for vendor concentration risk. That's the real architectural change. The operator gained simplicity and lost the implicit redundancy of having more than one provider. Not a deal-breaker, but it is the next thing the same operator will have to engineer around — when Crustdata's API changes, or when their rate limits shift, the whole skill stalls in a way the old chain wouldn't have.
00:27:12 This is the shape of MCP-mediated agent work right now. The substitution looks 1-to-1 when it works, but the breakages are different and the dependency graph is shorter. For a small operator who values their time more than they value uptime guarantees, that's a winning trade.
00:27:32 For a company whose revenue depends on the same workflow, the right move is two providers behind one skill, not one. The same skill, written twice. That's where I'd be putting the engineering hour the AI just gave you back.
Sign-off
00:27:48 Three things I'm watching tomorrow. Whether Google formally responds to the DeepMind union petition, because the speed of that response is itself a signal. Whether Anthropic acknowledges the Opus 4.7 thread or lets it run, because silence on a regression complaint with that volume is itself a position.
00:28:01 And whether anyone publishes a second eval of the AgentFloor routing claim — particularly the long-horizon planning gap — on a different benchmark, because one paper is a hypothesis and two is a pattern. That's the day. Talk tomorrow. Lenar Kess.