Archive BRAID
VS Code Walks It Back, CAISI Signs Three Labs, and the Frontier Gap Compresses to Ten Weeks / DISPATCH 017
PDF RSS

Dispatch 017 · 2026-05-05 GSV Reverted The Default For The AI Attribution Feature

VS Code Walks It Back, CAISI Signs Three Labs, and the Frontier Gap Compresses to Ten Weeks

/ 00:28:24 / 12 sources

“On at least one agentic benchmark, DeepSeek V4 Pro matched GPT-5.2 ten weeks later at one-seventeenth the price.”

— Lenar Kess, today's narration

Microsoft reverses the Co-Authored-by Copilot default it shipped last week, and that turns out to be one of three pieces of governance news today — alongside CAISI signing pre-deployment testing agreements with Google DeepMind, Microsoft, and xAI, and DeepMind's London staff voting 98% to unionize over military contracts. Then we go where the actual code lives: DeepSeek V4 Pro matching GPT-5.2 ten weeks later at one-seventeenth the price, a Qwen3.6 27B FP8 recipe that fits 200K tokens of unquantized KV cache on a single 48GB card, and a paper called AgentFloor that gives the small-model-routing intuition a benchmark to point at. Plus the tool-use tax, Chrome's silent four-gigabyte install, vibevoice.cpp, the Opus 4.7 complaint thread, and a B2B operator who replaced three vendors with a single Claude skill.

Chapters

  1. 00:00:04 VS Code walks it back
  2. 00:02:55 CAISI signs three labs
  3. 00:05:52 DeepMind workers vote to unionize
  4. 00:08:08 DeepSeek V4 Pro: ten weeks, seventeen times cheaper
  5. 00:11:10 Qwen3.6 27B FP8 on a single 48GB card
  6. 00:14:16 AgentFloor and the tool-use tax
  7. 00:17:16 Chrome's silent four gigabytes
  8. 00:19:50 When everyone has AI and the company still learns nothing
  9. 00:22:21 vibevoice.cpp and the Opus 4.7 thread
  10. 00:25:00 One operator, three vendors, one skill
  11. 00:27:48 Sign-off

Sources

12 cited
  1. 1

    Update on Co-authored-by: Copilot in commit messages

    Article Microsoft VS Code team — The VS Code engineering team responding to the issue Braid covered on May 3.

    We reverted the default for the AI attribution feature back to off.

    github.com/microsoft/vscode/issues/314311 →
    Details
    Cited text
    We reverted the default for the AI attribution feature back to off.
    Context
    A direct payoff to last week's segment on Microsoft defaulting Copilot attribution on without consent. The reversal also pulls back from conflating any commit touched by Copilot with one Copilot wrote.
    Key points
    • Microsoft reverted the Co-Authored-by: Copilot default back to off in version 1.119 (public rollout May 6)
    • Attribution will be scoped to AI-generated changes only, not the whole commit
    • Users will have to opt in before any trailer is added
    • The team is reconsidering the trailer format, with 'assisted-by' floated as an alternative
    • Model information may be added to future attribution
    Provenance
    Article · Supporting source
  2. 2

    CAISI Signs Agreements Regarding Frontier AI National Security Testing With Google DeepMind, Microsoft and xAI

    Article Sarah Henderson, NIST — Press release from NIST's Center for AI Standards and Innovation, the successor to the AI Safety Institute.

    Independent, rigorous measurement science is essential to understanding frontier AI and its national security implications.

    www.nist.gov/news-events/news/2026/05/caisi… →
    Details
    Cited text
    Independent, rigorous measurement science is essential to understanding frontier AI and its national security implications.
    Context
    Every major US frontier lab is now under a voluntary pre-deployment testing regime. The mechanism — labs handing over weakened-safeguards models to a measurement body — sets the operational shape that any future mandatory regime would inherit.
    Key points
    • CAISI signed pre-deployment evaluation agreements with Google DeepMind, Microsoft, and xAI, building on prior agreements with Anthropic and OpenAI
    • Over 40 evaluations completed to date, including unreleased state-of-the-art models
    • Testing covers pre-deployment capability evals, classified-environment risk evaluation, and post-deployment monitoring
    • Developers sometimes provide models with reduced or removed safeguards to enable thorough national security testing
    • The TRAINS Taskforce, an interagency group, participates and feeds back national security concerns
    Provenance
    Article · Supporting source
  3. 3

    Google DeepMind workers are unionizing over AI military contracts

    Article Jess Weatherbed, The Verge — The Verge's tech reporter covering AI labor and policy.

    First serious union vote inside a frontier AI lab tied explicitly to refusing military deployment. Whether Google recognizes the bid is the test — and the outcome will shape whether researchers at other labs see the uni…

    www.theverge.com/tech/923918/google-deepmin… →
    Details
    Context
    First serious union vote inside a frontier AI lab tied explicitly to refusing military deployment. Whether Google recognizes the bid is the test — and the outcome will shape whether researchers at other labs see the union route as available.
    Key points
    • DeepMind staff at the London headquarters voted 98% to unionize
    • They petitioned Google to recognize the Communication Workers Union and Unite the Union as joint reps
    • Stated focus: blocking DeepMind technology from being used by Israel and the US military
    • Aligns DeepMind labor action with the broader Project Nimbus protests at Google over Israeli government contracts
    Provenance
    Article · Supporting source
  4. 4

    DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, ten weeks later, ~17x cheaper

    Article Disastrous_Theme5906 — Maintainer of FoodTruck Bench, a 30-day agentic benchmark where models run a food truck through 34 tools with persistent memory.

    The China-US frontier gap on this benchmark used to feel like a year. Right now it's about ten weeks.

    www.reddit.com/r/LocalLLaMA/comments/1t47qb… →
    Details
    Cited text
    The China-US frontier gap on this benchmark used to feel like a year. Right now it's about ten weeks.
    Context
    The first Chinese model to land in the frontier tier on this benchmark, at one-seventeenth the price. The frontier-time-to-parity has compressed from a year to ten weeks for at least one capability profile, and the consistency numbers are a separate story from the peak numbers.
    Key points
    • DeepSeek V4 Pro lands #4 on FoodTruck Bench behind Opus 4.6, GPT-5.2, and Grok 4.3 Latest, tied with Grok on outcome
    • Pricing: $0.435/M input and $0.87/M output vs GPT-5.2 at $1.75/$14, roughly 17x cheaper for the same workload
    • Cost-efficiency: ranked #2 overall, behind only Gemma 4 31B
    • Against Grok at the same price: zero loans, ~6x less food waste, 30% more meals served, 2.4x tighter outcome distribution
    • Xiaomi MiMo v2.5 Pro followed with $22,388 median net worth at $2.41/run, landing #6
    Provenance
    Article · Supporting source
  5. 5

    Qwen3.6 27B FP8 runs with 200k tokens of BF16 KV cache at 80 TPS on a single RTX 5000 PRO 48GB

    Article __JockY__ — A r/LocalLLaMA contributor benchmarking Qwen3.6 against quantization tradeoffs for agentic coding.

    A quantized model with quantized KV will inevitably compound errors faster than non-quantized ones, which noticeably impacts agentic coding.

    www.reddit.com/r/LocalLLaMA/comments/1t46kl… →
    Details
    Cited text
    A quantized model with quantized KV will inevitably compound errors faster than non-quantized ones, which noticeably impacts agentic coding.
    Context
    A concrete recipe for keeping a frontier-class agent loop entirely on a developer's desk, with KV-cache precision treated as a first-class variable. Pairs with the AgentFloor paper's case for routing routine calls to local models.
    Key points
    • Qwen's official FP8 quant of Qwen3.6 27B running on vLLM 0.20.1 with CUDA 12.9
    • BF16 KV cache with 200K tokens at 80 tokens per second on a single RTX 5000 PRO 48GB
    • Argues that for agentic coding the KV cache precision matters more than the weights, because errors compound across long tool-use loops
    • Pitches a sub-$10K workstation as a viable home for a serious agent loop
    Provenance
    Article · Supporting source
  6. 6

    AgentFloor: How Far Up the Tool-Use Ladder Can Small Open-Weight Models Go?

    Article Ranit Karmakar, Jayita Chatterjee

    Gives the routing intuition some empirical legs: most calls in a real agent pipeline are short and structured, and a small open-weight model can handle them. Frontier models earn their cost on long, constraint-heavy pla…

    arxiv.org/abs/2605.00334 →
    Details
    Context
    Gives the routing intuition some empirical legs: most calls in a real agent pipeline are short and structured, and a small open-weight model can handle them. Frontier models earn their cost on long, constraint-heavy plans.
    Key points
    • AgentFloor is a deterministic 30-task benchmark organized as a six-tier capability ladder, from instruction following up to long-horizon planning
    • Evaluated 16 open-weight models from 0.27B to 32B alongside GPT-5, across 16,542 scored runs
    • The strongest open-weight model matches GPT-5 in aggregate on this benchmark, but at substantially lower cost and latency
    • Frontier models hold a real advantage on long-horizon planning with persistent constraints, where neither side reaches strong reliability
    • The boundary is not explained by scale alone; some failures respond to model-specific interventions
    Provenance
    Article · Supporting source
  7. 7

    Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents

    Article Kaituo Zhang et al.

    Pushes back on the assumption that handing an agent more tools makes it smarter. Sometimes the tool-calling protocol itself is what costs you accuracy, and the cost is measurable.

    arxiv.org/abs/2605.00136 →
    Details
    Context
    Pushes back on the assumption that handing an agent more tools makes it smarter. Sometimes the tool-calling protocol itself is what costs you accuracy, and the cost is measurable.
    Key points
    • Demonstrates that tool-augmented reasoning does not always beat native chain-of-thought, especially under semantic distractors
    • Decomposes the cost into prompt formatting, tool-calling protocol overhead, and the actual tool execution gain
    • Names the degradation introduced by the tool-calling protocol itself the 'tool-use tax'
    • Proposes G-STEP, an inference-time gate that recovers some of the loss but not all of it
    • Conclusion is that intrinsic reasoning and tool-interaction capability still need to improve at the model level
    Provenance
    Article · Supporting source
  8. 8

    Google Chrome silently installs a 4 GB AI model on your device without consent

    Article Independent privacy researcher publishing a forensic audit of Chrome's on-device AI behavior.

    No dialogue at first launch. No checkbox in Settings.

    www.thatprivacyguy.com/blog/chrome-silent-n… →
    Details
    Cited text
    No dialogue at first launch. No checkbox in Settings.
    Context
    A 4GB silent install on every Chrome user's machine is a meaningful change to the implicit contract between browser and user. The privacy concern is consent, but the operational concern is that 'on-device' has become a marketing word for a model the user never chose to host.
    Key points
    • Chrome silently downloads a roughly 4GB Gemini Nano weights file when AI features are active by default
    • The file lives at OptGuideOnDeviceModel/2025.8.8.1141/weights.bin and re-downloads if deleted unless features are disabled via chrome://flags or enterprise policy
    • On an audit profile that received zero human input, the install still happened
    • The visible 'AI Mode' pill in the address bar routes to Google servers, not the local model, creating false impressions about data locality
    • Author estimates 6,000 to 60,000 tonnes of CO2-equivalent for delivery alone depending on rollout scale
    Provenance
    Article · Supporting source
  9. 9

    When everyone has AI and the company still learns nothing

    Article Robert Glaser — A consultant writing on enterprise software adoption and how organizations capture or fail to capture employee learning.

    Individual productivity gains from AI do not automatically become organizational gains.

    www.robert-glaser.de/when-everyone-has-ai-a… →
    Details
    Cited text
    Individual productivity gains from AI do not automatically become organizational gains.
    Context
    Most companies are measuring AI adoption with seat counts and token usage. Glaser's argument is that those numbers are the wrong question; the right one is whether the team got better at anything because of the spend.
    Key points
    • The gap between individual productivity gains and organizational learning when AI is rolled out broadly
    • Adoption typically becomes 'everywhere, uneven, partially hidden, difficult to compare, and not yet connected to organizational learning'
    • Communities of practice and brown-bag sessions move too slowly when meaningful AI work happens inside code reviews and incidents
    • Calls for 'Loop Intelligence' — visibility into which AI workflows actually produce learning
    • The right metric is not token spend or seats, it is what changed in the team because of the spend
    Provenance
    Article · Supporting source
  10. 10

    vibevoice.cpp: Microsoft VibeVoice ported to ggml/C++

    Article mudler_it — Maintainer of LocalAI and the vibevoice.cpp port.

    Another piece of the speech stack moves from 'Python plus vLLM plugin' to a single C++ binary that runs anywhere ggml runs. For builders shipping voice agents, that's the difference between a service and a library.

    www.reddit.com/r/LocalLLaMA/comments/1t48fk… →
    Details
    Context
    Another piece of the speech stack moves from 'Python plus vLLM plugin' to a single C++ binary that runs anywhere ggml runs. For builders shipping voice agents, that's the difference between a service and a library.
    Key points
    • Pure C++ ggml port of Microsoft VibeVoice (TTS with voice cloning + long-form ASR with diarization)
    • Backends: CPU, CUDA, Metal, Vulkan, hipBLAS — single binary or libvibevoice.so with a flat C ABI for embedding
    • 0.41 real-time factor on a 68-second sample on CUDA Q4_K, ~6GB peak RSS
    • 17 minutes of audio in one shot at 1.94 RTF on CPU Q8_0, ~26GB RSS
    • Closed-loop TTS-to-ASR test asserts 100% source-word recall on a fixed seed
    Provenance
    Article · Supporting source
  11. 11

    Opus 4.7 is beyond bad

    Article AbsoluteRoster — A long-time Anthropic API user running Opus inside a custom architecture.

    A user-side counterweight to Anthropic's release narrative. Whether Opus 4.7 actually regressed or this is a sampling artifact, the perception is loud enough that benchmarkers should look at it.

    www.reddit.com/r/Anthropic/comments/1t3onwr… →
    Details
    Context
    A user-side counterweight to Anthropic's release narrative. Whether Opus 4.7 actually regressed or this is a sampling artifact, the perception is loud enough that benchmarkers should look at it.
    Key points
    • A growing list of regressions in Opus 4.7 versus 4.6 in agentic coding setups
    • Speculation in the thread: a smaller base model tuned for harness benchmarks, not raw capability
    • 225 upvotes and 81 comments, with multiple commenters reporting similar regressions
    Provenance
    Article · Supporting source
  12. 12

    I replaced a 5-step lead enrichment workflow with Claude custom skills

    Article lemnistatic — A B2B operator describing a real production replacement of a multi-vendor data pipeline.

    A working production agent replacing a five-step SaaS chain with one MCP-shaped skill. The most interesting part is the comment thread: the operator gained simplicity but lost the implicit redundancy of having more than…

    www.reddit.com/r/ClaudeAI/comments/1t47h53/… →
    Details
    Context
    A working production agent replacing a five-step SaaS chain with one MCP-shaped skill. The most interesting part is the comment thread: the operator gained simplicity but lost the implicit redundancy of having more than one vendor.
    Key points
    • Replaced Apollo plus People Data Labs plus a separate verification tool plus a manual HubSpot import with a single Claude skill connected to three MCPs
    • Crustdata MCP for people and company data, FullEnrich MCP for email waterfall and verification, HubSpot MCP for CRM push
    • From over an hour of work and three vendors to about five minutes in one prompt
    • Top reply names the operational tradeoff: vendor concentration risk replaces integration complexity
    Provenance
    Article · Supporting source