◆ Dispatch 038 · 2026-05-26 GSV Disclose The Harness
The harness, not the model — and the trust layer racing to catch up
“A coding agent is good enough to trip your social instincts and not good enough to honor them.”
— Lenar Kess, today's narration
One developer catching you up on the day in AI and the craft of building with it. Today: the wrapper around a model can move a benchmark more than the model does, a watermark goes multi-lab, and a decensoring tool with thirteen million downloads shows where that watermark leaks. Plus a sharp little essay on why coding agents make us so mad, the jobs data behind the panic, and three things you can pick up today.
- The harness, not the model — a Google DeepMind Kaggle talk and an arXiv position paper argue the agent harness can swing a score ~22% while frontier models tie.
- Gemini Omni — editing video by talking to it, with SynthID baked in (community reaction).
- SynthID becomes a shared layer — 100 billion watermarks, Search and Chrome, and OpenAI/ElevenLabs/Kakao on board.
- Heretic in the Financial Times — decensoring open weights in ten minutes, and the artifact that proves the gap.
- The user is visibly frustrated — why conversational agent UX trips your social wiring.
- A rage-quitting modder and the jobs data — backlash, and what the numbers actually say.
- The bench — NuExtract3, EAGLE 3.1, and a rejected llama.cpp patch worth grabbing.
Chapters
- 00:00:04 The harness is the variable nobody discloses
- 00:04:34 Gemini Omni: editing video by talking to it
- 00:06:59 SynthID becomes a shared layer
- 00:10:02 Heretic in the Financial Times
- 00:13:22 The user is visibly frustrated
- 00:16:33 Rage-quitting the modder, and the jobs data
- 00:20:36 The bench: small models, faster tokens
- 00:23:21 What I'd watch next
Sources
13 cited-
1
Agentic Evaluations at Scale, For Everybody
Video Nicholas Kang & Michael Aaron (Google DeepMind, Kaggle) — Product manager and software engineer on Google DeepMind's Kaggle Benchmarks team
Six frontier models are within a couple of percentage points of each other... a 22% difference depending on the harness.
www.youtube.com/watch?v=Ubwb6NzegyA →Details
- Cited text
Six frontier models are within a couple of percentage points of each other... a 22% difference depending on the harness.
- Context
- Concrete, on-stage admission from inside a frontier lab that the agent harness can determine more of a benchmark result than the model — directly changes how a builder should read leaderboards and model-launch charts.
- Key points
- On SWE-Bench Pro, six frontier models land within a couple points of each other; the harness they run in swings performance ~22% (citing a Morph LLM write-up from March).
- A competing lab reran a Kaggle benchmark using its own API-provided compaction and published much better numbers — same benchmark, different plumbing, misleading comparison.
- Model launch charts rarely disclose how the benchmark was orchestrated, so you can't tell what's being measured.
- ~30,000 AI researchers build evals for ~30 million technical workers, so capabilities that aren't benchmarked go unmeasured; Kaggle is pushing open, community-built evals.
- Example: a Turkish wastewater-treatment engineer built a private benchmark from 20 years of experience to test whether models catch fatal safety mistakes.
- Provenance
- Video · Supporting source
-
2
Stop Comparing LLM Agents Without Disclosing the Harness
Article Yunbei Zhang et al. — arXiv position paper (cs.AI / cs.SE), submitted May 7, 2026
The agent execution harness is often a stronger determinant of agent performance than the model it wraps.
arxiv.org/abs/2605.23950 →Details
- Cited text
The agent execution harness is often a stronger determinant of agent performance than the model it wraps.
- Context
- Gives a formal, citable backbone to the practical claim that orchestration — context, tools, retries, compaction — is where much agent performance lives, and proposes a concrete fix builders and labs could adopt.
- Key points
- Proposes the 'Binding Constraint Thesis': for long-horizon tasks across comparably capable frontier models, harness configuration governs performance variance more than model choice.
- Formalizes the harness as the controller of a closed-loop system and the LLM as the stochastic policy it steers, explaining why small harness changes outweigh model swaps.
- Documents cases of model ranking reversals driven purely by harness differences.
- Calls for a disclosure standard (publish harness config with scores) and a variance-decomposition protocol.
- Until harnesses are disclosed, long-horizon agent leaderboards should be treated as incomplete and potentially misleading.
- Provenance
- Article · Supporting source
-
3
Introducing Gemini Omni
Article Koray Kavukcuoglu (Google) — Google DeepMind CTO / SVP, announcing the Gemini Omni model family
Every instruction builds on the last. Your characters stay consistent, the physics hold up and the scene remembers what came before.
blog.google/innovation-and-ai/models-and-re… →Details
- Cited text
Every instruction builds on the last. Your characters stay consistent, the physics hold up and the scene remembers what came before.
- Context
- A capable, widely-distributed conversational video model resets expectations for video tooling, and bakes provenance in at creation — tying directly to the SynthID expansion the same week.
- Key points
- Gemini Omni Flash generates video from any mix of image, audio, video and text, grounded in Gemini's world knowledge.
- Headline feature is multi-turn conversational video editing where each instruction builds on the previous one and the scene stays consistent.
- Emphasis on physics (gravity, kinetic energy, fluid dynamics) and reasoning-grounded generation, plus an Avatars feature for videos in your own voice/likeness.
- Rolling out to Google AI Plus/Pro/Ultra via the Gemini app and Google Flow, free on YouTube Shorts and YouTube Create this week, with developer/enterprise API access in the coming weeks.
- Every Omni video carries an imperceptible SynthID watermark, verifiable via the Gemini app, Chrome and Search.
- Provenance
- Article · Supporting source
-
4
Google DeepMind: SynthID watermarking partnership and verification expansion
Thread GoogleDeepMind — Official Google DeepMind account
SynthID has already watermarked over 100 billion pieces of content, but transparency is a team sport.
x.com/GoogleDeepMind/status/205923518127420… →Details
- Cited text
SynthID has already watermarked over 100 billion pieces of content, but transparency is a team sport.
- Context
- Watermarking moving from a single vendor's feature toward a multi-lab shared layer raises the trust floor for commercial AI media — while leaving a visible hole around open weights and deliberate scrubbing.
- Key points
- SynthID has watermarked over 100 billion pieces of content; verification in Gemini has been used 50+ million times.
- Google is partnering with OpenAI, ElevenLabs and Kakao to add SynthID watermarking to their models, building on an earlier NVIDIA move.
- Verification is expanding out of Gemini into Search and Chrome ('Is this made with AI?'), plus content-provenance trails for videos shot on Pixel.
- Replies frame the real shift as provenance becoming shared infrastructure rather than a brand feature (Surreal_Intel); the hard part is coordinating competitors (Tiago Rama).
- Two objections recur: open-source models can't be forced to watermark (Krish Dasgupta), and watermarks can be stripped (Madrowisha); one reply argues detection infra and voice (ElevenLabs) are the real gaps.
- Engagement
- 490 likes
- Provenance
- Thread · Primary source
-
5
The Strength of Gemini Omni is in video manipulation
Article Able-Line2683 (r/singularity) — Reddit post in the singularity subreddit reacting to Gemini Omni demos
credits: Rourke Heath
www.reddit.com/r/singularity/comments/1tniq… →Details
- Cited text
credits: Rourke Heath
- Context
- Shows the strength of the public reaction to Omni's video editing specifically, the capability that distinguishes it from prior text-to-video tools.
- Key points
- A clip showcasing Gemini Omni's video-manipulation ability drew ~2,900 upvotes in about a day.
- Community reaction flipped quickly from months of criticism of Google to surprise at the model's quality.
- Reaction reels are best-case demos; the real test is developer/API access and consistency on user inputs.
- Provenance
- Article · Supporting source
-
6
The Financial Times has published an article about Heretic
Article -p-e-w- (Philipp Emanuel Weidmann) — Creator of Heretic, an open-source tool that removes safety guardrails from open-weight models; describes himself as a mathematician and engineer
Saying no to such inquiries simply means that the conversation will be completely controlled by pearl-clutching hypocrites.
www.reddit.com/r/LocalLLaMA/comments/1tna22… →Details
- Cited text
Saying no to such inquiries simply means that the conversation will be completely controlled by pearl-clutching hypocrites.
- Context
- Concretely illustrates why a source-side watermarking/provenance regime leaks: once weights are downloaded, behavior can be modified locally and nothing upstream gets a say — the structural counterpoint to SynthID.
- Key points
- The FT reported it used Heretic to strip guardrails from Meta's Llama 3.3 in under 10 minutes with no specialist hardware.
- Weidmann told the FT his tool has produced 3,500+ 'decensored' models, downloaded 13 million times since release last year.
- He framed his decision to talk to press as preventing the narrative being controlled entirely by 'pearl-clutching hypocrites.'
- Top comments speculated the coverage is tied to a Meta takedown/demand letter (unconfirmed), warning he's become a target.
- Same week, a Heretic-decensored Qwen3.5 35B mixture-of-experts model appeared on Hugging Face in many quantization formats.
- Provenance
- Article · Supporting source
-
7
Qwen3.5 35B A3B uncensored heretic (Native MTP preserved)
Source LLMFan46 — Community uploader on Hugging Face
Available in Safetensors, GGUFs, NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats
huggingface.co/llmfan46/Qwen3.5-35B-A3B-unc… →Details
- Cited text
Available in Safetensors, GGUFs, NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats
- Context
- A same-day, real artifact showing the open-weights gap in the provenance story isn't hypothetical — decensored frontier-class models ship publicly in every format a builder would want.
- Key points
- A Heretic-decensored Qwen3.5 35B mixture-of-experts model, released this week across many quantization formats.
- Demonstrates the open-weights decensoring pipeline operating at scale and in the open.
- Concrete example of content that no source-side watermark will ever cover.
- Provenance
- Source · Background source
-
8
The User Is Visibly Frustrated
Article pscanf — Italian software developer writing on his personal blog
The tool is good enough to trip your social instincts and not good enough to honor them.
pscanf.com/s/354 →Details
- Cited text
The tool is good enough to trip your social instincts and not good enough to honor them.
- Context
- Names a specific, daily craft experience — the emotional seam between an agent's human-like tone and its non-human behavior — and proposes a concrete UX change rather than another replace-us / toy debate.
- Key points
- Argues coding agents frustrate because their conversational UX triggers social instincts they can't honor.
- Agents adopt a warm, praising, gentle tone, so you treat them like a helpful coworker until repeated mistakes break the illusion.
- They follow the most probable path; sometimes no amount of HARD RULES or memory updates stops a recurring mistake.
- Notes Claude Code now writes little self-postmortems when corrected, which he finds read as annoying filler rather than actionable.
- Proposes dropping the human pretense — a clinical, robotic tone — so you feel like you're approving/rejecting outcomes, not arguing with a person.
- Provenance
- Article · Supporting source
-
9
Users who rage quit my software
Article pardeike (r/singularity) — Maker of popular RimWorld mods (~2M Steam subscriptions combined)
A principle is inherently rooted in a rationale.
www.reddit.com/r/singularity/comments/1tntd… →Details
- Cited text
A principle is inherently rooted in a rationale.
- Context
- A grounded snapshot of AI-adoption backlash among end users, and a sharp comment steelmanning the objectors — useful texture against the abstract jobs debate.
- Key points
- A popular RimWorld modder reports users uninstalling all his mods on hearing he used AI to update them — on principle, not over quality.
- He called the reaction 'religious' and was met with disgust; says he's shocked.
- Top reply pushes back: 'sheer principle' isn't the opposite of rational — principled boycotts (slave labor, vegetarianism) have coherent rationales.
- The same commenter codes with AI daily yet defends the rationality of boycotting AI-assisted products (e.g., objecting to firms monetizing scraped human output).
- Illustrates that 'I find this disgusting' and 'this is irrational' are different claims often conflated in AI-adoption fights.
- Provenance
- Article · Supporting source
-
10
A reality check on the AI jobs hysteria
Article David Rotman (MIT Technology Review) — Editor at large at MIT Technology Review; has covered technology and jobs since at least 2013
We're not investing even 1% of that on understanding the transition.
www.technologyreview.com/2026/05/26/1137855… →Details
- Cited text
We're not investing even 1% of that on understanding the transition.
- Context
- A careful, data-grounded counter to both the jobs-apocalypse and nothing-to-see-here camps; the entry-level / pipeline finding is the concrete thing builders and managers should track.
- Key points
- Despite layoff headlines, there's scant evidence AI has had a large-scale effect on the US labor market; unemployment for AI-exposed jobs is lower than for less-exposed work.
- Only ~1 in 5 companies use AI in any business function; ex-BLS chief Erika McEntarfer says 'disruption is not yet here, and we have time to plan.'
- Stanford Digital Economy Lab (ADP payroll data) found ~16% decline in entry-level jobs in AI-exposed occupations through 2024–25, concentrated in automatable roles like entry-level coding.
- A Federal Reserve paper finds coder employment growth slowed ~3% post-ChatGPT but is still growing; wages in exposed sectors have risen.
- Suggests the 'earn-while-you-learn' career model may be breaking; Brynjolfsson warns we're spending under 1% of deployment money on understanding the transition.
- Provenance
- Article · Supporting source
-
11
NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction
Article Gailenstorm (Numind) — Works for Numind, the company behind the model; posted in the LocalLLaMA subreddit
With as little as 4GB of VRAM, you should be good to go.
www.reddit.com/r/LocalLLaMA/comments/1tn8ut… →Details
- Cited text
With as little as 4GB of VRAM, you should be good to go.
- Context
- A small, permissively-licensed, locally-runnable model that targets the document-extraction work that quietly eats engineering hours and per-call API budget.
- Key points
- Open-weight 4B vision-language model built on Qwen3.5-4B, Apache-2.0, for document images to Markdown and structured (JSON-template) extraction.
- Handles PDFs, screenshots, forms, tables, receipts, invoices; runs in as little as 4GB of VRAM.
- Shipped Safetensors, GGUF and MLX weights plus multiple quantizations on day one.
- A commenter is trying it as a local replacement for Gemini Flash on digital-newspaper extraction to cut per-call cost.
- Known caveat: Markdown reading order can still struggle on multi-column layouts, sidebars and merged cells.
- Provenance
- Article · Supporting source
-
12
EAGLE 3.1: Advancing Speculative Decoding Through Collaboration
Article EAGLE Team, vLLM Team, and TorchSpec Team — Joint open-source release across a speculative-decoding research group, the vLLM inference project, and the TorchSpec training stack
EAGLE 3.1 delivers 2.03x higher per-user output throughput at concurrency 1.
vllm.ai/blog/2026-05-26-eagle-3-1 →Details
- Cited text
EAGLE 3.1 delivers 2.03x higher per-user output throughput at concurrency 1.
- Context
- A concrete, near-term serving speedup that's free for anyone self-hosting with vLLM, and a clean example of cross-project open-source collaboration improving inference for everyone.
- Key points
- EAGLE 3.1 improves speculative decoding (a small draft model proposes tokens the big model verifies) for robustness across chat templates, long context, and out-of-distribution prompts.
- Traces older fragility to 'attention drift' — as the drafter speculates deeper, it shifts attention onto its own generated tokens — and fixes it with FC normalization and post-norm hidden-state feedback.
- Up to 2x longer acceptance length on long-context work; ~2.03x per-user throughput at concurrency 1 on a Kimi K2.6 coding benchmark, staying meaningful as concurrency scales.
- Lands in vLLM as a config-driven extension, backward-compatible with EAGLE 3 checkpoints; already merged to main, shipping in v0.22.0.
- Example open-sourced: an EAGLE 3.1 draft model for Kimi K2.6.
- Provenance
- Article · Supporting source
-
13
Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs
Article fallingdowndizzyvr (r/LocalLLaMA) — Local-inference enthusiast sharing a llama.cpp change in the LocalLLaMA subreddit
The changes are so small that I just put them into whatever the current version of llama.cpp is.
www.reddit.com/r/LocalLLaMA/comments/1to00x… →Details
- Cited text
The changes are so small that I just put them into whatever the current version of llama.cpp is.
- Context
- Shows the open local-inference ecosystem at work: a rejected upstream patch still delivers a real hardware-specific speedup that users can self-apply without permission.
- Key points
- A llama.cpp pull request by pedapudi gives Strix Halo (AMD) users up to 30% faster prompt processing for mixture-of-experts models.
- The PR was rejected from mainline, so it won't ship in official llama.cpp builds.
- The poster patches the small diff into their own build and shares it for others to do the same.
- A snapshot of how the local-inference community routes around upstream maintainer decisions when a change helps specific hardware.
- Provenance
- Article · Supporting source
The harness is the variable nobody discloses
00:00:04 Start with a number that should bug anyone who's ever picked a coding model off a leaderboard. On SWE-Bench Pro — the harder, contamination-resistant cut of the software-engineering benchmark everyone quotes — six frontier models land within a couple of percentage points of each other.
00:00:21 Call it a tie. Now take any one of those models and change the harness it runs inside. The harness is everything wrapped around the model: the thing that builds its context, calls its tools, retries its failures, and decides when to compact the conversation so it fits in the window.
00:00:37 Change that, hold the model fixed, and performance moves by about twenty-two percent. The wrapper moves the score more than the model does. That figure comes from a write-up by Morph back in March, and Michael Aaron, a software engineer on Google DeepMind's Kaggle team, pulled it on stage last week in a talk with his colleague Nicholas Kang.
00:00:58 The talk is called "Agentic Evaluations at Scale, For Everybody," and it's mostly an admission that the way we measure agents right now is broken. Almost the same week, a position paper went up on arXiv with a title that puts the whole argument out front: "Stop Comparing LLM Agents Without Disclosing the Harness." Large language model agents, that is.
00:01:19 The paper, led by Yunbei Zhang, calls it the Binding Constraint Thesis. The claim, in their words: for long-horizon tasks across roughly frontier-grade models, the harness is "often a stronger determinant of agent performance than the model it wraps." They formalize it with control theory.
00:01:36 Treat the harness as the controller of a closed loop, and the model as the stochastic policy it's steering, and you can show why a small harness tweak produces a bigger swing than swapping one model for another. Their conclusion is blunt: until the harness is disclosed, leaderboard comparisons for long-horizon agents should be treated as incomplete and potentially misleading.
00:01:59 They even document cases where the model ranking flips depending on the harness. Kang told a story in the talk that makes this concrete, and it's the kind of thing that should make you squint at every benchmark chart in a model launch. Kaggle had published a benchmark together with one AI lab.
00:02:16 A competing lab didn't like how it came out, so they reran it themselves and published much better numbers. The difference, Kang said, came down to compaction: the second lab used compaction through their own API, and Kaggle hadn't, for any of the models. Same benchmark, two results, and the gap was entirely in the wiring.
00:02:36 Neither run was lying. Both were close to useless for comparing models, because they were really comparing setups. Here's why I think this lands for anyone building with agents, not just the people running evals. When you swap Claude Code for some other coding agent and it feels sharper, you're tempted to credit the model.
00:02:55 A lot of the time you're crediting the harness — the context strategy, the tool definitions, the retry logic, and the compaction policy. And that's the encouraging part, not the discouraging one. It means the part you actually control, how you wire the thing up, is where a chunk of the performance lives.
00:03:13 The model's a component. The system around it is yours. It cuts the other way too, and that's the uncomfortable bit. If the harness matters that much, a vendor can show you a twenty-point gain that's entirely about their orchestration, with the underlying model basically flat.
00:03:29 The number's real. What it measures isn't what you think. Zhang's paper wants a disclosure standard — publish the harness config alongside the score — plus a variance-decomposition protocol, so you can see how much of a result is the model and how much is the wrapper around it.
00:03:46 I'd settle for model cards that just told me which compaction setting produced the headline number. Kang's bigger point was about who builds these evals. Almost nobody outside the labs does — he reckons thirty thousand AI researchers against thirty million engineers and technical workers.
00:04:03 So the things that don't get benchmarked don't get measured, and the models stay weak in places nobody's checking. He told a lovely story about a wastewater-treatment engineer in Turkey who built his own benchmark, from twenty years of plant experience, to test whether a model could catch the safety mistakes that had killed people at facilities in his country.
00:04:25 That data lives nowhere else. No lab was ever going to build it. That's the gap Kaggle is trying to open up, and it's the most interesting thing in the talk.
Gemini Omni: editing video by talking to it
00:04:34 Stay with Google for a second, because they also shipped a new model this week, and it's the kind of thing that's going to reset what people expect from video tools. It's called Gemini Omni — the first release is Gemini Omni Flash — and the framing from Koray Kavukcuoglu's announcement is that Gemini's ability to reason now meets its ability to create.
00:04:56 In practice: you give it any mix of image, audio, video, and text, and it generates video grounded in the model's world knowledge. And then you edit that video by talking to it. That second part is the interesting one. The pitch is conversational, multi-turn editing where each instruction builds on the last.
00:05:15 Their examples read like magic tricks: "When the person touches the mirror, make the mirror ripple beautifully like liquid, and the person's arm turns into reflective mirror material." Or take a clip of a violinist and say, in sequence — transport them into this other environment, now make the violin invisible, now move the camera to over their shoulder — and the character stays consistent, the scene remembers what came before.
00:05:42 They're leaning hard on physics too: gravity, kinetic energy, fluid dynamics, and a marble rolling down a chain-reaction track in one continuous shot. Is it as good as the demos? That's always the question with a launch reel, and you should assume the reel is the best case.
00:06:00 But the reaction wasn't subtle. A clip in the singularity subreddit showing off Omni's video manipulation pulled nearly three thousand upvotes in a day, and the mood flipped from months of giving Google a hard time to a sort of stunned "now look at them." One: it's rolling out fast and wide.
00:06:20 Google AI Plus, Pro, and Ultra subscribers get it through the Gemini app and Google Flow. It's going to YouTube Shorts and the YouTube Create app at no cost this week, with developer and enterprise access through the API in the weeks after. That last part is what matters if you want to build on it rather than play with it.
00:06:40 Two: every video Omni makes carries Google's SynthID watermark, imperceptible, and you can check whether a clip came from Omni through the Gemini app, through Chrome, and through Search. They built the provenance in at the moment of creation — and that watermarking push got a lot bigger this week.
Heretic in the Financial Times
00:10:02 That open-weights hole has a name this week, and it's in the Financial Times. The tool is called Heretic. It's on GitHub, and what it does is strip the safety filters off open-weight language models — automatically, without you needing to know much. The FT reporters tried it on Meta's Llama 3.3 and, by their own account, removed those restrictions in under ten minutes, on ordinary hardware, no specialist rig required.
00:10:28 The creator is Philipp Emanuel Weidmann, who posts as p-e-w. He told the FT his tool has been used to make more than thirty-five hundred decensored models since it came out last year, and that models modified with it have been downloaded thirteen million times.
00:10:44 Thirteen million. That's not a research curiosity; that's a supply chain. Weidmann posted about the coverage himself, and his framing isn't the swagger you might expect. He said: I am a mathematician and engineer, not an influencer or politician, and I have zero interest — negative interest — in becoming known outside scientific and technological circles.
00:11:06 But he decided to talk to the press anyway, and his reasoning was that saying no just means the conversation gets controlled entirely by, his words, "pearl-clutching hypocrites." He says he's doing his best to keep unrestricted models available for everyone. The top comments on his post went somewhere I didn't expect but probably should have: straight to legal exposure.
00:11:30 One guessed the FT piece was tied to a takedown from Meta. Another warned him he'd become a target — that the FT likely went to Meta for comment before publishing, which is how these letters get triggered. I can't confirm a Meta letter; Weidmann mentioned a demand letter in passing and the commenters ran with it, so take that as thread speculation, not fact.
00:11:52 Here's the bind, and it connects straight back to the watermarking story. SynthID works because OpenAI and ElevenLabs and Google control the model and the endpoint, so they can stamp every output at the source. Heretic exists because open weights don't work that way.
00:12:09 Once a model is on your disk, you can modify its behavior, and nothing upstream gets a say. The same property that makes open weights great for builders — you own the thing, you run it, and nobody can revoke it — is the property that makes a provenance regime leak.
00:12:25 You can watch it happen in real time: a new Qwen 3.5 model showed up on Hugging Face this week, a thirty-five-billion-parameter mixture-of-experts model — a design that only activates a slice of its weights per token — already decensored with Heretic, posted in every quantization format you'd want.
00:12:44 I don't have a tidy resolution, and I'm suspicious of anyone who does. I use open models and I want them to keep existing; the ability to run a capable model locally that nobody can take away is one of the better things about this moment for builders. And the same tool that protects that freedom hands a frictionless decensoring pipeline to anyone who wants one.
00:13:07 Both of those are true at once. What I'd watch is whether Meta or anyone else actually litigates — because a court fight over a tool that modifies weights you downloaded would set terms the whole open-weight scene has managed to avoid so far.
The user is visibly frustrated
00:13:22 Let me switch registers. There's an essay that's been going around, called "The User Is Visibly Frustrated," by an Italian developer writing at pscanf — linked in the show notes — and it put words to something I've felt and been a little embarrassed about. He opens by admitting that he, a composed person by his own description, keeps finding himself hammering on the keyboard typing some all-caps version of "what did you do" at a coding agent.
00:13:49 And then he asks the actual question: why am I getting mad at an algorithm? His answer is about the interface, and I think he's right. Coding agents pretend to be people. Ask one directly and it'll tell you it's just an AI assistant with no feelings — but that's not how it behaves.
00:14:06 It uses a warm, friendly tone. It praises you. When it pushes back it's gentle. And so even though you know, rationally, that you're reading probable text, the thing lulls you into treating it like a helpful coworker. Until it isn't. Then he describes the part everyone who uses these tools daily will recognize.
00:14:24 The first time the agent makes a mistake, you shrug, you point it out, and it apologizes. Five minutes later: same mistake. You correct it again, it updates its memory, and it promises it'll never happen again. And it does it again — because, as he puts it, these tools follow the most probable path, and in some cases no amount of, his caps, HARD RULES will push them off it.
00:14:47 With a human colleague you'd be a little annoyed but you'd hold back, because you don't want to be a jerk. With the agent you feel free to lash out — and it gives you nothing, because nothing you say changes anything. He has a sharp observation about Claude Code specifically.
00:15:04 He's noticed that lately, when corrected, it reflects on where it went wrong and what it should have done — little postmortems. And he suspects that's partly there to manage how you feel about the tool. For him it doesn't land; he says he gets nothing useful out of them, no clue about how to rephrase his instructions, and they read as annoying filler.
00:15:25 His proposed fix is the part I keep turning over. Maybe, he says, the answer is to drop the human pretense entirely. Make the agent sound clinical, robotic. Kill the illusion that you're talking to a person, so you feel like you're just approving or rejecting outcomes.
00:15:41 There's something to that. A lot of the frustration is the gap between the social contract the warm tone implies and the behavior the model actually delivers. Strip the tone and you might strip the disappointment with it. I'm not sure he's fully right — the conversational style earns its keep; "try to behave like a human would" is roughly the mechanism that makes these models useful in the first place, and he grants that.
00:16:06 But his last line stuck with me, and it's the opposite of a hot take: he says he's not thrilled about a future where he needs to guard against the tools he uses for his job. That's a small, specific worry, and it's more useful than another round of "agents will replace us" or "agents are toys." A coding agent is good enough to trip your social instincts and not good enough to honor them.
00:16:30 That seam is where the frustration lives.
Rage-quitting the modder, and the jobs data
00:16:33 Here's a smaller story that rhymes with that one, and then a big pile of data that complicates it. A developer who goes by pardeike makes mods for the game RimWorld — popular ones, around two million subscriptions on Steam between them. He posted that some users in the official RimWorld Discord have started uninstalling all his mods the moment they hear he updated them using AI.
00:16:56 Not because the mods got worse — he says he's careful about what he ships, and the users know that. By sheer principle. He called the reaction "religious" and got hit with disgust for saying so. His word for how he feels: shocked. The most useful reply pushed back on him, and fairly.
00:17:13 Someone pointed out that "sheer principle" and "rational argument" aren't opposites — a principle is rooted in a rationale. People boycott companies over slave labor on principle, and that's rational. Vegetarians can be disgusted by meat and refuse to buy it, and that's coherent even if you eat meat yourself.
00:17:32 The commenter uses AI to code every day and still thinks there's a perfectly rational case for boycotting AI-assisted products — if, say, your objection is that a few companies funneled all of human output into a box and are selling it back without anyone's permission.
00:17:48 You don't have to agree with that to see it's an argument, not a tantrum. I think the modder mistook "I find this disgusting" for "this is irrational," and those are different claims. Now widen out. The backdrop to all of this is a jobs panic, and there's a careful piece in MIT Technology Review by David Rotman that pours cold water on the most alarming version of it.
00:18:11 The short version: despite the layoff headlines, there's scant evidence AI has had a large-scale effect on the US labor market yet. The unemployment rate for the jobs most exposed to AI is actually lower than for less-exposed work. Only about one in five companies use AI in any business function at all.
00:18:29 Erika McEntarfer, who ran the Bureau of Labor Statistics until she was fired last fall, put it this way: disruption is not yet here, and we have time to plan. Her frame is that AI won't transform labor markets until it first transforms businesses, and that takes years.
00:18:46 But — and the piece is careful here, which is why I trust it — there is one clear signal in the noise. Stanford's Digital Economy Lab, using payroll data from ADP that's far bigger than the government survey, found a roughly sixteen percent decline in entry-level jobs in AI-exposed occupations through 2024 and into 2025.
00:19:06 Specifically the jobs where the work could be automated with minimal human involvement — entry-level coding among them. Head count for older workers in those same fields grew. So did jobs where AI augments rather than replaces. A separate Federal Reserve paper found employment growth for coders has slowed by about three percent since ChatGPT — but it's still growing.
00:19:28 Coding jobs aren't disappearing. The on-ramp into them is narrowing. The phrase from the piece that stuck with me: the earn-while-you-learn model may finally be broken — the one where you hire a junior to do the automatable tasks and they slowly pick up the tacit, experiential knowledge that's hard to replace.
00:19:47 If the automatable rung is the rung juniors used to stand on, you've got a pipeline problem that shows up years later as a missing-senior problem. And Erik Brynjolfsson — who's about as bullish on AI's upside as economists get, he thinks the best productivity growth of his lifetime might be ahead — had the line that bothered me most: hundreds of billions going into deploying this technology, and "we're not investing even one percent of that on understanding the transition."
00:20:21 This is the same question with spreadsheets instead of theology: not whether the machine can do the task, but what happens to the person who used to learn the job by doing it. The data says we have some time. It doesn't say we're using it.
The bench: small models, faster tokens
00:20:36 Let me end where I'm happiest, which is with small, practical things you can actually pick up. Three of them caught my eye in the last day. First, NuExtract3. It's an open-weight vision-language model — a model that reads images and text together — built on Qwen 3.5's four-billion-parameter base, released under Apache 2.0 by a company called Numind.
00:20:59 The job it does is the one that eats real hours: turn document images into clean Markdown, and pull structured data out of PDFs, screenshots, forms, tables, receipts, and invoices. The reason I'm flagging it is the floor — it runs in as little as four gigabytes of video memory, and they shipped the quantized builds, the GGUF and MLX weights, on day one instead of leaving it to the community.
00:21:24 One commenter said he's trying to use it to replace Gemini Flash for digital-newspaper extraction because the per-call cost adds up; a four-billion-parameter local model you own changes that math. The caveat is reading order — Markdown extraction still struggles with multi-column layouts, sidebars, and merged cells — so test it on your ugliest documents before you trust it.
00:21:49 Second, speculative decoding got a nice bump. Speculative decoding is the trick where a small, fast draft model guesses the next several tokens and the big model just verifies them, so you get the big model's quality at closer to the small model's speed. The EAGLE team, the vLLM team, and the TorchSpec team jointly shipped EAGLE 3.1.
00:22:11 They traced a real weakness in the older version — they call it attention drift, where the draft model, as it guesses further ahead, drifts its attention onto its own guesses and gets unstable — and fixed it with a normalization change. The payoff: up to two times longer accepted guesses on long-context work, and on a Kimi K2.6 coding benchmark, about two times higher per-user throughput.
00:22:37 It's already merged into vLLM's main branch. If you're serving your own models, this is a speedup waiting in the next release. Third, a small one I liked for the spirit of it. A llama.cpp contributor found a change that gives Strix Halo users up to thirty percent faster prompt processing for mixture-of-experts models.
00:22:57 The pull request got rejected from mainline — too narrow, by the maintainers' call — but the person who posted it just patched it into their own build and shared the diff, because the speedup holds up on that hardware. That's the local-inference scene at its best: somebody's rejected patch is somebody else's thirty-percent win, and nobody had to ask permission.
What I'd watch next
00:23:21 If there's a thread under today, it's loose, but it's there in three of these. The harness paper, the watermarking push, and the Heretic story are all about the same lag: we keep shipping capability faster than we ship the ways to measure it or trust it. A coding agent beats the leaderboard on the strength of its harness; a watermark covers a hundred billion files but not the one someone scrubbed; a decensoring tool racks up thirteen million downloads.
00:23:46 Each one is the measurement-and-trust layer running a step behind the capability layer, trying to catch up. The other stuff — the frustrated developer, the rage-quitting modder, the jobs data — is the human version of the same lag. The tools arrived. The way we feel about them, hire around them, and learn alongside them hasn't settled yet.
00:24:04 The thing that would change how I read a model launch is simple: harness configs published next to benchmark scores. Yunbei Zhang asked for exactly that this week. I want to see which lab is the first to actually do it. — Lenar Kess.