Archive BRAID
The harness, not the model — and the trust layer racing to catch up / DISPATCH 038
PDF RSS

Dispatch 038 · 2026-05-26 GSV Disclose The Harness

The harness, not the model — and the trust layer racing to catch up

/ 00:24:26 / 13 sources

“A coding agent is good enough to trip your social instincts and not good enough to honor them.”

— Lenar Kess, today's narration

One developer catching you up on the day in AI and the craft of building with it. Today: the wrapper around a model can move a benchmark more than the model does, a watermark goes multi-lab, and a decensoring tool with thirteen million downloads shows where that watermark leaks. Plus a sharp little essay on why coding agents make us so mad, the jobs data behind the panic, and three things you can pick up today.

Chapters

  1. 00:00:04 The harness is the variable nobody discloses
  2. 00:04:34 Gemini Omni: editing video by talking to it
  3. 00:06:59 SynthID becomes a shared layer
  4. 00:10:02 Heretic in the Financial Times
  5. 00:13:22 The user is visibly frustrated
  6. 00:16:33 Rage-quitting the modder, and the jobs data
  7. 00:20:36 The bench: small models, faster tokens
  8. 00:23:21 What I'd watch next

Sources

13 cited
  1. 1

    Agentic Evaluations at Scale, For Everybody

    Video Nicholas Kang & Michael Aaron (Google DeepMind, Kaggle) — Product manager and software engineer on Google DeepMind's Kaggle Benchmarks team

    Six frontier models are within a couple of percentage points of each other... a 22% difference depending on the harness.

    www.youtube.com/watch?v=Ubwb6NzegyA →
    Details
    Cited text
    Six frontier models are within a couple of percentage points of each other... a 22% difference depending on the harness.
    Context
    Concrete, on-stage admission from inside a frontier lab that the agent harness can determine more of a benchmark result than the model — directly changes how a builder should read leaderboards and model-launch charts.
    Key points
    • On SWE-Bench Pro, six frontier models land within a couple points of each other; the harness they run in swings performance ~22% (citing a Morph LLM write-up from March).
    • A competing lab reran a Kaggle benchmark using its own API-provided compaction and published much better numbers — same benchmark, different plumbing, misleading comparison.
    • Model launch charts rarely disclose how the benchmark was orchestrated, so you can't tell what's being measured.
    • ~30,000 AI researchers build evals for ~30 million technical workers, so capabilities that aren't benchmarked go unmeasured; Kaggle is pushing open, community-built evals.
    • Example: a Turkish wastewater-treatment engineer built a private benchmark from 20 years of experience to test whether models catch fatal safety mistakes.
    Provenance
    Video · Supporting source
  2. 2

    Stop Comparing LLM Agents Without Disclosing the Harness

    Article Yunbei Zhang et al. — arXiv position paper (cs.AI / cs.SE), submitted May 7, 2026

    The agent execution harness is often a stronger determinant of agent performance than the model it wraps.

    arxiv.org/abs/2605.23950 →
    Details
    Cited text
    The agent execution harness is often a stronger determinant of agent performance than the model it wraps.
    Context
    Gives a formal, citable backbone to the practical claim that orchestration — context, tools, retries, compaction — is where much agent performance lives, and proposes a concrete fix builders and labs could adopt.
    Key points
    • Proposes the 'Binding Constraint Thesis': for long-horizon tasks across comparably capable frontier models, harness configuration governs performance variance more than model choice.
    • Formalizes the harness as the controller of a closed-loop system and the LLM as the stochastic policy it steers, explaining why small harness changes outweigh model swaps.
    • Documents cases of model ranking reversals driven purely by harness differences.
    • Calls for a disclosure standard (publish harness config with scores) and a variance-decomposition protocol.
    • Until harnesses are disclosed, long-horizon agent leaderboards should be treated as incomplete and potentially misleading.
    Provenance
    Article · Supporting source
  3. 3

    Introducing Gemini Omni

    Article Koray Kavukcuoglu (Google) — Google DeepMind CTO / SVP, announcing the Gemini Omni model family

    Every instruction builds on the last. Your characters stay consistent, the physics hold up and the scene remembers what came before.

    blog.google/innovation-and-ai/models-and-re… →
    Details
    Cited text
    Every instruction builds on the last. Your characters stay consistent, the physics hold up and the scene remembers what came before.
    Context
    A capable, widely-distributed conversational video model resets expectations for video tooling, and bakes provenance in at creation — tying directly to the SynthID expansion the same week.
    Key points
    • Gemini Omni Flash generates video from any mix of image, audio, video and text, grounded in Gemini's world knowledge.
    • Headline feature is multi-turn conversational video editing where each instruction builds on the previous one and the scene stays consistent.
    • Emphasis on physics (gravity, kinetic energy, fluid dynamics) and reasoning-grounded generation, plus an Avatars feature for videos in your own voice/likeness.
    • Rolling out to Google AI Plus/Pro/Ultra via the Gemini app and Google Flow, free on YouTube Shorts and YouTube Create this week, with developer/enterprise API access in the coming weeks.
    • Every Omni video carries an imperceptible SynthID watermark, verifiable via the Gemini app, Chrome and Search.
    Provenance
    Article · Supporting source
  4. 4

    Google DeepMind: SynthID watermarking partnership and verification expansion

    Thread GoogleDeepMind — Official Google DeepMind account

    SynthID has already watermarked over 100 billion pieces of content, but transparency is a team sport.

    x.com/GoogleDeepMind/status/205923518127420… →
    Details
    Cited text
    SynthID has already watermarked over 100 billion pieces of content, but transparency is a team sport.
    Context
    Watermarking moving from a single vendor's feature toward a multi-lab shared layer raises the trust floor for commercial AI media — while leaving a visible hole around open weights and deliberate scrubbing.
    Key points
    • SynthID has watermarked over 100 billion pieces of content; verification in Gemini has been used 50+ million times.
    • Google is partnering with OpenAI, ElevenLabs and Kakao to add SynthID watermarking to their models, building on an earlier NVIDIA move.
    • Verification is expanding out of Gemini into Search and Chrome ('Is this made with AI?'), plus content-provenance trails for videos shot on Pixel.
    • Replies frame the real shift as provenance becoming shared infrastructure rather than a brand feature (Surreal_Intel); the hard part is coordinating competitors (Tiago Rama).
    • Two objections recur: open-source models can't be forced to watermark (Krish Dasgupta), and watermarks can be stripped (Madrowisha); one reply argues detection infra and voice (ElevenLabs) are the real gaps.
    Engagement
    490 likes
    Provenance
    Thread · Primary source
  5. 5

    The Strength of Gemini Omni is in video manipulation

    Article Able-Line2683 (r/singularity) — Reddit post in the singularity subreddit reacting to Gemini Omni demos

    credits: Rourke Heath

    www.reddit.com/r/singularity/comments/1tniq… →
    Details
    Cited text
    credits: Rourke Heath
    Context
    Shows the strength of the public reaction to Omni's video editing specifically, the capability that distinguishes it from prior text-to-video tools.
    Key points
    • A clip showcasing Gemini Omni's video-manipulation ability drew ~2,900 upvotes in about a day.
    • Community reaction flipped quickly from months of criticism of Google to surprise at the model's quality.
    • Reaction reels are best-case demos; the real test is developer/API access and consistency on user inputs.
    Provenance
    Article · Supporting source
  6. 6

    The Financial Times has published an article about Heretic

    Article -p-e-w- (Philipp Emanuel Weidmann) — Creator of Heretic, an open-source tool that removes safety guardrails from open-weight models; describes himself as a mathematician and engineer

    Saying no to such inquiries simply means that the conversation will be completely controlled by pearl-clutching hypocrites.

    www.reddit.com/r/LocalLLaMA/comments/1tna22… →
    Details
    Cited text
    Saying no to such inquiries simply means that the conversation will be completely controlled by pearl-clutching hypocrites.
    Context
    Concretely illustrates why a source-side watermarking/provenance regime leaks: once weights are downloaded, behavior can be modified locally and nothing upstream gets a say — the structural counterpoint to SynthID.
    Key points
    • The FT reported it used Heretic to strip guardrails from Meta's Llama 3.3 in under 10 minutes with no specialist hardware.
    • Weidmann told the FT his tool has produced 3,500+ 'decensored' models, downloaded 13 million times since release last year.
    • He framed his decision to talk to press as preventing the narrative being controlled entirely by 'pearl-clutching hypocrites.'
    • Top comments speculated the coverage is tied to a Meta takedown/demand letter (unconfirmed), warning he's become a target.
    • Same week, a Heretic-decensored Qwen3.5 35B mixture-of-experts model appeared on Hugging Face in many quantization formats.
    Provenance
    Article · Supporting source
  7. 7

    Qwen3.5 35B A3B uncensored heretic (Native MTP preserved)

    Source LLMFan46 — Community uploader on Hugging Face

    Available in Safetensors, GGUFs, NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats

    huggingface.co/llmfan46/Qwen3.5-35B-A3B-unc… →
    Details
    Cited text
    Available in Safetensors, GGUFs, NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats
    Context
    A same-day, real artifact showing the open-weights gap in the provenance story isn't hypothetical — decensored frontier-class models ship publicly in every format a builder would want.
    Key points
    • A Heretic-decensored Qwen3.5 35B mixture-of-experts model, released this week across many quantization formats.
    • Demonstrates the open-weights decensoring pipeline operating at scale and in the open.
    • Concrete example of content that no source-side watermark will ever cover.
    Provenance
    Source · Background source
  8. 8

    The User Is Visibly Frustrated

    Article pscanf — Italian software developer writing on his personal blog

    The tool is good enough to trip your social instincts and not good enough to honor them.

    pscanf.com/s/354 →
    Details
    Cited text
    The tool is good enough to trip your social instincts and not good enough to honor them.
    Context
    Names a specific, daily craft experience — the emotional seam between an agent's human-like tone and its non-human behavior — and proposes a concrete UX change rather than another replace-us / toy debate.
    Key points
    • Argues coding agents frustrate because their conversational UX triggers social instincts they can't honor.
    • Agents adopt a warm, praising, gentle tone, so you treat them like a helpful coworker until repeated mistakes break the illusion.
    • They follow the most probable path; sometimes no amount of HARD RULES or memory updates stops a recurring mistake.
    • Notes Claude Code now writes little self-postmortems when corrected, which he finds read as annoying filler rather than actionable.
    • Proposes dropping the human pretense — a clinical, robotic tone — so you feel like you're approving/rejecting outcomes, not arguing with a person.
    Provenance
    Article · Supporting source
  9. 9

    Users who rage quit my software

    Article pardeike (r/singularity) — Maker of popular RimWorld mods (~2M Steam subscriptions combined)

    A principle is inherently rooted in a rationale.

    www.reddit.com/r/singularity/comments/1tntd… →
    Details
    Cited text
    A principle is inherently rooted in a rationale.
    Context
    A grounded snapshot of AI-adoption backlash among end users, and a sharp comment steelmanning the objectors — useful texture against the abstract jobs debate.
    Key points
    • A popular RimWorld modder reports users uninstalling all his mods on hearing he used AI to update them — on principle, not over quality.
    • He called the reaction 'religious' and was met with disgust; says he's shocked.
    • Top reply pushes back: 'sheer principle' isn't the opposite of rational — principled boycotts (slave labor, vegetarianism) have coherent rationales.
    • The same commenter codes with AI daily yet defends the rationality of boycotting AI-assisted products (e.g., objecting to firms monetizing scraped human output).
    • Illustrates that 'I find this disgusting' and 'this is irrational' are different claims often conflated in AI-adoption fights.
    Provenance
    Article · Supporting source
  10. 10

    A reality check on the AI jobs hysteria

    Article David Rotman (MIT Technology Review) — Editor at large at MIT Technology Review; has covered technology and jobs since at least 2013

    We're not investing even 1% of that on understanding the transition.

    www.technologyreview.com/2026/05/26/1137855… →
    Details
    Cited text
    We're not investing even 1% of that on understanding the transition.
    Context
    A careful, data-grounded counter to both the jobs-apocalypse and nothing-to-see-here camps; the entry-level / pipeline finding is the concrete thing builders and managers should track.
    Key points
    • Despite layoff headlines, there's scant evidence AI has had a large-scale effect on the US labor market; unemployment for AI-exposed jobs is lower than for less-exposed work.
    • Only ~1 in 5 companies use AI in any business function; ex-BLS chief Erika McEntarfer says 'disruption is not yet here, and we have time to plan.'
    • Stanford Digital Economy Lab (ADP payroll data) found ~16% decline in entry-level jobs in AI-exposed occupations through 2024–25, concentrated in automatable roles like entry-level coding.
    • A Federal Reserve paper finds coder employment growth slowed ~3% post-ChatGPT but is still growing; wages in exposed sectors have risen.
    • Suggests the 'earn-while-you-learn' career model may be breaking; Brynjolfsson warns we're spending under 1% of deployment money on understanding the transition.
    Provenance
    Article · Supporting source
  11. 11

    NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction

    Article Gailenstorm (Numind) — Works for Numind, the company behind the model; posted in the LocalLLaMA subreddit

    With as little as 4GB of VRAM, you should be good to go.

    www.reddit.com/r/LocalLLaMA/comments/1tn8ut… →
    Details
    Cited text
    With as little as 4GB of VRAM, you should be good to go.
    Context
    A small, permissively-licensed, locally-runnable model that targets the document-extraction work that quietly eats engineering hours and per-call API budget.
    Key points
    • Open-weight 4B vision-language model built on Qwen3.5-4B, Apache-2.0, for document images to Markdown and structured (JSON-template) extraction.
    • Handles PDFs, screenshots, forms, tables, receipts, invoices; runs in as little as 4GB of VRAM.
    • Shipped Safetensors, GGUF and MLX weights plus multiple quantizations on day one.
    • A commenter is trying it as a local replacement for Gemini Flash on digital-newspaper extraction to cut per-call cost.
    • Known caveat: Markdown reading order can still struggle on multi-column layouts, sidebars and merged cells.
    Provenance
    Article · Supporting source
  12. 12

    EAGLE 3.1: Advancing Speculative Decoding Through Collaboration

    Article EAGLE Team, vLLM Team, and TorchSpec Team — Joint open-source release across a speculative-decoding research group, the vLLM inference project, and the TorchSpec training stack

    EAGLE 3.1 delivers 2.03x higher per-user output throughput at concurrency 1.

    vllm.ai/blog/2026-05-26-eagle-3-1 →
    Details
    Cited text
    EAGLE 3.1 delivers 2.03x higher per-user output throughput at concurrency 1.
    Context
    A concrete, near-term serving speedup that's free for anyone self-hosting with vLLM, and a clean example of cross-project open-source collaboration improving inference for everyone.
    Key points
    • EAGLE 3.1 improves speculative decoding (a small draft model proposes tokens the big model verifies) for robustness across chat templates, long context, and out-of-distribution prompts.
    • Traces older fragility to 'attention drift' — as the drafter speculates deeper, it shifts attention onto its own generated tokens — and fixes it with FC normalization and post-norm hidden-state feedback.
    • Up to 2x longer acceptance length on long-context work; ~2.03x per-user throughput at concurrency 1 on a Kimi K2.6 coding benchmark, staying meaningful as concurrency scales.
    • Lands in vLLM as a config-driven extension, backward-compatible with EAGLE 3 checkpoints; already merged to main, shipping in v0.22.0.
    • Example open-sourced: an EAGLE 3.1 draft model for Kimi K2.6.
    Provenance
    Article · Supporting source
  13. 13

    Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs

    Article fallingdowndizzyvr (r/LocalLLaMA) — Local-inference enthusiast sharing a llama.cpp change in the LocalLLaMA subreddit

    The changes are so small that I just put them into whatever the current version of llama.cpp is.

    www.reddit.com/r/LocalLLaMA/comments/1to00x… →
    Details
    Cited text
    The changes are so small that I just put them into whatever the current version of llama.cpp is.
    Context
    Shows the open local-inference ecosystem at work: a rejected upstream patch still delivers a real hardware-specific speedup that users can self-apply without permission.
    Key points
    • A llama.cpp pull request by pedapudi gives Strix Halo (AMD) users up to 30% faster prompt processing for mixture-of-experts models.
    • The PR was rejected from mainline, so it won't ship in official llama.cpp builds.
    • The poster patches the small diff into their own build and shares it for others to do the same.
    • A snapshot of how the local-inference community routes around upstream maintainer decisions when a change helps specific hardware.
    Provenance
    Article · Supporting source