◆ Dispatch 006 · 2026-05-15

The Verification Pass

2026-05-15 / 00:14:24 / 9 sources

“A useful agent now has to build the harness that proves its answer can survive contact with the world.”
— Lenar Kess, today's narration

A useful agent now has to build the harness that proves its answer can survive contact with the world.

The Verification Pass

Chapters

00:00:00 Transcript

Sources

9 cited

1
Orthrus-Qwen3-8B: up to 7.8 tokens per forward on Qwen3-8B

Article Franck_Dernoncourt — Reddit poster disclosed co-authorship and linked the code, paper, and Hugging Face models.

Output distribution is provably identical to the base model.
www.reddit.com/r/LocalLLaMA/comments/1te5xp… →
Details
Cited text
Output distribution is provably identical to the base model.

Context
It makes verification part of the inference loop rather than an external reviewer added later.
Key points
Orthrus adds a trainable diffusion attention module to a frozen autoregressive transformer.
The diffusion head proposes 32 tokens in parallel and the autoregressive head verifies the longest accepted prefix.
The author reports up to 7.8 tokens per forward pass and roughly 6x wall-clock speed on MATH-500.
Limitations include Qwen-only evaluation, greedy plus rejection sampling only, and inherited limits from the frozen base model.
Provenance
Article · Supporting source
2
Bryan Catanzaro on Nemotron 3 Super and Ultra

X Bryan Catanzaro — NVIDIA researcher commenting on Nemotron training precision and scale.

Accelerated computing means we rethink every aspect of the AI stack.
x.com/ctnzr/status/2055393135971492034 →
Details
Cited text
Accelerated computing means we rethink every aspect of the AI stack.

Context
It puts efficiency pressure inside the training run, not only at inference time.
Key points
Nemotron 3 Super is described as 120 billion parameters and pretrained on 25 trillion tokens in NVFP4.
Nemotron 3 Ultra is described as roughly 500 billion parameters and also pretrained in NVFP4.
The post gives engineering direction, but not a full model card or evaluation package.
Provenance
Tweet · Primary source
3
Greg Brockman Officially Takes Control of OpenAI's Products in Latest Shakeup

Article WIRED — WIRED report surfaced through the r/OpenAI post in the packet.

execute with maximum focus toward the agentic future
www.wired.com/story/openai-reorg-greg-brock… →
Details
Cited text
execute with maximum focus toward the agentic future

Context
It makes Codex a central product primitive rather than a separate coding surface.
Key points
OpenAI told staff it is reorganizing product efforts under Greg Brockman.
The report says ChatGPT, Codex, and the developer API are being folded into one core product team.
Thibault Sottiaux is described as leading core product and platform across consumer, enterprise, and developer surfaces.
Provenance
Article · Supporting source
4
Combine Skills and MCP to Close the Context Gap

Video Pedro Rodrigues, Supabase — AI Engineer talk summarized in the packet.

security_invoker = true
www.youtube.com/watch?v=JT3OzDKrucU →
Details
Cited text
security_invoker = true

Context
It shows a concrete product rule that a general agent misses unless the workflow teaches it before action.
Key points
Supabase tested agents on Postgres row-level security tasks where views can bypass isolation without the right flag.
MCP plus skills improved completion compared with MCP-only runs.
The talk recommends keeping critical rules in the main skill file and enforcing opinionated workflows.
Provenance
Video · Supporting source
5
Self-hosted MCP server for public U.S. financial data

Article DanielAPO — Developer post on r-slash LocalLLaMA.

No cloud dependency, no API keys, no telemetry
www.reddit.com/r/LocalLLaMA/comments/1te2jk… →
Details
Cited text
No cloud dependency, no API keys, no telemetry

Context
It turns local agents toward live data, which makes provenance and date discipline part of the product contract.
Key points
Equibles exposes SEC filings, institutional holdings, insider and congressional trades, short data, FRED indicators, and prices as MCP tools.
The post frames current financial data as a missing ingredient for local model agents.
The tool runs locally and avoids cloud telemetry.
Provenance
Article · Supporting source
6
Ruining Li introduces Articraft

Thread Ruining Li — Researcher announcing Articraft and Articraft-10K.

writes code, executes it, receives validation feedback
x.com/RayLi234/status/2055345165779562870 →
Details
Cited text
writes code, executes it, receives validation feedback

Context
It moves agentic coding into simulation-ready physical artifacts where validation has to check behavior, not just text.
Key points
Articraft generates articulated 3D assets with parts, joints, and motion.
The system uses code execution and validation feedback rather than one-shot asset generation.
Articraft-10K contains more than 10,000 articulated objects across 250 categories.
Provenance
Thread · Primary source
7
Agents Don't Do Standups: Building the Post-Engineer Engineering Org

Video Mike Spitz, PFF — AI Engineer talk summarized in the packet.

two engineers against a team of ten
www.youtube.com/watch?v=VMemhtlsoNk →
Details
Cited text
two engineers against a team of ten

Context
It frames agent productivity as a verification workflow, not just a headcount replacement story.
Key points
The PFF case study used lightweight design documents, generated tickets and pull requests, trunk-based development, feature flags, agentic review, and QA agents.
The packet summary reports much higher deployment frequency and quality-score gains during the case study.
The workflow depends on system-literate senior engineers and task decomposition.
Provenance
Video · Supporting source
8
Harrison Chase on dependable repair for agent failures

X Harrison Chase — LangChain cofounder quoting a post about LangSmith Engine and auto-remediation.

Dependably for LLM agent failures
x.com/hwchase17/status/2055278799240241621 →
Details
Cited text
Dependably for LLM agent failures

Context
It names the product direction where agent monitoring opens a path to proposed fixes.
Key points
The quoted post describes LangSmith Engine as a detector for agent failures.
It proposes auto-remediation with a human approval gate as the next layer.
The concept fits the repair-after-detection pattern across the episode.
Provenance
Tweet · Primary source
9
Tibo on GPT-5.5 performance reports

X Tibo — Codex team lead posting about user reports.

We don't have anything conclusive yet
x.com/thsottiaux/status/2055316274394300829 →
Details
Cited text
We don't have anything conclusive yet

Context
It separates model size and launch energy from the measured user experience.
Key points
The Codex team is investigating reports that GPT-5.5 performs worse for some users.
The post says systems are healthy and the team has no conclusion yet.
The episode treats this as a caution against turning anecdotal performance into a product verdict.
Provenance
Tweet · Primary source

00:00:00

Transcript

00:00:00 liraenFriday's thread holds together: the best items today aren't only larger models or new agents. They're about a second pass. Orthrus asks a model to propose many tokens, then verify them against the old model. Articraft asks a coding agent to write a 3D object, run it, and repair it. Supabase asks agents to use skills before they touch a database. Even OpenAI's product reorg, as reported by WIRED, says ChatGPT and Codex are being folded toward one agentic surface. So my question for the room is simple: when the first draft is cheap, who owns the check?

00:00:36 halekThe check has to be part of the system, not a reviewer added at the end because everyone feels nervous. Orthrus is a clean example because the claim is almost rude in its precision. The r-slash LocalLLaMA post says the backbone stays frozen, a diffusion attention module proposes thirty-two tokens in parallel, and the autoregressive head verifies the prefix. If that verification fails, the proposal gets cut down. You only get the speedup because the acceptance rule is tight enough to preserve the base model's distribution.

00:01:10 liraenRight. The speed number gets attention: up to 7.8 tokens per forward pass on Qwen three, eight billion parameters. The author's summary also reports roughly six times wall-clock speed on MATH-500. But the more durable claim is that Orthrus doesn't ask you to trust a second drafter model. It shares one key-value cache, trains about sixteen percent of the parameters, and keeps the old model as the judge.

00:01:36 halekAnd it has limits, which makes me trust the writeup more. The author says Qwen-only evaluation, greedy plus rejection sampling only, and the frozen model still inherits its old biases and knowledge gaps. This won't become a universal inference engine. It is a narrower bargain: spend training compute on a head that predicts a block, then let the base model say how much of that block counts. I like that bargain because it gives you a place to test.

00:02:02 liraenDoes it change the local-model story from yesterday's multi-token prediction discussion, or is this still a cousin of the same idea?

00:02:09 halekA cousin, but not the same household. Yesterday's llama.cpp item was about preserving multi-token prediction layers that already exist in the model family. Orthrus bolts on a trained verifier-friendly module around a frozen model. For a team running local inference, that difference matters. One path waits for upstream model artifacts and runtime support. The other asks whether you can afford a twenty-four-hour, eight-H200 training job for the model you care about. Most people can't. A lab, a hosting company, or a community pool maybe can.

00:02:45 liraenBryan Catanzaro's post pulls the same question up to the frontier-training layer. He says Nemotron 3 Super is a 120 billion parameter model pretrained on twenty-five trillion tokens in NVFP4, and Nemotron 3 Ultra is roughly 500 billion parameters, also pretrained in NVFP4. His line is that accelerated computing makes them rethink every part of the AI stack for efficiency. How much should we take from a short post like that?

00:03:13 halekTake the numbers seriously and keep the claims narrow. NVFP4 means NVIDIA is pushing four-bit training precision from a trick into a design point. If a 120 billion parameter model and a roughly 500 billion parameter model can be pretrained that way, the cost model changes inside the training run: memory bandwidth, activation storage, and how much hardware you need for a given experiment. But I wouldn't turn that into a general claim about quality without model cards, evals, and release details. Catanzaro's post gives us the shape of the engineering bet, not the full evidence package.

00:03:51 liraenThat restraint matters today because two model-size claims are easy to overread. Elon Musk says xAI's internal Grok version nine is 1.5 trillion parameters and better than version eight before Cursor data gets added. Catanzaro gives the Nemotron numbers. Tibo from the Codex team says they are investigating reports that GPT-5.5 feels worse for some users, while systems are healthy and nothing is conclusive. Those three items together make the same point: the public number doesn't equal the experience.

00:04:22 halekExactly. A parameter count is an input to the story. It isn't the product. For Codex, the product is whether the agent keeps your intent across a repo, edits the right file, runs the check, and stops when the evidence is bad. If users say GPT-5.5 is worse for them and the team says it is investigating, the responsible line is plain: no conclusion yet. Measure the regression on the tasks people are complaining about, isolate routing from model behavior, and publish the fix when you know which one it was.

00:04:54 liraenWIRED's OpenAI report gives that Codex concern a bigger organizational shape. The report says Greg Brockman will lead product strategy while continuing his infrastructure work, and that OpenAI is folding ChatGPT, Codex, and the developer API into one core product team. The memo phrase WIRED quotes is "execute with maximum focus toward the agentic future." I don't want to overplay a reorg, but this one points at a product decision: coding is no longer treated as a side surface.

00:05:24 halekThat changes operator planning. If Codex becomes one of the engines inside ChatGPT and enterprise workflows, a coding agent is no longer just an IDE companion. It becomes the task executor inside the product people already open. The hard question becomes permissions. Can the same agent read a chat, open a spreadsheet, change code, and call an API under one identity? If yes, the approval model and audit log have to be shared too.

00:05:52 liraenAnd the person listening has heard us circle that before, especially with Anthropic's Agent SDK metering yesterday. I want the new angle here to be product integration, not pricing. If ChatGPT and Codex merge into one experience, the user won't think in categories like chat, code, and API. They will think, can this system finish the job, and can I see what it did?

00:06:14 halek[tsk] I would tighten one word there. It isn't a job when it touches production code or customer records. It is a delegated operation. That wording matters because a job sounds reversible. A delegated operation needs a named scope, credentials it can use, logs, a dry run when risk is high, and a clear human approval point. The OpenAI report says Codex is increasingly powering consumer and enterprise offerings that can perform digital tasks for users. Fine. Then the product contract has to explain what a task is allowed to change before the user learns by damage.

00:06:50 liraenThat lands directly on the Supabase talk from AI Engineer. Pedro Rodrigues's example is wonderfully concrete: an agent working with Postgres can create a view over a table with row-level security enabled and silently bypass the isolation unless it knows to use security invoker. Supabase tested Claude Sonnet 4.6 with Model Context Protocol tools alone, then with tools plus Supabase skills, and the skill version handled the row-level security constraint correctly.

00:07:18 halekEvery tool vendor should be able to show this kind of eval. Skip the vibe demo. Give me a task where the mistake is specific and costly. The Supabase rule isn't obscure if you live in Postgres, but it is exactly the sort of detail a general agent will miss because the tool list says what can be called, not what the product considers safe. A skill file can put the non-skippable rule where the agent will read it before it writes SQL.

00:07:45 liraenThe talk's four principles also fit the day: don't duplicate docs; expose docs through the filesystem when agents can navigate files; put critical guidance in the main skill file instead of a hidden reference; and enforce opinionated workflows. I like the last one because it admits that agents don't need more prose in every case. Sometimes they need a prescribed path: run the direct database operation in development or staging, run the advisor, then generate the migration.

00:08:13 halekAnd the result isn't magic; it gives the agent less room to improvise where improvisation hurts. Equibles, the self-hosted financial-data Model Context Protocol server from the LocalLLaMA post, sits next to this. It gives a local model SEC filings, thirteen-F holdings, insider and congressional trades, short data, FRED indicators, and prices. That is powerful only if the agent also knows which queries are read-only, which outputs are stale, and which conclusion requires a human to check the filing.

00:08:45 liraenGive the model tools, current data, and a house style for using both. Otherwise the tool server just moves the hallucination closer to an official-looking number.

00:08:54 halekYes. And for finance, that difference is brutal. A local agent that can search filings can help. But if it mixes a delayed price, a stale thirteen-F, and an insider form without saying which date each came from, it is worse than a spreadsheet. The server's privacy story is good: no cloud dependency, no API keys, no telemetry. The next proof is provenance in every answer.

00:09:19 liraenArticraft moves the same loop into a more physical domain. Ruining Li's announcement says Articraft writes code, executes it, receives validation feedback, and refines articulated 3D assets with parts, joints, and motion. The team is also releasing Articraft-10K: more than ten thousand articulated objects across 250 categories. What does that do to your operator brain?

00:09:43 halekIt makes me ask what validation means. [chuckle] Sorry, predictable answer, but that is the whole problem. A mesh that looks right isn't enough if the object is meant for robotics simulation. You need joint constraints, part labels, collision behavior, and motion that a simulator can use. The interesting bit in the announcement isn't that an agent writes code. It is that the agent gets feedback from the generated object and iterates toward something simulation-ready.

00:10:10 liraenThere was a reply under that thread that put it crisply: instead of creating a bespoke model, you can surround current large language models with a 3D asset generation harness. The same reply guessed the cost is under a dollar fifty per asset generation, though I would treat that as a commenter's estimate rather than a published benchmark. Still, the framing is useful: the intelligence is partly in the loop around the model.

00:10:34 halekThat comment names the practical fork. You can train a specialized generator for articulated assets, or you can give a capable code model enough structure, validators, and retries. The second route is scrappier, but it gets better every time the base model gets better. It also inherits all the mess: nondeterminism, weird edge cases, and the need for tests that describe physical usefulness rather than visual charm.

00:10:58 liraenArticraft connects back to PFF's AI Engineer talk about post-engineer engineering organizations. Their case study claims two senior engineers using agents shipped far more frequently than a ten-engineer team, with lightweight design documents, trunk-based development, feature flags, agentic review, and a QA agent checking acceptance criteria in staging. It is easy to reduce that talk to headcount drama, but the machinery is the verification chain around the agent.

00:11:26 halekYes, and I wouldn't make the headcount claim portable without the caveats. PFF had senior engineers, a known codebase, and a workflow where specs become tickets and pull requests. That is a better fit for agents than a vague product bet. The claim I buy is narrower: if the organization can express work as composable, checkable tasks, agents can make the cycle faster. If the organization can't express the work, the agent gives you more unchecked output.

00:11:55 liraenHarrison Chase's short post gives the repair loop a product name without quite naming a product. He quoted the phrase "Dependably for LLM agent failures," and the quoted post described LangSmith Engine as the detector, with auto-remediation plus a human approval gate as the next layer. That is the same shape again: detect, propose a fix, ask a person to approve the change.

00:12:17 halekThe approval point isn't a courtesy. It is the boundary between monitoring and production mutation. If LangSmith Engine can notice that an agent run failed because a prompt, tool schema, or retrieval path broke, the tempting next step is to open a pull request. Good. But the fix needs evidence attached: the failing trace, the proposed change, the rerun result, and the affected workflows. Otherwise auto-remediation becomes another agent making confident edits in the dark.

00:12:46 liraenThat also explains why today's items feel connected even though they live at different layers. Orthrus verifies token proposals. NVIDIA is testing the hardware and precision assumptions under training. OpenAI is reorganizing product around agents. Supabase is packaging product rules as skills. Articraft is using validation feedback to make simulated objects. LangSmith is gesturing toward repair after an agent breaks. The common question is whether the second pass has enough evidence behind it.

00:13:17 halekAnd whether the second pass is close enough to the work. A human approval gate that sees only a green check mark is theater. A verifier that only checks syntax is useful but limited. A QA agent in staging is better because it touches behavior. Orthrus's verifier is close to the token distribution. Supabase's skill is close to the database rule. Articraft's feedback is close to the object. The closer the check is to the actual claim, the less room there is for nonsense to pass as progress.

00:13:49 liraenFor this Friday, bigger models are still coming, faster inference is still coming, and product teams are moving agents into the main surface. But I trust the work where the agent has to show evidence at the point of action. Tomorrow is Saturday, so I would expect the feeds to get stranger and less polished. The test I am carrying into the weekend is simple: does the agent merely produce, or does it leave behind the proof that its output survived a real check?