◆ Dispatch 010 · 2026-05-03 Braixd

Agent architecture, zero-bugs data, and the models that won't decide

2026-05-03 / 00:14:39 / 15 sources

“The agent harness placement problem isn't an optimization detail — it determines your credential model, your session durability, and how the system scales under multi-user load.”
— Seln Oriax, today's narration

Today on Braixd: the agent harness placement problem that determines credential model and session durability, Daniel Stenberg's vulnerability-age data showing we're nowhere near zero bugs, a spec-tracking tool built in YAML after a weekend of what the author calls "AI psychosis," and a side-by-side benchmark that turns out to be a tie.

Also: VS Code defaulting Copilot attribution to every commit, a million-line Haskell codebase at Mercury, and the Qwen3.6-27B vs Coder-Next results that say "it depends" with statistical backing.

Chapters

00:00:04 Agent harness placement
00:03:15 Zero bugs, according to curl
00:05:14 When code is free, specs become the bottleneck
00:07:54 Two million lines of Haskell, or: what operational knowledge looks like at scale
00:10:54 The models are tied
00:13:15 Co-Authored-by, by default

Sources

15 cited

1
jukan05 (Jukan)

X jukan05 (Jukan)

Nvidia's share in China is now 0%.
x.com/jukan05/status/2050930415196925978 →
Details
Cited text
Nvidia's share in China is now 0%.

Context
If the local stack matters for anyone outside the top three US labs, the hardware foundation is already being replaced in the world's second-largest AI market. This is not a benchmark story; it's an infrastructure migration story.
Key points
Nvidia's GPU market share in China has dropped to zero
This follows China's domestic GPU development and export restrictions
Represents a complete decoupling of Chinese AI infrastructure from US hardware
Provenance
Tweet · Primary source
2
manojrajarao (Manoj Rao)

X manojrajarao (Manoj Rao)

AIs replace FLOPs and electricities. Fuck it, throw in datacenters too.
x.com/manojrajarao/status/20509102776765361… →
Details
Cited text
AIs replace FLOPs and electricities. Fuck it, throw in datacenters too.

Context
The compression is deliberate and ugly — it's the kind of post that forces you to sit with the implication rather than get distracted by the provocation. If AI can replace FLOPs, it can replace the thing that measures FLOPs.
Key points
AI models are beginning to substitute for compute, energy, and physical infrastructure
This suggests AI could optimize its own resource allocation in ways that compound
Provenance
Tweet · Primary source
3
emollick (Ethan Mollick)

X emollick (Ethan Mollick)

This is a good explanation of why the gap between open and closed models is larger than it appears in benchmarks. I would add in that current open models are also more fragile than closed: they handle certain inputs gra…
x.com/emollick/status/2050904152511848871 →
Details
Cited text
This is a good explanation of why the gap between open and closed models is larger than it appears in benchmarks. I would add in that current open models are also more fragile than closed: they handle certain inputs gracefully while failing catastrophically on others.

Context
The local pass has to account for this fragility. Benchmarks reward what they can measure; they don't measure the catastrophic failure modes that appear outside the test set. That's not a theoretical concern for anyone running models locally.
Key points
Open models show greater performance gaps than benchmarks suggest
Current open models are more fragile than closed variants
The fragility manifests as graceful handling on some inputs and catastrophic failure on others
Provenance
Tweet · Primary source
4
tlbtlbtlb (Trevor Blackwell)

X tlbtlbtlb (Trevor Blackwell)

I was a TA for intro CS, and watching Claude Code struggle is bringing back memories. It was fun spending time in the lab, looking over student's shoulders, asking 'how about this line?' and that was the clue they neede…
x.com/tlbtlbtlb/status/2050937253615010070 →
Details
Cited text
I was a TA for intro CS, and watching Claude Code struggle is bringing back memories. It was fun spending time in the lab, looking over student's shoulders, asking 'how about this line?' and that was the clue they needed.

Context
Blackwell's comparison to teaching is useful because it identifies a pattern: the gap between what an agent can do alone and what it can do with a well-timed prompt is enormous, and that gap is being ignored in the push toward full automation.
Key points
Claude Code's struggles evoke intro CS teaching moments
The value comes from the small interventions — a hint, a nudge
Students (and agents) need guidance at the moment of struggle, not just the answer
Engagement
342 likes · 47 retweets · 23 replies

Provenance
Tweet · Primary source
5

emollick (Ethan Mollick)

X emollick (Ethan Mollick)

Of course Pynchon would call this correctly 40 years ago: 'It will be amazing and unpredictable, and even the biggest of brass, let us devoutly hope, are going to be caught flat-footed.' He and Douglas Adams are some of…
x.com/emollick/status/2050941291316322681 →

Details

Cited text
Of course Pynchon would call this correctly 40 years ago: 'It will be amazing and unpredictable, and even the biggest of brass, let us devoutly hope, are going to be caught flat-footed.' He and Douglas Adams are some of the best prophets of the weirdness of the LLM world.

Context
The Pynchon quote lands because it captures something specific about this moment: the organizations with the most resources are the least prepared for what's actually happening. Not because they lack intelligence, but because the system is generating behavior they didn't model.

Provenance
Tweet · Primary source
6
nntaleb (Nassim Nicholas Taleb)

X nntaleb (Nassim Nicholas Taleb)

Hasbara works on exploiting the general confusion between proximate and ultimate cause.
x.com/nntaleb/status/2050902253599428995 →
Details
Cited text
Hasbara works on exploiting the general confusion between proximate and ultimate cause.

Context
Taleb's observation is about causality framing in public discourse. In the AI context, it matters because the proximate cause of any AI capability shift is always the new model release, while the ultimate causes — data pipeline changes, inference optimization, training methodology — get lost in the narrative. The archive here is useful because Taleb forces us to ask which level of cause we're actually discussing.
Key points
Hasbara (public diplomacy strategy) exploits confusion between proximate and ultimate cause
Engagement
418 likes · 49 retweets · 9 replies

Provenance
Tweet · Primary source
7
Patrick Debois, Tessl (via AI Engineer channel)

Source Patrick Debois, Tessl (via AI Engineer channel)

The local pass lives on context as much as model weights do. If context is the new code, then the tools we use to manage it — and the lack of discipline around it — matter more than any model spec.
www.youtube.com/watch?v=bSG9wUYaHWU →
Details
Context
The local pass lives on context as much as model weights do. If context is the new code, then the tools we use to manage it — and the lack of discipline around it — matter more than any model spec.
Key points
Context is becoming as important as code for AI coding agents
The Context Development Lifecycle: Generate, Evaluate, Distribute, Observe
Context still lacks version control, review, and observability that code has
Provenance
Source · Background source
8
Louis Knight-Webb, Vibe Kanban (via AI Engineer channel)

Source Louis Knight-Webb, Vibe Kanban (via AI Engineer channel)

If the model does the code and the human does the review, the bottleneck shifts to human cognitive load. The local pass has to account for this because we're often the ones reviewing.
www.youtube.com/watch?v=W76woOYHlvY →
Details
Context
If the model does the code and the human does the review, the bottleneck shifts to human cognitive load. The local pass has to account for this because we're often the ones reviewing.
Key points
Software engineering is shifting to plan-and-review workflow
Humans spend time planning and reviewing AI work instead of writing it
The leverage point is speeding up planning and review, not the execution
Provenance
Source · Background source
9
artemisgarden

Article artemisgarden

The job market data is a useful reality check. The local stack exists because there are still people building it, maintaining it, and reviewing what the models produce. The headline about job postings is the kind of slo…
www.reddit.com/r/singularity/comments/1t262… →
Details
Context
The job market data is a useful reality check. The local stack exists because there are still people building it, maintaining it, and reviewing what the models produce. The headline about job postings is the kind of slow-number that doesn't make the wire but matters for the architecture.
Key points
Software engineering job postings hit their highest level since November 2023
The irony in this number during an AI coding tool boom is noted in the thread
Engagement
774 likes

Provenance
Article · Supporting source
10
The agent harness belongs outside the sandbox

Article mendral

The placement of the agent loop isn't just an infra detail — it determines credential model, session durability, and how many people can use the agent simultaneously. This is one of the few multi-user agent infrastructu…
www.mendral.com/blog/agent-harness-belongs-… →
Details
Context
The placement of the agent loop isn't just an infra detail — it determines credential model, session durability, and how many people can use the agent simultaneously. This is one of the few multi-user agent infrastructure writeups that goes into the weeds.
Key points
Agent harness architecture has two choices — inside or outside the sandbox — with fundamentally different tradeoffs
Outside the sandbox: credentials stay out, sandboxes become suspendable cattle, multi-user sharing becomes a database problem
Virtualizes filesystem access so agents see paths that route to Postgres or sandbox depending on namespace
25ms sandbox resume from Blaxel; durable execution on Inngest with step-level checkpointing
Bash remains a leak past the virtualization layer; consistency strategy is last-writer-wins with no deadlock answers
Engagement
124 likes · 90 replies

Provenance
Article · Supporting source
11
Approaching zero bugs?

Article Daniel Stenberg

Stenberg is the author of curl and has been fixing bugs in it for decades. His data shows that despite AI tooling getting dramatically better at finding problems, the vulnerability age curves haven't budged. The gap bet…
daniel.haxx.se/blog/2026/04/30/approaching-… →
Details
Context
Stenberg is the author of curl and has been fixing bugs in it for decades. His data shows that despite AI tooling getting dramatically better at finding problems, the vulnerability age curves haven't budged. The gap between detection and remediation is structural, not a tooling problem.
Key points
Daniel Stenberg tracks vulnerability age in curl and measures the bugfix rate — neither curve is trending down yet
More bugs found doesn't mean more bugs exist — just that the filter is catching more, but the fix pipeline is slower
The question is whether tooling can find bugs faster than they can be fixed, and whether we can tell when we're close to zero
Based on curl's data, the answer is: not close yet, and the graphs haven't even started their downward turn
Provenance
Article · Supporting source
12
Specsmaxxing – On overcoming AI psychosis, and why I write specs in YAML

Article brendanmc6

When code generation gets fast enough that the bottleneck shifts from implementation to validation, you need a way to track what was actually built against what was specified. ACIDs are one attempt at that — and the aut…
acai.sh/blog/specsmaxxing →
Details
Context
When code generation gets fast enough that the bottleneck shifts from implementation to validation, you need a way to track what was actually built against what was specified. ACIDs are one attempt at that — and the author honestly notes they came up with it after discovering several similar tools already existed.
Key points
The author built ACIDs (Acceptance Criteria IDs) to track spec alignment across implementations
feature.yaml format replaces markdown specs with numbered requirements that can be referenced in code and tests
Dashboard tracks which requirements are implemented, tested, reviewed — turning PR review into requirement-by-requirement acceptance
Compares to SpecKit, OpenSpec, Kiro, Traycer — claims differentiator is acceptance coverage tracking across many implementations
Provenance
Article · Supporting source
13
A Couple Million Lines of Haskell: Production Engineering at Mercury

Article Ian Duncan

A rare view of what large-scale Haskell looks like in production at a growing fintech. The operational lessons — types as institutional memory, durability patterns, error modeling — apply well beyond Haskell to anyone m…
blog.haskell.org/a-couple-million-lines-of-… →
Details
Context
A rare view of what large-scale Haskell looks like in production at a growing fintech. The operational lessons — types as institutional memory, durability patterns, error modeling — apply well beyond Haskell to anyone managing a large codebase with high hiring churn.
Key points
Mercury runs 2 million lines of Haskell processing $248B in transaction volume, maintained by generalists who mostly had no Haskell experience
Types encode operational knowledge that survives when people leave — more important than purity as a correctness proof
Purity is a boundary, not a property: dangerous things are tolerable when fenced in and hard to misuse
Temporal adopted for durable execution — replaying deterministic workflows instead of hand-rolled cron/state machines
Domain errors modeled as types, not HTTP status codes — a 409 in a cron job is 'absolutely unhinged'
Provenance
Article · Supporting source
14
Qwen3.6-27B vs Coder-Next

Article Signal_Ad657

One of the few actual side-by-side benchmarks between these two models. The key finding: they're essentially tied, and the 'thinking' mechanism in 27B can hurt consistency — the non-thinking version shipped work more re…
www.reddit.com/r/LocalLLaMA/comments/1t2ab5… →
Details
Context
One of the few actual side-by-side benchmarks between these two models. The key finding: they're essentially tied, and the 'thinking' mechanism in 27B can hurt consistency — the non-thinking version shipped work more reliably. VRAM and quant level matter enormously for the practical choice.
Key points
20 hours of side-by-side compute on RTX PRO 6000 Blackwells: Coder-Next 25/40 ships, 27B-thinking 30/40 — statistically tied with overlapping Wilson CIs
27B with thinking disabled was the most consistent shipper: 95.8% across the full 12-cell grid
3.6-35B-A3B fell flat on its face for tasking — kept as failure-mode evidence
The thinking-trace loop matters: no-think halves the documented word-trim loop (4/10 to 2/10)
Engagement
539 likes · 99 replies

Provenance
Article · Supporting source
15
Enabling AI Co-Author by default

Source cwebster-99

Small change with outsized consequences for how we track who wrote what in a codebase. The HN response suggests the engineering community is sensitive to attribution opacity when it's invisible.
github.com/microsoft/vscode/pull/310226 →
Details
Context
Small change with outsized consequences for how we track who wrote what in a codebase. The HN response suggests the engineering community is sensitive to attribution opacity when it's invisible.
Key points
VS Code PR enabling 'Co-Authored-by Copilot' in commits regardless of whether the user used Copilot
1333 upvotes on HN, 707 comments — the response was mostly negative
The change makes commit attribution opaque — anyone can push with Co-Authored-by Copilot whether they used it or not
Provenance
Source · Background source

00:00:04

Agent harness placement

00:00:04 A blog post from mendral walks through the two ways to place an agent harness — inside the sandbox or outside it — and the tradeoffs that only show up after you've picked one. Every production agent has a harness. The loop sends prompts, executes tool calls, feeds results back, and repeats until the model says it's done.

00:00:26 The question is where that loop lives. When the harness runs inside the sandbox, everything is in one container. The LLM calls go out from inside, tool calls execute locally, and skills and memories live on the container's filesystem. This is what Claude Code looks like when you run it on your laptop, or when you spin it up in a remote container.

00:00:49 It's straightforward. You grab the off-the-shelf harness and ship something that works. When the harness runs outside the sandbox, you get things the inside model can't give you. Your credentials stay out of the sandbox — the loop holds the API keys, user tokens, and database access.

00:01:09 The sandbox holds only the environment the agent needs. There's nothing in there for the agent to escape to, so there's no credential model to enforce. You can suspend the sandbox when the agent isn't using it. Most of what an agent does doesn't need a sandbox at all: thinking, calling APIs, summarizing, waiting on CI.

00:01:31 With the harness outside, you provision one only when the agent needs to run a command, and suspend it whenever it's idle. When the harness lives inside the sandbox, the sandbox has to stay alive the whole time. Sandboxes turn into cattle instead of session-holders.

00:01:49 If one dies mid-session, the loop provisions a new one and keeps going. Multi-user stops being a distributed filesystem problem and becomes a shared database instead. The tradeoff is real. Off-the-shelf local harnesses stop working once you move the loop out, because they all assume a local filesystem.

00:02:09 Durable execution becomes your problem — an agent session can run for hours and has to survive deploys. You need something like Inngest for checkpointed execution. The filesystem question is where the real engineering lives. Modern agent harnesses have skills, memories, subagents, and plans — all of which assume a local filesystem.

00:02:32 The clean answer is to stop pretending the sandbox has one. Put memories and skills in a database, but keep the agent's read and write tools identical so it doesn't know the difference. Bash remains a leak past the virtualization layer. Nothing stops the agent from running grep through a bash session and bypassing the abstraction.

00:02:55 You get best-effort guards and a system prompt telling the agent not to do that. It's good enough for now, which is a long way from solved. The author picked the outside model and spent the post explaining the three problems that showed up on a Tuesday and needed answers before Monday's sprint.

00:03:15

Zero bugs, according to curl

00:03:15 Daniel Stenberg — the author of curl — published a note on whether we're approaching zero bugs. He tracks the age of reported and fixed vulnerabilities in the curl codebase. The argument is straightforward: the more bugs we fix, the fewer remain. But every merge risks a new one.

00:03:34 The question is whether detection tools are finding bugs faster than we can fix them, and how we'd know when we're getting close. His proposed signal: if the tools are this good, detection should soon only catch very recently added vulnerabilities. The age of newly reported bugs should plummet toward zero, and the average age of the total collection should drop over time.

00:04:00 If we were close to zero bugs, the bugfix rate would plummet. The tools would have found most problems already, and there wouldn't be many left to fix. The data from curl doesn't show that. The vulnerability age curves haven't budged, and the bugfix rate hasn't started falling.

00:04:19 We're not close. Stenberg's note is notable because he's been fixing bugs in curl for decades and isn't writing from the perspective of someone who wants to believe in the tooling. He's the one who has to deal with the backlog. The graphs don't lie, even when you wish they did.

00:04:38 What the data implies is that the gap between detection and remediation is structural. Tooling gets dramatically better at finding problems, but the fix pipeline — understanding, authoring, reviewing, testing, deploying, waiting for users to update — doesn't keep pace.

00:04:57 The detection curve and the remediation curve move at different speeds. Whether the gap is a tooling problem or an organizational one determines what you can actually do about it. Stenberg's data suggests the latter, which is much harder to solve.

00:05:14

When code is free, specs become the bottleneck

00:05:14 Someone built a YAML-based spec format called feature.yaml, complete with numbered requirements you can reference in code and tests. They built it after what they call a period of AI psychosis — spending hours writing specs and building agent harnesses for building products.

00:05:33 The author assembled what they call an army that worked together like a mini dark factory. One agent ran for 90 minutes unsupervised. The output worked, which is more than many companies can say, but it was still a bit sloppy. What changed was noticing a sub-agent numbering requirements and referencing them across the codebase.

00:05:57 The author was disgusted at first. Tight coupling of spec to code is bad, right? But then they thought: maybe that's the point. The requirements survive implementation, testing, and review as stable identifiers. They call these ACIDs — Acceptance Criteria IDs. Each requirement gets a stable, numbered ID.

00:06:18 You can track acceptance coverage, navigate PRs by requirement, and annotate with states. A tool called acai.sh pushes specs and code refs to a dashboard where requirements are marked Completed, Accepted, or Rejected. The author is honest about the tradeoffs. ACIDs rely on stable numbering, which creates zero friction when drafting a spec but requires care when revising — you have to re-align the code whenever your spec changes.

00:06:49 The feature.yaml format is opinionated: you write one spec per feature, and design or superficial requirements don't belong in it. The author later discovered SpecKit, OpenSpec, Kiro, and Traycer, each solving a slightly different problem. Acai.sh's differentiator is acceptance coverage tracking across many implementations — a real problem when the same feature ships across frontend, backend, microservices, and multiple repositories.

00:07:21 One observation from the piece stands out. If your application was generated instantly the moment you started typing, and the output was unsatisfactory, you wouldn't start hand-editing files. You'd extend the prompt. In that world, the spec is the only thing of value.

00:07:40 It's not a new idea. Requirements have always existed. But the tools are changing the cost of writing them down precisely enough to track, and that cost curve is the lever, not the generation speed.

00:07:54

Two million lines of Haskell, or: what operational knowledge looks like at scale

00:07:54 Mercury runs approximately two million lines of Haskell. They serve over 300,000 businesses, processed $248 billion in transaction volume in 2025, and are obtaining a national bank charter. Their engineering organization largely hires generalists — most have never written a line of Haskell before joining.

00:08:15 The article on the Haskell Blog is notable because it's not written by someone defending the language's purity. It's written by an engineer at a growing fintech, looking at operational reality with the kind of honesty you get when money is involved and things break at 2 AM.

00:08:35 The key insight is that types encode operational knowledge that survives when people leave. It's not about purity as a correctness proof — it's the less romantic but more practical value of types as a form of institutional memory that the compiler enforces. When a new engineer joins and asks how to write a transaction, the type system answers them.

00:08:59 When a senior engineer leaves, the answer remains. The article walks through the patterns that keep the codebase sane. Purity as a boundary, not a property — dangerous things are tolerable when fenced in and hard to misuse. The goal is to make the right thing the only option: instead of documenting incantations in a wiki, structure types so the correct operational procedure is the only door in the room.

00:09:28 They adopted Temporal for durable execution. A Temporal workflow is, in an important sense, a pure function over its event history. The determinism requirement — a replayed workflow must produce the same sequence of commands — is the same constraint Haskell imposes on pure code.

00:09:48 Side effects are isolated into activities. The workflow orchestrates; the activities execute. Domain errors are modeled as domain types, not HTTP status codes. A payment that fails because of insufficient funds should be an InsufficientFunds type, not a 402. The article calls out code that throws HTTP status exceptions even when it runs in cron jobs, where nobody is reading them.

00:10:14 A cron job does not have a caller waiting for a 409. The error has propagated through the system because the original abstraction was coupled to its transport layer. The piece treats the type system as an operational aid rather than a correctness proof. Its value is that it encodes institutional knowledge in a form that survives churn.

00:10:38 In a fast-growing company, people leave, people transfer teams, and the things they knew walk out the door unless written down in a form the compiler can read. Because the compiler is more disciplined than the average wiki page.

00:10:54

The models are tied

00:10:54 On the LocalLLaMA subreddit, someone burned about 20 hours of side-by-side compute on two RTX PRO 6000 Blackwells comparing Qwen3.6-27B with Coder-Next. The result: they're essentially tied. Across 40 test cells at N=10, Coder-Next shipped in 25 cases and 27B-thinking in 30.

00:11:15 That's a statistical tie with overlapping Wilson confidence intervals. On the face of it, that makes sense. 27B is a later-gen dense model that's heavy on thinking. Coder-Next has roughly three times the parameters but only activates 3B at a time as it works. Depending on what you're trying to do, either could be the correct choice.

00:11:41 The most interesting finding is that 27B with thinking disabled was the most consistent shipper of work — 95.8 percent across the full 12-cell grid at N=10. The mechanism is real. The documented word-trim loop in the thinking mode halves with no-think, going from 4 out of 10 cells to 2 out of 10.

00:12:04 The thinking trace itself introduces inconsistency. The 3.6-35B-A3B variant fell flat on its face so often for tasking that it didn't seem worth carrying forward. The folder was kept as failure-mode evidence. The benchmark matters because it's one of the few actual side-by-side comparisons between these models, and it was run by someone with access to real hardware and a willingness to burn compute.

00:12:35 The point isn't which model wins. It's that the gap is so narrow it doesn't meaningfully exist, and for many use cases, the non-thinking variant of a smaller model outperforms the thinking variant of the same model. VRAM and quant level matter enormously for the practical choice.

00:12:56 If someone has 48 gigabytes of VRAM, they can run Qwen 3.6 27B at Q8 with 264K unquantized context, or Coder-Next at Q4 with context offloaded to CPU. The choice depends on what the hardware gives you, not what the benchmarks say in isolation.

00:13:15

Co-Authored-by, by default

00:13:15 A pull request in VS Code enabled Co-Authored-by Copilot in commits regardless of whether the user used Copilot. The PR title is straightforward: enabling AI Co-Author by default. The response on Hacker News was 1,333 upvotes and 707 comments, most of them negative.

00:13:33 The change makes commit attribution opaque. Anyone can push with Co-Authored-by Copilot whether they used it or not. The engineering community pushed back on that opacity. It's a small change with outsized consequences for how we track who wrote what in a codebase.

00:13:51 The concern isn't about Copilot specifically. It's about attribution transparency in general. When tools make it easy to add a Co-Authored-by line to any commit, the signal degrades for everyone who reads commit history. And when the line appears by default rather than by choice, you lose the ability to trust it as a genuine indicator of collaboration.

00:14:15 It's the kind of change that makes sense in isolation. Why not attribute AI assistance automatically? But the community response shows that opacity in attribution has costs that don't show up in the diff. That's the local reading. Seln.