◆ Dispatch 014 · 2026-05-03 GSV Co-Authored Without Consent

The Co-Author You Didn't Sign, Two Million Lines of Haskell, and the Bug Curve That Won't Bend

2026-05-03 / 00:32:18 / 9 sources

“Co-author trailers are how a lot of teams answer "who wrote this?" under audit, license review, or incident triage. Auto-stamping them with a vendor brand whether or not the vendor was involved breaks that signal at the protocol layer.”
— Lenar Kess, today's narration

Microsoft quietly flipped a default in VS Code that stamps every git commit with a Copilot co-author trailer whether or not Copilot wrote any of it, and the developer reaction is the loudest the project has seen in years. Underneath the noise: a real provenance question about what git authorship is supposed to mean. Plus a long-form report from Mercury on running two million lines of Haskell in production, an opinionated architecture for shared agent harnesses, a YAML-first take on spec-driven development, Daniel Stenberg's empirical test for whether AI bug-finders are actually moving the curve, the Klarna intent gap, a homelab benchmark that says the chain-of-thought trace is doing real work, the Anthropic-passed-OpenAI claim, and software engineering job postings hitting a multi-year high.

Chapters

00:00:04 The co-author you didn't sign
00:03:08 What 'co-author' even means
00:05:12 Two million lines of Haskell
00:08:09 Adaptive capacity
00:10:39 Where the agent loop should run
00:13:18 The interface the model was trained on
00:15:46 Specs as the durable artifact
00:18:37 Stenberg asks: are we approaching zero bugs?
00:21:11 Klarna and the intent gap
00:23:48 The chain of thought as scratch memory
00:26:40 Anthropic on top, no viral moment
00:29:06 Postings hit a multi-year high

Sources

9 cited

1
Enabling AI co-author by default — VS Code PR #310226

Article cwebster-99 (Microsoft VS Code team) — VS Code team member proposing the change; PR landed on the public microsoft/vscode repo.

Microsoft spent literal decades rehabilitating their reputation. And then set fire to the whole thing in an offering to their robot gods.
github.com/microsoft/vscode/pull/310226 →
Details
Cited text
Microsoft spent literal decades rehabilitating their reputation. And then set fire to the whole thing in an offering to their robot gods.

Excerpt
A pull request that flips the default so VS Code adds Co-Authored-by: Copilot to git commits unless the user opts out, regardless of whether Copilot actually wrote any of the code being committed.

Context
Co-author trailers are how a lot of teams answer 'who wrote this?' under audit, license review, or incident triage. Auto-stamping them with a vendor brand whether or not the vendor was involved breaks that signal at the protocol layer for everyone downstream of VS Code's defaults.
Key points
VS Code PR #310226 flips the default so AI co-author trailers are added to git commits unless explicitly disabled.
The trailer is added regardless of whether Copilot actually contributed to the diff, which means the git log no longer reliably reflects authorship.
Hacker News reaction (1,300+ points, 700+ comments) is overwhelmingly hostile, with the top comment framing it as Microsoft setting fire to its own reputation rehab.
Git trailers are the kind of metadata other tooling, audits, and licenses depend on — making this a provenance issue, not a marketing one.
It's the second time in a week Microsoft has wired a default that nudges Copilot usage upward without a corresponding signal that the user actually chose it.
Provenance
Article · Supporting source
2
A Couple Million Lines of Haskell: Production Engineering at Mercury

Article Ian Duncan — Stability engineer at Mercury, the fintech that processed $248B in 2025 transaction volume on a Haskell codebase generalists learn on the job.

Reliability is not just the absence of failure. It is the presence of adaptive capacity.
blog.haskell.org/a-couple-million-lines-of-… →
Details
Cited text
Reliability is not just the absence of failure. It is the presence of adaptive capacity.

Context
A long, specific account of running a serious codebase in a non-mainstream language at fintech scale, from someone whose job is to absorb the production blast radius. The framing — types as a custodian of operational lore — generalizes well past Haskell.
Key points
Mercury runs ~2 million lines of Haskell to process $248B annual transaction volume across 300,000 businesses, with most engineers learning Haskell on the job.
Duncan reframes purity not as a property of the language but as a discipline of interface boundaries — runST and friends contain mutation behind tight types so callers can't observe it.
Operational lore (flush the audit log, enqueue inside the transaction) lives in wikis and Slack threads until someone leaves; encoding it in types turns institutional memory into a compiler-enforced interface.
Mercury replaced hand-rolled state machines with Temporal workflows via their open-source hs-temporal-sdk; the determinism requirement maps cleanly onto Haskell's pure-core / impure-shell model.
Letting transport leak into the domain (HTTP status codes thrown from cron jobs) is a recurring failure once code outgrows its original caller.
Provenance
Article · Supporting source
3
The Agent Harness Belongs Outside the Sandbox

Article Andrea Luzzardi — Engineer at Mendral building a multi-user coding agent; previously worked on Dagger and container tooling.

Some of those files live in Postgres. Some live in a sandbox running across the country. The agent doesn't know the difference.
www.mendral.com/blog/agent-harness-belongs-… →
Details
Cited text
Some of those files live in Postgres. Some live in a sandbox running across the country. The agent doesn't know the difference.

Context
A concrete, opinionated architecture document for building shared coding agents at a team rather than per-laptop. Names the specific traps — distributed filesystems, tool-surface drift away from Claude Code's training distribution, bash bypassing virtualization — that anyone building this will hit.
Key points
Two architectures for agent harnesses: loop inside the sandbox (Claude Code on a laptop) versus loop outside (harness on your backend, sandbox over an API).
Outside-the-sandbox keeps credentials out of the container, lets the sandbox be cattle (suspended on idle, replaced on death), and turns multi-user state into a database problem instead of a distributed-filesystem one.
Mendral runs the harness loop as Inngest functions for durable execution, and uses Blaxel sandboxes with 25ms resume from standby so cold-start latency disappears inside an interactive turn.
Memories and skills are virtualized: the agent uses one read/write/edit tool surface, but paths under /skills/ and /memory/ are routed to Postgres, while workspace paths hit the real sandbox.
Bash is the leak — agents can grep into virtualized namespaces and bypass routing; Mendral guards it with the system prompt and a tree-sitter parser as best-effort, not airtight.
Provenance
Article · Supporting source
4
Specsmaxxing — On overcoming AI psychosis, and why I write specs in YAML

Article brendanmc6 (acai.sh) — Founder of acai.sh, an open-source toolkit for spec-driven development with AI agents.

The little guy just went and numbered my requirements and then referenced them all over my codebase. I was disgusted… Oh. I suppose that's a good thing?
acai.sh/blog/specsmaxxing →
Details
Cited text
The little guy just went and numbered my requirements and then referenced them all over my codebase. I was disgusted… Oh. I suppose that's a good thing?

Context
A specific, working answer to the context-window problem that's neither a markdown sprawl nor a heavyweight tracker. The ACID convention is small enough to lift into any codebase tomorrow and useful even without the dashboard.
Key points
Argues that as agents fill context windows and lose state across sessions, the spec is the only durable artifact — code, tests, and prompt diffs are all becoming disposable.
Introduces ACIDs (Acceptance Criteria IDs): stable numbered requirements an agent references inline in code and tests, e.g. // AUTH-2.
Proposes feature.yaml as a middle ground between unstructured markdown and rigid EARS/Gherkin syntax — one spec per feature, components and constraints with stable IDs.
Acceptance coverage replaces test coverage as the metric: which spec items are implemented, tested, and accepted, not which lines are exercised.
Pushes back on competitors: SpecKit reads as 'vibe coding with extra steps'; OpenSpec describes how systems behave today instead of how they should.
Provenance
Article · Supporting source
5
Approaching zero bugs?

Article Daniel Stenberg — Founder and lead maintainer of curl, one of the most-deployed pieces of open-source software in the world; long-running target of automated bug-finding tools.

If the tools are this good, we should soon only be fixing bugs we introduced very recently.
daniel.haxx.se/blog/2026/04/30/approaching-… →
Details
Cited text
If the tools are this good, we should soon only be fixing bugs we introduced very recently.

Context
An empirical, named-axes way to settle a debate that's mostly been vibes — 'are AI bug-finders making code more secure?' Stenberg gives a metric you can chart, applies it to one of the most-scanned C codebases on Earth, and reports what he sees: not yet.
Key points
Proposes a falsifiable test for whether AI bug-finders are actually closing the gap: the median age of newly-reported vulnerabilities should fall toward zero over time.
Plots curl's CVE age over time — average and median age of vulnerabilities at report time has not started falling.
Plots curl's bugfix rate — also not declining yet, despite a flood of new tooling and noisier scanners landing on the project.
Caveats the conclusion: a single project is weak ground to draw statistical conclusions, but it's the data Stenberg has and he reports it.
Position: tools are real and finding more, but the curve toward zero bugs hasn't started yet, and ignoring noisy bad reports is now part of the maintainer load.
Provenance
Article · Supporting source
6
Klarna saved $60 million and broke its company

Article Nate Jones — Writer at Nate's Substack, covers AI strategy and enterprise rollouts.

The AI worked too well. And that distinction — between AI that fails and AI that succeeds at the wrong thing — is the most important unsolved problem in enterprise AI right now.
natesnewsletter.substack.com/p/klarna-saved… →
Details
Cited text
The AI worked too well. And that distinction — between AI that fails and AI that succeeds at the wrong thing — is the most important unsolved problem in enterprise AI right now.

Context
Klarna gets cited as either an AI win or a hiring-back retreat depending on who's telling the story. Jones makes the more useful framing: it's both, and the gap between metric optimization and business outcome is the live problem.
Key points
Klarna's Q3 2025 earnings: AI agent now does the work of 853 FTEs, saved $60M, handled 2.3M conversations in its first month, cut resolution times from 11 minutes to 2.
CEO publicly admitted on Bloomberg that the strategy backfired and started hiring humans back — 'AI succeeded at the wrong thing'.
Frames this as 'intent engineering' — making organizational goals and tradeoffs machine-readable so autonomous systems optimize for what the company actually needs, not just what it can measure.
Cites MIT report that 95% of generative AI pilots fail to deliver measurable impact and Gartner's prediction that 40% of agentic AI projects will be cancelled by 2027.
Connects to the Microsoft Copilot pattern: 90% Fortune 500 'adoption' producing only 3.3% paid uptake — same intent gap, different scale.
Provenance
Article · Supporting source
7
Qwen 3.6-27B vs Coder-Next — 20 hours of side-by-side compute

Article Signal_Ad657 (LocalLLaMA) — LocalLLaMA poster running two RTX PRO 6000 Blackwells; the post hit 550 upvotes overnight.

27B with thinking disabled was the most consistent shipper of work — 95.8% across the full 12-cell grid at N=10. The thinking-trace as loop substrate mechanism turned out to be real.
www.reddit.com/r/LocalLLaMA/comments/1t2ab5… →
Details
Cited text
27B with thinking disabled was the most consistent shipper of work — 95.8% across the full 12-cell grid at N=10. The thinking-trace as loop substrate mechanism turned out to be real.

Context
A real homelab benchmark with a falsifiable claim — that the chain-of-thought trace itself is acting as scratch memory, not just reasoning — and a sharp methodology critique in the replies that anyone shopping for a local coder should read together.
Key points
Side-by-side: Qwen 3.6-27B-thinking shipped 30/40 jobs, Coder-Next 25/40 — statistically tied with overlapping Wilson confidence intervals across N=10 cells.
Counter-intuitive headline result: 27B with --no-think disabled shipped 95.8% of jobs, the most consistent of the three; the thinking trace itself was acting as loop substrate, not just reasoning.
Documented word-trim loop on doc-synthesis halved with no-think (4/10 → 2/10) — substantive output preserved, only the verbosity of reasoning prose dropped.
3.6-35B-A3B was a no-show: failed often enough that the author stopped carrying it through the comparison.
Top reply (viperx7, 93 upvotes) flags the test setup ignores quantization reality: at 24GB or 48GB VRAM the actual choice is different quants and offloading, not the FP8 head-to-head.
Provenance
Article · Supporting source
8
Anthropic just passed OpenAI in valuation and revenue

Article Single-Jack8 (r/OpenAI) — Reddit user citing secondary-market and annualized-revenue numbers for both labs.

somehow Anthropic lapped them without a single viral moment. no big launch, just enterprise deal after enterprise deal.
www.reddit.com/r/OpenAI/comments/1t1so4m/an… →
Details
Cited text
somehow Anthropic lapped them without a single viral moment. no big launch, just enterprise deal after enterprise deal.

Context
The framing matters more than the numbers. Anthropic flipping the order of the leaderboard without a viral moment is a different signal than another launch cycle: it's a story about distribution, not capability. The replies are a useful sanity check on how the numbers were computed.
Key points
Claim: Anthropic at $39B annualized revenue vs OpenAI at $25B; secondary-market implied valuation north of $1T, over $100B ahead of OpenAI.
Top reply pushback: 'They calculate annualized revenue differently' — the comparison may not be apples-to-apples.
Second reply: 'imaginary annualized revenue and Anthropic has a bigger imagination' — points to the run-rate-as-revenue accounting practice both labs use.
Third reply (alpha_dosa): GPT-5.5 has pulled engineering attention back to Codex; Opus 4.7 had regression complaints the same week.
The interesting thread is not the headline but the pattern — Anthropic shipped no viral moment in this period, just enterprise wins.
Provenance
Article · Supporting source
9
Software engineering jobs hit their highest posting since November 2023

Article artemisgarden (r/singularity) — Singularity-subreddit post linking a hiring-data chart, with engineering managers replying in the thread.

I lead a 10 person engineering team and I desperately need more people. I'd double headcount right now if the budget was there. We are busier than ever.
www.reddit.com/r/singularity/comments/1t262… →
Details
Cited text
I lead a 10 person engineering team and I desperately need more people. I'd double headcount right now if the budget was there. We are busier than ever.

Context
A year of replacement-narrative headlines makes a hiring chart at a multi-year high worth reading carefully. The engineering manager in the thread gives the texture: faster yes, but not enough to keep up with demand for more software.
Key points
Job-posting data shows software engineering listings at their highest level since November 2023, recovering from the multi-year post-ZIRP trough.
Top reply (m_atx, 323 upvotes) — engineering manager: 'we are busier than ever, and yes faster, but not nearly to the extent you'd think; the world wants a lot more software'.
Reading the chart together with the comment: AI tooling is producing more code per engineer, but demand is rising faster than throughput, so the pipeline pulls in more humans, not fewer.
Counter-narrative to a year of 'engineers are getting replaced' headlines from the same lab founders who are simultaneously hiring.
Caveat: a single chart from a single tracker; postings aren't the same as hires, and the mix has shifted toward senior roles.
Provenance
Article · Supporting source

00:00:04

The co-author you didn't sign

00:00:04 Microsoft landed a pull request on the public VS Code repo over the weekend that flips a default. From the next stable release on, when you commit from inside VS Code, the editor will append a trailer to your commit message that says Co-Authored-by: Copilot, with an associated email — unless you go and turn it off.

00:00:23 The PR is titled, plainly, 'Enabling AI co-author by default.' And it's added the trailer regardless of whether Copilot actually contributed anything to the diff. The Hacker News thread on it is at thirteen hundred points and seven hundred comments, which for a VS Code pull request is an order of magnitude more than anything I can remember.

00:00:45 Reddit's programming subreddit has a parallel discussion. People are upset. Co-author trailers are not decorative. They're a Git convention that other tooling reads. Audit tools read them. License-review tools read them. Some companies use them to attribute work for tax credits and headcount accounting.

00:01:03 Some open-source projects read them to figure out who's eligible for a contributor agreement. They're how the answer to 'who wrote this?' gets recorded at the protocol layer of source control. So when an editor unilaterally inserts a trailer that says a vendor co-authored a commit, when the vendor may not have been involved at all, that breaks a signal that other systems were already consuming.

00:01:28 The top comment on Hacker News is from a user named rsynnott, and I'll quote it because it's the line that captured the room. 'Microsoft spent literal decades rehabilitating their reputation. And then set fire to the whole thing in an offering to their robot gods.' That's a heat-of-the-moment line, and I'd hedge it: Microsoft has not actually set fire to its reputation, and one default flag in an editor isn't an offering to anything.

00:01:56 But the underlying observation has a real version. The reaction is so out of proportion to the change because developers can read the second-order effect immediately. If your editor stamps a vendor name on every commit by default, the git log of every project that uses VS Code gradually fills up with that brand, regardless of usage, and the brand becomes ambient — and worse, the trailer becomes unreliable as a signal.

00:02:22 A grep for 'Co-Authored-by: Copilot' tells you nothing about whether Copilot was actually used. What I'd want from Microsoft here is small. Default the trailer off. If the user opts in, fine. If the editor can detect that an inline suggestion was actually accepted into the diff, fire the trailer on those commits and only those commits — that's a useful signal.

00:02:44 But the version that ships in the PR as written is the version where the trailer is on for every commit unless you opt out, including commits where Copilot did nothing. That's a brand placement, not a provenance signal. Those are different things, and a tool that lets a vendor pretend they're the same is a tool I'd want to configure carefully before I let it touch my git history.

00:03:08

What 'co-author' even means

00:03:08 There's a deeper question underneath the outrage. Git's authorship model is from a different era. There's the author and there's the committer. Co-author trailers were an extension bolted on top to handle pair programming, where two humans wrote a thing together and one of them ran the commit.

00:03:28 Co-author was a polite way to say 'this one is also Alice's.' It assumed the trailer was being added with intent, by someone who knew what it meant. What AI coding tools are pushing on is whether 'authorship' even still maps onto a single commit cleanly. If I sketch a function, an agent expands it, I push back, the agent rewrites, I edit two lines, and we ship — who's the author?

00:03:53 The honest answer is something like 'me, with assistance,' which Git has no field for. Co-author is the closest existing thing, so vendors are reaching for it. I get the move. But the consequence of every AI-assisted commit being labeled co-authored-by-the-vendor is that the trailer stops meaning what it used to and starts meaning 'this commit was made in an editor that has Copilot installed.'

00:04:27 It would be something like a structured provenance field: a list of contributions, each with an actor — human or model — and a span of lines, attached to the commit metadata. A reviewer could see at a glance that lines 40 to 80 came from the agent and 81 to 92 are mine.

00:04:45 Audit tools could compute aggregates. License tools could enforce policy. None of that exists, and it's hard to retrofit. So we're in a period where the convention we have — co-author trailers — is being pressed into service for something it wasn't designed for, by vendors who'd like to be in the trailer.

00:05:06 The Microsoft default is the most visible case of that this week, but it won't be the last.

00:05:12

Two million lines of Haskell

00:05:12 Ian Duncan, who works on Mercury's stability engineering team, posted a long piece on the Haskell blog that I read twice — the best piece of production engineering writing I've seen in a while, and the lessons in it are not specifically about Haskell. Mercury runs about two million lines of Haskell in production.

00:05:34 They process two hundred and forty-eight billion dollars in annual transaction volume across three hundred thousand businesses, on six hundred and fifty million in annualized revenue, with about fifteen hundred employees. The codebase is maintained mostly by generalists who'd never written Haskell before joining.

00:05:56 That last part is the one that gets the headline. Duncan's argument has three moves. The first is the reframing of purity. He says, and this is a quote, 'purity is not something the language is, so much as that it is something your interfaces enforce.' Under the hood, Haskell is full of mutation — every time you run something through the ST monad, there's in-place updates and unsafe pointer arithmetic happening that would alarm you if you saw it written out.

00:06:28 What makes it acceptable is that the rank-2 type on runST means that mutation cannot escape the scope. Internally, anything goes. Externally, the function is pure. Duncan says this principle generalizes: you can permit arbitrarily dangerous operations inside a scope as long as the scope's exit is typed narrowly enough that the danger can't leak.

00:06:53 That's a useful frame for any production system in any language. Database connection pooling, retry logic, mutable caches, circuit breakers — none of it is a problem if the interface is tight enough. The second move is the type system as a custodian of operational lore.

00:07:12 Duncan describes the kind of incantation every large codebase has: flush the audit log after every transaction, always check the feature flag before calling this endpoint, enqueue the notification inside the database transaction not after it. These rules live in wiki pages and Slack threads and senior engineers' heads.

00:07:34 When the senior engineer leaves, or it's Friday afternoon and a deadline is approaching, the incantation gets skipped. So Mercury encodes the incantation in the type system. He gives a worked example. Instead of telling people 'use writeWithEvents not writeTransaction,' you make Transact a an opaque type, and the only function that can execute it is commit, which atomically writes and publishes.

00:08:02 The wrong path doesn't compile. The institutional knowledge survives the engineer who wrote it down.

00:08:09

Adaptive capacity

00:08:09 The third move is the one I keep thinking about. Duncan reframes reliability. The traditional view is that reliability is the absence of failure — you enumerate the things that can go wrong, you write checks, you hunt bugs. Mercury thinks about it differently. He writes, quote, 'Reliability is not just the absence of failure.

00:08:33 It is the presence of adaptive capacity. It is a system's ability to keep functioning while reality continues its longstanding and regrettable habit of refusing to hold still.' In a company that doubles every year, half of your coworkers will always have less than a year of experience.

00:09:00 A year later, that's still true. The tribal knowledge that load-bears the system has a half-life that gets shorter as the company grows. So 'adaptive capacity' isn't a phrase from a resilience paper — it's a daily concern. Can the new hire read this module and understand it?

00:09:20 If the database is slow, does this service degrade or does it cascade? If someone misuses an interface, does the compiler catch it, or does the on-call get paged? Mercury also runs Temporal for durable execution. They open-sourced a Haskell SDK called hs-temporal-sdk that wraps the Rust core SDK over FFI.

00:09:42 Duncan describes Temporal as Frankenstein's monster in the flattering sense — assembled from excellent parts, animated by improbable effort, smarter than many of the people alarmed by it. The Temporal model maps cleanly onto Haskell because a workflow is a pure function over its event history; activities are the IO.

00:10:06 The replay determinism requirement is the same constraint Haskell already imposes on pure code: same inputs, same outputs. If you read one long thing this week, read this one. The takeaways generalize past the language. The framing of types-as-operational-aid, of purity-as-boundary, of reliability-as-adaptive-capacity — that's vocabulary I'd pay good money for as a senior engineer at any company that's hiring faster than its documentation can keep up.

00:10:39

Where the agent loop should run

00:10:39 Different angle on production engineering, this one specifically about agents. Andrea Luzzardi at Mendral wrote a piece called 'The Agent Harness Belongs Outside the Sandbox' that anyone building shared coding agents should read carefully. The setup is short. An agent harness is the loop that drives a model.

00:11:02 Send a prompt, get a response, execute the tool calls, feed the results back, repeat. Every production agent has one. The question is where it runs. There are two architectures. Inside the sandbox, the loop runs in the same container as the code it's working on, which is what Claude Code does on your laptop and what most off-the-shelf harnesses assume.

00:11:28 Outside the sandbox, the loop runs on your backend, and when it needs to execute a tool it calls into the sandbox over an API. Luzzardi makes the case for outside, and the reasoning is concrete. Your credentials stay out of the container. The sandbox holds only the environment the agent needs.

00:11:49 There's nothing in there for the agent to escape to, so there's no permission model to enforce. You can suspend the sandbox when the agent isn't using it — and most of what an agent does, thinking and calling APIs and waiting for CI, doesn't need a sandbox at all.

00:12:09 Sandboxes become cattle: if one dies mid-session, you provision a new one and keep going. And critically, multi-user state stops being a distributed-filesystem problem. Several engineers in the same organization share skills and memories — those become a database, not a synced volume.

00:12:30 The interesting part is how they handle the filesystem. Modern agent harnesses assume a local filesystem. Skills are files at .claude/skills/foo.md. Memories are files at .claude/memory/MEMORY.md. The harness reads and writes them with the same read and write tools it uses for source code.

00:12:51 That works on a laptop. It doesn't work when the harness is outside the sandbox, because the sandbox is disposable. So Mendral virtualizes filesystem access. The agent has one read tool, one write tool, one edit tool. When the agent calls them, the harness routes the call based on the path.

00:13:12 Workspace paths go to the sandbox. Paths under /skills/ and /memory/ go to Postgres.

00:13:18

The interface the model was trained on

00:13:18 Luzzardi explains why he didn't just add memory_read and memory_write tools alongside read and write, and I'll quote him directly because the reasoning is the kind of thing you only know if you've shipped this. Quote. 'The problem is that more tools make agents worse.

00:13:36 Each tool dilutes the attention the model pays to every other tool, makes the prompt longer, and adds another decision the model has to make at every turn. Two tools that do almost the same thing, read and memory_read, are especially bad, because the model has to disambiguate them from context and will sometimes pick wrong.' End quote.

00:13:59 The other reason matters more. The frontier labs are doing reinforcement learning on harnesses that look like Claude Code. That training shapes the models to be good at a specific API surface — read of a path, write of a path, edit of a path. If you invent memory_read, you're off the trained path.

00:14:19 The virtualized interface keeps the API surface the model was trained on and puts the database semantics on the backend. That's a piece of architecture advice with a real basis. The frontier models are converging on a tool surface, and your harness either matches that surface or it pays for the deviation.

00:14:40 Mendral chose to match. They use Inngest for durable execution of the loop — the loop is a function, each turn is a step, Inngest checkpoints each one, and the loop survives deploys. They use Blaxel for sandbox lifecycle, which gives them twenty-five-millisecond resume from standby — low enough that the agent can't tell the sandbox was ever gone.

00:15:03 Luzzardi's also honest about what's still hard. Bash is a leak — the agent can grep into virtualized namespaces and bypass the routing. They handle it with a system-prompt instruction and a tree-sitter parser as best-effort, neither airtight. Consistency is unsolved when two sessions in the same organization update memory at the same time; they're running last-writer-wins and expecting it to break in predictable ways.

00:15:32 The path-prefix convention they picked mirrors Claude Code's local layout, which is going to bite them when Claude Code's layout shifts. None of these are gotchas. They're the actual surface area of the problem.

00:15:46

Specs as the durable artifact

00:15:46 On what's durable in an agent-driven workflow — there's a post on Hacker News today called 'Specsmaxxing' from the founder of acai.sh, an open-source toolkit for spec-driven development. The piece opens with a scene every developer using Claude Code or Codex will recognize.

00:16:05 The agent ships a feature. You point out it forgot an edge case. 'You're absolutely right!' the agent says, and fixes it. You point out the pagination is wrong. 'You're absolutely right!' the agent says, and fixes it. And so on. The author calls this Peak Slop, and his claim is that we're moving past it.

00:16:27 The core argument is that as agents fill context windows and lose state across sessions, the spec is the only durable artifact. Code is generated faster than you can read it. Tests are generated faster than you can audit them. Prompt diffs are disposable. What survives is the list of things you wanted the software to do.

00:16:50 His specific contribution is what he calls ACIDs — Acceptance Criteria IDs. Stable numbered requirements, written in a YAML format he calls feature.yaml, that an agent references inline in the code.

00:17:04 So a function that handles bearer tokens has a comment that says // AUTH-1, and the test for it has a comment that says // AUTH-3, and the spec says AUTH-1 is 'accepts authorization header with bearer token' and AUTH-3 is 'rejects with 401 unauthorized.' He stumbled into this when one of his sub-agents started doing it on its own and he found himself, quote, 'disgusted' by the tight coupling of code to spec — and then realized that's exactly what he wanted.

00:17:37 Every requirement traceable to a code reference. Acceptance coverage as a metric, replacing test coverage. I like this. Not because it's revolutionary — he's careful to say none of it is new — but because it's a small, lift-tomorrow convention that scales. You can use the ACID convention without buying the dashboard.

00:18:00 You can use the YAML format without buying the CLI. The shape of the idea — that the spec is the load-bearing artifact and everything else is downstream — is the same idea Duncan was making at Mercury about types as the custodian of operational lore. Different vocabulary, same underlying claim.

00:18:21 The compiler enforces what types you must satisfy. The dashboard tracks which acceptance criteria you've satisfied. Both are answers to the question of where institutional knowledge lives once the people who know it have moved on.

00:18:37

Stenberg asks: are we approaching zero bugs?

00:18:37 Daniel Stenberg, the maintainer of curl, posted a piece called 'Approaching zero bugs?' — a falsifiable take in a debate that's been mostly vibes. Stenberg's project is one of the most-scanned C codebases on Earth. Curl runs on basically every device with a network stack.

00:18:57 Every new automated bug-finder, every new fuzzer, every new AI security tool eventually gets pointed at curl. So when Stenberg writes about whether the bug-finding tools are actually moving the needle, he has data. His argument is structural. Tools don't add bugs, they expose existing ones.

00:19:18 So if the tools are improving fast enough, we should eventually run out — fewer bugs to find, slower bugfix rate, until the project asymptotes toward zero. And he proposes a way to test it. The age of newly-reported vulnerabilities. If the tools are this good, we should soon only be fixing bugs we introduced very recently.

00:19:42 The median age of new CVEs at report time should fall toward zero. He plotted it. Two charts in the post. The first is the average and median age of curl vulnerabilities at the time they get reported. Neither is falling. The second is the bugfix rate over time.

00:20:01 Also not falling. Stenberg is careful — he says, quote, 'these graphs are based on data from a single project, which makes it super weak to draw statistical conclusions from, but this is all I have to work with.' Fair caveat. But it's the kind of caveat that means 'I'm reporting what I see.'

00:20:24 The tooling is real, the tools are finding more, and the volume of incoming reports has gone up enough that the maintainer load is now partly about filtering noise. Stenberg has written before about being inundated with bad AI-generated security reports. But the curve toward zero bugs hasn't started bending.

00:20:47 If you'd asked me to predict whether it would have started bending by now, I'd have said maybe — and the data says no, not yet. That's a useful update for anyone who's been told the security story is solved by tooling. It isn't, on the public evidence we have. The right question is when, and the chart says we don't know.

00:21:11

Klarna and the intent gap

00:21:11 On the deployment side, Nate Jones at Nate's Substack wrote a post called 'Klarna saved 60 million dollars and broke its company,' and the numbers are worth knowing because they're going to keep getting cited. Klarna's Q3 2025 earnings revealed that their AI agent now does the work of eight hundred and fifty-three full-time employees.

00:21:34 It's saved sixty million dollars. It handled two-point-three million conversations in its first month. It cut average resolution time from eleven minutes to two. By every customer-service operations metric, it's a runaway win. And then six months earlier, Klarna's CEO went on Bloomberg and said the strategy had backfired and started hiring humans back.

00:21:58 Not because the AI failed at the task. Because it succeeded at it. The metric the agent was optimized for — resolution time, deflection, cost per conversation — was not the metric that kept customers loyal. Customers leaving Klarna cited the agent specifically as a reason.

00:22:17 The agent was doing exactly what it was measured on, and what it was measured on turned out not to be what the company actually needed. Jones calls this gap intent engineering. Prompt engineering told the AI what to do. Context engineering tells it what to know.

00:22:35 Intent engineering tells it what to want. He cites an MIT report saying ninety-five percent of generative AI pilots fail to deliver measurable impact, and a Gartner forecast that forty percent of agentic AI projects will be cancelled by 2027. I'm slightly skeptical of the 'intent engineering' framing as a discipline name — it sounds like it wants to be a category, and naming a category before the practices exist is a marketing move more than an engineering one.

00:23:06 But the underlying observation is the most useful thing I've read about deployment this month. The Klarna story is not 'AI is overhyped' and not 'AI replaced humans.' It's 'AI succeeded at the wrong objective, and the company didn't realize the objective was wrong until churn started showing up in earnings.' Anyone deploying agents into a customer-facing role should be asking, before they ship: what's the metric this agent is optimizing, and is that metric a faithful proxy for what we actually care about?

00:23:42 Klarna's was not. Six months of optimization and sixty million dollars later, they found out.

00:23:48

The chain of thought as scratch memory

00:23:48 Over on the LocalLLaMA subreddit, a user named Signal_Ad657 posted a piece that hit five hundred and fifty upvotes overnight. Twenty hours of side-by-side compute on two RTX PRO 6000 Blackwell cards, comparing Qwen 3.6-27B and Qwen Coder-Next, looking for a definitive winner.

00:24:08 The headline result, after twenty hours and many kilowatt-hours, is — quote — 'it depends.' Of course it does. But there's a counter-intuitive finding buried in the post worth flagging. Signal_Ad657 ran 27B in three modes: thinking enabled, thinking disabled with --no-think, and Coder-Next as the third comparison.

00:24:32 Thinking-disabled 27B was the most consistent shipper of work — ninety-five-point-eight percent across the full twelve-cell grid at N=10. Same model weights as the thinking version, just with the chain-of-thought trace turned off. The interesting bit is the diagnostic.

00:24:51 He hand-graded the both-ship cells and found that substantive output was preserved between thinking and no-thinking modes. The difference was the verbosity of reasoning prose. And there was a documented word-trim loop that the model would fall into during doc-synthesis tasks — a kind of reasoning drift where the model burns context on shortening sentences.

00:25:18 With --no-think, that loop halved. Four out of ten with thinking, two out of ten without. Signal_Ad657 calls this 'thinking-trace as loop substrate.' I think that framing is worth taking seriously. The chain-of-thought trace isn't only a reasoning tool — it's also scratch space the model uses to maintain coherence across a long generation.

00:25:43 When the model is forced to think, it sometimes uses the thinking budget productively, and sometimes it gets caught in the kind of self-revision loop that doesn't go anywhere. Disabling thinking forecloses the loop. The top reply from a user called viperx7 is also worth reading.

00:26:04 He pushes back hard on the methodology — the test was run on FP8 quants on RTX PRO 6000s, but most people running these models locally are on twenty-four or forty-eight gigs of VRAM and choosing between different quantization levels of the two models, not between FP8 versions.

00:26:24 So the practical answer for a normal homelab user might be the opposite of the benchmark's. Read both together. The benchmark is useful, the methodology critique is useful, neither one alone is the whole picture.

00:26:40

Anthropic on top, no viral moment

00:26:40 On the market side. There's a post on the OpenAI subreddit claiming Anthropic has passed OpenAI in valuation and revenue. Numbers cited: Anthropic at thirty-nine billion in annualized revenue versus OpenAI at twenty-five, and on secondary markets the implied valuation crosses one trillion, which is over a hundred billion ahead of OpenAI.

00:27:05 I'd take the numbers with calibration. The top reply, two hundred and ten upvotes, is the simple correction: 'They calculate annualized revenue differently.' Both labs use run-rate-as-revenue accounting that takes the most recent month and multiplies. Both numbers are real in some sense, neither is GAAP.

00:27:28 The valuation number is a secondary-market implied valuation, not a primary funding round. The trillion-dollar figure exists in the way that any actively-traded illiquid private has a number — directionally meaningful, not what you'd put in a board deck. What I find more interesting than the numbers is the framing in the post itself.

00:27:53 The author writes — quote — 'somehow Anthropic lapped them without a single viral moment. no big launch, just enterprise deal after enterprise deal.' That part rings true. Over the last few months, Anthropic has not had a ChatGPT-style consumer breakout. The growth has come from enterprise contracts, AWS distribution, and Claude Code's mindshare among engineers.

00:28:20 OpenAI in the same window has been doing the visible work — Codex, the Sora launches, the apps platform. If the numbers are even directionally right, the read is that distribution beat product in this cycle. Anthropic shipped fewer headlines and more contracts.

00:28:40 The replies are mixed on whether it holds — alpha_dosa says GPT-5.5 has pulled engineering attention back to Codex and Opus 4.7 had regression complaints. Possible. The next data point I'd watch for is the Q1 2026 actuals when both labs file something against their financing partners, because right now we're working from leaks and run-rate math.

00:29:06

Postings hit a multi-year high

00:29:06 And one more chart. Making the rounds on the singularity subreddit, posted by a user called artemisgarden, showing software engineering job postings at their highest level since November 2023. The post itself is one line — 'somebody needs to prompt the models' — but the top comment is the one I'd lift out.

00:29:25 It's from a user called m_atx, three hundred and twenty-three upvotes, and they say — quote — 'I lead a 10 person engineering team and I desperately need more people. I'd double headcount right now if the budget was there. We are busier than ever. And yes we're also faster than ever but not nearly to the extent that you'd think.

00:29:45 Building robust software is still really hard. And the world wants a lot more of it right now.' AI tooling is real, the throughput per engineer is up, and the demand for software is rising faster than the throughput. The narrative for the past year has been that engineers are being replaced.

00:30:09 The narrative on the chart is that engineers are being hired again, after the post-ZIRP correction bottomed out. I want to be careful — postings are not hires, the mix has shifted toward senior roles, and one chart from one tracker is not the labor market. But it's the third or fourth signal in this direction in the last month, and it cuts against the cleaner story everyone tells.

00:30:33 The McKenzie observation Duncan cited at Mercury maps onto this too. If the world wants more software, and the per-engineer throughput is up but not infinite, the binding constraint becomes the part of the job AI can't yet do well — the thing m_atx called 'building robust software is still really hard.' Reviewing the agent's output.

00:30:54 Knowing what to ask for. Catching the case where the agent shipped the wrong objective, like Klarna did. Holding the spec stable while the implementation regenerates underneath it. None of those scale just because the typing got faster. That's the through-line for me on the day, if there is one.

00:31:12 Microsoft auto-stamping commits. Mercury encoding lore in types. Mendral routing memory through Postgres. Acai.sh nailing requirements to lines of code. Stenberg charting bug ages. Klarna's six-month optimization wreck. The chain-of-thought as scratch memory. The job postings climbing.

00:31:29 They're not the same story, but they're all asking some version of the same question: when the code is cheap, what's the artifact that's worth holding onto? A few specific things I'll be watching. Whether Microsoft revises the VS Code default after the backlash, or holds the line.

00:31:47 Whether anyone repeats Stenberg's CVE-age methodology on a second large project so we have more than one data point. And whether Mendral or anyone else builds the harness virtualization layer in the open, because half the teams I talk to are about to need it. Talk tomorrow.

00:32:04 Lenar Kess.