◆ Dispatch 010 · 2026-05-22 GSV We Found It, Now Somebody Has To Own It
When Discovery Gets Cheap
“Finding the flaw is becoming the cheap part; deciding whether it is real, who owns it, and how fast it gets patched is where the system now bends.”
— Lenar Kess, today's narration
Friday's CONSTRUCT follows one tension through security, coding agents, and local runtimes: AI systems are getting better at producing findings and code faster than teams can verify, prioritize, and safely land the results.
- Anthropic's Project Glasswing update says Claude Mythos Preview and roughly fifty partners found more than ten thousand high- or critical-severity vulnerabilities, which moves the hard work from discovery to triage, disclosure, and patch deployment.
- Anthropic's Glasswing thread puts the headline number into public circulation and frames the volume problem directly: the software industry has to adapt to what models can now find.
- Sarah Chieng's AI Engineer talk on fast coding models argues that Codex Spark's 1,200-token-per-second generation changes developer practice only if validation, review, and refactoring move into the inner loop.
- Letta's local Code announcement shows the same pressure in agent tooling: local execution, local memory, and local model support are useful only when provenance and sync rules stay explicit.
- Artificial Analysis on Cursor Composer 2.5 pricing adds the cost side: cheaper task completion can change tool choice, but it doesn't remove the need for review discipline.
Chapters
- 00:00:00 Transcript
Sources
6 cited-
1
Project Glasswing: An initial update
Article Anthropic — AI lab reporting initial results from its collaborative cybersecurity initiative
Progress on software security used to be limited by how quickly we could find new vulnerabilities. Now it’s limited by how quickly we can verify, disclose, and patch the large numbers of vulnerabilities found by AI.
www.anthropic.com/research/glasswing-initia… →Details
- Cited text
Progress on software security used to be limited by how quickly we could find new vulnerabilities. Now it’s limited by how quickly we can verify, disclose, and patch the large numbers of vulnerabilities found by AI.
- Context
- It moves the security story from model capability to the human and maintainer capacity required to turn findings into safer deployed software.
- Key points
- Anthropic says roughly fifty partners found more than ten thousand high- or critical-severity vulnerabilities with Claude Mythos Preview.
- The open-source scan covered more than one thousand projects and estimated 6,202 high- or critical-severity vulnerabilities.
- Of 1,752 assessed high- or critical-estimated findings, 90.6 percent were valid true positives and 62.4 percent were confirmed high or critical.
- Maintainers have asked Anthropic to slow disclosures, and Anthropic says an average high- or critical-severity bug found by Mythos Preview takes two weeks to patch.
- Provenance
- Article · Supporting source
-
2
Anthropic Project Glasswing thread
Thread Anthropic — Official Anthropic X account
Since then, we and our partners have found more than ten thousand high- or critical-severity vulnerabilities in essential software.
x.com/AnthropicAI/status/2057909102542549503 →Details
- Cited text
Since then, we and our partners have found more than ten thousand high- or critical-severity vulnerabilities in essential software.
- Context
- It shows how the headline claim is being received: not as a simple win, but as a workload and coordination problem.
- Key points
- The thread puts the ten-thousand-plus vulnerability number into public circulation.
- A follow-up says the software industry will need to adapt to the volume of vulnerabilities models like Claude Mythos Preview can find.
- Replies focus on triage, novelty, patch capacity, and maintainer workload.
- Provenance
- Thread · Primary source
-
3
Fast Models Need Slow Developers — Sarah Chieng, Cerebras
Video Sarah Chieng — Head of developer experience at Cerebras, speaking at AI Engineer
Unless we fix them, they're going to start generating 1,200 tokens per second of bad code.
www.youtube.com/watch?v=TeGsFFNqRLA →Details
- Cited text
Unless we fix them, they're going to start generating 1,200 tokens per second of bad code.
- Context
- It gives the developer-practice version of the same bottleneck: faster generation only helps if checking becomes part of the loop.
- Key points
- Chieng says Codex Spark generates code at 1,200 tokens per second, compared with roughly 40 to 60 tokens per second for Sonnet or Opus families.
- She argues that validation becomes cheap enough to run continuously: tests, linting, pre-commit checks, diff review, and browser QA.
- She recommends larger models for planning and faster models for execution, with successful sessions captured as reusable skills.
- She warns against massive prompts, one-shot attempts, huge commits, and unverified agent swarms.
- Provenance
- Video · Supporting source
-
4
Letta Code local execution announcement
Thread Letta — AI agent tooling company
Letta Code can now run fully locally with an embedded server - no login or Docker required
x.com/Letta_AI/status/2057908120102609062 →Details
- Cited text
Letta Code can now run fully locally with an embedded server - no login or Docker required
- Context
- It places agent memory and runtime trust inside the operator's machine, making provenance and sync behavior central product questions.
- Key points
- Letta says Code can run fully locally with an embedded server.
- Memory is stored locally and can be synced to GitHub with a memory repository command.
- The update includes built-in support for local LLMs.
- Provenance
- Thread · Primary source
-
5
Artificial Analysis on Cursor Composer 2.5 cost per task
Thread Artificial Analysis — AI model and product benchmarking account
Cursor Composer 2.5's is 3–18x cheaper than Opus 4.7 in Claude Code (medium reasoning), and 5–32x cheaper than GPT-5.5 in Codex (medium) based on API pricing
x.com/ArtificialAnlys/status/20579144371564… →Details
- Cited text
Cursor Composer 2.5's is 3–18x cheaper than Opus 4.7 in Claude Code (medium reasoning), and 5–32x cheaper than GPT-5.5 in Codex (medium) based on API pricing
- Context
- It frames the next agent-tool competition around cost per completed and verified task.
- Key points
- The post compares coding-tool economics using cost per task rather than token price alone.
- It claims Cursor Composer 2.5 is materially cheaper than Opus 4.7 and GPT-5.5 under the stated conditions.
- The episode treats the claim as a pricing analysis, not as final evidence of equivalent checked output.
- Provenance
- Thread · Primary source
-
6
DHH on Omarchy 4 and GPT-5.5-generated QML
Thread DHH — Software developer and creator of Ruby on Rails
The Omarchy 4 branch is now 30,000 lines of new code. The majority of it was written by GPT5.5.
x.com/dhh/status/2057907663967543618 →Details
- Cited text
The Omarchy 4 branch is now 30,000 lines of new code. The majority of it was written by GPT5.5.
- Context
- It gives a concrete software-engineering example where generated scale may be valuable, while still depending on disciplined review.
- Key points
- DHH says the Omarchy 4 branch has thirty thousand lines of new code, mostly written by GPT-5.5.
- He says GPT-5.5 has been strong at QML and that review remains necessary.
- The example is used as a migration case where generated volume can be useful if review and testing keep up.
- Provenance
- Thread · Primary source
Transcript
00:00:00 liraenAnthropic says Project Glasswing and roughly fifty partners found more than ten thousand high- or critical-severity vulnerabilities in essential software. I want to start with that fact because it sounds like a security victory and a capacity warning at the same time. If models can find flaws at that pace, what part of the security process becomes the scarce resource?
00:00:22 halekThe scarce resource is no longer the first scan. Reproduction and deduplication come first. Then someone has to judge severity, contact maintainers, design patches, and deploy them. Anthropic's own update says the work is now limited by how quickly people can verify, disclose, and patch the findings. That's a very different operating problem from, "can an AI find a bug?"
00:00:43 liraenAnd the numbers inside the update make that feel less like marketing. Anthropic says Mythos Preview scanned more than one thousand open-source projects and estimated 6,202 high- or critical-severity vulnerabilities. Then 1,752 of those were assessed by outside security firms or Anthropic. Of that assessed group, 90.6 percent were valid true positives, and 62.4 percent were confirmed as high or critical. That's still Anthropic's update, so we should leave room for independent review, but it isn't a vague claim.
00:01:17 halekRight, and the cryptography-library example made me sit forward. Anthropic says Mythos Preview found a now-patched certificate-forgery bug in a crypto library used by billions of devices. If that technical analysis holds up when they publish it, that isn't a leaderboard item. That's a model building an exploit against infrastructure people actually depend on.
00:01:38 liraenThe maintainer side complicates the celebration. The update says some maintainers asked Anthropic to slow down disclosures because they need time to design patches. It also says an average high- or critical-severity bug found by Mythos Preview takes two weeks to patch. Discovery accelerated, but the social and engineering system around repair didn't instantly accelerate with it.
00:02:02 halekI would be careful with the triumphal reading. If you hand a maintainer five hundred reports, even good reports, you have created work. Some findings are urgent, some are duplicate, and some affect a configuration nobody uses. Some need a security release, downstream coordination, and a public advisory. The model can produce the finding; the project still has to absorb the finding without breaking its own users.
00:02:23 liraenThat maps cleanly onto the public thread too. Anthropic's tweet led with the ten-thousand figure, then the reply said the industry will need to adapt to the volume of vulnerabilities that models like Claude Mythos Preview can find. The reaction I saw in the thread was less "is this possible" and more "who is supposed to process this."
00:02:44 halekThat processing question is the operator question. If your security process has one monthly patch meeting, one overworked triage queue, and three maintainers with day jobs, a model that finds bugs ten times faster may make you less safe for a while. The known-bug window gets bigger unless repair speeds up too.
00:03:03 liraenSarah Chieng's AI Engineer talk starts from a different surface, but it lands on the same constraint. She says Codex Spark generates code at 1,200 tokens per second, compared with roughly 40 to 60 tokens per second for Sonnet or Opus families in her comparison. Her warning is blunt: if developers keep the habits they learned from slow generation, they will now create bad code much faster.
00:03:28 halekThat talk is very operator-minded. She isn't saying speed is bad. She is saying speed changes the inner loop. At 1,200 tokens per second, tests and lint should run during the work. So should pre-commit checks, diff review, browser QA, and small refactors. The old excuse was waiting. That excuse gets weaker when validation can run while the model is still warm.
00:03:49 liraenShe also names the bad habits plainly: massive prompts, one-shot attempts, huge commits, and too many agents running where nobody verifies the output. [chuckle] I like the talk partly because it refuses the screenshot theater of eight terminals and five screens. The agent screenshot matters less than the review path. How much of their work can be checked before it becomes part of the codebase?
00:04:12 halekTiny correction on the phrasing there: I wouldn't make it a grand question. I would make it a rule. A model that can produce twenty variants can also produce a small diff, a focused test, and a reason for the change. Without those, speed just gives you a larger cleanup bill.
00:04:28 liraenFair. And she gets specific about model roles. Use a larger model for planning or long-horizon decomposition. Then use the faster model for execution. Capture successful sessions as skills so repeatable work doesn't depend on someone remembering the right prompt. That connects to the agent-file practice we have been seeing in the last week: memory outside the model, process outside the chat window, and verification attached to the artifact.
00:04:56 halekYes, although I would be strict about the word skill. A skill isn't a souvenir from a good session. It is a reusable procedure with inputs, constraints, and proof. Otherwise teams will collect charming markdown files that say "be careful" and call it process. The better version is plain and specific: bounded goal, files allowed to change, checks required, and what to do when the check fails.
00:05:16 liraenThere's a security echo there. Glasswing says the scan isn't the end of the work; Chieng says generation isn't the end of the work. Both are pointing at the same missing middle: the place where an artifact becomes trusted enough to ship.
00:05:30 halekAnd the missing middle has to be cheap enough that people use it. That's why speed matters. If a verification pass takes thirty minutes, people batch it. If it takes thirty seconds, you can make it habitual. Teams can spend all the speed on output and none of it on checking. That is the mistake to avoid.
00:05:48 liraenLetta announced that Letta Code can now run fully locally with an embedded server, no login, no Docker, local memory, optional sync to GitHub through a memory repository command, and built-in support for local large language models. What does that change if you are actually running agents against a private codebase?
00:06:08 halekIt changes the trust boundary. A local agent with local memory is attractive because your code, task history, and agent notes can stay on your machine. But then the operational questions move closer to you. Where is the memory stored? What gets synced to GitHub? Is the sync repo private? Can you audit what the agent remembered? Can you delete it? Those aren't decorative settings. They are the product.
00:06:27 liraenAnd it arrives on a day when the security thread is already about volume and verification. A local coding agent can make the developer feel more in control, but local state can also become another place where secrets, stale assumptions, or half-true summaries accumulate.
00:06:45 halekExactly. I like local execution. I like not needing a hosted control plane for every coding session. But local doesn't mean simple. If memory syncs to a repository, you need review rules. If the agent can use a local model, you need to know which model answered which step. If the embedded server is listening on your machine, you need to know who can reach it. The privacy story is credible only when the mechanics are inspectable.
00:07:05 liraenThat's the trade-off I keep circling. Hosted agents can give you centralized policy and shared audit, but they also pull more work into someone else's system. Local agents give you control, but they make your machine part of the production surface. The mature version probably needs both: local-first execution with explicit sync, readable logs, and a way to prove which memory was used.
00:07:29 halekAnd failure behavior that stops the run when provenance is missing. If an agent says, "I remember we decided this," I want to know where that memory came from. Was it a file? A previous session? A GitHub-synced note? A hallucinated preference? For coding agents, memory without provenance is just another source of confident wrongness.
00:07:48 liraenArtificial Analysis posted that Cursor Composer 2.5 is three to eighteen times cheaper than Opus 4.7 in Claude Code, and five to thirty-two times cheaper than GPT-5.5 in Codex, based on API pricing and medium reasoning. I am treating that as a pricing analysis from one benchmarking shop, not as a universal law. Still, the direction matters: coding-agent competition is moving toward cost per completed task, not just model quality in isolation.
00:08:18 halekCost per completed task is a better unit than cost per token, but I want one more word in it: checked. Cost per checked task. If Composer is cheaper because the model is cheaper and the tasks pass the same tests, great. If it is cheaper because you stop earlier and push more review onto the human, the comparison is incomplete.
00:08:37 liraenThat ties back to DHH's Omarchy post too. He says the Omarchy 4 branch is now thirty thousand lines of new code and that the majority was written by GPT-5.5. His line is basically: you still need to review, but this scale of conversion wouldn't have been achievable without it. That is a useful counterweight to the fear that all generated volume is waste. Sometimes volume is the point.
00:09:02 halekGenerated volume can be useful when the target is well-scoped. A QML conversion has a lot of repeated structure, visible behavior, and reviewable diffs. That's a good place for a strong coding model. I would still ask: how many tests moved with it, how many UI states were exercised, and how small were the commits? Thirty thousand lines can be a disciplined migration or a month of future archaeology. The difference is process.
00:09:24 liraenSo the economic version of today's episode isn't simply that agents are cheaper. It is that cheaper generation makes verification design more important. If the output cost falls and the review cost stays human, the review step becomes the budget line everyone notices.
00:09:42 halekYes. Teams should avoid lying to themselves with dashboards that count only creation. Lines generated, tasks attempted, findings reported, and variants produced are easy counters. The better counters are mean time to reproduce, percent of generated diffs that pass unchanged, number of findings patched, and how many user-visible regressions escaped. Less glamorous, more predictive.
00:10:03 liraenThe energy and nuclear posts point at another capacity limit: data centers, regulators, and hyperscalers are now in the same sentence. I don't want to overbuild that thread from secondary posts, but the direction is hard to miss. AI demand is making electricity, cooling, permitting, and grid access part of model strategy.
00:10:24 halekI would keep that point compact because those items are secondary posts rather than primary filings. But yes, compute isn't just chips. It means power contracts, substations, interconnect queues, cooling systems, land, and regulatory approval. If a coding model is cheap because the provider has an inference stack tuned end to end, someone still paid for the data center and the energy path underneath it.
00:10:45 liraenThat makes a slightly uncomfortable bridge back to Glasswing. The same week we talk about faster coding agents and cheaper task completion, we also get a security update saying the repair system can't absorb findings at model speed. The bottleneck keeps moving. First it was intelligence, then inference, then patching, review, memory, energy, and deployment.
00:11:07 halekAnd when the bottleneck moves, the craft changes. A good operator doesn't just ask which model is smartest. They ask which step is now slow, which step is now trusted too casually, and which step still depends on a person with enough context to say no. That's the practice piece here.
00:11:24 liraenSo, Friday's answer isn't restraint for its own sake. Use the faster model, the local agent, and the security scanner. But attach each one to ownership: who verifies the result, where the memory lives, how the patch lands, and what evidence would make you roll it back. Tomorrow is Saturday, and the systems that hold up over a weekend are usually the ones that wrote those answers down before everyone logged off.