◆ Dispatch 024 · 2026-05-13 GSV With One Run Overnight

Hackbots, Magento, and Three Lines of Logic

2026-05-13 / 00:29:55 / 13 sources

“A frontier model multiplies whatever the operator already had — the target list, the harness, the disclosure pipeline.”
— Lenar Kess, today's narration

An overnight hackbot run lands a real CVE in Adobe Magento. Codex starts driving local Mac apps in parallel, with per-app permissions and a separate cursor. Cloudflare publishes one of the prettiest debugging writeups of the year — a nine-year-old kernel patch, a 14ms oscillation, three lines of fix. Plus Nous Research's removable attention wrapper, GPT-5.5's first ProgramBench solve, Vercel's argument that giving an agent a file system changes how it behaves, a 26-million-parameter tool-calling model, Isomorphic's two-billion-dollar Series B, and a Purdue senior who put Rust on his graduation cap.

Chapters

00:00:04 A hackbot earned a CVE overnight
00:04:26 Codex starts driving your Mac apps
00:08:15 Cloudflare's CUBIC death spiral
00:13:06 Nous Research's Lighthouse Attention
00:15:48 GPT-5.5 cracks a ProgramBench task
00:19:35 Vercel: give the agent a computer
00:22:24 Needle: tool-calling in 26 million parameters
00:24:35 Isomorphic Labs raises 2.1 billion
00:26:47 A Rust grad cap, and a tweet about working out
00:29:44 Tomorrow's Angular zero-days and Lighthouse at scale

Sources

13 cited

1
Hackbot FUZZ-E earns CVE for LFI in Adobe Magento, plus 2 Angular zero-days

X @rez0__ — Joseph Thacker — AppSec researcher focused on AI red-teaming, prolific writer on agentic security

With 1 run overnight, it found vulns in wildly hardened projects.
x.com/rez0__/status/2054539643912077351 →
Details
Cited text
With 1 run overnight, it found vulns in wildly hardened projects.

Context
An overnight autonomous run pulling a CVE from Magento changes the asymmetry on every open-source maintainer and enterprise codebase. The bar moved from "AI finds toy bugs" to "AI finds real bugs in projects that have already survived years of human review."
Key points
FUZZ-E, an autonomous hackbot from @AutonomousCyber, earned a CVE for a local file inclusion vulnerability in Adobe Magento
Same overnight run also surfaced 2 zero-days in Angular (disclosure pending)
Targets were chosen by Thacker — these aren't toy codebases, they are mature enterprise projects
Thacker says the same hackbot is 'even better with gpt5.5 now' — model upgrades translate directly to bug yield per run
Provenance
Tweet · Primary source
2
Adobe Magento security advisory APSB26-49

Article Adobe Product Security

Anchors the hackbot claim to an actual vendor advisory rather than a tweet. The CVE went through Adobe's standard process — the autonomy was at discovery, not at disclosure.
helpx.adobe.com/security/products/magento/a… →
Details
Context
Anchors the hackbot claim to an actual vendor advisory rather than a tweet. The CVE went through Adobe's standard process — the autonomy was at discovery, not at disclosure.
Key points
Primary-source disclosure for the Magento LFI vulnerability found by FUZZ-E
Confirms a vendor patch shipped through the normal Adobe advisory channel
Provenance
Article · Supporting source
3
Computer use in Codex

Video OpenAI — Roma and Ari Weinstein — Ari Weinstein joined OpenAI from Sky/Shortcuts; Roma hosts the Codex product video

Every computer use implementation I've ever seen takes over your entire computer. So you can't use your computer while the agent is using your apps.
www.youtube.com/watch?v=D_FCYsshMI4 →
Details
Cited text
Every computer use implementation I've ever seen takes over your entire computer. So you can't use your computer while the agent is using your apps.

Context
Local-app computer use is the bridge from chat-only coding agents to agents that can actually drive the rest of your workflow. The per-app permission model is the part to watch — it sets a precedent for how desktop agents will be sandboxed.
Key points
Codex can now drive any local Mac application via accessibility framework plus screenshots
The agent uses a separate cursor and does not steal focus — you keep working while it operates other apps in the background
Permissions are per-app and require explicit allow on first use; the agent has no access to apps you haven't authorized
Pairing with the Spark model removes the multimodal screenshot dependency for many tasks because the accessibility tree provides textual structure — Weinstein claims this runs faster than a human can operate the app
Roadmap target: 2-5-10x human speed on common tasks; Windows support 'very soon'
Provenance
Video · Supporting source
4
When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug

Article Esteban Carisimo and Antonio Vicente — Cloudflare

The effort to find the bug was massive, but the fix itself was basically one line of logic.
blog.cloudflare.com/quic-death-spiral-fix →
Details
Cited text
The effort to find the bug was massive, but the fix itself was basically one line of logic.

Context
This is what senior-engineer craft looks like when no agent is going to help: a bug at the minimum-cwnd corner of CUBIC's state space, invisible to throughput dashboards, only surfaced because someone wrote a test that deliberately drove the controller into recovery. The fix is small; the investigation is the work.
Key points
A 2017 Linux kernel patch for CUBIC's idle-handling was ported into quiche in 2020 — but the follow-up kernel fix from a week later was not
Test was failing 61% of the time: after a heavy loss burst, the congestion window pinned at the two-packet floor and never grew back
Diagnosis took weeks of qlog instrumentation; the smoking gun was a 14ms oscillation period matching the test's RTT
Root cause: bytes_in_flight hitting zero between ACK and next send was misread as an idle period, advancing the recovery boundary into the future on every cycle
Fix: measure idle from last_ack_time, not last_sent_time — a three-line patch contributed back to cloudflare/quiche
Provenance
Article · Supporting source
5
Lighthouse Attention from Nous Research

X @omarsar0 — Elvis Saravia — runs DAIR.AI, regular curator of frontier research

What if you could speed up long-context pretraining with a subquadratic wrapper that you remove before deployment?
x.com/omarsar0/status/2054224130103554359 →
Details
Cited text
What if you could speed up long-context pretraining with a subquadratic wrapper that you remove before deployment?

Context
Most efficient-attention proposals lock you into a different architecture at deploy time. Lighthouse's train-with, deploy-without framing means labs can experiment with subquadratic pretraining without paying for it on every inference call.
Key points
Nous Research paper on Lighthouse Attention
Wraps ordinary scaled-dot-product attention with a hierarchical, gradient-free selection layer for long-context pretraining
Selection layer is removable at deployment time, leaving plain vanilla attention behind
Trades off training-time compute for inference-time fidelity — a different shape from the usual efficient-attention work
Provenance
Tweet · Primary source
6
Sebastian Raschka on Lighthouse Attention

X @rasbt — Sebastian Raschka — author of "Build a Large Language Model From Scratch", prolific commentator on pretraining mechanics

It is a relatively low-commitment attention modification. One can use it during most of training, switch back to vanilla attention near the end, and recover roughly the same modeling performance as if full attention had…
x.com/rasbt/status/2054543968344412621 →
Details
Cited text
It is a relatively low-commitment attention modification. One can use it during most of training, switch back to vanilla attention near the end, and recover roughly the same modeling performance as if full attention had [been used throughout].

Context
The most pragmatic read on a research paper from one of the most pragmatic readers in the field. Marking what is and isn't bet-the-run risk for a pretraining team is exactly the kind of context a builder needs.
Key points
Raschka highlights the 'low-commitment' property — you don't have to bet your whole training run on the modification
Switch back to vanilla attention near the end and recover full-attention-equivalent quality
This is the unusual property; most efficient-attention work degrades final modeling quality unless you keep it on at inference
Provenance
Tweet · Primary source
7
GPT 5.5 high Solves First Instance — ProgramBench

Article Kilian Lieret and John (ProgramBench team) — Maintained by researchers behind the SWE-bench family at Princeton and FAIR

First full solve on a benchmark designed to be unforgiving is a real milestone, but the comparison runs are the more useful read: a model that probes the CLI surface carefully in 34 calls is doing meaningfully different…
programbench.com/blog/gpt-5-5-first-solve →
Details
Context
First full solve on a benchmark designed to be unforgiving is a real milestone, but the comparison runs are the more useful read: a model that probes the CLI surface carefully in 34 calls is doing meaningfully different work than one that grinds through 178 calls and still ships case-sensitive string compares.
Key points
GPT-5.5 with high reasoning becomes the first model ever to fully solve a ProgramBench task (the cmatrix instance)
Score is still 0.05% overall — one solved task out of thousands — but 26 tasks now pass 95%+ of unit tests
Both GPT-5.5 (high) and (xhigh) full-solved cmatrix in 34 and 40 API calls; Claude Opus 4.7 (xhigh) used 178 calls and racked 19 failures
Opus's 19 failures decomposed into two trivial bugs: strcmp instead of strcasecmp for color names (11 failures) and wrong exit code on invalid color (8 failures)
GPT-5.5 default (medium reasoning) 'barely beat Claude Sonnet 4.6' — the wins come from running it at higher reasoning levels
Provenance
Article · Supporting source
8
Give Your Agent a Computer — Nico Albanese, Vercel

Video Nico Albanese — Vercel — Developer-relations lead on the Vercel AI SDK

The big assumption that goes through every single AI SDK API decision is that we want the agent definition to be the source of truth that everything else inherits from.
www.youtube.com/watch?v=wflNENRSUb4 →
Details
Cited text
The big assumption that goes through every single AI SDK API decision is that we want the agent definition to be the source of truth that everything else inherits from.

Context
The pattern matters more than the SDK: agent quality is increasingly bottlenecked on what the agent can write to and read from between turns, not on the model. Vercel framing the file system as a behavioral primitive is consistent with what every serious agent shop has been quietly converging on.
Key points
AI SDK 6 ships an object-oriented agent primitive (toolLoopAgent) plus end-to-end type inference from agent definition to UI message rendering
Vercel's internal claim: giving an agent a file system didn't just add storage — it changed the agent's behavior; it followed through on long tasks and built on its own prior work
Workshop walks through a tool-loop agent with bash, memories.md, and persistent named sandboxes
Three primitives Albanese says define agent building in 2026: an agent runtime, a tool set, and a computer or sandbox for state and code execution
Provenance
Video · Supporting source
9
Needle: a 26M-parameter tool-calling model distilled from Gemini

Article Henrie_the_dreamer (LocalLLaMA)

If tool calling really decomposes to retrieve-and-assemble, then the right architecture for the tool-loop step is small and specialized, not the same monolith that does reasoning. Cheap, fast, on-device tool routing is…
www.reddit.com/r/LocalLLaMA/comments/1tb9b0… →
Details
Context
If tool calling really decomposes to retrieve-and-assemble, then the right architecture for the tool-loop step is small and specialized, not the same monolith that does reasoning. Cheap, fast, on-device tool routing is what makes always-on local agents plausible.
Key points
26M parameter open-weights function-calling model targeted at consumer phones
Reported throughput: 6000 tokens per second prefill, 1200 tokens per second decode on consumer devices
Distilled from Gemini tool-calling traces; argues that tool calling is essentially retrieval-and-assembly rather than reasoning
Aimed at agentic workflows on budget Android hardware where a 7B-plus model is impractical
Provenance
Article · Supporting source
10
Isomorphic Labs announces $2.1B Series B

Article Isomorphic Labs — Alphabet-spinout AI drug discovery company led by Demis Hassabis

A two-billion-dollar Series B in AI drug design — with a UK Sovereign AI Fund check in it — is a signal about where AI capital is willing to leave the chat-app gold rush and bet on bench science. Worth knowing about eve…
www.isomorphiclabs.com/articles/isomorphic-… →
Details
Context
A two-billion-dollar Series B in AI drug design — with a UK Sovereign AI Fund check in it — is a signal about where AI capital is willing to leave the chat-app gold rush and bet on bench science. Worth knowing about even if you'll never write a peptide.
Key points
Series B totals $2.1 billion, led by Thrive Capital
New investors include MGX, Temasek, CapitalG, and the UK Sovereign AI Fund; existing backers Alphabet and GV participated
Capital is earmarked for the IsoDDE drug design engine and the company's pipeline of drug candidates
First major outside funding round for the AlphaFold-descended company since its 2021 spinout
Provenance
Article · Supporting source
11
My graduation cap runs Rust

Article Eric Park

A small, fun reminder that the joy of building things is still the joy of building things. No agent needed, no benchmark involved — just a Rust toolchain, a reed switch, and a graduating senior writing better prose abou…
ericswpark.com/blog/2026/2026-05-12-my-grad… →
Details
Context
A small, fun reminder that the joy of building things is still the joy of building things. No agent needed, no benchmark involved — just a Rust toolchain, a reed switch, and a graduating senior writing better prose about it than most product blogs.
Key points
A Purdue senior built a Digispark ATtiny85 + 48 WS2812B LED cap, triggered by a reed switch and magnet as the tassel moves
Wrote firmware in Rust, forking avr-hal and ws2812-avr to support the ATtiny85 and a 16MHz clock
Coding took 2 hours; hardware took 3+ hours — confirming the universal truth that hardware is always the slow part
Doesn't plan to wear it: 'It looks like what kids would think of as a gaming PC and what boomers would think of as a seizure.'
Provenance
Article · Supporting source
12
Thacker on the GitHub bug-bounty backlash

X @rez0__

It's painfully obvious that it's the second in this case. And also, they probably haven't paid the majority of any valid bugs submitted.
x.com/rez0__/status/2054529796672041113 →
Details
Cited text
It's painfully obvious that it's the second in this case. And also, they probably haven't paid the majority of any valid bugs submitted.

Context
The flip side of the FUZZ-E story. Same agentic-coding wave; opposite quality distribution. Bounty programs are being drowned in low-effort AI reports while a tuned hackbot pulls real CVEs from Magento overnight — the gap is in the operator, not the model.
Key points
Discourse around 325 GitHub bug-bounty submissions splits two ways: 'GitHub pays too little' versus 'most aren't valid'
Thacker's read: most aren't valid — the AI-generated submissions are noise — and the few real ones still aren't getting paid
Provenance
Tweet · Primary source
13
Nick Cammarata on identity in an AI-doubling world

X @nickcammarata — Nick Cammarata — ex-OpenAI researcher

Everyone is handling AI doubling every fourteen hours surprisingly well. They mostly just dropped it and work out more.
x.com/nickcammarata/status/2054492840668123… →
Details
Cited text
Everyone is handling AI doubling every fourteen hours surprisingly well. They mostly just dropped it and work out more.

Context
A one-liner that captures the emotional weather underneath the news cycle. Not a claim about model capabilities; a claim about the people who used to derive worth from the part of the job a model now does in seconds.
Key points
Comic exaggeration on capability-doubling — 'every fourteen hours' is deliberately absurd
Pointing at a real felt-sense among researchers: identity built on being smart is unstable when capability moves faster than you can
Provenance
Tweet · Primary source

00:00:04

A hackbot earned a CVE overnight

00:00:04 Joseph Thacker posted something around five in the morning Pacific that I've been turning over all day. He landed a CVE for a local file inclusion bug in Adobe Magento. Adobe's advisory number is APSB26-49 — the real, vendor-issued, you-need-to-patch-this-version kind of advisory, not a research demo.

00:00:26 The interesting bit is how it got found. Thacker doesn't work for Adobe. He's a security researcher who handed a list of targets — back in January — to a company called Autonomous Cyber. They run a hackbot called FUZZ-E. One overnight run. The Magento LFI plus, he says, two zero-days in Angular that he'll disclose later.

00:00:50 His exact words: "With one run overnight, it found vulns in wildly hardened projects." It's been the e-commerce backbone for a long list of Fortune 500 storefronts for over a decade. It's been audited, attacked, patched, and audited again. Same with Angular — Google's flagship front-end framework, used in pretty much every enterprise dashboard you've ever seen.

00:01:22 The bugs still hiding in these codebases are deep, weird, and have already survived years of human review. The model class matters too. In a follow-up, Thacker noted FUZZ-E was running on an older generation back in January. He added it's, quote, "even better with gpt5.5 now." If you take that at face value — and I think you have to, the CVE is real and the advisory shipped — every frontier model bump translates into more zero-days per overnight run for whoever has the infrastructure to run a hackbot like FUZZ-E.

00:02:01 There's a flip side worth putting next to this, because it makes the asymmetry clearer. Thacker has another post from the same morning about a different discourse. GitHub had 325 bug-bounty submissions that became a small flap on X. Two camps formed: one saying GitHub pays too little for bugs, the other saying most of those submissions were AI-generated noise.

00:02:28 Thacker's read is the second one. His line: "It's painfully obvious that it's the second in this case. And also, they probably haven't paid the majority of any valid bugs submitted." On the high end: an experienced researcher with a good target list and a well-tuned hackbot pulling a CVE from a 15-year-old e-commerce codebase.

00:03:01 On the low end: a flood of slop into bounty programs that vendors are now triaging by the hundred. The gap between those two outputs isn't about the model. It's target selection, harness quality, the willingness to read a result skeptically, and the relationship with the vendor's disclosure pipeline.

00:03:24 The model multiplies whatever the operator already brought. Two things to track over the next few weeks. One — does Thacker actually publish the Angular zero-days, and what shape are they? If they're injection bugs in templating or routing, that's interesting. If they're something stranger, like prototype-pollution patterns inside Angular's own internals, that changes how I'd think about which classes of bug autonomous tooling now finds well.

00:03:57 Two — how do open-source maintainers respond when this becomes routine, because right now the asymmetry is brutal. A solo maintainer of a popular library has to handle whatever volume of high-quality findings these tools generate, on top of the noise. If you're shipping anything that talks HTTP and parses user input, today is a reasonable day to look at your fuzzing budget.

00:04:26

Codex starts driving your Mac apps

00:04:26 OpenAI shipped a Codex feature yesterday called computer use, and there's a fifteen-minute walkthrough with Ari Weinstein — who came to OpenAI from the Sky/Shortcuts world — and Roma, who hosts the Codex product video. I think this is the moment desktop agents stop being a chat-window-plus-browser thing and start being something that lives in your dock.

00:04:48 Computer use in Codex on Mac drives any local application — not just a browser tab or your editor. UTM to spin up a Mac OS virtual machine. Spotify to start music. Reminders to add a task. Messages to send a text. The agent does these in parallel, in the background, while you keep using your computer.

00:05:08 The distinction Weinstein flagged is worth flagging back. Every other computer-use implementation he's seen takes over your whole screen. His words: "Every computer use implementation I've ever seen takes over your entire computer. So you can't use your computer while the agent is using your apps." Codex animates in a separate cursor, swims it across to the app, and clicks and types without stealing focus from whatever you're doing.

00:05:35 You can keep working. That changes the affordance dramatically. An agent that locks you out for ten minutes is a tool you use carefully on weekends. An agent that operates UTM in the background while you write is something you might leave running all day. The technical piece I found most interesting is how they're using the macOS accessibility framework alongside screenshots.

00:06:00 Historically, computer-use models have been screenshot-only, which means you pay multimodal model costs on every step, and you only see what's currently rendered. The accessibility tree gives you textual structure for every UI element on screen — including ones scrolled out of view — plus role information.

00:06:19 So the model can see what kind of element it's looking at, not just that there's a rectangle of pixels there. Two things follow. One, the model is faster, because it can route without rendering. Two — and this is the bit that surprised me — they can run computer use with a non-multimodal model.

00:06:38 They pair it with Spark, Codex's fast inference path, and Weinstein says it operates the software, quote, "literally faster than than than a human would." I watched him demo it sending a text message in Messages — open app, focus thread, type, send — in roughly a second.

00:06:56 The permissions model is per-app and explicit. The first time Codex wants to use an application, you get a prompt. Allow it for that app, and it can see and type into that app, and only that app. Anything else stays off-limits. That's a meaningful constraint. You can give Codex access to your developer apps and your spreadsheets without giving it access to your password manager, your iMessage history, or whatever else you'd rather not see show up in a model trace.

00:07:26 Where I'm skeptical: the demo tasks are all "set music, send a text, add a reminder." That's exactly the kind of thing where a Shortcuts-style automation would already work, and where the agent's failure cases are forgiving. The interesting test is whether it can drive a complex native app — Logic Pro, Numbers with a real formula, a CAD tool — and not fall over the moment a dialog appears it wasn't expecting.

00:07:52 I'm waiting for the first builder story where someone hands it a legitimately tedious workflow and it finishes without supervision. Windows support, Weinstein said, is coming "very soon." If you've been wondering when the agent stops being a chat panel and starts being something you can actually delegate hours to, this is closer than it was last week.

00:08:15

Cloudflare's CUBIC death spiral

00:08:15 Now for a completely different kind of story. Esteban Carisimo and Antonio Vicente at Cloudflare published one of those investigations that reminds you why senior engineering is paid for. They had a test in their ingress proxy pipeline that was failing 61% of the time.

00:08:35 The post is called "When idle isn't idle: how a Linux kernel optimization became a QUIC bug." If you only read one Cloudflare post this quarter, make it this one. Quick setup. CUBIC is the default congestion controller on Linux. Quiche, Cloudflare's open-source QUIC stack, uses CUBIC too.

00:08:57 The test: a 10-megabyte HTTP/3 download with 30% packet loss for the first two seconds, then no loss. Expected behavior — CUBIC takes some hits, drops its congestion window, then after the loss stops, ramps back up and finishes in four or five seconds. Actual behavior — about 60% of the time, the download didn't finish in the ten-second timeout.

00:09:23 When they instrumented the qlogs, they saw the congestion window pinned at the minimum — 2700 bytes, two full-size packets. The controller was flipping between recovery and congestion-avoidance states 999 times in 6.7 seconds. One transition every 14 milliseconds.

00:09:43 The test's round-trip time was 10 milliseconds. The oscillation was lockstepped to the ACK clock. They traced the cause back to a 2017 Linux kernel patch. CUBIC had a long-standing bug where idle periods would inflate the congestion window dangerously on the next send.

00:10:03 Jana Iyengar proposed a fix. Neal Cardwell pointed out a flaw in the first version. Eric Dumazet, Yuchung Cheng, and Cardwell shipped an elegant version: shift the CUBIC epoch forward by the idle duration, don't reset it. That worked. A week later, a follow-up patch landed in the kernel to handle a remaining overflow case.

00:10:28 In 2020, the quiche team ported the first fix into Rust. They didn't port the second one. For years that was fine — until someone ran a test designed to drive CUBIC into its minimum window. At minimum window, bytes-in-flight hits zero between the last ACK and the next send on every cycle.

00:10:49 The quiche code interpreted that as the application going idle, and added the perceived idle duration — effectively a full RTT — to the recovery-start time. That pushed the recovery boundary into the future. The next ACK saw the connection as still in recovery, didn't grow the window, and exited recovery.

00:11:12 The boundary got reset to the ACK time. The next send pushed it forward again. The window stayed pinned at two packets, and a 10-megabyte download crawled along at 2700 bytes per round trip. The fix is three lines. Measure the idle period from the last ACK received, not the last packet sent.

00:11:34 When the window is small and the pipe drains, the last-send time is a full RTT in the past; the last-ack time is much more recent. With that change, the recovery boundary stops chasing the send time, and the next packet lands on the right side of it. The download finishes in five seconds.

00:11:56 The line from the writeup I keep coming back to: "The effort to find the bug was massive, but the fix itself was basically one line of logic." First, the bug was invisible to throughput dashboards and to static review — it only existed in the corner of CUBIC's state space where loss had already happened.

00:12:22 Second, the diagnostic work was almost entirely about visualization — building qlog tooling that let them see the 14-millisecond oscillation. Third, the trail led back nine years to a kernel patch, and the resolution required reading the original Linux kernel mailing-list thread, the follow-up patch, and understanding why the quiche port had silently diverged.

00:12:49 You can give an agent a lot of code. You can't, today, give it this investigation. The accountable engineer who reads a 2017 commit message and feels the pattern click into place is still the bottleneck. I find that reassuring.

00:13:06

Nous Research's Lighthouse Attention

00:13:06 A shorter one. Nous Research dropped a paper called Lighthouse Attention, and Sebastian Raschka — the "Build a Large Language Model From Scratch" author — has the most useful read I've seen on it. The framing comes from Elvis Saravia, who runs the DAIR.AI account: "What if you could speed up long-context pretraining with a subquadratic wrapper that you remove before deployment?"

00:13:34 Most efficient-attention work — linear attention, state-space models, sliding-window variants — gives you a trade. You get cheaper attention, but you also get measurably worse modeling quality unless you keep the modification on at inference. Lighthouse wraps ordinary scaled-dot-product attention in a hierarchical, gradient-free selection layer during training.

00:13:58 Then, near the end of training, you strip the wrapper off and let the model run as plain vanilla attention. Raschka's read: "It is a relatively low-commitment attention modification. One can use it during most of training, switch back to vanilla attention near the end, and recover roughly the same modeling performance as if full attention had been used throughout."

00:14:26 If you're a lab planning a pretraining run, you don't have to bet the whole curriculum on whether the modification will degrade quality on your downstream evaluations. You can run most of the long-context steps with the subquadratic wrapper, save a meaningful chunk of compute, and recover to a vanilla-attention model at the end.

00:14:48 The risk profile is different from anything you'd take on with linear attention. The gradient-free part is also notable. Selection layers are usually trained — you're learning which tokens to attend to. A gradient-free selection avoids the optimization pathologies that show up when the routing decisions themselves have to be differentiable.

00:15:10 Two open questions. One — does anyone outside Nous reproduce this at a non-toy scale? The proof is in the seven-billion or thirty-billion parameter run with real downstream evaluation, not in the appendix figure. Two — how does it interact with the existing efficient-attention literature?

00:15:29 In particular, does Lighthouse compose with sliding-window attention or with the sparse-attention shapes the inference-time crowd has been shipping? If it holds up, it changes how a small lab thinks about long-context pretraining budgets. That alone is enough to keep an eye on it.

00:15:48

GPT-5.5 cracks a ProgramBench task

00:15:48 ProgramBench updated their leaderboard yesterday with GPT-5.5, and Kilian Lieret's team flagged it as the first model ever to fully solve one of their task instances. The task is cmatrix — yes, that cmatrix, the green-rain screensaver — being reimplemented from a README and a man page in a sandbox with no internet and no dependencies.

00:16:13 The headline number sounds modest. GPT-5.5 with high reasoning solves 0.05% of the benchmark. One task out of thousands. The cumulative-pass histogram is more interesting. GPT-5.5 at extra-high reasoning is now the best model at every threshold — 95% pass rate, 50% pass rate, average score.

00:16:35 It also holds a record for the number of "almost resolved" tasks, where 95-plus percent of unit tests pass: 26 of them. Look at the comparison runs against Claude Opus 4.7 at extra-high reasoning. Two GPT-5.5 runs solved cmatrix. One used 34 API calls, the other 40.

00:16:55 The Opus run used 178 API calls and racked 19 failures. The Opus failures decompose into two bugs that any senior engineer would catch on a code review. First — eleven of the failures came from using strcmp instead of strcasecmp when parsing color names. The agent's parser only accepted lowercase.

00:17:17 "GREEN", "Red", "BLUE" all returned invalid. A one-token fix would have eliminated all eleven. Second — eight failures from an exit code of one for invalid colors. The original binary exits zero. The agent had actually tested the original binary and observed this.

00:17:37 It just never noticed when its own implementation diverged. Both bugs trace to the same root: insufficient exploration before writing code. The agent tested lowercase colors and one invalid color name. It never tested uppercase or mixed case, and never checked that its exit codes matched the binary's.

00:18:00 It was confident enough in its mental model to commit. Meanwhile, GPT-5.5 at high reasoning, on the run that fully solved cmatrix, spent ten exploration turns probing forty-plus flag combinations before writing a single line of implementation. Then it wrote the whole thing in one pass with five targeted patches.

00:18:23 That isn't a model-quality story so much as a model-behavior story. The model with the higher reasoning budget allocates more of that budget to exploration; the model with too little allocates more to commit. There's a smaller story buried in the data that I also find useful.

00:18:44 GPT-5.5 at the default medium-reasoning setting, the one vendors use for their pricing, "barely beat Claude Sonnet 4.6." The wins only show up when you crank reasoning up. That's a real cost — those high and extra-high runs cost three to five dollars per task on this benchmark — and it tells you something about how to think about model selection.

00:19:10 Default settings are a different model than max settings. The leaderboard at one isn't the leaderboard at the other. For people choosing models for coding agents: your reminder is that model identity is a curve, not a point. What you pay for, on a coding agent, is increasingly the willingness to spend on exploration before commit.

00:19:35

Vercel: give the agent a computer

00:19:35 Stay in the agent-building space for one more. Nico Albanese gave a workshop at AI Engineer in London titled "Give Your Agent a Computer." The talk is built around Vercel's AI SDK 6, which now ships an object-oriented agent primitive called tool-loop agent, end-to-end typed messages from agent definition to UI, and persistent named sandboxes you can hand to the agent as state.

00:20:02 Albanese's framing matches what I've seen in every serious agent shop. He named three primitives for building agents in 2026: an agent runtime — the harness, the loop, how context is managed across turns; a tool set — what the agent can actually do; and a computer or sandbox — somewhere for the agent to write files, persist state, and run code.

00:20:27 Here's the insight he attributed to Vercel's internal work. Giving an agent a file system, he said, didn't just add storage. It changed how the agent behaved. It started following through on long tasks, staying on track, and building on its own prior work. Read that sentence again.

00:20:48 Storage as a behavioral primitive. The agent doesn't get smarter, but it stops dropping the plot, because it has something other than its context window to fall back on. A memories.md file the agent maintains. A bash tool to run intermediate commands. A persistent sandbox that survives between runs.

00:21:09 This is consistent with what Anthropic's Claude Code does with its CLAUDE.md, what Cursor does with its memory, and what the open-source agent harnesses all seem to converge on. The model is the same in all of them. What changes is what the agent can write to, read from, and recover from between turns.

00:21:31 If you're building an agent right now, this is where you spend the week. The model picker is a one-liner; the file system the agent inhabits is the product. Treat the agent's working environment like the actual surface area of your application, and the behavior you get back changes shape.

00:21:52 The type system Albanese spends most of the workshop on — end-to-end typed UI message inference from a single agent declaration — is the kind of polish that matters more than it looks. When the agent definition is the source of truth and types flow through to your route handler and your React component, you can refactor the agent contract without playing whack-a-mole across the codebase.

00:22:20 That isn't glamorous, but it is how teams ship.

00:22:24

Needle: tool-calling in 26 million parameters

00:22:24 In the local-models corner of Reddit yesterday, a team announced Needle — a 26-million-parameter open-weights model that does one thing: function calling, also known as tool calling. The numbers they posted are striking. 6000 tokens per second prefill, 1200 tokens per second decode, on a consumer phone.

00:22:45 The model is distilled from Gemini tool-calling traces. Here's their argument. Tool calling, in their phrase, is "fundamentally retrieval-and-assembly" — you match a user query to a tool name and you extract arguments from the input. It's a pattern match and a structured extraction.

00:23:04 A huge frontier model doing it is overkill, in the same way that using GPT-4 to parse a date string is overkill. That decomposition might be too clean. There are tool-calling cases where you do need reasoning — multi-step orchestration, choosing between similar tools, handling ambiguous inputs.

00:23:25 For the dominant case, though — single-call, well-named tools, clear-ish inputs — they have a point. A specialized 26M model that routes the call quickly and hands off the actual work to a larger model only when needed is the right shape for on-device agents. The interesting thing here isn't the open-weights drop, although that's nice.

00:23:48 It's the implied architecture. If you split agent runtime into a small fast model for routing and a larger model for reasoning, you get latency, privacy, and battery characteristics that a single-monolith approach can't touch. Phones, browser extensions, IDE plug-ins — all of them benefit.

00:24:08 The Needle team is making a bet that this becomes the dominant shape for agent infrastructure on consumer devices. I think they're at least directionally right. Before I'd bet production traffic on it, I'd want to see their evaluation against actual tool-calling benchmarks — the Berkeley Function-Calling Leaderboard, the Gorilla benchmarks.

00:24:32 The throughput numbers alone make me curious.

00:24:35

Isomorphic Labs raises 2.1 billion

00:24:35 Quick one, because the news itself is straightforward and you've probably already seen it. Isomorphic Labs — the Alphabet spin-out Demis Hassabis founded out of the AlphaFold work — announced a Series B of 2.1 billion dollars yesterday. Thrive Capital led. New investors include MGX, Temasek, CapitalG, and notably the UK Sovereign AI Fund.

00:24:59 Existing backers Alphabet and GV participated. The money's going into their IsoDDE drug design engine and the company's drug-candidate pipeline. They're advancing programs across multiple therapeutic areas. Two things I notice. One — the UK Sovereign AI Fund participating is interesting.

00:25:19 We've talked about sovereign AI as a thing primarily in the model-training sense — the United Arab Emirates funding G42, France's positions on Mistral, and so on. A sovereign fund writing a check into a drug-discovery AI is a different shape of bet. It's an explicit national-strategy claim that AI-driven biology is industrial infrastructure worth the same kind of state backing as cloud compute.

00:25:48 Two — the round size is large for biotech, and small for AI. Two billion in a foundation-model context is half a training run. Two billion in a drug-discovery context is the kind of capital that finances multiple drug candidates through years of preclinical work and Phase 1.

00:26:07 Hassabis is using AI capital flows to fund pharma timelines, and the investors lining up suggest at least some of them think the math will work. On the company's pipeline I'll stay neutral. On what this signals, I won't: AI capital is starting to look outside the chat-app gold rush.

00:26:27 If Isomorphic's pipeline produces a clinical hit in the next eighteen months, that ripples outward into how everyone else thinks about applied-AI funding. If it doesn't, the sovereign-money pattern still tells you something about where states have decided to position themselves.

00:26:47

A Rust grad cap, and a tweet about working out

00:26:47 I want to close with two short items from the human side, because the news cycle is heavy today and you've earned the breather. Eric Park, a senior at Purdue, posted "My graduation cap runs Rust." He bought the rental cap and gown — he points out you have to rent it, it costs ninety-four dollars, and you can't even buy them outright — and decided that if he can't burn it, he can at least light it up.

00:27:16 A Digispark ATtiny85 microcontroller, forty-eight WS2812B LEDs, a reed switch and a magnet that triggers when the tassel moves from right to left during the ceremony, and Rust firmware on the AVR toolchain. He had to fork two crates — avr-hal and ws2812-avr — and dirty-patch the clock speed to 16 megahertz before they'd compile for the ATtiny85.

00:27:41 Code took two hours. Hardware took three-plus, which he flags as the eternal truth: "If anybody tells you hardware is easy, they're wrong or they're lying and have never worked on a custom hardware project." The video at the end has the LEDs strobing in a way he describes, accurately, as "what kids would think of as a gaming PC and what boomers would think of as a seizure."

00:28:11 The whole post is a build log of something he did because he wanted to. No agent. No model. No benchmark. Just a senior, a Rust toolchain, and a graduation ceremony to embellish. I love it. The other one is a tweet from Nick Cammarata, ex-OpenAI researcher. Quote: "Given the average person I know built their whole sense of self-worth around being smart starting at age 4 and reinforced continuously for the next few decades, everyone is handling AI doubling every fourteen hours surprisingly well.

00:28:47 They mostly just dropped it and work out more." The shape underneath isn't. There's a real felt-sense among people in the labs and people building agents that capability is moving faster than the part of an identity that derives worth from being good at it can keep up.

00:29:09 Some of those people are deliberately building lives around things the model can't do — moving heavier weights, cooking better dinners, talking to their kids. Some of them are at a loss. No tidy take here. I notice that the people I think are happiest right now are the ones who treated their craftsmanship as a gift to be shared rather than a license to feel important about themselves.

00:29:37 Eric's cap is, weirdly, the right metaphor. He built it because he wanted to. He'll graduate either way.

00:29:44

Tomorrow's Angular zero-days and Lighthouse at scale

00:29:44 Patch your Magento install if you run one, set a CUBIC death-spiral test on your QUIC stack if you ship one, and tell me what you've built this week. — Lenar.