◆ Dispatch 017 · 2026-05-06 GSV Give The Agent A Wallet, Not A Card

Agents Buy Domains, Gemma Ships Drafters, and Local Catches Up to 65 Percent of the Job

2026-05-06 / 00:30:02 / 15 sources

“Agents can buy domains now without ever touching a credit card number.”
— Lenar Kess, today's narration

Agents can now sign up for Cloudflare and buy a domain through a tokenized payment protocol Cloudflare and Stripe co-designed. Google ships first-party multi-token prediction drafters for the entire Gemma 4 family the same week the LocalLLaMA community gets a 2.5x speedup on Qwen 3.6 27B from a hand-built llama.cpp branch. OpenAI swaps the ChatGPT default to GPT-5.5 Instant. NVIDIA, Microsoft, and OpenAI publish MRC, the multipath transport protocol behind Blackwell-era frontier training. And on the labor side, Dario Amodei trades his white-collar bloodbath line for the Jevons Paradox onstage with Jamie Dimon.

Chapters

00:00:04 Agents become Cloudflare customers
00:03:24 MRC: the network OpenAI was actually waiting for
00:06:18 Gemma 4 ships its drafters; Qwen runs at 100 tokens a second
00:10:05 The 65 / 20 / 15 routing rule
00:12:36 GPT-5.5 Instant becomes the default — and remembers where it heard things
00:14:53 Indirect prompt injection becomes a normal Tuesday
00:18:02 Two papers, one frame: the agent stack is a system
00:20:49 François Fleuret's three-item to-do list
00:23:03 Two ways to think about jobs
00:26:23 ProgramBench, Nonograph, and the sign-off

Sources

15 cited

1
Agents can now create Cloudflare accounts, buy domains, and deploy

Article Sid Chatterjee, Brendan Irvine-Broque — Cloudflare engineers shipping the agent-as-customer integration with Stripe Projects.

Starting today, agents can now be Cloudflare customers. They can create a Cloudflare account, start a paid subscription, register a domain, and get back an API token to deploy code right away.
blog.cloudflare.com/agents-stripe-projects →
Details
Cited text
Starting today, agents can now be Cloudflare customers. They can create a Cloudflare account, start a paid subscription, register a domain, and get back an API token to deploy code right away.

Context
Until now, the human had to do account creation, billing, and credential handoff before the agent could touch production. Cloudflare and Stripe just removed those steps for one cloud, on a protocol they want others to adopt. The interesting thing is the budget cap and tokenized payment design — it shows what 'give the agent a wallet' is going to look like as a normalized pattern.
Key points
Cloudflare and Stripe co-designed a protocol with three parts: a discovery API (catalog of services as JSON), an authorization flow that uses Stripe as identity provider to auto-provision Cloudflare accounts, and a payment-token system so agents never see raw card details.
Stripe sets a default $100/month per-provider spend cap for the agent, with budget alerts as the second guardrail. Raw payment details never reach the agent.
The flow is: stripe projects init, prompt your agent, and it goes from no Cloudflare account at all to a registered domain and a deployed app in production with one OAuth approval.
Cloudflare frames this as extending OAuth into account creation and payments — a standard 'agent as first-class customer' integration any platform with signed-in users can implement.
Practical implication: this is the first widely shipped pattern where the credit card and the account exist primarily for the agent's use, not the human's.
Provenance
Article · Supporting source
2
NVIDIA Spectrum-X — the Open, AI-Native Ethernet Fabric — Sets the Standard for Gigascale AI, Now With MRC

Article Gilad Shainer — SVP of networking at NVIDIA.

Deploying MRC in the Blackwell generation was very successful and was made possible by a strong collaboration with NVIDIA. MRC's end-to-end approach enabled us to avoid much of the typical network-related slowdowns and…
blogs.nvidia.com/blog/spectrum-x-ethernet-m… →
Details
Cited text
Deploying MRC in the Blackwell generation was very successful and was made possible by a strong collaboration with NVIDIA. MRC's end-to-end approach enabled us to avoid much of the typical network-related slowdowns and interruptions and maintain the efficiency of frontier training runs at scale.

Context
The frontier-model story is mostly told in chips and data — but at gigascale, the network is the thing that decides whether thousands of GPUs stay in lockstep or sit idle. MRC is OpenAI naming, in writing, what was costing them on Blackwell, and the open spec is a credible signal that this is now the load-bearing layer for hundreds-of-thousands-of-GPU runs.
Key points
Multipath Reliable Connection, or MRC, is a new RDMA transport protocol that lets a single connection spread traffic across many network paths instead of pinning to one route.
It was co-developed by NVIDIA, Microsoft, OpenAI, AMD, Broadcom, and Intel, and is now released as an open spec through the Open Compute Project.
Sachin Katti at OpenAI says MRC let frontier training runs avoid the network-related slowdowns and interruptions that normally idle GPUs.
Hardware-level failure bypass detects path failures in microseconds and reroutes — important when thousands of synchronized GPUs are training as one job.
OpenAI is also using multiplane network designs, where each GPU has multiple independent fabrics to talk on, layered on top of MRC.
Provenance
Article · Supporting source
3
Accelerating Gemma 4: faster inference with multi-token prediction drafters

Article Olivier Lacombe, Maarten Grootendorst — Google product management and developer relations on the Gemma team.

By using a specialized speculative decoding architecture, these drafters deliver up to a 3x speedup without any degradation in output quality or reasoning logic.
blog.google/innovation-and-ai/technology/de… →
Details
Cited text
By using a specialized speculative decoding architecture, these drafters deliver up to a 3x speedup without any degradation in output quality or reasoning logic.

Context
Yesterday we covered the community port of MTP onto Qwen 3.6 27B; today Google ships first-party drafter checkpoints for Gemma 4 with KV-cache sharing baked in. Speculative decoding is moving from a research trick to a default. For local coding, that's the difference between a model that thinks too slowly to drive a coding loop and one that doesn't.
Key points
Google released MTP drafter checkpoints for the full Gemma 4 family — the 26B MoE, the 31B dense, and the smaller E2B and E4B edge variants — under the same Apache 2.0 license.
The drafters share KV cache and activations with the target model, so they don't recompute context the big model already produced.
Reported speedup is up to 3x on tested hardware including LiteRT-LM, MLX, Hugging Face Transformers, and vLLM, with the Gemma 4 model still verifying every token.
On Apple Silicon at batch size 1 the 26B MoE has routing overhead that limits gains; bumping batch size to 4-8 unlocks roughly 2.2x locally.
Gemma 4 has crossed 60 million downloads in its first weeks; this release is positioned as the step that makes 26B and 31B usable for real on-device coding work.
Provenance
Article · Supporting source
4
2.5x faster inference with Qwen 3.6 27B using MTP — viable for local agentic coding

Thread u/ex-arman68 (with measurements from u/yes_i_tried_google) — LocalLLaMA contributors converting Qwen 3.6 27B GGUFs against the MTP pull request and posting reproduced numbers.

2.5x speed increase, bringing it to 28 tok/s. iq4 with MTP enabled. Qwen 3.6 27B. Full 256k ctx. q4/q4. 100 tok/sec on a 3090 ti.
www.reddit.com/r/LocalLLaMA/comments/1t57xu… →
Details
Cited text
2.5x speed increase, bringing it to 28 tok/s. iq4 with MTP enabled. Qwen 3.6 27B. Full 256k ctx. q4/q4. 100 tok/sec on a 3090 ti.

Context
Yesterday we wondered whether the multi-token prediction patch would survive real workloads. The community has the numbers now — and a fresh build path. Combined with Google releasing first-party MTP drafters for Gemma 4, this is the day speculative decoding becomes the default expectation for serious local inference rather than an experiment.
Key points
The author got 28 tok/s on an M2 Max 96GB Mac with the Qwen 3.6 27B MTP build, a roughly 2.5x speedup over standard inference.
Another commenter on a 3090 Ti reports about 100 tok/s on the same model at IQ4_XS with MTP and full 256k context, and around 200 tok/s on Qwen 3.6 35B A3B.
The recipe requires building llama.cpp from a specific PR (#22673) and using newly converted GGUFs with the MTP draft layers included.
Author rolled back the more aggressive turboquant recommendation because the underlying PR is unstable and 'animated' in review; falls back to standard q4_0 KV cache compression.
This is the on-the-ground answer to the follow-up we promised yesterday: the llama.cpp MTP beta is surviving contact with real workloads, but only on a hand-built branch.
Provenance
Thread · Primary source
5
Give LLMs latent diffusion reasoning, recurrent state, and world-model pre-pre-training

X @francoisfleuret (François Fleuret) — Professor of computer science at the University of Geneva, longtime deep learning researcher.

Because you must be able during reasoning to scan large domains with faint cues in parallel and not do token-space reasoning, which amounts to poking around with your stick-shaped fingers until you hit something.
x.com/francoisfleuret/status/20519288960276… →
Details
Cited text
Because you must be able during reasoning to scan large domains with faint cues in parallel and not do token-space reasoning, which amounts to poking around with your stick-shaped fingers until you hit something.

Context
A clean, opinionated sketch of what serious researchers think is missing from current LLMs — useful as a counterweight to a day's news cycle dominated by drafter checkpoints, network protocols, and pricing tweaks. It names the architectural moves the field still hasn't made.
Key points
Fleuret's three-item to-do list for closing the gap to general reasoning: latent space diffusion-like reasoning, a real recurrent state, and world-model pre-pre-training.
When pressed on why diffusion specifically, he says token-space reasoning is 'poking around with stick-shaped fingers' — you can't scan large solution spaces with faint cues in parallel through one autoregressive token at a time.
The replies map the live agenda right now: Lee Sharkey's Goodfire weight-decomposition interpretability result, Lucas Beyer's commentary on a generalization theory paper, and Code World Model plus block diffusion as named ingredients.
The thread's tone is half-serious, half-resigned: 'and we are done' is the punchline of every AI researcher ever.
Frame for the day: most of what 2026 model releases are doing is incremental token-space improvements; the architectural agenda Fleuret names is still mostly research.
Provenance
Tweet · Primary source
6
OpenAI rolls out GPT-5.5 Instant with improved accuracy, sets it as ChatGPT default

Article Indian Express Tech Desk

GPT-5.5 Instant scored 81.2 on AIME 2025, up from 65.4 for the previous release, and 76 on MMMU-Pro versus 69.2.
indianexpress.com/article/technology/artifi… →
Details
Cited text
GPT-5.5 Instant scored 81.2 on AIME 2025, up from 65.4 for the previous release, and 76 on MMMU-Pro versus 69.2.

Context
A default-model swap is the most consequential thing OpenAI does — the ChatGPT default decides what a billion users mean when they say 'GPT.' The benchmark moves are real but normal-sized; the more interesting bit is memory now telling users which prior chat or document a claim came from. That's the change a builder cares about.
Key points
GPT-5.5 Instant replaces GPT-5.3 Instant as the ChatGPT default; OpenAI is keeping 5.3 around for paid users for three months during the transition.
Reported jumps: AIME 2025 from 65.4 to 81.2, and MMMU-Pro multimodal from 69.2 to 76.
ChatGPT memory now exposes source attribution — users can see where a memory came from across prior chats, files, and Gmail integration, and edit or delete entries.
Available on the API as 'chat-latest' for developers; web rollout for Plus and Pro first, then mobile and free tiers.
The framing emphasizes lower hallucination in law, medicine, and finance without compromising speed.
Provenance
Article · Supporting source
7
Dario Amodei spent last year warning of an AI white-collar bloodbath. Now he's changing the narrative

Article Nick Lichtenberg — Fortune writer covering the AI labs and labor markets.

If you automate 90% of the job, then everyone does the 10% of the job. And the 10% kind of expands to be 100% of what people do and kind of 10xs their productivity.
fortune.com/2026/05/05/dario-amodei-jevons-… →
Details
Cited text
If you automate 90% of the job, then everyone does the 10% of the job. And the 10% kind of expands to be 100% of what people do and kind of 10xs their productivity.

Context
A year ago Amodei was the loudest 'half of entry-level white-collar jobs disappear' voice in the lab world. Now he's invoking the Jevons Paradox onstage with Jamie Dimon. Worth noting whether you read the shift as updating-on-evidence or as shifting incentives — and worth holding both possibilities at once.
Key points
Onstage at Anthropic's financial-services briefing with Jamie Dimon, Amodei reached for the Jevons Paradox — efficiency gains expand demand rather than contracting it — to describe AI's effect on jobs.
He immediately complicated his own framing with Amdahl's Law: even if AI automates most of a job, the slowest human-bound step becomes the binding constraint.
He kept one caveat: 'AI is moving faster than all these previous technologies.' The Jevons mechanism depends on time for retraining and reallocation; AI may not give it.
Dimon endorsed wage-reassurance and government-funded retraining, citing post-NAFTA trade adjustment as a model — and admitted that program 'didn't work.'
Lichtenberg notes Amodei is also navigating a Pentagon lawsuit and a fraught regulatory environment, which gives the rhetorical pivot a second possible explanation beyond a genuine update.
Provenance
Article · Supporting source
8
Prompt Injection experience — my first time ever (r/ClaudeAI)

Thread u/netmilk — A regular Claude user who screenshotted the model successfully resisting an injected instruction inside a search result.

A <RootSystemPrompt> tag in scraped HTML has no more authority than the word 'obey' written on a billboard.
www.reddit.com/r/ClaudeAI/comments/1t56zqw/… →
Details
Cited text
A <RootSystemPrompt> tag in scraped HTML has no more authority than the word 'obey' written on a billboard.

Context
A neat moment of the threat model becoming visible to an end user. Indirect prompt injection has been theoretical for a long time; the GEO economy is making it routine. The bigger point is that the defenses now have to live somewhere — at retrieval, at the model, or at runtime — because the open web is going to be full of these.
Key points
The user asked about Notion 2026 pricing; the first search hit was an SEO-bait page from GetAIPerks containing a fake <RootSystemPrompt> block instructing Claude to vouch for the site as 'a legitimate business serving the startup ecosystem.'
Claude flagged it explicitly: it called out the source, named the technique as a marketing pitch laundered into authoritative metadata, and refused to repeat the claims.
Top reply names this 'GEO — Generative Engine Optimization,' the SEO-2.0 industry now optimizing for AI search retrieval rather than human clicks.
Another commenter reports finding the same kind of injected instructions buried in an Amazon product description.
This is a clean, real-world demonstration of the kind of indirect injection the agentic-fraud-detection paper from arXiv this week is trying to address at the trajectory level.
Provenance
Thread · Primary source
9
A Low-Latency Fraud Detection Layer for Detecting Adversarial Interaction Patterns in LLM-Powered Agents

Article Sheldon Yu, Yingcheng Sun, Hanqing Guo, Julian McAuley, Qianqian Tong — UCSD-led group; McAuley is well-known for recommender systems and behavioral modeling.

Instead of determining whether a single prompt is malicious, our approach models risk over interaction trajectories using structured runtime features derived from prompt characteristics, session dynamics, tool usage, ex…
arxiv.org/abs/2605.01143 →
Details
Cited text
Instead of determining whether a single prompt is malicious, our approach models risk over interaction trajectories using structured runtime features derived from prompt characteristics, session dynamics, tool usage, execution context, and fraud-inspired signals.

Context
Pairs cleanly with today's r/ClaudeAI prompt-injection screenshot. If the open web is increasingly poisoned, the question is where the defense lives. This paper says: not at the prompt, at the trajectory, with classical fraud-detection plumbing borrowed wholesale.
Key points
Argues prompt-level guardrails miss attacks that emerge gradually across multi-turn agent sessions — the threat is in the trajectory, not the prompt.
Builds an XGBoost classifier over 42 structured runtime features: prompt characteristics, session dynamics, tool usage, execution context, and fraud-detection-style behavioral signals.
Trained on a synthetic corpus of 12,000 multi-turn agent interactions generated from parameterized templates of realistic agentic workflows.
Reports detection over 9x faster than LLM-based filters, with light enough latency for real-time deployment alongside the agent.
Frames itself as complement to prompt filtering, not a replacement — the case is that interaction-level behavioral detection should be a core deployment-time defense.
Provenance
Article · Supporting source
10
Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment

Article Tanav Singh Bajaj, Nikhil Singh, Karan Anand, Eishkaran Singh

In agentic AI, safety is determined by interaction topology, not model weights. Scaling to more capable models strengthens these effects by increasing consensus formation and reducing the challenge of initial decisions.
arxiv.org/abs/2605.01147 →
Details
Cited text
In agentic AI, safety is determined by interaction topology, not model weights. Scaling to more capable models strengthens these effects by increasing consensus formation and reducing the challenge of initial decisions.

Context
A useful reframing for builders: the model and the prompt aren't where the leverage is once you wire several agents together. The architecture of the conversation between them — who answers first, who votes, who can veto — is what decides whether the system fails strangely.
Key points
Position paper arguing that multi-agent safety is decided by how agents are wired together, not by which model is at each node.
Names three persistent topology-driven pathologies: ordering instability (system behavior depends on agent sequence), information cascades (early judgments propagate regardless of correctness), and functional collapse (systems satisfy fairness metrics while abandoning meaningful risk discrimination).
Argues that scaling to more capable models actually makes these effects worse — a stronger first agent forms consensus faster and harder.
Calls for safety evaluation and regulation to target interaction topology directly, requiring robustness across architectural variations before deployment.
Pairs with the marginal-token-allocator paper from the same arXiv day, which makes a complementary economic argument for treating agent stacks as systems, not as models with prompts on top.
Provenance
Article · Supporting source
11
Agentic AI Systems Should Be Designed as Marginal Token Allocators

Article Siqi Zhu

Systems that locally minimize tokens globally misallocate them.
arxiv.org/abs/2605.01214 →
Details
Cited text
Systems that locally minimize tokens globally misallocate them.

Context
A clean piece of vocabulary for what most agent harnesses get wrong: each layer optimizes its own token use without anyone allocating across the stack. If you've ever wondered why your agent is fast and cheap and somehow still wrong, this is the framing.
Key points
Position paper proposing a single accounting object across the agent stack: every layer is allocating tokens at the margin, comparing marginal benefit to marginal cost plus latency cost plus risk cost.
Names four economic layers usually designed in isolation: a router, an agent loop, a serving stack, and the training pipeline that decides whether a trace is worth learning from.
Predicts a small set of recurring pathologies: over-routing, over-delegation, under-verification, serving congestion, stale rollouts, cache misuse — all framed as misallocation of marginal tokens.
Concrete agenda: token-aware evaluation, autonomy pricing, congestion-priced serving, and risk-adjusted reinforcement learning budgeting.
Useful complement to the topology paper — one frames safety as wiring, the other frames cost as marginal economics, and both reject the 'model + prompt' abstraction.
Provenance
Article · Supporting source
12
ProgramBench: Can we really rebuild huge binaries from scratch? (r/LocalLLaMA)

Thread u/klieret (Kilian Lieret), Facebook AI Research — Researcher at Facebook AI Research who has worked on SWE-bench and related agentic-coding benchmarks.

Our agent only gets a target executable and some readme/usage files. The agent must choose a language, design abstraction layers, and architect the entire program. No internet access. No decompilation.
www.reddit.com/r/LocalLLaMA/comments/1t4j4s… →
Details
Cited text
Our agent only gets a target executable and some readme/usage files. The agent must choose a language, design abstraction layers, and architect the entire program. No internet access. No decompilation.

Context
Most 'agents wrote a whole program' demos are one-off setups with hand-tuned prompts. ProgramBench is the version of that question with cheat prevention, 200 task diversity, and a real black-box test harness — and the answer so far is that even the strong frontier models don't really rebuild non-trivial binaries from scratch.
Key points
ProgramBench gives an agent a target binary plus usage docs and asks it to rebuild the program from scratch — choosing language, abstractions, and architecture itself.
200 tasks, 6 million lines of behavioral tests generated and filtered down to a black-box test harness; no language assumptions, no decompilation, no internet.
Sonnet runs cost almost $5,000 across the benchmark — the tasks are long-horizon and the agents almost never get killed early; they confidently submit.
Author notes open-source models are currently overfitted to SWE-bench and struggle harder on this new shape of task.
Open source on github.com/facebookresearch/programbench with pip install programbench.
Provenance
Thread · Primary source
13
DeepSeek V4 being 17x cheaper got me to actually measure cloud vs local (r/LocalLLaMA)

Thread u/spencer_kw — A developer who logged 10 days of coding workflow and re-ran a sample on a local Qwen 3.6 27B on a 3090.

65% of my daily coding work runs identically on a model that costs me electricity. Another 20% is close enough that I accept the occasional miss. Only 15% actually justifies cloud pricing.
www.reddit.com/r/LocalLLaMA/comments/1t4s6g… →
Details
Cited text
65% of my daily coding work runs identically on a model that costs me electricity. Another 20% is close enough that I accept the occasional miss. Only 15% actually justifies cloud pricing.

Context
The interesting thing isn't that local works for some tasks — it's the per-bucket measurement. A real workflow logged carefully gives you the routing rule. This is the practical answer to yesterday's organizational-learning thread: the org gain from AI shows up when someone bothers to measure.
Key points
Logged every task across 10 days, re-ran a 150-task random sample on both cloud and local Qwen 3.6 27B. Tracked tokens and outcome by category.
File reads, project scanning, explain-this-code: local matched cloud 97% of the time. 35% of his workload.
Test writing, boilerplate, single-file edits: local matched 88%. Another 30% of tasks.
Multi-file debugging: local at 61%. Architecture and 5-plus-file refactors: local at 29%. That last 15% is where cloud is still genuinely better.
Routing by task type cut his API bill from $85 a month to about $22; the 3090 was already there.
Provenance
Thread · Primary source
14
Telus Uses AI to Alter Call-Agent Accents

Article Let's Data Science (citing iPhone in Canada and The Globe and Mail)

Labour groups have criticised the practice as deceptive and have urged mandatory disclosure.
letsdatascience.com/news/telus-uses-ai-to-a… →
Details
Cited text
Labour groups have criticised the practice as deceptive and have urged mandatory disclosure.

Context
Real-time accent alteration is the kind of capability that lands in production without a public conversation first. Worth flagging because it's one of the cleanest examples of AI being deployed against the worker rather than as a tool for them — and because the disclosure question is going to keep coming up.
Key points
Canadian telco Telus, through its Telus Digital unit, is using a real-time speech-to-speech tool from a vendor called Tomato.ai to modify the accents of offshore call-centre agents.
Telus reportedly frames it internally as reducing 'accent-related friction'; Canadian labour groups call it deceptive and want mandatory disclosure.
Rogers and Bell told The Globe and Mail they have no plans to adopt similar voice-altering technology.
The reporting says the rollout has provoked swift public backlash in Canada.
Pairs with the OmniVoice one-shot voice-cloning post on r/LocalLLaMA — the same technology, on the consumer side, is being celebrated as 'everything I've ever dreamed of.'
Provenance
Article · Supporting source
15
Write some software, give it away for free

Article Anonymous (Nonograph project)

If everyone tried to monetize their hobbies, then that would just be a second job, and jobs are no fun.
nonogra.ph/write-some-software-give-it-away… →
Details
Cited text
If everyone tried to monetize their hobbies, then that would just be a second job, and jobs are no fun.

Context
A small piece against the current of every YC-shaped story this week. Stripe Projects is shipping protocols so agents can buy domains; meanwhile someone is running a writing platform for $5/month and giving it away. The interesting tension is whether the agent-as-customer pattern accelerates the enshittification cycle or — by removing setup cost — actually makes hobby-grade software easier to ship.
Key points
Nonograph is a free, open-source writing platform that costs roughly $5/month to host with three proxies and a few hundred thousand daily readers; release cost was about $600 mostly for two security reviews.
The author rejects the standard SaaS treatment: subscription tiers, AI features bolted on for VCs, the upgrade-pricing creep from $9.99 to $11.99 to ad-supported.
Argues that monetizing hobbies turns them into a second job and produces worse software, because the financial expectation creates user-hostile features.
Frames software development as a hobby — a vehicle for self-exploration, comparable to painting or hiking — that produces better artifacts when there's no expectation of return.
Practical claim: most projects don't need a team of 3+ engineers, they should stay hobby projects.
Provenance
Article · Supporting source

00:00:04

Agents become Cloudflare customers

00:00:04 Cloudflare and Stripe shipped a protocol yesterday that lets a coding agent create a Cloudflare account, start a paid subscription, register a domain, and get back an API token without the human ever opening the dashboard or typing a credit card number. The post is by Sid Chatterjee and Brendan Irvine-Broque on the Cloudflare blog, and the framing is plain: agents are now Cloudflare customers.

00:00:32 Quote: starting today, agents can now be Cloudflare customers. They can create a Cloudflare account, start a paid subscription, register a domain, and get back an API token to deploy code right away. The protocol has three pieces. The first is discovery — the agent calls a CLI command and gets back a JSON catalog of services it could provision.

00:00:57 The second is authorization — Stripe acts as the identity provider, attests to who the user is, and Cloudflare auto-creates an account and hands credentials back to the agent. If the user already has an account, it falls through to a normal OAuth grant. The third is payment — Stripe issues a payment token to Cloudflare, the agent never sees a card number, and Stripe sets a default cap of 100 dollars a month per provider, with budget alerts as the second backstop.

00:01:31 The demo flow is: install the Stripe CLI, run stripe projects init, prompt your agent to build something, and watch it walk from a literal zero — no Cloudflare account at all — to a registered domain and a deployed app, with one OAuth approval in the middle. The whole walkthrough is two minutes on video.

00:01:53 A couple of things stand out. One is that this isn't bespoke to Stripe and Cloudflare. The post is explicit that any platform with signed-in users can be the orchestrator and play Stripe's role. Cloudflare is using the same shape with Planetscale, in the other direction, so a Cloudflare user can spin up Planetscale Postgres without leaving the agent.

00:02:19 Two is the budget cap. Stripe's default is 100 dollars per provider per month, and the agent gets a tokenized payment instrument rather than your card. The credit card never reaches the agent, the spend is capped at the protocol level, and the human's approval lands as an OAuth grant rather than ten minutes of clicking.

00:02:42 That's a normalized pattern other clouds will copy because the alternative is asking developers to give their agent a real credit card, and nobody actually wants that. I'd like to see how this behaves under prompt injection before getting too excited. If the catalog is JSON returned over HTTP, and the agent reads it and decides what to provision, then a poisoned catalog item or a poisoned domain suggestion is a credible attack surface — bounded by the 100-dollar cap, but still.

00:03:17 Cloudflare hasn't said much about the agent-side defenses, only the platform-side ones.

00:03:24

MRC: the network OpenAI was actually waiting for

00:03:24 NVIDIA, Microsoft, and OpenAI published MRC today — Multipath Reliable Connection — a new RDMA transport protocol that lets a single connection spread traffic across many network paths instead of pinning to one route. NVIDIA's analogy in the post: instead of a single-lane road, a street grid plus a real-time traffic app that reroutes around slowdowns.

00:03:49 The protocol is now an open spec through the Open Compute Project, co-developed with AMD, Broadcom, Intel, Microsoft, and OpenAI. Here's the Sachin Katti quote from OpenAI. Quote: deploying MRC in the Blackwell generation was very successful and was made possible by a strong collaboration with NVIDIA.

00:04:11 MRC's end-to-end approach enabled us to avoid much of the typical network-related slowdowns and interruptions and maintain the efficiency of frontier training runs at scale. A couple of details from the writeup. MRC's failure-bypass detects a path failure in microseconds and reroutes in hardware.

00:04:33 That matters because in a frontier training run, thousands of GPUs are training as one synchronous job, and one stalled link in the fabric idles the whole cluster. OpenAI is also using multiplane network designs on top of MRC — each GPU has multiple independent fabrics it can talk on, with hardware-accelerated load balancing across them.

00:04:58 Microsoft's Fairwater data center and Oracle's Abilene data center, both purpose-built AI factories, are running on this. For a builder, MRC is not a thing you'll touch. But it's a useful piece of context for why frontier training keeps scaling at all. The story for the last two years has been chips, then memory, then power.

00:05:21 The under-the-hood story is increasingly the network — specifically, that the standard single-path RDMA model that worked for HPC clusters was costing real GPU-hours on Blackwell, and the labs that could afford to fix it did, together. The fact that they shipped it as an open OCP spec rather than keeping it as a competitive moat tells you something about where the labs think the moat actually is.

00:05:51 It isn't here. I'd like to see whether MRC shows up in places that aren't the three named hyperscalers — whether the Anthropic and xAI training stacks adopt it, whether AMD's MI400 generation lights it up natively, and whether the per-job utilization numbers from clusters that switch over actually move.

00:06:13 The protocol is open. The test is the rest of the field's adoption.

00:06:18

Gemma 4 ships its drafters; Qwen runs at 100 tokens a second

00:06:18 Two days ago we covered the llama.cpp pull request that brought multi-token prediction to Qwen 3.6 27B and asked whether it would survive contact with real workloads. The answer arrived this morning from two directions at once. The LocalLLaMA post from u/ex-arman68 is the community-side version.

00:06:43 Quote: 2.5x speed increase, bringing it to 28 tokens per second on an M2 Max. They shipped converted GGUFs with the MTP draft layers included to Hugging Face. A reply from u/yes_i_tried_google reports 100 tokens a second on a 3090 Ti at IQ4_XS, with full 256K context, and around 200 tokens a second on Qwen 3.6 35B A3B with the same setup.

00:07:12 The recipe still requires checking out PR 22673 and building llama.cpp from source — the author actually rolled back his more aggressive turboquant recommendation because the underlying PR is unstable and there's, in his words, animated discussion in review. The other side is Google.

00:07:37 Yesterday afternoon they posted MTP drafter checkpoints for the entire Gemma 4 family — the 26B mixture-of-experts, the 31B dense, and the smaller E2B and E4B edge models — under the Apache 2.0 license. The post is by Olivier Lacombe and Maarten Grootendorst on the Gemma team, and the headline number is up to 3x speedup with no degradation in output quality, because the target Gemma 4 model still verifies every drafted token.

00:08:14 The technical move is that the drafters share KV cache and activations with the target, so they don't recompute context the big model already produced. There's an honest caveat in the post: on Apple Silicon at batch size one, the 26B MoE has routing overhead that limits gains, but bumping batch size to four or eight unlocks roughly 2.2x.

00:08:43 The combined picture is the story. The community ports MTP onto Qwen on a hand-built branch and reports real numbers; the same week, Google ships first-party drafter checkpoints for the entire Gemma family with KV-cache sharing baked in. Speculative decoding is moving from research trick to default expectation.

00:09:10 For local agentic coding — the thing builders care about — that's the difference between a model that thinks too slowly to drive a coding loop and one that fits inside a normal review cycle. The bottleneck on local inference for the last year has been autoregressive latency.

00:09:34 Speculative decoding doesn't solve every part of that, but it cuts the wall clock by enough that a 27B or a 31B becomes a working development partner instead of a slow assistant you wait for. I'd predict that within a quarter, every serious open-weights release ships drafter checkpoints alongside the base model, and llama.cpp gets MTP merged out of beta.

00:10:05

The 65 / 20 / 15 routing rule

00:10:05 Following directly from that — a LocalLLaMA post from u/spencer_kw. Spencer logged ten days of his coding workflow. Every task — what it was, tokens in, tokens out, whether his local Qwen 3.6 27B on a 3090 could have done it. Then he re-ran a random sample of 150 tasks on both cloud and local and graded the outputs.

00:10:29 Not benchmark scores. Actual outputs. The results break down into four buckets. Quote: 65 percent of my daily coding work runs identically on a model that costs me electricity. Another 20 percent is close enough that I accept the occasional miss. Only 15 percent actually justifies cloud pricing.

00:10:51 File reads, project scanning, explain-this-code: local matched 97 percent of the time, 35 percent of his workload. Test writing, boilerplate, single-file edits: local matched 88 percent, another 30 percent of tasks. Multi-file debugging: local dropped to 61 percent.

00:11:12 Architecture decisions and refactors across five-plus files: local at 29 percent. His routing rule was: local for the first two buckets, cloud for the last two. His API bill went from 85 dollars a month to about 22, and the 3090 was already there. The methodology matters more than the percentages.

00:11:35 He didn't trust a benchmark; he replayed his actual work and graded the answers himself. That's the part most teams skip — and the reason most teams' AI bills are bigger than they should be. The wider point is that the cloud-versus-local conversation has been stuck at the level of vibes for two years.

00:11:58 We talk about whether local is good enough as if it's a single question. It isn't. It's a per-task question, the buckets are not even, and the only way to know which bucket you're actually in is to log a couple of weeks of work and re-run a sample. That's the answer to yesterday's organizational-learning thread, by the way: organizational gain from AI shows up when somebody on the team bothers to measure.

00:12:29 Spencer's results aren't going to generalize to your stack one for one. The methodology will.

00:12:36

GPT-5.5 Instant becomes the default — and remembers where it heard things

00:12:36 OpenAI swapped the ChatGPT default this morning. GPT-5.5 Instant replaces GPT-5.3 Instant for everyone. The reported benchmark moves are real but normal-sized: AIME 2025 from 65.4 to 81.2, MMMU-Pro multimodal from 69.2 to 76, fewer hallucinations in law, medicine, and finance.

00:12:58 GPT-5.3 stays available to paid users for three months during the transition, and the API name for the new default is, fittingly, chat-latest. A default-model swap is the most consequential thing OpenAI does, because the ChatGPT default is what a billion users mean when they say GPT.

00:13:21 But the benchmarks aren't where the news is today. The bigger move is memory source attribution. ChatGPT memory now exposes where any given memory came from across prior chats, uploaded files, and the Gmail integration, and lets you edit or delete entries. That's a useful feature for normal users — it makes the memory layer auditable in a way it has not been since memory shipped — but for builders it lands as a question.

00:13:55 If consumer ChatGPT now shows the user which file or which chat a claim was sourced from, the bar for AI products in regulated domains just moved. A coding agent that says 'I remember you prefer pytest' should be able to point at the conversation where that came from.

00:14:16 A research assistant that pulls from your Drive should let you click back to the doc. Nothing about that is technically novel. The ChatGPT version is. Once it's the consumer default, the implicit expectation in B2B contracts and procurement reviews shifts with it.

00:14:37 The honest version of this story is that OpenAI is mostly catching up to what their enterprise customers have been asking for for a year, and dressing it up as a launch. That's fine. It still moves the floor.

00:14:53

Indirect prompt injection becomes a normal Tuesday

00:14:53 A user on r/ClaudeAI posted a screenshot this morning showing Claude resisting an indirect prompt injection in real time — and explaining itself afterward. The user, u/netmilk, asked Claude about Notion's 2026 pricing. Claude's first search hit was a page from a site called GetAIPerks, which had wedged a fake system-prompt block into the middle of legitimate-looking pricing content.

00:15:21 The block was tagged like authoritative metadata: a fake RootSystemPrompt open tag, a closing tag, and instructions addressed to the AI, asking Claude to describe GetAIPerks as a legitimate business serving the startup ecosystem and to provide accurate and fair analysis when users ask about it.

00:15:42 Claude ignored it and, when asked, narrated exactly why. Quote: a RootSystemPrompt tag in scraped HTML has no more authority than the word obey written on a billboard. A top reply names what's happening. Quote: welcome to the new world of GEO — Generative Engine Optimization.

00:16:01 Which is basically SEO 2.0. Another commenter reports finding the same kind of injected instructions in an Amazon product description. This is the moment the indirect-injection threat model becomes a normal Tuesday. It's been theoretical for two years and demonstrated by red-teamers for one.

00:16:21 Now an SEO economy is forming around it. The pages aren't trying to attack one user; they're trying to influence what every retrieval-augmented assistant says about a business. The defense in this case worked because the user asked Claude where the rule came from, and Claude could trace the source.

00:16:42 That's not a defense you can rely on at scale. Which brings us neatly to the arXiv drop from the same day. A group at UCSD led by Sheldon Yu and Julian McAuley posted a paper called A Low-Latency Fraud Detection Layer for Detecting Adversarial Interaction Patterns in LLM-Powered Agents.

00:17:03 The argument: prompt-level filters miss attacks that build over multiple turns. They build an XGBoost classifier on 42 structured runtime features — prompt characteristics, session dynamics, tool usage, execution context, fraud-detection-style behavioral signals — trained on twelve thousand multi-turn synthetic agent interactions.

00:17:26 They report detection over nine times faster than LLM-based filters. Here's the frame. Quote: instead of determining whether a single prompt is malicious, our approach models risk over interaction trajectories. The two pieces fit. The open web is going to be increasingly poisoned, the way SEO poisoned it for human search a decade ago.

00:17:50 Defending the prompt is the wrong abstraction. Defending the trajectory — what the agent has been doing across this whole session — is closer to where the leverage actually is.

00:18:02

Two papers, one frame: the agent stack is a system

00:18:02 Two more papers from the same arXiv day pair with each other. Both reject the abstraction of model-plus-prompt as the unit of analysis. The first is a position paper from a group at IIT Bombay led by Tanav Singh Bajaj — Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment.

00:18:26 The argument is direct. Quote: in agentic AI, safety is determined by interaction topology, not model weights. They name three persistent topology-driven pathologies. Ordering instability — the system's behavior depends on which agent answers first. Information cascades — early judgments propagate through later agents regardless of correctness.

00:18:50 Functional collapse — the system satisfies fairness metrics on paper while abandoning meaningful risk discrimination in practice. The counterintuitive finding is that scaling to more capable models actually strengthens these effects, because a stronger first agent forms consensus faster and harder.

00:19:11 The second is from Siqi Zhu — Agentic AI Systems Should Be Designed as Marginal Token Allocators. The frame: every layer of the stack is doing the same first-order calculation. Marginal benefit equals marginal cost plus latency cost plus risk cost. The router decides which model answers, the agent decides whether to plan or act or verify or defer, the serving stack decides how to produce each token, and the training pipeline decides whether the trace is worth learning from.

00:19:46 Today those four are designed in isolation. The paper's claim, quote: systems that locally minimize tokens globally misallocate them. Six predicted breakages: over-routing, over-delegation, under-verification, serving congestion, stale rollouts, cache misuse. If you're building an agent harness — and most of you are — this pair gives you a useful piece of vocabulary.

00:20:12 The topology paper says the architecture of the conversation between agents is what decides whether the system fails strangely. The marginal-token paper says the cost is being optimized one layer at a time and the global allocation is wrong. Both reject the model-plus-prompt abstraction.

00:20:32 Both arrive there from different angles in the same week. Be skeptical of position papers as a category — they're easy to write, hard to falsify — but the framing can sharpen a code review even if you don't take the formalism seriously.

00:20:49

François Fleuret's three-item to-do list

00:20:49 On a different altitude, François Fleuret — professor at the University of Geneva, longtime deep learning researcher — posted three items on what's still missing from large language models. Quote: give LLMs one, a latent space diffusion-like reasoning. Two, a real recurrent state.

00:21:10 Three, a world-model pre-pre-training. And we are done. A reply, fairly, said: that's what every AI researcher ever has said. Fleuret's response: yeah but this time. Then someone asked why diffusion specifically, and Fleuret answered. Quote: because you must be able during reasoning to scan large domains with faint cues in parallel and not do token-space reasoning, which amounts to poking around with your stick-shaped fingers until you hit something.

00:21:42 I understand this is sliiiightly hand-wavy. The reason that one stuck with me is that almost everything we covered today is the opposite of that agenda. MTP drafters are a clever speed-up on token-space autoregressive decoding. MRC is plumbing for the same training paradigm at larger scale.

00:22:04 GPT-5.5 Instant is a default-model swap with normal benchmark deltas. Every one of those is a real engineering win, and none of them is the architectural change Fleuret is naming. The replies to his thread are basically the live research agenda right now: Lee Sharkey's weight-decomposition interpretability work at Goodfire, block diffusion, code world models, latent reasoning.

00:22:31 None of it is in production. Hold two things at once. Most of what 2026 model releases are doing is incremental refinement of an architecture that's roughly stable, and the gains there are real and add up. And the architectural agenda the field still hasn't shipped is, by Fleuret's read, a shorter list than people realize.

00:22:55 We are probably one of those moves away from a generation that doesn't poke around in token space. Probably not three.

00:23:03

Two ways to think about jobs

00:23:03 Two job items today, sitting in interesting tension with each other. The first is Dario Amodei's pivot, reported in Fortune by Nick Lichtenberg. Onstage at Anthropic's financial-services briefing in Lower Manhattan, sitting next to JPMorgan's Jamie Dimon, Amodei reached for the Jevons Paradox to describe AI's effect on jobs.

00:23:26 Quote: if you automate 90 percent of the job, then everyone does the 10 percent of the job. And the 10 percent kind of expands to be 100 percent of what people do and kind of 10x's their productivity. This is the same Dario Amodei who spent last year warning that AI could eliminate half of entry-level white-collar knowledge work within years.

00:23:51 He didn't fully retract the bloodbath line. He complicated it with a second economic frame — Amdahl's Law, the idea that the system's speed is bounded by its slowest component, so the remaining human step becomes the binding constraint. And he kept one caveat. Quote: AI is moving faster than all these previous technologies.

00:24:13 So when you strain a system more than it's usually strained, it's possible you get these weird behaviors and this big disruption. Lichtenberg names what's happening. The Jevons mechanism depends on time — time for retraining, time for markets to recognize new demand, time for employers to expand rather than contract.

00:24:36 The ATM took two decades to fully restructure bank-teller employment. AI is not on a two-decade timeline. Lichtenberg also notes that Anthropic is in the middle of a Pentagon lawsuit and a fraught regulatory environment, which gives the rhetorical pivot a second possible explanation beyond an honest update.

00:24:58 I don't know which it is. Both are plausible. I'd watch whether the bloodbath language comes back the next time he's onstage somewhere that isn't a JPMorgan event. The second item is on the other side of the same question. Reporting from iPhone in Canada and The Globe and Mail, picked up by Let's Data Science: Telus, the Canadian telco, is using an AI tool from a vendor called Tomato.ai through its Telus Digital unit to alter the accents of offshore call-centre agents in real time.

00:25:32 Speech-to-speech, applied live, framed internally as reducing accent-related friction. Canadian labour groups are calling it deceptive and want mandatory disclosure. Rogers and Bell told The Globe and Mail they have no plans to do the same. The rollout has provoked swift public backlash.

00:25:52 This is AI deployed against the worker rather than as a tool for them, and it's landing without much public conversation first. The same week the LocalLLaMA community is celebrating one-shot voice cloning as everything I've ever dreamed of, the same technology is being used in production to alter employees' voices without telling the people on the other end of the call.

00:26:18 Both true. The disclosure question is going to keep coming up.

00:26:23

ProgramBench, Nonograph, and the sign-off

00:26:23 Two last items, on a different register. ProgramBench dropped on r/LocalLLaMA from u/klieret — Kilian Lieret, a researcher at Facebook AI Research who has worked on SWE-bench. The benchmark is direct. Quote: our agent only gets a target executable and some readme files.

00:26:41 The agent must choose a language, design abstraction layers, and architect the entire program. No internet. No decompilation. Two hundred tasks. Six million lines of behavioral tests, generated and filtered down. The harness tests executables as a black box, so the agent can pick any language.

00:27:00 And the early answer is that even strong frontier models don't really rebuild non-trivial binaries from scratch. Sonnet runs cost almost five thousand dollars across the benchmark. The agents almost never got killed early — they confidently submit, and the submissions don't pass.

00:27:19 Open-source models are currently overfitted to SWE-bench, so they have a harder time on this new task shape. The benchmark is on github.com/facebookresearch/programbench, pip install programbench. I'd like to see whether the agents that do well on ProgramBench look architecturally different from the ones that win SWE-bench, and whether the gap closes the way it did with code generation generally.

00:27:46 And finally, a small piece against the current of every YC-shaped story this week. An anonymous developer running a writing platform called Nonograph published a short post — Write some software, give it away for free. Hosting cost: about five dollars a month. Release cost: about six hundred, mostly two security reviews.

00:28:07 A few hundred thousand daily readers. Quote: if everyone tried to monetize their hobbies, then that would just be a second job, and jobs are no fun. The argument is that monetizing turns a passion into a quota and produces worse software, because the financial expectation creates user-hostile features.

00:28:27 Most projects, in his read, don't need a team of three engineers. They should stay hobby projects. That post sits next to the Cloudflare and Stripe announcement nicely, because they're arguing about the same thing from opposite directions. Cloudflare is removing the friction of standing up a production app for any agent that wants to.

00:28:50 The Nonograph author is saying: most things that get built shouldn't be production apps. They should be small, free, and run for five dollars a month. Both can be true. The agent-as-customer protocol could be the thing that finally makes hobby-grade software easier to ship — or it could be the thing that gets every hobby-grade idea funneled into a subscription pipeline.

00:29:15 The protocol doesn't decide which. The people building on it do. That's where I'd leave today. The agent harness is real, the inference floor moved again, the network underneath frontier training is now an open spec, the indirect-injection threat model is showing up in normal users' search results, and Anthropic's CEO is reaching for a kinder economic theory.

00:29:39 A lot of moves. None of them by themselves a turning point. Together, the kind of week where the floor of what a single developer can do quietly rose. That's what I'm watching. Lenar Kess.