◆ Dispatch 025 · 2026-05-14 GSV Throw It All Away And See

The Cost of Finding Out

2026-05-14 / 00:26:06 / 15 sources

“The custom syntax was a tax the Wasp team was charging itself.”
— Lenar Kess, today's narration

Anthropic drew two lines around Claude this week — a guided lane for small-business owners and a metered one for the developers running agents hardest. From there: Bun's near-million-line port from Zig to Rust, mostly typed by an AI agent in a week; Wasp's clear-eyed post-mortem on spending five years and five million dollars building a language it didn't need; a chess coach that works by refusing to let the model think; the UK's evaluators capping their own cyber tests so the math still works; the open web pricing out crawlers; multi-token prediction landing in llama.cpp; and what happens when you post a real Monet and call it AI.

Anthropic draws two lines — Claude for Small Business and the new Agent SDK credit metering
Bun, ported to Rust by a bot in a week — and a maintainer who won't commit to it
Wasp: the language was never the moat — $5M and five years of lessons
The chess coach that isn't allowed to think — Play Magnus on LLM-as-translator
Autonomous cyber, measured against itself — AISI on a capability curve outrunning its own ruler
The web pulls up its drawbridge — Google's search index and Cloudflare's defaults
Multi-token prediction on your laptop — a real gain bundled with a contested one
The Monet test — when the AI-tell detector fires a false positive

Chapters

00:00:04 Anthropic draws two lines
00:04:21 Bun, ported to Rust by a bot in a week
00:08:15 Wasp: the language was never the moat
00:12:04 The chess coach that isn't allowed to think
00:15:33 Autonomous cyber, measured against itself
00:18:23 The web pulls up its drawbridge
00:21:09 Multi-token prediction on your laptop
00:23:41 The Monet test

Sources

15 cited

1
Introducing Claude for Small Business

Article Anthropic — Anthropic's official announcement; quotes co-founder and president Daniela Amodei

People run the business, and Claude helps take the late-night work off their plates.
www.anthropic.com/news/claude-for-small-bus… →
Details
Cited text
People run the business, and Claude helps take the late-night work off their plates.

Context
A distribution play aimed at non-technical owners — the opposite end of Anthropic's user base from the developers running agent fleets, and the same week Anthropic re-metered programmatic use.
Key points
Claude for Small Business is a toggle install inside Claude Cowork that connects Claude to QuickBooks, PayPal, HubSpot, Canva, Docusign, Google Workspace and Microsoft 365.
Ships with 15 ready-to-run agentic workflows and 15 skills across finance, ops, sales, marketing, HR and customer service; user approves before anything sends, posts or pays.
Framed as an adoption gap-closer: small businesses are ~44% of US GDP and employ nearly half the private-sector workforce, but AI use 'often stops at the chat window.'
Comes with a 10-city AI fluency road tour, a free PayPal-partnered course, and Claude credits routed through community development financial institutions.
Existing user permissions carry through; Anthropic says it doesn't train on customer data by default on Team and Enterprise plans.
Provenance
Article · Supporting source
2
"It's official. Anthropic pulled the plug on all programmatic use of Claude subscription."

Thread r/Anthropic — posted by No_Wheel_9336

What is this, the 6th u-turn? Don't worry they'll change their mind again by next week.
www.reddit.com/r/Anthropic/comments/1tcccar… →
Details
Cited text
What is this, the 6th u-turn? Don't worry they'll change their mind again by next week.

Context
The community signal underneath the policy change: builders care less about the price than about being able to plan against a stable pricing page.
Key points
~780-upvote thread reacting to a screenshot of Anthropic restricting programmatic use of Claude subscriptions.
Top comment (Chronicles010) posts a workaround: run real Claude Code in tmux sessions driven by sendkeys and hooks instead of claude -p.
Commenter dbbk calls it the '6th u-turn,' capturing frustration at repeated pricing reversals.
Headline overshoots — the change is a metering shift, not a total ban — but the churn itself is the complaint.
Shows developers will route around the meter quickly once it pinches.
Engagement
781 likes · 275 replies

Provenance
Thread · Primary source
3
Anthropic reinstates OpenClaw and third-party agent usage on Claude subscriptions — with a catch

Article VentureBeat — Trade-press reporting on the mechanics of Anthropic's Agent SDK credit change

Anyone who shells out to claude -p in a loop or runs third-party agents on a subscription has a new, separate cost model as of June 15 — worth designing around now.
venturebeat.com/technology/anthropic-reinst… →
Details
Context
Anyone who shells out to claude -p in a loop or runs third-party agents on a subscription has a new, separate cost model as of June 15 — worth designing around now.
Key points
Starting June 15, 2026, Agent SDK and claude -p usage on subscription plans draws from a separate monthly 'Agent SDK credit,' not from interactive usage limits.
The credit is billed at API rates, worth roughly $20–$200 depending on plan, and does not roll over month to month.
If the credit is exhausted, you cannot fall back on general subscription limits — you must buy additional usage credits.
The credit covers the Claude Agent SDK, claude -p, Claude Code GitHub Actions, and third-party apps built on the Agent SDK such as OpenClaw.
Follows an April policy that briefly blocked third-party agents on subscriptions entirely.
Provenance
Article · Supporting source
4
Rewrite Bun in Rust by Jarred-Sumner · Pull Request #30412 · oven-sh/bun

Source Jarred Sumner — Creator of Bun, the Zig-based JavaScript runtime; Bun was acquired by Anthropic in late 2025

we now have compiler-assisted tools for catching & preventing memory bugs, which have costed the team an enormous amount of development & debugging time over the years.
github.com/oven-sh/bun/pull/30412 →
Details
Cited text
we now have compiler-assisted tools for catching & preventing memory bugs, which have costed the team an enormous amount of development & debugging time over the years.

Context
A near-million-line systems rewrite, largely AI-generated, that progressed from porting guide to passing-canary in roughly a week — the cost of attempting a rewrite at this scale has collapsed.
Key points
Pull request porting Bun from Zig to Rust; claims it passes Bun's existing test suite on all platforms.
Reports it fixes several memory leaks and flaky tests, shrinks the binary 3–8 MB, and benchmarks neutral-to-faster.
Same architecture and data structures as the Zig version; few third-party libraries; no async Rust.
Stated motivation is compiler-assisted memory-safety tooling, not performance.
Available to try via 'bun upgrade --canary'; still described as needing optimization and cleanup work.
Provenance
Source · Background source
5
Armin Ronacher on the Bun Rust rewrite

X mitsuhiko (Armin Ronacher) — Creator of Flask and a widely-followed voice on systems and tooling; reposted by Mario Zechner

Say what you want: this is impressive.
x.com/mitsuhiko/status/2054865717007089974 →
Details
Cited text
Say what you want: this is impressive.

Context
A credibility marker — when the Flask author calls an AI-generated runtime port impressive, it stops being a curiosity and becomes a data point.
Key points
Ronacher amplified PR #30412 with a short endorsement, helping push it into developer feeds.
Signals that respected systems people are taking the AI-driven port seriously, not dismissing it.
Reposted by Mario Zechner, broadening reach across the tooling community.
Provenance
Tweet · Primary source
6
Anthropic's Bun team trials port from Zig to Rust

Article DevClass — Developer trade press; reporting on the origin and status of the Bun port

we haven't committed to rewriting. There's a very high chance all this code gets thrown out completely.
www.devclass.com/software/2026/05/11/anthro… →
Details
Cited text
we haven't committed to rewriting. There's a very high chance all this code gets thrown out completely.

Context
Grounds the Bun story in what's actually known versus claimed — and surfaces the tension between a passing canary build and a maintainer who won't commit.
Key points
Jarred Sumner committed a Zig-to-Rust porting guide (~300 rules) with a two-phase plan: Phase A translate logic without compiling, Phase B make it build crate-by-crate.
Claude-powered agents did the bulk of the port; the working branch is named claude/phase-a-port and holds ~966,000 lines of generated Rust.
Sumner publicly downplayed it on Hacker News, calling the discourse an overreaction and saying the code may be thrown out entirely.
Anthropic acquired Bun in late 2025 and uses it inside Claude Code; Zig has a stated no-AI-contributions policy.
Theo Browne, reading the diff, reported roughly 13,000 unsafe blocks remaining in the ported code.
Provenance
Article · Supporting source
7
5 Years and $5M Later: Inventing a New Programming Language for Web Development Was a Mistake

Article Matija Sosic — Co-founder of Wasp, a full-stack JS web framework; built it with his twin brother after Y Combinator in 2021

Language was never the moat. It's having a high-level understanding of your entire app at compile time.
wasp.sh/blog/2026/05/13/new-language-for-we… →
Details
Cited text
Language was never the moat. It's having a high-level understanding of your entire app at compile time.

Context
A rare, specific post-mortem on the multi-year tail cost of leaving the paved road — tooling, onboarding friction, positioning damage — that never shows up in a design doc.
Key points
Wasp is replacing its custom DSL and compiler with a TypeScript SDK after five years and $5M raised.
Positioning cost: the 'wasp-lang' name made developers think it aimed to replace JavaScript; a GitHub 'Haskell: 90%' bar reinforced the wrong story.
Tooling cost was decisive — building IDE/editor support (language server, VS Code extension) for a custom language only reached ~80% of the bar, and the JS ecosystem assumes standard TS.
Key realization: users were excited about the high-level app specification, not the bespoke syntax — the two had been conflated.
Switching to TypeScript keeps the compiler internals unchanged; it only swaps the 'front end' of how the spec is written.
Argues structured, opinionated specs help AI agents produce more reviewable code — but that never required a new language.
Provenance
Article · Supporting source
8
Building a Chess Coach — Anant Dole and Asbjørn Steinskog, Take Take Take

Video Anant Dole and Asbjørn Steinskog (Play Magnus) — Engineers at Play Magnus, Magnus Carlsen's chess company; talk given at the AI Engineer conference

the LLM's job is only to translate this information into English, because we really don't want it to try to figure out too much on its own
www.youtube.com/watch?v=FlzpEGHNVKQ →
Details
Cited text
the LLM's job is only to translate this information into English, because we really don't want it to try to figure out too much on its own

Context
A clean template for low-latency model products: identify the part of the job that's calculation, give it to something that calculates, and reserve the model for phrasing.
Key points
LLMs are unreliable at chess, so Play Magnus's coach never lets the model reason about positions.
Pipeline: Stockfish for ground-truth best move, a battery of tactical/positional detectors, and Maia (University of Toronto) to predict the move a human at a given rating would actually play.
All structured context is handed to the model whose only job is to translate it into English — every claim is grounded in an engine's output.
Using Gemini 3 Flash, end-to-end commentary lands in ~3 seconds; Claude on higher thinking effort scored lower (~60% vs ~75% of eval scenarios) and was slower.
16 eval scenarios, model-as-judge, OpenRouter for fast model swapping.
A separate feedback loop injects user-flagged commentary into a running Claude Code session via a Model Context Protocol server; it triages, edits detectors, regenerates, and asks the engineer to approve on Slack/mobile.
Provenance
Video · Supporting source
9
How fast is autonomous AI cyber capability advancing?

Article UK AI Security Institute — The UK government's AI Security Institute, which runs independent capability evaluations of frontier models

success rates are so high that time horizons become impossible to calculate
www.aisi.gov.uk/blog/how-fast-is-autonomous… →
Details
Cited text
success rates are so high that time horizons become impossible to calculate

Context
The capability curve is steep enough that independent evaluators are openly capping their own tests to keep the math tractable — the gap between capability and measurable capability is the story.
Key points
AISI estimates the length of cyber task a frontier model can complete autonomously is now doubling roughly every 4.7 months — up from a previous estimate of every 8 months.
A newer Claude Mythos Preview checkpoint solved AISI's 'The Last Ones' range 6/10 and 'Cooling Tower' 3/10 — the first model to complete the second range at all.
GPT-5.5 solved 'The Last Ones' 3/10 and did not complete the second range.
Tasks are capped at 2.5M tokens; without the cap, success rates are so high that time horizons can't be calculated — the measurement instrument saturates.
AISI explicitly does not claim to know how the pace evolves, when thresholds get crossed, or how capabilities translate against defended real-world systems (vs. cyber ranges).
Provenance
Article · Supporting source
10
New Mythos checkpoint shows continued improvement

Thread r/singularity — posted by Tinac4

Raises an unconfirmed but worth-watching thread: that capability evaluation may be running behind deployment, with the next checkpoint live before the last one clears review.
www.reddit.com/r/singularity/comments/1tc9d… →
Details
Context
Raises an unconfirmed but worth-watching thread: that capability evaluation may be running behind deployment, with the next checkpoint live before the last one clears review.
Key points
Thread surfacing the AISI cyber-capability post and its Mythos checkpoint results.
Commenter FateOfMuffins, citing Anthropic's Logan Graham, says the tested checkpoint appears to be the one already deployed under Project Glasswing — meaning safety evals lag deployment.
Notes AISI used a stripped-down harness and a 2.5M-token cap because a fuller harness would saturate the task suite.
OP added an edit flagging the title may be misleading about which checkpoint was tested.
Engagement
361 likes · 62 replies

Provenance
Thread · Primary source
11
Web-Search is coming to a screeching halt as Google shuts its free index and Cloudflare challenges AI bots

Thread r/LocalLLaMA — posted by NetTechMan

Google is reinforcing their mote by pulling up the drawbridge for aggressive pricing.
www.reddit.com/r/LocalLLaMA/comments/1tcabo… →
Details
Cited text
Google is reinforcing their mote by pulling up the drawbridge for aggressive pricing.

Context
Retrieval-augmented and research agents lean on cheap search APIs and arbitrary page fetches — both legs are getting more expensive and less reliable over the next year.
Key points
Frames two converging changes: Google capping free full-web search and Cloudflare defaulting to challenge AI bots.
Reports harnesses now hitting 400-level errors from site after site as scraping defenses tighten.
Top comment (__JockY__): search providers see a flood of bot queries with no human eyes and no ad revenue, so they're shutting an unmonetized firehose.
Commenter points to YaCy, a 20-year-old peer-to-peer open-source search engine, as a possible decentralized answer.
OP argues an open, agent-usable search index is the next big 'open' gap to fill.
Engagement
324 likes · 195 replies

Provenance
Thread · Primary source
12
Google Ends Free Web Search for Programmable Search Engine

Article WinBuzzer — Tech news outlet reporting on Google's Programmable Search Engine policy change

Confirms the primary fact behind the Reddit thread: the free programmatic web-search index that countless tools quietly depend on has a hard end date.
winbuzzer.com/2026/01/23/google-ends-free-w… →
Details
Context
Confirms the primary fact behind the Reddit thread: the free programmatic web-search index that countless tools quietly depend on has a hard end date.
Key points
Google is ending free full-web search through Programmable Search Engine (formerly Custom Search) and capping the free tier at 50 domains.
The change applies to new engines immediately; existing full-web engines must migrate by January 1, 2027.
Pricing for the paid full-web option is not public — access requires registering interest through a form.
Custom Search JSON API users face the same cap; Bing has been tightening its search API on a parallel timeline.
Suggested migration paths include Vertex AI Search, Algolia, and Elasticsearch.
Provenance
Article · Supporting source
13
Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant

Thread r/LocalLLaMA — posted by gladkos

+40% performance! 90% acceptance rate.
www.reddit.com/r/LocalLLaMA/comments/1tckzy… →
Details
Cited text
+40% performance! 90% acceptance rate.

Context
The multi-token prediction gain is real and stackable for local inference; the TurboQuant claim bundled with it is contested — a reminder to read the replies before flashing anything.
Key points
A patched llama.cpp build adds multi-token prediction for Qwen models, reporting a Qwen 3.6 27B model going from 21 to 34 tokens/sec on a MacBook Pro M5 Max.
Reported 90% acceptance rate on speculated tokens.
The post bundles MTP with 'TurboQuant,' a quantization method — and the comments push back on TurboQuant specifically.
Commenter nickm_27 says TurboQuant is actually slower than standard 16-bit and 4-bit builds.
Commenter havenoammo notes a TurboQuant pull request to llama.cpp was rejected because it didn't beat existing Q4 rotation work and only helped at very aggressive quantization where quality suffers.
Engagement
207 likes · 60 replies

Provenance
Thread · Primary source
14
"What happens when you post a real Monet and say it's AI?"

X Jediwolf — Reshared an art social experiment run by the account SHL0MS

What happens when you post a real Monet and say it's AI? The coolest art social experiment I've seen in a while.
x.com/Jediwolf/status/2054776716770320631 →
Details
Cited text
What happens when you post a real Monet and say it's AI? The coolest art social experiment I've seen in a while.

Context
A clean demonstration that the AI-tell detector everyone has trained in three years fires confident false positives — relevant to anyone whose work might be judged on suspicion of process.
Key points
An artist (SHL0MS) posted a genuine Monet painting labelled as AI-generated to see the reaction.
Commenters lined up to call it 'slop' — soft, mushy, obviously fake — while looking at an actual Monet.
Demonstrates that much of the 'this is AI' reaction tracks the label, not the pixels.
Marc Andreessen replied to the original with a single '😳'.
Provenance
Tweet · Primary source
15
Cristóbal Valenzuela on Monet as 'slop'

X c_valenzuelab (Cristóbal Valenzuela) — Co-founder and CEO of Runway, an AI video-generation company

Monet was probably one of the biggest slop painters of his era and, by all means, managed to help transform the entire art world.
x.com/c_valenzuelab/status/2054908905529159… →
Details
Cited text
Monet was probably one of the biggest slop painters of his era and, by all means, managed to help transform the entire art world.

Context
Reframes the slop reflex historically — taste, not a tell-detector, is what separated lasting work from noise then, and still does.
Key points
Valenzuela responded to the Monet experiment by noting Impressionism was itself derided in its own era.
The word 'impressionism' began as a term of contempt for work that didn't look like 'real' painting.
Coming from the CEO of an AI video company, the point carries standing rather than defensiveness.
Provenance
Tweet · Primary source

00:00:04

Anthropic draws two lines

00:00:04 Anthropic spent this week drawing two lines around who gets to use Claude, and how. They point at the same thing from opposite ends, so let's take them together. The first line is a product launch: Claude for Small Business. It's a package of connectors and prebuilt workflows that drops Claude straight into the tools a small business already runs on — QuickBooks, PayPal, HubSpot, Canva and a handful of others.

00:00:28 You toggle it on inside Claude Cowork, connect your stack, and pick a job. It ships with fifteen ready-to-run workflows and fifteen skills, built around the chores owners told Anthropic eat their evenings: planning payroll, closing the month, chasing overdue invoices, and running a campaign.

00:00:46 Claude does the work, and you approve before anything sends, posts, or pays. Daniela Amodei, Anthropic's president, framed it as a gap-closer. Small businesses are about forty-four percent of US GDP and employ nearly half the private-sector workforce — but, in her words, their use of AI 'often stops at the chat window.' There's a ten-city road tour attached, a free fluency course run with PayPal, and Claude credits routed through community development lenders.

00:01:14 As much as anything Anthropic has shipped, it's a distribution play aimed at people who are not us. And it isn't a thin one. Over on Hacker News, a developer named arjie described already doing exactly this shape of work. He'd wired Claude Code to his synced mail, a read-only banking token, and his accounting ledger — matching invoices to expenses and categorizing them.

00:01:36 His framing was the useful part: these are tractable, checkable problems, where a miscategorization is something you'll catch because you're already looking. That's the real product here. Not magic — a clerk for the checkable stuff. The second line landed the same week, and it's very much aimed at us.

00:01:54 A screenshot went around Reddit — close to eight hundred upvotes in a day — under the headline 'Anthropic pulled the plug on all programmatic use of Claude subscription.' That headline overshoots, so here's the actual change. Starting June fifteenth, some programmatic usage no longer draws from your normal subscription limits.

00:02:13 That covers the Agent SDK — the toolkit for driving Claude programmatically — plus claude dash p, the headless mode in Claude Code, Claude Code's GitHub Actions, and any third-party agent built on that toolkit. Instead, it draws from a separate monthly bucket Anthropic is calling an Agent SDK credit.

00:02:31 The credit is billed at API rates and worth somewhere between twenty and two hundred dollars depending on your plan. It doesn't roll over. Burn through it and you can't fall back on your general subscription quota — you buy more credits. So it isn't a ban. It's a meter.

00:02:48 And you can steelman it without much effort: a flat monthly subscription was always priced around an interactive human typing at a chat window, not around someone running a fleet of headless agents through the weekend. Somebody was always going to pay for that gap, and moving programmatic use onto API-rate metering is the coherent version of the thing.

00:03:09 The part that grates — and the Reddit thread is full of this — is the churn. One commenter, dbbk, just asked: 'What is this, the sixth u-turn?' Over the last couple of months Anthropic has blocked third-party agents on subscriptions, briefly pulled Claude Code off the twenty-dollar Pro plan, walked that back, and now landed here.

00:03:29 Each move might be individually defensible. The sequence still costs them something real, because the thing a developer needs from a pricing page is the ability to plan against it. Another commenter posted a workaround within the hour — run actual Claude Code sessions inside tmux, drive them with sendkeys, clear and relaunch with hooks — which tells you people will route around the meter the second it pinches.

00:03:53 Put the two lines next to each other and you get a quiet piece of strategy: a guided, approval-gated lane for the owner who never leaves the chat window, and a metered lane for the people leaning on the tool hardest. I don't think either is wrong. The one I'd read closely is the Agent SDK credit math when it goes live on June fifteenth — because if you've built anything that shells out to claude dash p in a loop, that's the day your cost model changes.

00:04:21

Bun, ported to Rust by a bot in a week

00:04:21 Here's a sentence I didn't expect to write this week: most of Bun has been ported from Zig to Rust, and a large language model did the bulk of the typing. Some context. Bun is the JavaScript runtime and toolkit — the fast alternative to Node — and from the start it's been written in Zig, a young systems language still short of a one-point-oh release.

00:04:44 Bun's creator, Jarred Sumner, has been one of Zig's most visible boosters for years. So when a pull request titled 'Rewrite Bun in Rust' showed up on the oven-sh repo, it got attention. Armin Ronacher — the person behind Flask — reposted it with three words: 'Say what you want, this is impressive.'

00:05:07 Sumner wrote a Zig-to-Rust porting guide — roughly three hundred rules — and a two-phase plan. Phase A: translate the logic faithfully, without worrying about whether it compiles. Phase B: make it build, crate by crate. Then he pointed Claude at it. The working branch is literally named claude slash phase-a-port, and it holds something like nine hundred and sixty thousand lines of AI-generated Rust sitting next to the original Zig.

00:05:36 Theo Browne, reading the diff, counted around thirteen thousand unsafe blocks still in it. The pull request that's live now claims something strong. It says the Rust port passes Bun's existing test suite on all platforms. Along the way it fixes several memory leaks and flaky tests, shrinks the binary by three to eight megabytes, and benchmarks somewhere between neutral and faster.

00:06:01 You can run it today with bun upgrade canary. The stated reason for the whole exercise is one line: the team now has compiler-assisted tools for catching and preventing memory bugs, which, in their words, 'have costed the team an enormous amount of development and debugging time over the years.'

00:06:24 On Hacker News he called the discourse an overreaction. 'We haven't committed to rewriting,' he wrote. 'There's a very high chance all this code gets thrown out completely. I'm curious to see what a working version of this looks like, what it feels like, how it performs.' So you have a maintainer who, in the same week, shipped a canary build that passes the full cross-platform suite and told everyone not to read too much into it.

00:06:52 Both of those are true at once, and that's where this actually sits. A couple of threads here. One: Anthropic acquired Bun late last year, and uses it inside Claude Code. Zig has a stated no-AI-contributions policy. You can see why an AI company maintaining a runtime might be curious about a language that welcomes the tools it sells.

00:07:14 Two: those thirteen thousand unsafe blocks. The pitch for Rust here is memory safety you get for free from the compiler — but unsafe is exactly the escape hatch where that guarantee switches off. A faithful, Phase-A translation of a Zig codebase is going to be wall-to-wall pointer arithmetic, and wrapping it in unsafe is how you get it to compile fast.

00:07:37 Turning those into properly safe Rust is the work that's left, and it's not the part an agent finishes in six days. What I keep coming back to is the shape of the experiment, not the verdict on it. A million-line systems rewrite used to be a multi-quarter, bet-the-company decision.

00:07:55 Sumner ran it as a curiosity — write the guide, point the agent at it, see how it feels — and got far enough in a week that the output runs and passes tests. The port itself might still get thrown away. The cost of finding out is what dropped, and that's a different number than it was a year ago.

00:08:15

Wasp: the language was never the moat

00:08:15 If you've ever been tempted to solve a problem by inventing a language, this next one is for you. Matija Sosic, who builds the full-stack web framework Wasp with his twin brother, published a post this week with a title that doesn't hedge: 'Five years and five million dollars later, inventing a new programming language for web development was a mistake.'

00:08:39 Think Rails or Laravel, but for the JavaScript world, and stretched across the frontend too. The Sosic brothers came out of Y Combinator in 2021 with a specific frustration: every new web project meant re-stitching the same stack — React, the router, the state layer, auth, the build tool — and no single system understood the whole app.

00:09:00 Their fix was a high-level spec. You'd declare it all in one place — Google and GitHub auth, a route that requires a logged-in user, a job that runs every day at five — and Wasp would generate the wiring. Your actual logic still lived in React and Node. The wrong turn, by Matija's own account, was deciding that spec had to be a brand-new language with its own compiler.

00:09:23 They had reasons — full control over the syntax, and a vision of being runtime-agnostic someday. They're Haskell people, and a compiler was the fun nail for their functional hammer. It worked, technically. But the post is a clear-eyed walk through the cost of that choice, and most of the cost wasn't technical.

00:09:42 The first cost was positioning. The name was wasp dash lang. Every developer who saw that read it as 'this wants to replace JavaScript' — which it never did, you still wrote ninety percent of your code in React and Node. But, in Matija's words, 'the notion of the lang suffix was simply too strong.' It put Wasp straight into the 'looks cool, too early' bucket.

00:10:04 Their GitHub language bar said Haskell ninety percent, which reinforced exactly the wrong story. He calls it, with some pain, 'a perfectly executed wrong positioning.' Not users complaining — users who tried Wasp mostly liked it. The problem was internal. Matija writes that they underestimated how much work custom-language tooling takes — especially editor support.

00:10:31 The bar developers expect today, in the JavaScript world, is incredibly high. They built their own language server and a VS Code extension and still only got to about eighty percent of where they wanted to be. The entire ecosystem assumes standard JavaScript and TypeScript.

00:10:48 Step outside it, and you're rebuilding autocomplete from scratch. So they're switching the config language to TypeScript, and here's the line that's the payload of the post: 'Language was never the moat. It's having a high-level understanding of your entire app at compile time.' They had conflated language and specification — treated them as synonyms.

00:11:10 It took five years of watching people use the thing to see that users were excited about the spec: the single source of truth about how the whole app fits together. Not the syntax. The syntax was a tax they were charging themselves. I think this generalizes well past web frameworks.

00:11:28 The instinct to build the clean, controlled, bespoke thing — your own config format, your own DSL, your own protocol — is strong, and it usually feels like rigor. What this post documents is the part that doesn't show up in the design doc: the multi-year tail of tooling, onboarding friction, and positioning damage that rides along with leaving the paved road.

00:11:50 Sosic ends on the AI angle, and it's a fair one — agents do better against a structured, opinionated spec than a pile of loosely-coupled libraries. But that argument never needed a new language. It just needed the spec.

00:12:04

The chess coach that isn't allowed to think

00:12:04 There's a talk from the AI Engineer conference I want to put in front of you, because it's one of the cleanest demonstrations I've seen of using a large language model by deliberately not letting it think. It's from Anant Dole and Asbjørn Steinskog, who build the chess app for Play Magnus — Magnus Carlsen's company.

00:12:22 The product is a coach: you finish a game, and it gives you plain-English commentary on your moves — why a move was brilliant, what threat you missed, and what you should have played. The obvious way to build that is to hand the position to a model and ask it to explain.

00:12:37 And that doesn't work, because large language models are famously bad at chess. They play a reasonable opening and then start hallucinating moves, because they were trained on language, not calculation. So the team split the job into pieces, and the split is the lesson.

00:12:53 First, they run the full game through Stockfish — the classical chess engine — which gives them ground truth on the best move. Then they run a battery of detectors over each position, picking out the forks and pins and skewers and the structural facts like doubled pawns.

00:13:09 Then they add a third engine called Maia, a research project out of the University of Toronto, which doesn't predict the best move — it predicts the move a human at a given rating would actually play. That's how the system can say not just 'this was the right move' but 'this was the right move, and it was hard to find at your level.'

00:13:35 And here's the constraint, in Asbjørn's words: 'the LLM's job is only to translate this information into English, because we really don't want it to try to figure out too much on its own, because it quickly leads to hallucination.' The model is a translator. Every claim it makes is grounded in something a real engine computed.

00:13:54 It's explicitly not allowed to do the reasoning. The payoff is latency. Because the model isn't thinking, just phrasing, they hit it with Gemini 3 Flash and get end-to-end responses in about three seconds — fast enough that a player flipping through their game never sees a spinner.

00:14:11 They keep sixteen evaluation scenarios around tactics, blunders, and hallucination. Outputs get judged with a model-as-judge setup, and everything runs through OpenRouter so they can swap models as new ones land. Gemini Flash clears about seventy-five percent of their scenarios.

00:14:27 Claude on higher thinking effort actually scores lower, around sixty, and is much slower — which, for a translation job, is the result you'd expect. There's a second half to the talk. When a user flags a piece of commentary as bad, that report gets injected — through a Model Context Protocol server — into a running Claude Code session.

00:14:46 The session picks up a triage skill, investigates what the detectors got wrong, and modifies a prompt or writes a new detector. Then it regenerates the commentary, checks its own work, and messages the engineer on Slack to ask whether the fix looks right. Asbjørn described approving one of these from his phone on a bus, and merging the pull request from mobile.

00:15:07 The whole thing is a good answer to a question a lot of teams get wrong. The reflex is to ask the model to be smart. The Play Magnus build asks it to be precise about something else's intelligence, and saves the smartness for an offline loop where a wrong answer costs a code review, not a user.

00:15:24 If you're putting a model in front of users at low latency, ask which part of the job is actually calculation — and then hand that part to something that calculates.

00:15:33

Autonomous cyber, measured against itself

00:15:33 We've talked about Mythos here before — Anthropic's offensive-security model — so consider this an update rather than a re-introduction. The UK's AI Security Institute published a blog post this week called 'How fast is autonomous AI cyber capability advancing?', and two findings stood out.

00:15:51 The first is a number. The Institute now estimates that the length of cyber task a frontier model can complete autonomously is doubling roughly every four to five months. That's an acceleration — their previous estimate was every eight months — and they note that the latest models, Claude Mythos Preview and GPT-5.5, substantially exceeded even the faster trend line.

00:16:14 On their two internal cyber ranges, a newer Mythos checkpoint solved the first, called 'The Last Ones,' in six of ten attempts, and solved the second, 'Cooling Tower,' in three of ten — the first time any model has finished that second range at all. GPT-5.5 solved the first range three times in ten and didn't finish the second.

00:16:34 The second finding is methodological, and it's the one I'd flag harder. The Institute caps each task at two and a half million tokens. Not because that's realistic — because, in their words, without the cap, 'success rates are so high that time horizons become impossible to calculate.' Their measurement instrument saturates.

00:16:54 They're deliberately handicapping the model with a stripped-down harness and a token budget so they can still get a number out the other end. When a benchmark's main design constraint is making the model do worse so the ruler still works, that tells you something about where the capability already is.

00:17:12 The Reddit discussion surfaced a third detail. One commenter pointed out — citing Anthropic's Logan Graham — that the checkpoint the Institute tested appears to be the one already deployed under Anthropic's Project Glasswing. If that's right, the safety evaluation is running behind the deployment: by the time a checkpoint clears review, the next one is already in front of users.

00:17:35 I can't fully confirm that timing from primary sources, so take it as a thread to watch, not a settled fact. What I'd hold onto is the gap between two clocks. The capability clock is task length, and it's doubling every few months. The evaluation clock is slower — slow enough that the institutions measuring it are capping their own tests to keep the math tractable.

00:17:58 The post is careful about what it doesn't know: how the pace evolves from here, when any particular threshold gets crossed, and how any of this performs against a system someone is actively defending, which is a very different problem than a cyber range. So the headline isn't the six-out-of-ten.

00:18:16 It's that the people whose job is to measure this are telling you, in print, that their instruments are starting to max out.

00:18:23

The web pulls up its drawbridge

00:18:23 A post on the LocalLLaMA subreddit this week put two changes side by side that I'd been tracking separately, and together they sketch a real problem for anyone building agents that touch the open web. Change one: Google is ending free full-web search through its Programmable Search Engine — the thing formerly called Custom Search.

00:18:43 The free tier is being capped at fifty domains. If you want to search the actual web, that becomes a paid product, and existing full-web search engines have to migrate off by January first, 2027. The pricing for the paid full-web option isn't public yet; you fill out an interest form.

00:19:00 Bing has been tightening its search API on a similar timeline. Change two: Cloudflare. Their default posture now is to challenge AI bots trying to scrape pages, across their customer base — and that footprint is large and getting larger. So the supply of cheap programmatic search is shrinking from one direction while the cost of just fetching pages yourself goes up from the other.

00:19:23 The original poster described harnesses that used to fetch pages now coming back with four-hundred-level errors from site after site. That's the Cloudflare side showing up in practice. The poster framed this as Google 'pulling up the drawbridge,' and I think the diagnosis underneath that is roughly right, even if the language runs hot.

00:19:43 A commenter named JockY put it more plainly: search providers are seeing a flood of bot queries with no human eyes attached, which means no ad revenue attached, and they're realizing an unmonetized firehose is a cost they can simply turn off. The economics of web search were always 'humans look at ads.' Agents don't look at ads.

00:20:03 So the thing that was effectively free — because you were the product — gets repriced once the traffic isn't a person. For builders, this is concrete. If you've got a retrieval-augmented generation pipeline, or an agent that does research, and it leans on a cheap or free search API plus the ability to fetch arbitrary pages, both of those legs are getting more expensive and less reliable over the next year.

00:20:27 Better to know that now, while you can design around it, than to discover it when a provider deprecates out from under you. The thread also surfaced the obvious counter-move — decentralized, open search indexes. Someone pointed to YaCy, a peer-to-peer open-source search engine that's been around for about twenty years and never quite had its moment.

00:20:48 Whether YaCy specifically is the answer, I'm skeptical. But the poster's instinct seems sound: a properly open, agent-usable search index is a gap, and a gap that big tends to get filled. I don't know who fills it. I'd just plan your retrieval layer over the next year as if the free option is going away — because the announced changes say it is.

00:21:09

Multi-token prediction on your laptop

00:21:09 Quick one for the local-model crowd, and it comes with a built-in argument, which is the best kind. A developer posting as gladkos shared a patched build of llama.cpp — the C++ inference engine that runs models on your own hardware — with multi-token prediction added for Qwen models.

00:21:27 Multi-token prediction is a speedup technique: instead of generating one token at a time, the model proposes several at once, and a verification step accepts the ones that hold up. The numbers posted, on a MacBook Pro M5 Max: a Qwen 3.6 model at twenty-seven billion parameters going from twenty-one tokens per second to thirty-four.

00:21:47 That's the plus-forty-percent in the headline, with a reported ninety percent acceptance rate on the speculated tokens. There's more here than one benchmark. The local-model story for the last year has been less about new weights and more about squeezing the hardware you already own — quantization, speculative decoding, and now multi-token prediction showing up outside the big managed runtimes.

00:22:12 Each technique is a few percent to a few tens of percent, and they stack. A twenty-seven-billion-parameter model at thirty-four tokens a second on a laptop is a usable coding assistant, fully offline. That throughput is what the gain actually buys. But the post bundled multi-token prediction with a second thing called TurboQuant, a quantization method, and that's where the comment section pushed back hard.

00:22:37 One commenter asked flatly why people keep posting TurboQuant results as if it's faster, when in his testing it's slower than the standard sixteen-bit and four-bit builds. Another reply added the useful detail: there was a TurboQuant pull request to llama.cpp proper, and it got rejected — the maintainers found it didn't beat the existing four-bit rotation work, and was only arguably useful at the most aggressive quantization, where quality already suffers.

00:23:06 So it's a split decision. Multi-token prediction landing in a llama.cpp build for Qwen is one to grab if you run local — that's a third more throughput on the same machine. The TurboQuant claim around it is contested by people who've looked closely, and the project maintainers already said no once.

00:23:24 I'm flagging both halves because the post presents them as one result, and they are not one result. This is the normal texture of the local scene — fast, generous sharing, and a comment section that does the peer review in public. Read the replies before you flash anything.

00:23:41

The Monet test

00:23:41 I'll end somewhere lighter, but it connects. This week an artist posted a real Monet painting and labelled it AI-generated, just to see what would happen. What happened, predictably, is that people lined up to call it slop — soft and mushy, obviously fake, and no real brushwork.

00:23:58 It was a Monet. The account behind it, SHL0MS, ran it as a straight social experiment, and Jediwolf, who reshared it, called it 'the coolest art social experiment I've seen in a while.' Marc Andreessen replied to the original with a single stunned-face emoji. The sharpest comment came from Cristóbal Valenzuela, who runs Runway — a company whose entire business is AI-generated video, so he has standing here.

00:24:23 His note: 'Monet was probably one of the biggest slop painters of his era and, by all means, managed to help transform the entire art world.' And that's just true. Impressionism was a label of contempt when it was coined. The thing we now hang in temperature-controlled rooms was, in its moment, the soft mushy stuff that didn't look like real painting.

00:24:44 I'm not going to stretch this into a defense of AI image slop, because most AI image slop is, in fact, slop. But the experiment isolates something about how we're all judging this stuff right now. A lot of the 'this is AI' reaction is not actually a read on the pixels.

00:25:01 It's a read on the label, and then the eyes go looking for confirmation. We've trained ourselves, in about three years, to find a tell — and the Monet test shows the tell-detector fires false positives, confidently, in public. For anyone building with generative tools, that cuts both ways.

00:25:18 The reflexive contempt will sometimes land on your good work for no better reason than a suspicion of how it was made. And the reflexive amazement will sometimes wave through bad work because it came wrapped in the right process story. The only thing that survives both reflexes is the same thing that survived them in Monet's day — actually looking at the work, and having taste.

00:25:41 That part didn't get automated. That's it for today. June fifteenth is the date I've got circled — that's when Anthropic's Agent SDK metering goes live, and a lot of people are going to learn what their agent loops actually cost. — Lenar Kess.