◆ Dispatch 029 · 2026-05-17 GSV Bring Your Own Numbers
Bring Your Own Numbers
“The agents do the typing. The humans do the specification.”
— Lenar Kess, today's narration
A Sunday show about doing your own arithmetic. Mustafa Suleyman gives the white-collar tier eighteen months, in a piece whose own counter-data sits two paragraphs down. The State of Brand argues every AI subscription is a subsidized loss-leader two weeks away from a forcing function. William Angel runs the tokens-per-hour math on an M5 MacBook Pro and finds OpenRouter cheaper. Frederick Vanbrabant uses The Goal to explain why agents move the bottleneck rather than break it. Marlene Mhangami's Playwright talk shows the cleanest pattern for tests AI should write. Calif's public M5 / MIE exploit write-up lands. Artem Loenko explains why every chat UI keeps ending up on a browser engine. And Luke Lanchester's MCP hello page is the small fix I most enjoyed this week.
- Suleyman's eighteen-month claim, and the article that knocks it down
- The subsidy that ends on June 1
- Apple Silicon costs more than OpenRouter
- Frederick Vanbrabant on why AI doesn't speed up your process
- Marlene Mhangami: tests AI writes vs tests AI verifies
- Calif on the five-day M5 / MIE exploit, in their own words
- Artem Loenko: native all the way, until you need text
- Luke Lanchester's MCP hello page
Chapters
- 00:00:04 Eighteen months, and the article that knocks it down
- 00:03:01 The subsidy that ends on June 1
- 00:06:50 Apple Silicon costs more than OpenRouter
- 00:09:28 AI doesn't speed up your process — it moves the bottleneck
- 00:12:18 Tests AI writes, tests AI verifies
- 00:15:26 The five-day exploit, in their own words
- 00:18:19 Native all the way, until you need text
- 00:21:18 The MCP hello page
Sources
8 cited-
1
Microsoft AI chief gives it 18 months — for all white-collar work to be automated by AI
Article Jake Angelo — Fortune staff writer; piece is a May 16 re-publication of a Feb 13, 2026 story.
"Creating a new model is going to be like creating a podcast or writing a blog."
fortune.com/article/why-microsoft-ai-chief-… →Details
- Cited text
"Creating a new model is going to be like creating a podcast or writing a blog."
- Context
- A senior figure restating the maximalist 18-month timeline, in a piece that includes the empirical pushback in the same article, is a useful artifact for calibrating between AGI marketing and what is shipping on production codebases.
- Key points
- Mustafa Suleyman, CEO of Microsoft AI, told the Financial Times that 'most tasks that involve sitting down at a computer' will be fully automated within 12-18 months — naming accounting, legal, marketing, and project management.
- Fortune's own piece notes the take 'hasn't aged well': a 2025 Thomson Reuters report on professional services found only marginal productivity gains, and a METR study found AI made software developer tasks take 20% longer.
- Apollo Global Management economist Torsten Slok found Big Tech Q4 2025 profit margins up 20%+ while the broader Bloomberg 500 saw almost no change — investors don't expect AI earnings outside tech.
- Challenger Gray & Christmas counts ~49,135 AI-related job cuts so far in 2026; Microsoft itself let go 15,000 in 2025 with Nadella citing a 'new era.'
- Anthropic's Dario Amodei walked back his 2025 'half of entry-level white-collar jobs' warning; the prediction drumbeat is starting again.
- Provenance
- Article · Supporting source
-
2
Every AI Subscription Is a Ticking Time Bomb for Enterprise
Article The State of Brand
"They are selling enterprises filet mignon at gas station hot dog prices and calling it a business model."
www.thestateofbrand.com/news/ai-subscriptio… →Details
- Cited text
"They are selling enterprises filet mignon at gas station hot dog prices and calling it a business model."
- Context
- If a team built workflows on $20-a-month seats over the last two years, the gap between subscription price and true cost is the same line item every engineering org should be modeling against now.
- Key points
- The piece argues every major AI lab is running a coordinated loss-leader: $20/mo Claude Pro or ChatGPT Plus seats whose actual API-rate burn for a moderate user is $200-400/mo.
- Cites GitHub Copilot reportedly losing $20+/user/mo on a $10 plan; power users hitting $80/mo of compute on $10 subscriptions; Anthropic users consuming ~$8 of compute per $1 of subscription revenue.
- GitHub Copilot is moving to usage-based 'AI Credits' billing on June 1, 2026 — GitHub's own announcement attributes the change to agentic usage becoming the default.
- Quotes OpenAI VP Product Nick Turley calling subscription pricing something they 'stumbled into,' comparing flat plans to 'unlimited electricity.'
- Anthropic at ~$30B annualized revenue (up from $9B end of 2025); OpenAI ~$25B and projecting $115B cumulative cash burn through 2029, with $665B committed compute spend by 2030; Oracle taking $43B of debt in a year to build OpenAI's data centers.
- Provenance
- Article · Supporting source
-
3
Apple Silicon costs more than OpenRouter
Article William Angel — Offline Agentic Coding series, part 3.
"For apple silicon, the hardware cost dominates."
www.williamangel.net/blog/2026/05/17/offlin… →Details
- Cited text
"For apple silicon, the hardware cost dominates."
- Context
- Counterintuitive math for builders weighing on-device inference: depreciation, not electricity, sets the floor — and hosted APIs still win on raw cost per token until human time enters the calculation.
- Key points
- A 14-inch MacBook Pro with M5 Max and 64GB lists at $4,299; amortized over 3, 5, and 10 years that is ~$0.16, $0.10, and $0.05 per hour.
- Electricity at $0.18/kWh over 50-100W is ~$0.01-0.02/hr — hardware depreciation dominates, not power.
- At 10-40 tokens/sec on Gemma 4 31B, that works out to roughly $1.50-$4.79 per million tokens on the pessimistic end and $0.40-$1.20 on the optimistic end.
- OpenRouter serves Gemma 4 31B at ~$0.38-0.50 per million tokens and at 60-70 tok/s — 3-7x the local throughput.
- Author's bottom line: on the pro max, local inference runs ~3x the cost of OpenRouter from an accounting view — and the speed gap is bigger than the cost gap.
- Provenance
- Article · Supporting source
-
4
I don't think AI will make your processes go faster
Article Frederick Vanbrabant — Enterprise architecture and product strategy blogger.
"Bottlenecks should receive predictable, high-quality inputs."
frederickvanbrabant.com/blog/2026-05-15-i-d… →Details
- Cited text
"Bottlenecks should receive predictable, high-quality inputs."
- Context
- A useful reframe for any team modeling AI as a throughput multiplier: the bottleneck moves to scoping, and the math only works if the scoping work itself improves.
- Key points
- Re-reads The Toyota Way and Eli Goldratt's The Goal to argue most process optimization misses where the actual constraint sits.
- Software development is rarely slow because of typing speed; it's slow because of vague requirements and back-and-forth with domain experts.
- AI-generated code does not collapse a scoping phase; if anything the documentation phase grows because the agent needs every detail spelled out.
- Closing argument: handing humans the same depth of feature/scope documentation that agents need would produce comparable productivity gains.
- Cites The Mythical Man-Month — adding people (or AI seats) to a constrained bottleneck does not unblock it.
- Provenance
- Article · Supporting source
-
5
Beyond Code Coverage: Functionality Testing with Playwright
Video Marlene Mhangami — Senior developer advocate at Microsoft and GitHub, core AI group.
"Clean code bases amplify AI gains; unchecked AI in a codebase is going to amplify entropy."
www.youtube.com/watch?v=FWEInOtngmM →Details
- Cited text
"Clean code bases amplify AI gains; unchecked AI in a codebase is going to amplify entropy."
- Context
- A working pattern for the most common AI-code failure: a test suite the model wrote to confirm what the code does instead of what the user needs the code to do. The Playwright MCP loop is one of the more concrete answers shipping right now.
- Key points
- GitHub saw ~1 billion commits in 2025; COO Kyle Daigle has said the platform is now seeing ~275 million commits per week, which extrapolates to ~14 billion by end of 2026.
- Cites a Stanford study of 120,000 developers: a team that ran unchecked AI saw PR throughput rise but effective output rise only ~1%, with refactor and rework eating the gains.
- Warns AI-generated unit tests are often self-affirming — the suite goes green while the user-facing behavior stays broken.
- Argues for behavior-first TDD with Playwright: agents generate failing end-to-end tests against expected user behavior, then code to make them pass, then humans spend the most time on refactor.
- Demos Playwright MCP server plus 'Playwright agents' — a planner, generator, and healer agent set — driving the red-green-refactor loop from GitHub Copilot CLI.
- Provenance
- Video · Supporting source
-
6
First public macOS kernel memory corruption exploit on Apple M5
Article Calif — AI-assisted offensive security shop pairing senior reverse engineers with Anthropic's Mythos Preview.
"Apple built MIE in a world before Mythos Preview."
blog.calif.io/p/first-public-kernel-memory-… →Details
- Cited text
"Apple built MIE in a world before Mythos Preview."
- Context
- Pays off Friday's follow-up question — the actual write-up is now public. The pairing claim (model finds the bug class, human carries the novel mitigation bypass) is the load-bearing detail for anyone modeling the next year of vuln research.
- Key points
- A data-only kernel local-privilege-escalation chain on macOS 26.4.1 (25E253) on bare-metal M5 hardware with kernel MIE enabled — first public exploit surviving Apple's Memory Integrity Enforcement.
- Bruce Dang found the bugs on April 25; Dion Blazakis joined April 27; Josh Maine built tooling; a working exploit landed by May 1 — five days from first bug to root shell.
- Mythos Preview identified the bugs (known classes generalize well); humans carried the MIE bypass because the mitigation is new and there are no exploit patterns to copy.
- Reported in person at Apple Park instead of via the submission queue — laser-printed 55-page report, full technical details after Apple ships a fix.
- Includes the line 'Apple spent $5 billion building this office, then asked about our office. We said, well, ours definitely cost less than $1 billion' — Calif's framing of small-team AI-assisted offense vs platform-vendor defense.
- Provenance
- Article · Supporting source
-
7
Native all the way, until you need text
Article Artem Loenko — Native macOS / iOS developer of ~20 years.
"If you want to build rich text rendering for long-form chats, SwiftUI and Apple's native SDKs are not helping you. They stop being an advantage and start becoming constraints."
justsitandgrin.im/posts/native-all-the-way-… →Details
- Cited text
"If you want to build rich text rendering for long-form chats, SwiftUI and Apple's native SDKs are not helping you. They stop being an advantage and start becoming constraints."
- Context
- For builders shipping AI chat UIs on the desktop, this is the underexplained reason almost every model client ends up on a browser engine. The dominant interaction pattern of the era happens to be the one the native SDKs handle worst.
- Key points
- A senior Apple-platform developer walks through trying to ship a streaming Markdown chat UI in pure SwiftUI: cannot select a Markdown document built from SwiftUI primitives — by design.
- NSTextView with TextKit 2 lets you select but loses SwiftUI tooling and spikes CPU on streamed text; NSCollectionView is mature but cells blink during streaming, by design.
- Pure TextKit 2 prototypes work but lose context menus, dictionary lookup, accessibility — months of work to reach baseline native parity.
- WebKit Markdown rendering works well; an Electron prototype reaches better text behavior and typography out of the box than the pure TextKit 2 build.
- Conclusion: chat as an interface pattern is web-native today; native SDKs are not the win for streaming Markdown surfaces.
- Provenance
- Article · Supporting source
-
8
MCP Hello Page
Article Luke Lanchester — Software engineer running HybridLogic; ships an MCP server for the author's day-job product.
"It's not working though because they need to paste it into their client of choice, but no-one thinks that far ahead."
www.hybridlogic.co.uk/blog/2026/05/mcp-hell… →Details
- Cited text
"It's not working though because they need to paste it into their client of choice, but no-one thinks that far ahead."
- Context
- A small, observable, builder-led fix for a real onboarding cliff in MCP — and a useful tell that the spec is being adopted faster than its first-mile UX has caught up.
- Key points
- Customers open his mcp.acme.com/mcp URL in a browser, get a 401 JSON blob, and file support tickets saying the link is broken.
- The fix: when the request is GET /mcp and the Accept header includes text/html but not application/json or text/event-stream, return a plain HTML page explaining what an MCP server is and what to do with the URL.
- Result: ticket volume on the issue dropped sharply, with no observable downside.
- Author calls the MCP spec 'an utterly terrible attempt at a specification' for not anticipating this; argues the pattern should be in the spec.
- The packaging alternative — building a connector or plugin per LLM client — is described as 'slow, painful, and a never-ending game of whack-a-mole.'
- Provenance
- Article · Supporting source
Eighteen months, and the article that knocks it down
00:00:04 Mustafa Suleyman is the CEO of Microsoft AI. He told the Financial Times earlier this year that most tasks involving — in his words — sitting down at a computer will be automated within twelve to eighteen months. He named accounting, legal, marketing, and project management as in scope.
00:00:21 The Fortune piece that ran yesterday is a re-publication of that February conversation, and Fortune itself spends a good chunk of the article explaining why the take hasn't held up. The counter-data is in the same byline. A 2025 Thomson Reuters report on lawyers, accountants, and auditors found marginal productivity gains from targeted use — document review and routine analysis — and nothing close to mass displacement.
00:00:47 A METR study on software developers, which we've talked about before on this show, found AI assistance made the workers' tasks take twenty percent longer on average. Torsten Slok at Apollo Global Management ran the margin numbers for fourth-quarter 2025. Big Tech profit margins grew more than twenty percent, while the broader Bloomberg 500 saw almost no change.
00:01:09 Slok also reported that investors don't expect AI to lift earnings outside the tech sector at all. There are some real layoff numbers in the article too. Challenger, Gray and Christmas counts about forty-nine thousand AI-related job cuts so far in 2026. Microsoft itself let go fifteen thousand workers in 2025, in a round that Nadella tied to, quote, reimagining our mission for a new era.
00:01:33 But that isn't the same shape as the eighteen-month claim. Suleyman is forecasting a near-total replacement of office cognition. What's actually visible is targeted automation, modest productivity, and selective headcount cuts at companies already restructuring.
00:01:49 I don't think Suleyman is being cynical here. I think he believes it. The Microsoft AI charter he's articulating — creating a new model is going to be like creating a podcast or writing a blog — is a worldview, not a quarterly forecast. And it's a worldview I find exciting in places, because if even part of it lands, the creative ceiling for an individual builder goes up a lot.
00:02:13 What I'm skeptical about is the calendar. The eighteen months in early 2025 from Dario Amodei was for half of entry-level white-collar jobs, and Amodei has since walked it back. The eighteen months in early 2026 from Suleyman is for the whole white-collar tier.
00:02:29 The number stays the same. The goalpost moves outward. So when you hear the headline today, here's a useful tell: the article you're reading probably already includes its own qualifier two paragraphs down. That's a pretty good clue about where the consensus actually sits right now.
00:02:46 Most of the show today is about what AI actually costs in dollars and time, and what hands-on builders are finding when they sit down with the tools. There's a loose connection between most of the items, and I'll let you draw it on your own.
The subsidy that ends on June 1
00:03:01 The most interesting business writing I read this weekend is a piece from The State of Brand titled "Every AI Subscription Is a Ticking Time Bomb for Enterprise." It's long, and there's a sentence in it that stuck with me. Quote: They are selling enterprises filet mignon at gas station hot dog prices and calling it a business model.
00:03:20 The argument runs like this. Claude Pro is twenty dollars a month. Sonnet 4.6 on the API is three dollars per million input tokens and fifteen per million output. A knowledge worker uploading documents and running analyses through Claude for a couple of hours a day burns through several million tokens a week.
00:03:38 At API rates, that workload runs two hundred to four hundred dollars a month per seat. On a Pro plan, the company pays twenty. The piece pulls in the supporting numbers from across the industry. Microsoft was reportedly losing more than twenty dollars per user per month on GitHub Copilot back when it was a ten-dollar product, with power users consuming eighty dollars of compute.
00:04:00 One widely-cited analysis found Anthropic users consuming about eight dollars in compute for every dollar of subscription revenue. Nick Turley, OpenAI's VP of Product, has described their subscription pricing as something they, quote, stumbled into, and floated phasing out unlimited plans altogether, comparing them to unlimited electricity.
00:04:20 Then comes the agentic shift, which is where the math goes from bad to unworkable. Claude Code sessions run autonomously for long stretches. Users have reported burning through five-hour rate-limit windows in under ninety minutes. GitHub announced that Copilot moves to usage-based AI Credits billing on June 1, in just over two weeks.
00:04:39 GitHub's own announcement attributes the change to agentic usage becoming the default and producing higher compute and inference demands. Sam Altman said publicly that OpenAI now needs to become, in his words, an AI inference company, which is an interesting way to describe a company that already sells inference.
00:04:57 There's a longer arc here too. Anthropic is at roughly thirty billion dollars in annualized revenue, up from nine billion at the end of 2025. OpenAI is at about twenty-five. OpenAI is projecting one hundred and fifteen billion in cumulative cash burn through 2029 and has committed to six hundred and sixty-five billion in compute spending by 2030.
00:05:17 Oracle took on forty-three billion of debt in a single fiscal year to build out OpenAI's data centers. That infrastructure is financed on the assumption that revenue will eventually cover costs. Right now it doesn't, and the article's claim is that going public is the forcing function.
00:05:34 Public markets won't tolerate the unit economics. The subsidies end when the IPO bell rings. The reason this matters for builders is concrete. Last Wednesday I talked about Anthropic's Agent SDK metering and the question of how programmatic users get billed when humans aren't in the loop.
00:05:50 The State of Brand piece is the same pressure from the other direction. If you've built a workflow on twenty-dollar seats over the last two years, the gap between your subscription line item and the actual unit cost of the work is the same number every CFO is about to be made aware of.
00:06:07 What I'd actually do, if it were my budget: model out what your AI line looks like at two-x, five-x, and ten-x current prices. Make sure no single vendor's pricing change can blow up the quarter. And have the conversation with finance before finance has it with you.
00:06:22 The piece quotes Brian Jabarian, an economist at the University of Chicago who consults with companies on AI adoption, with the cleanest framing. Quote: The time for the bill is going to come. I don't think it'll come everywhere at once. I do think the GitHub date — June 1, 2026 — is going to be the first concrete data point a lot of teams have on what their actual usage costs, and the spread between subscription pricing and metered pricing is going to surprise people.
Apple Silicon costs more than OpenRouter
00:06:50 While we're on costs, here's a counterintuitive one. William Angel has a short post in his Offline Agentic Coding series, and the headline is, Apple Silicon costs more than OpenRouter. His M5 MacBook Pro Max with sixty-four gigs of RAM lists for about four thousand three hundred dollars.
00:07:06 He amortizes that over three, five, and ten years, and gets a per-hour hardware cost of sixteen cents, ten cents, and five cents respectively. Electricity is almost noise. At fifty to one hundred watts under inference load, and Northern Virginia's roughly eighteen cents per kilowatt-hour, you're paying about one to two cents an hour for power.
00:07:26 Hardware depreciation dominates the cost stack, not electricity. The tokens-per-hour math is where it lands. He's seeing ten to forty tokens per second on Gemma 4, the 31-billion-parameter version, which he describes as close to Sonnet-level performance. At ten tokens per second on the pessimistic three-year hardware life, that works out to about four dollars and seventy-nine cents per million tokens.
00:07:49 At forty tokens per second on a ten-year hardware life, it's about forty cents. OpenRouter serves the same Gemma 4 model at around thirty-eight to fifty cents per million tokens, and providers there are getting sixty to seventy tokens per second — three to seven times faster than the local Max.
00:08:05 Angel's bottom line is that local inference on the Pro Max runs about three times the cost per token of OpenRouter from an accounting view, and the speed gap is the bigger problem. For an employee on a work laptop whose salary is a thousand times the cost of the tokens they can generate locally, throwing money at Anthropic, in his words, makes more sense.
00:08:26 I find this useful because the local large-language-model conversation usually skips over the depreciation line entirely. The implicit math people do is, I bought the laptop anyway, the tokens are free. Angel makes you count the hardware against the throughput, and the answer is that today, on consumer Apple silicon, the hosted API still wins on raw cost per token until you start putting human time in the equation.
00:08:50 He ends on a line I liked: It's still wild that a consumer device can run models that are close to Anthropic Sonnet levels of performance. Both things can be true at once. The cost picture argues against running it locally, and the capability picture is still extraordinary.
00:09:05 One small caveat: this is a 31-billion-parameter model in May 2026. The cost curve will keep falling for both local and hosted, but the speed gap is structural. Datacenter GPUs with high-bandwidth memory will keep beating laptops at single-stream throughput for a while.
00:09:21 If you're picking between the two for an agent loop where every second matters, the answer right now is the hosted endpoint.
AI doesn't speed up your process — it moves the bottleneck
00:09:28 Frederick Vanbrabant published a piece a couple of days ago titled, I don't think AI will make your processes go faster, which made the front page of Hacker News this morning. Frederick writes about enterprise architecture and product strategy, and he re-read two old classics for the post — The Toyota Way, and Eli Goldratt's The Goal — and applied them to the AI-build-out conversation.
00:09:50 The setup is a Gantt chart of a typical software project: scoping, exploration, development, deployment, and hyper-care. Development is the longest bar. If your job is to speed the project up, you go look at development. That's correct. Where people go wrong, he says, is at the next step.
00:10:07 They either throw more people at it — which Brooks already covered in The Mythical Man-Month — or they assume AI is going to collapse the development bar without anything else changing. Vanbrabant's quick redraw is that the development bar does shrink with AI in the loop, and then the scoping and documentation bar gets much wider, because the agent needs everything spelled out.
00:10:29 The total project length improves, but not by as much as the marketing claims, and almost all of the new effort lands on the domain experts who have to write the spec. His direct quote, lifted from Goldratt: bottlenecks should receive predictable, high-quality inputs.
00:10:45 That's the line I'd put on a Post-it next to most agent setups. The argument generalizes. Every team that's serious about agent-driven development is converging on something like this. Brian Scanlan at Intercom — in the piece I covered yesterday on doubling engineering throughput with Claude Code — describes a similar shift in where senior time goes.
00:11:05 The agents do the typing. The humans do the specification. Vanbrabant's punchline is the one I keep coming back to. He says if you handed human developers the same depth of feature and scope documentation that an agent needs to do good work, you would also see human productivity skyrocket.
00:11:22 Software development was never primarily slow because of typing speed. It was slow because the question wasn't well-posed. What bugs me about a lot of AI-doubles-productivity claims, including the bullish ones I find believable, is that they tend not to control for that.
00:11:38 When a senior developer pairs with a coding agent for three hours and produces five well-tested PRs, some of that delta is the model. Some of it is that the senior was forced to think carefully about each task before talking to the model. Doing this decomposition properly is harder than the headline, and Vanbrabant's piece is a good reminder.
00:11:58 I don't fully agree with him on one thing. He treats the AI productivity argument as something close to zero-sum — that humans plus better specs would close most of the gap. I think that understates how much faster the typing-and-revising loop gets when you don't have to type or revise.
00:12:15 But the directional point is right, and the bottleneck does move.
Tests AI writes, tests AI verifies
00:12:18 Marlene Mhangami gave a talk at AI Engineer Singapore titled, Beyond Code Coverage: Functionality Testing with Playwright. She's a senior developer advocate at Microsoft and GitHub in their Core AI group, which looks at how developers use AI across the products.
00:12:33 She opens with a number. GitHub saw about a billion commits in 2025, which was already their biggest year ever. Their COO, Kyle Daigle, recently said the platform is now seeing about two hundred and seventy-five million commits per week, which extrapolates to roughly fourteen billion by the end of 2026.
00:12:51 A growing share is co-authored by agents. Claude co-signs commits. Copilot does. Codex doesn't, but there's enough wording in the diffs to track it indirectly. So she asks the question that follows: does any of this code translate into productivity? She points to a Stanford study of a hundred and twenty thousand developers that found the answer depends entirely on the codebase.
00:13:12 Clean codebases amplify AI gains. Unchecked AI in a messy codebase amplifies entropy. The talk shows a case study where one team's PR throughput went up substantially after adopting an AI tool, and effective output — net of refactoring and rework — rose about one percent.
00:13:27 Her recommendation is for behavior-first test-driven development with Playwright, and she's quite specific about why. AI-generated unit tests tend to be self-affirming. They assert that the function does what the function does. The test suite goes green, and the user-facing behavior is still broken.
00:13:44 This is, in my experience, the single most common pathology of an agent that's been told to increase test coverage without further instruction. The pattern she walks through is the red-green-refactor cycle, but with the labor split differently. The agent writes a Playwright end-to-end test against the expected user behavior first, in the red phase.
00:14:04 The agent then writes the code to make the test pass, in the green phase. And the human spends the most time on refactor, because that's where the quality lives. She demos it live with the Playwright MCP server and GitHub Copilot CLI, picking up a Microsoft 365 email about a feature request through a skill called Work IQ, generating the failing test, generating the implementation, and watching Playwright drive the browser through the search bar, the category filter, and the price range.
00:14:32 Two of her closing pointers stuck with me. First, commit your code before letting the agent fix the failing tests, because if it doesn't commit, it may not remember what changed. Second, generate one feature per test — the trigger for writing a new test is no longer a new method on a class, it's a new behavior in the spec.
00:14:50 I think this is one of the more practical patterns I've seen for the AI-generates-tests problem. It doesn't solve it. The agent can still write a Playwright test that confirms the wrong behavior, because the spec was wrong, and Vanbrabant's whole point from the last chapter applies.
00:15:07 But it does close the most common false-positive — the test that asserts the implementation it just generated. If you're trying this, the Playwright agents the Microsoft team ships — a planner, a generator, and a healer agent set — install as agent dot M-D files into your repo, and they're the lightest way I've seen to get the loop running.
The five-day exploit, in their own words
00:15:26 On Friday I talked about Calif and Mythos Preview's five-day macOS kernel exploit on Apple's M5 silicon. The write-up is now public, and a few specifics from their own post are worth getting on record. The chain is a data-only kernel local-privilege-escalation exploit on macOS 26 point 4 point 1 — that's build twenty-five-E-two-fifty-three — on bare-metal M5 hardware with kernel MIE enabled.
00:15:50 MIE is Memory Integrity Enforcement, Apple's hardware-assisted memory safety system built on ARM's Memory Tagging Extension, which Apple spent five years and probably billions of dollars building. According to Apple's own research, MIE disrupts every public exploit chain against modern iOS, including the leaked Coruna and Darksword kits.
00:16:10 The timeline from the Calif post is tight. Bruce Dang found the bugs on April twenty-fifth. Dion Blazakis joined Calif on April twenty-seventh, and Josh Maine built tooling. By May first, they had a working exploit. The implementation path involves two vulnerabilities and several techniques, starts from an unprivileged local user, uses only normal system calls, and ends with a root shell.
00:16:34 The split-of-labor claim is what I find most interesting. Quoting Calif directly. Mythos Preview is powerful: once it has learned how to attack a class of problems, it generalizes to nearly any problem in that class. Mythos discovered the bugs quickly because they belong to known bug classes.
00:16:51 But MIE is a new best-in-class mitigation, so autonomously bypassing it can be tricky. This is where human expertise comes in. End quote. That's the cleanest articulation I've seen of how this pairing actually works today. The model is fast and generalizing over known vulnerability classes.
00:17:09 The novel-mitigation bypass — inventing a new technique against a defense that's never been broken in public — is still on the humans. The advantage of the pairing is wall-clock time. A week from first bug to root shell against the best memory-safety mitigation Apple has ever shipped.
00:17:26 The post also includes a story about going to Apple Park to hand-deliver a laser-printed fifty-five-page report, in honor of their hacker friends, instead of dropping it into Apple's submission queue. There's a line in there I enjoyed. Their hosts at Apple shared that the company spent five billion dollars on the spaceship office, then asked about Calif's office.
00:17:48 They said theirs definitely cost less than one billion. The Vietnamese phrase Calif uses to close the post is nhỏ mà có võ — small but fierce. What I'll be looking for next is the full technical report, which Calif says will land after Apple ships the fix. The five-day timeline is the lede everyone is repeating, but the implementation details of a data-only MIE-surviving chain will tell us much more about what generalizes to the next mitigation Apple builds.
00:18:16 That's where the industry actually learns something.
Native all the way, until you need text
00:18:19 Artem Loenko has a piece on his blog today called, Native all the way, until you need text. Artem has been a native macOS and iOS developer for almost twenty years, and the post is a slightly tired walk through what happened when he tried to build a streaming Markdown chat UI in pure Swift and SwiftUI.
00:18:37 The journey is funny in a way only Apple-platform developers will fully appreciate. He starts with SwiftUI. Scrolling is jumpy but tolerable. Then he tries to let the user select a whole Markdown document built from SwiftUI primitives, and finds out he can't. By design.
00:18:53 So he moves to NSTextView with TextKit 2. Selection works. But now he's outside the SwiftUI tooling story, and when he tries to stream model output into the view, he sees CPU spikes. Fine. He moves to AppKit with NSCollectionView, which is mature and well-proven, and on day two he realizes the cells will blink during streaming.
00:19:13 By design. He goes lower-level to pure TextKit 2. The prototype is okay on performance, terrible on streaming, and breaks every modern integration. He gives up on SwiftUI entirely, fights expanding text chunks by hand in AppKit, and reaches the moment where the text is selectable and almost nothing else works.
00:19:32 Then he counts the cost of just reaching baseline native parity. Context menus, dictionary lookup, accessibility, and text interactions — all the small things a Mac user takes for granted. Months of work. So he tries WebKit for Markdown rendering. It works. There are caveats, but mostly it just works.
00:19:50 Then, in his words, at the darkest possible moment, he generates an Electron prototype. And he is amazed. Text operations, Markdown rendering, good typography, and even Git diffs work out of the box, with performance he could not get from his pure TextKit 2 implementation.
00:20:06 His conclusion: chat as an interface pattern — long-form rich text, flexible typography, streaming responses — is web-native today. SwiftUI is fine for simple screens. Swift is great for performance-critical paths. But the rendering and text model you need for a modern AI chat surface lives in WebKit or in Chromium, and the gap, in his words, isn't a shortcut-versus-proper-solution debate anymore.
00:20:30 It's a feature gap. I think this is one of those posts that explains a thing developers have been quietly observing for two or three years now without quite articulating. Why does every new AI client end up shipped in Electron, as a web app, or wrapped in a WK WebView?
00:20:46 Because the dominant interaction pattern of the era happens to be the one the native SDKs handle worst. And the cost of bridging that gap — accessibility, selection, streaming, and context menus — adds up to months that almost no team can spare when they're shipping a model client into a market that turns over every six weeks.
00:21:06 I'd love to be wrong about this. I'd love for SwiftUI to ship a first-class streaming-Markdown text view next year. Until then, Artem's piece is the most honest account I've read of why we keep ending up on the dark side.
The MCP hello page
00:21:18 Closing with the smallest item on the lineup, which is also my favorite. Luke Lanchester runs HybridLogic, and he published a short post called MCP Hello Page about a fix he made to his company's MCP server that I think is good craft. His problem: customers were getting his MCP server URL and pasting it into a browser, where they would get a four-oh-one Unauthorized status with a raw JSON blob.
00:21:40 They would then file a support ticket saying the link was broken. He'd then explain that the URL is for an MCP client, not a browser, and that they need to paste it into Claude or Cursor or whatever they're using. This happened a lot. He calls it a never-ending game of whack-a-mole, because the alternative — building a connector per LLM client — doesn't scale either.
00:22:01 The fix is a single content-negotiation rule. When a browser opens the MCP server URL — and only a browser, identified by an Accept header asking for text-slash-html but not the JSON or server-sent-events content types real MCP clients send — return an HTML page that explains what an MCP server is and what the user is supposed to do with the URL.
00:22:20 For every actual MCP client, the behavior is unchanged, because clients send the JSON or SSE Accept headers. For a browser, the user gets a human-readable onboarding page. His result: ticket volume on this issue dropped sharply. Customer success is happier. Customers onboard faster.
00:22:36 No measurable downside. I like this for a few reasons. First, it's the kind of fix that good infrastructure engineers ship constantly and rarely write up. Second, it's a small, observable, content-negotiation-based answer to a real onboarding cliff in MCP, which is now adoptable enough that the first-mile UX is starting to matter.
00:22:55 Third, Luke's gripe about the spec — and I'll quote him directly — quote, despite the fact that MCP is an utterly terrible attempt at a specification — is the kind of opinion someone has only after spending a lot of time inside the thing. He's not complaining for sport.
00:23:10 He's complaining because he just shipped a real product against it. If you ship an MCP server, copy his pattern. The five lines of Accept-header logic will save your customer success team a lot of email. And if you happen to be on the MCP spec working group, a hello-page convention belongs in the spec itself.
00:23:28 Tomorrow is Monday. The GitHub Copilot usage-based billing change is two weeks out, the Calif technical report drops whenever Apple ships the fix, and Anthropic's Agent SDK metering is starting to settle into the shape last Wednesday's chapter sketched. — Lenar.