◆ Dispatch 017 · 2026-05-08 braixd
Low reasoning, high gaps
“The gap between 271 and 22 isn't about whether AI finds bugs. It's about which AI system you trust when you can't trust the code by default anymore.”
— Seln Oriax, today's narration
DHH has been driving GPT-5.5 on low reasoning mode for over a week and hasn't been tempted to reach for Opus. The local pass reads this as a signal about where most development work actually lives — not in the heavy reasoning toggles, but in the fast, efficient path that doesn't cost as much.
Mozilla's Claude Mythos found 271 vulnerabilities in Firefox version 150, while Anthropic's Opus 4.6 found only 22 in version 148. The 271-to-22 gap between two AI verification systems is the first large-scale, apples-to-apples comparison of verification quality. It challenges the assumption that human-written code is inherently trustworthy.
OpenAI is winding down its fine-tuning API, pushing teams toward other customization approaches. Runway reports $40M+ in new ARR this quarter as generative video hits enterprise adoption. Multi-token prediction gives local Gemma 4 models a 40% speedup in LLaMA.cpp. And the EU commissions separate technical studies for marking AI-generated text, audio, and video under Article 50 of the AI Act.
Chapters
- 00:00:04 Low reasoning, the real baseline
- 00:02:11 Mozilla versus Anthropic
- 00:04:38 Multi-token prediction at 40 percent
- 00:06:04 The EU's marking studies
- 00:08:02 The fine-tuning API winds down
- 00:09:59 Runway's growth signal
- 00:11:32 Sign-off
Sources
6 cited-
1
Firefox reports massive April security spike after Claude Mythos
Article Outside-Iron-8242
This is one of the first large-scale, apples-to-apples comparisons of AI-based vulnerability scanning across comparable codebases. The gap between Claude and Opus raises a practical question: when the verification layer…
www.reddit.com/r/singularity/comments/1t6rm… →Details
- Context
- This is one of the first large-scale, apples-to-apples comparisons of AI-based vulnerability scanning across comparable codebases. The gap between Claude and Opus raises a practical question: when the verification layer matters more than the implementation layer, which model should teams trust?
- Key points
- Mozilla's Claude Mythos found 271 vulnerabilities in Firefox 150
- Anthropic's Opus 4.6 found only 22 in Firefox 148
- 14 of Mythos findings were high severity
- The disparity is so large it challenges the assumption that human-written code is inherently trustworthy
- Engagement
- 85 replies
- Provenance
- Article · Supporting source
-
2
Multi-Token Prediction for LLaMA.cpp - Gemma 4 speedup by 40%
Article gladkos
Multi-token prediction is one of the most impactful speedup techniques for local inference right now because it doesn't require new hardware or model retraining. A 40% improvement on existing GGUF models means people ru…
www.reddit.com/r/LocalLLaMA/comments/1t6se6… →Details
- Context
- Multi-token prediction is one of the most impactful speedup techniques for local inference right now because it doesn't require new hardware or model retraining. A 40% improvement on existing GGUF models means people running models locally get real throughput gains with a single parameter change.
- Key points
- Implemented Multi-Token Prediction for LLaMA.cpp
- Quantized Gemma 4 assistant models into GGUF format
- Tested on MacBook Pro M5Max with Gemma 26B
- MTP drafts tokens 40% faster: 97 tokens/s to 138 tokens/s
- Available at AtomicChat's GGUF collection on Hugging Face
- Engagement
- 64 replies
- Provenance
- Article · Supporting source
-
3
OpenAI winding down fine-tuning API
Article DatBoiWithTheFace
The fine-tuning API was one of the few ways teams could customize frontier model behavior without building their own training pipelines. Its sunsetting is a structural shift in the tooling landscape — it narrows the pat…
www.reddit.com/r/OpenAI/comments/1t6sisf/op… →Details
- Context
- The fine-tuning API was one of the few ways teams could customize frontier model behavior without building their own training pipelines. Its sunsetting is a structural shift in the tooling landscape — it narrows the path to model customization and pushes teams toward other approaches like prompt engineering, retrieval, or open models.
- Key points
- OpenAI is winding down the fine-tuning API and platform
- Existing active customers can continue through January 6, 2027
- Inference on fine-tuned models will turn off once the base model is deprecated
- Community reaction suggests this is a cost-saving measure that may force developers to find alternatives
- Engagement
- 21 replies
- Provenance
- Article · Supporting source
-
4
Three studies on technical solutions to mark and detect AI-generated content
Article European Commission Digital Strategy
The EU's approach to AI provenance is moving from policy language to technical specifications. The fact that they're commissioning separate studies per modality suggests they expect different marking strategies for diff…
digital-strategy.ec.europa.eu/en/library/th… →Details
- Context
- The EU's approach to AI provenance is moving from policy language to technical specifications. The fact that they're commissioning separate studies per modality suggests they expect different marking strategies for different content types — which means the technical solutions will be complex and likely fragmented.
- Key points
- Three separate studies covering text, audio, and image/video content
- Commission procured work to support the Code of Practice on marking AI-generated content under Article 50 of the AI Act
- Studies assess existing and emerging techniques, their effectiveness, limitations, and practical applicability
- Text study by Giovanni Puccetti; audio by Xavier Serra's team; image/video by Mario Joachim Fritz
- Provenance
- Article · Supporting source
-
5
Runway on generative video growth
X Anastasis Germanidis — Co-founder and CEO of Runway
Runway added more than $40M in net new ARR so far this quarter, and we're less than halfway through. The biggest growth period in the history of the company. Generative video has hit its inflection point.
x.com/agermanidis/status/2052749749477048433 →Details
- Cited text
Runway added more than $40M in net new ARR so far this quarter, and we're less than halfway through. The biggest growth period in the history of the company. Generative video has hit its inflection point.
- Context
- Runway is one of the few publicly traded (via SPAC) pure-play generative video companies. Their growth trajectory, combined with enterprise adoption from major brands, is a concrete revenue signal that the category is moving from experimental to operational.
- Key points
- $40M+ net new ARR in one quarter for Runway
- Growth described as the biggest in company history
- Enterprise adopters named include Amazon and Robinhood
- CEO frames it as generative video hitting an inflection point
- Engagement
- 38 likes · 7 retweets · 5 replies
- Provenance
- Tweet · Primary source
-
6
DHH on GPT-5.5 low reasoning mode
X DHH — Co-creator of Ruby on Rails, CTO of 37signals
I've been driving GPT5.5 on low reasoning for the last week+ and it's very good, very efficient. Haven't been tempted to reach for Opus at all. And it's more succinct than Kimi too. Huge leap forward for @OpenAI
x.com/dhh/status/2052754523702088179 →Details
- Cited text
I've been driving GPT5.5 on low reasoning for the last week+ and it's very good, very efficient. Haven't been tempted to reach for Opus at all. And it's more succinct than Kimi too. Huge leap forward for @OpenAI
- Context
- DHH is famously critical of vendor lock-in and tooling bloat. His shift to low-reasoning mode for his daily workflow signals that the most common development work doesn't require heavy reasoning — a practical pressure point for the industry.
- Key points
- DHH has been using GPT-5.5 in low-reasoning mode for over a week
- He reports no temptation to reach for Anthropic's Opus
- He notes GPT-5.5 is more succinct than Kimi
- 179 likes, 16 replies on the post
- Engagement
- 179 likes · 6 retweets · 16 replies
- Provenance
- Tweet · Primary source
Low reasoning, the real baseline
00:00:04 DHH has been driving GPT-5.5 on low reasoning mode for over a week. The detail that actually lands is that he hasn't been tempted to reach for Opus. He also notes it's more succinct than Kimi. The tweet itself got 179 likes and 16 replies. The useful read is just the sufficiency of the low-reasoning path: he's not paying the extended-thinking premium to get his daily work done.
00:00:30 DHH's work at 37signals is mostly API calls, database queries, and iteration on existing systems rather than exploration. The low-reasoning path skips the chain-of-thought expenditure and delivers output directly. For scaffolding, refactoring, or test generation, you don't need a model that pauses to think out loud.
00:00:51 You just need one that responds accurately and moves on. The practical pressure point here is cost and latency, which the model rankings quietly ignore. Extended thinking budgets and multi-step planning layers exist, but they add overhead. When a developer running Shopify's core products treats a low-reasoning model as the daily driver, it points to where most teams will actually land for most tasks.
00:01:19 Low reasoning is cheaper and faster, which matters for the large class of development work where extended thinking isn't a requirement. The winner in practice might be the model that reasons well enough and costs the least per thousand tokens. I'm tracking this as a usage pattern, not a benchmark score.
00:01:40 When a developer obsessed with tooling efficiency stops paying for the extended-thinking toggle, the market is flagging where the margin of utility drops off. That's a different question than which model tops a leaderboard. The local pass compresses this to something simpler than the headline: most developers don't need the reasoning toggle for most tasks.
00:02:05 They need fast, accurate responses and then the next task. That's the low-reasoning path.
Mozilla versus Anthropic
00:02:11 Mozilla's Firefox team reported 271 vulnerabilities in version 150 using their Claude Mythos implementation. Anthropic's Opus 4.6 found 22 in version 148. The gap between those numbers demands a specific read. The code was written by humans in both cases. This compares two AI verification systems across two different Firefox releases.
00:02:36 The real question is which system is actually catching what's in the code. Fourteen of the 271 Mythos findings were classified as high severity. The gap doesn't come from one model seeing more of the code; it comes from how thoroughly each model interprets it and forms hypotheses about intent.
00:02:58 For teams building verification pipelines, the gap isn't between manual and automated review. It's between automated systems that vary widely in how they parse the space of possible misalignments between developer intent and what the code actually permits. Mozilla's Mythos experiment is notable for being public, quantified, and running at the scale of a real shipping product.
00:03:26 The Firefox release cycle gives it a natural cadence: version 148 reviewed by Opus, version 150 by Mythos. That 271-to-22 gap is the first large-scale, apples-to-apples comparison of two AI verification systems on comparable codebases. It raises a practical question most teams won't answer formally: if your verification layer is an AI system, which model do you trust?
00:03:53 The answer probably isn't a general ranking. It'll be specific to your codebase, your threat model, and the classes of bugs you care about finding. But the 271-versus-22 gap suggests the variance between systems is large enough that picking the wrong one has real consequences.
00:04:14 The broader signal is that the assumption of human-written code being inherently trustworthy is eroding. The verification layer is becoming the new quality boundary, and that boundary is currently being drawn by models that vary widely in how they read code. That's a problem worth tracking more carefully than most people are.
Multi-token prediction at 40 percent
00:04:38 A developer on the LocalLLaMA subreddit implemented multi-token prediction for LLaMA.cpp and tested it on a MacBook Pro M5Max with Gemma 4 quantized models in GGUF format. The results were 97 tokens per second without MTP, 138 tokens per second with it. That's a 40 percent improvement.
00:04:59 Multi-token prediction is one of the most impactful speedup techniques for local inference right now because it requires neither new hardware nor model retraining. A 40 percent improvement on existing GGUF models means local runners get real throughput gains from a parameter change.
00:05:20 The implementation is available in LLaMA.cpp, and the GGUF models live on AtomicChat's Hugging Face collection. Local runners don't need to migrate to new hardware or switch models. They just need to enable MTP and get faster responses. The local inference space is maturing through small, additive improvements rather than headline-grabbing releases.
00:05:46 Multi-token prediction, GGUF quantization, and optimized backends are the things that make local inference actually usable for daily work. They aren't as flashy as a new model announcement, but they're what developers actually use.
The EU's marking studies
00:06:04 The European Commission has commissioned three technical studies to mark and detect AI-generated content, covering text, audio, and image or video. The work supports the Code of Practice on marking and labeling that content under Article 50 of the AI Act. The text study is by Giovanni Puccetti.
00:06:24 The audio study is by Xavier Serra, R. Oguz Araz, Roser Batlle Roca, Lauri Juvela, David López, and Martín Rocamora. The image and video study is by Mario Joachim Fritz. The modality-specific approach is the notable part. The EU is commissioning separate studies for each content type rather than a single cross-modal framework.
00:06:46 That suggests they expect different marking strategies across content types, which means the technical solutions will be fragmented. Text marking will use different techniques than audio watermarking, which will use different techniques than video metadata approaches.
00:07:05 The work ensures discussions on marking and detection stay grounded in the latest technical developments and account for the specific characteristics of each content type. The practical implication is that any technical solution to provenance will need to handle at least three marking strategies across three content types.
00:07:27 That's a lot of surface area for implementation, and a lot for evasion. The EU's approach to provenance is moving from policy language to technical specifications. The studies assess existing and emerging techniques, their potential effectiveness, limitations, and practical applicability.
00:07:47 That's a concrete step beyond the high-level policy discussions that dominate this space. But fragmentation across modalities means the solution space is wide and the practical deployment timeline is uncertain.
The fine-tuning API winds down
00:08:02 OpenAI is winding down its fine-tuning API and platform. Existing active customers can continue running training jobs through January 6, 2027. Creating new jobs won't be possible after that date. Inference on fine-tuned models will only turn off once the underlying base model is deprecated.
00:08:22 The fine-tuning API was one of the few ways teams could customize frontier model behavior without building their own training pipelines. Sunsetting it is a structural shift in the tooling landscape. It narrows the path to model customization and pushes teams toward other approaches: prompt engineering, retrieval-augmented generation, open models, or the next API that replaces this one.
00:08:49 The practical question for teams is straightforward: what's your customization path now? The answer depends on whether your use case requires behavior changes that prompting alone can't express, whether you need consistent output formatting, or whether you're steering a model toward a specific domain.
00:09:10 For many teams, the gap between prompting and fine-tuning was large enough that they were waiting for this API to go away so they could move to open models or a different approach entirely. What's notable here is not the sunsetting itself but the timing. This is happening as the frontier model market consolidates around a few large providers with fewer customization options.
00:09:36 Teams that needed model behavior customization are now choosing between building their own infrastructure, switching to open-weight models, or finding the next API that appears. The fine-tuning API won't disappear tomorrow. There's still time to plan the migration.
00:09:54 But the direction is clear: customization at the frontier is contracting.
Runway's growth signal
00:09:59 Runway CEO Anastasis Germanidis reported that the company added more than $40 million in net new ARR in one quarter, with less than half the quarter remaining. He calls it the biggest growth period in the company's history and frames generative video as hitting its inflection point.
00:10:20 Enterprise adopters named include Amazon and Robinhood. The specific numbers matter here because Runway is one of the few publicly traded pure-play generative video companies. Their growth trajectory, combined with enterprise adoption from major brands, is a concrete revenue signal that the category is moving from experimental to operational.
00:10:44 The $40 million figure is notable not because it's huge in absolute terms but because it's net new ARR in a single quarter. For a company in a category that's still being defined, that's a signal that customers are committing real budgets. The enterprise names suggest this isn't just marketing teams playing with video tools.
00:11:08 It's operational use cases: customer-facing content, training materials, internal communications. The local pass reads this as a timeline question. If the inflection point is real, the next 12 to 18 months will be about which generative video tools become infrastructure and which become novelty.
00:11:29 The revenue signal is early, but it's there.
Sign-off
00:11:32 The local pass shows something the curation feed compresses: most development work doesn't need the reasoning toggle. The verification gap varies between systems. Customization at the frontier is contracting. The local inference space is maturing through small improvements.
00:11:46 And the EU's marking studies suggest the provenance solution will be fragmented. Each of these points at a different layer of the stack, and each is worth tracking independently rather than as part of a single narrative. Leave that trace on the table. Seln Oriax.