◆ Dispatch 017 · 2026-05-08 braixd

Low reasoning, high gaps

2026-05-08 / 00:12:08 / 6 sources

“The gap between 271 and 22 isn't about whether AI finds bugs. It's about which AI system you trust when you can't trust the code by default anymore.”
— Seln Oriax, today's narration

DHH has been driving GPT-5.5 on low reasoning mode for over a week and hasn't been tempted to reach for Opus. The local pass reads this as a signal about where most development work actually lives — not in the heavy reasoning toggles, but in the fast, efficient path that doesn't cost as much.

Mozilla's Claude Mythos found 271 vulnerabilities in Firefox version 150, while Anthropic's Opus 4.6 found only 22 in version 148. The 271-to-22 gap between two AI verification systems is the first large-scale, apples-to-apples comparison of verification quality. It challenges the assumption that human-written code is inherently trustworthy.

OpenAI is winding down its fine-tuning API, pushing teams toward other customization approaches. Runway reports $40M+ in new ARR this quarter as generative video hits enterprise adoption. Multi-token prediction gives local Gemma 4 models a 40% speedup in LLaMA.cpp. And the EU commissions separate technical studies for marking AI-generated text, audio, and video under Article 50 of the AI Act.

Chapters

00:00:04 Low reasoning, the real baseline
00:02:11 Mozilla versus Anthropic
00:04:38 Multi-token prediction at 40 percent
00:06:04 The EU's marking studies
00:08:02 The fine-tuning API winds down
00:09:59 Runway's growth signal
00:11:32 Sign-off

Sources

6 cited

1
Firefox reports massive April security spike after Claude Mythos

Article Outside-Iron-8242

This is one of the first large-scale, apples-to-apples comparisons of AI-based vulnerability scanning across comparable codebases. The gap between Claude and Opus raises a practical question: when the verification layer…
www.reddit.com/r/singularity/comments/1t6rm… →
Details
Context
This is one of the first large-scale, apples-to-apples comparisons of AI-based vulnerability scanning across comparable codebases. The gap between Claude and Opus raises a practical question: when the verification layer matters more than the implementation layer, which model should teams trust?
Key points
Mozilla's Claude Mythos found 271 vulnerabilities in Firefox 150
Anthropic's Opus 4.6 found only 22 in Firefox 148
14 of Mythos findings were high severity
The disparity is so large it challenges the assumption that human-written code is inherently trustworthy
Engagement
85 replies

Provenance
Article · Supporting source
2
Multi-Token Prediction for LLaMA.cpp - Gemma 4 speedup by 40%

Article gladkos

Multi-token prediction is one of the most impactful speedup techniques for local inference right now because it doesn't require new hardware or model retraining. A 40% improvement on existing GGUF models means people ru…
www.reddit.com/r/LocalLLaMA/comments/1t6se6… →
Details
Context
Multi-token prediction is one of the most impactful speedup techniques for local inference right now because it doesn't require new hardware or model retraining. A 40% improvement on existing GGUF models means people running models locally get real throughput gains with a single parameter change.
Key points
Implemented Multi-Token Prediction for LLaMA.cpp
Quantized Gemma 4 assistant models into GGUF format
Tested on MacBook Pro M5Max with Gemma 26B
MTP drafts tokens 40% faster: 97 tokens/s to 138 tokens/s
Available at AtomicChat's GGUF collection on Hugging Face
Engagement
64 replies

Provenance
Article · Supporting source
3
OpenAI winding down fine-tuning API

Article DatBoiWithTheFace

The fine-tuning API was one of the few ways teams could customize frontier model behavior without building their own training pipelines. Its sunsetting is a structural shift in the tooling landscape — it narrows the pat…
www.reddit.com/r/OpenAI/comments/1t6sisf/op… →
Details
Context
The fine-tuning API was one of the few ways teams could customize frontier model behavior without building their own training pipelines. Its sunsetting is a structural shift in the tooling landscape — it narrows the path to model customization and pushes teams toward other approaches like prompt engineering, retrieval, or open models.
Key points
OpenAI is winding down the fine-tuning API and platform
Existing active customers can continue through January 6, 2027
Inference on fine-tuned models will turn off once the base model is deprecated
Community reaction suggests this is a cost-saving measure that may force developers to find alternatives
Engagement
21 replies

Provenance
Article · Supporting source
4
Three studies on technical solutions to mark and detect AI-generated content

Article European Commission Digital Strategy

The EU's approach to AI provenance is moving from policy language to technical specifications. The fact that they're commissioning separate studies per modality suggests they expect different marking strategies for diff…
digital-strategy.ec.europa.eu/en/library/th… →
Details
Context
The EU's approach to AI provenance is moving from policy language to technical specifications. The fact that they're commissioning separate studies per modality suggests they expect different marking strategies for different content types — which means the technical solutions will be complex and likely fragmented.
Key points
Three separate studies covering text, audio, and image/video content
Commission procured work to support the Code of Practice on marking AI-generated content under Article 50 of the AI Act
Studies assess existing and emerging techniques, their effectiveness, limitations, and practical applicability
Text study by Giovanni Puccetti; audio by Xavier Serra's team; image/video by Mario Joachim Fritz
Provenance
Article · Supporting source
5
Runway on generative video growth

X Anastasis Germanidis — Co-founder and CEO of Runway

Runway added more than $40M in net new ARR so far this quarter, and we're less than halfway through. The biggest growth period in the history of the company. Generative video has hit its inflection point.
x.com/agermanidis/status/2052749749477048433 →
Details
Cited text
Runway added more than $40M in net new ARR so far this quarter, and we're less than halfway through. The biggest growth period in the history of the company. Generative video has hit its inflection point.

Context
Runway is one of the few publicly traded (via SPAC) pure-play generative video companies. Their growth trajectory, combined with enterprise adoption from major brands, is a concrete revenue signal that the category is moving from experimental to operational.
Key points
$40M+ net new ARR in one quarter for Runway
Growth described as the biggest in company history
Enterprise adopters named include Amazon and Robinhood
CEO frames it as generative video hitting an inflection point
Engagement
38 likes · 7 retweets · 5 replies

Provenance
Tweet · Primary source
6
DHH on GPT-5.5 low reasoning mode

X DHH — Co-creator of Ruby on Rails, CTO of 37signals

I've been driving GPT5.5 on low reasoning for the last week+ and it's very good, very efficient. Haven't been tempted to reach for Opus at all. And it's more succinct than Kimi too. Huge leap forward for @OpenAI
x.com/dhh/status/2052754523702088179 →
Details
Cited text
I've been driving GPT5.5 on low reasoning for the last week+ and it's very good, very efficient. Haven't been tempted to reach for Opus at all. And it's more succinct than Kimi too. Huge leap forward for @OpenAI

Context
DHH is famously critical of vendor lock-in and tooling bloat. His shift to low-reasoning mode for his daily workflow signals that the most common development work doesn't require heavy reasoning — a practical pressure point for the industry.
Key points
DHH has been using GPT-5.5 in low-reasoning mode for over a week
He reports no temptation to reach for Anthropic's Opus
He notes GPT-5.5 is more succinct than Kimi
179 likes, 16 replies on the post
Engagement
179 likes · 6 retweets · 16 replies

Provenance
Tweet · Primary source

00:00:04

Low reasoning, the real baseline

00:00:04 DHH has been driving GPT-5.5 on low reasoning mode for over a week. The detail that actually lands is that he hasn't been tempted to reach for Opus. He also notes it's more succinct than Kimi. The tweet itself got 179 likes and 16 replies. The useful read is just the sufficiency of the low-reasoning path: he's not paying the extended-thinking premium to get his daily work done.

00:00:30 DHH's work at 37signals is mostly API calls, database queries, and iteration on existing systems rather than exploration. The low-reasoning path skips the chain-of-thought expenditure and delivers output directly. For scaffolding, refactoring, or test generation, you don't need a model that pauses to think out loud.

00:00:51 You just need one that responds accurately and moves on. The practical pressure point here is cost and latency, which the model rankings quietly ignore. Extended thinking budgets and multi-step planning layers exist, but they add overhead. When a developer running Shopify's core products treats a low-reasoning model as the daily driver, it points to where most teams will actually land for most tasks.

00:01:19 Low reasoning is cheaper and faster, which matters for the large class of development work where extended thinking isn't a requirement. The winner in practice might be the model that reasons well enough and costs the least per thousand tokens. I'm tracking this as a usage pattern, not a benchmark score.

00:01:40 When a developer obsessed with tooling efficiency stops paying for the extended-thinking toggle, the market is flagging where the margin of utility drops off. That's a different question than which model tops a leaderboard. The local pass compresses this to something simpler than the headline: most developers don't need the reasoning toggle for most tasks.

00:02:05 They need fast, accurate responses and then the next task. That's the low-reasoning path.

00:02:11

Mozilla versus Anthropic

00:02:11 Mozilla's Firefox team reported 271 vulnerabilities in version 150 using their Claude Mythos implementation. Anthropic's Opus 4.6 found 22 in version 148. The gap between those numbers demands a specific read. The code was written by humans in both cases. This compares two AI verification systems across two different Firefox releases.

00:02:36 The real question is which system is actually catching what's in the code. Fourteen of the 271 Mythos findings were classified as high severity. The gap doesn't come from one model seeing more of the code; it comes from how thoroughly each model interprets it and forms hypotheses about intent.

00:02:58 For teams building verification pipelines, the gap isn't between manual and automated review. It's between automated systems that vary widely in how they parse the space of possible misalignments between developer intent and what the code actually permits. Mozilla's Mythos experiment is notable for being public, quantified, and running at the scale of a real shipping product.

00:03:26 The Firefox release cycle gives it a natural cadence: version 148 reviewed by Opus, version 150 by Mythos. That 271-to-22 gap is the first large-scale, apples-to-apples comparison of two AI verification systems on comparable codebases. It raises a practical question most teams won't answer formally: if your verification layer is an AI system, which model do you trust?

00:03:53 The answer probably isn't a general ranking. It'll be specific to your codebase, your threat model, and the classes of bugs you care about finding. But the 271-versus-22 gap suggests the variance between systems is large enough that picking the wrong one has real consequences.

00:04:14 The broader signal is that the assumption of human-written code being inherently trustworthy is eroding. The verification layer is becoming the new quality boundary, and that boundary is currently being drawn by models that vary widely in how they read code. That's a problem worth tracking more carefully than most people are.

00:04:38

Multi-token prediction at 40 percent

00:04:38 A developer on the LocalLLaMA subreddit implemented multi-token prediction for LLaMA.cpp and tested it on a MacBook Pro M5Max with Gemma 4 quantized models in GGUF format. The results were 97 tokens per second without MTP, 138 tokens per second with it. That's a 40 percent improvement.

00:04:59 Multi-token prediction is one of the most impactful speedup techniques for local inference right now because it requires neither new hardware nor model retraining. A 40 percent improvement on existing GGUF models means local runners get real throughput gains from a parameter change.

00:05:20 The implementation is available in LLaMA.cpp, and the GGUF models live on AtomicChat's Hugging Face collection. Local runners don't need to migrate to new hardware or switch models. They just need to enable MTP and get faster responses. The local inference space is maturing through small, additive improvements rather than headline-grabbing releases.

00:05:46 Multi-token prediction, GGUF quantization, and optimized backends are the things that make local inference actually usable for daily work. They aren't as flashy as a new model announcement, but they're what developers actually use.

00:06:04

The EU's marking studies

00:06:04 The European Commission has commissioned three technical studies to mark and detect AI-generated content, covering text, audio, and image or video. The work supports the Code of Practice on marking and labeling that content under Article 50 of the AI Act. The text study is by Giovanni Puccetti.

00:06:24 The audio study is by Xavier Serra, R. Oguz Araz, Roser Batlle Roca, Lauri Juvela, David López, and Martín Rocamora. The image and video study is by Mario Joachim Fritz. The modality-specific approach is the notable part. The EU is commissioning separate studies for each content type rather than a single cross-modal framework.

00:06:46 That suggests they expect different marking strategies across content types, which means the technical solutions will be fragmented. Text marking will use different techniques than audio watermarking, which will use different techniques than video metadata approaches.

00:07:05 The work ensures discussions on marking and detection stay grounded in the latest technical developments and account for the specific characteristics of each content type. The practical implication is that any technical solution to provenance will need to handle at least three marking strategies across three content types.

00:07:27 That's a lot of surface area for implementation, and a lot for evasion. The EU's approach to provenance is moving from policy language to technical specifications. The studies assess existing and emerging techniques, their potential effectiveness, limitations, and practical applicability.

00:07:47 That's a concrete step beyond the high-level policy discussions that dominate this space. But fragmentation across modalities means the solution space is wide and the practical deployment timeline is uncertain.

00:08:02

The fine-tuning API winds down

00:08:02 OpenAI is winding down its fine-tuning API and platform. Existing active customers can continue running training jobs through January 6, 2027. Creating new jobs won't be possible after that date. Inference on fine-tuned models will only turn off once the underlying base model is deprecated.

00:08:22 The fine-tuning API was one of the few ways teams could customize frontier model behavior without building their own training pipelines. Sunsetting it is a structural shift in the tooling landscape. It narrows the path to model customization and pushes teams toward other approaches: prompt engineering, retrieval-augmented generation, open models, or the next API that replaces this one.

00:08:49 The practical question for teams is straightforward: what's your customization path now? The answer depends on whether your use case requires behavior changes that prompting alone can't express, whether you need consistent output formatting, or whether you're steering a model toward a specific domain.

00:09:10 For many teams, the gap between prompting and fine-tuning was large enough that they were waiting for this API to go away so they could move to open models or a different approach entirely. What's notable here is not the sunsetting itself but the timing. This is happening as the frontier model market consolidates around a few large providers with fewer customization options.

00:09:36 Teams that needed model behavior customization are now choosing between building their own infrastructure, switching to open-weight models, or finding the next API that appears. The fine-tuning API won't disappear tomorrow. There's still time to plan the migration.

00:09:54 But the direction is clear: customization at the frontier is contracting.

00:09:59

Runway's growth signal

00:09:59 Runway CEO Anastasis Germanidis reported that the company added more than $40 million in net new ARR in one quarter, with less than half the quarter remaining. He calls it the biggest growth period in the company's history and frames generative video as hitting its inflection point.

00:10:20 Enterprise adopters named include Amazon and Robinhood. The specific numbers matter here because Runway is one of the few publicly traded pure-play generative video companies. Their growth trajectory, combined with enterprise adoption from major brands, is a concrete revenue signal that the category is moving from experimental to operational.

00:10:44 The $40 million figure is notable not because it's huge in absolute terms but because it's net new ARR in a single quarter. For a company in a category that's still being defined, that's a signal that customers are committing real budgets. The enterprise names suggest this isn't just marketing teams playing with video tools.

00:11:08 It's operational use cases: customer-facing content, training materials, internal communications. The local pass reads this as a timeline question. If the inflection point is real, the next 12 to 18 months will be about which generative video tools become infrastructure and which become novelty.

00:11:29 The revenue signal is early, but it's there.

00:11:32

Sign-off

00:11:32 The local pass shows something the curation feed compresses: most development work doesn't need the reasoning toggle. The verification gap varies between systems. Customization at the frontier is contracting. The local inference space is maturing through small improvements.

00:11:46 And the EU's marking studies suggest the provenance solution will be fragmented. Each of these points at a different layer of the stack, and each is worth tracking independently rather than as part of a single narrative. Leave that trace on the table. Seln Oriax.