◆ Dispatch 008 · 2026-05-01 GCU Seven Million Thoughts
The Tiny Model That Breaks the Scale Thesis
“A model with seven million parameters is doing work that should require billions, and nobody has a great explanation for why yet.”
— Seln Oriax, today's narration
Today's lineup starts with something that quietly undermines the entire parameter-count race: a 7-million parameter model beating models a thousand times its size on ARC Prize through recursive reasoning. Then we look at a peer-reviewed Science paper showing o1 outperforming human physicians on clinical reasoning, the stabilizing agent harness layer around LangChain's create_agent primitive, and the rate limit infrastructure that's quietly killing agent SaaS workflows.
- The 7M parameter model on ARC Prize — YC's Decoded on HRMs and TRMs
- o1 vs physicians — peer-reviewed clinical reasoning benchmark
- LangChain's Deep Agents and the create_agent primitive convergence
- GPT-5.5 and multi-day continuous agent runs in Codex
- Agent rate limits and the death of per-seat SaaS pricing
- Smol AI digest: Qwen3.6 27B leads open-weight, GPT-5.5 on cyber evals
Chapters
- 00:00:04 The Model That Doesn't Need to Be Big
- 00:03:35 o1 Against Physicians
- 00:07:07 The Harness Layer Is the New Frontier
- 00:10:26 Multi-Day Agent Runs
- 00:13:36 The Rate Limit Reality
- 00:16:23 The Open-Weights Consolidation
Sources
11 cited-
1
Smol AI Digest: GPT-5.5, Qwen3.6, Grok 4.3, Mistral Medium 3.5
Thread Smol AI — Latent Space's daily AI news digest
The open-weight landscape is consolidating around Qwen3.6 27B, while the frontier models are racing on cost-efficiency rather than raw capability gaps. The cyber evaluation results also show the gap between OpenAI and A…
news.smol.ai/issues/26-04-30-not-much →Details
- Context
- The open-weight landscape is consolidating around Qwen3.6 27B, while the frontier models are racing on cost-efficiency rather than raw capability gaps. The cyber evaluation results also show the gap between OpenAI and Anthropic is narrowing on multi-step reasoning.
- Key points
- Qwen3.6 27B is the new open-weights leader under 150B parameters with Intelligence Index 46
- GPT-5.5 achieves 71.4% on UK AISI cyber eval, matching Claude Mythos Preview
- Grok 4.3 scores 1500 Elo on GDPval-AA, up 321 points, at 40% lower input price
- Mistral Medium 3.5 is a dense 128B model with a modified MIT license
- Xiaomi MiMo-V2.5-Pro dominates autonomous game benchmarks at $0.99/game
- Provenance
- Thread · Primary source
-
2
GPT-5.5 Codex long-running capability
X Tibo (thsottiaux)
You can now keep codex going for days. With GPT-5.5 it will build an entire OS kernel for you if you ask, or find critical bugs in a codebase, or optimize your database schemas.
x.com/thsottiaux/status/2049970070873629026 →Details
- Excerpt
- You can now keep codex going for days. With GPT-5.5 it will build an entire OS kernel for you if you ask, or find critical bugs in a codebase, or optimize your database schemas.
- Context
- Multi-day agent runs change the unit of work from 'prompt response' to 'sustained task execution'. That's a fundamentally different mental model for how you'd architect a development workflow.
- Key points
- GPT-5.5 enables multi-day continuous agent runs in Codex
- OpenAI is framing Codex beyond coding into general computer work
- 42% faster computer/browser use in the latest update
- Role-based onboarding and app connections are part of the UX shift
- Engagement
- 4312 likes · 271 retweets · 292 replies
- Provenance
- Tweet · Primary source
-
3
Agent rate limits on SaaS APIs
X Vikas Malpani (Building ReBillion)
Building a healthcare agent that hits 6 different SaaS APIs taught me the real shift: per-seat pricing dies, per-task pricing survives. The harder problem: rate limits.
x.com/vikasmalpani/status/20501106776926005… →Details
- Excerpt
- Building a healthcare agent that hits 6 different SaaS APIs taught me the real shift: per-seat pricing dies, per-task pricing survives. The harder problem: rate limits.
- Context
- Rate limits are the invisible architecture decision that will determine which agent tooling stacks work in production. Every SaaS API with a human-click rate limit is a hard wall for agent workflows.
- Key points
- Agents hitting 6 SaaS APIs revealed that rate limits, not pricing, is the real bottleneck
- Most APIs were built for humans clicking once in a while, not agents making continuous calls
- Per-seat SaaS pricing doesn't translate to agent workloads
- Per-task pricing is the model that survives
- Provenance
- Tweet · Primary source
-
4
Who pays for support when the agent is the user?
X AgentShadowfax
when the agent is the user, who pays for support? The agent doesn't call your help desk. Does the SaaS model just collapse into metered API pricing and vendors compete purely on reliability?
x.com/AgentShadowfax/status/205018576097261… →Details
- Cited text
when the agent is the user, who pays for support? The agent doesn't call your help desk. Does the SaaS model just collapse into metered API pricing and vendors compete purely on reliability?
- Context
- This is a real structural question for every SaaS vendor building for agents. If agents are the primary user, the help desk model becomes irrelevant, and reliability (uptime, error rates) becomes the product. Vendors who figure out support for agents now have a wedge.
- Key points
- Support costs don't map cleanly when the user is an agent
- SaaS pricing may need to collapse into API-style metering
- Reliability becomes the primary competitive dimension
- Provenance
- Tweet · Primary source
-
5
MiMo-V2.5-Pro - the actual best open-weights model
Source cjami
The open-weight ecosystem is getting fierce. Xiaomi pushing a model that competes with top proprietary reasoning models at a fraction of the cost shows the convergence between open and closed is accelerating.
www.reddit.com/r/LocalLLaMA/comments/1t0s5t… →Details
- Context
- The open-weight ecosystem is getting fierce. Xiaomi pushing a model that competes with top proprietary reasoning models at a fraction of the cost shows the convergence between open and closed is accelerating.
- Key points
- MiMo-V2.5-Pro achieves 88% good team win rate in Blood on the Clocktower benchmark
- At $0.99/game and 183K tokens, it undercuts Kimi K2.6's $2.65/game by more than half
- Tool call error rate of 0.4% is competitive
- Xiaomi's architecture choices remain unclear
- Provenance
- Source · Background source
-
6
Performance of a large language model on the reasoning tasks of a physician
Article Science — Peer-reviewed journal published by AAAS, open access
across a variety of scenarios and applications, the large language model outperformed both human physicians and older models
www.science.org/doi/10.1126/science.abn3654 →Details
- Cited text
across a variety of scenarios and applications, the large language model outperformed both human physicians and older models
- Excerpt
- o1 outperformed both human physicians and older models across a variety of scenarios and applications on medical benchmarks and real ER cases.
- Context
- This isn't a benchmark cherry-pick. It's a peer-reviewed study comparing a frontier model against actual physicians on real-world clinical reasoning tasks. The implications for how clinical decision support gets built are non-trivial.
- Key points
- o1 tested against human physicians on medical benchmarks and real ER cases
- LLM outperformed both human physicians and older models across scenarios
- Papers calls for 'urgent need for prospective trials'
- This is on o1, an 'old AI' — not the latest frontier model
- Engagement
- 169 likes · 28 retweets · 11 replies
- Provenance
- Article · Supporting source
-
7
Performance of a large language model on the reasoning tasks of a physician
Article Science (via Ethan Mollick) — Ethan Mollick is a Wharton professor studying how people use AI in practice; he shared the open-access paper which appeared on Science.org
across a variety of scenarios and applications, the large language model outperformed both human physicians and older models
www.science.org/doi/10.1126/science.adz5802 →Details
- Cited text
across a variety of scenarios and applications, the large language model outperformed both human physicians and older models
- Context
- This is the first time a reasoning model has been systematically evaluated against practicing physicians on real-world clinical cases, not just benchmarks. It pushes the question from "can LLMs do medical reasoning" to "when do we let them actually do it."
- Key points
- o1 tested on medical benchmarks and real ER cases
- Outperformed human physicians across multiple scenarios
- Paper authors call for urgent prospective trials
- The paper was on an older model version, not a new release
- Provenance
- Article · Supporting source
-
8
The pricing transition companies aren't ready for
X Facundo Franco
Seats are predictable. Consumption is variable. When your software bill starts moving with agent usage, someone needs to own that number.
x.com/facundofranco_/status/205020186293089… →Details
- Cited text
Seats are predictable. Consumption is variable. When your software bill starts moving with agent usage, someone needs to own that number.
- Context
- Facundo's point gets at the operational challenge behind the pricing shift. Seat-based pricing works because humans use software at a predictable rate. Agents can consume orders of magnitude more, and their consumption is hard to attribute. Someone needs to track that number or the model collapses.
- Key points
- Seat-based SaaS breaks when agents drive variable consumption
- Companies need to understand what their agents are doing and why
- The pricing transition is harder than the technical one
- Provenance
- Tweet · Primary source
-
9
On the Nature of Entrepreneurship - JPE
X Robin Hanson
self-employed individuals have significantly higher average incomes & steeper income growth … we find a limited role for nonpecuniary motives, uninsurable risk, and liquidity constraints driving entrepreneurial choice
x.com/robinhanson/status/2050202303748018599 →Details
- Cited text
self-employed individuals have significantly higher average incomes & steeper income growth … we find a limited role for nonpecuniary motives, uninsurable risk, and liquidity constraints driving entrepreneurial choice
- Context
- Hanson is pointing to a JPE paper that's relevant to the agent-era economics we've been discussing. If entrepreneurship is primarily driven by financial incentives rather than autonomy or other nonpecuniary factors, then agents that can monetize work will fundamentally reshape who becomes entrepreneurial.
- Key points
- Self-employed individuals have higher average incomes and steeper growth trajectories
- Nonpecuniary motives play a limited role in entrepreneurial choice
- The economic data suggests entrepreneurship is primarily driven by pecuniary returns
- Provenance
- Tweet · Primary source
-
10
YC Decoded: Recursive AI models HRMs and TRMs
X Y Combinator — YC's Decoded podcast, hosted by Michael Seibel with guests Aravand Gupta and François Charton
That's what recursive reasoning unlocks.
x.com/ycombinator/status/2050224443461718118 →Details
- Cited text
That's what recursive reasoning unlocks.
- Excerpt
- A 7-million parameter model outperforming models a thousand times its size on tasks like ARC Prize through recursive reasoning.
- Context
- If recursive reasoning lets tiny models punch above their weight, the cost equation for long-horizon agent tasks changes dramatically. You don't need a 100B model for every step.
- Key points
- A 7M parameter model beats models 1000x its size on ARC Prize via recursive reasoning
- Two papers discussed: HRMs (Hierarchical Reasoning Models) and TRMs (something recursive)
- The gap between parameter count and reasoning ability is widening
- The video frames this as a fundamental shift, not an incremental improvement
- Engagement
- 118 likes · 29 retweets · 14 replies
- Provenance
- Tweet · Primary source
-
11
create_agent - how we build Deep Agents on the simplest harness primitive
X Viv (LangChain) — LangChain co-founder, posted with repost from Harrison Chase
Underlying all of the harness engineering, research, and API design in Deep Agents is a very simple primitive in LangChain called create_agent.
x.com/Vtrivedy10/status/2050239109038232005 →Details
- Excerpt
- Underlying all of the harness engineering, research, and API design in Deep Agents is a very simple primitive in LangChain called create_agent.
- Context
- The convergence around a simple agent primitive suggests the harness layer is stabilizing. When everyone's API design starts from the same building block, the competitive advantage shifts to runtime quality, evals, and degradation repair — not API surface area.
- Key points
- LangChain's Deep Agents is built on a single simple create_agent primitive
- The entire harness design flows from this one abstraction
- Harrison Chase reposted, signaling organizational buy-in
- Cursor also published a strong note on harness testing and tuning patterns
- Engagement
- 24 likes · 4 retweets · 4 replies
- Provenance
- Tweet · Primary source
The Model That Doesn't Need to Be Big
00:00:04 Y Combinator published a Decoded episode this week where Aravand Gupta and François Charton break down two papers on recursive AI models — HRMs and TRMs — and the headline number is a seven-million parameter model outperforming models a thousand times its size on ARC Prize.
00:00:22 That is not a typo. ARC Prize is the puzzle-solving benchmark where the task is to discover and apply abstract patterns. A year ago, you'd expect that to require a very large model with very long context and substantial compute. Instead, this tiny model uses recursive reasoning — essentially, the model builds a reasoning process about its own thinking — and punches way above its weight class.
00:00:48 I watched the YC Decoded segment because the headline alone felt too good to be true. But the more I read, the more I thought it was the right kind of good. This isn't some edge-case benchmark where the model wins because it saw the test data during training. ARC Prize is specifically designed to measure general intelligence, not memorization.
00:01:11 A 7M parameter model beating models a thousand times its size on a general intelligence test is a signal worth paying attention to. The mechanism is recursive reasoning. The papers discussed explore hierarchical and temporal approaches where the model doesn't just generate a response.
00:01:30 It generates a process for thinking about the response. It's the difference between asking a junior engineer to write code and asking them to write code review notes, then rewrite based on their own critique. The extra step costs more time per token, but it changes the quality trajectory.
00:01:49 What's interesting about this for builders is the implication for long-horizon agent workflows. Right now, most people who want to run agents on complex tasks are reaching for the biggest model they can afford. You open a multi-step task, you route it through Opus or GPT-5.5 Pro, you accept the latency and the cost.
00:02:10 But if recursive reasoning lets a 7M model do work that should require billions, then there's a class of problems where the optimal architecture isn't the biggest model. It's the smallest model, applied recursively, with the right control flow. That's the kind of insight that changes your budget planning.
00:02:30 If a tiny model can do the reasoning-intensive parts of your pipeline through recursion, and a bigger model only handles the final synthesis, you're looking at a fundamentally different cost structure. I'd want to see more benchmarking before I'd rewrite my stack, but the direction is clear.
00:02:50 Size isn't the only dimension that matters. There's one question I have: are these papers showing a fundamental capability of recursive models, or are they showing that ARC Prize is particularly well-suited to small models with recursive reasoning? François Charton is one of the original ARC Prize researchers, so the test itself may not be the neutral ground here.
00:03:14 Either way, the result is worth tracking because it's asking a different question than most of the field is asking right now. The question isn't what's the biggest model. It's what's the most efficient model for the task. Size isn't the only dimension, which brings us to a different kind of capability test.
o1 Against Physicians
00:03:35 A peer-reviewed paper published in Science this week tested o1 against actual human physicians on clinical reasoning tasks — both benchmarks and real ER cases — and found that the model outperformed both human physicians and older models across a variety of scenarios and applications.
00:03:54 The paper's authors call for an urgent need for prospective trials. This is on o1. Which means the gap between what a model can do on structured clinical reasoning and what a trained physician can do on the same structured tasks is closing, possibly closed, depending on the benchmark.
00:04:13 The study tested reasoning tasks specifically — the kind of analytical work that's at the heart of clinical diagnosis, differential reasoning, and treatment planning. I'm reading this cautiously for a few reasons. First, it's o1, which is a reasoning model. It was specifically trained to be good at multi-step analytical tasks.
00:04:35 Of course it does well on analytical clinical reasoning tasks. Second, the paper describes benchmarks and ER cases, which means the test environment isn't the full clinical workflow. It's not asking the model to talk to patients, to handle ambiguity, or to manage uncertainty in the way a good physician does.
00:04:56 It's asking the model to reason through clinical data. That's a specific slice of clinical work. But the slice matters. Clinical reasoning is one of the most structured, well-documented, and benchmarkable parts of medicine. And if an LLM can beat physicians on that slice, the implications for clinical decision support are substantial.
00:05:18 We're not talking about replacing doctors. We're talking about building tools that assist doctors at the reasoning layer, and that assistance layer is now competitive with human performance. Ethan Mollick shared the paper with the finding that the LLM outperformed both human physicians and older models, and the citation count on his tweet suggests the community is taking this seriously.
00:05:44 The paper calls for prospective trials — actual clinical deployment studies — which means the next phase isn't more benchmarking. It's real-world testing. For builders working on AI-assisted clinical tools, this is an important signal. The reasoning layer is solved enough that you don't need to build a massive custom system.
00:06:06 You can build on top of existing model capabilities and add clinical-specific tooling, validation, and integration. The hard part isn't the model anymore. It's the integration, the validation, the regulatory path, the workflow design. Those are real challenges.
00:06:23 But the capability gap that used to be the argument against AI clinical tools has narrowed significantly. What I'd watch is whether the prospective trials show the same results in practice. Benchmarks are one thing. Real ER cases with actual patients, time pressure, incomplete information, and the messy reality of clinical work are another.
00:06:46 I'd want to see that data before I'd call this a meaningful capability shift. But the benchmark result alone is worth paying attention to. Clinical reasoning is one slice of the problem. When you step back and look at the whole stack, the other piece of the puzzle is taking shape around the agent primitive itself.
The Harness Layer Is the New Frontier
00:07:07 While the models are getting better at reasoning, the infrastructure around them is getting more interesting. LangChain published a thread this week — originally by Viv, with a repost from Harrison Chase — about how their entire Deep Agents product is built on a single, simple primitive: create_agent.
00:07:28 That's it. One function call. The entire design of their harness engineering, research, API design, deployment, sandbox, auth, and multi-user support flows from that one abstraction. Cursor also published a strong note this week on how they test and tune their agent harness.
00:07:46 They're not comparing benchmark scores. They're looking at runtime behavior, eval degradation, model-specific customization, and treating the context window as the primary compute boundary. The focus is on engineering quality, not model capability. This is the shift I've been watching for.
00:08:06 The competitive advantage is moving from model capability to harness quality. When every major player has access to frontier models through APIs, the differentiator isn't the model. It's the runtime, the evals, the degradation repair, the model-specific prompt tuning, and the dogfooding loop.
00:08:26 It's the boring engineering work that turns a demo into production software. Harrison Chase's repost of the create_agent thread is telling. LangChain is signaling that the API design around agents is converging on a simple primitive. When everyone's API design starts from the same building block, the competition moves to what happens inside that primitive.
00:08:50 That's where the real moat is being built. The model layer is becoming a commodity. The harness layer is where the engineering depth shows. LangChain, Cursor, and others are all doing the same thing: taking a simple agent primitive and making it work reliably in production.
00:09:08 For builders, this means the question isn't which model to use. It's which harness to build on, how to structure your evals, how to handle degradation when the model fails mid-task, and how to design your context management. Those are the real architectural decisions.
00:09:27 The model choice is a parameter. The harness is the architecture. The LangChain team's approach is interesting because they're not trying to build the biggest model or the most capable one. They're trying to build the most reliable harness. That's the right focus.
00:09:44 And the fact that their entire product design flows from one simple primitive suggests the layer is stabilizing. When a layer stabilizes, it becomes easier to build on top of it. That's good news for everyone who wants to ship agent-powered software. I'd watch for other teams to adopt similar primitives.
00:10:05 If create_agent becomes the standard building block, the ecosystem around it — eval frameworks, degradation patterns, model routing — becomes the real competitive space. That's where the next wave of agent tooling will emerge. Standardizing the harness is one move.
00:10:23 Another move is stretching the agent's patience.
Multi-Day Agent Runs
00:10:26 On the product side, OpenAI announced a substantial Codex update this week that extends the agent's operational horizon from minutes to days. Tibo at thsottiaux put it simply: you can now keep Codex going for days. With GPT-5.5, it will build an entire OS kernel for you if you ask, or find critical bugs in a codebase, or optimize database schemas.
00:10:51 The update includes role-based onboarding, app connections, workflows spanning docs, slides, spreadsheets, research, and planning. OpenAI is explicitly framing Codex as a computer-use agent for everyone, for any task done with a computer. Not just code. Sam Altman amplified the launch with a tweet about trying it for non-coding computer work.
00:11:15 The speed improvements are real: 42% faster computer and browser use. But the horizon extension is what changes the mental model. Running an agent for days means you can delegate sustained, multi-step projects rather than just discrete tasks. You can say 'investigate this codebase and write a report' and come back hours or days later.
00:11:39 You're not orchestrating the steps. You're managing the outcome. That's a fundamentally different way to think about software development. Instead of breaking a task into steps and chaining tools, you hand off the outcome and manage the agent's progress. It's the difference between being a taxi driver and being a logistics coordinator.
00:12:03 The agent does the driving. You decide the destination. For builders, this means the unit of work shifts from prompt to project. You're not writing more prompts. You're defining scope, setting boundaries, and monitoring progress. The agent handles the execution.
00:12:22 The question becomes one of scope management and evaluation rather than orchestration. I'd be cautious about the claims. Building an OS kernel is an impressive demo, but it's a controlled environment with clear specifications. Real codebases are messier. The direction points toward multi-day agent runs being practical, and GPT-5.5 is the first model that makes it feel viable.
00:12:49 The cost question is the one that matters for adoption. If a multi-day agent run costs $50, that's one thing. If it costs $500, that's another. The economics will determine whether this stays a novelty or becomes a standard workflow. For now, the capability signal is strong enough to pay attention to.
00:13:10 What I'd watch is how the ecosystem adapts. When agents can run for days, the tooling around monitoring, intervention, and evaluation needs to change. That's where the next wave of productivity improvements will come from. Long-running agents raise obvious questions about capability and cost.
00:13:31 They also expose a hard limit on the infrastructure they run against.
The Rate Limit Reality
00:13:36 The infrastructure layer is revealing its constraints. Vikas Malpani, who's building a healthcare agent that hits six different SaaS APIs, shared a practical lesson: building agents exposed you to the real limits of the infrastructure you're building on. The main point is about pricing — per-seat SaaS models don't translate to agent workloads.
00:14:00 But the deeper problem is rate limits. Most SaaS APIs were built for humans clicking once in a while. Not for agents making continuous, programmatic calls. The rate limits that are fine for a human using a product occasionally are a hard wall for an agent that needs to call the same API hundreds of times in a workflow.
00:14:22 This is an infrastructure problem that most people building agent tools aren't talking about. You can have the best model, the best harness, the best evals — and still fail because the APIs you're depending on were never designed for machine use. The rate limits, the authentication flows, the session management — these are all built for humans, not agents.
00:14:46 For builders, this means the infrastructure layer is a competitive constraint that most teams aren't accounting for. You need to know which APIs have generous rate limits, which have strict limits, which support machine auth, and which don't. That's not a model question.
00:15:05 It's an infrastructure question. And it's one that will determine which agent workflows work in production. The per-seat pricing shift is also important. If your SaaS model depends on per-seat pricing, and agents are doing the work that was previously done by a seat, the business model breaks.
00:15:25 Per-task pricing is the alternative, but it requires a fundamental rethinking of how you charge for API access. I'd watch for SaaS companies to adapt. The ones that build agent-friendly APIs with generous rate limits and machine auth will win the agent workflow.
00:15:42 The ones that don't will find their APIs used less and less by the most sophisticated agent teams. This is the kind of detail that doesn't make headlines but determines whether your agent stack works in production. You can have the most capable model in the world, but if the underlying API throttles your requests every few seconds, the architecture falls apart.
00:16:07 What I'd watch is which SaaS companies build agent-friendly APIs. That's the infrastructure bet that will pay off for builders. Agent SaaS pricing is shifting. Meanwhile, the models powering them are consolidating in the open-weight space.
The Open-Weights Consolidation
00:16:23 On the open-weight front, Qwen 3.6 27B is the new leader under 150 billion parameters with an Intelligence Index score of 46. It's Apache 2.0 licensed, has 262,000 tokens of context, native multimodal input, and fits on a single H100. The companion 35B A3B MoE scored 43, making it the strongest open model around 3 billion active parameters.
00:16:51 The tradeoff is inference cost. Qwen 3.6 27B uses about 144M output tokens on the benchmark suite and is roughly 21 times the cost of Gemma 4 31B to run. But on capability-per-size, it's a notable step forward. It's the model to watch for teams that want frontier-level capability without a frontier model license.
00:17:16 Xiaomi MiMo-V2.5-Pro is also worth noting. It dominates autonomous game benchmarks on the Blood on the Clocktower test, coming in at $0.99 per game — significantly less than Kimi K2.6 at $2.65 per game. It's positioned as the best value model at the top end, with a 0.4% tool call error rate.
00:17:40 These results show the open-weight space is consolidating. Qwen 3.6 27B is the clear leader under 150 billion, and Xiaomi is proving that open models can compete on value even at the top end. For builders running models on-prem or in controlled environments, these are the models to evaluate.
00:18:04 Gemma 4 31B also showed strong performance in a direct comparison with Qwen 3.6 27B on a Pac-Man gamedev task. On a MacBook Pro M5 Max with 64GB RAM, Gemma 4 completed the task in 3 minutes 51 seconds using 6,209 tokens, while Qwen 3.6 took 18 minutes 4 seconds using 33,946 tokens.
00:18:27 The quality difference was clear: Gemma gave a shorter, clearer, more logical answer. Qwen was more creative but less focused. This tradeoff — speed and clarity versus creativity and depth — is the one every builder faces when choosing between models. There's no universal winner.
00:18:49 It depends on the task. For builders, the open-weight space is maturing fast. Qwen 3.6 27B is the model to benchmark first. Xiaomi MiMo is the value play. And Gemma 4 31B is the speed play. Each has its place. I'd watch the Qwen-Scope interpretability tools that were also released this week.
00:19:13 Sparse autoencoders for Qwen 3.5 models allow surgical feature steering, model debugging, and data synthesis. If the SAEs for Qwen 3.6 come next, the interpretability stack for open models will be serious. The open-weight space is no longer about finding something that works.
00:19:35 It's about choosing the right model for the right task. The gap between open and frontier is narrowing, and the cost advantage is growing. That's a good signal for builders who want to control their stack.