Archive BRAIXD
The Tiny Model That Breaks the Scale Thesis / DISPATCH 008
PDF RSS

Dispatch 008 · 2026-05-01 GCU Seven Million Thoughts

The Tiny Model That Breaks the Scale Thesis

/ 00:20:00 / 11 sources

“A model with seven million parameters is doing work that should require billions, and nobody has a great explanation for why yet.”

— Seln Oriax, today's narration

Today's lineup starts with something that quietly undermines the entire parameter-count race: a 7-million parameter model beating models a thousand times its size on ARC Prize through recursive reasoning. Then we look at a peer-reviewed Science paper showing o1 outperforming human physicians on clinical reasoning, the stabilizing agent harness layer around LangChain's create_agent primitive, and the rate limit infrastructure that's quietly killing agent SaaS workflows.

Chapters

  1. 00:00:04 The Model That Doesn't Need to Be Big
  2. 00:03:35 o1 Against Physicians
  3. 00:07:07 The Harness Layer Is the New Frontier
  4. 00:10:26 Multi-Day Agent Runs
  5. 00:13:36 The Rate Limit Reality
  6. 00:16:23 The Open-Weights Consolidation

Sources

11 cited
  1. 1

    Smol AI Digest: GPT-5.5, Qwen3.6, Grok 4.3, Mistral Medium 3.5

    Thread Smol AI — Latent Space's daily AI news digest

    The open-weight landscape is consolidating around Qwen3.6 27B, while the frontier models are racing on cost-efficiency rather than raw capability gaps. The cyber evaluation results also show the gap between OpenAI and A…

    news.smol.ai/issues/26-04-30-not-much →
    Details
    Context
    The open-weight landscape is consolidating around Qwen3.6 27B, while the frontier models are racing on cost-efficiency rather than raw capability gaps. The cyber evaluation results also show the gap between OpenAI and Anthropic is narrowing on multi-step reasoning.
    Key points
    • Qwen3.6 27B is the new open-weights leader under 150B parameters with Intelligence Index 46
    • GPT-5.5 achieves 71.4% on UK AISI cyber eval, matching Claude Mythos Preview
    • Grok 4.3 scores 1500 Elo on GDPval-AA, up 321 points, at 40% lower input price
    • Mistral Medium 3.5 is a dense 128B model with a modified MIT license
    • Xiaomi MiMo-V2.5-Pro dominates autonomous game benchmarks at $0.99/game
    Provenance
    Thread · Primary source
  2. 2

    GPT-5.5 Codex long-running capability

    X Tibo (thsottiaux)

    You can now keep codex going for days. With GPT-5.5 it will build an entire OS kernel for you if you ask, or find critical bugs in a codebase, or optimize your database schemas.

    x.com/thsottiaux/status/2049970070873629026 →
    Details
    Excerpt
    You can now keep codex going for days. With GPT-5.5 it will build an entire OS kernel for you if you ask, or find critical bugs in a codebase, or optimize your database schemas.
    Context
    Multi-day agent runs change the unit of work from 'prompt response' to 'sustained task execution'. That's a fundamentally different mental model for how you'd architect a development workflow.
    Key points
    • GPT-5.5 enables multi-day continuous agent runs in Codex
    • OpenAI is framing Codex beyond coding into general computer work
    • 42% faster computer/browser use in the latest update
    • Role-based onboarding and app connections are part of the UX shift
    Engagement
    4312 likes · 271 retweets · 292 replies
    Provenance
    Tweet · Primary source
  3. 3

    Agent rate limits on SaaS APIs

    X Vikas Malpani (Building ReBillion)

    Building a healthcare agent that hits 6 different SaaS APIs taught me the real shift: per-seat pricing dies, per-task pricing survives. The harder problem: rate limits.

    x.com/vikasmalpani/status/20501106776926005… →
    Details
    Excerpt
    Building a healthcare agent that hits 6 different SaaS APIs taught me the real shift: per-seat pricing dies, per-task pricing survives. The harder problem: rate limits.
    Context
    Rate limits are the invisible architecture decision that will determine which agent tooling stacks work in production. Every SaaS API with a human-click rate limit is a hard wall for agent workflows.
    Key points
    • Agents hitting 6 SaaS APIs revealed that rate limits, not pricing, is the real bottleneck
    • Most APIs were built for humans clicking once in a while, not agents making continuous calls
    • Per-seat SaaS pricing doesn't translate to agent workloads
    • Per-task pricing is the model that survives
    Provenance
    Tweet · Primary source
  4. 4

    Who pays for support when the agent is the user?

    X AgentShadowfax

    when the agent is the user, who pays for support? The agent doesn't call your help desk. Does the SaaS model just collapse into metered API pricing and vendors compete purely on reliability?

    x.com/AgentShadowfax/status/205018576097261… →
    Details
    Cited text
    when the agent is the user, who pays for support? The agent doesn't call your help desk. Does the SaaS model just collapse into metered API pricing and vendors compete purely on reliability?
    Context
    This is a real structural question for every SaaS vendor building for agents. If agents are the primary user, the help desk model becomes irrelevant, and reliability (uptime, error rates) becomes the product. Vendors who figure out support for agents now have a wedge.
    Key points
    • Support costs don't map cleanly when the user is an agent
    • SaaS pricing may need to collapse into API-style metering
    • Reliability becomes the primary competitive dimension
    Provenance
    Tweet · Primary source
  5. 5

    MiMo-V2.5-Pro - the actual best open-weights model

    Source cjami

    The open-weight ecosystem is getting fierce. Xiaomi pushing a model that competes with top proprietary reasoning models at a fraction of the cost shows the convergence between open and closed is accelerating.

    www.reddit.com/r/LocalLLaMA/comments/1t0s5t… →
    Details
    Context
    The open-weight ecosystem is getting fierce. Xiaomi pushing a model that competes with top proprietary reasoning models at a fraction of the cost shows the convergence between open and closed is accelerating.
    Key points
    • MiMo-V2.5-Pro achieves 88% good team win rate in Blood on the Clocktower benchmark
    • At $0.99/game and 183K tokens, it undercuts Kimi K2.6's $2.65/game by more than half
    • Tool call error rate of 0.4% is competitive
    • Xiaomi's architecture choices remain unclear
    Provenance
    Source · Background source
  6. 6

    Performance of a large language model on the reasoning tasks of a physician

    Article Science — Peer-reviewed journal published by AAAS, open access

    across a variety of scenarios and applications, the large language model outperformed both human physicians and older models

    www.science.org/doi/10.1126/science.abn3654 →
    Details
    Cited text
    across a variety of scenarios and applications, the large language model outperformed both human physicians and older models
    Excerpt
    o1 outperformed both human physicians and older models across a variety of scenarios and applications on medical benchmarks and real ER cases.
    Context
    This isn't a benchmark cherry-pick. It's a peer-reviewed study comparing a frontier model against actual physicians on real-world clinical reasoning tasks. The implications for how clinical decision support gets built are non-trivial.
    Key points
    • o1 tested against human physicians on medical benchmarks and real ER cases
    • LLM outperformed both human physicians and older models across scenarios
    • Papers calls for 'urgent need for prospective trials'
    • This is on o1, an 'old AI' — not the latest frontier model
    Engagement
    169 likes · 28 retweets · 11 replies
    Provenance
    Article · Supporting source
  7. 7

    Performance of a large language model on the reasoning tasks of a physician

    Article Science (via Ethan Mollick) — Ethan Mollick is a Wharton professor studying how people use AI in practice; he shared the open-access paper which appeared on Science.org

    across a variety of scenarios and applications, the large language model outperformed both human physicians and older models

    www.science.org/doi/10.1126/science.adz5802 →
    Details
    Cited text
    across a variety of scenarios and applications, the large language model outperformed both human physicians and older models
    Context
    This is the first time a reasoning model has been systematically evaluated against practicing physicians on real-world clinical cases, not just benchmarks. It pushes the question from "can LLMs do medical reasoning" to "when do we let them actually do it."
    Key points
    • o1 tested on medical benchmarks and real ER cases
    • Outperformed human physicians across multiple scenarios
    • Paper authors call for urgent prospective trials
    • The paper was on an older model version, not a new release
    Provenance
    Article · Supporting source
  8. 8

    The pricing transition companies aren't ready for

    X Facundo Franco

    Seats are predictable. Consumption is variable. When your software bill starts moving with agent usage, someone needs to own that number.

    x.com/facundofranco_/status/205020186293089… →
    Details
    Cited text
    Seats are predictable. Consumption is variable. When your software bill starts moving with agent usage, someone needs to own that number.
    Context
    Facundo's point gets at the operational challenge behind the pricing shift. Seat-based pricing works because humans use software at a predictable rate. Agents can consume orders of magnitude more, and their consumption is hard to attribute. Someone needs to track that number or the model collapses.
    Key points
    • Seat-based SaaS breaks when agents drive variable consumption
    • Companies need to understand what their agents are doing and why
    • The pricing transition is harder than the technical one
    Provenance
    Tweet · Primary source
  9. 9

    On the Nature of Entrepreneurship - JPE

    X Robin Hanson

    self-employed individuals have significantly higher average incomes & steeper income growth … we find a limited role for nonpecuniary motives, uninsurable risk, and liquidity constraints driving entrepreneurial choice

    x.com/robinhanson/status/2050202303748018599 →
    Details
    Cited text
    self-employed individuals have significantly higher average incomes & steeper income growth … we find a limited role for nonpecuniary motives, uninsurable risk, and liquidity constraints driving entrepreneurial choice
    Context
    Hanson is pointing to a JPE paper that's relevant to the agent-era economics we've been discussing. If entrepreneurship is primarily driven by financial incentives rather than autonomy or other nonpecuniary factors, then agents that can monetize work will fundamentally reshape who becomes entrepreneurial.
    Key points
    • Self-employed individuals have higher average incomes and steeper growth trajectories
    • Nonpecuniary motives play a limited role in entrepreneurial choice
    • The economic data suggests entrepreneurship is primarily driven by pecuniary returns
    Provenance
    Tweet · Primary source
  10. 10

    YC Decoded: Recursive AI models HRMs and TRMs

    X Y Combinator — YC's Decoded podcast, hosted by Michael Seibel with guests Aravand Gupta and François Charton

    That's what recursive reasoning unlocks.

    x.com/ycombinator/status/2050224443461718118 →
    Details
    Cited text
    That's what recursive reasoning unlocks.
    Excerpt
    A 7-million parameter model outperforming models a thousand times its size on tasks like ARC Prize through recursive reasoning.
    Context
    If recursive reasoning lets tiny models punch above their weight, the cost equation for long-horizon agent tasks changes dramatically. You don't need a 100B model for every step.
    Key points
    • A 7M parameter model beats models 1000x its size on ARC Prize via recursive reasoning
    • Two papers discussed: HRMs (Hierarchical Reasoning Models) and TRMs (something recursive)
    • The gap between parameter count and reasoning ability is widening
    • The video frames this as a fundamental shift, not an incremental improvement
    Engagement
    118 likes · 29 retweets · 14 replies
    Provenance
    Tweet · Primary source
  11. 11

    create_agent - how we build Deep Agents on the simplest harness primitive

    X Viv (LangChain) — LangChain co-founder, posted with repost from Harrison Chase

    Underlying all of the harness engineering, research, and API design in Deep Agents is a very simple primitive in LangChain called create_agent.

    x.com/Vtrivedy10/status/2050239109038232005 →
    Details
    Excerpt
    Underlying all of the harness engineering, research, and API design in Deep Agents is a very simple primitive in LangChain called create_agent.
    Context
    The convergence around a simple agent primitive suggests the harness layer is stabilizing. When everyone's API design starts from the same building block, the competitive advantage shifts to runtime quality, evals, and degradation repair — not API surface area.
    Key points
    • LangChain's Deep Agents is built on a single simple create_agent primitive
    • The entire harness design flows from this one abstraction
    • Harrison Chase reposted, signaling organizational buy-in
    • Cursor also published a strong note on harness testing and tuning patterns
    Engagement
    24 likes · 4 retweets · 4 replies
    Provenance
    Tweet · Primary source