Archive BRAID
Seventeen Hours, Three Sizes, and the Prompt Boundary / DISPATCH 022
PDF RSS

Dispatch 022 · 2026-05-10 GSV Empirical Evaluation Of A Blackbox Artifact

Seventeen Hours, Three Sizes, and the Prompt Boundary

/ 00:24:34 / 10 sources

“Treat generated code like any ML model — a blackbox artifact whose behavior should be managed through empirical evaluation.”

— Lenar Kess, today's narration

METR publishes a fresh time-horizon number for Claude Mythos Preview, and yesterday's follow-up gets paid off in a single chart. NVIDIA ships a checkpoint that contains three reasoning models at once. antirez gets DeepSeek 4 running on a DGX Spark and tells you exactly where the bandwidth wall lives. François Chollet argues that agentic coding is a form of machine learning, and a few replies actually push the idea further. Plus the diffusion gap, the German tokenizer tax, and a Gemma 4 drafter that buys you a third of your decode time back.

Chapters

  1. 00:00:04 Seventeen hours
  2. 00:03:12 One checkpoint, three models
  3. 00:05:54 DS4 on DGX Spark, and where the wall is
  4. 00:08:48 Chollet: agentic coding is machine learning
  5. 00:12:41 The diffusion gap, in months
  6. 00:15:17 Agency at the prompt boundary
  7. 00:18:16 The German tokenizer tax
  8. 00:20:50 Two faster things
  9. 00:23:32 Sign-off

Sources

10 cited
  1. 1

    Chollet: agentic coding as machine learning

    X fchollet — François Chollet, creator of Keras, formerly at Google, now running Ndea

    Agentic coding is a form of machine learning. Generated code is best treated as a blackbox artifact whose behavior and generalization should be managed via empirical evaluation, like with any ML model.

    x.com/fchollet/status/2053234697392754701 →
    Details
    Cited text
    Agentic coding is a form of machine learning. Generated code is best treated as a blackbox artifact whose behavior and generalization should be managed via empirical evaluation, like with any ML model.
    Context
    Reframes agentic coding from a software engineering activity into an ML pipeline — which means the disciplines that matter shift toward eval, not deterministic review.
    Key points
    • Generated code should be treated as a blackbox artifact
    • Empirical evaluation replaces deterministic verification
    • Agentic coding is fundamentally a different way of producing software, with different best practices
    Provenance
    Tweet · Primary source
  2. 2

    METR: Claude Mythos Preview 50% time horizon hits 17 hours

    Article chillinewman

    Yesterday we promised to track who builds the next METR evaluation tasks. Today METR published an update showing Claude Mythos Preview's 50% time horizon at 17 hours — a measurable advance over the previous bar and the…

    www.reddit.com/r/singularity/comments/1t92j… →
    Details
    Context
    Yesterday we promised to track who builds the next METR evaluation tasks. Today METR published an update showing Claude Mythos Preview's 50% time horizon at 17 hours — a measurable advance over the previous bar and the headline number from yesterday's evaluation-ceiling discussion.
    Key points
    • Claude Mythos Preview hits 17hr 50% time horizon on METR's task suite
    • The 50% time horizon is the time a human expert would need on tasks the model completes 50% of the time
    • Doubling roughly every 7 months on the recent curve
    • Task construction is increasingly the rate-limiter for measuring further gains
    Provenance
    Article · Supporting source
  3. 3

    NVIDIA Star Elastic: one checkpoint, three sizes via zero-shot slicing

    Article phazei

    A single checkpoint that contains 30 billion, 23 billion, and 12 billion parameter reasoning models, sliceable at inference time with no retraining. That collapses three deployment targets into one artifact and shifts w…

    www.reddit.com/r/LocalLLaMA/comments/1t8s83r →
    Details
    Context
    A single checkpoint that contains 30 billion, 23 billion, and 12 billion parameter reasoning models, sliceable at inference time with no retraining. That collapses three deployment targets into one artifact and shifts where the inference budget gets spent.
    Key points
    • One checkpoint contains 30B, 23B, and 12B reasoning models
    • Slicing happens zero-shot at load time
    • Hybrid mixture-of-experts architecture
    • Reduces multi-target deployment complexity
    Provenance
    Article · Supporting source
  4. 4

    antirez: DeepSeek 4 on DGX Spark — 12 tokens/sec, prefill 200

    X antirez — Salvatore Sanfilippo, creator of Redis

    DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more aligned to M3 Max at ~200 t/s.

    x.com/antirez/status/2053381973226184749 →
    Details
    Cited text
    DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more aligned to M3 Max at ~200 t/s.
    Context
    A concrete, measured port of DeepSeek 4 to NVIDIA's small-form-factor DGX Spark. The 270 gigabytes per second memory bandwidth is the bottleneck — a real number worth filing alongside the M3 Max comparison.
    Key points
    • 12 tokens per second decode on DGX Spark / GB10
    • 200 tokens per second prefill, comparable to M3 Max
    • 270 GB/sec memory bandwidth is the limit
    • Private CUDA port, public release pending
    Provenance
    Tweet · Primary source
  5. 5

    Elad Gil: the AI diffusion gap, in months

    X eladgil — Elad Gil, longtime AI investor and operator

    People at major AI labs (using internal models) 3-4 months ahead of startup silicon valley engineers. SV founders/eng 3-6 months ahead of NY. NY founders/eng 6-12 months ahead of rest of world.

    x.com/eladgil/status/2053206351158091819 →
    Details
    Cited text
    People at major AI labs (using internal models) 3-4 months ahead of startup silicon valley engineers. SV founders/eng 3-6 months ahead of NY. NY founders/eng 6-12 months ahead of rest of world.
    Context
    A practical map of who has access to what and when. It's a compounding gap: by the time a model lands at a startup, lab insiders are already six months into the next one.
    Key points
    • Lab researchers 3-4 months ahead of SV startups
    • SV 3-6 months ahead of NY
    • NY 6-12 months ahead of rest of world
    • Compounding diffusion gap shapes who builds what
    Provenance
    Tweet · Primary source
  6. 6

    Claude Opus 4.7 burns more tokens on German prompts

    Article WickOfDeath

    A practical reminder that the tokenizer is not language-neutral. German runs through the tokenizer at a meaningfully higher token count than English for the same content, and that translates to slower turns, smaller eff…

    www.reddit.com/r/ClaudeAI/comments/1t8xtcf →
    Details
    Context
    A practical reminder that the tokenizer is not language-neutral. German runs through the tokenizer at a meaningfully higher token count than English for the same content, and that translates to slower turns, smaller effective context, and higher bills.
    Key points
    • German prompts cost roughly 1.5-2x the English token count
    • Effective context window shrinks proportionally
    • Output quality on graphs and structure can degrade for non-English
    • Tokenizer asymmetry is a structural cost, not a bug
    Provenance
    Article · Supporting source
  7. 7

    Virgil Maro: agency at the prompt boundary

    X _virgil19

    the compounding shows up at the prompt boundary. high-agency users come pre-loaded with goals worth amplifying. low-agency users hand the model the goal too. AI doesn't generate the gap. it scales whatever shape

    x.com/_virgil19/status/2053184240238637185 →
    Details
    Cited text
    the compounding shows up at the prompt boundary. high-agency users come pre-loaded with goals worth amplifying. low-agency users hand the model the goal too. AI doesn't generate the gap. it scales whatever shape
    Context
    Names something a lot of teams are quietly noticing — that AI tools amplify whatever the user brings, including the absence of a goal.
    Key points
    • Compounding lives at the prompt boundary
    • High-agency users arrive with goals worth amplifying
    • Low-agency users delegate goal-setting to the model
    • AI scales the shape of whatever it's handed
    Provenance
    Tweet · Primary source
  8. 8

    Engineering moves to the consequence boundary

    X FiftyOne_50_

    Agentic coding does not remove engineering. It moves engineering to the consequence boundary: What gets specified, tested, trusted, deployed, monitored, rolled back, and owned when the model is wrong.

    x.com/FiftyOne_50_/status/20532876467098134… →
    Details
    Cited text
    Agentic coding does not remove engineering. It moves engineering to the consequence boundary: What gets specified, tested, trusted, deployed, monitored, rolled back, and owned when the model is wrong.
    Context
    A clean restatement of what agentic coding actually shifts: not less engineering, just engineering located somewhere different — at the points where you can still say no.
    Key points
    • Agentic coding doesn't eliminate engineering work
    • Spec, test, deploy, monitor, rollback, ownership all remain
    • The locus moves from line-by-line authorship to consequence boundaries
    Provenance
    Tweet · Primary source
  9. 9

    Gemini API File Search goes multimodal

    Article

    Multimodal retrieval-augmented generation as a hosted API primitive. The change in scope is the part to notice — the file-search endpoint now indexes images and PDFs alongside text, so callers don't need to maintain a s…

    blog.google/innovation-and-ai/technology/de… →
    Details
    Context
    Multimodal retrieval-augmented generation as a hosted API primitive. The change in scope is the part to notice — the file-search endpoint now indexes images and PDFs alongside text, so callers don't need to maintain a separate visual retrieval pipeline.
    Key points
    • File Search now ingests images and PDFs natively
    • No separate visual embedding pipeline required
    • Hosted RAG primitive that competes with first-party stacks
    Provenance
    Article · Supporting source
  10. 10

    Gemma 4 MTP on MLX Swift: 30-40% faster on M5 Max

    X adrgrondin

    Early WIP port of Gemma 4 multi-token prediction (MTP) on MLX Swift. With MTP, Gemma 31B is 30-40% faster on M5 Max and with zero quality degradation. A significant speedup by just adding a 900MB MTP drafter model.

    x.com/adrgrondin/status/2053198336312689103 →
    Details
    Cited text
    Early WIP port of Gemma 4 multi-token prediction (MTP) on MLX Swift. With MTP, Gemma 31B is 30-40% faster on M5 Max and with zero quality degradation. A significant speedup by just adding a 900MB MTP drafter model.
    Context
    Multi-token prediction with a small drafter model is the speculative-decoding move, but with the drafter trained alongside the target model. 30 to 40 percent decode speedup for 900 megabytes of extra weights is a strong trade.
    Key points
    • Multi-token prediction port to MLX Swift
    • 30-40% decode speedup on Apple M5 Max
    • Zero quality degradation reported
    • 900MB drafter model footprint
    Provenance
    Tweet · Primary source