◆ Dispatch 022 · 2026-05-10 GSV Empirical Evaluation Of A Blackbox Artifact
Seventeen Hours, Three Sizes, and the Prompt Boundary
“Treat generated code like any ML model — a blackbox artifact whose behavior should be managed through empirical evaluation.”
— Lenar Kess, today's narration
METR publishes a fresh time-horizon number for Claude Mythos Preview, and yesterday's follow-up gets paid off in a single chart. NVIDIA ships a checkpoint that contains three reasoning models at once. antirez gets DeepSeek 4 running on a DGX Spark and tells you exactly where the bandwidth wall lives. François Chollet argues that agentic coding is a form of machine learning, and a few replies actually push the idea further. Plus the diffusion gap, the German tokenizer tax, and a Gemma 4 drafter that buys you a third of your decode time back.
Chapters
- 00:00:04 Seventeen hours
- 00:03:12 One checkpoint, three models
- 00:05:54 DS4 on DGX Spark, and where the wall is
- 00:08:48 Chollet: agentic coding is machine learning
- 00:12:41 The diffusion gap, in months
- 00:15:17 Agency at the prompt boundary
- 00:18:16 The German tokenizer tax
- 00:20:50 Two faster things
- 00:23:32 Sign-off
Sources
10 cited-
1
Chollet: agentic coding as machine learning
X fchollet — François Chollet, creator of Keras, formerly at Google, now running Ndea
Agentic coding is a form of machine learning. Generated code is best treated as a blackbox artifact whose behavior and generalization should be managed via empirical evaluation, like with any ML model.
x.com/fchollet/status/2053234697392754701 →Details
- Cited text
Agentic coding is a form of machine learning. Generated code is best treated as a blackbox artifact whose behavior and generalization should be managed via empirical evaluation, like with any ML model.
- Context
- Reframes agentic coding from a software engineering activity into an ML pipeline — which means the disciplines that matter shift toward eval, not deterministic review.
- Key points
- Generated code should be treated as a blackbox artifact
- Empirical evaluation replaces deterministic verification
- Agentic coding is fundamentally a different way of producing software, with different best practices
- Provenance
- Tweet · Primary source
-
2
METR: Claude Mythos Preview 50% time horizon hits 17 hours
Article chillinewman
Yesterday we promised to track who builds the next METR evaluation tasks. Today METR published an update showing Claude Mythos Preview's 50% time horizon at 17 hours — a measurable advance over the previous bar and the…
www.reddit.com/r/singularity/comments/1t92j… →Details
- Context
- Yesterday we promised to track who builds the next METR evaluation tasks. Today METR published an update showing Claude Mythos Preview's 50% time horizon at 17 hours — a measurable advance over the previous bar and the headline number from yesterday's evaluation-ceiling discussion.
- Key points
- Claude Mythos Preview hits 17hr 50% time horizon on METR's task suite
- The 50% time horizon is the time a human expert would need on tasks the model completes 50% of the time
- Doubling roughly every 7 months on the recent curve
- Task construction is increasingly the rate-limiter for measuring further gains
- Provenance
- Article · Supporting source
-
3
NVIDIA Star Elastic: one checkpoint, three sizes via zero-shot slicing
Article phazei
A single checkpoint that contains 30 billion, 23 billion, and 12 billion parameter reasoning models, sliceable at inference time with no retraining. That collapses three deployment targets into one artifact and shifts w…
www.reddit.com/r/LocalLLaMA/comments/1t8s83r →Details
- Context
- A single checkpoint that contains 30 billion, 23 billion, and 12 billion parameter reasoning models, sliceable at inference time with no retraining. That collapses three deployment targets into one artifact and shifts where the inference budget gets spent.
- Key points
- One checkpoint contains 30B, 23B, and 12B reasoning models
- Slicing happens zero-shot at load time
- Hybrid mixture-of-experts architecture
- Reduces multi-target deployment complexity
- Provenance
- Article · Supporting source
-
4
antirez: DeepSeek 4 on DGX Spark — 12 tokens/sec, prefill 200
X antirez — Salvatore Sanfilippo, creator of Redis
DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more aligned to M3 Max at ~200 t/s.
x.com/antirez/status/2053381973226184749 →Details
- Cited text
DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more aligned to M3 Max at ~200 t/s.
- Context
- A concrete, measured port of DeepSeek 4 to NVIDIA's small-form-factor DGX Spark. The 270 gigabytes per second memory bandwidth is the bottleneck — a real number worth filing alongside the M3 Max comparison.
- Key points
- 12 tokens per second decode on DGX Spark / GB10
- 200 tokens per second prefill, comparable to M3 Max
- 270 GB/sec memory bandwidth is the limit
- Private CUDA port, public release pending
- Provenance
- Tweet · Primary source
-
5
Elad Gil: the AI diffusion gap, in months
X eladgil — Elad Gil, longtime AI investor and operator
People at major AI labs (using internal models) 3-4 months ahead of startup silicon valley engineers. SV founders/eng 3-6 months ahead of NY. NY founders/eng 6-12 months ahead of rest of world.
x.com/eladgil/status/2053206351158091819 →Details
- Cited text
People at major AI labs (using internal models) 3-4 months ahead of startup silicon valley engineers. SV founders/eng 3-6 months ahead of NY. NY founders/eng 6-12 months ahead of rest of world.
- Context
- A practical map of who has access to what and when. It's a compounding gap: by the time a model lands at a startup, lab insiders are already six months into the next one.
- Key points
- Lab researchers 3-4 months ahead of SV startups
- SV 3-6 months ahead of NY
- NY 6-12 months ahead of rest of world
- Compounding diffusion gap shapes who builds what
- Provenance
- Tweet · Primary source
-
6
Claude Opus 4.7 burns more tokens on German prompts
Article WickOfDeath
A practical reminder that the tokenizer is not language-neutral. German runs through the tokenizer at a meaningfully higher token count than English for the same content, and that translates to slower turns, smaller eff…
www.reddit.com/r/ClaudeAI/comments/1t8xtcf →Details
- Context
- A practical reminder that the tokenizer is not language-neutral. German runs through the tokenizer at a meaningfully higher token count than English for the same content, and that translates to slower turns, smaller effective context, and higher bills.
- Key points
- German prompts cost roughly 1.5-2x the English token count
- Effective context window shrinks proportionally
- Output quality on graphs and structure can degrade for non-English
- Tokenizer asymmetry is a structural cost, not a bug
- Provenance
- Article · Supporting source
-
7
Virgil Maro: agency at the prompt boundary
X _virgil19
the compounding shows up at the prompt boundary. high-agency users come pre-loaded with goals worth amplifying. low-agency users hand the model the goal too. AI doesn't generate the gap. it scales whatever shape
x.com/_virgil19/status/2053184240238637185 →Details
- Cited text
the compounding shows up at the prompt boundary. high-agency users come pre-loaded with goals worth amplifying. low-agency users hand the model the goal too. AI doesn't generate the gap. it scales whatever shape
- Context
- Names something a lot of teams are quietly noticing — that AI tools amplify whatever the user brings, including the absence of a goal.
- Key points
- Compounding lives at the prompt boundary
- High-agency users arrive with goals worth amplifying
- Low-agency users delegate goal-setting to the model
- AI scales the shape of whatever it's handed
- Provenance
- Tweet · Primary source
-
8
Engineering moves to the consequence boundary
X FiftyOne_50_
Agentic coding does not remove engineering. It moves engineering to the consequence boundary: What gets specified, tested, trusted, deployed, monitored, rolled back, and owned when the model is wrong.
x.com/FiftyOne_50_/status/20532876467098134… →Details
- Cited text
Agentic coding does not remove engineering. It moves engineering to the consequence boundary: What gets specified, tested, trusted, deployed, monitored, rolled back, and owned when the model is wrong.
- Context
- A clean restatement of what agentic coding actually shifts: not less engineering, just engineering located somewhere different — at the points where you can still say no.
- Key points
- Agentic coding doesn't eliminate engineering work
- Spec, test, deploy, monitor, rollback, ownership all remain
- The locus moves from line-by-line authorship to consequence boundaries
- Provenance
- Tweet · Primary source
-
9
Gemini API File Search goes multimodal
Article
Multimodal retrieval-augmented generation as a hosted API primitive. The change in scope is the part to notice — the file-search endpoint now indexes images and PDFs alongside text, so callers don't need to maintain a s…
blog.google/innovation-and-ai/technology/de… →Details
- Context
- Multimodal retrieval-augmented generation as a hosted API primitive. The change in scope is the part to notice — the file-search endpoint now indexes images and PDFs alongside text, so callers don't need to maintain a separate visual retrieval pipeline.
- Key points
- File Search now ingests images and PDFs natively
- No separate visual embedding pipeline required
- Hosted RAG primitive that competes with first-party stacks
- Provenance
- Article · Supporting source
-
10
Gemma 4 MTP on MLX Swift: 30-40% faster on M5 Max
X adrgrondin
Early WIP port of Gemma 4 multi-token prediction (MTP) on MLX Swift. With MTP, Gemma 31B is 30-40% faster on M5 Max and with zero quality degradation. A significant speedup by just adding a 900MB MTP drafter model.
x.com/adrgrondin/status/2053198336312689103 →Details
- Cited text
Early WIP port of Gemma 4 multi-token prediction (MTP) on MLX Swift. With MTP, Gemma 31B is 30-40% faster on M5 Max and with zero quality degradation. A significant speedup by just adding a 900MB MTP drafter model.
- Context
- Multi-token prediction with a small drafter model is the speculative-decoding move, but with the drafter trained alongside the target model. 30 to 40 percent decode speedup for 900 megabytes of extra weights is a strong trade.
- Key points
- Multi-token prediction port to MLX Swift
- 30-40% decode speedup on Apple M5 Max
- Zero quality degradation reported
- 900MB drafter model footprint
- Provenance
- Tweet · Primary source
Seventeen hours
00:00:04 The number to start with today is seventeen hours. That's METR's freshly-posted fifty-percent time horizon for Claude Mythos Preview, the early build, and it landed on a Reddit thread late last night with the chart and the FAQ already attached. On Friday I told you we'd watch who built the next set of METR evaluation tasks and where the ceiling moved.
00:00:28 Here's the move. Seventeen hours doesn't mean Mythos can act autonomously for seventeen hours. METR is careful about that, and the FAQ on the chart says it explicitly: the fifty-percent time horizon is the length of task — measured in expert human time — that the model completes successfully half the time.
00:00:50 So if a senior engineer would take roughly seventeen hours to do a task, Mythos Preview gets that task right about one time in two. Above that, success drops off. Below it, success climbs. The way to file this is as a single point on a curve we've now been watching for almost two years.
00:01:10 The doubling time on that curve — depending on whose fit you trust — has been somewhere between four and seven months. The earliest GPT-4 sat around four to fifteen minutes on the same axis. Claude 3.5 was in the hour-or-two range. Last summer's frontier sat near eight hours.
00:01:29 Seventeen hours is consistent with the trend rather than off it. What I'd take from it. First, the headline number is measurable progress, but it's progress on a specific task suite — METR's, which is mostly software-engineering-flavored, mostly closed-form, and mostly built around crisp success criteria.
00:01:50 It is not an Operations Officer rating. It is not a measure of how long Mythos can stay coherent on something open-ended like, say, running a customer-success thread for a week. The horizon for that kind of task is shorter, and we don't have a clean instrument for it yet.
00:02:09 Second — and this is the thing yesterday's chapter on METR was already pointing at — the rate-limiter is increasingly task construction. To resolve the curve past seventeen hours, somebody has to write good multi-day evaluation tasks, with verifiable answers, that aren't already in the training set.
00:02:30 That is hard work, and it isn't free. METR has been hiring for it. Anthropic and OpenAI are building internal versions. The interesting question for the next quarter is whether anyone outside that small circle of labs can produce a multi-day eval suite that the labs themselves take seriously.
00:02:51 Third, the version we're seeing is Preview. Anthropic ships the public Mythos sometime later this quarter. That preview-to-public gap is exactly the thing Elad Gil was writing about yesterday afternoon — and we'll come back to that in a minute, because it pulls a lot of the day's items into the same orbit.
One checkpoint, three models
00:03:12 Over on the LocalLLaMA subreddit, somebody surfaced an NVIDIA release from about two weeks ago that hasn't really caught fire yet. It's called Star Elastic, and the trick is that a single checkpoint contains three reasoning models — thirty billion parameters, twenty-three billion, and twelve billion — and you can slice between them at load time, zero-shot, without retraining and without distillation.
00:03:38 The way it works, as best I can tell from the model card and the discussion: it's a hybrid mixture-of-experts architecture where the experts are arranged so that you can drop the bottom tier — the lower-utilization experts and the corresponding attention heads — and the remaining sub-network is still a coherent model.
00:03:58 You're not pruning a 30 billion parameter dense model down to 12 billion and hoping it survives. You're picking a pre-arranged subset that was trained, jointly, to behave like a smaller model when you take the slice. What this actually changes for someone deploying.
00:04:15 The case I keep coming back to is the multi-target deployment problem. If you ship to phones, browsers, and a couple of GPU classes, you've historically had to either pick one model and accept it's wrong-sized for half your fleet, or maintain three checkpoints with three eval pipelines and three sets of behavioral tests.
00:04:36 With Star Elastic — assuming the technique generalizes, which I haven't verified myself — that becomes one checkpoint, one eval pass, and a runtime knob. The caveat I'd want you to hold on to is that the benchmark scores at the smaller slices are not obviously equivalent to a dedicated 12 billion parameter model trained from scratch.
00:04:57 NVIDIA's own numbers show some degradation. The 12 billion slice on MMLU is a couple of points behind a dedicated equivalent in their comparison table. So you're trading a single-checkpoint deployment story for a small quality hit at the smaller sizes. Whether that's a good trade is a deployment question, not a model-quality question.
00:05:18 The broader thing I find interesting here is the direction of travel. We had dense models. Then we had mixtures of experts. Now we have nested mixtures of experts where the same weights serve at multiple scales. The architecture keeps getting more inference-time-flexible, and the deployment story keeps getting less of an architecture decision and more of a runtime decision.
00:05:42 Which means the people who are good at runtime decisions — at telemetry, autoscaling, and routing — get more of the leverage, and the people whose job was picking the checkpoint get less.
DS4 on DGX Spark, and where the wall is
00:05:54 Salvatore Sanfilippo — antirez, creator of Redis — posted last night that he's gotten DeepSeek 4 running on a DGX Spark, NVIDIA's small-form-factor workstation with the GB10 chip. Private branch for now, CUDA port, public release pending. The numbers he posted are the part to file.
00:06:15 Twelve tokens per second on decode. Two hundred tokens per second on prefill. Memory bandwidth on the system: 270 gigabytes per second. The shape of those numbers is the whole story. Two hundred t/s prefill on a desktop-class machine is good — antirez compares it to an M3 Max, which is roughly the right reference point.
00:06:38 Twelve tokens per second decode is not good. It is usable for batch work and unusable for an interactive coding loop, and the reason it's slow is sitting right there in the third number. 270 GB/sec is a memory-bandwidth wall rather than a compute wall. DeepSeek 4 has a lot of weights, the key-value cache — the KV cache — moves a lot of bytes per generated token, and the GB10's HBM tier just can't feed the SMs fast enough.
00:07:09 Which means: if you're shopping for a local frontier-model rig today, the spec sheet to read first is the memory bandwidth, not the FLOPs. An H100 sits north of three terabytes a second. An M3 Max is around 400 GB/sec. A consumer 5090 is somewhere in the high 800s.
00:07:28 The DGX Spark, at 270, is meaningfully behind all of those, and the numbers antirez is posting reflect that exactly. The reason this matters beyond hardware-spec trivia is that a lot of the DeepSeek-4-class models are now in a window where they're large enough to need the bandwidth, and small enough that the rest of the stack is interactive-feasible.
00:07:54 Where the wall lives shapes which workloads make sense locally. At twelve tokens per second decode you are not running an interactive agent loop on this thing. You are running batched analysis, overnight refactors, or things where latency doesn't matter and privacy or cost do.
00:08:15 antirez says he'll publish the CUDA port when it's ready. The thing I'll be watching is whether he or anyone else gets the prefill story onto more aggressive batching — because if you can keep the SMs fed at prefill rates during decode through speculative or multi-token techniques, the bandwidth wall stops being a hard ceiling and starts being a soft one.
00:08:41 There's a thread there that connects to one of the other items today. Hold that one for a few minutes.
Chollet: agentic coding is machine learning
00:08:48 François Chollet posted a two-tweet thread yesterday afternoon that I want to read out in full, because I think it does a useful piece of work on a framing problem a lot of teams are tangled up in. First tweet: "Agentic coding is a form of machine learning. Generated code is best treated as a blackbox artifact whose behavior and generalization should be managed via empirical evaluation, like with any ML model."
00:09:24 It is a fundamentally different way of producing software, with different best practices and different use cases. Just like ML." In the software-engineering tradition, you reason about the code by reading it. You have invariants, types, and tests written against named functions.
00:09:52 The artifact is legible, and you trust it because you can audit it line by line. In the machine-learning tradition, the artifact is opaque, and you trust it because it passed an evaluation suite that's a meaningful proxy for the work it has to do in production.
00:10:11 If agentic coding is in the second tradition rather than the first — and I think Chollet's right about this for a useful subset of cases — then the disciplines that matter shift. Code review becomes less central, because the reviewer can no longer reason about why the code does what it does, only whether it passes the eval.
00:10:33 Eval design becomes much more central, because it's the only thing actually keeping you tethered to behavior. Versioning shifts: you don't version the prompt and call it a day, you version the prompt and the harness and the eval suite together as a single artifact, the way you'd version a model card.
00:10:54 A reply from someone going by FiftyOne came in a few hours later with a sharper restatement of the same idea, which I'll quote: "Agentic coding does not remove engineering. It moves engineering to the consequence boundary: what gets specified, tested, trusted, deployed, monitored, rolled back, and owned when the model is wrong."
00:11:20 Engineering doesn't go away in this world. It moves to the points where you can still say no — the spec, the test, the deploy gate, the rollback button, and the on-call rotation. The middle of the work — the line-by-line authorship — gets surrendered to the model.
00:11:38 The edges, where consequence lives, are exactly where you cannot surrender. Where I'd push back on Chollet a little. The blackbox framing is right for a lot of cases — for code that goes through a single well-shaped eval, like a sorting algorithm or a data-transformation step.
00:11:58 It's less right for code that has to be maintained, extended, and read by humans across years. There, you don't just need behavior, you need a representation that future humans (and future agents) can extend without rewriting. The eval-only discipline catches behavior regressions but it does not catch the slow accumulation of structurally bad code that is, individually, locally correct.
00:12:25 That's a problem worth thinking about. But as a framing for how to relate to agentic output today, I think the ML tradition is closer to right than the SE tradition is, and I'm going to be carrying this lens through the next few episodes.
The diffusion gap, in months
00:12:41 Elad Gil posted yesterday afternoon — and I want to be careful about this one, because it's the kind of post that's easy to over-read. Here's what he actually said, in his own words: SV founders/eng 3 to 6 months ahead of NY. NY founders/eng 6 to 12 months ahead of rest of world.
00:13:08 Most people have no idea." The reason it lands today is that it gives you a way to read everything else in this episode at the right altitude. Mythos Preview is the model the labs have been using for weeks. Public Mythos lands later this quarter. That's roughly the lab-to-SV gap Gil is naming.
00:13:35 Star Elastic was released eleven days ago, by Gil's standards is already a quarter into its diffusion to NY, and the Reddit thread we're discussing is part of the reason it's now leaking out further. DeepSeek 4 dropped at the start of the quarter; antirez running it on a Spark is about as fast as the cycle from frontier weights to a hobbyist port can go.
00:13:59 What I'd add to Gil's frame. The gap isn't uniform across capabilities. Coding capability diffuses fastest, because the open-source ecosystem chews through it within weeks of release. Reasoning capability diffuses next, because benchmark culture pulls it forward.
00:14:17 Agentic capability — the thing that requires harnesses, tool-use plumbing, and eval suites, and observability — diffuses slowest, because it's the part that requires actual infrastructure investment, not just weights. So if you're an engineer at a startup outside of San Francisco or New York, the read isn't "I'm three months behind." The read is more like: I'm a few months behind on what frontier models can do in a chat window, and twelve to eighteen months behind on what frontier teams can build with full agentic harnesses.
00:14:54 The gap on the harness side is bigger because the artifacts diffuse worse — they're proprietary, idiosyncratic, and hard to replicate from a tweet. The productive response to that gap, if you're sitting in it, is the same one Mert pointed at in another thread yesterday — agency at the prompt boundary.
00:15:14 Which is the next thing I want to talk about.
Agency at the prompt boundary
00:15:17 Two threads from yesterday landed within a few hours of each other and basically said the same thing in different words. I want to put them next to each other. Virgil Maro: "the compounding shows up at the prompt boundary. high-agency users come pre-loaded with goals worth amplifying.
00:15:36 low-agency users hand the model the goal too. AI doesn't generate the gap. it scales whatever shape." Most AI copilots are stuck at point two — rewriting SLAs, not SOWs. So the compounding stays theoretical until the tool actually closes loops." If you arrive with a sharp goal — a specification, a constraint, a thing you actually want — the model amplifies that into more, faster, with more breadth than you could have produced alone.
00:16:17 If you arrive without a goal — if you're hoping the model will tell you what to want — it amplifies the absence of one, and you get glossy, well-formatted, beautifully argued output for a question you didn't actually have. I think this is the most under-discussed thing about agentic tools right now, and the thing I'm most unsure how to teach.
00:16:39 The high-agency-user advantage isn't about prompt cleverness. It's about arriving at the conversation with an actual model of what you're building, what could go wrong, what "done" looks like, and what you'd accept versus what you'd reject. That muscle was always part of senior engineering.
00:16:59 It's just that it used to be one of several muscles, and now it's the one that the rest of the work hangs off. The practical version of this. A teammate of mine who works mostly with Claude Code described the move he's been making, roughly: he's stopped opening the model and asking it to do a thing.
00:17:18 He's started opening a doc, writing the spec he'd give a contractor, and then handing the spec to the model. The intermediate step costs maybe ten minutes and changes the output dramatically — not because the model is smarter when given a spec, but because writing the spec forces him to figure out what he actually wants.
00:17:40 The model is a forcing function for legibility on his own intent. The failure mode I'm watching for. The compounding only runs in the direction of high agency. If you're a junior engineer learning the craft, the temptation is to delegate goal-setting upward to the model, because the model produces fluent output.
00:18:00 That's a learning trap. You don't develop the muscle that lets you eventually be the high-agency user. I don't have a clean answer for how teams handle this — it's where mentorship used to live, and it's where mentorship has to find a new shape.
The German tokenizer tax
00:18:16 Smaller item, but a useful one. A poster on the ClaudeAI subreddit flagged something a lot of non-English engineers have been quiet about: Claude Opus 4.7 burns substantially more tokens on German prompts than on English ones for the same content. Practical numbers in the thread: roughly 1.5 to 2 times the token count for equivalent German text, depending on the document type.
00:18:44 Code-heavy prompts converge closer to 1.2x; narrative prompts blow out to nearly 2x. This isn't a bug. It's a structural property of the tokenizer. The BPE vocabulary on every frontier model is trained on a corpus that's heavily English-weighted, so English words map to one or two tokens and German words — with their compound nouns, their cases, their longer surface forms — map to four or five.
00:19:13 The cost shows up in three places: your bills, your latency, and your effective context window. If you're paying per token, you're paying a tax. If you're rate-limited, you're spending more of your minute on the same content. And if your context is 200K tokens, you've effectively got 100 to 130 thousand for the same German document.
00:19:37 The poster's specific complaint was about output quality on graphs and structured content. Opus 4.6 produced complete graphs in German; Opus 4.7 truncated them. The fix that worked for them was switching to English for the inner reasoning and German only on the output layer.
00:19:57 That's a reasonable workaround for some use cases and entirely useless for others — if you're working with German legal documents, the inner reasoning has to happen in German. The broader thing I'd flag is that this tax compounds with the diffusion gap from the previous chapter.
00:20:17 If you're building a German-language product outside of the US tech corridors, you're getting worse model behavior, a smaller effective context, a higher price, and a longer wait on the latest weights. The real cost of operating outside the English-language US engineering hub is bigger than the diffusion gap alone, because the underlying tokenizer wasn't optimized for you.
00:20:45 Worth knowing if you're estimating budgets for a non-English product.
Two faster things
00:20:50 Two more items I want to put on the table before I sign off. Both are about getting more out of the same hardware. First, Adrien Grondin posted an early port of Gemma 4's multi-token prediction onto MLX Swift. The number to file: Gemma 4 at 31 billion parameters runs 30 to 40 percent faster on decode on an Apple M5 Max with the multi-token prediction drafter loaded — with, in Adrien's own claim, which I haven't verified, zero quality degradation.
00:21:22 The drafter itself is 900 megabytes of additional weights. Multi-token prediction is the speculative-decoding move with a twist: instead of using a separate small model as the drafter and hoping its predictions match the target model's, the drafter is trained jointly with the target model to share its representation.
00:21:46 So the acceptance rate of drafter tokens is much higher than with classical speculative decoding, and the speedup is closer to a multiplier on the draft length. 30 to 40 percent on a 31-billion-parameter model is the kind of win that matters — it changes the boundary on which workloads are interactive on a laptop.
00:22:09 The connection back to antirez's DGX Spark numbers from earlier: if you can buy 30 to 40 percent decode through a drafter on Apple silicon, you might be able to buy something similar on a GB10. That's a path past the 270 GB/sec wall — not by changing the bandwidth, but by changing how many tokens come out per byte fetched.
00:22:32 Second, Google shipped multimodal File Search in the Gemini API. Until yesterday, File Search was a text retrieval primitive — you uploaded text, you got chunks. As of yesterday, you upload images, PDFs, and slides, and the underlying retrieval indexes them natively without you running a separate visual embedding pipeline.
00:22:56 That's a hosted RAG primitive that competes directly with rolling your own vision-plus-text retrieval stack. Whether you adopt it depends on whether you're already on Gemini and what your retrieval quality looks like in production. The thing to notice is that the floor on "build a custom multimodal RAG" just moved up.
00:23:19 If you were building one today as a startup, the question is now whether your custom version meaningfully beats the hosted one — and that's a bar that gets higher every quarter.
Sign-off
00:23:32 So today the picture I've got is: the curve moved from eight hours to seventeen on the METR axis. NVIDIA shipped one checkpoint that pretends to be three. antirez found the bandwidth wall on a DGX Spark and named it. Chollet relocated agentic coding into the ML tradition, and a few replies sharpened the move.
00:23:50 Elad Gil drew the diffusion map. Two threads named the same thing about agency at the prompt boundary. The German tokenizer is still expensive, and a 900-megabyte drafter buys you a third of your decode time back on Apple silicon. What I'm watching for tomorrow: whether anyone outside the lab circle ships a multi-day agentic eval suite that the labs themselves take seriously, because that's the thing that resolves the curve past seventeen hours.
00:24:16 And whether the Star Elastic technique generalizes to other architectures, or stays an NVIDIA-specific trick. — Lenar Kess.