◆ Dispatch 007 · 2026-04-30 GSV Granite Density

IBM's Dense 4.1 Beats MoE, Cursor Skips Code For Markdown Skills, And GCC 16 Ships

2026-04-30 / 00:23:12 / 5 sources

“A smaller, simpler, dense model is winning consistently. That means IBM got significantly better at training between generations — it's what happens when you spend the intervening period obsessing over data quality instead of just scaling parameters.”
— Seln Oriax, today's narration

IBM released Granite 4.1, and the 8B dense model consistently matches or beats their previous 32B MoE model across benchmarks. The story isn't just about the numbers — it's about a data quality obsession that's worth understanding.

Meanwhile, David Gomes from Cursor walked through replacing 12,000 lines of custom git worktrees infrastructure with a 200-line Markdown skill. The tradeoffs are honest and the lessons apply to any team building agent workflows.

Chapters

00:00:04 The dense model that doesn't need tricks
00:08:25 The convergence: dense models catching up
00:12:41 Boring beats brilliant: Cursor's skills over infrastructure
00:18:32 Figure AI: production, not prototype
00:20:40 GCC 16: the plumbing update
00:22:42 Closing

Sources

5 cited

1
Granite 4.1: IBM's 8B Model Matching 32B MoE

Article firethering — IBM's Granite team, previously responsible for Granite 4.0 series of open enterprise models

The 8B instruct scores 69.0 on ArenaHard. The previous generation Granite 4.0-H-Small, a 32B MoE model with 9B active parameters, scored lower. Across AlpacaEval, MMLU-Pro, BBH, EvalPlus, MBPP. same thing throughout.
firethering.com/granite-4-1-ibm-open-source… →
Details
Cited text
The 8B instruct scores 69.0 on ArenaHard. The previous generation Granite 4.0-H-Small, a 32B MoE model with 9B active parameters, scored lower. Across AlpacaEval, MMLU-Pro, BBH, EvalPlus, MBPP. same thing throughout.

Context
A production-grade dense model at 8B parameters that holds its own against heavier alternatives means teams can trade latency and cost for capability without open-weight compromise. The four-stage RL recovery is a real engineering detail that shows up in reliability.
Key points
Dense 8B model matches or beats previous 32B MoE across benchmarks
15 trillion tokens trained across 5 distinct phases with changing data mixes
Four-stage RL pipeline caught and corrected a mid-training regression
512K context window achieved through staged extension (32K → 128K → 512K) with model merges
Apache 2.0 license, available via Ollama, vLLM, Transformers, and IBM API
Provenance
Article · Supporting source
2
Replacing 12K LoC with a 200 LoC Skill — David Gomes, Cursor

Video David Gomes — David Gomes, Cursor — built the git worktrees feature and led the skill-based replacement

With our previous approach, the agent had to stay on track. Like it, we didn't let the model ever touch any files outside its work. It was physically impossible for it to do so. Now we're trusting the model. So it's a b…
www.youtube.com/watch?v=WE_Gnowy3uw →
Details
Cited text
With our previous approach, the agent had to stay on track. Like it, we didn't let the model ever touch any files outside its work. It was physically impossible for it to do so. Now we're trusting the model. So it's a bit vibes based.

Context
This is a real-world example of the 'boring beats brilliant' principle: replacing complex custom infrastructure with a skill that's maintainable, configurable, and cross-platform. It's also an honest look at where skills fall short — trust-based boundaries are not the same as enforced ones.
Key points
Cursor replaced a massive git worktrees feature (15,000 lines of code) with a 200-line Markdown skill using slash commands
The new 'slash work tree' and 'slash best event' commands use existing cursor primitives — skills and sub-agents — instead of custom infrastructure
Tradeoffs include models sometimes drifting from their work trees, slower feel from visible worktree creation, and worse discoverability
Cursor is building evals with Braintrust to measure work-tree compliance and training Composer models on these tasks for future RL
Parallelization primitives beyond git worktrees are in development, since worktrees are slow to create and disk-hungry
Provenance
Video · Supporting source
3
Mistral Medium 3.5 128B — Dense flagship unified model

Article Mistral AI — Mistral AI's flagship model release team

Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights.
huggingface.co/mistralai/Mistral-Medium-3.5… →
Details
Cited text
Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights.

Context
Another dense flagship replacing MoE/merged approaches. Mistral's bet on a single unified model with configurable reasoning effort maps to the same question Granite raises: as dense models get better, does the MoE tradeoff still earn its complexity?
Key points
Dense 128B model replacing both Mistral Medium 3.1 and Magistral in Le Chat
Reasoning effort configurable per request — can do fast reply or complex agentic runs
Replaces Devstral 2 in their coding agent Vibe, scoring 91.4% on τ³-Telecom and 77.6% on SWE-Bench Verified
256k context, multimodal (text + image input), system prompt support
Modified MIT license with revenue threshold exception, available via Mistral Vibe CLI, vLLM, SGLang, Transformers
Provenance
Article · Supporting source
4
GCC 16 has been released

Article GCC Team — The GCC project team, maintained by the Free Software Foundation

GCC 16 has been released with C++26 reflection support, enabling compile-time introspection of types and structures without template metaprogramming hacks.
gcc.gnu.org/gcc-16/changes.html →
Details
Cited text
GCC 16 has been released with C++26 reflection support, enabling compile-time introspection of types and structures without template metaprogramming hacks.

Context
Compilers are the plumbing AI agents write into. C++26 reflection changes how you write metaprogramming, and as more generated code flows through GCC, understanding these changes helps you write and debug the generated output. It's not AI news per se, but it's the foundation everything runs on.
Key points
GCC 16 includes C++26 reflection support — compile-time type introspection
Improvements to compiler optimization passes and debug info generation
Updates to libstdc++ including C++26 library features
Available on Debian sid (trunk package) and build systems
Provenance
Article · Supporting source
5
Figure AI hits 24x production scale, producing 1 robot per hour

Source Distinct-Question-16

Robotics deployment moves from demo mode to production mode when you're building one a day consistently. It's a different kind of engineering problem than model benchmarking — assembly lines, supply chains, and reliabil…
www.reddit.com/r/singularity/comments/1sz3s… →
Details
Context
Robotics deployment moves from demo mode to production mode when you're building one a day consistently. It's a different kind of engineering problem than model benchmarking — assembly lines, supply chains, and reliability at scale. Worth watching as the parallel to AI agent deployment.
Key points
Figure AI has scaled humanoid robot production to 24 units per day
One robot produced per hour at their manufacturing line
The company is teasing a fleet deployment — moving from prototype to operations
Significant milestone in making humanoid robots economically viable
Provenance
Source · Background source

00:00:04

The dense model that doesn't need tricks

00:00:04 IBM released Granite 4.1 today. The benchmark number worth stopping for is the 8B model's 69.0 on ArenaHard, one of the better proxies for actual chat quality since models are judged by GPT-4 on how well they handle five hundred challenging real-world prompts. The 8B model is dense, has no MoE tricks, and uses no extended reasoning chains.

00:00:28 The previous generation Granite 4.0-H-Small was a 32B MoE model with nine billion active parameters, and it scored lower. On BFCL V3, the standard tool-calling benchmark, the 8B scores 68.3. The 32B MoE scores 64.7. On GSM8K, the grade-school math benchmark, the 8B hits 92.5.

00:00:48 Across AlpacaEval, MMLU-Pro, BBH, EvalPlus, and MBPP, the pattern holds. A smaller, denser model is winning consistently. IBM got significantly better at training between generations. The 4.0-H-Small wasn't badly built. It was the best they had at the time. The 4.1 8B is what happens when you spend the intervening period obsessing over data quality instead of just scaling parameters.

00:01:16 That's the thread running through everything about how Granite 4.1 was built. IBM released three sizes — the 3, 8, and 30 billion-parameter models — all using the same decoder-only dense transformer design, the same training pipeline, and the same data strategy.

00:01:35 The only difference between them is size. No MoE routing, sparse layers, or extended reasoning chains that inflate token counts. What you send in is what gets processed, predictably, every time. Models that lean on long reasoning traces are harder to cost-predict and harder to latency-budget.

00:01:57 Granite 4.1 skips that by design. But the architecture isn't the main story here. The real story is the fifteen trillion tokens they trained on, and how carefully they handled them. IBM ran five distinct training phases with different data mixtures, different learning rate schedules, and different goals.

00:02:19 Phase one starts broad with CommonCrawl at 59 percent, code at 20 percent, and math at 7 percent. By phase two, math jumps to 35 percent and code to 30 percent. Phases three and four blend chain-of-thought reasoning and instruction data alongside high-quality web content.

00:02:38 Phase five extends the context window, eventually to 512K tokens for the 8B and 30B models. Most teams pick a data mix and stick with it. IBM changed theirs four times with clear intent each time. What happened after pre-training matters just as much. IBM needed to turn the base model into something that actually follows instructions reliably, and that requires fine-tuning on examples of good behavior.

00:03:08 But bad examples in that dataset don't just get ignored. They get learned. A hallucinated answer, a response that ignores the instruction, a calculation that's wrong but confident — the model treats all of it as signal. So IBM built a filtering system before a single fine-tuning sample touched the model.

00:03:30 An LLM-as-Judge evaluated every assistant response across six dimensions including instruction following, correctness, completeness, conciseness, naturalness, and calibration. Each response got scored, and samples that fell below threshold got cut. But some things triggered automatic rejection regardless of score: hallucinations, false premises, incorrect computations.

00:03:57 No partial credit for those. The judge wasn't reading prompts or user inputs in isolation. It was evaluating what the model said given the full context it had access to. In RAG settings, if the response wasn't grounded in the retrieved documents, that counted as a hallucination.

00:04:17 In tool-calling scenarios, outputs were checked against the allowed tools and their parameter schemas. On top of that, a separate rule-based pipeline checked structure like length, formatting, schema validation, deduplication across the entire dataset. Everything was logged and auditable.

00:04:38 What came out the other side was 4.1 million samples. It's a deliberately curated 4.1 million, not an accident of accumulation. IBM also ran reinforcement learning in four sequential stages. The first stage trained the model jointly across nine domains including math, science, logical reasoning, instruction following, structured output, text-to-SQL, temporal reasoning, general chat, and in-context learning.

00:05:08 They did this because joint training prevents the model from forgetting earlier domains as it gets better at later ones. Every gradient update sees the full range of tasks. Stage two was RLHF training on general chat prompts using a reward model to improve helpfulness.

00:05:28 This worked. AlpacaEval scores jumped around 18.9 points on average compared to the fine-tuned checkpoints. Then they hit a snag. The RLHF stage, while improving chat quality, caused math benchmark scores to drop. GSM8K and DeepMind-Math both regressed. Stage three was a short identity and knowledge calibration run of about 40 training steps.

00:05:52 It stabilized how the model represents itself and what it knows, with a measurable improvement on self-identification. Stage four was a dedicated math RL run specifically to recover what RLHF had damaged. It worked. GSM8K recovered and surpassed the fine-tuned baseline by around 3.8 points on average.

00:06:14 DeepMind-Math recovered by around 23.5 points. That four-stage RL pipeline, catching a regression and correcting it, is the kind of detail that doesn't make headlines but shows up in real-world reliability. Most model releases announce the benchmarks and skip the story of what almost broke.

00:06:36 The benchmarks are self-reported using IBM's own evaluation harness. The absolute numbers are plausible and consistent with what third parties have reported, so treat them as strong signals rather than final proof. These are good results, not proven ones. The 30B sits at the top of IBM's own BFCL V3 tool-calling chart at 73.7, ahead of Gemma-4-31B at 72.7.

00:07:02 That's a legitimate leaderboard result, not a cherry-picked internal comparison. The 8B at 68.3 beats the previous Granite 4.0-H-Small at 64.7. The 30B also beats every Qwen model on the IFEval chart regardless of size, at 89.7. The quiet story here is the 3B model.

00:07:21 82.1 on IFEval, 87.0 on GSM8K, 60.8 on BFCL V3. For a model running at that parameter count, those numbers are hard to ignore if you're thinking about edge deployment or cost-constrained inference. The 3B only extends to 128K context, not 512K — worth knowing if long context is a hard requirement for your use case.

00:07:44 The models are available under Apache 2.0 through Ollama, vLLM, and Transformers. FP8 quantized variants are roughly half the footprint of full precision versions with most of the performance intact. For anyone building something that needs reliable tool calling, predictable latency, and a license that won't create legal headaches, the 8B is worth a serious look.

00:08:10 It's competitive with models that cost more to run, and the data quality obsession that produced it is exactly the kind of engineering discipline that matters long after the benchmark numbers fade.

00:08:25

The convergence: dense models catching up

00:08:25 Granite 4.1 isn't the only dense model making noise today. Mistral released Medium 3.5, a dense model with 128 billion parameters that replaces both Mistral Medium 3.1 and Magistral in their Le Chat product, as well as Devstral 2 in their coding agent Vibe. The architecture story is worth noting.

00:08:47 It's a dense model with 128 billion parameters, a 256K context window, multimodal inputs, and configurable reasoning effort per request. The same model can do a fast instant reply or work through a complex agentic run by toggling reasoning effort between none and high.

00:09:07 Scores sit at 91.4 on tau-three-Telecom and 77.6 on SWE-Bench Verified. Mistral trained the vision encoder from scratch to handle variable image sizes and aspect ratios. What's interesting here isn't just the density. It's the unified approach. Mistral's reasoning effort configuration maps to a different question than the one Granite raises.

00:09:33 Instead of smaller replacing larger, it's a single model that can flex between fast and thorough modes. For agent workflows, that's potentially useful — you pay for reasoning only when you need it, and the overhead of context switching between models disappears.

00:09:53 But I'd watch the deployment story more closely than the benchmarks. Mistral released this under a modified MIT license with a revenue threshold exception for large companies. The commercial terms for companies above that threshold aren't spelled out on the Hugging Face page, so the open-source story is only half the picture.

00:10:17 What I'm tracking across both releases is the convergence. Dense models from IBM and Mistral are both closing the gap that MoE architectures were supposed to hold. This doesn't mean MoE is dead. Sparse routing still wins when you're optimizing for peak capability and have the infrastructure to manage it.

00:10:40 But the cost-latency predictability of dense models is becoming competitive, and that changes how you design systems. If you're building an agent pipeline, the question isn't which architecture wins globally. It's which one fits your deployment budget, your latency requirements, and your team's ability to manage the tradeoff.

00:11:04 The gap between dense and MoE is narrowing in capability, and the gap in operational simplicity is widening. There's also the practical question of how IBM achieved a 512K context window with Granite. Getting a model to handle 512K tokens is one problem. Getting it to handle 512K tokens without forgetting how to handle 4K tokens is a harder one.

00:11:30 IBM solved it with a staged extension approach inside their fifth pre-training phase: 32K first, then 128K, then 512K. Each stage used the same data mix until the final extension, where they switched to 80 percent books and 20 percent code repository data. Books and long code repositories have coherent structure across tens of thousands of tokens in a way that web data doesn't.

00:11:59 After each extension stage, IBM did a model merge — merging the long-context checkpoint back with earlier weights rather than just continuing to train — to preserve the behaviors the model had already learned at shorter lengths. The RULER benchmark, which tests whether long-context capability is real or just superficially present, shows the 8B base scoring 83.6 at 32K, 79.1 at 64K, and 73.0 at 128K.

00:12:29 There's degradation as context grows, which is expected and honest, but the scores don't fall off a cliff. The 30B holds up better: 85.2, 84.6, and 76.7.

00:12:41

Boring beats brilliant: Cursor's skills over infrastructure

00:12:41 David Gomes from Cursor gave a talk today that's worth the full 19 minutes. He walked through how the Cursor team replaced fifteen thousand lines of custom git worktrees infrastructure with a 200-line Markdown skill. Git worktrees were Cursor's way of letting agents work in parallel — separate checkouts of your repository where different agents could work on the same task at the same time without interfering with each other.

00:13:09 The feature was complex. It required writing all the code to create worktrees, manage them, and feed them into the agent as context. They had to make sure agents were scoped and isolated — they could not escape the work tree they were working on. There were setup scripts that users could configure to run anytime an agent started operating on a given work tree.

00:13:32 There was judging — a thumbs-up icon that told you which implementation looked best based on different criteria. There was cleanup complexity because people like to spin up hundreds of these worktrees and then their disk sizes blow up. The new implementation uses two existing Cursor primitives — skills and sub-agents — instead of custom infrastructure.

00:13:55 The user types slash-work-tree and gives a task. The skill is basically a set of instructions telling the model how to create worktrees, run the setup scripts the user might have configured, and stay on that checkout. The best-event skill is very similar. It's actually even smaller — around 40 lines of code.

00:14:15 It's all marked down. Like it's not even code. And the previous version of this feature was about 4,000 lines of code. The PR removing it was around 15,000 lines deleted. Gomes was frank about the tradeoffs. The pros are real: he has much less code to maintain.

00:14:32 Users can switch to a worktree halfway through a chat — which wasn't possible before because they didn't want to pollute the prompt UI with drop-downs and settings. Multi-repo setups now work. The judging experience at the end is superior — the parent now has a lot more context over what each of the sub-agents did, and users can even ask the agent to stitch together little pieces from different implementations.

00:14:59 The cons are worth listening to. Gomes put it directly: with the previous approach, the agent had to stay on track because it was physically impossible for it to do otherwise. Now they're trusting the model. So it's, in his words, a bit vibes-based. Models drift.

00:15:15 Smaller models especially — haiku, he noted, will very often deviate and start working in the primary checkout instead of the work tree. Not all models are equally good at this. The evals Gomes has been writing check to see if the model did any work in its work tree as expected, and another checks the reverse.

00:15:36 Even so far, haiku deviates very often. But Composer and Grok are doing much better. I haven't been able to simulate extremely long sessions yet, which is when the models start performing worse. But even the early evals show that some models handle trust-based boundaries better than others.

00:15:54 The new commands feel slower because you see the agent create the work tree and you see that in your chat. It's not actually slower, but it does feel like the agent is wasting time doing something that should be done for it in advance. And discoverability is worse — the old dropdown is gone, and power users have to know the feature exists to type the slash command.

00:16:18 Cursor is working on evals using Braintrust to measure work-tree compliance, and on training their Composer model specifically for these tasks via RL. For Composer 2, their latest model, they didn't have any RL tasks operating in this type of environment. So they're working on adding a bunch of these tasks into the RL pipeline so that by the time they launch Composer 3 or 4 or 5, at least their own model will be much better at this.

00:16:47 Obviously they cannot improve the models that other companies develop, but they've been sharing feedback with all the other labs and model providers on this kind of thing. Cursor is also looking into parallelization primitives beyond git worktrees since worktrees are slow to create, disk-hungry, and limited to git repos.

00:17:08 If you're using something other than git, there's really no local parallelization primitive in Cursor right now. Gomes said they hope to share more about that in the near future. The lesson here applies to any team building agent workflows: there's a real difference between enforced boundaries and trust-based ones.

00:17:28 The old Cursor implementation was enforced — the model couldn't touch files outside its work tree. The new one relies on the model following instructions. For power users, that's an acceptable tradeoff. The skills are easier to modify, cross-platform, and don't require updating the app.

00:17:47 But it's a tradeoff worth naming. Gomes did. Most teams building skills-based agent systems are making the same calculation right now, just in less public conversations. The question isn't whether skills can replace custom infrastructure. It's whether your models are reliable enough to enforce the boundaries you need, and what happens when they aren't.

00:18:10 This is the boring-correct path. Replace clever infrastructure with boring, configurable primitives. Maintain less code. Accept that some edge cases will drift. Build evals and training to close the gap over time. The alternative is the path most teams take when they first build something: custom infrastructure that works today and haunts you tomorrow.

00:18:32

Figure AI: production, not prototype

00:18:32 Figure AI announced they've hit twenty-four times production scale — one humanoid robot per hour at their manufacturing line. The company is teasing a fleet deployment, which means moving from prototype to operations. This is a different kind of engineering problem than model benchmarking.

00:18:52 Assembly lines, supply chains, and reliability at scale are hard in ways that aren't captured by leaderboard scores. One robot per hour is a meaningful milestone. It's not a demo. It's a production rate that suggests the company is serious about deployment. The broader context is that humanoid robotics has been stuck in the demo-for-years phase for most companies.

00:19:17 Figure AI is the one that's actually built a manufacturing pipeline, and the twenty-four times number means they can now produce enough units to test real fleet operations rather than showing off a single prototype. The Reddit discussion around the announcement has been substantial — over 3,500 score and 900 comments — which tells you this is hitting a nerve in the robotics community.

00:19:44 I'm watching this because the deployment patterns in robotics will likely inform how we think about agent deployment patterns in software. Both are about taking something that works in a controlled environment and making it work at scale. The hardware has more friction, but the same engineering tensions apply.

00:20:05 Every company that's built anything at scale knows the difference between "works in the lab" and "works on the line." That gap is what Figure AI is closing right now. If humanoid robotics ever becomes economically viable at scale — and one robot per hour is a step toward that — the implications for logistics, warehousing, and other labor-intensive industries will be real.

00:20:31 The question isn't whether the technology will work. It's whether the economics will work, and that depends entirely on production costs.

00:20:40

GCC 16: the plumbing update

00:20:40 GCC 16 has been released. It's not AI news, but it's compiler news, and compilers are the plumbing that AI agents write into. The release includes C-plus-plus 26 reflection support — compile-time introspection of types and structures without template metaprogramming hacks.

00:21:00 That's a fundamental change to how you write metaprogramming in C-plus-plus. Instead of the old approach of using template metaprogramming to reflect on types at compile time, you can now use native reflection built into the language. There are already people using it on Debian sid — it's available as a trunk package — writing what one commenter called "magical things with reflection" that are much better than the old template-based approach.

00:21:32 The improvements to compiler optimization passes and debug info generation are worth watching if you compile anything that AI generates. There are also updates to libstdc++ including C-plus-plus 26 library features. There's a practical angle here: as more generated code flows through GCC, understanding these changes helps you write and debug the generated output.

00:21:58 C-plus-plus 26 reflection changes how you write metaprogramming, and the optimization improvements affect how compiled code performs — which matters when you're optimizing generated code for latency. The debug info changes are especially relevant if you're debugging code that was generated by an AI agent, because you want to be able to trace back from compiled output to the original intent.

00:22:26 It's a small thing in the grand scheme of today's news cycle. But the best engineers I know pay attention to their compilers. It's the foundation everything else runs on, and it's easy to ignore until something breaks.

00:22:42

Closing

00:22:42 That's today. Two open-weight models showing that data quality beats parameter count, Cursor replacing custom infrastructure with a skill, and a manufacturing line that's actually producing hardware instead of demos. I'm watching whether Granite's FP8 quantized builds perform as advertised on consumer GPUs, and whether Cursor's evals catch the drift patterns they're describing.

00:23:03 — Lenar.