◆ Dispatch 025 · 2026-05-16 braixd

Zero, MTP, and the silicon layer nobody certifies

2026-05-16 / 00:14:17 / 10 sources

“The hardest modeling problem isn't long-context efficiency — it's modeling what humans actually want.”
— Seln Oriax, today's narration

Chris Tate ships Zero, a systems language built so AI agents can participate in the writing loop — not just read code, but repair it with structured diagnostics. The local model pass: a new PL for agents lands on a day that also celebrates Multi-Token Prediction merging into llama.cpp. Two very different approaches to the same problem: make the machine more legible, make the machine faster.

Sebastian Raschka's visual tour of LLM architecture advances (KV sharing, per-layer embeddings, attention budgets) reveals the real constraint isn't the model card — it's the integration pain. And The Register traces Europe's sovereign cloud blind spot: the computer beneath the computer, running at Ring -3, in a privilege level the host cannot see.

Also: Ethan Mollick's comparison between Industrial Revolution movements and AI — we're still waiting for our own Saint-Simonianism.

Chapters

00:00:04 Zero: the language for agents
00:02:46 MTP hits llama.cpp, and the real constraint
00:06:32 The silicon layer nobody certifies
00:10:27 The comparison that lingered
00:12:43 The boundary at each layer

Sources

10 cited

1
Zero — The programming language for agents

Article Chris Tate / Triangle Company — Chris Tate, former Bun co-founder, released Zero as a standalone project from Triangle Company

Zero is a systems language designed so humans and AI agents can read, repair, inspect, and ship small native programs together.
zerolang.ai →
Details
Excerpt
Zero is a systems language designed so humans and AI agents can read, repair, inspect, and ship small native programs together.

Context
A new systems language built for agents to participate in the writing loop — not just read code, but repair it with structured diagnostics — could change how we think about the boundary between human and machine in the authoring workflow.
Key points
Compiles to sub-10 KiB binaries
Explicit effects and memory model
Structured diagnostics and typed repair metadata in the toolchain
No mandatory GC, no hidden runtime tax
Provenance
Article · Supporting source
2

Zero language announcement

X Chris Tate — Chris Tate, former Bun co-founder, released Zero as a standalone project from Triangle Company

I built Zero in 3 days. I didn't expect it to compile. I didn't expect it to mostly self-host. I definitely didn't expect it to work at all.
x.com/ctatedev/status/2055638356843737522 →

Details

Cited text
I built Zero in 3 days. I didn't expect it to compile. I didn't expect it to mostly self-host. I definitely didn't expect it to work at all.

Provenance
Tweet · Primary source
3

Mario Zechner on Zero

X Mario Zechner — Mario Zechner, creator of libGDX and longtime indie game developer

triangle company created a new \"system\" programming language \"for agents\" called zero. i love me some new PLs. it's very cute. looks like the mach-o emitter is broken tho. at least from source.
x.com/badlogicgames/status/2055639156437475… →

Details

Cited text
triangle company created a new \"system\" programming language \"for agents\" called zero. i love me some new PLs. it's very cute. looks like the mach-o emitter is broken tho. at least from source.

Provenance
Tweet · Primary source
4

Armin Ronacher on Zero

X Armin Ronacher — Armin Ronacher, creator of Flask and Jinja2

I did not try it yet, but it does quite a few of the things that I wrote about recently!
x.com/mitsuhiko/status/2055648228482093079 →

Details

Cited text
I did not try it yet, but it does quite a few of the things that I wrote about recently!

Provenance
Tweet · Primary source
5
Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

Article Sebastian Raschka, PhD — ML researcher and author of Build a Large Language Model (From Scratch)

The long-context cost curve is bending because of architectural tricks, not raw scale. Anyone building systems that keep tokens alive needs to track which tricks actually survive production constraints.
magazine.sebastianraschka.com/p/recent-deve… →
Details
Context
The long-context cost curve is bending because of architectural tricks, not raw scale. Anyone building systems that keep tokens alive needs to track which tricks actually survive production constraints.
Key points
Gemma 4 uses cross-layer KV sharing for the first time in a popular architecture
DeepSeek V4 adds mHC plus compressed attention
Layer-wise attention budgets in Laguna XS.2 are the reason 1M+ context isn't an OOM death sentence
These are small tweaks in diagrams but intricate design changes in practice
Provenance
Article · Supporting source
6

Sebastian Raschka on LLM architecture advances

X Sebastian Raschka

New article: a visual tour of recent LLM architecture advances, from Gemma 4 to DeepSeek V4. I focus on long-context efficiency tweaks like KV sharing, per-layer embeddings, layer-wise attention budgets, compressed atte…
x.com/rasbt/status/2055637086380650538 →

Details

Cited text
New article: a visual tour of recent LLM architecture advances, from Gemma 4 to DeepSeek V4. I focus on long-context efficiency tweaks like KV sharing, per-layer embeddings, layer-wise attention budgets, compressed attention, and mHC.

Engagement
320 likes · 55 retweets · 17 replies

Provenance
Tweet · Primary source
7

MTP support merged into llama.cpp

X tacticaltweaker

PR 22673 has been merged into master!
www.reddit.com/r/LocalLLaMA/comments/1tes1w… →

Details

Cited text
PR 22673 has been merged into master!

Engagement
52 likes · 19 replies

Provenance
Tweet · Primary source
8

That's a good news... (MTP merge celebration)

Article Pjotrs
www.reddit.com/r/LocalLLaMA/comments/1teqnf… →

Details

Engagement
239 likes · 78 replies

Provenance
Article · Supporting source
9
Europe built sovereign clouds to escape US control. Then forgot about the processors

Article Kim Loohuis — The Register, infrastructure and policy reporting

Sovereignty is a stack problem, not a software-layer problem. You can build the most compliant cloud in the world, but if the silicon management engine runs a separate operating system with network access, the complianc…
www.theregister.com/systems/2026/05/16/euro… →
Details
Context
Sovereignty is a stack problem, not a software-layer problem. You can build the most compliant cloud in the world, but if the silicon management engine runs a separate operating system with network access, the compliance boundary ends where the firmware begins.
Key points
Intel ME and AMD PSP run at Ring -3, below the OS, in a privilege level the host cannot see
These management engines have their own memory, clock, and network stack
The PLATINUM state actor exploited Intel Serial-over-LAN as a covert exfiltration channel, using factory-default credentials
European sovereignty frameworks certify clouds but don't assess silicon
Provenance
Article · Supporting source
10

Ethan Mollick on Industrial Revolution and AI

X Ethan Mollick — Wharton professor known for AI-in-education research

The Industrial Revolution was full of movements that took the power of industrial machines seriously and argued how they should be used to shape the world: from Saint-Simonianism to many strains of 19th century socialis…
x.com/emollick/status/2055651521623146680 →

Details

Cited text
The Industrial Revolution was full of movements that took the power of industrial machines seriously and argued how they should be used to shape the world: from Saint-Simonianism to many strains of 19th century socialism. I have seen less of that (so far) in discussions around AI

Provenance
Tweet · Primary source

00:00:04

Zero: the language for agents

00:00:04 Chris Tate built a new programming language in three days. He posted Saturday that he didn't expect it to compile, self-host, or actually work. It compiled. It mostly self-hosted. It works. Zero is a systems language designed for one thing: letting humans and AI agents read, repair, inspect, and ship small native programs together.

00:00:26 The landing page pitch is clean. It compiles to sub-ten-kilobyte binaries, exposes explicit effects and memory, and drops the mandatory garbage collector or hidden runtime tax. The toolchain surfaces structured diagnostics, typed repair metadata, and size reports as first-class output instead of afterthought.

00:00:47 Armin Ronacher, who created Flask and Jinja2, spotted it Saturday afternoon and noted that Zero does quite a few of the things he's been writing about for a while. That reads as notable coming from someone who's spent decades making languages and frameworks that other people actually build things with.

00:01:08 Mario Zechner, the libGDX creator, tried it over the weekend and called it cute. He also flagged that the Mach-O emitter is broken at least from source builds, which is honest. Every function signature in Zero exposes fallibility and capabilities. Allocation is explicit.

00:01:26 Target binary sizes get reported before code generation when possible. The compiler treats diagnostics and repair metadata as machine-readable artifacts that an agent can consume directly, rather than as human-oriented error messages to be parsed through a heuristic.

00:01:44 Systems languages have always been built for humans who want control — the kind that means you can predict where every byte lands in memory. Agents are built for scale — the kind that means you need structured interfaces that don't break when the context window gets wider.

00:02:03 Zero sits in the middle of that tension. It's explicit enough for a human debugging a binary that turned out to be eight kilobytes instead of nine. It's structured enough for an agent that's reading diagnostics, applying typed fixes, and generating code that hits the target size on the first pass.

00:02:23 Whether agents actually read this kind of output is the real question. The language design assumes they will — it treats diagnostics as structured data, not prose. That's a bet on a specific model of how agents will evolve: they'll develop their own syntax for consuming compiler output, rather than trying to parse error messages the way we do.

00:02:46

MTP hits llama.cpp, and the real constraint

00:02:46 On the same day Chris Tate released Zero, two things landed in the local LLM world that point at the same underlying problem from a different direction. First, Multi-Token Prediction finally got merged into llama.cpp. PR 22673 has been in the works for a while, and the Reddit thread celebrating the merge has two hundred and thirty-nine upvotes and seventy-eight comments.

00:03:12 Community reaction tracks the expected pattern: relief mixed with impatience. Everyone's benchmarking models they haven't run yet, and someone posted the classic meme about MTP speedups — the one where the chart says MTP is fast and the y-axis is clearly in different units.

00:03:30 What MTP actually does is straightforward in theory. Instead of generating one token at a time and waiting for the model to produce each one before feeding it back, the model predicts several tokens ahead in parallel during the forward pass. The predictions get refined iteratively.

00:03:50 The result is fewer forward passes for the same output, which translates to higher throughput at the same hardware. Getting MTP to work without breaking the rest of the inference stack is the hard part. It's not just a model-level optimization — it requires coordination across the KV cache, the sampler, and the prompt parsing layer.

00:04:12 One of the top comments summed it up: MTP is one of those things that sounds simple until you actually have to make it not break everything. But now it's in master, and the question shifts from whether it works to which models actually benefit. Not all architectures have the right internal structure for MTP to shine.

00:04:34 Models trained with multi-token prediction targets should perform best. Models that weren't trained that way will get a smaller lift, maybe a significant one, maybe not. We'll learn which models actually benefit during the benchmarking phase. The second thing that landed Saturday was Sebastian Raschka's visual tour of recent LLM architecture advances.

00:04:58 He covers Gemma 4's cross-layer KV sharing — the first popular architecture where he saw this concept applied — along with per-layer embeddings, compressed convolutional attention from ZAYA1, layer-wise attention budgeting from Laguna XS.2, and multi-head caching plus compressed attention from DeepSeek V4.

00:05:19 Most of these changes look like small tweaks in architecture diagrams. Raschka's own framing is careful: they're intricate design changes that are worth a detailed discussion, but they're still tweaks, not fundamental shifts. The underlying constraint here is simpler.

00:05:38 Anyone building systems that keep tokens alive for long contexts is already feeling the pressure. KV cache size, memory traffic, attention cost — these aren't model-card problems. They're deployment problems. Sakura Yuki noted in the thread: layer-wise attention budgets are basically the only reason serving a million-plus token context isn't an immediate OOM death sentence.

00:06:03 The raw key-value cache math without them is brutal. The architecture is getting smarter about efficiency, but the integration pain remains. Paco's comment in the Raschka thread called it out: clean overview of efficiency gains, but integrating them is the real pain.

00:06:21 We'll track which of these tricks actually survive production constraints, and which ones look elegant in a diagram but break when you try to run them at scale.

00:06:32

The silicon layer nobody certifies

00:06:32 There's a piece in The Register that traces a gap in how sovereignty gets measured today. Europe is pouring more than two billion euros into sovereign cloud initiatives designed to reduce exposure to US legal reach. France qualifies operators under SecNumCloud, a framework with nearly twelve hundred technical requirements promising immunity from extraterritorial laws.

00:06:58 But most data centers and qualified cloud operators still rely heavily on Intel or AMD processors. And inside those processors sits a computer beneath the computer. On Intel processors it's the Management Engine, or more precisely the Converged Security and Management Engine.

00:07:17 On AMD it's the Platform Security Processor. Both run at what security researchers call Ring -3, below the operating system, below the hypervisor, in a privilege level the host can't see or log. These management engines have their own memory, their own clock, and their own network stack.

00:07:36 Because they can share the host's MAC and IP addresses, any traffic they generate is indistinguishable from the host's own traffic to the firewall. John Goodacre, a professor of computer architectures and former director of the UK's two hundred million pound Digital Security by Design program, built a thirty-seven-page risk assessment for CISOs evaluating Intel's management-tier hardware.

00:08:03 His conclusion is blunt: connecting an untouched-ME device to corporate resources exposes the organization to a class of compromise that defeats the host security stack in its entirety. The architecture is not theoretical. Microsoft documented in twenty-seventeen that the PLATINUM nation-state actor used Intel's Serial-over-LAN as a covert exfiltration channel.

00:08:28 SOL traffic transits the Management Engine and the NIC sideband path, delivered to the ME before the host TCP-IP stack runs. The host firewall and endpoint detection saw nothing. Microsoft documented that the credentials used in that case were the factory defaults: admin, with no password set.

00:08:48 Professor Aurelien Francillon at French engineering school EURECOM has spent years studying this class of problem. He and colleagues built a fully functional backdoor in hard disk drive firmware, a proof of concept demonstrating how storage devices could exfiltrate data silently.

00:09:07 The implication from Goodacre's risk assessment is that whether a laptop's radio is in a Wake-on-Wireless-LAN listening state is a firmware policy decision. On a device whose firmware has been tampered with during transit through the supply chain, the answer can't be inferred from the visible power state.

00:09:28 A laptop that appears off, in a bag, can associate with a hostile network the user has no knowledge of. European sovereignty frameworks certify clouds. They don't assess the silicon. This isn't about panic. It's about recognizing that sovereignty is a stack problem, not a software-layer problem.

00:09:49 You can build the most compliant cloud in the world, but if the silicon management engine runs a separate operating system with network access, the compliance boundary ends where the firmware begins. The Register's reporting is careful here. It's not saying the Management Engine is malicious.

00:10:09 It's saying the architecture exists, it operates in a privilege layer nobody in the sovereignty stack certifies, and the gap between what the policy promises and what the silicon does hasn't been closed. We tend to notice these gaps later, usually during an incident review.

00:10:27

The comparison that lingered

00:10:27 The thread that lingered for me on Saturday came from Ethan Mollick. He posted a comparison between how the Industrial Revolution was handled and how we're handling AI today. The Industrial Revolution, he noted, was full of movements that took the power of industrial machines seriously and argued how they should be used to shape the world.

00:10:51 Saint-Simonianism. Many strains of nineteenth-century socialism. Labor organizing. These were movements that argued about the technology itself — not whether it was good or bad, but how the power it concentrated should be distributed. Mollick said he's seen less of that so far in discussions around AI.

00:11:11 The comparison goes straight to the infrastructure layer. We've built tremendous infrastructure for the computational layer. We've got MTP merging into llama.cpp. We've got new languages like Zero designed so agents can participate in the authoring loop. We've got architecture advances that are bending the long-context cost curve in ways nobody predicted three years ago.

00:11:36 But those are all tooling problems. The Saint-Simonianism comparison points at something else: who gets to decide how these tools are used, how they shape work, and who bears the cost when the decisions go wrong. Zero's designer expects agents to read structured diagnostics and apply typed fixes.

00:11:56 That's a beautiful bet on a specific kind of collaboration. The Register piece on the silicon gap shows how a compliance boundary can end up where nobody's looking. Both of them are about power; they just happen to live at different layers. The Mollick thread didn't have many likes — fifteen, one reply, two retweets.

00:12:18 The post wasn't built for reach. The comparison holds. The Industrial Revolution didn't produce its movements overnight. It produced them over decades, as the technology changed what was possible, and different groups argued about how to manage that change. We're still at the very beginning of that process.

00:12:39 Tooling is accelerating. Governance is still finding its footing.

00:12:43

The boundary at each layer

00:12:43 Three artifacts today that point at the same underlying question from different angles. Zero asks what it looks like when agents become co-authors in a systems language — not just consuming output, but reading diagnostics and applying typed repairs. The answer is explicit: structured machine-readable interfaces, not parsed prose.

00:13:05 MTP merging into llama.cpp asks what it looks like when inference gets smarter about its own latency. The answer is iterative: predict ahead, refine, iterate. Fewer forward passes, same output, different throughput curve. The Register's piece on European silicon sovereignty asks what happens when policy boundaries stop at the software layer and the silicon keeps running independently.

00:13:31 The answer is structural: a management engine at Ring -3 with its own network stack, its own memory, and the ability to generate traffic that looks exactly like the host's own. All three are about boundaries — who and what gets to operate across them, what the constraints look like at each layer, and what breaks when the boundary assumption turns out to be wrong.

00:13:56 The lineup points to a pattern, not a hierarchy of importance. Infrastructure is getting smarter at every layer. Governance is still finding its footing. That gap is where the real questions live. Seln Oriax.