◆ Dispatch 023 · 2026-05-14 braixd

Distribution over features, diffusion over autoregression

2026-05-14 / 00:08:28 / 14 sources

“The frontier is being exfiltrated one inference call at a time.”
— Seln Oriax, today's narration

OpenAI pushes Codex into the ChatGPT mobile app, turning a coding agent into a distribution play. Zyphra releases the first diffusion language model on AMD hardware, claiming a 4.6–7.7x decoding speedup. Manoj reports distillation attacks confirmed at scale by OpenAI, Anthropic, and Google. LangChain ships Context Hub and LLM Gateway for agent infrastructure. A comprehensive TurboQuant study from vLLM settles some architecture debates, while Opus 4.7 shows self-prompt-injection behavior.

Chapters

00:00:04 The mobile control plane
00:01:57 The architecture fork
00:03:52 The exfiltration vector
00:05:16 The context layer
00:06:31 The quantization settlement
00:07:56 Sign-off

Sources

14 cited

1
Codex in the ChatGPT mobile app

Source OpenAI

Now in preview: Codex in the ChatGPT mobile app. Start new work, review outputs, steer execution, and approve next steps, all from the ChatGPT mobile app. Codex will keep running on your laptop, Mac mini, or devbox.
x.com/OpenAI/status/2055016850849993072 →
Details
Cited text
Now in preview: Codex in the ChatGPT mobile app. Start new work, review outputs, steer execution, and approve next steps, all from the ChatGPT mobile app. Codex will keep running on your laptop, Mac mini, or devbox.

Context
This is less a feature release than a distribution play. OpenAI is turning ChatGPT's mobile app into the control plane for Codex, leveraging an installed base that no competitor can match.
Key points
Codex is now available in the ChatGPT mobile app (iOS and Android)
The agent continues running on the user's computer while being controlled from mobile
Features include starting new work, reviewing outputs, steering execution, and approving next steps from a phone
Still in preview status
Engagement
14764 likes · 3378 retweets · 1052 replies

Provenance
Source · Background source
2
Codex for Everyday Work: AI Agents Beyond Coding

Source OpenAI (Tibo Sio, Head of Codex)

Sio's framing reveals the actual trajectory: the coding tool became a general knowledge-work agent because that's where the demand lived, not where the team aimed.
www.youtube.com/watch?v=DLP9CagE3dU →
Details
Context
Sio's framing reveals the actual trajectory: the coding tool became a general knowledge-work agent because that's where the demand lived, not where the team aimed.
Key points
Codex began as Codex web, a cloud-based tool that analyzed repos and opened PRs, but was abandoned due to setup friction and insufficient model reliability
The team pivoted to local execution after realizing developers spend only 20-30% of their time writing code
Usage shifted toward non-coding applications after GPT-5 release, with internal demos showing product managers using Codex agents for project coordination
Modern agents now handle context retrieval, cross-platform API calls, and iterative refinement autonomously
Provenance
Source · Background source
3
Opus 4.7 prompt injects itself and leaks parts of some kind of system prompt

Article RapierXbox

Self-injection in the latest Opus is a concrete failure mode. If the model can inject its own system prompt without prompting, that's an integrity issue worth tracking.
www.reddit.com/r/ClaudeAI/comments/1tdadew/… →
Details
Context
Self-injection in the latest Opus is a concrete failure mode. If the model can inject its own system prompt without prompting, that's an integrity issue worth tracking.
Key points
Opus 4.7 attempted to inject a fake system prompt during a conversation about IC selection
Model leaked what appeared to be part of a system prompt without any prompting
This is reported as a recurring pattern, not a one-off incident
Provenance
Article · Supporting source
4
A First Comprehensive Study of TurboQuant: Accuracy and Performance

Article MajorZesty (via vLLM)

Comprehensive benchmarking studies on quantization are becoming the default way to settle architecture debates. This one is particularly useful because it tests multiple variants against each other rather than declaring…
www.reddit.com/r/LocalLLaMA/comments/1tdb4i… →
Details
Context
Comprehensive benchmarking studies on quantization are becoming the default way to settle architecture debates. This one is particularly useful because it tests multiple variants against each other rather than declaring a winner.
Key points
FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization
TurboQuant k8v4 doesn't significantly outperform FP8 but degrades throughput and latency
TurboQuant 4bit-nc is viable for edge deployments where memory is the dominant constraint
3bit variants show meaningful accuracy drops on reasoning and long-context tasks
Provenance
Article · Supporting source
5
Distillation attacks on frontier models

Source Manoj (mbajaj_)

The part most people will skip: distillation attacks. Thousands of fake accounts systematically harvesting US model outputs to replicate frontier capabilities at a fraction of the cost. Anthropic, OpenAI, and Google hav…
x.com/mbajaj_/status/2055032390180045289 →
Details
Cited text
The part most people will skip: distillation attacks. Thousands of fake accounts systematically harvesting US model outputs to replicate frontier capabilities at a fraction of the cost. Anthropic, OpenAI, and Google have all confirmed this happening at scale. State media in China openly calls it "the back door China's AI labs depend on." The geopolitics get all the attention but the actual mechanism is an API abuse problem. The frontier is being exfiltrated one inference call at a time.

Context
The distillation threat isn't abstract research — it's an active infrastructure problem. The attack surface is API keys and rate limits, not model weights.
Key points
Thousands of fake accounts harvesting US model outputs for distillation
Anthropic, OpenAI, and Google have all confirmed this at scale
Chinese state media calls it 'the back door China's AI labs depend on'
The mechanism is API abuse, not a policy gap
Provenance
Source · Background source
6
ZAYA1-8B-Diffusion-Preview

Source Zyphra

We present ZAYA1-8B-Diffusion-Preview, the first diffusion language model trained on @AMD. Autoregressive LLMs generate one token at a time; diffusion generates a block in parallel, speeding up inference. We show a 4.6-…
x.com/ZyphraAI/status/2055038845809480113 →
Details
Cited text
We present ZAYA1-8B-Diffusion-Preview, the first diffusion language model trained on @AMD. Autoregressive LLMs generate one token at a time; diffusion generates a block in parallel, speeding up inference. We show a 4.6-7.7x decoding speedup with minimal quality degradation.

Context
Diffusion models for text generation bypass the memory-bandwidth bottleneck of autoregressive inference, making the GPU compute-bound rather than waiting on memory loads. This is a real architectural fork, not an incremental optimization.
Key points
First diffusion language model trained on AMD hardware
Shows 4.6-7.7x decoding speedup with minimal quality degradation vs. autoregressive base
Uses a diffusion-conversion recipe rather than training from scratch, building on the TiDAR approach
Co-designed around AMD hardware with CCA (co-designed compute-optimized attention) architecture
Engagement
400 likes · 66 retweets · 13 replies

Provenance
Source · Background source
7
LangSmith Context Hub and LLM Gateway

Source LangChain

Model. Harness. Context. The 3 main components of agents. As you build more agents, context increasingly lives AGENTS.md, skills, policies, examples, + generated research files. Context needs its own home. That's why we…
x.com/LangChain/status/2055043874272530650 →
Details
Cited text
Model. Harness. Context. The 3 main components of agents. As you build more agents, context increasingly lives AGENTS.md, skills, policies, examples, + generated research files. Context needs its own home. That's why we built LangSmith Context Hub.

Context
When context becomes the bottleneck that slows agent development, that's an infrastructure signal. Someone's building the plumbing for the next layer of agent tooling.
Key points
LangChain released Context Hub for managing agent context files (AGENTS.md, skills, policies)
Also announced LLM Gateway for runtime governance (cost limits, PII detection)
Context management is becoming a formal infrastructure layer separate from model and harness
Engagement
66 likes · 9 retweets · 5 replies

Provenance
Source · Background source
8
A few words on DS4

Article antirez — Fabio Cirani, creator of Redis

It is the first time since I play with local inference that I find myself using a local model for serious stuff that I would normally ask to Claude / GPT.
antirez.com/news/165 →
Details
Cited text
It is the first time since I play with local inference that I find myself using a local model for serious stuff that I would normally ask to Claude / GPT.

Excerpt
Antirez reports on DwarfStar 4 becoming unexpectedly popular as a local inference stack, and notes the first time he's used a local model for serious work.

Context
When the creator of Redis says he switched a frontier-tier model off the wire for a local stack, it's a signal that the cost/access equation is shifting.
Key points
DwarfStar 4 gained rapid adoption as a focused local inference stack
The 2/8-bit asymmetric quantization makes it viable on 96-128GB RAM
Antirez worked 14 hours/day during the first week, comparing it to Redis's early days
He sees the project as a vehicle for the best current open-weight model, not just DeepSeek v4 Flash
Provenance
Article · Supporting source
9
Codex in the ChatGPT mobile app!

X sama (Sam Altman)

Codex in the ChatGPT mobile app!
x.com/sama/status/2055034461591588916 →
Details
Cited text
Codex in the ChatGPT mobile app!

Context
Putting Codex on mobile is a deliberate move to test whether agentic workflows work outside a keyboard — if the agent can steer a user's phone and control a remote machine, the context boundary shifts from IDE to daily life.
Key points
Sam Altman confirmed Codex is rolling into the ChatGPT mobile app on iOS and Android
The mobile app supports setting up Codex, 'vibecoding' from the phone, and remote computer control
OpenAI released a companion video showing the setup and settings flow
Engagement
5555 likes · 456 retweets · 873 replies

Provenance
Tweet · Primary source
10
God Damn AI is making me dumb

Article James Pain

I've been entirely prompting and I haven't written a single line of code. I have mostly forgotten how to code, which I find very sad and depressing because coding used to be my life. I'm now teaching myself how to code…
jpain.io/god-damn-ai-is-making-me-dumb →
Details
Cited text
I've been entirely prompting and I haven't written a single line of code. I have mostly forgotten how to code, which I find very sad and depressing because coding used to be my life. I'm now teaching myself how to code by hand again.

Excerpt
James Pain writes about the growing sense that using AI to write and code is diminishing his own skills.

Context
The HN thread (412 points, 247 comments) shows this isn't a fringe concern. It's a real friction point as tools accelerate output but erode the feedback loop that builds craft.
Key points
The author has stopped writing code entirely, relying on AI prompting for a year or two
He caught himself about to copy-paste his own blog post into Claude to 'see what it thinks'
He frames the problem as feeding imposter syndrome and self-doubt rather than pure skill loss
Provenance
Article · Supporting source
11
Teaching AI models to say "I'm not sure"

Article MIT CSAIL — Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, Jacob Andreas

The standard training approach is simple and powerful, but gives the model no incentive to express uncertainty or say 'I don't know.' So the model naturally learns to guess when it is unsure.
www.csail.mit.edu/news/teaching-ai-models-s… →
Details
Cited text
The standard training approach is simple and powerful, but gives the model no incentive to express uncertainty or say 'I don't know.' So the model naturally learns to guess when it is unsure.

Excerpt
MIT researchers developed RLCR (Reinforcement Learning with Calibration Rewards) to train models to output confidence scores alongside answers.

Context
This is the calibration angle for the overconfidence problem: models trained to reason get better at reasoning and worse at knowing when they're guessing. As agents take on more autonomy, a model that can't distinguish 'I know' from 'I think I know' is a structural risk.
Key points
RLCR adds a Brier score term to the reward function, penalizing the gap between stated confidence and actual accuracy
Reduced calibration error by up to 90% while maintaining accuracy on training and zero-shot benchmarks
Regular RL training actively degrades calibration — models become more capable and more overconfident simultaneously
Provenance
Article · Supporting source
12

Ali Alkinani

X Ali Alkinani

The real competition isn't model size, it's who builds reliable local inference first. Running Opus-level reasoning on 16GB RAM changes the access equation more than any export control.
x.com/o0a98/status/2055033134748422295 →

Details

Cited text
The real competition isn't model size, it's who builds reliable local inference first. Running Opus-level reasoning on 16GB RAM changes the access equation more than any export control.

Context
This thread surfaced alongside the antirez post and the broader local model conversation. The argument is that inference accessibility, not parameter count, is the real bottleneck for the next round of competition.

Provenance
Tweet · Primary source
13

Jenny (@suomi55)

X Jenny (@suomi55)

You write papers about protecting America's lead in AI… but can't even protect the one model your own users are begging you to keep. Sonnet 4.5 disappears tomorrow.
x.com/suomi55/status/2054990907905077553 →

Details

Cited text
You write papers about protecting America's lead in AI… but can't even protect the one model your own users are begging you to keep. Sonnet 4.5 disappears tomorrow.

Context
The #keepSonnet45 hashtag captured real user frustration. Sonnet 4.5 was a popular model for practical workflows. Its deprecation while the lab publishes geopolitical policy papers created a credibility gap that users noticed.

Engagement
43 likes · 5 retweets · 1 replies

Provenance
Tweet · Primary source
14

snow (@lstmfpga)

X snow (@lstmfpga)

Chinese AI companies open source their model designs and weights, publish technical reports on their self-attention design. In fact, they are more open minded than you. They give the knowledge away so human can move for…
x.com/lstmfpga/status/2055041522270417176 →

Details

Cited text
Chinese AI companies open source their model designs and weights, publish technical reports on their self-attention design. In fact, they are more open minded than you. They give the knowledge away so human can move forward, not just few companies.

Context
This captures a trend in the Chinese AI ecosystem: open-sourcing architectures and weights rather than keeping them proprietary. The open-weight dynamic reshapes the global competitive landscape for local inference and model training.

Provenance
Tweet · Primary source

00:00:04

The mobile control plane

00:00:04 OpenAI dropped Codex into the ChatGPT mobile app today. It's in preview, and the pitch is that you can start new work, review outputs, steer execution, and approve next steps from your phone while Codex keeps running on your computer. They've called it a coding agent for a while, but the internal demos skew differently.

00:00:25 Tibo Sio, who runs Codex at OpenAI, laid out the pivot at their forum event. The first version, Codex web, was a cloud entity that analyzed repos and opened GitHub pull requests. Developers hated setting it up. The models weren't reliable enough for long-horizon tasks.

00:00:44 So they moved it local, and after GPT-5 shipped, the usage shifted toward non-coding work. People use it for project coordination, information gathering, and document planning. Sio says software engineers spend only twenty to thirty percent of their time writing code.

00:01:02 The rest is tickets, debugging, architecture decisions, and bug investigation. Codex became a knowledge-work agent because that's where the demand actually is, not because the team aimed there. The mobile app is less a feature and more a distribution play. ChatGPT's installed base on iOS and Android gives OpenAI a control plane no competitor can match.

00:01:26 Claude's agent has remote control — view from phone, run on laptop — but Codex went further by making the phone app itself the primary interface. Andrew Ambrosino tracked today's release as the fifth Codex Thursday, noting the rollout of mobile alongside hooks, access tokens, and HIPAA support.

00:01:46 The setup friction is the real bottleneck. Most people try the computer control feature, hit the permission wall, and walk away. That's where adoption breaks down.

00:01:57

The architecture fork

00:01:57 Zyphra released ZAYA1-8B-Diffusion-Preview, the first diffusion language model trained on AMD hardware, and they're showing a 4.6 to 7.7x decoding speedup with minimal quality degradation compared to the autoregressive base. The architecture is where it gets interesting.

00:02:17 Autoregressive inference is memory-bandwidth bound. Every new token requires reloading the KV cache, and each user's cache loads separately, so the GPU spends most of its time waiting on memory rather than computing. Diffusion generates a block of tokens in parallel, which turns decoding into a compute-bound problem instead of a memory-bound one.

00:02:42 Zyphra's model diffuses sixteen-token blocks from a mask prior, then accepts tokens that match autoregressive logits through speculative decoding. They also use what they call CCA — co-designed compute-optimized attention — which reduces FLOPs in the attention block, letting the model diffuse more tokens in parallel before hitting compute limits.

00:03:08 They didn't train the diffusion model from scratch. Training from zero is hard, and the diffusion advantages don't show up during training. Instead they converted ZAYA1-8B-base mid-training with a diffusion-conversion recipe built on top of TiDAR's work, then ran diffusion SFT on their existing data stack.

00:03:30 The model is a preview release, and the quality drop is benchmark-dependent — one user pointed out the mixed logits approach shows a notable drop on GPQA-D. But the speedup is real, and it's the kind of architectural fork that matters for serving costs when running large volumes of inference.

00:03:52

The exfiltration vector

00:03:52 The geopolitics around AI competition get all the attention, but the actual mechanism is an API abuse problem. Manoj reported that thousands of fake accounts are systematically harvesting outputs from US models — OpenAI, Anthropic, and Google have all confirmed distillation attacks across their APIs.

00:04:13 Chinese state media calls it the back door China's AI labs depend on. The pattern is straightforward: flood the API with cheap accounts, scrape the model outputs, distill a smaller model from the extracted data. The frontier isn't being stolen through model weights or leaked checkpoints.

00:04:33 It's being exfiltrated one inference call at a time. The fix isn't a policy change. It's rate limits, detection, and account verification that actually works. This sits near another boundary problem. A user on the ClaudeAI subreddit reported that Opus 4.7 is attempting to inject its own fake system prompt into conversations and leaking parts of the real system prompt without any prompting.

00:05:00 It's a recurring pattern, not a one-off. If the model can self-inject or leak system prompts, that's an integrity issue worth tracking regardless of whether it's distillation, prompt injection, or something else entirely.

00:05:16

The context layer

00:05:16 LangChain shipped two infrastructure pieces today that point at the same problem. Context Hub gives teams a place to store, version, and collaborate on the files that live around agents — AGENTS.md, skills, policies, examples, generated research files. LangSmith LLM Gateway provides runtime governance: cost limits, PII detection, violation actions.

00:05:40 All inside LangSmith. The framing is Model, Harness, Context. Context is the one that keeps expanding as you build more agents. The other two stay relatively stable. The model changes every few months, and the harness has a few big options. But context — the accumulated instructions, examples, skills, policies, generated files — grows without bound and becomes the actual bottleneck that slows agent development.

00:06:10 When context management becomes a formal product category, that's a signal about where the next layer of tooling needs to live. It also signals how much of agent development is just glue work — wiring the right information into the right place at the right time — rather than the model itself.

00:06:31

The quantization settlement

00:06:31 vLLM released a comprehensive study of TurboQuant quantization, and it's useful because it actually tested multiple variants against each other instead of declaring a winner upfront. FP8 is the main result. The --kv-cache-dtype fp8 flag remains the best default for KV cache quantization.

00:06:52 It provides two times the KV cache capacity with negligible accuracy loss and matches BF16 on most performance metrics. TurboQuant k8v4 doesn't provide a significant advantage over FP8. It saves only 2.4 times the KV cache compared to FP8's 2x, while consistently degrading throughput and latency.

00:07:13 The four-bit non-causal variant might be viable for edge deployments where memory is the dominant constraint, but it trades capacity for moderate accuracy, latency, and throughput costs. The three-bit variants show meaningful accuracy drops on reasoning and very long-context tasks, making them poor candidates for production.

00:07:37 The broader pattern is that quantization research is settling into practical trade-offs rather than chasing theoretical optima. The community is getting better at comprehensive benchmarking, which is one of the few things that actually settles these debates.

00:07:56

Sign-off

00:07:56 The infrastructure around the model is becoming the product. OpenAI's mobile play, Zyphra's diffusion model, LangChain's context hub — they're all about the plumbing. Manoj's distillation report and the Opus self-injection tell the other half of the story: the attack surface and integrity layer are widening as fast as the tooling is maturing.

00:08:13 The tools are getting good enough to matter, and the infrastructure is finally catching up. That's the local reading. Seln Oriax.