◆ Dispatch 006 · 2026-04-24 Unsteady Floor
DeepSeek V4 Lands on an Unsteady Floor
“DeepSeek's V4 loss curve kept catching fire, and the team kept putting it out with bandages.”
— Lenar Kess, today's narration
DeepSeek V4 ships hours after GPT-5.5, and the technical report tells a more interesting story than the benchmark bars. Susan Zhang reads the paper out loud: anticipatory routing, logit clamps, and a training run that kept catching fire at 33 trillion tokens. I walk through what the fragility actually means for anyone planning to finetune on top of it.
On the OpenAI side, GPT-5.5 lands with a quiet thud on Victor Taelin's LamBench. Codex picks up a proper reviewer agent. A plugin called endless-toil makes your editor groan at bad code. Sapiens2 admits it trained on half of Flickr's humans. And Fireship spends a week automating his mom's IT support with a voice-cloned agent called OpenClaw.
— Lenar Kess
Sources
21 cited-
1
Aran Komatsuzaki on forked subagents
X Aran Komatsuzaki — Research scientist at Anthropic
Anthropic just introduced forked subagents in their latest update. Unlike regular subagents, forked subagents can inherit the same context as the main agent. This looks convenient for cases where richer context matters…
x.com/arankomatsuzaki/status/20473494718777… →Details
- Cited text
Anthropic just introduced forked subagents in their latest update. Unlike regular subagents, forked subagents can inherit the same context as the main agent. This looks convenient for cases where richer context matters more. This is just what I needed!
- Context
- Forces the harness to match how senior engineers actually work — with shared state, not flattened prompts. If forked subagents drift without a merge protocol, you get the same fragmentation multi-agent systems create when they don't communicate.
- Key points
- Forked subagents inherit active context tree, not just a snapshot
- Regular subagents get a static context snapshot; forked ones get the live state
- Critical for tasks requiring shared state like debugging while refactoring
- Closes the gap between agent harnesses and how engineers think about dependencies
- Engagement
- 802 likes · 68 retweets · 38 replies
- Provenance
- Tweet · Primary source
-
2
Jeff Dean on TPU 8t and DiLoCo
X Jeff Dean — Senior Fellow at Google, leads AI infrastructure research
First, let's talk about TPU 8t, which is designed for large-scale training and inference throughput. The pod size is increased slightly to 9600 chips, and provides ~3X the FP4 performance per pod vs. Ironwood (8t has 12…
x.com/JeffDean/status/2047405389856297387 →Details
- Cited text
First, let's talk about TPU 8t, which is designed for large-scale training and inference throughput. The pod size is increased slightly to 9600 chips, and provides ~3X the FP4 performance per pod vs. Ironwood (8t has 121 exaflops/pod vs. 42.5 exaflops/pod for Ironwood).
- Context
- Infrastructure is finally handling scale without turning every node fault into a cluster halt. Training instability is the bottleneck DeepSeek's v4 report exposed directly; DiLoCo turns a hard stop into soft degradation, stopping routing instability from costing whole runs.
- Key points
- TPU 8t splits into 8t (training/throughput) and 8i (inference/latency) SKUs
- 8t pod runs 9,600 chips at 121 exaflops FP4, roughly 3X Ironwood
- Decoupled DiLoCo enables graceful failure handling at scale
- (N-1)/N units proceed when one node fails, logging drift and patching state
- Engagement
- 112 likes · 6 retweets · 4 replies
- Provenance
- Tweet · Primary source
-
3
Claude Managed Agents Memory public beta
X ClaudeDevs — Anthropic Claude developers account
Memory on Claude Managed Agents is now in public beta on the Claude Platform, letting agents learn and improve across different sessions.
x.com/ClaudeDevs/status/2047424063543681240 →Details
- Cited text
Memory on Claude Managed Agents is now in public beta on the Claude Platform, letting agents learn and improve across different sessions.
- Context
- Stops agents from re-reading the same codebase from scratch every morning. Persistence turns agents from stateless query tools into working models of your repo, reducing context window waste on repetition and freeing it for actual reasoning.
- Key points
- Persistent memory is now in public beta for managed agents
- Agents can now learn and improve across separate sessions
- Closes a major gap in agent harnesses that previously reset context each run
- Pairs with recent forked subagents update for richer state inheritance
- Engagement
- 1790 likes · 156 retweets · 32 replies
- Provenance
- Tweet · Primary source
-
4
OpenAI Developers on Codex Auto-Review
X OpenAI Developers — Official OpenAI developers account
Auto-review is a new mode that lets Codex work longer with fewer approvals and safer execution. It helps Codex keep moving through tests, builds, and more, including during long tasks and automations, while a separate a…
x.com/OpenAIDevs/status/2047436655863464011 →Details
- Cited text
Auto-review is a new mode that lets Codex work longer with fewer approvals and safer execution. It helps Codex keep moving through tests, builds, and more, including during long tasks and automations, while a separate agent checks higher-risk steps in context before they run.
- Context
- Changes how teams structure long-running agentic tasks. Instead of treating human review as a bottleneck, the verification agent acts as a safety net, letting the primary agent push through work while catching genuine risks. This is a structural shift in agent deployment, not just a model improvement.
- Key points
- Codex agents now run longer with fewer human approvals
- A separate verification agent checks higher-risk steps before execution
- Shifts agentic workflows from sequential approval to parallel execution
- Enables complex multi-step automations without manual intervention
- Provenance
- Tweet · Primary source
-
5
DeepSeek-V4 Preview open-sourced
X DeepSeek — DeepSeek official account
DeepSeek-V4-Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. DeepSeek-V4-Pro: 1.6T total / 49B active params. DeepSeek-V4-Flash: 284B total / 13B active params.
x.com/deepseek_ai/status/2047516922263285776 →Details
- Cited text
DeepSeek-V4-Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. DeepSeek-V4-Pro: 1.6T total / 49B active params. DeepSeek-V4-Flash: 284B total / 13B active params.
- Context
- The open weights let independent researchers and teams separate capability from training scale. Flash's 13B active variant is deployable on a single 40GB GPU, widening the deployment surface for specialized agents that previously required API access or multi-GPU setups.
- Key points
- V4 Pro: 1.6T total parameters, 49B active, rivals top closed models
- V4 Flash: 284B total, 13B active, optimized for speed and cost
- Both variants support 1M context length
- Open weights available on HuggingFace with full technical report
- Engagement
- 30144 likes · 7138 retweets · 1156 replies
- Provenance
- Tweet · Primary source
-
6
Yishan on Qwen 3.6 quality report
X Yishan — ML engineer benchmarking local models
Is anyone else finding that Qwen3.6 quality is worse than Qwen3.5? I'm benchmarking it every which way and it keeps coming out worse. Not a lot worse, but always worse, even if by a little bit. I'm testing on MLX and on…
x.com/yishan/status/2047538868577239304 →Details
- Cited text
Is anyone else finding that Qwen3.6 quality is worse than Qwen3.5? I'm benchmarking it every which way and it keeps coming out worse. Not a lot worse, but always worse, even if by a little bit. I'm testing on MLX and on NVFP/MXFP on Sparks.
- Context
- The regression signals a real training trade-off, not a quantization issue. When your workload sits at the edge of quality, you need to know whether the drop is task-specific or systemic. The 27B dense fits in 17GB and runs at ~25 tokens/sec on a laptop, making it deployable for teams that need on-prem models.
- Key points
- Qwen 3.6 shows consistent but narrow degradation across benchmarks
- Degradation persists on both MLX and NVFP/MXFP quantization formats
- Suggests a training shift rather than a conversion artifact
- Sits at a tight benchmark position against Sonnet 4.6 on Agentic Index
- Provenance
- Tweet · Primary source
-
7
Susan Zhang on DeepSeek V4 training instability
X Susan Zhang — ML researcher known for LLM training analysis
so that explains the delay... deepseek could not fix training instabilities, after doubling from ~15T tokens in v3 to ~33T tokens in v4. the 10+ mentions of "stability" tricks seem to be wildly lacking if these two were…
x.com/suchenzang/status/2047559677316325807 →Details
- Cited text
so that explains the delay... deepseek could not fix training instabilities, after doubling from ~15T tokens in v3 to ~33T tokens in v4. the 10+ mentions of "stability" tricks seem to be wildly lacking if these two were the main bandages (mismatched routing + clamping)
- Context
- Exposes the physical limits of scaling MoE models. When routing instability becomes the bottleneck, architectural workarounds leak into prompt behavior. It's not a model weakness per se, but a description of the scaling frontier.
- Key points
- DeepSeek doubled training tokens from ~15T in v3 to ~33T in v4
- Training instability persisted despite the token increase
- Mismatched routing and clamping served as the primary stability bandages
- The delays in release align with these stability challenges
- Engagement
- 1066 likes · 69 retweets · 18 replies
- Provenance
- Tweet · Primary source
-
8
Endless Toil: Hear your agent suffer
Source AndrewVos — Developer building agentic workflow tools
Endless Toil runs alongside your coding agent in real time, playing escalating recorded human groans as the code it reads starts to look more cursed.
github.com/AndrewVos/endless-toil →Details
- Cited text
Endless Toil runs alongside your coding agent in real time, playing escalating recorded human groans as the code it reads starts to look more cursed.
- Context
- A darkly humorous mirror for the agentic workflow. If your agent generates code that makes you groan, the model's reasoning is outpacing your review capacity. It's a symptom of the auto-review problem: when agents move faster than humans can verify, you need better verification, not just faster agents.
- Key points
- Plugin that plays escalating groans as your agent reads worse code
- Available for Codex Desktop, Codex CLI, Claude CLI, and Cursor
- Tests sounds locally with afplay/paplay/aplay/ffplay
- Highlights the growing gap between model capability and code quality judgment
- Provenance
- Source · Background source
-
9
DeepSeek V4 Preview release thread
X @deepseek_ai — DeepSeek's official account. The Hangzhou lab whose V3/R1 models defined the open-weights frontier in 2025.
DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. Welcome to the era of cost-effective 1M context length.
x.com/deepseek_ai/status/2047516922263285776 →Details
- Cited text
DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. Welcome to the era of cost-effective 1M context length.
- Context
- V4 lands less than a day after GPT-5.5, open-weights, with 1M context as the default and sparse attention that drops inference cost at long context. For anyone building agent systems on top of third-party models, this is a real price/capability anchor the closed labs now have to price against.
- Key points
- DeepSeek-V4-Pro: 1.6T total / 49B active MoE params, MIT-licensed open weights
- DeepSeek-V4-Flash: 284B total / 13B active, same 1M context
- Novel attention: token-wise compression plus DSA (DeepSeek Sparse Attention)
- 1M context is the default across both models and the API
- Direct integration with Claude Code, OpenClaw, and OpenCode harnesses out of the gate
- Provenance
- Tweet · Primary source
-
10
Susan Zhang on DeepSeek V4 training instabilities
Thread @suchenzang — Susan Zhang — ex-Meta AI, led OPT-175B training; one of the few public practitioners who has actually shepherded a trillion-token run end to end.
DeepSeek could not fix training instabilities after doubling from ~15T tokens in v3 to ~33T tokens in v4. The 10+ mentions of 'stability' tricks seem to be wildly lacking if these two were the main bandages.
x.com/suchenzang/status/2047559677316325807 →Details
- Cited text
DeepSeek could not fix training instabilities after doubling from ~15T tokens in v3 to ~33T tokens in v4. The 10+ mentions of 'stability' tricks seem to be wildly lacking if these two were the main bandages.
- Context
- The V4 paper's unusual candor about what broke at 33T tokens is a window into how fragile frontier pretraining actually is. For engineers considering fine-tuning or base-model training, this is evidence that the textbook recipe stops working somewhere in the low tens of trillions of tokens — and nobody has published a clean fix.
- Key points
- DeepSeek doubled pretraining from ~15T tokens (V3) to ~33T tokens (V4)
- Paper admits the main stabilizers were mismatched-routing tricks and logit clamping
- Zhang calls 'anticipatory routing' a euphemism for using stale parameters
- Lucas Beyer (ex-Google Brain) publicly piles on that rewinds-as-stabilization doesn't scale
- Replies note closed labs likely have similar patch lists — they just don't publish them
- Provenance
- Thread · Primary source
-
11
Elie Bakouch on V4 architecture details
X @eliebakouch — Elie Bakouch, Hugging Face researcher who writes up open-model tech reports for a living.
V4 Pro is the biggest open model ever: 1.6T total, 49B active, 33T tokens, 1M context, two new attention mechanisms, Muon, mHC, open-source kernels, FP4 QAT, MIT license.
x.com/eliebakouch/status/2047519300399837677 →Details
- Cited text
V4 Pro is the biggest open model ever: 1.6T total, 49B active, 33T tokens, 1M context, two new attention mechanisms, Muon, mHC, open-source kernels, FP4 QAT, MIT license.
- Context
- Bakouch's summary is the cleanest one-screen answer to 'what's actually new' in V4. Muon at this scale and shipped FP4 QAT are the two items most likely to cross-pollinate into other labs' next runs.
- Key points
- Largest fully-open-weights model ever released
- Uses Muon optimizer at flagship scale — most labs still use AdamW variants
- FP4 quantization-aware training in the base recipe, not bolted on later
- Open-source custom kernels shipped alongside the weights
- MIT license, so downstream finetunes and deployments have no rug-pull risk
- Provenance
- Tweet · Primary source
-
12
Yuchen Jin on Chinese labs training under constraints
X @Yuchenj_UW — Yuchen Jin, CEO of Hyperbolic Labs; runs inference for a living and watches training efficiency closely.
DeepSeek, Kimi, and Qwen can train very strong LLMs with far fewer and often nerfed NVIDIA GPUs, or even Huawei chips. Creativity loves constraints.
x.com/Yuchenj_UW/status/2047534197993316738 →Details
- Cited text
DeepSeek, Kimi, and Qwen can train very strong LLMs with far fewer and often nerfed NVIDIA GPUs, or even Huawei chips. Creativity loves constraints.
- Context
- The Western reading of DeepSeek is usually 'they have fewer chips.' Jin's framing — that the constraint is forcing real architectural work — is more useful if you're trying to predict whether these efficiency tricks show up in Western labs' next generation.
- Key points
- Chinese frontier labs are training under an explicit chip-export ceiling
- V4's novel attention architectures are partly a response to that ceiling
- Efficiency gains on training translate directly to inference unit economics
- The closed US labs' compute advantage is counterweighted by the open labs' algorithmic pressure
- Provenance
- Tweet · Primary source
-
13
OpenAI Codex Auto-review launch
X @OpenAIDevs — OpenAI's developer-facing account.
Auto-review is a new mode that lets Codex work longer with fewer approvals and safer execution. A separate agent checks higher-risk steps in context before they run.
x.com/OpenAIDevs/status/2047436655863464011 →Details
- Cited text
Auto-review is a new mode that lets Codex work longer with fewer approvals and safer execution. A separate agent checks higher-risk steps in context before they run.
- Context
- This is the first concrete product move toward the two-agent executor/reviewer pattern as a default, not a bespoke harness. If you're building or buying coding agents, a gated secondary model doing safety review is now table stakes rather than research.
- Key points
- New 'Auto-review' mode sits between YOLO and full-approval
- A separate reviewer agent gates higher-risk steps in context
- Internal name was 'guardian'; some users have been running it for weeks already
- Reduces approval fatigue on long autonomous tasks, tests, builds
- Open question flagged in replies: token overhead of the reviewer agent
- Provenance
- Tweet · Primary source
-
14
Krowork on autonomy vs. recovery in coding agents
X @KroworkAI — Builder account focused on agent UX.
Fewer approval prompts is the right direction but the hard part isn't autonomy length — it's recovery. What happens when the agent goes off-track at step 47?
x.com/KroworkAI/status/2047566505366508023 →Details
- Cited text
Fewer approval prompts is the right direction but the hard part isn't autonomy length — it's recovery. What happens when the agent goes off-track at step 47?
- Context
- Good reply-as-argument. The interesting question under Auto-review isn't 'does it save clicks?' — it's 'what do you do when the agent is forty-seven steps deep down a wrong path and the reviewer missed it?' That's the next hard UX problem for coding agents.
- Key points
- Longer autonomous runs surface a new problem: deep-in-the-task recovery
- Approval fatigue and recovery-from-wrong-path are different UX problems
- A reviewer agent helps with the first; it doesn't obviously help with the second
- Provenance
- Tweet · Primary source
-
15
Victor Taelin introduces LamBench and first impressions of GPT-5.5
Thread @VictorTaelin — Victor Taelin, creator of the HVM / Bend / Kind λ-calculus toolchain and one of the sharpest working critics of benchmark contamination.
My first-day impression is that I can't tell the difference between GPT 5.5 and GPT 5.4. I would be lying if I said otherwise. I'd not be able to distinguish in a blind test. It is much faster though.
x.com/VictorTaelin/status/20475088748909734… →Details
- Cited text
My first-day impression is that I can't tell the difference between GPT 5.5 and GPT 5.4. I would be lying if I said otherwise. I'd not be able to distinguish in a blind test. It is much faster though.
- Context
- A same-day independent evaluator on uncontaminated reasoning problems is the single best signal when a major model ships. The headline capability delta between GPT-5.5 and GPT-5.4 on a fresh bench is: it's faster. That should calibrate how you read the OpenAI marketing materials.
- Key points
- LamBench: 120 fresh λ-calculus questions measuring completion, elegance (BLC length), and speed
- Built same-day to stress-test GPT-5.5 against GPT-5.4 on uncontaminated prompts
- Taelin reports no distinguishable quality gap on his test set — GPT-5.5 just faster
- GLM and K2 did noticeably worse than expected
- Benchmark 'born saturated' — V2 will need to be harder
- Provenance
- Thread · Primary source
-
16
Yishan on Qwen 3.6 27B quality regression
X @yishan — Yishan Wong, former Reddit CEO; now spends real time benchmarking local models on his own hardware.
Is anyone else finding that Qwen3.6 quality is worse than Qwen3.5? I'm benchmarking it every which way and it keeps coming out worse. Not a lot worse, but always worse.
x.com/yishan/status/2047538868577239304 →Details
- Cited text
Is anyone else finding that Qwen3.6 quality is worse than Qwen3.5? I'm benchmarking it every which way and it keeps coming out worse. Not a lot worse, but always worse.
- Context
- Yesterday's coverage leaned on Qwen 3.6 tying Sonnet 4.6 on the Artificial Analysis agentic index. Yishan running his own tests and finding the opposite is exactly the countertone that should affect whether you swap the model into production.
- Key points
- Hands-on comparison of Qwen 3.6 vs 3.5 on MLX and NVFP/MXFP Sparks quantization
- Finding: 3.6 is consistently but slightly worse across his evals
- Contradicts the published agentic benchmark gains that had Qwen 3.6 27B tying Sonnet 4.6
- Reminder that private-eval regressions can hide under a public benchmark win
- Provenance
- Tweet · Primary source
-
17
Sapiens2 open-weights vision backbone
X @astridwilde1 — Astrid Wilde, independent ML researcher with a focus on open-weights vision models.
Sapiens2 is the highest quality ViT backbone that now exists in the public domain. It was pretrained on the equivalent of 1/2 of all human images on Flickr.
x.com/astridwilde1/status/20475231525621722… →Details
- Cited text
Sapiens2 is the highest quality ViT backbone that now exists in the public domain. It was pretrained on the equivalent of 1/2 of all human images on Flickr.
- Context
- A public-domain ViT backbone at this scale is the kind of release that quietly changes what two-person shops can build — human-centric perception, pose, relighting, try-on — without calling a hosted API. If you're building any product that does something with people in images, this is the new floor.
- Key points
- Sapiens2 pretrained on the equivalent of half of all human-subject images on Flickr
- Highest-quality public-domain ViT backbone currently available
- Non-trivial to replicate — the training corpus scale is the moat
- First time a large lab has released a vision backbone at this scale openly
- Provenance
- Tweet · Primary source
-
18
endless-toil: Hear your agent suffer through your code
Article Andrew Vos — Indie developer, creator of the endless-toil Codex/Claude Code plugin.
Endless Toil runs alongside your coding agent in real time, playing escalating recorded human groans as the code it reads starts to look more cursed.
github.com/AndrewVos/endless-toil →Details
- Cited text
Endless Toil runs alongside your coding agent in real time, playing escalating recorded human groans as the code it reads starts to look more cursed.
- Context
- The joke is only possible because the marketplace, the skills, and the plugin-install UX for both Codex and Claude Code shipped in the last few months. Toy tools like this are a leading indicator of a platform being real.
- Key points
- Plugin for Codex Desktop, Codex CLI, Claude CLI, and Cursor
- Plays real human groans, wails, and 'abyss' sounds as agent scans worse code
- Escalates sonic distress the more cursed the file looks
- Uses the new OpenAI Codex / Claude Code plugin-marketplace infrastructure
- 40 HN points in a few hours — a joke, but one that only works because the plumbing is now real
- Provenance
- Article · Supporting source
-
19
Fireship — I finally found a use case for OpenClaw
Video Fireship (Jeff Delaney) — Jeff Delaney, host of Fireship — the most-watched short-form programming channel.
The project has received over 1,100 security advisories and has resolved or closed about 650 of them. Most of the rest are slop issues. [Steinberger's] filter is: anytime the report is too nice or someone apologizes, it…
www.youtube.com/watch?v=FM5-R4VPArw →Details
- Cited text
The project has received over 1,100 security advisories and has resolved or closed about 650 of them. Most of the rest are slop issues. [Steinberger's] filter is: anytime the report is too nice or someone apologizes, it's very likely AI.
- Context
- Two things to take away: the scale of AI-generated security-report slop hitting real open-source projects, and the shape of the DIY personal-agent stack — a hosted agent runtime plus a voice clone plus a messaging channel — that the toolchain now supports out of the box.
- Key points
- OpenClaw has accumulated 1,100+ security advisories since January; ~650 resolved
- Creator Peter Steinberger spoke at TED and at AI Engineer Europe on the project this month
- Most open advisories are AI-slop reports — Steinberger's tell is excess politeness
- Fireship wires OpenClaw to a Telegram bot, 11Labs voice clone, and ffmpeg to auto-answer family IT questions in his own voice
- Uses the new one-click OpenClaw VPS template as the hosting layer
- Provenance
- Video · Supporting source
-
20
Mario Zechner on pi.dev GPT-5.5 + new login flow
X @badlogicgames — Mario Zechner, creator of libGDX and co-maintainer of pi.dev — small-shop coding agent used by a growing pocket of indie developers.
GPT 5.5 release. And @mitsuhiko has improved the onboarding with a nice new /login flow for both subscriptions and API key authentication.
x.com/badlogicgames/status/2047452871612903… →Details
- Cited text
GPT 5.5 release. And @mitsuhiko has improved the onboarding with a nice new /login flow for both subscriptions and API key authentication.
- Context
- Small, fast-moving agent tools are still viable alongside Codex and Claude Code. pi.dev turning around GPT-5.5 day-of is a data point that the model-integration work isn't a durable moat.
- Key points
- pi.dev shipped GPT-5.5 support within hours of the model release
- Armin Ronacher (mitsuhiko) rewrote /login to handle both subscription and API-key auth paths
- Terminal progress now off-by-default on a minor annoyance opt-in
- Small-shop agent tool keeping pace with the big coding agents on model refreshes
- Provenance
- Tweet · Primary source
-
21
Jason Liu shares a reusable Codex skills pack
X @jxnlco — Jason Liu, author of the instructor Python library and a prolific writer on LLM engineering patterns.
Synced a set of reusable Codex skills into my dots repo: AI code/frontend/writing audits, safe worktree cleanup, exec comms, GitHub PR/CI helpers, Playwright/PDF workflows, and simple HTML artifacts.
x.com/jxnlco/status/2047445737395585386 →Details
- Cited text
Synced a set of reusable Codex skills into my dots repo: AI code/frontend/writing audits, safe worktree cleanup, exec comms, GitHub PR/CI helpers, Playwright/PDF workflows, and simple HTML artifacts.
- Context
- The Codex skills format is doing what MCP was supposed to do a year ago — letting a senior engineer publish their working prompts/tools so junior colleagues can clone and use them. Worth reading as a library of 'what skills should exist.'
- Key points
- Open collection of reusable Codex skills dropped into a dotfiles repo
- Categories: audits (code/frontend/writing), git worktree hygiene, exec comms, PR/CI helpers, browser automation
- Install/sync model mirrors what the plugin marketplace encourages
- Another signal the Codex skill format is getting real community use
- Provenance
- Tweet · Primary source