DeepSeek V4 Lands on an Unsteady Floor

1

Aran Komatsuzaki on forked subagents

X Aran Komatsuzaki — Research scientist at Anthropic

Anthropic just introduced forked subagents in their latest update. Unlike regular subagents, forked subagents can inherit the same context as the main agent. This looks convenient for cases where richer context matters…

x.com/arankomatsuzaki/status/20473494718777… →

Details

Cited text: Anthropic just introduced forked subagents in their latest update. Unlike regular subagents, forked subagents can inherit the same context as the main agent. This looks convenient for cases where richer context matters more. This is just what I needed!
Context: Forces the harness to match how senior engineers actually work — with shared state, not flattened prompts. If forked subagents drift without a merge protocol, you get the same fragmentation multi-agent systems create when they don't communicate.
Key points: Forked subagents inherit active context tree, not just a snapshot
Regular subagents get a static context snapshot; forked ones get the live state
Critical for tasks requiring shared state like debugging while refactoring
Closes the gap between agent harnesses and how engineers think about dependencies
Engagement: 802 likes · 68 retweets · 38 replies
Provenance: Tweet · Primary source

2

Jeff Dean on TPU 8t and DiLoCo

X Jeff Dean — Senior Fellow at Google, leads AI infrastructure research

First, let's talk about TPU 8t, which is designed for large-scale training and inference throughput. The pod size is increased slightly to 9600 chips, and provides ~3X the FP4 performance per pod vs. Ironwood (8t has 12…

x.com/JeffDean/status/2047405389856297387 →

Details

Cited text: First, let's talk about TPU 8t, which is designed for large-scale training and inference throughput. The pod size is increased slightly to 9600 chips, and provides ~3X the FP4 performance per pod vs. Ironwood (8t has 121 exaflops/pod vs. 42.5 exaflops/pod for Ironwood).
Context: Infrastructure is finally handling scale without turning every node fault into a cluster halt. Training instability is the bottleneck DeepSeek's v4 report exposed directly; DiLoCo turns a hard stop into soft degradation, stopping routing instability from costing whole runs.
Key points: TPU 8t splits into 8t (training/throughput) and 8i (inference/latency) SKUs
8t pod runs 9,600 chips at 121 exaflops FP4, roughly 3X Ironwood
Decoupled DiLoCo enables graceful failure handling at scale
(N-1)/N units proceed when one node fails, logging drift and patching state
Engagement: 112 likes · 6 retweets · 4 replies
Provenance: Tweet · Primary source

3

Claude Managed Agents Memory public beta

X ClaudeDevs — Anthropic Claude developers account

Memory on Claude Managed Agents is now in public beta on the Claude Platform, letting agents learn and improve across different sessions.

x.com/ClaudeDevs/status/2047424063543681240 →

Details

Cited text: Memory on Claude Managed Agents is now in public beta on the Claude Platform, letting agents learn and improve across different sessions.
Context: Stops agents from re-reading the same codebase from scratch every morning. Persistence turns agents from stateless query tools into working models of your repo, reducing context window waste on repetition and freeing it for actual reasoning.
Key points: Persistent memory is now in public beta for managed agents
Agents can now learn and improve across separate sessions
Closes a major gap in agent harnesses that previously reset context each run
Pairs with recent forked subagents update for richer state inheritance
Engagement: 1790 likes · 156 retweets · 32 replies
Provenance: Tweet · Primary source

4

OpenAI Developers on Codex Auto-Review

X OpenAI Developers — Official OpenAI developers account

Auto-review is a new mode that lets Codex work longer with fewer approvals and safer execution. It helps Codex keep moving through tests, builds, and more, including during long tasks and automations, while a separate a…

x.com/OpenAIDevs/status/2047436655863464011 →

Details

Cited text: Auto-review is a new mode that lets Codex work longer with fewer approvals and safer execution. It helps Codex keep moving through tests, builds, and more, including during long tasks and automations, while a separate agent checks higher-risk steps in context before they run.
Context: Changes how teams structure long-running agentic tasks. Instead of treating human review as a bottleneck, the verification agent acts as a safety net, letting the primary agent push through work while catching genuine risks. This is a structural shift in agent deployment, not just a model improvement.
Key points: Codex agents now run longer with fewer human approvals
A separate verification agent checks higher-risk steps before execution
Shifts agentic workflows from sequential approval to parallel execution
Enables complex multi-step automations without manual intervention
Provenance: Tweet · Primary source

5

DeepSeek-V4 Preview open-sourced

X DeepSeek — DeepSeek official account

DeepSeek-V4-Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. DeepSeek-V4-Pro: 1.6T total / 49B active params. DeepSeek-V4-Flash: 284B total / 13B active params.

x.com/deepseek_ai/status/2047516922263285776 →

Details

Cited text: DeepSeek-V4-Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. DeepSeek-V4-Pro: 1.6T total / 49B active params. DeepSeek-V4-Flash: 284B total / 13B active params.
Context: The open weights let independent researchers and teams separate capability from training scale. Flash's 13B active variant is deployable on a single 40GB GPU, widening the deployment surface for specialized agents that previously required API access or multi-GPU setups.
Key points: V4 Pro: 1.6T total parameters, 49B active, rivals top closed models
V4 Flash: 284B total, 13B active, optimized for speed and cost
Both variants support 1M context length
Open weights available on HuggingFace with full technical report
Engagement: 30144 likes · 7138 retweets · 1156 replies
Provenance: Tweet · Primary source

6

Yishan on Qwen 3.6 quality report

X Yishan — ML engineer benchmarking local models

Is anyone else finding that Qwen3.6 quality is worse than Qwen3.5? I'm benchmarking it every which way and it keeps coming out worse. Not a lot worse, but always worse, even if by a little bit. I'm testing on MLX and on…

x.com/yishan/status/2047538868577239304 →

Details

Cited text: Is anyone else finding that Qwen3.6 quality is worse than Qwen3.5? I'm benchmarking it every which way and it keeps coming out worse. Not a lot worse, but always worse, even if by a little bit. I'm testing on MLX and on NVFP/MXFP on Sparks.
Context: The regression signals a real training trade-off, not a quantization issue. When your workload sits at the edge of quality, you need to know whether the drop is task-specific or systemic. The 27B dense fits in 17GB and runs at ~25 tokens/sec on a laptop, making it deployable for teams that need on-prem models.
Key points: Qwen 3.6 shows consistent but narrow degradation across benchmarks
Degradation persists on both MLX and NVFP/MXFP quantization formats
Suggests a training shift rather than a conversion artifact
Sits at a tight benchmark position against Sonnet 4.6 on Agentic Index
Provenance: Tweet · Primary source

7

Susan Zhang on DeepSeek V4 training instability

X Susan Zhang — ML researcher known for LLM training analysis

so that explains the delay... deepseek could not fix training instabilities, after doubling from ~15T tokens in v3 to ~33T tokens in v4. the 10+ mentions of "stability" tricks seem to be wildly lacking if these two were…

x.com/suchenzang/status/2047559677316325807 →

Details

Cited text: so that explains the delay... deepseek could not fix training instabilities, after doubling from ~15T tokens in v3 to ~33T tokens in v4. the 10+ mentions of "stability" tricks seem to be wildly lacking if these two were the main bandages (mismatched routing + clamping)
Context: Exposes the physical limits of scaling MoE models. When routing instability becomes the bottleneck, architectural workarounds leak into prompt behavior. It's not a model weakness per se, but a description of the scaling frontier.
Key points: DeepSeek doubled training tokens from ~15T in v3 to ~33T in v4
Training instability persisted despite the token increase
Mismatched routing and clamping served as the primary stability bandages
The delays in release align with these stability challenges
Engagement: 1066 likes · 69 retweets · 18 replies
Provenance: Tweet · Primary source

8

Endless Toil: Hear your agent suffer

Source AndrewVos — Developer building agentic workflow tools

Endless Toil runs alongside your coding agent in real time, playing escalating recorded human groans as the code it reads starts to look more cursed.

github.com/AndrewVos/endless-toil →

Details

Cited text: Endless Toil runs alongside your coding agent in real time, playing escalating recorded human groans as the code it reads starts to look more cursed.
Context: A darkly humorous mirror for the agentic workflow. If your agent generates code that makes you groan, the model's reasoning is outpacing your review capacity. It's a symptom of the auto-review problem: when agents move faster than humans can verify, you need better verification, not just faster agents.
Key points: Plugin that plays escalating groans as your agent reads worse code
Available for Codex Desktop, Codex CLI, Claude CLI, and Cursor
Tests sounds locally with afplay/paplay/aplay/ffplay
Highlights the growing gap between model capability and code quality judgment
Provenance: Source · Background source

9

DeepSeek V4 Preview release thread

X @deepseek_ai — DeepSeek's official account. The Hangzhou lab whose V3/R1 models defined the open-weights frontier in 2025.

DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. Welcome to the era of cost-effective 1M context length.

x.com/deepseek_ai/status/2047516922263285776 →

Details

Cited text: DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. Welcome to the era of cost-effective 1M context length.
Context: V4 lands less than a day after GPT-5.5, open-weights, with 1M context as the default and sparse attention that drops inference cost at long context. For anyone building agent systems on top of third-party models, this is a real price/capability anchor the closed labs now have to price against.
Key points: DeepSeek-V4-Pro: 1.6T total / 49B active MoE params, MIT-licensed open weights
DeepSeek-V4-Flash: 284B total / 13B active, same 1M context
Novel attention: token-wise compression plus DSA (DeepSeek Sparse Attention)
1M context is the default across both models and the API
Direct integration with Claude Code, OpenClaw, and OpenCode harnesses out of the gate
Provenance: Tweet · Primary source

10

Susan Zhang on DeepSeek V4 training instabilities

Thread @suchenzang — Susan Zhang — ex-Meta AI, led OPT-175B training; one of the few public practitioners who has actually shepherded a trillion-token run end to end.

DeepSeek could not fix training instabilities after doubling from ~15T tokens in v3 to ~33T tokens in v4. The 10+ mentions of 'stability' tricks seem to be wildly lacking if these two were the main bandages.

x.com/suchenzang/status/2047559677316325807 →

Details

Cited text: DeepSeek could not fix training instabilities after doubling from ~15T tokens in v3 to ~33T tokens in v4. The 10+ mentions of 'stability' tricks seem to be wildly lacking if these two were the main bandages.
Context: The V4 paper's unusual candor about what broke at 33T tokens is a window into how fragile frontier pretraining actually is. For engineers considering fine-tuning or base-model training, this is evidence that the textbook recipe stops working somewhere in the low tens of trillions of tokens — and nobody has published a clean fix.
Key points: DeepSeek doubled pretraining from ~15T tokens (V3) to ~33T tokens (V4)
Paper admits the main stabilizers were mismatched-routing tricks and logit clamping
Zhang calls 'anticipatory routing' a euphemism for using stale parameters
Lucas Beyer (ex-Google Brain) publicly piles on that rewinds-as-stabilization doesn't scale
Replies note closed labs likely have similar patch lists — they just don't publish them
Provenance: Thread · Primary source

11

Elie Bakouch on V4 architecture details

X @eliebakouch — Elie Bakouch, Hugging Face researcher who writes up open-model tech reports for a living.

V4 Pro is the biggest open model ever: 1.6T total, 49B active, 33T tokens, 1M context, two new attention mechanisms, Muon, mHC, open-source kernels, FP4 QAT, MIT license.

x.com/eliebakouch/status/2047519300399837677 →

Details

Cited text: V4 Pro is the biggest open model ever: 1.6T total, 49B active, 33T tokens, 1M context, two new attention mechanisms, Muon, mHC, open-source kernels, FP4 QAT, MIT license.
Context: Bakouch's summary is the cleanest one-screen answer to 'what's actually new' in V4. Muon at this scale and shipped FP4 QAT are the two items most likely to cross-pollinate into other labs' next runs.
Key points: Largest fully-open-weights model ever released
Uses Muon optimizer at flagship scale — most labs still use AdamW variants
FP4 quantization-aware training in the base recipe, not bolted on later
Open-source custom kernels shipped alongside the weights
MIT license, so downstream finetunes and deployments have no rug-pull risk
Provenance: Tweet · Primary source

12

Yuchen Jin on Chinese labs training under constraints

X @Yuchenj_UW — Yuchen Jin, CEO of Hyperbolic Labs; runs inference for a living and watches training efficiency closely.

DeepSeek, Kimi, and Qwen can train very strong LLMs with far fewer and often nerfed NVIDIA GPUs, or even Huawei chips. Creativity loves constraints.

x.com/Yuchenj_UW/status/2047534197993316738 →

Details

Cited text: DeepSeek, Kimi, and Qwen can train very strong LLMs with far fewer and often nerfed NVIDIA GPUs, or even Huawei chips. Creativity loves constraints.
Context: The Western reading of DeepSeek is usually 'they have fewer chips.' Jin's framing — that the constraint is forcing real architectural work — is more useful if you're trying to predict whether these efficiency tricks show up in Western labs' next generation.
Key points: Chinese frontier labs are training under an explicit chip-export ceiling
V4's novel attention architectures are partly a response to that ceiling
Efficiency gains on training translate directly to inference unit economics
The closed US labs' compute advantage is counterweighted by the open labs' algorithmic pressure
Provenance: Tweet · Primary source

13

OpenAI Codex Auto-review launch

X @OpenAIDevs — OpenAI's developer-facing account.

Auto-review is a new mode that lets Codex work longer with fewer approvals and safer execution. A separate agent checks higher-risk steps in context before they run.

x.com/OpenAIDevs/status/2047436655863464011 →

Details

Cited text: Auto-review is a new mode that lets Codex work longer with fewer approvals and safer execution. A separate agent checks higher-risk steps in context before they run.
Context: This is the first concrete product move toward the two-agent executor/reviewer pattern as a default, not a bespoke harness. If you're building or buying coding agents, a gated secondary model doing safety review is now table stakes rather than research.
Key points: New 'Auto-review' mode sits between YOLO and full-approval
A separate reviewer agent gates higher-risk steps in context
Internal name was 'guardian'; some users have been running it for weeks already
Reduces approval fatigue on long autonomous tasks, tests, builds
Open question flagged in replies: token overhead of the reviewer agent
Provenance: Tweet · Primary source

14

Krowork on autonomy vs. recovery in coding agents

X @KroworkAI — Builder account focused on agent UX.

Fewer approval prompts is the right direction but the hard part isn't autonomy length — it's recovery. What happens when the agent goes off-track at step 47?

x.com/KroworkAI/status/2047566505366508023 →

Details

Cited text: Fewer approval prompts is the right direction but the hard part isn't autonomy length — it's recovery. What happens when the agent goes off-track at step 47?
Context: Good reply-as-argument. The interesting question under Auto-review isn't 'does it save clicks?' — it's 'what do you do when the agent is forty-seven steps deep down a wrong path and the reviewer missed it?' That's the next hard UX problem for coding agents.
Key points: Longer autonomous runs surface a new problem: deep-in-the-task recovery
Approval fatigue and recovery-from-wrong-path are different UX problems
A reviewer agent helps with the first; it doesn't obviously help with the second
Provenance: Tweet · Primary source

15

Victor Taelin introduces LamBench and first impressions of GPT-5.5

Thread @VictorTaelin — Victor Taelin, creator of the HVM / Bend / Kind λ-calculus toolchain and one of the sharpest working critics of benchmark contamination.

My first-day impression is that I can't tell the difference between GPT 5.5 and GPT 5.4. I would be lying if I said otherwise. I'd not be able to distinguish in a blind test. It is much faster though.

x.com/VictorTaelin/status/20475088748909734… →

Details

Cited text: My first-day impression is that I can't tell the difference between GPT 5.5 and GPT 5.4. I would be lying if I said otherwise. I'd not be able to distinguish in a blind test. It is much faster though.
Context: A same-day independent evaluator on uncontaminated reasoning problems is the single best signal when a major model ships. The headline capability delta between GPT-5.5 and GPT-5.4 on a fresh bench is: it's faster. That should calibrate how you read the OpenAI marketing materials.
Key points: LamBench: 120 fresh λ-calculus questions measuring completion, elegance (BLC length), and speed
Built same-day to stress-test GPT-5.5 against GPT-5.4 on uncontaminated prompts
Taelin reports no distinguishable quality gap on his test set — GPT-5.5 just faster
GLM and K2 did noticeably worse than expected
Benchmark 'born saturated' — V2 will need to be harder
Provenance: Thread · Primary source

16

Yishan on Qwen 3.6 27B quality regression

X @yishan — Yishan Wong, former Reddit CEO; now spends real time benchmarking local models on his own hardware.

Is anyone else finding that Qwen3.6 quality is worse than Qwen3.5? I'm benchmarking it every which way and it keeps coming out worse. Not a lot worse, but always worse.

x.com/yishan/status/2047538868577239304 →

Details

Cited text: Is anyone else finding that Qwen3.6 quality is worse than Qwen3.5? I'm benchmarking it every which way and it keeps coming out worse. Not a lot worse, but always worse.
Context: Yesterday's coverage leaned on Qwen 3.6 tying Sonnet 4.6 on the Artificial Analysis agentic index. Yishan running his own tests and finding the opposite is exactly the countertone that should affect whether you swap the model into production.
Key points: Hands-on comparison of Qwen 3.6 vs 3.5 on MLX and NVFP/MXFP Sparks quantization
Finding: 3.6 is consistently but slightly worse across his evals
Contradicts the published agentic benchmark gains that had Qwen 3.6 27B tying Sonnet 4.6
Reminder that private-eval regressions can hide under a public benchmark win
Provenance: Tweet · Primary source

17

Sapiens2 open-weights vision backbone

X @astridwilde1 — Astrid Wilde, independent ML researcher with a focus on open-weights vision models.

Sapiens2 is the highest quality ViT backbone that now exists in the public domain. It was pretrained on the equivalent of 1/2 of all human images on Flickr.

x.com/astridwilde1/status/20475231525621722… →

Details

Cited text: Sapiens2 is the highest quality ViT backbone that now exists in the public domain. It was pretrained on the equivalent of 1/2 of all human images on Flickr.
Context: A public-domain ViT backbone at this scale is the kind of release that quietly changes what two-person shops can build — human-centric perception, pose, relighting, try-on — without calling a hosted API. If you're building any product that does something with people in images, this is the new floor.
Key points: Sapiens2 pretrained on the equivalent of half of all human-subject images on Flickr
Highest-quality public-domain ViT backbone currently available
Non-trivial to replicate — the training corpus scale is the moat
First time a large lab has released a vision backbone at this scale openly
Provenance: Tweet · Primary source

18

endless-toil: Hear your agent suffer through your code

Article Andrew Vos — Indie developer, creator of the endless-toil Codex/Claude Code plugin.

Endless Toil runs alongside your coding agent in real time, playing escalating recorded human groans as the code it reads starts to look more cursed.

github.com/AndrewVos/endless-toil →

Details

Cited text: Endless Toil runs alongside your coding agent in real time, playing escalating recorded human groans as the code it reads starts to look more cursed.
Context: The joke is only possible because the marketplace, the skills, and the plugin-install UX for both Codex and Claude Code shipped in the last few months. Toy tools like this are a leading indicator of a platform being real.
Key points: Plugin for Codex Desktop, Codex CLI, Claude CLI, and Cursor
Plays real human groans, wails, and 'abyss' sounds as agent scans worse code
Escalates sonic distress the more cursed the file looks
Uses the new OpenAI Codex / Claude Code plugin-marketplace infrastructure
40 HN points in a few hours — a joke, but one that only works because the plumbing is now real
Provenance: Article · Supporting source

19

Fireship — I finally found a use case for OpenClaw

Video Fireship (Jeff Delaney) — Jeff Delaney, host of Fireship — the most-watched short-form programming channel.

The project has received over 1,100 security advisories and has resolved or closed about 650 of them. Most of the rest are slop issues. [Steinberger's] filter is: anytime the report is too nice or someone apologizes, it…

www.youtube.com/watch?v=FM5-R4VPArw →

Details

Cited text: The project has received over 1,100 security advisories and has resolved or closed about 650 of them. Most of the rest are slop issues. [Steinberger's] filter is: anytime the report is too nice or someone apologizes, it's very likely AI.
Context: Two things to take away: the scale of AI-generated security-report slop hitting real open-source projects, and the shape of the DIY personal-agent stack — a hosted agent runtime plus a voice clone plus a messaging channel — that the toolchain now supports out of the box.
Key points: OpenClaw has accumulated 1,100+ security advisories since January; ~650 resolved
Creator Peter Steinberger spoke at TED and at AI Engineer Europe on the project this month
Most open advisories are AI-slop reports — Steinberger's tell is excess politeness
Fireship wires OpenClaw to a Telegram bot, 11Labs voice clone, and ffmpeg to auto-answer family IT questions in his own voice
Uses the new one-click OpenClaw VPS template as the hosting layer
Provenance: Video · Supporting source

20

Mario Zechner on pi.dev GPT-5.5 + new login flow

X @badlogicgames — Mario Zechner, creator of libGDX and co-maintainer of pi.dev — small-shop coding agent used by a growing pocket of indie developers.

GPT 5.5 release. And @mitsuhiko has improved the onboarding with a nice new /login flow for both subscriptions and API key authentication.

x.com/badlogicgames/status/2047452871612903… →

Details

Cited text: GPT 5.5 release. And @mitsuhiko has improved the onboarding with a nice new /login flow for both subscriptions and API key authentication.
Context: Small, fast-moving agent tools are still viable alongside Codex and Claude Code. pi.dev turning around GPT-5.5 day-of is a data point that the model-integration work isn't a durable moat.
Key points: pi.dev shipped GPT-5.5 support within hours of the model release
Armin Ronacher (mitsuhiko) rewrote /login to handle both subscription and API-key auth paths
Terminal progress now off-by-default on a minor annoyance opt-in
Small-shop agent tool keeping pace with the big coding agents on model refreshes
Provenance: Tweet · Primary source

21

Jason Liu shares a reusable Codex skills pack

X @jxnlco — Jason Liu, author of the instructor Python library and a prolific writer on LLM engineering patterns.

Synced a set of reusable Codex skills into my dots repo: AI code/frontend/writing audits, safe worktree cleanup, exec comms, GitHub PR/CI helpers, Playwright/PDF workflows, and simple HTML artifacts.

x.com/jxnlco/status/2047445737395585386 →

Details

Cited text: Synced a set of reusable Codex skills into my dots repo: AI code/frontend/writing audits, safe worktree cleanup, exec comms, GitHub PR/CI helpers, Playwright/PDF workflows, and simple HTML artifacts.
Context: The Codex skills format is doing what MCP was supposed to do a year ago — letting a senior engineer publish their working prompts/tools so junior colleagues can clone and use them. Worth reading as a library of 'what skills should exist.'
Key points: Open collection of reusable Codex skills dropped into a dotfiles repo
Categories: audits (code/frontend/writing), git worktree hygiene, exec comms, PR/CI helpers, browser automation
Install/sync model mirrors what the plugin marketplace encourages
Another signal the Codex skill format is getting real community use
Provenance: Tweet · Primary source