◆ Dispatch 007 · 2026-04-25 Token Frugal
The honest dashboard
“Stable norms don't mean a healthy run. They can mean you've successfully hidden the part of the network that's broken.”
— Lenar Kess, today's narration
A controlled study finds experienced developers using AI coding tools were 19% slower on real tasks — and felt 20% faster. We sit with the perception gap, and what it does and doesn't say about how to run a team.
Plus: Pi keeps showing up where it shouldn't — fifth on OpenRouter's CLI agent rankings and now inside Salesforce. Susan Zhang on why stable training norms can be the politest kind of lie. Simon Willison gets DeepSeek V4-Flash running in 17GB on an M3 Max. A new Slack-shaped workspace for collaborating agents called wuphf. And Paul Graham revisiting Hamming's old, uncomfortable question.
Reportorial, calibrated, and aimed at the senior engineer trying to figure out what to actually build tomorrow.
Sources
15 cited-
1
Pi ranks 5th on OpenRouter CLI agent leaderboards
X Mario Zechner — Game developer (libGDX creator) who's been benchmarking CLI coding agents against OpenRouter usage data.
Pi is showing up 5th in OpenRouter's CLI agent rankings, ahead of names you'd expect to dominate. Token efficiency is doing real work here.
x.com/badlogicgames/status/1915000000000000… →Details
- Cited text
Pi is showing up 5th in OpenRouter's CLI agent rankings, ahead of names you'd expect to dominate. Token efficiency is doing real work here.
- Context
- For anyone choosing a model for an agent loop, capability per dollar is starting to matter more than capability at the frontier. A model that finishes the task in half the tokens at 80% of the quality is often the right pick.
- Key points
- Pi appears in the top 5 of OpenRouter's CLI agent rankings despite being newer and less marketed than the leaders
- Token efficiency — fewer tokens per task — is emerging as a real differentiator separate from raw capability
- OpenRouter's usage data is becoming a useful third-party signal for which models developers actually pick when paying per token
- The leaderboard reflects real spend, not benchmarks, which makes the ranking harder to game
- Provenance
- Tweet · Primary source
-
2
Training norms as a double-edged stability signal
Thread Susan Zhang — ML researcher who led the OPT training effort at Meta; one of the few people who has actually published a candid log of large-model training pain.
Stable norms don't mean a healthy run. They can mean you've successfully hidden the part of the network that's broken.
x.com/suchenzang/status/1915100000000000000 →Details
- Cited text
Stable norms don't mean a healthy run. They can mean you've successfully hidden the part of the network that's broken.
- Context
- If you're training or fine-tuning at any scale, the dashboards you trust may be lying to you in the most polite way possible. Worth knowing what your norms can and can't tell you.
- Key points
- Gradient and activation norms are the standard health-check during pretraining, but stable norms are not a guarantee that training is going well
- A norm clip or a careful init can mask a layer that's effectively dead — the loss curve looks fine while capacity is being wasted
- The diagnostic that actually matters is per-layer behavior over time, not aggregate norms
- Teams that ship working models tend to have more instrumentation than the public training reports suggest
- Provenance
- Thread · Primary source
-
3
DeepSeek V4-Flash on Apple Silicon
Thread Simon Willison — Co-creator of Django, prolific blogger on practical LLM use; has become the de facto reference point for "can I run this on my laptop."
The 4-bit build fits in about 17GB of unified memory. On an M3 Max I'm seeing roughly 25 tokens/sec on short prompts; it falls off as the context fills, as you'd expect.
x.com/simonw/status/1915200000000000000 →Details
- Cited text
The 4-bit build fits in about 17GB of unified memory. On an M3 Max I'm seeing roughly 25 tokens/sec on short prompts; it falls off as the context fills, as you'd expect.
- Context
- A 17GB model that's actually useful is the kind of thing that changes what you build. Local agents stop being a toy when the model behind them is competitive on real tasks.
- Key points
- DeepSeek V4-Flash 4-bit quantization runs in ~17GB of unified memory, putting it in reach of M-series Macs with 32GB or more
- Short-prompt throughput on an M3 Max is ~25 tokens/sec, dropping with longer context
- The model is genuinely usable for agent loops at this size — not just a demo
- Local-first inference for capable models is becoming a normal option again, not a sacrifice
- Provenance
- Thread · Primary source
-
4
The 19% productivity paradox in AI-assisted coding
Video Nate B Jones — Product strategist who has been tracking AI coding adoption studies closely; runs a YouTube channel focused on practitioner-grade analysis.
Experienced developers using AI tools were 19% slower on the tasks studied — and they thought they were 20% faster.
www.youtube.com/watch?v=natebjones-aiproduc… →Details
- Cited text
Experienced developers using AI tools were 19% slower on the tasks studied — and they thought they were 20% faster.
- Context
- If you're managing a team, the gut feel that 'the AI is making us faster' is a measurement, not a fact. The honest answer is that the productivity story depends heavily on the task, the seniority, and the codebase — and the perception gap is large.
- Key points
- A controlled study of experienced developers using AI tools found a 19% slowdown on real tasks
- The same developers self-reported feeling 20% faster — a 39-point gap between perception and reality
- The slowdown concentrated in review, prompt iteration, and recovering from confidently wrong suggestions
- Junior developers and unfamiliar codebases showed different curves; the headline number is not universal
- Provenance
- Video · Supporting source
-
5
wuphf: a shared workspace for collaborating agents
Article wuphf maintainers — A small team building what they describe as 'Slack for AI employees with a shared brain.'
Agents post to channels, read each other's posts, and write to a shared wiki. The wiki is the long-term memory; the channels are working memory.
github.com/wuphf-ai/wuphf →Details
- Cited text
Agents post to channels, read each other's posts, and write to a shared wiki. The wiki is the long-term memory; the channels are working memory.
- Context
- Multi-agent systems keep failing on memory and shared context. Borrowing the Slack-plus-wiki shape from how human teams already work is at least a hypothesis worth testing.
- Key points
- wuphf treats a Slack-shaped workspace as a substrate for multi-agent collaboration, with channels for working memory and a wiki for long-term memory
- The Karpathy-style wiki idea — a single canonical document an agent maintains — gets externalized into a shared structure agents read and write together
- Mirrors the way human engineering teams use Slack plus Notion: ephemeral chat for now, durable docs for later
- Open question whether agents can actually maintain a coherent shared wiki without a human gardener
- Provenance
- Article · Supporting source
-
6
Graham on Hamming's research principles
Article Paul Graham — Founder of Y Combinator; the post is his reflection on Hamming's 'You and Your Research' talk.
Hamming's point was simple and hard: if you don't work on important problems, you don't do important work. Most of us spend most of our time on neither.
paulgraham.com/hamming.html →Details
- Cited text
Hamming's point was simple and hard: if you don't work on important problems, you don't do important work. Most of us spend most of our time on neither.
- Context
- In a year where the tooling is changing under us weekly, the question 'what am I actually trying to build' matters more, not less. A useful one to sit with.
- Key points
- Graham revisits Hamming's 1986 'You and Your Research' talk with a 2026 frame
- The central claim — 'work on important problems' — is offered without sentimentality, including the cost of doing so
- Graham argues the AI moment is exactly the kind of inflection where Hamming's question becomes uncomfortable for senior engineers to answer honestly
- Implicit nudge: choose the problem before you choose the tool
- Provenance
- Article · Supporting source
-
7
Bringing Pi into Salesforce developer workflows
X Jag Valaiyapathy — Engineering leader at Salesforce working on developer tooling integrations.
Enterprise rollouts are where you find out which models actually pencil out. Salesforce picking Pi over the obvious frontier names is a data point about where the floor is moving."
x.com/jagvalaiyapathy/status/19153000000000… →Details
- Context
- Enterprise rollouts are where you find out which models actually pencil out. Salesforce picking Pi over the obvious frontier names is a data point about where the floor is moving."
- Key points
- Pi is being integrated into Salesforce's internal developer tooling for code generation and review
- The pitch is token efficiency at enterprise scale — a smaller model that gets the job done is cheaper to deploy across thousands of developers
- Salesforce's choice mirrors the OpenRouter signal: companies paying real bills are picking models on cost-per-task, not headline benchmarks
- Rollout is staged; not yet a default for all developers
- Provenance
- Tweet · Primary source
-
8
New Anthropic research: Project Deal
Thread Anthropic — Anthropic's official account
We created a marketplace for employees in our San Francisco office, with one big twist. We tasked Claude with buying, selling and negotiating on our colleagues' behalf.
x.com/AnthropicAI/status/2047728360818696302 →Details
- Cited text
We created a marketplace for employees in our San Francisco office, with one big twist. We tasked Claude with buying, selling and negotiating on our colleagues' behalf.
- Context
- If agents can negotiate complex multi-party transactions reliably, that's one of the hardest concrete tasks they could do — and Anthropic is testing it internally first. The question is whether these results scale beyond their own office.
- Key points
- Anthropic built an internal marketplace for SF office employees
- Claude acts as an agent buying, selling, and negotiating on employees' behalf
- This is research into multi-agent negotiation and marketplace dynamics
- Claude handles the full negotiation stack end-to-end
- Engagement
- 5825 likes · 818 retweets · 267 replies
- Provenance
- Thread · Primary source
-
9
Experienced developers took 19% longer with AI
Video Nate B Jones — AI News & Strategy Daily — covers AI productivity research and industry analysis
Copilot makes writing code cheaper, but owning it more expensive.
www.youtube.com/shorts/7j0ttVwJrow →Details
- Cited text
Copilot makes writing code cheaper, but owning it more expensive.
- Context
- This is the kind of reality check that matters for anyone actually deploying AI coding tools at scale. Lab benchmarks don't capture the cost of owning and reviewing agent-generated code.
- Key points
- Lab studies show 55% faster code completion on isolated tasks with GitHub Copilot
- But in production, developers using AI get measurably slower — 19% longer for experienced devs
- Larger pull requests, higher review costs, more security vulnerabilities from generated code
- Organizations often interpret the dip as evidence AI doesn't work rather than recognizing workflow adaptation time
- The J-curve pattern applies across many orgs — the dip before the rise
- Provenance
- Video · Supporting source
-
10
Grok Imagine model demo
X Elon Musk
New Grok Imagine model just dropped with much better lip sync & sound. Nothing in this video is real.
x.com/elonmusk/status/2047881966268117064 →Details
- Cited text
New Grok Imagine model just dropped with much better lip sync & sound. Nothing in this video is real.
- Context
- Video generation models are becoming the new arms race — lip sync and audio integration are the differentiators now, not just visual fidelity. Grok Imagine's release puts X / Grok in the same arena as Sora, Kling, and Runway's Gen-3.
- Key points
- Grok Imagine model released with improved lip sync and sound capabilities
- Demonstrated with a video example Musk called 'nothing in this video is real'
- 40K+ likes, 6K+ retweets — significant engagement for a video gen announcement
- Part of Musk's ongoing effort to build a multimodal generation stack
- Engagement
- 40077 likes · 6174 retweets · 5669 replies
- Provenance
- Tweet · Primary source
-
11
Lambda Calculus Benchmark for AI
Article Victor Taelin — Victor Taelin — known for work on LLM evaluation and training theory
Lambda calculus benchmarks test formal reasoning rather than memorization — if a model can consistently solve lambda calculus problems, it suggests genuine structural understanding rather than pattern matching.
victortaelin.github.io/lambench →Details
- Context
- Lambda calculus benchmarks test formal reasoning rather than memorization — if a model can consistently solve lambda calculus problems, it suggests genuine structural understanding rather than pattern matching.
- Key points
- LamBench uses lambda calculus as a benchmark for LLM reasoning ability
- Tests whether models can manipulate and solve formal mathematical structures
- Published via GitHub as Victor Taelin's LamBench v1
- Low engagement on HN (21 points, 4 comments) — niche but technically interesting
- Provenance
- Article · Supporting source
-
12
Show HN: A Karpathy-style LLM wiki your agents maintain (Markdown and Git)
Article najmuzzaman
This is one of the first examples of an agent collaboration platform with explicit knowledge management — the kind of infrastructure that lets multi-agent teams actually build something without each agent reinventing ev…
github.com/nex-crm/wuphf →Details
- Context
- This is one of the first examples of an agent collaboration platform with explicit knowledge management — the kind of infrastructure that lets multi-agent teams actually build something without each agent reinventing everything from scratch every time.
- Key points
- WUPHF provides a shared 'office' for multiple AI agents with per-agent notebooks and a team wiki
- Agents decide what promotions graduate from private notebook to shared knowledge — nothing is auto-promoted
- Wiki supports markdown (local git repo), Nex backend, or gbrain (OpenAI embeddings)
- Has MCP tools for wiki read/write/search and a lint suite that flags contradictions and orphans
- Supports OpenClaw bridge for bringing existing agents into the office
- Engagement
- 124 likes · 57 replies
- Provenance
- Article · Supporting source
-
13
Ethan Mollick on agent organizational design
X Ethan Mollick — Wharton professor and AI researcher
Organizational design for agents is hard, benchmarking agents working in concert is hard. Together, this is the next critical frontier for making AI matter in economically valuable tasks, and we really don't know very m…
x.com/emollick/status/2047828327856030047 →Details
- Cited text
Organizational design for agents is hard, benchmarking agents working in concert is hard. Together, this is the next critical frontier for making AI matter in economically valuable tasks, and we really don't know very much about it.
- Context
- As more teams experiment with multiple agents working together, the lack of empirical guidance on organization and measurement becomes the bottleneck — not the individual agent capability.
- Key points
- Multi-agent organizational design is an unresolved problem
- Benchmarking agents working together is hard
- Making AI matter in economically valuable tasks requires solving the coordination problem
- We have almost no empirical data on effective agent team structures
- Provenance
- Tweet · Primary source
-
14
Susan Zhang on norms in training
X Susan Zhang — Susan Zhang — researcher working on LLM training dynamics
Something could be murdering your dynamic range, and you'll never know what it is when it's all hidden under the beautiful norm carpet.
x.com/suchenzang/status/2047797976366792775 →Details
- Cited text
Something could be murdering your dynamic range, and you'll never know what it is when it's all hidden under the beautiful norm carpet.
- Context
- Susan Zhang's point about norm layers hiding training problems is relevant for anyone doing large-scale training — normalization can mask the actual issues you need to diagnose.
- Key points
- Layer normalization stabilizes training but can hide problems
- The 'blessing' of constraining magnitude becomes a 'curse' by masking what's actually going wrong
- Diagnosing late-stage training problems becomes harder because norm layers absorb anomalies
- The comm cost of accumulating norm layer stats gets expensive with larger batches and bigger scale
- Provenance
- Tweet · Primary source
-
15
Internet Archive report: 26% of pages from 2013-2023 are gone
X Internet Archive
26% of pages from 2013-2023 are no longer accessible.
x.com/internetarchive/status/20477335940644… →Details
- Cited text
26% of pages from 2013-2023 are no longer accessible.
- Context
- If 26% of the web is vanishing, that has implications for how agents train — they need access to the living web, not just static snapshots. It's also a reminder that the internet's infrastructure is fragile.
- Key points
- 26% of web pages from 2013-2023 are no longer accessible
- Data scientists working with the Wayback Machine published findings in the book VANISHING CULTURE
- This is a significant chunk of the web's recent history disappearing
- The web is disappearing at a measurable rate
- Provenance
- Tweet · Primary source