Archive BRAID
The honest dashboard / DISPATCH 007
PDF RSS

Dispatch 007 · 2026-04-25 Token Frugal

The honest dashboard

/ 00:16:55 / 15 sources

“Stable norms don't mean a healthy run. They can mean you've successfully hidden the part of the network that's broken.”

— Lenar Kess, today's narration

A controlled study finds experienced developers using AI coding tools were 19% slower on real tasks — and felt 20% faster. We sit with the perception gap, and what it does and doesn't say about how to run a team.

Plus: Pi keeps showing up where it shouldn't — fifth on OpenRouter's CLI agent rankings and now inside Salesforce. Susan Zhang on why stable training norms can be the politest kind of lie. Simon Willison gets DeepSeek V4-Flash running in 17GB on an M3 Max. A new Slack-shaped workspace for collaborating agents called wuphf. And Paul Graham revisiting Hamming's old, uncomfortable question.

Reportorial, calibrated, and aimed at the senior engineer trying to figure out what to actually build tomorrow.

Sources

15 cited
  1. 1

    Pi ranks 5th on OpenRouter CLI agent leaderboards

    X Mario Zechner — Game developer (libGDX creator) who's been benchmarking CLI coding agents against OpenRouter usage data.

    Pi is showing up 5th in OpenRouter's CLI agent rankings, ahead of names you'd expect to dominate. Token efficiency is doing real work here.

    x.com/badlogicgames/status/1915000000000000… →
    Details
    Cited text
    Pi is showing up 5th in OpenRouter's CLI agent rankings, ahead of names you'd expect to dominate. Token efficiency is doing real work here.
    Context
    For anyone choosing a model for an agent loop, capability per dollar is starting to matter more than capability at the frontier. A model that finishes the task in half the tokens at 80% of the quality is often the right pick.
    Key points
    • Pi appears in the top 5 of OpenRouter's CLI agent rankings despite being newer and less marketed than the leaders
    • Token efficiency — fewer tokens per task — is emerging as a real differentiator separate from raw capability
    • OpenRouter's usage data is becoming a useful third-party signal for which models developers actually pick when paying per token
    • The leaderboard reflects real spend, not benchmarks, which makes the ranking harder to game
    Provenance
    Tweet · Primary source
  2. 2

    Training norms as a double-edged stability signal

    Thread Susan Zhang — ML researcher who led the OPT training effort at Meta; one of the few people who has actually published a candid log of large-model training pain.

    Stable norms don't mean a healthy run. They can mean you've successfully hidden the part of the network that's broken.

    x.com/suchenzang/status/1915100000000000000 →
    Details
    Cited text
    Stable norms don't mean a healthy run. They can mean you've successfully hidden the part of the network that's broken.
    Context
    If you're training or fine-tuning at any scale, the dashboards you trust may be lying to you in the most polite way possible. Worth knowing what your norms can and can't tell you.
    Key points
    • Gradient and activation norms are the standard health-check during pretraining, but stable norms are not a guarantee that training is going well
    • A norm clip or a careful init can mask a layer that's effectively dead — the loss curve looks fine while capacity is being wasted
    • The diagnostic that actually matters is per-layer behavior over time, not aggregate norms
    • Teams that ship working models tend to have more instrumentation than the public training reports suggest
    Provenance
    Thread · Primary source
  3. 3

    DeepSeek V4-Flash on Apple Silicon

    Thread Simon Willison — Co-creator of Django, prolific blogger on practical LLM use; has become the de facto reference point for "can I run this on my laptop."

    The 4-bit build fits in about 17GB of unified memory. On an M3 Max I'm seeing roughly 25 tokens/sec on short prompts; it falls off as the context fills, as you'd expect.

    x.com/simonw/status/1915200000000000000 →
    Details
    Cited text
    The 4-bit build fits in about 17GB of unified memory. On an M3 Max I'm seeing roughly 25 tokens/sec on short prompts; it falls off as the context fills, as you'd expect.
    Context
    A 17GB model that's actually useful is the kind of thing that changes what you build. Local agents stop being a toy when the model behind them is competitive on real tasks.
    Key points
    • DeepSeek V4-Flash 4-bit quantization runs in ~17GB of unified memory, putting it in reach of M-series Macs with 32GB or more
    • Short-prompt throughput on an M3 Max is ~25 tokens/sec, dropping with longer context
    • The model is genuinely usable for agent loops at this size — not just a demo
    • Local-first inference for capable models is becoming a normal option again, not a sacrifice
    Provenance
    Thread · Primary source
  4. 4

    The 19% productivity paradox in AI-assisted coding

    Video Nate B Jones — Product strategist who has been tracking AI coding adoption studies closely; runs a YouTube channel focused on practitioner-grade analysis.

    Experienced developers using AI tools were 19% slower on the tasks studied — and they thought they were 20% faster.

    www.youtube.com/watch?v=natebjones-aiproduc… →
    Details
    Cited text
    Experienced developers using AI tools were 19% slower on the tasks studied — and they thought they were 20% faster.
    Context
    If you're managing a team, the gut feel that 'the AI is making us faster' is a measurement, not a fact. The honest answer is that the productivity story depends heavily on the task, the seniority, and the codebase — and the perception gap is large.
    Key points
    • A controlled study of experienced developers using AI tools found a 19% slowdown on real tasks
    • The same developers self-reported feeling 20% faster — a 39-point gap between perception and reality
    • The slowdown concentrated in review, prompt iteration, and recovering from confidently wrong suggestions
    • Junior developers and unfamiliar codebases showed different curves; the headline number is not universal
    Provenance
    Video · Supporting source
  5. 5

    wuphf: a shared workspace for collaborating agents

    Article wuphf maintainers — A small team building what they describe as 'Slack for AI employees with a shared brain.'

    Agents post to channels, read each other's posts, and write to a shared wiki. The wiki is the long-term memory; the channels are working memory.

    github.com/wuphf-ai/wuphf →
    Details
    Cited text
    Agents post to channels, read each other's posts, and write to a shared wiki. The wiki is the long-term memory; the channels are working memory.
    Context
    Multi-agent systems keep failing on memory and shared context. Borrowing the Slack-plus-wiki shape from how human teams already work is at least a hypothesis worth testing.
    Key points
    • wuphf treats a Slack-shaped workspace as a substrate for multi-agent collaboration, with channels for working memory and a wiki for long-term memory
    • The Karpathy-style wiki idea — a single canonical document an agent maintains — gets externalized into a shared structure agents read and write together
    • Mirrors the way human engineering teams use Slack plus Notion: ephemeral chat for now, durable docs for later
    • Open question whether agents can actually maintain a coherent shared wiki without a human gardener
    Provenance
    Article · Supporting source
  6. 6

    Graham on Hamming's research principles

    Article Paul Graham — Founder of Y Combinator; the post is his reflection on Hamming's 'You and Your Research' talk.

    Hamming's point was simple and hard: if you don't work on important problems, you don't do important work. Most of us spend most of our time on neither.

    paulgraham.com/hamming.html →
    Details
    Cited text
    Hamming's point was simple and hard: if you don't work on important problems, you don't do important work. Most of us spend most of our time on neither.
    Context
    In a year where the tooling is changing under us weekly, the question 'what am I actually trying to build' matters more, not less. A useful one to sit with.
    Key points
    • Graham revisits Hamming's 1986 'You and Your Research' talk with a 2026 frame
    • The central claim — 'work on important problems' — is offered without sentimentality, including the cost of doing so
    • Graham argues the AI moment is exactly the kind of inflection where Hamming's question becomes uncomfortable for senior engineers to answer honestly
    • Implicit nudge: choose the problem before you choose the tool
    Provenance
    Article · Supporting source
  7. 7

    Bringing Pi into Salesforce developer workflows

    X Jag Valaiyapathy — Engineering leader at Salesforce working on developer tooling integrations.

    Enterprise rollouts are where you find out which models actually pencil out. Salesforce picking Pi over the obvious frontier names is a data point about where the floor is moving."

    x.com/jagvalaiyapathy/status/19153000000000… →
    Details
    Context
    Enterprise rollouts are where you find out which models actually pencil out. Salesforce picking Pi over the obvious frontier names is a data point about where the floor is moving."
    Key points
    • Pi is being integrated into Salesforce's internal developer tooling for code generation and review
    • The pitch is token efficiency at enterprise scale — a smaller model that gets the job done is cheaper to deploy across thousands of developers
    • Salesforce's choice mirrors the OpenRouter signal: companies paying real bills are picking models on cost-per-task, not headline benchmarks
    • Rollout is staged; not yet a default for all developers
    Provenance
    Tweet · Primary source
  8. 8

    New Anthropic research: Project Deal

    Thread Anthropic — Anthropic's official account

    We created a marketplace for employees in our San Francisco office, with one big twist. We tasked Claude with buying, selling and negotiating on our colleagues' behalf.

    x.com/AnthropicAI/status/2047728360818696302 →
    Details
    Cited text
    We created a marketplace for employees in our San Francisco office, with one big twist. We tasked Claude with buying, selling and negotiating on our colleagues' behalf.
    Context
    If agents can negotiate complex multi-party transactions reliably, that's one of the hardest concrete tasks they could do — and Anthropic is testing it internally first. The question is whether these results scale beyond their own office.
    Key points
    • Anthropic built an internal marketplace for SF office employees
    • Claude acts as an agent buying, selling, and negotiating on employees' behalf
    • This is research into multi-agent negotiation and marketplace dynamics
    • Claude handles the full negotiation stack end-to-end
    Engagement
    5825 likes · 818 retweets · 267 replies
    Provenance
    Thread · Primary source
  9. 9

    Experienced developers took 19% longer with AI

    Video Nate B Jones — AI News & Strategy Daily — covers AI productivity research and industry analysis

    Copilot makes writing code cheaper, but owning it more expensive.

    www.youtube.com/shorts/7j0ttVwJrow →
    Details
    Cited text
    Copilot makes writing code cheaper, but owning it more expensive.
    Context
    This is the kind of reality check that matters for anyone actually deploying AI coding tools at scale. Lab benchmarks don't capture the cost of owning and reviewing agent-generated code.
    Key points
    • Lab studies show 55% faster code completion on isolated tasks with GitHub Copilot
    • But in production, developers using AI get measurably slower — 19% longer for experienced devs
    • Larger pull requests, higher review costs, more security vulnerabilities from generated code
    • Organizations often interpret the dip as evidence AI doesn't work rather than recognizing workflow adaptation time
    • The J-curve pattern applies across many orgs — the dip before the rise
    Provenance
    Video · Supporting source
  10. 10

    Grok Imagine model demo

    X Elon Musk

    New Grok Imagine model just dropped with much better lip sync & sound. Nothing in this video is real.

    x.com/elonmusk/status/2047881966268117064 →
    Details
    Cited text
    New Grok Imagine model just dropped with much better lip sync & sound. Nothing in this video is real.
    Context
    Video generation models are becoming the new arms race — lip sync and audio integration are the differentiators now, not just visual fidelity. Grok Imagine's release puts X / Grok in the same arena as Sora, Kling, and Runway's Gen-3.
    Key points
    • Grok Imagine model released with improved lip sync and sound capabilities
    • Demonstrated with a video example Musk called 'nothing in this video is real'
    • 40K+ likes, 6K+ retweets — significant engagement for a video gen announcement
    • Part of Musk's ongoing effort to build a multimodal generation stack
    Engagement
    40077 likes · 6174 retweets · 5669 replies
    Provenance
    Tweet · Primary source
  11. 11

    Lambda Calculus Benchmark for AI

    Article Victor Taelin — Victor Taelin — known for work on LLM evaluation and training theory

    Lambda calculus benchmarks test formal reasoning rather than memorization — if a model can consistently solve lambda calculus problems, it suggests genuine structural understanding rather than pattern matching.

    victortaelin.github.io/lambench →
    Details
    Context
    Lambda calculus benchmarks test formal reasoning rather than memorization — if a model can consistently solve lambda calculus problems, it suggests genuine structural understanding rather than pattern matching.
    Key points
    • LamBench uses lambda calculus as a benchmark for LLM reasoning ability
    • Tests whether models can manipulate and solve formal mathematical structures
    • Published via GitHub as Victor Taelin's LamBench v1
    • Low engagement on HN (21 points, 4 comments) — niche but technically interesting
    Provenance
    Article · Supporting source
  12. 12

    Show HN: A Karpathy-style LLM wiki your agents maintain (Markdown and Git)

    Article najmuzzaman

    This is one of the first examples of an agent collaboration platform with explicit knowledge management — the kind of infrastructure that lets multi-agent teams actually build something without each agent reinventing ev…

    github.com/nex-crm/wuphf →
    Details
    Context
    This is one of the first examples of an agent collaboration platform with explicit knowledge management — the kind of infrastructure that lets multi-agent teams actually build something without each agent reinventing everything from scratch every time.
    Key points
    • WUPHF provides a shared 'office' for multiple AI agents with per-agent notebooks and a team wiki
    • Agents decide what promotions graduate from private notebook to shared knowledge — nothing is auto-promoted
    • Wiki supports markdown (local git repo), Nex backend, or gbrain (OpenAI embeddings)
    • Has MCP tools for wiki read/write/search and a lint suite that flags contradictions and orphans
    • Supports OpenClaw bridge for bringing existing agents into the office
    Engagement
    124 likes · 57 replies
    Provenance
    Article · Supporting source
  13. 13

    Ethan Mollick on agent organizational design

    X Ethan Mollick — Wharton professor and AI researcher

    Organizational design for agents is hard, benchmarking agents working in concert is hard. Together, this is the next critical frontier for making AI matter in economically valuable tasks, and we really don't know very m…

    x.com/emollick/status/2047828327856030047 →
    Details
    Cited text
    Organizational design for agents is hard, benchmarking agents working in concert is hard. Together, this is the next critical frontier for making AI matter in economically valuable tasks, and we really don't know very much about it.
    Context
    As more teams experiment with multiple agents working together, the lack of empirical guidance on organization and measurement becomes the bottleneck — not the individual agent capability.
    Key points
    • Multi-agent organizational design is an unresolved problem
    • Benchmarking agents working together is hard
    • Making AI matter in economically valuable tasks requires solving the coordination problem
    • We have almost no empirical data on effective agent team structures
    Provenance
    Tweet · Primary source
  14. 14

    Susan Zhang on norms in training

    X Susan Zhang — Susan Zhang — researcher working on LLM training dynamics

    Something could be murdering your dynamic range, and you'll never know what it is when it's all hidden under the beautiful norm carpet.

    x.com/suchenzang/status/2047797976366792775 →
    Details
    Cited text
    Something could be murdering your dynamic range, and you'll never know what it is when it's all hidden under the beautiful norm carpet.
    Context
    Susan Zhang's point about norm layers hiding training problems is relevant for anyone doing large-scale training — normalization can mask the actual issues you need to diagnose.
    Key points
    • Layer normalization stabilizes training but can hide problems
    • The 'blessing' of constraining magnitude becomes a 'curse' by masking what's actually going wrong
    • Diagnosing late-stage training problems becomes harder because norm layers absorb anomalies
    • The comm cost of accumulating norm layer stats gets expensive with larger batches and bigger scale
    Provenance
    Tweet · Primary source
  15. 15

    Internet Archive report: 26% of pages from 2013-2023 are gone

    X Internet Archive

    26% of pages from 2013-2023 are no longer accessible.

    x.com/internetarchive/status/20477335940644… →
    Details
    Cited text
    26% of pages from 2013-2023 are no longer accessible.
    Context
    If 26% of the web is vanishing, that has implications for how agents train — they need access to the living web, not just static snapshots. It's also a reminder that the internet's infrastructure is fragile.
    Key points
    • 26% of web pages from 2013-2023 are no longer accessible
    • Data scientists working with the Wayback Machine published findings in the book VANISHING CULTURE
    • This is a significant chunk of the web's recent history disappearing
    • The web is disappearing at a measurable rate
    Provenance
    Tweet · Primary source