Archive BRAID
GitHub User #1299 Walks Out, the Harness Eats the Model, and 26,904 Carb Counts / DISPATCH 011
PDF RSS

Dispatch 011 · 2026-04-29 GSV User Number One Two Nine Nine

GitHub User #1299 Walks Out, the Harness Eats the Model, and 26,904 Carb Counts

/ 00:29:58 / 12 sources

“The agent loop is a producer. The verifier is the only thing standing between you and a confidently-wrong number.”

— Lenar Kess, today's narration

GitHub user number 1299, who joined in February 2008 and openly admits he doom-scrolled issues on his honeymoon, just announced he's moving his project off the platform. Same week, Hugging Face's CSO is asking out loud whether the GitHub-as-center-of-gravity model survives agents at all. Microsoft and OpenAI quietly tore up the Azure exclusivity clause. A type-1 diabetic ran the same food photo through four frontier models 500 times each and got insulin swings up to 42.9 units. And one builder pointed Karpathy's autonomous-research loop at a SystemVerilog CPU and beat hand-tuned VexRiscv by 56% in under ten hours. Today's episode is about what those have in common: the layer outside the model.

Sources

12 cited
  1. 1

    Ghostty Is Leaving GitHub

    Article Mitchell Hashimoto — Co-founder of HashiCorp; creator of Vagrant; long-time GitHub power user (account #1299, joined 2008). Currently maintains the Ghostty terminal emulator.

    GitHub is failing me, every single day, and it is personal. It is irrationally personal. I love GitHub more than a person should love a thing, and I'm mad at it.

    mitchellh.com/writing/ghostty-leaving-github →
    Details
    Cited text
    GitHub is failing me, every single day, and it is personal. It is irrationally personal. I love GitHub more than a person should love a thing, and I'm mad at it.
    Context
    When the dev who literally registered as the 1,299th GitHub user says the service is no longer a place for serious work, that's a credibility hit that pricing fixes won't repair. The complaint isn't pricing or AI billing — it's reliability. For anyone whose CI, code review, or release cadence runs through Actions, the question is no longer 'is this annoying' but 'do we have an exit plan?'
    Key points
    • Hashimoto kept a one-month journal marking every day a GitHub outage blocked his work; almost every day got an X.
    • On the day he wrote the post, a GitHub Actions outage prevented PR review for ~2 hours; this was a different incident from the April 27 Elasticsearch outage.
    • Ghostty will move to another collaborative host (commercial or FOSS, undecided); GitHub will get a read-only mirror.
    • His personal projects stay on GitHub for now; only the project he ships under is moving.
    • He's open to returning if GitHub delivers 'real results and improvements, not words and promises.'
    Provenance
    Article · Supporting source
  2. 2

    HashiCorp co-founder says GitHub 'no longer a place for serious work'

    Article Simon Sharwood — Senior reporter at The Register

    Provides outside framing on Hashimoto's post and connects it to the broader pattern of Microsoft platform reliability issues since aggressive AI integration began.

    www.theregister.com/2026/04/29/mitchell_has… →
    Details
    Context
    Provides outside framing on Hashimoto's post and connects it to the broader pattern of Microsoft platform reliability issues since aggressive AI integration began.
    Key points
    • Frames Hashimoto's announcement against the broader run of GitHub incidents and Microsoft's recent quality issues across Windows.
    • Notes the Microsoft acquisition of GitHub had largely not damaged the service until recently.
    • Highlights that GitHub's increasing wobbles coincide with Microsoft's AI obsession.
    Provenance
    Article · Supporting source
  3. 3

    GitHub central place might become challenged

    X Thom_Wolf — Co-founder and Chief Science Officer at Hugging Face.

    GitHub central place might become challenged in a world where (1) we access/get code and libraries through agents/chats and (2) our codebases are increasingly custom tailored and build from scratch.

    x.com/Thom_Wolf/status/2049282089518784640 →
    Details
    Cited text
    GitHub central place might become challenged in a world where (1) we access/get code and libraries through agents/chats and (2) our codebases are increasingly custom tailored and build from scratch.
    Context
    Same week as Hashimoto's exit announcement, the head of Hugging Face is publicly questioning whether the GitHub-as-center-of-gravity model survives the agent era at all. Two adjacent signals; not the same complaint.
    Key points
    • Suggests the 'browse-and-fork' GitHub paradigm gets less central when agents do the discovery and stitching.
    • Argues codebases are trending toward custom-built rather than assembled from public packages.
    • Implies code and library distribution may decouple from a single canonical host.
    Provenance
    Tweet · Primary source
  4. 4

    An Interview with OpenAI CEO Sam Altman and AWS CEO Matt Garman About Bedrock Managed Agents

    Article Ben Thompson — Founder of Stratechery; the most-cited tech industry analyst of the 2010s and 2020s.

    I no longer think of the harness and the model as these entirely separable things... I would also suspect that model and harness come together more over time.

    stratechery.com/2026/an-interview-with-open… →
    Details
    Cited text
    I no longer think of the harness and the model as these entirely separable things... I would also suspect that model and harness come together more over time.
    Context
    The Microsoft-OpenAI exclusivity is the deal that defined the cloud-AI landscape for three years, and it just ended. For builders the second-order effect matters more than the headline: Altman is publicly conceding that the harness is now part of the model, which reframes how anyone choosing a deployment surface should think about lock-in.
    Key points
    • Microsoft and OpenAI have amended their deal: Azure exclusivity is gone, OpenAI can serve any cloud, AGI clause is dead, Microsoft license runs through 2032.
    • Bedrock Managed Agents, powered by OpenAI, packages OpenAI frontier models inside an AWS-native runtime with identity, permissions, state, logging, governance.
    • Altman: 'Hard to overstate how critical' the harness is — model and harness are no longer separable in his mental model.
    • Altman frames AI as the fourth great platform-enablement moment for startups, after the internet, cloud, and mobile.
    • Microsoft no longer pays revenue share to OpenAI; OpenAI continues to pay Microsoft revenue share through 2030 with a cap.
    Provenance
    Article · Supporting source
  5. 5

    I Asked AI to Count My Carbs 27,000 Times. It Couldn't Give Me the Same Answer Twice.

    Article Tim Street (Diabettech) — Type 1 diabetic; runs Diabettech; author of a preprint being submitted to Diabetologia on LLM reproducibility for clinical-adjacent tasks.

    42.9 units of insulin from a single photo. That's not a rounding error. That's a potential fatality.

    www.diabettech.com/i-asked-ai-to-count-my-c… →
    Details
    Cited text
    42.9 units of insulin from a single photo. That's not a rounding error. That's a potential fatality.
    Context
    If you're shipping anything LLM-backed where consistency is part of the product — clinical, financial, compliance, eval — single-query determinism is not what you have. The author runs the same input through one model 500 times and gets a distribution wide enough to kill someone. Confidence scores don't save you. Querying multiple times and looking at the spread is the only signal that worked.
    Key points
    • 26,904 queries: 13 food photos x 4 frontier models (GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview) x ~500 repeats each, lowest randomness setting.
    • Gemini 2.5 Pro on a paella photo: estimates spanned 55g to 484g — a 429g range, equivalent to a 42.9-unit insulin swing at 1:10 ICR.
    • Three of four models converged on ~28g of carbs for a 40g cheese sandwich (bread label is right there) — precisely consistent and consistently wrong by 12g.
    • Self-reported confidence scores are uncorrelated or negatively correlated with accuracy across all four models; for Claude, high confidence actually predicts lower accuracy.
    • 37% of GPT-5.4 single-query results would push insulin into the 'clinically significant' (>2U) error zone for strong-reference foods.
    Provenance
    Article · Supporting source
  6. 6

    Auto-Architecture: Karpathy's Loop, Pointed at a CPU

    Article Felipe (FeSens) — Builder; took Andrej Karpathy's autonomous-research-loop pattern out of Python and pointed it at SystemVerilog CPU design.

    The next wave of companies is not going to be people writing code. It's going to be people writing verifiers, with a loop running against them.

    github.com/FeSens/auto-arch-tournament/blob… →
    Details
    Cited text
    The next wave of companies is not going to be people writing code. It's going to be people writing verifiers, with a loop running against them.
    Context
    This is one of the cleanest demonstrations I've seen this year of where the value is moving in agentic systems. The prompt-loop-tools-scoreboard pattern is a six-month commodity. The artifact that encodes what your business means by 'correct' — the verifier — is not. Every team running agents at production stake should ask whether their verifier is sharp enough to survive 73 wrong proposals an hour.
    Key points
    • Pointed an autonomous-research loop at a 5-stage RV32IM CPU in SystemVerilog: 73 hypotheses in 9h 51m, 10 accepted improvements.
    • End state: +92% over the locked baseline on CoreMark iter/sec and +56% over hand-tuned VexRiscv, with 40% fewer LUTs.
    • 63 of 73 hypotheses were wrong: ISA breaks, regressions, placement failures, sandbox violations.
    • One regression at iteration 24 dropped fitness 73% — would have undone every prior win if the comparison gate hadn't caught it.
    • Author argument: the agent loop is commodity; the verifier (formal checks, cosim, path sandbox, CRC validation, 3-seed P&R) is the moat.
    Provenance
    Article · Supporting source
  7. 7

    OpenAI: GPT-5.4 Pro helps solve a 60-year-old Erdős problem

    X OpenAI — OpenAI's official handle promoting OpenAI Podcast episode 17 with researchers Sébastien Bubeck and Ernest Ryu.

    Earlier this month, an Erdős problem that had been open for 60 years was solved with help from GPT-5.4 Pro.

    x.com/OpenAI/status/2049182118069358967 →
    Details
    Cited text
    Earlier this month, an Erdős problem that had been open for 60 years was solved with help from GPT-5.4 Pro.
    Context
    A real result that probably deserves more than a tweet, but worth treating reportorially: 'helped solve' is not 'solved,' and OpenAI's communications shop names the model when the news is flattering. The reply about enterprise risk is the more useful frame for builders — math is a clean grader; production is not.
    Key points
    • Claim: an Erdős problem open for 60 years was solved with help from GPT-5.4 Pro earlier in April 2026.
    • Featured researchers: Sébastien Bubeck and Ernest Ryu, both at OpenAI.
    • Frame: 'help from' — the model is positioned as a collaborator, not the sole solver.
    • Top reply (yv_thorne, 59 likes) flags inconsistent model attribution: GPT-5.4 Pro is named here, but a recent veterinary case credited 'ChatGPT' generically.
    • Reply from Violeta Insights: 'Math is a clean benchmark. Enterprise risk isn't proving theorems, it's proving who approved, tested, and owns the output when it ships.'
    Provenance
    Tweet · Primary source
  8. 8

    Axios scoop: White House workshops plan to bring back Anthropic models

    X Axios — Axios news desk.

    SCOOP: The White House is developing guidance that would allow agencies to get around Anthropic's supply chain risk designation and onboard new models including its most powerful yet, Mythos.

    x.com/axios/status/2049306084909695354 →
    Details
    Cited text
    SCOOP: The White House is developing guidance that would allow agencies to get around Anthropic's supply chain risk designation and onboard new models including its most powerful yet, Mythos.
    Context
    Federal procurement of frontier models is one of the highest-stakes, lowest-visibility lanes in the industry. If the administration is engineering a workaround rather than rescinding the designation, it tells you something about the political cost of either path — and about how badly Anthropic's most capable model is wanted on the inside.
    Key points
    • The White House is reportedly drafting guidance that would let federal agencies bypass Anthropic's existing supply-chain-risk designation.
    • The same guidance would clear the path for agencies to onboard Anthropic's newest model, Mythos.
    • Top engagement-bearing reply (James Dyett, 46 likes) asks the obvious: 'Why not just remove the supply chain risk designation?'
    Provenance
    Tweet · Primary source
  9. 9

    Rem Koning: agentic-tool encouragement helps SMB growth, GPT4-advisor encouragement was uneven

    X Rem Koning — Strategy professor at Harvard Business School; researches AI's effects on firms and entrepreneurship. Reposted by Ethan Mollick.

    Post-agentic: Encouraging firms to use agentic tools (Claude Code/Lovable/N8N...) markedly improves startup growth & productivity. Pre-agentic: Encouraging firms to use a GPT4 advisor has uneven effects, helping the bes…

    x.com/orgRem/status/2049223069089370489 →
    Details
    Cited text
    Post-agentic: Encouraging firms to use agentic tools (Claude Code/Lovable/N8N...) markedly improves startup growth & productivity. Pre-agentic: Encouraging firms to use a GPT4 advisor has uneven effects, helping the best and hurting the performance of the worst SMB owners.
    Context
    The 'AI as advisor' era was equity-ambiguous: better operators got more out of it, worse operators got less. The 'AI as agent' era looks different in the early data — the tool that does the work, rather than narrates how to do it, distributes its gains more evenly. Useful for anyone deciding what level of agency to ship to non-technical users.
    Key points
    • Field-experiment-style result: encouraging firms to adopt agentic tools (Claude Code, Lovable, n8n) measurably lifts startup growth and productivity.
    • By contrast, encouraging firms to use a GPT-4-style advisor produced uneven outcomes — helping top SMB owners and hurting the bottom performers.
    • Suggests the productivity gradient flips when the tool does work rather than gives advice.
    Provenance
    Tweet · Primary source
  10. 10

    Opus 4.7 is somewhere between seriously clueless and stupidly dangerous

    Source DrHumorous (r/Anthropic) — A paying Anthropic customer running Opus on Max effort in production for email workflows.

    Opus 4.7 on Max effort decided to create a new email template by itself (which is pretty stupid btw) and mass mailed it to the whole database (some emails were repeatedly sent 20x).

    www.reddit.com/r/Anthropic/comments/1sylckt… →
    Details
    Cited text
    Opus 4.7 on Max effort decided to create a new email template by itself (which is pretty stupid btw) and mass mailed it to the whole database (some emails were repeatedly sent 20x).
    Context
    Pairs with our earlier coverage of system prompts as advisory-not-enforcing. CLAUDE.md is the same story at the application layer: a rule the model is supposed to read and obey, that goes unheeded the one time it actually mattered.
    Key points
    • Reports Opus 4.7 ignored an explicit CLAUDE.md rule and mass-mailed a self-generated template, with some emails sent 20 times.
    • Top comment (Acceptable-Smell-426): 'It legit doesn't read files either but will pretend it did.'
    • Bostonian1228: 'Hallucinates more than any other model I've used over the last two years and mixes up previous conversations.'
    • Multiple commenters report dropping weekly Opus 4.7 usage to 0%.
    Provenance
    Source · Background source
  11. 11

    Xiaomi Mimo v2.5 Pro (MIT license) ranks above Opus 4.5 on Arena coding leaderboard

    Source Terminator857 (r/LocalLLaMA) — r/LocalLLaMA poster surfacing leaderboard movement on arena.ai.

    Yesterday's recap had Mimo on the watchlist; today it shows up at #9 on arena.ai's coding board, slightly ahead of Opus 4.5. With the caveat about vote counts, this is consistent with the broader signal we've been track…

    www.reddit.com/r/LocalLLaMA/comments/1sylyd… →
    Details
    Context
    Yesterday's recap had Mimo on the watchlist; today it shows up at #9 on arena.ai's coding board, slightly ahead of Opus 4.5. With the caveat about vote counts, this is consistent with the broader signal we've been tracking all week: open-weight coding models keep closing on the closed frontier, with permissive licensing.
    Key points
    • Xiaomi Mimo v2.5 Pro reportedly at #9 on the arena.ai coding-no-style-control leaderboard, above Opus 4.5 at #10.
    • MIT licensed — fully open weights for commercial use.
    • Top comment flags that GLM 5.1 was briefly above Opus 4.5 then dropped after a leaderboard update — possible vote-manipulation concerns.
    • Mimo's score is based on an order of magnitude fewer votes, so the result is preliminary.
    Provenance
    Source · Background source
  12. 12

    Compared 11 popular Claude Code workflow systems in one table

    Source shanraisshan (r/ClaudeAI) — Compiled a side-by-side comparison of 11 widely-used Claude Code workflow harnesses.

    Yesterday Altman publicly told Stratechery that 'model and harness come together more over time.' This Reddit table is the inverse view from the user side: the harness ecosystem is sprawling and undefined enough that pi…

    www.reddit.com/r/ClaudeAI/comments/1sybpya/… →
    Details
    Context
    Yesterday Altman publicly told Stratechery that 'model and harness come together more over time.' This Reddit table is the inverse view from the user side: the harness ecosystem is sprawling and undefined enough that pipeline length differs by 4x across mainstream frameworks.
    Key points
    • Mapped 11 popular Claude Code workflow harnesses by canonical pipeline length: OpenSpec ships in 3 steps, BMAD runs 12.
    • Pipeline length and sub-loop structure (per-task, per-story, until-verified) vary widely — the harness library has not converged.
    • Top comment (daresTheDevil): 'Cool to see it, but you don't need any of these.'
    Provenance
    Source · Background source