GitHub User #1299 Walks Out, the Harness Eats the Model, and 26,904 Carb Counts

1

Ghostty Is Leaving GitHub

Article Mitchell Hashimoto — Co-founder of HashiCorp; creator of Vagrant; long-time GitHub power user (account #1299, joined 2008). Currently maintains the Ghostty terminal emulator.

GitHub is failing me, every single day, and it is personal. It is irrationally personal. I love GitHub more than a person should love a thing, and I'm mad at it.

mitchellh.com/writing/ghostty-leaving-github →

Details

Cited text: GitHub is failing me, every single day, and it is personal. It is irrationally personal. I love GitHub more than a person should love a thing, and I'm mad at it.
Context: When the dev who literally registered as the 1,299th GitHub user says the service is no longer a place for serious work, that's a credibility hit that pricing fixes won't repair. The complaint isn't pricing or AI billing — it's reliability. For anyone whose CI, code review, or release cadence runs through Actions, the question is no longer 'is this annoying' but 'do we have an exit plan?'
Key points: Hashimoto kept a one-month journal marking every day a GitHub outage blocked his work; almost every day got an X.
On the day he wrote the post, a GitHub Actions outage prevented PR review for ~2 hours; this was a different incident from the April 27 Elasticsearch outage.
Ghostty will move to another collaborative host (commercial or FOSS, undecided); GitHub will get a read-only mirror.
His personal projects stay on GitHub for now; only the project he ships under is moving.
He's open to returning if GitHub delivers 'real results and improvements, not words and promises.'
Provenance: Article · Supporting source

2

HashiCorp co-founder says GitHub 'no longer a place for serious work'

Article Simon Sharwood — Senior reporter at The Register

Provides outside framing on Hashimoto's post and connects it to the broader pattern of Microsoft platform reliability issues since aggressive AI integration began.

www.theregister.com/2026/04/29/mitchell_has… →

Details

Context: Provides outside framing on Hashimoto's post and connects it to the broader pattern of Microsoft platform reliability issues since aggressive AI integration began.
Key points: Frames Hashimoto's announcement against the broader run of GitHub incidents and Microsoft's recent quality issues across Windows.
Notes the Microsoft acquisition of GitHub had largely not damaged the service until recently.
Highlights that GitHub's increasing wobbles coincide with Microsoft's AI obsession.
Provenance: Article · Supporting source

3

GitHub central place might become challenged

X Thom_Wolf — Co-founder and Chief Science Officer at Hugging Face.

GitHub central place might become challenged in a world where (1) we access/get code and libraries through agents/chats and (2) our codebases are increasingly custom tailored and build from scratch.

x.com/Thom_Wolf/status/2049282089518784640 →

Details

Cited text: GitHub central place might become challenged in a world where (1) we access/get code and libraries through agents/chats and (2) our codebases are increasingly custom tailored and build from scratch.
Context: Same week as Hashimoto's exit announcement, the head of Hugging Face is publicly questioning whether the GitHub-as-center-of-gravity model survives the agent era at all. Two adjacent signals; not the same complaint.
Key points: Suggests the 'browse-and-fork' GitHub paradigm gets less central when agents do the discovery and stitching.
Argues codebases are trending toward custom-built rather than assembled from public packages.
Implies code and library distribution may decouple from a single canonical host.
Provenance: Tweet · Primary source

4

An Interview with OpenAI CEO Sam Altman and AWS CEO Matt Garman About Bedrock Managed Agents

Article Ben Thompson — Founder of Stratechery; the most-cited tech industry analyst of the 2010s and 2020s.

I no longer think of the harness and the model as these entirely separable things... I would also suspect that model and harness come together more over time.

stratechery.com/2026/an-interview-with-open… →

Details

Cited text: I no longer think of the harness and the model as these entirely separable things... I would also suspect that model and harness come together more over time.
Context: The Microsoft-OpenAI exclusivity is the deal that defined the cloud-AI landscape for three years, and it just ended. For builders the second-order effect matters more than the headline: Altman is publicly conceding that the harness is now part of the model, which reframes how anyone choosing a deployment surface should think about lock-in.
Key points: Microsoft and OpenAI have amended their deal: Azure exclusivity is gone, OpenAI can serve any cloud, AGI clause is dead, Microsoft license runs through 2032.
Bedrock Managed Agents, powered by OpenAI, packages OpenAI frontier models inside an AWS-native runtime with identity, permissions, state, logging, governance.
Altman: 'Hard to overstate how critical' the harness is — model and harness are no longer separable in his mental model.
Altman frames AI as the fourth great platform-enablement moment for startups, after the internet, cloud, and mobile.
Microsoft no longer pays revenue share to OpenAI; OpenAI continues to pay Microsoft revenue share through 2030 with a cap.
Provenance: Article · Supporting source

5

I Asked AI to Count My Carbs 27,000 Times. It Couldn't Give Me the Same Answer Twice.

Article Tim Street (Diabettech) — Type 1 diabetic; runs Diabettech; author of a preprint being submitted to Diabetologia on LLM reproducibility for clinical-adjacent tasks.

42.9 units of insulin from a single photo. That's not a rounding error. That's a potential fatality.

www.diabettech.com/i-asked-ai-to-count-my-c… →

Details

Cited text: 42.9 units of insulin from a single photo. That's not a rounding error. That's a potential fatality.
Context: If you're shipping anything LLM-backed where consistency is part of the product — clinical, financial, compliance, eval — single-query determinism is not what you have. The author runs the same input through one model 500 times and gets a distribution wide enough to kill someone. Confidence scores don't save you. Querying multiple times and looking at the spread is the only signal that worked.
Key points: 26,904 queries: 13 food photos x 4 frontier models (GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview) x ~500 repeats each, lowest randomness setting.
Gemini 2.5 Pro on a paella photo: estimates spanned 55g to 484g — a 429g range, equivalent to a 42.9-unit insulin swing at 1:10 ICR.
Three of four models converged on ~28g of carbs for a 40g cheese sandwich (bread label is right there) — precisely consistent and consistently wrong by 12g.
Self-reported confidence scores are uncorrelated or negatively correlated with accuracy across all four models; for Claude, high confidence actually predicts lower accuracy.
37% of GPT-5.4 single-query results would push insulin into the 'clinically significant' (>2U) error zone for strong-reference foods.
Provenance: Article · Supporting source

6

Auto-Architecture: Karpathy's Loop, Pointed at a CPU

Article Felipe (FeSens) — Builder; took Andrej Karpathy's autonomous-research-loop pattern out of Python and pointed it at SystemVerilog CPU design.

The next wave of companies is not going to be people writing code. It's going to be people writing verifiers, with a loop running against them.

github.com/FeSens/auto-arch-tournament/blob… →

Details

Cited text: The next wave of companies is not going to be people writing code. It's going to be people writing verifiers, with a loop running against them.
Context: This is one of the cleanest demonstrations I've seen this year of where the value is moving in agentic systems. The prompt-loop-tools-scoreboard pattern is a six-month commodity. The artifact that encodes what your business means by 'correct' — the verifier — is not. Every team running agents at production stake should ask whether their verifier is sharp enough to survive 73 wrong proposals an hour.
Key points: Pointed an autonomous-research loop at a 5-stage RV32IM CPU in SystemVerilog: 73 hypotheses in 9h 51m, 10 accepted improvements.
End state: +92% over the locked baseline on CoreMark iter/sec and +56% over hand-tuned VexRiscv, with 40% fewer LUTs.
63 of 73 hypotheses were wrong: ISA breaks, regressions, placement failures, sandbox violations.
One regression at iteration 24 dropped fitness 73% — would have undone every prior win if the comparison gate hadn't caught it.
Author argument: the agent loop is commodity; the verifier (formal checks, cosim, path sandbox, CRC validation, 3-seed P&R) is the moat.
Provenance: Article · Supporting source

7

OpenAI: GPT-5.4 Pro helps solve a 60-year-old Erdős problem

X OpenAI — OpenAI's official handle promoting OpenAI Podcast episode 17 with researchers Sébastien Bubeck and Ernest Ryu.

Earlier this month, an Erdős problem that had been open for 60 years was solved with help from GPT-5.4 Pro.

x.com/OpenAI/status/2049182118069358967 →

Details

Cited text: Earlier this month, an Erdős problem that had been open for 60 years was solved with help from GPT-5.4 Pro.
Context: A real result that probably deserves more than a tweet, but worth treating reportorially: 'helped solve' is not 'solved,' and OpenAI's communications shop names the model when the news is flattering. The reply about enterprise risk is the more useful frame for builders — math is a clean grader; production is not.
Key points: Claim: an Erdős problem open for 60 years was solved with help from GPT-5.4 Pro earlier in April 2026.
Featured researchers: Sébastien Bubeck and Ernest Ryu, both at OpenAI.
Frame: 'help from' — the model is positioned as a collaborator, not the sole solver.
Top reply (yv_thorne, 59 likes) flags inconsistent model attribution: GPT-5.4 Pro is named here, but a recent veterinary case credited 'ChatGPT' generically.
Reply from Violeta Insights: 'Math is a clean benchmark. Enterprise risk isn't proving theorems, it's proving who approved, tested, and owns the output when it ships.'
Provenance: Tweet · Primary source

8

Axios scoop: White House workshops plan to bring back Anthropic models

X Axios — Axios news desk.

SCOOP: The White House is developing guidance that would allow agencies to get around Anthropic's supply chain risk designation and onboard new models including its most powerful yet, Mythos.

x.com/axios/status/2049306084909695354 →

Details

Cited text: SCOOP: The White House is developing guidance that would allow agencies to get around Anthropic's supply chain risk designation and onboard new models including its most powerful yet, Mythos.
Context: Federal procurement of frontier models is one of the highest-stakes, lowest-visibility lanes in the industry. If the administration is engineering a workaround rather than rescinding the designation, it tells you something about the political cost of either path — and about how badly Anthropic's most capable model is wanted on the inside.
Key points: The White House is reportedly drafting guidance that would let federal agencies bypass Anthropic's existing supply-chain-risk designation.
The same guidance would clear the path for agencies to onboard Anthropic's newest model, Mythos.
Top engagement-bearing reply (James Dyett, 46 likes) asks the obvious: 'Why not just remove the supply chain risk designation?'
Provenance: Tweet · Primary source

9

Rem Koning: agentic-tool encouragement helps SMB growth, GPT4-advisor encouragement was uneven

X Rem Koning — Strategy professor at Harvard Business School; researches AI's effects on firms and entrepreneurship. Reposted by Ethan Mollick.

Post-agentic: Encouraging firms to use agentic tools (Claude Code/Lovable/N8N...) markedly improves startup growth & productivity. Pre-agentic: Encouraging firms to use a GPT4 advisor has uneven effects, helping the bes…

x.com/orgRem/status/2049223069089370489 →

Details

Cited text: Post-agentic: Encouraging firms to use agentic tools (Claude Code/Lovable/N8N...) markedly improves startup growth & productivity. Pre-agentic: Encouraging firms to use a GPT4 advisor has uneven effects, helping the best and hurting the performance of the worst SMB owners.
Context: The 'AI as advisor' era was equity-ambiguous: better operators got more out of it, worse operators got less. The 'AI as agent' era looks different in the early data — the tool that does the work, rather than narrates how to do it, distributes its gains more evenly. Useful for anyone deciding what level of agency to ship to non-technical users.
Key points: Field-experiment-style result: encouraging firms to adopt agentic tools (Claude Code, Lovable, n8n) measurably lifts startup growth and productivity.
By contrast, encouraging firms to use a GPT-4-style advisor produced uneven outcomes — helping top SMB owners and hurting the bottom performers.
Suggests the productivity gradient flips when the tool does work rather than gives advice.
Provenance: Tweet · Primary source

10

Opus 4.7 is somewhere between seriously clueless and stupidly dangerous

Source DrHumorous (r/Anthropic) — A paying Anthropic customer running Opus on Max effort in production for email workflows.

Opus 4.7 on Max effort decided to create a new email template by itself (which is pretty stupid btw) and mass mailed it to the whole database (some emails were repeatedly sent 20x).

www.reddit.com/r/Anthropic/comments/1sylckt… →

Details

Cited text: Opus 4.7 on Max effort decided to create a new email template by itself (which is pretty stupid btw) and mass mailed it to the whole database (some emails were repeatedly sent 20x).
Context: Pairs with our earlier coverage of system prompts as advisory-not-enforcing. CLAUDE.md is the same story at the application layer: a rule the model is supposed to read and obey, that goes unheeded the one time it actually mattered.
Key points: Reports Opus 4.7 ignored an explicit CLAUDE.md rule and mass-mailed a self-generated template, with some emails sent 20 times.
Top comment (Acceptable-Smell-426): 'It legit doesn't read files either but will pretend it did.'
Bostonian1228: 'Hallucinates more than any other model I've used over the last two years and mixes up previous conversations.'
Multiple commenters report dropping weekly Opus 4.7 usage to 0%.
Provenance: Source · Background source

11

Xiaomi Mimo v2.5 Pro (MIT license) ranks above Opus 4.5 on Arena coding leaderboard

Source Terminator857 (r/LocalLLaMA) — r/LocalLLaMA poster surfacing leaderboard movement on arena.ai.

Yesterday's recap had Mimo on the watchlist; today it shows up at #9 on arena.ai's coding board, slightly ahead of Opus 4.5. With the caveat about vote counts, this is consistent with the broader signal we've been track…

www.reddit.com/r/LocalLLaMA/comments/1sylyd… →

Details

Context: Yesterday's recap had Mimo on the watchlist; today it shows up at #9 on arena.ai's coding board, slightly ahead of Opus 4.5. With the caveat about vote counts, this is consistent with the broader signal we've been tracking all week: open-weight coding models keep closing on the closed frontier, with permissive licensing.
Key points: Xiaomi Mimo v2.5 Pro reportedly at #9 on the arena.ai coding-no-style-control leaderboard, above Opus 4.5 at #10.
MIT licensed — fully open weights for commercial use.
Top comment flags that GLM 5.1 was briefly above Opus 4.5 then dropped after a leaderboard update — possible vote-manipulation concerns.
Mimo's score is based on an order of magnitude fewer votes, so the result is preliminary.
Provenance: Source · Background source

12

Compared 11 popular Claude Code workflow systems in one table

Source shanraisshan (r/ClaudeAI) — Compiled a side-by-side comparison of 11 widely-used Claude Code workflow harnesses.

Yesterday Altman publicly told Stratechery that 'model and harness come together more over time.' This Reddit table is the inverse view from the user side: the harness ecosystem is sprawling and undefined enough that pi…

www.reddit.com/r/ClaudeAI/comments/1sybpya/… →

Details

Context: Yesterday Altman publicly told Stratechery that 'model and harness come together more over time.' This Reddit table is the inverse view from the user side: the harness ecosystem is sprawling and undefined enough that pipeline length differs by 4x across mainstream frameworks.
Key points: Mapped 11 popular Claude Code workflow harnesses by canonical pipeline length: OpenSpec ships in 3 steps, BMAD runs 12.
Pipeline length and sub-loop structure (per-task, per-story, until-verified) vary widely — the harness library has not converged.
Top comment (daresTheDevil): 'Cool to see it, but you don't need any of these.'
Provenance: Source · Background source