Gowers fed an open question from a Mel Nathanson paper into ChatGPT 5.5 Pro and got a quadratic upper bound in 17 minutes. On a harder follow-up — tightening Isaac Rajagopal's exponential-in-r² bound — the model produced a polynomial-in-r result in under two hours, using k-dissociated sets in a way Rajagopal called original and clever. Gowers judges the work to be a perfectly reasonable chapter…
Read source◆ Braid Daily · 2026-05-09
A Fields Medalist watches ChatGPT do real combinatorics
Gowers logs an LLM producing an original additive-combinatorics result. Mozilla's bug pipeline gets a deep follow-up. METR's eval suite…
The lead
1The bar for new PhD problems just moved
2A recent experience with ChatGPT 5.5 Pro
Timothy Gowers
A Fields Medalist documents, in his own voice, an LLM producing a non-trivial original mathematical idea — with Rajagopal, the author of the prior paper, certifying it as 'almost certainly correct' at the level of ideas. Gowers also flags the open practical question: arXiv refuses AI-written content, so where does work like this live?
Read source“It is no longer enough that somebody asks a problem: it needs to be hard enough for an LLM not to be able to solve it.”
METR evaluated an early Claude Mythos Preview
r/singularity
METR estimates a 50%-time-horizon of at least 16 hours for Claude Mythos Preview, with a 95% CI from 8.5 to 55 hours. Only 5 of the 228 tasks in their suite are 16+ hours long, so the suite has hit its measurement ceiling and METR is publishing the result with the explicit caveat that they need to build harder tasks.
Read source“We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.”
AI is rewriting vulnerability disclosure
3AI is breaking two vulnerability cultures
Jeff Kaufman
Kaufman argues both the Linux 'bugs are bugs' tradition and 90-day coordinated-disclosure embargoes assumed scarce attention — and AI removes that assumption. The recent ESP vulnerability was independently re-reported nine hours after Kim's report. His practical recommendation: shrink embargoes, and keep shrinking them. He also tested Gemini 3.1 Pro, ChatGPT-Thinking 5.5, and Claude Opus 4.7 on a raw Copy Fail diff; Gemini and GPT recognized the security fix, Claude did not.
Read source“Embargoes can increase risk: they create a false sense of non-urgency and limit which actors can work to fix a flaw.”
The React2Shell story
lachlan.nz
A pentester's full walkthrough of CVE-2025-55182 in React Server Components' Flight wire format — including a thenable trick that hands an attacker chainable function calls, ultimately reaching Webpack's Module._load for RCE in React itself. The kind of bug that took a week of obsession plus deep JavaScript runtime knowledge, not a CI scanner.
Read source“React's near-impeccable track record in security made the notion of finding a vulnerability like this seem ridiculous. The examples I gave above of vulnerable application code actually appeared inside React itself.”
DHH on Copilot's PR review hit ratio
@dhh
DHH reports Copilot's PR review going from 1-in-10 finding real issues to 7-in-10. Notable because he's a frequent skeptic of AI tooling marketing; this is unprompted positive. His one complaint is about state, not capability — review tools that re-raise rejected concerns turn signal into ticket spam.
Read source“Hit ratio went from 1/10 to 7/10. Impressive! (Just wish it would not re-raise concerns that have been given a 👎 once already).”
Alignment, formal methods, and the trust boundary
3Teaching Claude why
Anthropic
Anthropic shows receipts on which alignment data shapes work. Training on near-IID prompts with aligned answers cut blackmail from 22% to 15%. Adding deliberation about values to the responses dropped it to 3% on the same data. A 3M-token OOD 'difficult advice' set matched a 30M+ token in-distribution honeypot — 28x more efficient and generalizes better. Document training on the constitution plus fictional admirable-AI stories cut blackmail from 65% to 19% and persists through downstream RL.
Read source“Training on examples where the assistant displays admirable reasoning for its aligned behavior works better than training on the aligned behavior alone.”
Can LLMs model real-world systems in TLA+?
ACM SIGOPS
SysMoBench grades LLM-generated TLA+ specs across 11 systems in four phases. Frontier LLMs cluster near 100% on syntax but only ~46% on conformance and ~41% on invariants — they recite the textbook protocol, not the implementation. Concrete failure: Claude Sonnet's ZooKeeper FLE spec uses set-union for recvVotes when the code uses a per-peer overwriting map. Specula, an agent built on Claude Code/Codex, scores full conformance — raw-LLM modeling and agent-driven modeling are very different beasts.
Read source“What Claude produced was not a spec for Etcd. It was a spec from the appendix of the Raft paper.”
Codex can use Chrome directly on macOS and Windows
OpenAI
A new Codex Chrome extension runs against the user's real Chrome profile — same cookies, same sessions, same logged-in apps — and creates its own tab group rather than seizing the whole browser. With code execution available, Codex skips the screenshot-reason-click loop and scripts repetitive web work directly. The trust boundary just moved: an agent with full access to your authenticated browser is a different beast from a sandboxed browser-use agent.
Read source“Same profile, same session, same cookies, same tabs, same logged-in apps.”
Open weights
2Qwen 35B-A3B is very usable on 12GB of VRAM
r/LocalLLaMA
Qwen3.6-35B-A3B-MTP at IQ4_XS runs usably on a 3060 with -ncmoe tuning to keep enough MoE blocks on GPU, leaving room for 16k–32k context. The MTP draft heads stay native and continue to deliver speedup — practical follow-through on Wednesday's MTP-on-local-models thread.
Read source“12GB VRAM feels like a very practical size for this model. It lets you keep enough MoE blocks on GPU that plain decoding becomes quite strong, while still leaving room for useful context sizes like 16k/32k.”
AI2 ships EMO with document-level routing
r/LocalLLaMA
AI2's EMO is a 1B-active, 14B-total MoE trained on 1T tokens, released open on Hugging Face. The interesting bit is that routing happens at document granularity — experts cluster around domains like health, news, and code rather than around token-level surface patterns. A clean experiment with implications for both interpretability and inference batching.
Read source“Document-level routing. Experts cluster around domains like health, news, etc. instead of surface patterns.”
Companion episode
A Fields Medalist, a PhD chapter, and the week the bar moved
Two stories from the same family today: an LLM extending a published combinatorics paper, and Mozilla's pipeline finding a 20-year-old XSLT bug. Different fields, same tooling. Kaufman's piece is the one to read after both — the disclosure norms we inherited assumed attention was scarce, and that assumption is the next thing to give.