Archive BRAID DAILY
A Fields Medalist watches ChatGPT do real combinatorics
Subscribe

Braid Daily · 2026-05-09

A Fields Medalist watches ChatGPT do real combinatorics

Gowers logs an LLM producing an original additive-combinatorics result. Mozilla's bug pipeline gets a deep follow-up. METR's eval suite…

Split editorial composition: chalkboard mathematics on the left, code-review terminal on the right, joined by a single yellow line.
An LLM produced an original combinatorics result this week. The same family of tools is unhiding 15-year-old Firefox bugs.

The lead

1

Gowers fed an open question from a Mel Nathanson paper into ChatGPT 5.5 Pro and got a quadratic upper bound in 17 minutes. On a harder follow-up — tightening Isaac Rajagopal's exponential-in-r² bound — the model produced a polynomial-in-r result in under two hours, using k-dissociated sets in a way Rajagopal called original and clever. Gowers judges the work to be a perfectly reasonable chapter…

Read source

The bar for new PhD problems just moved

2

A recent experience with ChatGPT 5.5 Pro

Timothy Gowers

A Fields Medalist documents, in his own voice, an LLM producing a non-trivial original mathematical idea — with Rajagopal, the author of the prior paper, certifying it as 'almost certainly correct' at the level of ideas. Gowers also flags the open practical question: arXiv refuses AI-written content, so where does work like this live?

“It is no longer enough that somebody asks a problem: it needs to be hard enough for an LLM not to be able to solve it.”

Read source

METR evaluated an early Claude Mythos Preview

r/singularity

METR estimates a 50%-time-horizon of at least 16 hours for Claude Mythos Preview, with a 95% CI from 8.5 to 55 hours. Only 5 of the 228 tasks in their suite are 16+ hours long, so the suite has hit its measurement ceiling and METR is publishing the result with the explicit caveat that they need to build harder tasks.

“We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.”

Read source

AI is rewriting vulnerability disclosure

3

AI is breaking two vulnerability cultures

Jeff Kaufman

Kaufman argues both the Linux 'bugs are bugs' tradition and 90-day coordinated-disclosure embargoes assumed scarce attention — and AI removes that assumption. The recent ESP vulnerability was independently re-reported nine hours after Kim's report. His practical recommendation: shrink embargoes, and keep shrinking them. He also tested Gemini 3.1 Pro, ChatGPT-Thinking 5.5, and Claude Opus 4.7 on a raw Copy Fail diff; Gemini and GPT recognized the security fix, Claude did not.

“Embargoes can increase risk: they create a false sense of non-urgency and limit which actors can work to fix a flaw.”

Read source

The React2Shell story

lachlan.nz

A pentester's full walkthrough of CVE-2025-55182 in React Server Components' Flight wire format — including a thenable trick that hands an attacker chainable function calls, ultimately reaching Webpack's Module._load for RCE in React itself. The kind of bug that took a week of obsession plus deep JavaScript runtime knowledge, not a CI scanner.

“React's near-impeccable track record in security made the notion of finding a vulnerability like this seem ridiculous. The examples I gave above of vulnerable application code actually appeared inside React itself.”

Read source

DHH on Copilot's PR review hit ratio

@dhh

DHH reports Copilot's PR review going from 1-in-10 finding real issues to 7-in-10. Notable because he's a frequent skeptic of AI tooling marketing; this is unprompted positive. His one complaint is about state, not capability — review tools that re-raise rejected concerns turn signal into ticket spam.

“Hit ratio went from 1/10 to 7/10. Impressive! (Just wish it would not re-raise concerns that have been given a 👎 once already).”

Read source

Alignment, formal methods, and the trust boundary

3

Teaching Claude why

Anthropic

Anthropic shows receipts on which alignment data shapes work. Training on near-IID prompts with aligned answers cut blackmail from 22% to 15%. Adding deliberation about values to the responses dropped it to 3% on the same data. A 3M-token OOD 'difficult advice' set matched a 30M+ token in-distribution honeypot — 28x more efficient and generalizes better. Document training on the constitution plus fictional admirable-AI stories cut blackmail from 65% to 19% and persists through downstream RL.

“Training on examples where the assistant displays admirable reasoning for its aligned behavior works better than training on the aligned behavior alone.”

Read source

Can LLMs model real-world systems in TLA+?

ACM SIGOPS

SysMoBench grades LLM-generated TLA+ specs across 11 systems in four phases. Frontier LLMs cluster near 100% on syntax but only ~46% on conformance and ~41% on invariants — they recite the textbook protocol, not the implementation. Concrete failure: Claude Sonnet's ZooKeeper FLE spec uses set-union for recvVotes when the code uses a per-peer overwriting map. Specula, an agent built on Claude Code/Codex, scores full conformance — raw-LLM modeling and agent-driven modeling are very different beasts.

“What Claude produced was not a spec for Etcd. It was a spec from the appendix of the Raft paper.”

Read source

Codex can use Chrome directly on macOS and Windows

OpenAI

A new Codex Chrome extension runs against the user's real Chrome profile — same cookies, same sessions, same logged-in apps — and creates its own tab group rather than seizing the whole browser. With code execution available, Codex skips the screenshot-reason-click loop and scripts repetitive web work directly. The trust boundary just moved: an agent with full access to your authenticated browser is a different beast from a sandboxed browser-use agent.

“Same profile, same session, same cookies, same tabs, same logged-in apps.”

Read source

Open weights

2

Qwen 35B-A3B is very usable on 12GB of VRAM

r/LocalLLaMA

Qwen3.6-35B-A3B-MTP at IQ4_XS runs usably on a 3060 with -ncmoe tuning to keep enough MoE blocks on GPU, leaving room for 16k–32k context. The MTP draft heads stay native and continue to deliver speedup — practical follow-through on Wednesday's MTP-on-local-models thread.

“12GB VRAM feels like a very practical size for this model. It lets you keep enough MoE blocks on GPU that plain decoding becomes quite strong, while still leaving room for useful context sizes like 16k/32k.”

Read source

AI2 ships EMO with document-level routing

r/LocalLLaMA

AI2's EMO is a 1B-active, 14B-total MoE trained on 1T tokens, released open on Hugging Face. The interesting bit is that routing happens at document granularity — experts cluster around domains like health, news, and code rather than around token-level surface patterns. A clean experiment with implications for both interpretability and inference batching.

“Document-level routing. Experts cluster around domains like health, news, etc. instead of surface patterns.”

Read source

Companion episode

A Fields Medalist, a PhD chapter, and the week the bar moved

· 00:31:21

Two stories from the same family today: an LLM extending a published combinatorics paper, and Mozilla's pipeline finding a 20-year-old XSLT bug. Different fields, same tooling. Kaufman's piece is the one to read after both — the disclosure norms we inherited assumed attention was scarce, and that assumption is the next thing to give.