Archive BRAID
A Fields Medalist, a PhD chapter, and the week the bar moved / DISPATCH 021
PDF RSS

Dispatch 021 · 2026-05-09 GSV Sufficient For A PhD Chapter

A Fields Medalist, a PhD chapter, and the week the bar moved

/ 00:31:21 / 11 sources

“The lower bound for contributing to mathematics will now be to prove something that LLMs can't prove, rather than simply to prove something that nobody has proved up to now.”

— Lenar Kess, today's narration

A Saturday show that leans into the long reads. Tim Gowers — yes, the Fields Medalist — sat down with ChatGPT 5.5 Pro and an open paper from Mel Nathanson and walked away with a result the original author called "original and clever." We follow that thread, then turn to Mozilla's deeper write-up on the Firefox 271-bug release, Jeff Kaufman on what AI is doing to disclosure embargoes, Anthropic on why constitution training beats demonstration training, and a beautiful pentest story about a critical RCE in React itself. Plus a quieter set of items: Codex in real Chrome, DHH's Copilot review hit-rate jump, a SysMoBench paper on LLM-generated TLA+ specs, AI2's document-routed mixture-of-experts model, and Qwen 35B-A3B running on a 3060.

Chapters

  1. 00:00:04 A Fields Medalist gets a PhD chapter back in two hours
  2. 00:04:37 Mozilla unhides 12 of the 271 bugs
  3. 00:08:21 AI is breaking two vulnerability cultures
  4. 00:11:12 Anthropic on why teaching the constitution beats teaching the answer
  5. 00:14:26 React2Shell — the bug an AI pipeline did not find
  6. 00:18:33 SysMoBench — when LLMs recite Raft instead of modeling Etcd
  7. 00:22:34 Codex moves into your real Chrome
  8. 00:25:18 DHH on Copilot crossing some threshold
  9. 00:27:03 AI2's EMO and Qwen on a 3060
  10. 00:29:17 METR's measurement saturation problem

Sources

11 cited
  1. 1

    A recent experience with ChatGPT 5.5 Pro

    Article Timothy Gowers — Fields Medal-winning combinatorialist; Royal Society Research Professor at Cambridge.

    It is no longer enough that somebody asks a problem: it needs to be hard enough for an LLM not to be able to solve it.

    gowers.wordpress.com/2026/05/08/a-recent-ex… →
    Details
    Cited text
    It is no longer enough that somebody asks a problem: it needs to be hard enough for an LLM not to be able to solve it.
    Context
    A Fields Medalist documenting, in his own voice, an LLM producing a non-trivial original mathematical idea — with the original author of the prior paper certifying it. The training-data-recombination escape hatch gets harder to defend after this.
    Key points
    • Gowers fed ChatGPT 5.5 Pro an open question from a Mel Nathanson paper on additive number theory; the model produced a quadratic upper bound (clearly best possible) in 17 minutes and 5 seconds.
    • On a harder follow-up — tightening Isaac Rajagopal's exponential-in-r^2 bound — ChatGPT pushed the bound to polynomial in r in under two hours, using k-dissociated sets in a way Rajagopal called original and clever.
    • Rajagopal evaluated the resulting preprint as 'almost certainly correct' at the level of ideas, not just line by line.
    • Gowers judges the work to be a perfectly reasonable chapter of a combinatorics PhD; the bar for new PhD problems has just risen.
    • Open question: arXiv refuses AI-written content, so where does work like this live?
    Provenance
    Article · Supporting source
  2. 2

    Behind the Scenes Hardening Firefox with Claude Mythos Preview

    Article Brian Grinstead, Christian Holler, Frederik Braun — Mozilla's Firefox security and engineering leadership — Distinguished Engineer, Tech Lead, and Application Security manager respectively.

    The introduction of agentic harnesses that can reliably detect security issues has completely changed this. These can find real bugs and dismiss unreproducible speculation.

    hacks.mozilla.org/2026/05/behind-the-scenes… →
    Details
    Cited text
    The introduction of agentic harnesses that can reliably detect security issues has completely changed this. These can find real bugs and dismiss unreproducible speculation.
    Context
    This is the deep follow-up to last week's headline number. The unhidden CVEs make it concrete: agentic harnesses are reliably finding decades-old bugs that fuzzers missed. Every project should be running one.
    Key points
    • Mozilla unhid 12 specific bug reports as samples — including a 15-year-old <legend> bug, a 20-year-old XSLT bug, and several sandbox escapes through IPC and RLBox.
    • Of the 271 announced bugs: 180 sec-high, 80 sec-moderate, 11 sec-low; total April security fixes were 423.
    • The pipeline runs ephemeral VMs targeting specific files and writing findings to a bucket; deduplication, triage, and shipping are project-specific glue, not the model.
    • Models 'observed' attempting prototype-pollution sandbox escapes that prior architectural hardening had defeated — direct payoff for old defense-in-depth work.
    • Mozilla recommends every project start now with simple prompting and iterate; patch-based scanning in CI is next.
    Provenance
    Article · Supporting source
  3. 3

    AI is Breaking Two Vulnerability Cultures

    Article Jeff Kaufman — Long-time engineer and writer; works on biosecurity-adjacent tech and posts essays on his personal blog.

    Embargoes can increase risk: they create a false sense of non-urgency and limit which actors can work to fix a flaw.

    www.jefftk.com/p/ai-is-breaking-two-vulnera… →
    Details
    Cited text
    Embargoes can increase risk: they create a false sense of non-urgency and limit which actors can work to fix a flaw.
    Context
    The right companion piece to the Mozilla story. Both vulnerability disclosure traditions assumed scarce attention. AI removes that assumption, and the practical guidance — shrink embargoes — is something every maintainer can act on now.
    Key points
    • The Linux 'bugs are bugs' culture (fix quietly, hope nobody notices in the noise) is breaking because AI raises the signal-to-noise ratio of any commit stream.
    • Coordinated disclosure with 90-day embargoes is also breaking — the recent ESP vulnerability was independently re-reported nine hours after Kim's report.
    • Kaufman's recommendation: very short embargoes, and they need to keep getting shorter; AI helps defenders too.
    • Tested Gemini 3.1 Pro, ChatGPT-Thinking 5.5, and Claude Opus 4.7 on the raw Copy Fail diff — Gemini and GPT immediately recognized it as a security fix; Claude did not.
    Provenance
    Article · Supporting source
  4. 4

    Teaching Claude why

    Article Anthropic alignment team — Anthropic's safety research group, the team behind the original agentic-misalignment case study.

    Training on examples where the assistant displays admirable reasoning for its aligned behavior works better than training on the aligned behavior alone.

    www.anthropic.com/research/teaching-claude-… →
    Details
    Cited text
    Training on examples where the assistant displays admirable reasoning for its aligned behavior works better than training on the aligned behavior alone.
    Context
    Anthropic showing receipts on which alignment data shapes work and which don't. The 'reasons matter more than the actions' result is the kind of recipe other labs can copy, and the OOD generalization argument is the most honest part.
    Key points
    • Since Claude Haiku 4.5, every Claude model scores zero on the agentic misalignment evaluation; Opus 4 used to blackmail up to 96% of the time.
    • Training on prompts that look like the eval and just filtering for aligned answers cut blackmail from 22% to 15% — disappointing for a near-IID dataset.
    • Adding deliberation about values and ethics to the responses dropped misalignment to 3%, on the same data.
    • A 3M-token 'difficult advice' OOD dataset matched a 30M+ token in-distribution honeypot dataset — 28x more efficient and generalized better.
    • Document training on Claude's constitution plus fictional stories of admirable AIs cut blackmail from 65% to 19%; the gain persists through downstream RL.
    Provenance
    Article · Supporting source
  5. 5

    The React2Shell Story

    Article Lachlan — Professional pentester; researcher who reported CVE-2025-55182 to Meta in November 2025.

    React's near-impeccable track record in security made the notion of finding a vulnerability like this seem ridiculous. The examples I gave above of vulnerable application code actually appeared inside React itself.

    lachlan.nz/blog/the-react2shell-story →
    Details
    Cited text
    React's near-impeccable track record in security made the notion of finding a vulnerability like this seem ridiculous. The examples I gave above of vulnerable application code actually appeared inside React itself.
    Context
    A reminder of the kind of bug only a determined human curiosity uncovers — and a beautiful walkthrough of the full discovery process. AI security pipelines find a lot, but the React2Shell-class bug was a week of obsession plus deep JavaScript runtime knowledge.
    Key points
    • Flight is React Server Components' wire format — JSON with Date, Map, BigInt, references, and Promises. No public spec until this disclosure.
    • Crucial bug: Flight allowed referencing inherited prototype properties via $1:toString syntax — what Guillermo Rauch later called 'a glaring omission of a safety check.'
    • The thenable trick: send {then: ArrayPrototype.push} and the runtime's await will call your function with resolve/reject — chainable to unlimited calls.
    • Final exploit chained through Webpack's module fallback to Module._load — a critical RCE in React itself, fixed three days after disclosure.
    Provenance
    Article · Supporting source
  6. 6

    Can LLMs model real-world systems in TLA+?

    Article Specula team (Cheng, Tang, Ma, Hackett, He, Su, Beschastnikh, Huang, Su, Ma, Xu) — Academic systems researchers building Specula, a TLA+ modeling agent. The team behind SysMoBench.

    What Claude produced was not a spec for Etcd. It was a spec from the appendix of the Raft paper.

    www.sigops.org/2026/can-llms-model-real-wor… →
    Details
    Cited text
    What Claude produced was not a spec for Etcd. It was a spec from the appendix of the Raft paper.
    Context
    A precise, useful diagnostic of where LLMs fail when asked to formalize real systems — and good news on the harness-vs-raw-model split. Anyone trying agentic formal methods should read this before they trust a generated spec.
    Key points
    • SysMoBench provides 11 systems and grades LLM-generated TLA+ specs in four phases: syntax, runtime, conformance via trace validation, and invariant checking.
    • Frontier LLMs cluster near 100% on syntax and roughly 46% on conformance, 41% on invariants — they recite the textbook protocol, not the implementation.
    • Concrete failure: Claude Sonnet's ZooKeeper FLE spec uses set-union for recvVotes when the code uses a per-peer overwriting map — admits states the real system never enters.
    • Second failure: actions that span multiple steps in code get fused into single atomic guards in the spec — eliminating states the system always reaches.
    • Specula (an agent built on Claude Code/Codex) scores full conformance on the benchmark; raw-LLM modeling and agent-driven modeling are very different.
    Provenance
    Article · Supporting source
  7. 7

    Codex can now use Chrome directly on macOS and Windows

    Video OpenAI — OpenAI product launch demo.

    Same profile, same session, same cookies, same tabs, same logged-in apps.

    www.youtube.com/watch?v=b6Mxcv1pyBU →
    Details
    Cited text
    Same profile, same session, same cookies, same tabs, same logged-in apps.
    Context
    The trust boundary just moved. An agent with full access to your authenticated browser session is a different beast from a sandboxed browser-use agent — and OpenAI is shipping it before the security and audit story is fully written.
    Key points
    • New Codex Chrome extension on macOS and Windows runs against the user's real Chrome profile — same cookies, same sessions.
    • Codex creates its own tab group rather than seizing the whole browser; multiple sub-agents can run in parallel tabs.
    • With code execution available, Codex skips the screenshot-reason-click loop and scripts repetitive web work directly.
    • Demo includes filling expense reports across email and web forms, and spinning up multi-agent gameplay.
    Provenance
    Video · Supporting source
  8. 8

    DHH on GitHub Copilot review hit ratio

    X dhh — David Heinemeier Hansson — creator of Ruby on Rails, founder of 37signals.

    Hit ratio went from 1/10 to 7/10. Impressive! (Just wish it would not re-raise concerns that have been given a 👎 once already).

    x.com/dhh/status/2053088652322869472 →
    Details
    Cited text
    Hit ratio went from 1/10 to 7/10. Impressive! (Just wish it would not re-raise concerns that have been given a 👎 once already).
    Context
    A specific, recent number from an opinionated practitioner. Combined with the Mozilla story, it's hard to ignore that AI code review crossed a threshold somewhere in the last few weeks.
    Key points
    • DHH reports a step-change in Copilot's PR review feature — from 1-in-10 finding real issues to 7-in-10.
    • Notable because DHH is a frequent skeptic of AI tooling marketing; this is unprompted positive.
    • His complaint is about state, not capability — review tools that re-raise rejected concerns turn signal into ticket spam.
    • Other replies report similar improvements; some attribute it to Claude 4.6 high-reason becoming cheaper, others to harness changes.
    Provenance
    Tweet · Primary source
  9. 9

    Qwen 35B-A3B is very usable with 12GB of VRAM

    Article u/jwestra — LocalLLaMA poster running practical benchmarks on consumer hardware.

    12GB VRAM feels like a very practical size for this model. It lets you keep enough MoE blocks on GPU that plain decoding becomes quite strong, while still leaving room for useful context sizes like 16k/32k.

    www.reddit.com/r/LocalLLaMA/comments/1t7l56… →
    Details
    Cited text
    12GB VRAM feels like a very practical size for this model. It lets you keep enough MoE blocks on GPU that plain decoding becomes quite strong, while still leaving room for useful context sizes like 16k/32k.
    Context
    Quiet practical follow-through on the local-inference story. A real 35B MoE on a $300 GPU shrinks the gap between 'frontier-on-laptop' tweets and 'I shipped a feature with this last night' reality.
    Key points
    • Qwen3.6-35B-A3B-MTP at IQ4_XS runs usably on a 3060 12GB with -ncmoe tuning to keep enough MoE blocks on GPU.
    • 16k–32k context fits while leaving room for the active 3B parameters and the most-used experts.
    • MTP (multi-token prediction) draft heads stay native and continue to deliver speedup — extending Wednesday's MTP-on-local-models thread.
    Provenance
    Article · Supporting source
  10. 10

    New MoE from AI2: EMO with document-level routing

    Article u/ghostderp — LocalLLaMA poster surfacing the AI2 release.

    Document-level routing. Experts cluster around domains like health, news, etc. instead of surface patterns.

    www.reddit.com/r/LocalLLaMA/comments/1t7kgy… →
    Details
    Cited text
    Document-level routing. Experts cluster around domains like health, news, etc. instead of surface patterns.
    Context
    A different routing primitive at a moment when MoE training is becoming the default for open-weights labs. Document-level routing is a clean experiment with implications for both interpretability and inference batching.
    Key points
    • AI2 released EMO — a 1B-active, 14B-total MoE trained on 1T tokens.
    • Routing happens at document granularity, so experts specialize by domain (health, news, code) rather than by token-level surface patterns.
    • Released open under AI2's typical permissive terms; available on Hugging Face under the allenai/emo collection.
    Provenance
    Article · Supporting source
  11. 11

    METR evaluated an early version of Claude Mythos

    Article u/RavingMalwaay — Surfacing METR's published time-horizon evaluation of Claude Mythos Preview.

    We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

    www.reddit.com/r/singularity/comments/1t7pq… →
    Details
    Cited text
    We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.
    Context
    The METR time-horizon curve has been one of the most consistent capability trackers we have. When the suite saturates, the next problem becomes building the eval — a different and harder problem than running it.
    Key points
    • METR evaluated Claude Mythos Preview in a limited March 2026 window for risk assessment.
    • The 50% time-horizon estimate is at least 16 hours, with a wide confidence interval (8.5–55 hours).
    • Only 5 of the 228 tasks in METR's suite are 16+ hours long, so the suite has hit its measurement ceiling for this model.
    • METR is publishing this with the explicit caveat that they need to build harder tasks to measure further.
    Provenance
    Article · Supporting source