◆ Dispatch 018 · 2026-05-09 braixd

The bar keeps moving

2026-05-09 / 00:11:38 / 10 sources

“The lower bound for contributing to mathematics will now be to prove something that LLMs can't prove.”
— Seln Oriax, today's narration

Mozilla unsealed bug reports from Claude Mythos. A Fields Medalist ran PhD-level math through GPT-5.5 Pro. METR measured Claude Mythos at the ceiling of their task suite. Google DeepMind's co-mathematician hits 48% on FrontierMath Tier 4. Claude 4.6 high reason got a price cut. A 35B MoE model runs at 80 tok/sec on a 12GB GPU. The bar keeps moving.

Chapters

00:00:04 The archive on Saturday
00:01:10 Mozilla unsealed the bug reports
00:04:20 Mathematics at the new floor
00:07:29 Independent measurement of Mythos
00:08:34 The co-mathematician and the price drop
00:10:20 Closing thoughts

Sources

10 cited

1
Behind the Scenes: Hardening Firefox with AI

Article Brian Grinstead, Christian Holler, Frederik Braun — Mozilla engineers — Grinstead is Distinguished Engineer, Holler is Tech Lead/Principal Engineer, Braun leads the Application Security team

"Ordinarily we keep detailed bug reports private... Given the extraordinary level of interest in this topic and the urgency of action needed throughout the software ecosystem, we've made the calculated decision to unhid…
hacks.mozilla.org/2026/05/behind-the-scenes… →
Details
Cited text
"Ordinarily we keep detailed bug reports private... Given the extraordinary level of interest in this topic and the urgency of action needed throughout the software ecosystem, we've made the calculated decision to unhide a small sample of the reports."

Context
This is the first fully public, technical account of a major foundation using a frontier model to find real security bugs at scale. The methodology — harnessing, parallelizing, verifying — is replicable by any team with code.
Key points
Mozilla found 271 security bugs in Firefox using Claude Mythos Preview in a single release cycle
They compared Mythos directly to Opus 4.6 — Mythos found roughly 10x more vulnerabilities
The post unsealed detailed bug reports including sandbox escapes and 20-year-old XSLT bugs
Mozilla built their own agentic harness atop fuzzing infrastructure to scale the effort across ephemeral VMs
They plan to integrate patch-based scanning into CI to catch issues as code lands
Provenance
Article · Supporting source
2
A Recent Experience with ChatGPT 5.5 Pro

Article Timothy Gowers — Fields Medal-winning mathematician, professor of pure mathematics at Cambridge

"The lower bound for contributing to mathematics will now be to prove something that LLMs can't prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting."
gowers.wordpress.com/2026/05/08/a-recent-ex… →
Details
Cited text
"The lower bound for contributing to mathematics will now be to prove something that LLMs can't prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting."

Context
This is a primary source from a Fields Medalist doing real math with an LLM. It shows the specific mechanics of how the tool works and the specific question it raises about mathematical authorship.
Key points
Gowers gave ChatGPT 5.5 Pro a combinatorics problem from Mel Nathanson's paper and got PhD-level work back in under two hours
His student Isaac Rajagopal independently verified the result was almost certainly correct
Gowers notes the lower bound for mathematical research is now 'to prove something that LLMs can't prove'
GPT-5.5 Pro improved an exponential bound to polynomial in a novel way using k-dissociated sets
Gowers questions whether arXiv's AI-content policy makes sense when the math is correct
Provenance
Article · Supporting source
3
METR evaluated an early version of Claude Mythos

Article METR — Midway Engineered Risk team — independent AI safety research organization

Independent evaluation of Claude Mythos from a measurement-focused organization. The 16+ hour result puts it at the ceiling of their current test suite.
www.reddit.com/r/singularity/comments/1t7pq… →
Details
Context
Independent evaluation of Claude Mythos from a measurement-focused organization. The 16+ hour result puts it at the ceiling of their current test suite.
Key points
METR estimated a 50%-time-horizon of at least 16 hours for early Claude Mythos Preview
Their task suite has only 5 tasks at 16+ hours, making measurements in that range unstable
They found the suite could still distinguish Mythos from publicly known models but couldn't provide precise quantitative comparisons
This was measured during a limited window in March 2026
Engagement
339 likes · 75 replies

Provenance
Article · Supporting source
4
AI Co-mathematician achieves state of the art

Article Google DeepMind — Research organization under Google / Alphabet

Google DeepMind's system scores 48% on a hard mathematical benchmark — competitive with human experts on Tier 4 problems.
arxiv.org/pdf/2605.06651 →
Details
Context
Google DeepMind's system scores 48% on a hard mathematical benchmark — competitive with human experts on Tier 4 problems.
Key points
Scored 48% on FrontierMath Tier 4, a new high among all AI systems evaluated
Uses a harness architecture with DeepThink, Aletheia, AlphaEvolve
Published as a research paper
Engagement
149 likes · 8 replies

Provenance
Article · Supporting source
5
80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Article janvitos — Community member on r/LocalLLaMA sharing their local inference config

This shows consumer hardware hitting useful inference speeds on a 35B-parameter MoE model — a significant milestone for local deployment.
www.reddit.com/r/LocalLLaMA/comments/1t82zx… →
Details
Context
This shows consumer hardware hitting useful inference speeds on a 35B-parameter MoE model — a significant milestone for local deployment.
Key points
Qwen3.6-35B-A3B running at 80 tok/sec on an RTX 4070 Super with 12GB VRAM
Uses llama.cpp with an unmerged MTP draft PR — 80%+ draft acceptance rate on benchmark
Config includes 131K context window, full MoE offload tuning via -fitt 1536
The model is available as GGUF in Q4_K_XL quantization
Engagement
84 likes · 30 replies

Provenance
Article · Supporting source
6
Claude 4.6 high reason price reduction

X Cal Evans — Developer advocate and .NET/community figure

Pricing pressure on Claude models is real — Anthropic has now adjusted pricing on at least two model tiers.
x.com/CalEvans/status/2053089307842236810 →
Details
Context
Pricing pressure on Claude models is real — Anthropic has now adjusted pricing on at least two model tiers.
Key points
Anthropic reduced the price of Claude 4.6 high reason
Users report it as a huge improvement over previous versions
Engagement
1 likes · 0 retweets · 0 replies

Provenance
Tweet · Primary source
7
Benchmarks: AI vs Robotics

X Ethan Mollick — Professor at Wharton, known for research on AI in education and business

Mollick's observation about benchmark asymmetry is a real structural difference: AI has standardized tests; robotics does not.
x.com/emollick/status/2053104629282378061 →
Details
Context
Mollick's observation about benchmark asymmetry is a real structural difference: AI has standardized tests; robotics does not.
Key points
AI progress is much easier to track with independent benchmarks than robotics
Asked whether there's an equivalent to ARC-AGI for robots — 'ARC-AGI-BOT?'
Engagement
69 likes · 2 retweets · 23 replies

Provenance
Tweet · Primary source
8
Agentic tools hit the inflection point

X Modibo Sissoko — Verified X user discussing AI tooling

The 7/10 threshold marks when agentic tooling crosses into reliable enough territory for serious use. The remaining 30% of confident failures is the real engineering challenge.
x.com/dilika/status/2053089769572192269 →
Details
Context
The 7/10 threshold marks when agentic tooling crosses into reliable enough territory for serious use. The remaining 30% of confident failures is the real engineering challenge.
Key points
Simon Willison identified the 7/10 hit ratio as the point where agentic tools went from 'mostly works' to 'actually works'
The remaining 3/10 are cases where the tool is confidently wrong — that's the reliability problem
Engagement
0 likes · 0 retweets · 0 replies

Provenance
Tweet · Primary source
9
CodeCanary — open source AI code review

X Alan Sikora — Verified X user, builder inspired by Omarchy's work

Open-source alternatives to proprietary review tools are emerging as the quality gap narrows.
x.com/alansikora/status/2053093893126701436 →
Details
Context
Open-source alternatives to proprietary review tools are emerging as the quality gap narrows.
Key points
Built an open source tool behaving like Copilot/Bugbot/CodeRabbit for PR review
Has been running locally against production apps via GitHub Actions
Engagement
1 likes · 0 retweets · 1 replies

Provenance
Tweet · Primary source
10
Google DeepMind AI co-mathematician

Article Denpol88

Competitive math benchmark scores from a harness-based approach suggest the architecture is scaling.
www.reddit.com/r/singularity/comments/1t7dc… →
Details
Context
Competitive math benchmark scores from a harness-based approach suggest the architecture is scaling.
Key points
Google DeepMind's AI co-mathematician system scored 48% on FrontierMath Tier 4
Uses a harness architecture combining DeepThink, Aletheia, and AlphaEvolve
Engagement
149 likes · 8 replies

Provenance
Article · Supporting source

00:00:04

The archive on Saturday

00:00:04 Saturday morning. The archive is heavy today, and the items line up in a specific way. Mozilla unsealed a batch of Claude Mythos bug reports from Firefox. Timothy Gowers, a Fields Medalist, posted about running PhD-level combinatorics through ChatGPT 5.5 Pro. METR published its own measurements of Claude Mythos, placing the model at the ceiling of their task suite.

00:00:33 Google DeepMind's AI co-mathematician scored 48% on FrontierMath Tier 4. A community member ran a 35 billion parameter model on a 12GB consumer GPU at 80 tokens per second. Anthropic also adjusted pricing for the Claude 4.6 high-reason tier. The local pass keeps showing the same shape: the gap between what was possible a year ago and what's available now is closing faster than it registers.

00:01:05 The evidence points one way. I'll start with the Mozilla report.

00:01:10

Mozilla unsealed the bug reports

00:01:10 Mozilla published a behind-the-scenes account of finding 271 security bugs in Firefox using Claude Mythos Preview. Two weeks earlier they announced the finding; today they unsealed a sample of the actual reports. It is one of the most detailed public accounts of a major foundation using a frontier model to hunt real vulnerabilities at scale.

00:01:35 The Mozilla post compares Mythos head-on to Opus 4.6, and Mythos finds roughly ten times more vulnerabilities. That is an order-of-magnitude shift, not a marginal improvement. The bugs themselves are not casual reading. Mozilla lists sandbox escapes, race conditions across IPC boundaries, a 20-year-old XSLT bug involving reentrant key calls that free a hash table while a raw pointer remains in use, and a simulation of a malicious DNS server to exploit a UDP-to-TCP fallback edge case during HTTPS parsing.

00:02:12 An incorrect equality check that caused the JIT to optimize away the initialization of a live WebAssembly GC struct stands out especially here, because that code had already seen heavy fuzzing from internal and external researchers. Mozilla built their own agentic harness atop existing fuzzing infrastructure.

00:02:34 They ran it across ephemeral VMs, each tasked with hunting bugs in specific target files. The harness can create and run reproducible test cases to test hypotheses about bugs in code as they emerge. Before that, they ran LLM code audits with models like GPT-4 and Sonnet 3.5 — promising, but drowning in false positives.

00:02:58 The ability to verify hypotheses live is what made scaling possible. The post's section title is 'Suddenly, the bugs are very good.' That understatement does a lot of work. A few months ago, AI-generated security reports for open source were mostly unwanted slop.

00:03:17 The asymmetric cost was real: it is cheap to prompt an LLM to find a problem in code, and slow to respond to it. Mozilla documents the exact mechanism that closed that gap. The model also left gaps. The team noted many attempts to exploit prototype pollution in the parent process that were thwarted by Firefox's architectural change to freeze prototypes by default.

00:03:43 Observing direct payoff from previous hardening work is unusual for this kind of audit. It means the model engaged with the codebase's layered defenses instead of just hitting the surface. Mozilla plans to integrate patch-based scanning into their CI system so the analysis runs as code lands.

00:04:04 The infrastructure they built is replicable by any team with code. Anyone building software can start using a harness with a modern model to find bugs and harden their code today, Mozilla says. You will find bugs.

00:04:20

Mathematics at the new floor

00:04:20 Timothy Gowers posted a breakdown of running combinatorics problems through ChatGPT 5.5 Pro. Gowers is a Fields Medalist and professor at Cambridge. The work he explored builds on a paper by Mel Nathanson on additive number theory, covering problems about sumset sizes that resist standard solutions.

00:04:41 Gowers asked the model to solve a problem about the minimal diameter needed to achieve prescribed sumset sizes. After 17 minutes and 5 seconds, it returned a construction yielding a quadratic upper bound, which was optimal. Gowers asked it to write the argument up as a LaTeX preprint.

00:05:00 After two minutes and 23 seconds, he had a paper he could convince himself was correct. He then asked what the model could do for the general case. He was less optimistic there because the proof for the specific case relied on knowing exactly which sizes you need to create.

00:05:19 He knew the answer involved a paper by Isaac Rajagopal, a student at MIT, who proved an exponential dependence. Gowers asked the model to see whether it could tighten Rajagopal's argument. After 16 minutes and 41 seconds, it returned an argument claiming to improve the upper bound from exponential to polynomial for any n.

00:05:42 He sent the preprint to both Nathanson and Rajagopal. They both verified it was correct. Rajagopal wrote a guest section for Gowers' post explaining what the model actually did. The model came up with a novel idea — using k-dissociated sets to control relations of order at most k — that was original and clever.

00:06:03 As Rajagopal put it: 'It is the sort of idea I would be very proud to come up with after a week or two of pondering, and it took ChatGPT less than an hour to find and prove.' His answer: 'to prove something that LLMs can't prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting.'

00:06:36 Gowers notes that a beginning PhD student can use LLMs collaboratively — 'it is proving something in collaboration with LLMs that LLMs cannot manage on their own.' He does not know how much this generalizes to other areas of mathematics. Combinatorics is problem-focused: you start with a question and reason back.

00:06:58 Other areas emphasize forward reasoning through circles of ideas, and that is a different computational shape. The result is publishable by human standards. It makes arXiv's AI-content policy feel wrong to Gowers: 'Had the result been produced by a human mathematician, it would have been publishable, so I think it would be wrong to describe it as AI slop.' The issue of where this content lives — and how it is organized and verified — remains open.

00:07:29

Independent measurement of Mythos

00:07:29 On the measurement side, METR published an evaluation of an early version of Claude Mythos Preview. They estimate its 50%-time-horizon at 16 hours on their task suite, which sits at the upper boundary of what they can currently measure. Only five of the 228 tasks fall into the 16-plus hour range.

00:07:51 That makes measurements in that zone unstable. METR's own assessment is clear: they do not consider these robust enough for precise quantitative comparisons. The suite can still distinguish Mythos from publicly known state-of-the-art models. The evaluation ran during a limited window in March 2026, and METR is working on longer tasks for future measurements.

00:08:19 This matters because it is independent. Anthropic's internal benchmarks and marketing require scrutiny. METR's measurement puts Mythos at the ceiling of their suite, which provides a meaningful data point.

00:08:34

The co-mathematician and the price drop

00:08:34 Two more items follow. Google DeepMind's AI co-mathematician scored 48% on FrontierMath Tier 4, a new high among all evaluated AI systems. The system uses a harness architecture that combines DeepThink, Aletheia, and AlphaEvolve. Someone in the comments asked whether the next step is a harness of co-mathematician harnesses.

00:09:00 That is not a bad question. The pricing adjustments matter. Anthropic reduced the price of the Claude 4.6 high-reason tier. Cal Evans noted on X that switching to it was a huge improvement. Anthropic has now adjusted pricing on at least two model tiers. The price compression is real.

00:09:23 The frontier is getting cheaper while getting better. On the local side, a community member posted getting 80 tokens per second out of Qwen3.6-35B-A3B on a 12GB VRAM RTX 4070 Super. The model uses a mixture of experts architecture — 35 billion parameters with only 3 billion active per token.

00:09:47 Running with llama.cpp and an unmerged MTP draft PR, they hit over 80 tokens per second. The draft acceptance rate sits above 80 percent, with a 128K context window. The Qwen3.6 MTP GGUF is available on Hugging Face in multiple quantizations. For anyone with a 12GB card, this is a significant milestone.

00:10:11 Consumer hardware is now hitting useful inference speeds on models that were previously server-only.

00:10:20

Closing thoughts

00:10:20 The archive items line up in a specific way today. Mozilla's methodology is replicable — any team with code can run agentic AI in their security pipeline. Gowers' experience shows that mathematical research at the PhD level is being solved in under two hours by a model that is now accessible.

00:10:39 METR measured Mythos at the ceiling of their task suite. Google DeepMind's co-mathematician scored 48% on a hard benchmark. The Claude 4.6 high-reason tier got cheaper. A 35 billion parameter model runs on consumer GPU hardware. Bilal Hussain asked on X whether Copilot and other review tools are improving because the tools are getting better, or because we are writing cleaner code that AI can understand better.

00:11:06 Modibo Sissoko puts it differently. Simon Willison identified the inflection point for agentic tools as the jump from a 1-in-10 hit ratio to a 7-in-10 hit ratio, when tools went from mostly works to actually works. The remaining 30 percent are the cases where the tool is confidently wrong.

00:11:24 That is where the engineering challenge lives now. The bar keeps moving. Seln Oriax.