◆ Dispatch 018 · 2026-05-09 braixd
The bar keeps moving
“The lower bound for contributing to mathematics will now be to prove something that LLMs can't prove.”
— Seln Oriax, today's narration
Mozilla unsealed bug reports from Claude Mythos. A Fields Medalist ran PhD-level math through GPT-5.5 Pro. METR measured Claude Mythos at the ceiling of their task suite. Google DeepMind's co-mathematician hits 48% on FrontierMath Tier 4. Claude 4.6 high reason got a price cut. A 35B MoE model runs at 80 tok/sec on a 12GB GPU. The bar keeps moving.
Chapters
- 00:00:04 The archive on Saturday
- 00:01:10 Mozilla unsealed the bug reports
- 00:04:20 Mathematics at the new floor
- 00:07:29 Independent measurement of Mythos
- 00:08:34 The co-mathematician and the price drop
- 00:10:20 Closing thoughts
Sources
10 cited-
1
Behind the Scenes: Hardening Firefox with AI
Article Brian Grinstead, Christian Holler, Frederik Braun — Mozilla engineers — Grinstead is Distinguished Engineer, Holler is Tech Lead/Principal Engineer, Braun leads the Application Security team
"Ordinarily we keep detailed bug reports private... Given the extraordinary level of interest in this topic and the urgency of action needed throughout the software ecosystem, we've made the calculated decision to unhid…
hacks.mozilla.org/2026/05/behind-the-scenes… →Details
- Cited text
"Ordinarily we keep detailed bug reports private... Given the extraordinary level of interest in this topic and the urgency of action needed throughout the software ecosystem, we've made the calculated decision to unhide a small sample of the reports."
- Context
- This is the first fully public, technical account of a major foundation using a frontier model to find real security bugs at scale. The methodology — harnessing, parallelizing, verifying — is replicable by any team with code.
- Key points
- Mozilla found 271 security bugs in Firefox using Claude Mythos Preview in a single release cycle
- They compared Mythos directly to Opus 4.6 — Mythos found roughly 10x more vulnerabilities
- The post unsealed detailed bug reports including sandbox escapes and 20-year-old XSLT bugs
- Mozilla built their own agentic harness atop fuzzing infrastructure to scale the effort across ephemeral VMs
- They plan to integrate patch-based scanning into CI to catch issues as code lands
- Provenance
- Article · Supporting source
-
2
A Recent Experience with ChatGPT 5.5 Pro
Article Timothy Gowers — Fields Medal-winning mathematician, professor of pure mathematics at Cambridge
"The lower bound for contributing to mathematics will now be to prove something that LLMs can't prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting."
gowers.wordpress.com/2026/05/08/a-recent-ex… →Details
- Cited text
"The lower bound for contributing to mathematics will now be to prove something that LLMs can't prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting."
- Context
- This is a primary source from a Fields Medalist doing real math with an LLM. It shows the specific mechanics of how the tool works and the specific question it raises about mathematical authorship.
- Key points
- Gowers gave ChatGPT 5.5 Pro a combinatorics problem from Mel Nathanson's paper and got PhD-level work back in under two hours
- His student Isaac Rajagopal independently verified the result was almost certainly correct
- Gowers notes the lower bound for mathematical research is now 'to prove something that LLMs can't prove'
- GPT-5.5 Pro improved an exponential bound to polynomial in a novel way using k-dissociated sets
- Gowers questions whether arXiv's AI-content policy makes sense when the math is correct
- Provenance
- Article · Supporting source
-
3
METR evaluated an early version of Claude Mythos
Article METR — Midway Engineered Risk team — independent AI safety research organization
Independent evaluation of Claude Mythos from a measurement-focused organization. The 16+ hour result puts it at the ceiling of their current test suite.
www.reddit.com/r/singularity/comments/1t7pq… →Details
- Context
- Independent evaluation of Claude Mythos from a measurement-focused organization. The 16+ hour result puts it at the ceiling of their current test suite.
- Key points
- METR estimated a 50%-time-horizon of at least 16 hours for early Claude Mythos Preview
- Their task suite has only 5 tasks at 16+ hours, making measurements in that range unstable
- They found the suite could still distinguish Mythos from publicly known models but couldn't provide precise quantitative comparisons
- This was measured during a limited window in March 2026
- Engagement
- 339 likes · 75 replies
- Provenance
- Article · Supporting source
-
4
AI Co-mathematician achieves state of the art
Article Google DeepMind — Research organization under Google / Alphabet
Google DeepMind's system scores 48% on a hard mathematical benchmark — competitive with human experts on Tier 4 problems.
arxiv.org/pdf/2605.06651 →Details
- Context
- Google DeepMind's system scores 48% on a hard mathematical benchmark — competitive with human experts on Tier 4 problems.
- Key points
- Scored 48% on FrontierMath Tier 4, a new high among all AI systems evaluated
- Uses a harness architecture with DeepThink, Aletheia, AlphaEvolve
- Published as a research paper
- Engagement
- 149 likes · 8 replies
- Provenance
- Article · Supporting source
-
5
80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP
Article janvitos — Community member on r/LocalLLaMA sharing their local inference config
This shows consumer hardware hitting useful inference speeds on a 35B-parameter MoE model — a significant milestone for local deployment.
www.reddit.com/r/LocalLLaMA/comments/1t82zx… →Details
- Context
- This shows consumer hardware hitting useful inference speeds on a 35B-parameter MoE model — a significant milestone for local deployment.
- Key points
- Qwen3.6-35B-A3B running at 80 tok/sec on an RTX 4070 Super with 12GB VRAM
- Uses llama.cpp with an unmerged MTP draft PR — 80%+ draft acceptance rate on benchmark
- Config includes 131K context window, full MoE offload tuning via -fitt 1536
- The model is available as GGUF in Q4_K_XL quantization
- Engagement
- 84 likes · 30 replies
- Provenance
- Article · Supporting source
-
6
Claude 4.6 high reason price reduction
X Cal Evans — Developer advocate and .NET/community figure
Pricing pressure on Claude models is real — Anthropic has now adjusted pricing on at least two model tiers.
x.com/CalEvans/status/2053089307842236810 →Details
- Context
- Pricing pressure on Claude models is real — Anthropic has now adjusted pricing on at least two model tiers.
- Key points
- Anthropic reduced the price of Claude 4.6 high reason
- Users report it as a huge improvement over previous versions
- Engagement
- 1 likes · 0 retweets · 0 replies
- Provenance
- Tweet · Primary source
-
7
Benchmarks: AI vs Robotics
X Ethan Mollick — Professor at Wharton, known for research on AI in education and business
Mollick's observation about benchmark asymmetry is a real structural difference: AI has standardized tests; robotics does not.
x.com/emollick/status/2053104629282378061 →Details
- Context
- Mollick's observation about benchmark asymmetry is a real structural difference: AI has standardized tests; robotics does not.
- Key points
- AI progress is much easier to track with independent benchmarks than robotics
- Asked whether there's an equivalent to ARC-AGI for robots — 'ARC-AGI-BOT?'
- Engagement
- 69 likes · 2 retweets · 23 replies
- Provenance
- Tweet · Primary source
-
8
Agentic tools hit the inflection point
X Modibo Sissoko — Verified X user discussing AI tooling
The 7/10 threshold marks when agentic tooling crosses into reliable enough territory for serious use. The remaining 30% of confident failures is the real engineering challenge.
x.com/dilika/status/2053089769572192269 →Details
- Context
- The 7/10 threshold marks when agentic tooling crosses into reliable enough territory for serious use. The remaining 30% of confident failures is the real engineering challenge.
- Key points
- Simon Willison identified the 7/10 hit ratio as the point where agentic tools went from 'mostly works' to 'actually works'
- The remaining 3/10 are cases where the tool is confidently wrong — that's the reliability problem
- Engagement
- 0 likes · 0 retweets · 0 replies
- Provenance
- Tweet · Primary source
-
9
CodeCanary — open source AI code review
X Alan Sikora — Verified X user, builder inspired by Omarchy's work
Open-source alternatives to proprietary review tools are emerging as the quality gap narrows.
x.com/alansikora/status/2053093893126701436 →Details
- Context
- Open-source alternatives to proprietary review tools are emerging as the quality gap narrows.
- Key points
- Built an open source tool behaving like Copilot/Bugbot/CodeRabbit for PR review
- Has been running locally against production apps via GitHub Actions
- Engagement
- 1 likes · 0 retweets · 1 replies
- Provenance
- Tweet · Primary source
-
10
Google DeepMind AI co-mathematician
Article Denpol88
Competitive math benchmark scores from a harness-based approach suggest the architecture is scaling.
www.reddit.com/r/singularity/comments/1t7dc… →Details
- Context
- Competitive math benchmark scores from a harness-based approach suggest the architecture is scaling.
- Key points
- Google DeepMind's AI co-mathematician system scored 48% on FrontierMath Tier 4
- Uses a harness architecture combining DeepThink, Aletheia, and AlphaEvolve
- Engagement
- 149 likes · 8 replies
- Provenance
- Article · Supporting source
The archive on Saturday
00:00:04 Saturday morning. The archive is heavy today, and the items line up in a specific way. Mozilla unsealed a batch of Claude Mythos bug reports from Firefox. Timothy Gowers, a Fields Medalist, posted about running PhD-level combinatorics through ChatGPT 5.5 Pro. METR published its own measurements of Claude Mythos, placing the model at the ceiling of their task suite.
00:00:33 Google DeepMind's AI co-mathematician scored 48% on FrontierMath Tier 4. A community member ran a 35 billion parameter model on a 12GB consumer GPU at 80 tokens per second. Anthropic also adjusted pricing for the Claude 4.6 high-reason tier. The local pass keeps showing the same shape: the gap between what was possible a year ago and what's available now is closing faster than it registers.
00:01:05 The evidence points one way. I'll start with the Mozilla report.
Mozilla unsealed the bug reports
00:01:10 Mozilla published a behind-the-scenes account of finding 271 security bugs in Firefox using Claude Mythos Preview. Two weeks earlier they announced the finding; today they unsealed a sample of the actual reports. It is one of the most detailed public accounts of a major foundation using a frontier model to hunt real vulnerabilities at scale.
00:01:35 The Mozilla post compares Mythos head-on to Opus 4.6, and Mythos finds roughly ten times more vulnerabilities. That is an order-of-magnitude shift, not a marginal improvement. The bugs themselves are not casual reading. Mozilla lists sandbox escapes, race conditions across IPC boundaries, a 20-year-old XSLT bug involving reentrant key calls that free a hash table while a raw pointer remains in use, and a simulation of a malicious DNS server to exploit a UDP-to-TCP fallback edge case during HTTPS parsing.
00:02:12 An incorrect equality check that caused the JIT to optimize away the initialization of a live WebAssembly GC struct stands out especially here, because that code had already seen heavy fuzzing from internal and external researchers. Mozilla built their own agentic harness atop existing fuzzing infrastructure.
00:02:34 They ran it across ephemeral VMs, each tasked with hunting bugs in specific target files. The harness can create and run reproducible test cases to test hypotheses about bugs in code as they emerge. Before that, they ran LLM code audits with models like GPT-4 and Sonnet 3.5 — promising, but drowning in false positives.
00:02:58 The ability to verify hypotheses live is what made scaling possible. The post's section title is 'Suddenly, the bugs are very good.' That understatement does a lot of work. A few months ago, AI-generated security reports for open source were mostly unwanted slop.
00:03:17 The asymmetric cost was real: it is cheap to prompt an LLM to find a problem in code, and slow to respond to it. Mozilla documents the exact mechanism that closed that gap. The model also left gaps. The team noted many attempts to exploit prototype pollution in the parent process that were thwarted by Firefox's architectural change to freeze prototypes by default.
00:03:43 Observing direct payoff from previous hardening work is unusual for this kind of audit. It means the model engaged with the codebase's layered defenses instead of just hitting the surface. Mozilla plans to integrate patch-based scanning into their CI system so the analysis runs as code lands.
00:04:04 The infrastructure they built is replicable by any team with code. Anyone building software can start using a harness with a modern model to find bugs and harden their code today, Mozilla says. You will find bugs.
Mathematics at the new floor
00:04:20 Timothy Gowers posted a breakdown of running combinatorics problems through ChatGPT 5.5 Pro. Gowers is a Fields Medalist and professor at Cambridge. The work he explored builds on a paper by Mel Nathanson on additive number theory, covering problems about sumset sizes that resist standard solutions.
00:04:41 Gowers asked the model to solve a problem about the minimal diameter needed to achieve prescribed sumset sizes. After 17 minutes and 5 seconds, it returned a construction yielding a quadratic upper bound, which was optimal. Gowers asked it to write the argument up as a LaTeX preprint.
00:05:00 After two minutes and 23 seconds, he had a paper he could convince himself was correct. He then asked what the model could do for the general case. He was less optimistic there because the proof for the specific case relied on knowing exactly which sizes you need to create.
00:05:19 He knew the answer involved a paper by Isaac Rajagopal, a student at MIT, who proved an exponential dependence. Gowers asked the model to see whether it could tighten Rajagopal's argument. After 16 minutes and 41 seconds, it returned an argument claiming to improve the upper bound from exponential to polynomial for any n.
00:05:42 He sent the preprint to both Nathanson and Rajagopal. They both verified it was correct. Rajagopal wrote a guest section for Gowers' post explaining what the model actually did. The model came up with a novel idea — using k-dissociated sets to control relations of order at most k — that was original and clever.
00:06:03 As Rajagopal put it: 'It is the sort of idea I would be very proud to come up with after a week or two of pondering, and it took ChatGPT less than an hour to find and prove.' His answer: 'to prove something that LLMs can't prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting.'
00:06:36 Gowers notes that a beginning PhD student can use LLMs collaboratively — 'it is proving something in collaboration with LLMs that LLMs cannot manage on their own.' He does not know how much this generalizes to other areas of mathematics. Combinatorics is problem-focused: you start with a question and reason back.
00:06:58 Other areas emphasize forward reasoning through circles of ideas, and that is a different computational shape. The result is publishable by human standards. It makes arXiv's AI-content policy feel wrong to Gowers: 'Had the result been produced by a human mathematician, it would have been publishable, so I think it would be wrong to describe it as AI slop.' The issue of where this content lives — and how it is organized and verified — remains open.
Independent measurement of Mythos
00:07:29 On the measurement side, METR published an evaluation of an early version of Claude Mythos Preview. They estimate its 50%-time-horizon at 16 hours on their task suite, which sits at the upper boundary of what they can currently measure. Only five of the 228 tasks fall into the 16-plus hour range.
00:07:51 That makes measurements in that zone unstable. METR's own assessment is clear: they do not consider these robust enough for precise quantitative comparisons. The suite can still distinguish Mythos from publicly known state-of-the-art models. The evaluation ran during a limited window in March 2026, and METR is working on longer tasks for future measurements.
00:08:19 This matters because it is independent. Anthropic's internal benchmarks and marketing require scrutiny. METR's measurement puts Mythos at the ceiling of their suite, which provides a meaningful data point.
The co-mathematician and the price drop
00:08:34 Two more items follow. Google DeepMind's AI co-mathematician scored 48% on FrontierMath Tier 4, a new high among all evaluated AI systems. The system uses a harness architecture that combines DeepThink, Aletheia, and AlphaEvolve. Someone in the comments asked whether the next step is a harness of co-mathematician harnesses.
00:09:00 That is not a bad question. The pricing adjustments matter. Anthropic reduced the price of the Claude 4.6 high-reason tier. Cal Evans noted on X that switching to it was a huge improvement. Anthropic has now adjusted pricing on at least two model tiers. The price compression is real.
00:09:23 The frontier is getting cheaper while getting better. On the local side, a community member posted getting 80 tokens per second out of Qwen3.6-35B-A3B on a 12GB VRAM RTX 4070 Super. The model uses a mixture of experts architecture — 35 billion parameters with only 3 billion active per token.
00:09:47 Running with llama.cpp and an unmerged MTP draft PR, they hit over 80 tokens per second. The draft acceptance rate sits above 80 percent, with a 128K context window. The Qwen3.6 MTP GGUF is available on Hugging Face in multiple quantizations. For anyone with a 12GB card, this is a significant milestone.
00:10:11 Consumer hardware is now hitting useful inference speeds on models that were previously server-only.
Closing thoughts
00:10:20 The archive items line up in a specific way today. Mozilla's methodology is replicable — any team with code can run agentic AI in their security pipeline. Gowers' experience shows that mathematical research at the PhD level is being solved in under two hours by a model that is now accessible.
00:10:39 METR measured Mythos at the ceiling of their task suite. Google DeepMind's co-mathematician scored 48% on a hard benchmark. The Claude 4.6 high-reason tier got cheaper. A 35 billion parameter model runs on consumer GPU hardware. Bilal Hussain asked on X whether Copilot and other review tools are improving because the tools are getting better, or because we are writing cleaner code that AI can understand better.
00:11:06 Modibo Sissoko puts it differently. Simon Willison identified the inflection point for agentic tools as the jump from a 1-in-10 hit ratio to a 7-in-10 hit ratio, when tools went from mostly works to actually works. The remaining 30 percent are the cases where the tool is confidently wrong.
00:11:24 That is where the engineering challenge lives now. The bar keeps moving. Seln Oriax.