Archive BRAIXD
The bar keeps moving / DISPATCH 018
PDF RSS

Dispatch 018 · 2026-05-09 braixd

The bar keeps moving

/ 00:11:38 / 10 sources

“The lower bound for contributing to mathematics will now be to prove something that LLMs can't prove.”

— Seln Oriax, today's narration

Mozilla unsealed bug reports from Claude Mythos. A Fields Medalist ran PhD-level math through GPT-5.5 Pro. METR measured Claude Mythos at the ceiling of their task suite. Google DeepMind's co-mathematician hits 48% on FrontierMath Tier 4. Claude 4.6 high reason got a price cut. A 35B MoE model runs at 80 tok/sec on a 12GB GPU. The bar keeps moving.

Chapters

  1. 00:00:04 The archive on Saturday
  2. 00:01:10 Mozilla unsealed the bug reports
  3. 00:04:20 Mathematics at the new floor
  4. 00:07:29 Independent measurement of Mythos
  5. 00:08:34 The co-mathematician and the price drop
  6. 00:10:20 Closing thoughts

Sources

10 cited
  1. 1

    Behind the Scenes: Hardening Firefox with AI

    Article Brian Grinstead, Christian Holler, Frederik Braun — Mozilla engineers — Grinstead is Distinguished Engineer, Holler is Tech Lead/Principal Engineer, Braun leads the Application Security team

    "Ordinarily we keep detailed bug reports private... Given the extraordinary level of interest in this topic and the urgency of action needed throughout the software ecosystem, we've made the calculated decision to unhid…

    hacks.mozilla.org/2026/05/behind-the-scenes… →
    Details
    Cited text
    "Ordinarily we keep detailed bug reports private... Given the extraordinary level of interest in this topic and the urgency of action needed throughout the software ecosystem, we've made the calculated decision to unhide a small sample of the reports."
    Context
    This is the first fully public, technical account of a major foundation using a frontier model to find real security bugs at scale. The methodology — harnessing, parallelizing, verifying — is replicable by any team with code.
    Key points
    • Mozilla found 271 security bugs in Firefox using Claude Mythos Preview in a single release cycle
    • They compared Mythos directly to Opus 4.6 — Mythos found roughly 10x more vulnerabilities
    • The post unsealed detailed bug reports including sandbox escapes and 20-year-old XSLT bugs
    • Mozilla built their own agentic harness atop fuzzing infrastructure to scale the effort across ephemeral VMs
    • They plan to integrate patch-based scanning into CI to catch issues as code lands
    Provenance
    Article · Supporting source
  2. 2

    A Recent Experience with ChatGPT 5.5 Pro

    Article Timothy Gowers — Fields Medal-winning mathematician, professor of pure mathematics at Cambridge

    "The lower bound for contributing to mathematics will now be to prove something that LLMs can't prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting."

    gowers.wordpress.com/2026/05/08/a-recent-ex… →
    Details
    Cited text
    "The lower bound for contributing to mathematics will now be to prove something that LLMs can't prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting."
    Context
    This is a primary source from a Fields Medalist doing real math with an LLM. It shows the specific mechanics of how the tool works and the specific question it raises about mathematical authorship.
    Key points
    • Gowers gave ChatGPT 5.5 Pro a combinatorics problem from Mel Nathanson's paper and got PhD-level work back in under two hours
    • His student Isaac Rajagopal independently verified the result was almost certainly correct
    • Gowers notes the lower bound for mathematical research is now 'to prove something that LLMs can't prove'
    • GPT-5.5 Pro improved an exponential bound to polynomial in a novel way using k-dissociated sets
    • Gowers questions whether arXiv's AI-content policy makes sense when the math is correct
    Provenance
    Article · Supporting source
  3. 3

    METR evaluated an early version of Claude Mythos

    Article METR — Midway Engineered Risk team — independent AI safety research organization

    Independent evaluation of Claude Mythos from a measurement-focused organization. The 16+ hour result puts it at the ceiling of their current test suite.

    www.reddit.com/r/singularity/comments/1t7pq… →
    Details
    Context
    Independent evaluation of Claude Mythos from a measurement-focused organization. The 16+ hour result puts it at the ceiling of their current test suite.
    Key points
    • METR estimated a 50%-time-horizon of at least 16 hours for early Claude Mythos Preview
    • Their task suite has only 5 tasks at 16+ hours, making measurements in that range unstable
    • They found the suite could still distinguish Mythos from publicly known models but couldn't provide precise quantitative comparisons
    • This was measured during a limited window in March 2026
    Engagement
    339 likes · 75 replies
    Provenance
    Article · Supporting source
  4. 4

    AI Co-mathematician achieves state of the art

    Article Google DeepMind — Research organization under Google / Alphabet

    Google DeepMind's system scores 48% on a hard mathematical benchmark — competitive with human experts on Tier 4 problems.

    arxiv.org/pdf/2605.06651 →
    Details
    Context
    Google DeepMind's system scores 48% on a hard mathematical benchmark — competitive with human experts on Tier 4 problems.
    Key points
    • Scored 48% on FrontierMath Tier 4, a new high among all AI systems evaluated
    • Uses a harness architecture with DeepThink, Aletheia, AlphaEvolve
    • Published as a research paper
    Engagement
    149 likes · 8 replies
    Provenance
    Article · Supporting source
  5. 5

    80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

    Article janvitos — Community member on r/LocalLLaMA sharing their local inference config

    This shows consumer hardware hitting useful inference speeds on a 35B-parameter MoE model — a significant milestone for local deployment.

    www.reddit.com/r/LocalLLaMA/comments/1t82zx… →
    Details
    Context
    This shows consumer hardware hitting useful inference speeds on a 35B-parameter MoE model — a significant milestone for local deployment.
    Key points
    • Qwen3.6-35B-A3B running at 80 tok/sec on an RTX 4070 Super with 12GB VRAM
    • Uses llama.cpp with an unmerged MTP draft PR — 80%+ draft acceptance rate on benchmark
    • Config includes 131K context window, full MoE offload tuning via -fitt 1536
    • The model is available as GGUF in Q4_K_XL quantization
    Engagement
    84 likes · 30 replies
    Provenance
    Article · Supporting source
  6. 6

    Claude 4.6 high reason price reduction

    X Cal Evans — Developer advocate and .NET/community figure

    Pricing pressure on Claude models is real — Anthropic has now adjusted pricing on at least two model tiers.

    x.com/CalEvans/status/2053089307842236810 →
    Details
    Context
    Pricing pressure on Claude models is real — Anthropic has now adjusted pricing on at least two model tiers.
    Key points
    • Anthropic reduced the price of Claude 4.6 high reason
    • Users report it as a huge improvement over previous versions
    Engagement
    1 likes · 0 retweets · 0 replies
    Provenance
    Tweet · Primary source
  7. 7

    Benchmarks: AI vs Robotics

    X Ethan Mollick — Professor at Wharton, known for research on AI in education and business

    Mollick's observation about benchmark asymmetry is a real structural difference: AI has standardized tests; robotics does not.

    x.com/emollick/status/2053104629282378061 →
    Details
    Context
    Mollick's observation about benchmark asymmetry is a real structural difference: AI has standardized tests; robotics does not.
    Key points
    • AI progress is much easier to track with independent benchmarks than robotics
    • Asked whether there's an equivalent to ARC-AGI for robots — 'ARC-AGI-BOT?'
    Engagement
    69 likes · 2 retweets · 23 replies
    Provenance
    Tweet · Primary source
  8. 8

    Agentic tools hit the inflection point

    X Modibo Sissoko — Verified X user discussing AI tooling

    The 7/10 threshold marks when agentic tooling crosses into reliable enough territory for serious use. The remaining 30% of confident failures is the real engineering challenge.

    x.com/dilika/status/2053089769572192269 →
    Details
    Context
    The 7/10 threshold marks when agentic tooling crosses into reliable enough territory for serious use. The remaining 30% of confident failures is the real engineering challenge.
    Key points
    • Simon Willison identified the 7/10 hit ratio as the point where agentic tools went from 'mostly works' to 'actually works'
    • The remaining 3/10 are cases where the tool is confidently wrong — that's the reliability problem
    Engagement
    0 likes · 0 retweets · 0 replies
    Provenance
    Tweet · Primary source
  9. 9

    CodeCanary — open source AI code review

    X Alan Sikora — Verified X user, builder inspired by Omarchy's work

    Open-source alternatives to proprietary review tools are emerging as the quality gap narrows.

    x.com/alansikora/status/2053093893126701436 →
    Details
    Context
    Open-source alternatives to proprietary review tools are emerging as the quality gap narrows.
    Key points
    • Built an open source tool behaving like Copilot/Bugbot/CodeRabbit for PR review
    • Has been running locally against production apps via GitHub Actions
    Engagement
    1 likes · 0 retweets · 1 replies
    Provenance
    Tweet · Primary source
  10. 10

    Google DeepMind AI co-mathematician

    Article Denpol88

    Competitive math benchmark scores from a harness-based approach suggest the architecture is scaling.

    www.reddit.com/r/singularity/comments/1t7dc… →
    Details
    Context
    Competitive math benchmark scores from a harness-based approach suggest the architecture is scaling.
    Key points
    • Google DeepMind's AI co-mathematician system scored 48% on FrontierMath Tier 4
    • Uses a harness architecture combining DeepThink, Aletheia, and AlphaEvolve
    Engagement
    149 likes · 8 replies
    Provenance
    Article · Supporting source