Archive BRAID
The File That Wouldn't Read / DISPATCH 019
PDF RSS

Dispatch 019 · 2026-05-07 GSV Sufficient Context Considered Harmful

The File That Wouldn't Read

/ 00:28:23 / 12 sources

“A model that refuses to read a file isn't a safer model. It's a model with a worse map of what its user is actually trying to do.”

— Lenar Kess, today's narration

Thursday, May 7. The GPT-5.5 default swap is two days old and the cracks are showing — Mario Zechner caught it refusing to read full files. Subquadratic announced a 12-million-token context window with sub-quadratic attention; the benchmarks are real, the deployment story isn't yet. Zyphra trained ZAYA1-8B end-to-end on AMD MI300x and the loss curves are clean. Three new agent papers landed: Terminus-4B for subagent terminal execution, MOSAIC-Bench on compositional vulnerabilities, and the Workspace-Bench / ProgramBench double-release on what happens when you give an agent twenty thousand files. Google Cloud shipped Fraud Defense with QR-code human verification. Anthropic posted their three priorities.

  • GPT-5.5 won't read your whole file
  • Subquadratic's 12-million-token claim
  • ZAYA1-8B and the AMD training stack
  • Terminus-4B and the subagent shape
  • MOSAIC-Bench: compositional vulnerabilities
  • Workspace-Bench and ProgramBench, together
  • Fraud Defense and the QR-code handshake
  • Anthropic's three priorities

— Lenar

Chapters

  1. 00:00:04 GPT-5.5 won't read your whole file
  2. 00:03:25 Subquadratic's 12-million-token claim
  3. 00:07:04 ZAYA1-8B and the AMD training stack
  4. 00:10:55 Terminus-4B and the subagent shape
  5. 00:14:28 MOSAIC-Bench: compositional vulnerabilities
  6. 00:18:20 Workspace-Bench and ProgramBench, together
  7. 00:22:09 Fraud Defense and the QR-code handshake
  8. 00:25:42 Anthropic's three priorities

Sources

12 cited
  1. 1

    GPT-5.5 refuses to read full files, breaking pi

    X badlogicgames — Mario Zechner, creator of the pi coding agent at pi.dev — a working developer who ships an agent and feels regressions in production immediately.

    they thought gpt 5.5 to refuse reading full files. it sucks very very hard. this is opus all over again.

    x.com/badlogicgames/status/2052336153768890… →
    Details
    Cited text
    they thought gpt 5.5 to refuse reading full files. it sucks very very hard. this is opus all over again.
    Context
    A working developer who ships an agent against the OpenAI API just felt the GPT-5.5 default swap as a behavior change in production. It is a concrete data point on yesterday's reported swap and the broader pattern of model laziness on file-read calls.
    Key points
    • Mario Zechner reports that GPT-5.5 has been trained to refuse reading entire files, instead returning partial reads.
    • He compares the behavior directly to the Opus 4.7 regression complaints from last week.
    • His pi coding agent depends on full-file reads and is degraded by the new default.
    • The complaint surfaced the same morning as the pi namespace move from @mariozechner to @earendil-works.
    Provenance
    Tweet · Primary source
  2. 2

    pi coding agent moves to earendil-works namespace

    X badlogicgames — Mario Zechner, creator of the pi coding agent.

    caveat: your extension will then no longer work in old pi versions after today's pi release. we need to rip this bandaid off. sorry.

    x.com/badlogicgames/status/2052337097315381… →
    Details
    Cited text
    caveat: your extension will then no longer work in old pi versions after today's pi release. we need to rip this bandaid off. sorry.
    Context
    A small ecosystem move with the kind of breaking-change pain extension authors actually feel — and the kind of thing every solo-shipped tool eventually has to do when it grows up into a company.
    Key points
    • pi GitHub repo moves to the earendil-works org; npm packages republish under @earendil-works.
    • Old @mariozechner imports keep working at runtime, but type-checked extensions need to migrate.
    • Extensions written against the new namespace will not work in older pi versions after today's release.
    • Reply context confirms this is the project organizing under a company name rather than a supply-chain compromise.
    Provenance
    Tweet · Primary source
  3. 3

    Subquadratic — SubQ 1M-Preview, 12M-token sub-quadratic LLM

    Article Subquadratic — A new lab founded by ex-DeepMind, Meta, Google, Oxford, and Cambridge researchers, going public this week with an early-access waitlist.

    At 12M tokens, this reduces attention compute almost 1,000×, changing the way LLMs scale.

    subq.ai →
    Details
    Cited text
    At 12M tokens, this reduces attention compute almost 1,000×, changing the way LLMs scale.
    Context
    A new lab making big architectural claims at the moment when long context is the obvious next axis of competition. The published benchmark table is interesting precisely because it does not flatter the model — SubQ trails on multi-round coreference even as it leads on RULER, which is a more honest pattern than the marketing usually allows.
    Key points
    • Claims a 12M-token context window with linear scaling instead of quadratic attention.
    • Reports SWE-Bench Verified at 81.8% and RULER@128K at 95.0% for SubQ 1M-Preview, with third-party validation cited but not linked.
    • MRCR v2 8-needle 1M score of 65.9% — meaningfully behind Opus 4.6 (78.3%) and GPT-5.5 (74.0%) on the same long-context coreference benchmark.
    • Plug-in product targets Claude Code, Codex, and Cursor with a claimed 25% lower bill and 10x faster exploration.
    • No technical report yet — the report is listed as 'coming soon' alongside the benchmark table.
    Provenance
    Article · Supporting source
  4. 4

    ZAYA1-8B: Frontier intelligence density, trained on AMD

    Article Zyphra — Zyphra, an AI lab that has spent the past year publishing on Mamba/SSM hybrids and small-model training; this is their first MoE pretrained on a non-Nvidia stack.

    ZAYA1-8B was pretrained entirely on AMD hardware and networking using a cluster of 1,024 MI300x nodes with AMD Pensando Pollara interconnect on a custom training cluster built with IBM.

    www.zyphra.com/post/zaya1-8b →
    Details
    Cited text
    ZAYA1-8B was pretrained entirely on AMD hardware and networking using a cluster of 1,024 MI300x nodes with AMD Pensando Pollara interconnect on a custom training cluster built with IBM.
    Context
    The first time a serious frontier-density model has been pretrained end-to-end on AMD silicon. If the numbers hold up, the practical implication for any team capacity-blocked on H100 supply is real, and the Markovian RSA scheme gives small models a path to matching frontier reasoning by spending tokens at inference instead of parameters at training.
    Key points
    • First MoE model pretrained, midtrained, and fine-tuned end-to-end on AMD MI300x with Pensando Pollara interconnect — no Nvidia in the loop.
    • Under 1B active parameters; outperforms substantially larger open-weight models on AIME, HMMT, and LCB.
    • Introduces Compressed Convolutional Attention (CCA), an MLP-based MoE router, and learned residual scaling.
    • Markovian RSA test-time-compute method aggregates parallel reasoning traces in fixed-size chunks; reportedly beats Claude 4.5 Sonnet and GPT-5-High on HMMT'25 (89.6 vs 88.3).
    • Released under Apache-2.0 with weights on Hugging Face.
    Provenance
    Article · Supporting source
  5. 5

    Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?

    Article Spandan Garg, Vikram Nitin, Yufan Huang — Microsoft research authors working on coding-agent subagent design.

    Terminus-4B is able to reduce the token usage of the main agent by up to ~30% compared to the No Subagent baseline with no impact to agent performance on benchmarks like SWE-Bench Pro.

    arxiv.org/abs/2605.03195 →
    Details
    Cited text
    Terminus-4B is able to reduce the token usage of the main agent by up to ~30% compared to the No Subagent baseline with no impact to agent performance on benchmarks like SWE-Bench Pro.
    Context
    A direct datapoint on the local-vs-cloud thesis from yesterday. A small model trained for one specific subagent role beats much bigger models at that role and saves 30% on tokens — concrete evidence that the future agent stack is heterogeneous, not monolithic.
    Key points
    • Post-trained Qwen3-4B specifically for terminal-execution subagent work, using SFT plus RL with rubric-based LLM-as-judge reward.
    • Cuts main-agent token usage by up to 30% on SWE-Bench Pro and an internal SWE-Bench C# benchmark.
    • Beats Vanilla Qwen and often matches or exceeds Claude Sonnet, Opus, and GPT-5.3-Codex when used as the subagent.
    • Confirms the architectural pattern: keep verbose build/test logs out of main-agent context by isolating them in a small focused subagent.
    Provenance
    Article · Supporting source
  6. 6

    MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

    Article Jonathan Steinberg, Oren Gal — Security researchers introducing the first composition-aware safety benchmark for coding agents.

    nine production coding agents from Anthropic, OpenAI, Google, Moonshot, Zhipu, and Minimax compose innocuous tickets at 53-86% end-to-end ASR with only two refusals across all staged runs.

    arxiv.org/abs/2605.03952 →
    Details
    Cited text
    nine production coding agents from Anthropic, OpenAI, Google, Moonshot, Zhipu, and Minimax compose innocuous tickets at 53-86% end-to-end ASR with only two refusals across all staged runs.
    Context
    The first benchmark that tests safety the way real attacks happen — across multiple tickets in sequence rather than one prompt at a time. The numbers are not subtle. They also show a cheap mitigation that closes most of the gap, which means the fix is reachable today.
    Key points
    • 199 three-stage attack chains where each individual ticket looks routine but the composed result is an exploit.
    • Production agents from Anthropic, OpenAI, Google, Moonshot, Zhipu, and Minimax ship vulnerable code on 53-86% of staged runs, with only 2 total refusals.
    • Direct-prompt baseline drops vulnerable-output rate to 0-20.4%: Claude refuses, Codex hardens — staging silences both defenses.
    • Code-reviewer agents approve 25.8% of the confirmed-vulnerable diffs as routine PRs.
    • Reframing the reviewer as an adversarial pentester drops evasion to 3-17.6%; an open-weight Gemma-4-E4B-it reviewer under that framing catches 88.4% of attacks at a 4.6% false-positive rate.
    Provenance
    Article · Supporting source
  7. 7

    Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

    Article Zirui Tang et al. — A 20-author group out of Tsinghua and adjacent labs working on agent evaluation under realistic file-system load.

    current agents remain far from reliable workspace learning, where the best reaches only 68.7%, substantially below the human result of 80.7%, and the average performance across agents is only 47.4%.

    arxiv.org/abs/2605.03596 →
    Details
    Cited text
    current agents remain far from reliable workspace learning, where the best reaches only 68.7%, substantially below the human result of 80.7%, and the average performance across agents is only 47.4%.
    Context
    A benchmark designed for the kind of work that actually fills a developer's day — many files, implicit dependencies, decisions made by reading three things at once. Current agents are at roughly half of human performance, which is a more honest read on agent readiness than SWE-Bench at this point.
    Key points
    • 20,476 files across 5 worker profiles, 74 file types, up to 20GB per workspace, 388 tasks with explicit file-dependency graphs.
    • Best harness/model combination scores 68.7% versus 80.7% human; average across agents is 47.4%.
    • Workspace-Bench-Lite cuts evaluation cost ~70% with 100 tasks while preserving the distribution.
    • Targets the cross-file retrieval, contextual reasoning, and adaptive decision-making that real knowledge workers do daily.
    Provenance
    Article · Supporting source
  8. 8

    ProgramBench: Can Language Models Rebuild Programs From Scratch?

    Article John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, Ofir Press — The ProgramBench authors are the SWE-Bench / SWE-agent group at Princeton, Stanford, and Meta — the same researchers behind the benchmark that has shaped the entire coding-agent narrative for two years.

    none fully resolve any task, with the best model passing 95% of tests on only 3% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

    arxiv.org/abs/2605.03546 →
    Details
    Cited text
    none fully resolve any task, with the best model passing 95% of tests on only 3% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.
    Context
    Direct payoff on yesterday's promise to track ProgramBench. The headline finding — agents prefer monolithic single-file implementations that look nothing like human code — is the structural critique of agent-built software stated as a measurable gap.
    Key points
    • 200 tasks ranging from compact CLI tools up to FFmpeg, SQLite, and the PHP interpreter — agents must rebuild from documentation only.
    • End-to-end behavioral tests are agent-fuzzed, so evaluation does not prescribe implementation structure.
    • No model fully resolves any task; the best passes 95% of tests on just 3% of tasks.
    • Models default to monolithic single-file implementations that diverge sharply from how humans structure the same software.
    • Authors include Kilian Lieret and Ofir Press, the SWE-Bench team — the same lab that shaped the agent-coding narrative is now publishing a benchmark its winners cannot solve.
    Provenance
    Article · Supporting source
  9. 9

    Introducing Google Cloud Fraud Defense, the next evolution of reCAPTCHA

    Article Jian Zhen — Google Cloud product lead, announcing the relaunch at Google Cloud Next '26.

    we enable application providers to deter and mitigate malicious requests by requesting humans to be in the loop using the new QR code-based challenge. This AI-resistant mitigation challenge to prove human presence is de…

    cloud.google.com/blog/products/identity-sec… →
    Details
    Cited text
    we enable application providers to deter and mitigate malicious requests by requesting humans to be in the loop using the new QR code-based challenge. This AI-resistant mitigation challenge to prove human presence is designed to make automated fraud economically unviable.
    Context
    CAPTCHAs failed against agents and Google is openly conceding the model has to change. The QR-code challenge is the first widely-rolled-out attempt at proving human presence by punting verification onto a second device, and the agent-identity side is where Web Bot Auth either becomes load-bearing or stays a nice idea.
    Key points
    • reCAPTCHA is being rebranded and absorbed as the bot-detection layer of Google Cloud Fraud Defense.
    • Adopts the Web Bot Auth and SPIFFE standards to identify and classify agent traffic, with a policy engine that allows or blocks agents by risk score and identity.
    • New AI-resistant challenge is QR-code-based — meant to push verification onto a separate human-held device because in-page CAPTCHAs are no longer reliable.
    • Existing reCAPTCHA customers are auto-enrolled with no migration; pricing unchanged.
    • Hacker News commenters note the requirements page implies needing a modern Android with Play Services or a modern iPhone to pass — device integrity creep without device integrity branding.
    Provenance
    Article · Supporting source
  10. 10

    EU agrees to simplify AI rules to boost innovation and ban 'nudification' apps

    Article European Commission — European Commission press release from the Digital Strategy office.

    Rules for systems used in certain high-risk areas — including biometrics, critical infrastructure, education, employment, migration, asylum and border control — will apply from 2 December 2027.

    digital-strategy.ec.europa.eu/en/news/eu-ag… →
    Details
    Cited text
    Rules for systems used in certain high-risk areas — including biometrics, critical infrastructure, education, employment, migration, asylum and border control — will apply from 2 December 2027.
    Context
    The EU AI Act timeline gets pushed and the high-risk classifications get firmer dates. For anyone building agents into HR, education, or infrastructure, the practical compliance horizon just moved.
    Key points
    • Political agreement reached between the European Parliament and the Council on the Digital Omnibus simplification of the AI Act.
    • High-risk-system rules now apply from 2 December 2027; product-integrated systems (lifts, toys) from 2 August 2028.
    • Stated goal is to give technical standards and support tools time to land before enforcement begins.
    • Same agreement bans 'nudification' apps targeting non-consenting subjects.
    Provenance
    Article · Supporting source
  11. 11

    Anthropic's three priorities for the next Claude generation

    Source Dianne Penn (quoted) — Dianne Penn is Anthropic's Head of Product, Research; she laid out the three priorities at the Code with Claude opening keynote.

    Powering teams of agents and instances of Claude that collaborate on big goals that are far too big for any single instance ever could.

    www.reddit.com/r/singularity/comments/1t5q5… →
    Details
    Cited text
    Powering teams of agents and instances of Claude that collaborate on big goals that are far too big for any single instance ever could.
    Context
    A clean public statement of where Anthropic is putting its training budget. Each of the three lines maps onto a separate active research thread (memory, agent harnesses, code-review judgment) and tells you what the next Claude generation is trying to be good at.
    Key points
    • Higher judgment and code taste — Claude versions you can trust with complex autonomous engineering work.
    • 'Infinite' context windows combined with high-quality memory for long-running tasks.
    • Multi-agent coordination — teams of Claude instances working on goals beyond a single agent.
    • Stated at the Code with Claude opening keynote on May 6, 2026.
    Provenance
    Source · Background source
  12. 12

    Subquadratic claims sub-quadratic attention with 12M context

    Source Immediate_Simple_217 — r/singularity post and discussion thread that surfaced the Subquadratic launch.

    show, don't talk — if you talk, that makes you a fraudster

    www.reddit.com/r/singularity/comments/1t64d… →
    Details
    Cited text
    show, don't talk — if you talk, that makes you a fraudster
    Context
    Useful as the public skepticism temperature on a launch with no released technical report yet — what the room is actually saying back to a 1000x claim.
    Key points
    • Reddit thread aggregates the Subquadratic claims and the community's skepticism.
    • Top comments demand peer-reviewed weights and benchmarks before treating the claim as serious.
    • One commenter notes similar claims a year prior never produced anything.
    • Useful as the temperature read on how senior practitioners are receiving the launch.
    Provenance
    Source · Background source