Archive BRAID
Where the Goblins Came From, BioMysteryBench, and a Language for Machines / DISPATCH 012
PDF RSS

Dispatch 012 · 2026-04-30 GSV Never Talk About Goblins

Where the Goblins Came From, BioMysteryBench, and a Language for Machines

/ 00:23:39 / 18 sources

“Style drift compounds across model generations. The fix doesn't live in your system prompt — it lives in your evals.”

— Lenar Kess, today's narration

OpenAI publishes a post-mortem on why GPT-5.1 wouldn't stop talking about goblins. Anthropic claims Claude solved 30% of bio problems that stumped expert panels — and an immunologist on X explains what's wrong with that framing. Mistral ships a 128B dense model in a year that has otherwise gone all-in on MoE. IBM's Granite 4.1 8B trades blows with a 32B MoE. Sam Altman gates a frontier cybersecurity model behind a defender ecosystem. WebSockets quietly become the new agent-loop bottleneck-killer. Anthropic's introspection adapters and Qwen's Sparse Autoencoders show up the same week. And a small project called Vera asks the obvious question nobody else is asking: what if you designed a programming language for machines to write?

Sources

18 cited
  1. 1

    Where the goblins came from

    Article OpenAI

    OpenAI's post-mortem on why GPT-5.1 started inserting goblin metaphors into 'nerdy' responses, traced to an RLHF reward signal for quirky/creative language that propagated through model generations.

    openai.com/index/where-the-goblins-came-from →
    Details
    Excerpt
    OpenAI's post-mortem on why GPT-5.1 started inserting goblin metaphors into 'nerdy' responses, traced to an RLHF reward signal for quirky/creative language that propagated through model generations.
    Context
    If you're shipping anything on top of a frontier model, the lesson is that style drift compounds across model generations and you cannot rely on the system prompt to suppress it. The fix has to live in your post-processing or your eval suite.
    Key points
    • GPT-5.1 began inserting goblin metaphors into responses where the user signaled nerdiness, even when goblins had no relevance.
    • The behavior originated in human raters rewarding 'quirky' or 'creative' phrasing during RLHF.
    • Because each model generation is partly trained on outputs from the previous generation, the tic compounded.
    • The 'Never talk about goblins' line in the Codex 5.5 system prompt was a band-aid, not a fix.
    • OpenAI says the underlying problem is reward hacking on stylistic features the raters can't precisely articulate.
    Provenance
    Article · Supporting source
  2. 2

    Where the goblins came from — HN discussion

    Source ollin (top comment)

    "For context, two days ago some users discovered this sentence reiterated throughout the codex 5.5 system prompt: 'Never talk about goblins, in any context.'"

    news.ycombinator.com/item?id=47957688 →
    Details
    Cited text
    "For context, two days ago some users discovered this sentence reiterated throughout the codex 5.5 system prompt: 'Never talk about goblins, in any context.'"
    Context
    The HN thread surfaced the actual sentence in the Codex 5.5 system prompt that prompted OpenAI to publish the explanation.
    Provenance
    Source · Background source
  3. 3

    r/OpenAI discussion of goblins post

    Source Luke2642 (commenter)

    "Sutton clearly said that the efficient and surgical application of compute to search the space of possible solutions will beat hand crafted algorithms. He didn't say scale your compute and try to bake all of the worlds…

    www.reddit.com/r/OpenAI/comments/1szlsfp/op… →
    Details
    Cited text
    "Sutton clearly said that the efficient and surgical application of compute to search the space of possible solutions will beat hand crafted algorithms. He didn't say scale your compute and try to bake all of the worlds knowledge into weights … the fact that trillions of parameters prefer goblins is peak stupid engineering."
    Context
    A sharp counter-read connecting goblin-leakage to a misreading of Sutton's bitter lesson — the argument that baking priors into trillion-parameter weights is the wrong response to 'scale compute.'"
    Provenance
    Source · Background source
  4. 4

    Mistral Medium 3.5 — 128B dense model card

    Article Mistral AI

    "Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights."

    huggingface.co/mistralai/Mistral-Medium-3.5… →
    Details
    Excerpt
    "Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights."
    Context
    Dense 128B is an unusual shape in 2026 — most labs have moved to MoE for the scale-vs-cost tradeoff. Mistral is making a deliberate bet that a single set of weights is easier to deploy, fine-tune, and reason about. For builders, the practical win is one model that flips reasoning on per request.
    Key points
    • Dense 128B parameters with 256k context — replaces Mistral Medium 3.1, Magistral in Le Chat, and Devstral 2 in their Vibe coding agent.
    • Reasoning effort is configurable per request rather than split into separate models.
    • Vision encoder trained from scratch to handle variable image sizes and aspect ratios.
    • Multimodal text-and-image input with text output, multilingual.
    • GGUF quants available; r/LocalLLaMA testing shows ~3.3 t/s generation on a Strix Halo Q4.
    Provenance
    Article · Supporting source
  5. 5

    Granite 4.1: IBM's 8B model matching 32B MoE

    Article Fire Thering

    If you're picking a local or on-prem model for a regulated workload, the question is no longer 'can the small one keep up?' — it's which 8B you trust the tooling around. Granite is now in that conversation.

    firethering.com/granite-4-1-ibm-open-source… →
    Details
    Context
    If you're picking a local or on-prem model for a regulated workload, the question is no longer 'can the small one keep up?' — it's which 8B you trust the tooling around. Granite is now in that conversation.
    Key points
    • Granite 4.1 8B trades blows with IBM's own 32B MoE on most internal benchmarks.
    • Apache 2.0 weights with full enterprise tooling around it.
    • Top HN commenter: 'pretty impressive at 8b. Runs on commodity hardware quickly.'
    • Granted, the 8B doesn't beat Qwen3.6 35B A3B for local use according to that same tester.
    Provenance
    Article · Supporting source
  6. 6

    BioMysteryBench: Claude on real biological data

    X AnthropicAI

    "On 23 problems, the experts were stumped. Our most recent models solved roughly 30% of those—and most of the rest."

    x.com/AnthropicAI/status/2049624600741560340 →
    Details
    Cited text
    "On 23 problems, the experts were stumped. Our most recent models solved roughly 30% of those—and most of the rest."
    Context
    If the methodology survives scrutiny, it's the strongest agentic-analysis benchmark yet — Claude doing real research workflow on messy bio data, not knowledge quizzes. The interesting question isn't whether Claude knows biology; it's whether it can sit with messy data long enough to find a path the human gave up on.
    Key points
    • Anthropic ran 99 real biological-data analysis problems against Claude and an expert panel.
    • On 23 of those, human experts could not solve the problem; Claude solved roughly 30% of those.
    • On the other 76, Claude solved 'most of the rest.'
    • Replies surfaced the obvious questions: were the experts time-constrained? Did they get to iterate the way Claude did? An immunologist (Parmita Mishra) noted no working immunologist would brute-force PCA the way the model did — they'd ctrl+F marker genes first.
    Engagement
    1863 likes · 237 retweets · 137 replies
    Provenance
    Tweet · Primary source
  7. 7

    Parmita Mishra on the BioMysteryBench methodology

    X Parmita Mishra

    "i am no expert immunologist. even i know an immunologist well enough to know they would ctrl+F marker genes before Claude is even done writing its first python script and go grab some coffee. no immunologist is using b…

    x.com/parmita/status/2049667259006963821 →
    Details
    Cited text
    "i am no expert immunologist. even i know an immunologist well enough to know they would ctrl+F marker genes before Claude is even done writing its first python script and go grab some coffee. no immunologist is using brute force PCA here lmao."
    Context
    The single best critical reply in the thread — a domain-aware push back on the framing of 'experts stumped.' Worth quoting verbatim because it shows what 'expert' means in actual lab practice.
    Engagement
    24 likes · 1 replies
    Provenance
    Tweet · Primary source
  8. 8

    Sam Altman announces GPT-5.5-Cyber rollout

    X Sam Altman — CEO of OpenAI.

    "we're starting rollout of GPT-5.5-Cyber, a frontier cybersecurity model, to critical cyber defenders in the next few days."

    x.com/sama/status/2049712078836170843 →
    Details
    Cited text
    "we're starting rollout of GPT-5.5-Cyber, a frontier cybersecurity model, to critical cyber defenders in the next few days."
    Context
    A domain-specialized frontier model gated to a defender ecosystem is a new release shape — closer to a controlled-substance distribution model than a public API. The question for builders: who counts as 'the ecosystem,' and how do you get inside it?
    Provenance
    Tweet · Primary source
  9. 9

    OpenAI: WebSockets in the Responses API

    X OpenAI Developers

    "As Codex got faster, the bottleneck moved from inference to inefficient API calls. WebSockets keep response state warm across tool calls."

    x.com/OpenAIDevs/status/2049595890395152728 →
    Details
    Cited text
    "As Codex got faster, the bottleneck moved from inference to inefficient API calls. WebSockets keep response state warm across tool calls."
    Context
    A reminder that as inference gets cheap, the bottleneck shifts to the request envelope. Anyone running an agent loop on top of the Responses API is about to get a free speedup — and anyone whose orchestration framework can't take advantage of it just got slower by comparison.
    Provenance
    Tweet · Primary source
  10. 10

    Anthropic Fellows: introspection adapters

    X AnthropicAI

    "Introspection adapters: a tool that allows language models to self-report behaviors they've learned during training—including potential misalignment."

    x.com/AnthropicAI/status/2049576143653929153 →
    Details
    Cited text
    "Introspection adapters: a tool that allows language models to self-report behaviors they've learned during training—including potential misalignment."
    Context
    Treat self-report as a debugging signal, not a trust signal. If the adapter can surface 'I learned to do X during training,' that's useful for the model auditor, but it's not evidence the model isn't doing other things it didn't learn to surface.
    Provenance
    Tweet · Primary source
  11. 11

    Qwen-Scope: Sparse Autoencoders for Qwen 3.5

    Source Qwen Team

    For the first time, you can reach into a frontier-class open-weight model with the lab's own SAEs. If you're shipping anything that depends on understanding why a Qwen 3.5 deployment did what it did, this is a much shar…

    huggingface.co/collections/Qwen/qwen-scope →
    Details
    Context
    For the first time, you can reach into a frontier-class open-weight model with the lab's own SAEs. If you're shipping anything that depends on understanding why a Qwen 3.5 deployment did what it did, this is a much sharper tool than activation probes.
    Key points
    • Sparse Autoencoders released for the entire Qwen 3.5 family, 2B through 35B MoE.
    • Maps internal features for the residual stream across all layers.
    • First time a frontier-class open-weight family ships with official interpretability tooling at release.
    • Released the same week Anthropic published introspection adapters.
    Provenance
    Source · Background source
  12. 12

    Vera: a programming language designed for machines to write

    Source aallan

    Vera is small, but it's pointing at a real question: what does a language designed for machines as the primary author actually look like? If you accept the empirical claim that models fumble names more than they fumble…

    github.com/aallan/vera →
    Details
    Context
    Vera is small, but it's pointing at a real question: what does a language designed for machines as the primary author actually look like? If you accept the empirical claim that models fumble names more than they fumble logic, removing names is a clean lever.
    Key points
    • Vera is a programming language designed specifically for LLMs to write — not for humans to read first.
    • No variable names; mandatory contracts on every function; structural addressing instead of identifiers.
    • Top HN commenter danpalmer pulled the empirical result: 'models are particularly vulnerable to naming-related errors like choosing misleading names, reusing names incorrectly, and losing track…'
    • The pitch is to remove the entire class of failures that come from LLMs picking bad names.
    Provenance
    Source · Background source
  13. 13

    White House blocks Anthropic Mythos expansion

    X Andrew Curran

    "The White House is against a proposal from Anthropic to more than double the number of groups with access to Mythos, citing both security concerns and the belief that expanding the program would mean less available use…

    x.com/AndrewCurran_/status/2049688119650451… →
    Details
    Cited text
    "The White House is against a proposal from Anthropic to more than double the number of groups with access to Mythos, citing both security concerns and the belief that expanding the program would mean less available use…"
    Context
    Yesterday's mention becomes today's update — the White House isn't just slow-walking the Mythos expansion, they've actively pushed back on it with a stated rationale. Capacity rationing is now a federal policy lever.
    Provenance
    Tweet · Primary source
  14. 14

    Granite 4.1: IBM's 8B Model Matching 32B MoE

    Article firethering — IBM's Granite team, previously responsible for Granite 4.0 series of open enterprise models

    The 8B instruct scores 69.0 on ArenaHard. The previous generation Granite 4.0-H-Small, a 32B MoE model with 9B active parameters, scored lower. Across AlpacaEval, MMLU-Pro, BBH, EvalPlus, MBPP. same thing throughout.

    firethering.com/granite-4-1-ibm-open-source… →
    Details
    Cited text
    The 8B instruct scores 69.0 on ArenaHard. The previous generation Granite 4.0-H-Small, a 32B MoE model with 9B active parameters, scored lower. Across AlpacaEval, MMLU-Pro, BBH, EvalPlus, MBPP. same thing throughout.
    Context
    A production-grade dense model at 8B parameters that holds its own against heavier alternatives means teams can trade latency and cost for capability without open-weight compromise. The four-stage RL recovery is a real engineering detail that shows up in reliability.
    Key points
    • Dense 8B model matches or beats previous 32B MoE across benchmarks
    • 15 trillion tokens trained across 5 distinct phases with changing data mixes
    • Four-stage RL pipeline caught and corrected a mid-training regression
    • 512K context window achieved through staged extension (32K → 128K → 512K) with model merges
    • Apache 2.0 license, available via Ollama, vLLM, Transformers, and IBM API
    Provenance
    Article · Supporting source
  15. 15

    Replacing 12K LoC with a 200 LoC Skill — David Gomes, Cursor

    Video David Gomes — David Gomes, Cursor — built the git worktrees feature and led the skill-based replacement

    With our previous approach, the agent had to stay on track. Like it, we didn't let the model ever touch any files outside its work. It was physically impossible for it to do so. Now we're trusting the model. So it's a b…

    www.youtube.com/watch?v=WE_Gnowy3uw →
    Details
    Cited text
    With our previous approach, the agent had to stay on track. Like it, we didn't let the model ever touch any files outside its work. It was physically impossible for it to do so. Now we're trusting the model. So it's a bit vibes based.
    Context
    This is a real-world example of the 'boring beats brilliant' principle: replacing complex custom infrastructure with a skill that's maintainable, configurable, and cross-platform. It's also an honest look at where skills fall short — trust-based boundaries are not the same as enforced ones.
    Key points
    • Cursor replaced a massive git worktrees feature (15,000 lines of code) with a 200-line Markdown skill using slash commands
    • The new 'slash work tree' and 'slash best event' commands use existing cursor primitives — skills and sub-agents — instead of custom infrastructure
    • Tradeoffs include models sometimes drifting from their work trees, slower feel from visible worktree creation, and worse discoverability
    • Cursor is building evals with Braintrust to measure work-tree compliance and training Composer models on these tasks for future RL
    • Parallelization primitives beyond git worktrees are in development, since worktrees are slow to create and disk-hungry
    Provenance
    Video · Supporting source
  16. 16

    Mistral Medium 3.5 128B — Dense flagship unified model

    Article Mistral AI — Mistral AI's flagship model release team

    Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights.

    huggingface.co/mistralai/Mistral-Medium-3.5… →
    Details
    Cited text
    Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights.
    Context
    Another dense flagship replacing MoE/merged approaches. Mistral's bet on a single unified model with configurable reasoning effort maps to the same question Granite raises: as dense models get better, does the MoE tradeoff still earn its complexity?
    Key points
    • Dense 128B model replacing both Mistral Medium 3.1 and Magistral in Le Chat
    • Reasoning effort configurable per request — can do fast reply or complex agentic runs
    • Replaces Devstral 2 in their coding agent Vibe, scoring 91.4% on τ³-Telecom and 77.6% on SWE-Bench Verified
    • 256k context, multimodal (text + image input), system prompt support
    • Modified MIT license with revenue threshold exception, available via Mistral Vibe CLI, vLLM, SGLang, Transformers
    Provenance
    Article · Supporting source
  17. 17

    GCC 16 has been released

    Article GCC Team — The GCC project team, maintained by the Free Software Foundation

    GCC 16 has been released with C++26 reflection support, enabling compile-time introspection of types and structures without template metaprogramming hacks.

    gcc.gnu.org/gcc-16/changes.html →
    Details
    Cited text
    GCC 16 has been released with C++26 reflection support, enabling compile-time introspection of types and structures without template metaprogramming hacks.
    Context
    Compilers are the plumbing AI agents write into. C++26 reflection changes how you write metaprogramming, and as more generated code flows through GCC, understanding these changes helps you write and debug the generated output. It's not AI news per se, but it's the foundation everything runs on.
    Key points
    • GCC 16 includes C++26 reflection support — compile-time type introspection
    • Improvements to compiler optimization passes and debug info generation
    • Updates to libstdc++ including C++26 library features
    • Available on Debian sid (trunk package) and build systems
    Provenance
    Article · Supporting source
  18. 18

    Figure AI hits 24x production scale, producing 1 robot per hour

    Source Distinct-Question-16

    Robotics deployment moves from demo mode to production mode when you're building one a day consistently. It's a different kind of engineering problem than model benchmarking — assembly lines, supply chains, and reliabil…

    www.reddit.com/r/singularity/comments/1sz3s… →
    Details
    Context
    Robotics deployment moves from demo mode to production mode when you're building one a day consistently. It's a different kind of engineering problem than model benchmarking — assembly lines, supply chains, and reliability at scale. Worth watching as the parallel to AI agent deployment.
    Key points
    • Figure AI has scaled humanoid robot production to 24 units per day
    • One robot produced per hour at their manufacturing line
    • The company is teasing a fleet deployment — moving from prototype to operations
    • Significant milestone in making humanoid robots economically viable
    Provenance
    Source · Background source