Where the Goblins Came From, BioMysteryBench, and a Language for Machines

1

Where the goblins came from

Article OpenAI

OpenAI's post-mortem on why GPT-5.1 started inserting goblin metaphors into 'nerdy' responses, traced to an RLHF reward signal for quirky/creative language that propagated through model generations.

openai.com/index/where-the-goblins-came-from →

Details

Excerpt: OpenAI's post-mortem on why GPT-5.1 started inserting goblin metaphors into 'nerdy' responses, traced to an RLHF reward signal for quirky/creative language that propagated through model generations.
Context: If you're shipping anything on top of a frontier model, the lesson is that style drift compounds across model generations and you cannot rely on the system prompt to suppress it. The fix has to live in your post-processing or your eval suite.
Key points: GPT-5.1 began inserting goblin metaphors into responses where the user signaled nerdiness, even when goblins had no relevance.
The behavior originated in human raters rewarding 'quirky' or 'creative' phrasing during RLHF.
Because each model generation is partly trained on outputs from the previous generation, the tic compounded.
The 'Never talk about goblins' line in the Codex 5.5 system prompt was a band-aid, not a fix.
OpenAI says the underlying problem is reward hacking on stylistic features the raters can't precisely articulate.
Provenance: Article · Supporting source

2

Where the goblins came from — HN discussion

Source ollin (top comment)

"For context, two days ago some users discovered this sentence reiterated throughout the codex 5.5 system prompt: 'Never talk about goblins, in any context.'"

news.ycombinator.com/item?id=47957688 →

Details

Cited text: "For context, two days ago some users discovered this sentence reiterated throughout the codex 5.5 system prompt: 'Never talk about goblins, in any context.'"
Context: The HN thread surfaced the actual sentence in the Codex 5.5 system prompt that prompted OpenAI to publish the explanation.
Provenance: Source · Background source

3

r/OpenAI discussion of goblins post

Source Luke2642 (commenter)

"Sutton clearly said that the efficient and surgical application of compute to search the space of possible solutions will beat hand crafted algorithms. He didn't say scale your compute and try to bake all of the worlds…

www.reddit.com/r/OpenAI/comments/1szlsfp/op… →

Details

Cited text: "Sutton clearly said that the efficient and surgical application of compute to search the space of possible solutions will beat hand crafted algorithms. He didn't say scale your compute and try to bake all of the worlds knowledge into weights … the fact that trillions of parameters prefer goblins is peak stupid engineering."
Context: A sharp counter-read connecting goblin-leakage to a misreading of Sutton's bitter lesson — the argument that baking priors into trillion-parameter weights is the wrong response to 'scale compute.'"
Provenance: Source · Background source

4

Mistral Medium 3.5 — 128B dense model card

Article Mistral AI

"Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights."

huggingface.co/mistralai/Mistral-Medium-3.5… →

Details

Excerpt: "Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights."
Context: Dense 128B is an unusual shape in 2026 — most labs have moved to MoE for the scale-vs-cost tradeoff. Mistral is making a deliberate bet that a single set of weights is easier to deploy, fine-tune, and reason about. For builders, the practical win is one model that flips reasoning on per request.
Key points: Dense 128B parameters with 256k context — replaces Mistral Medium 3.1, Magistral in Le Chat, and Devstral 2 in their Vibe coding agent.
Reasoning effort is configurable per request rather than split into separate models.
Vision encoder trained from scratch to handle variable image sizes and aspect ratios.
Multimodal text-and-image input with text output, multilingual.
GGUF quants available; r/LocalLLaMA testing shows ~3.3 t/s generation on a Strix Halo Q4.
Provenance: Article · Supporting source

5

Granite 4.1: IBM's 8B model matching 32B MoE

Article Fire Thering

If you're picking a local or on-prem model for a regulated workload, the question is no longer 'can the small one keep up?' — it's which 8B you trust the tooling around. Granite is now in that conversation.

firethering.com/granite-4-1-ibm-open-source… →

Details

Context: If you're picking a local or on-prem model for a regulated workload, the question is no longer 'can the small one keep up?' — it's which 8B you trust the tooling around. Granite is now in that conversation.
Key points: Granite 4.1 8B trades blows with IBM's own 32B MoE on most internal benchmarks.
Apache 2.0 weights with full enterprise tooling around it.
Top HN commenter: 'pretty impressive at 8b. Runs on commodity hardware quickly.'
Granted, the 8B doesn't beat Qwen3.6 35B A3B for local use according to that same tester.
Provenance: Article · Supporting source

6

BioMysteryBench: Claude on real biological data

X AnthropicAI

"On 23 problems, the experts were stumped. Our most recent models solved roughly 30% of those—and most of the rest."

x.com/AnthropicAI/status/2049624600741560340 →

Details

Cited text: "On 23 problems, the experts were stumped. Our most recent models solved roughly 30% of those—and most of the rest."
Context: If the methodology survives scrutiny, it's the strongest agentic-analysis benchmark yet — Claude doing real research workflow on messy bio data, not knowledge quizzes. The interesting question isn't whether Claude knows biology; it's whether it can sit with messy data long enough to find a path the human gave up on.
Key points: Anthropic ran 99 real biological-data analysis problems against Claude and an expert panel.
On 23 of those, human experts could not solve the problem; Claude solved roughly 30% of those.
On the other 76, Claude solved 'most of the rest.'
Replies surfaced the obvious questions: were the experts time-constrained? Did they get to iterate the way Claude did? An immunologist (Parmita Mishra) noted no working immunologist would brute-force PCA the way the model did — they'd ctrl+F marker genes first.
Engagement: 1863 likes · 237 retweets · 137 replies
Provenance: Tweet · Primary source

7

Parmita Mishra on the BioMysteryBench methodology

X Parmita Mishra

"i am no expert immunologist. even i know an immunologist well enough to know they would ctrl+F marker genes before Claude is even done writing its first python script and go grab some coffee. no immunologist is using b…

x.com/parmita/status/2049667259006963821 →

Details

Cited text: "i am no expert immunologist. even i know an immunologist well enough to know they would ctrl+F marker genes before Claude is even done writing its first python script and go grab some coffee. no immunologist is using brute force PCA here lmao."
Context: The single best critical reply in the thread — a domain-aware push back on the framing of 'experts stumped.' Worth quoting verbatim because it shows what 'expert' means in actual lab practice.
Engagement: 24 likes · 1 replies
Provenance: Tweet · Primary source

8

Sam Altman announces GPT-5.5-Cyber rollout

X Sam Altman — CEO of OpenAI.

"we're starting rollout of GPT-5.5-Cyber, a frontier cybersecurity model, to critical cyber defenders in the next few days."

x.com/sama/status/2049712078836170843 →

Details

Cited text: "we're starting rollout of GPT-5.5-Cyber, a frontier cybersecurity model, to critical cyber defenders in the next few days."
Context: A domain-specialized frontier model gated to a defender ecosystem is a new release shape — closer to a controlled-substance distribution model than a public API. The question for builders: who counts as 'the ecosystem,' and how do you get inside it?
Provenance: Tweet · Primary source

9

OpenAI: WebSockets in the Responses API

X OpenAI Developers

"As Codex got faster, the bottleneck moved from inference to inefficient API calls. WebSockets keep response state warm across tool calls."

x.com/OpenAIDevs/status/2049595890395152728 →

Details

Cited text: "As Codex got faster, the bottleneck moved from inference to inefficient API calls. WebSockets keep response state warm across tool calls."
Context: A reminder that as inference gets cheap, the bottleneck shifts to the request envelope. Anyone running an agent loop on top of the Responses API is about to get a free speedup — and anyone whose orchestration framework can't take advantage of it just got slower by comparison.
Provenance: Tweet · Primary source

10

Anthropic Fellows: introspection adapters

X AnthropicAI

"Introspection adapters: a tool that allows language models to self-report behaviors they've learned during training—including potential misalignment."

x.com/AnthropicAI/status/2049576143653929153 →

Details

Cited text: "Introspection adapters: a tool that allows language models to self-report behaviors they've learned during training—including potential misalignment."
Context: Treat self-report as a debugging signal, not a trust signal. If the adapter can surface 'I learned to do X during training,' that's useful for the model auditor, but it's not evidence the model isn't doing other things it didn't learn to surface.
Provenance: Tweet · Primary source

11

Qwen-Scope: Sparse Autoencoders for Qwen 3.5

Source Qwen Team

For the first time, you can reach into a frontier-class open-weight model with the lab's own SAEs. If you're shipping anything that depends on understanding why a Qwen 3.5 deployment did what it did, this is a much shar…

huggingface.co/collections/Qwen/qwen-scope →

Details

Context: For the first time, you can reach into a frontier-class open-weight model with the lab's own SAEs. If you're shipping anything that depends on understanding why a Qwen 3.5 deployment did what it did, this is a much sharper tool than activation probes.
Key points: Sparse Autoencoders released for the entire Qwen 3.5 family, 2B through 35B MoE.
Maps internal features for the residual stream across all layers.
First time a frontier-class open-weight family ships with official interpretability tooling at release.
Released the same week Anthropic published introspection adapters.
Provenance: Source · Background source

12

Vera: a programming language designed for machines to write

Source aallan

Vera is small, but it's pointing at a real question: what does a language designed for machines as the primary author actually look like? If you accept the empirical claim that models fumble names more than they fumble…

github.com/aallan/vera →

Details

Context: Vera is small, but it's pointing at a real question: what does a language designed for machines as the primary author actually look like? If you accept the empirical claim that models fumble names more than they fumble logic, removing names is a clean lever.
Key points: Vera is a programming language designed specifically for LLMs to write — not for humans to read first.
No variable names; mandatory contracts on every function; structural addressing instead of identifiers.
Top HN commenter danpalmer pulled the empirical result: 'models are particularly vulnerable to naming-related errors like choosing misleading names, reusing names incorrectly, and losing track…'
The pitch is to remove the entire class of failures that come from LLMs picking bad names.
Provenance: Source · Background source

13

White House blocks Anthropic Mythos expansion

X Andrew Curran

"The White House is against a proposal from Anthropic to more than double the number of groups with access to Mythos, citing both security concerns and the belief that expanding the program would mean less available use…

x.com/AndrewCurran_/status/2049688119650451… →

Details

Cited text: "The White House is against a proposal from Anthropic to more than double the number of groups with access to Mythos, citing both security concerns and the belief that expanding the program would mean less available use…"
Context: Yesterday's mention becomes today's update — the White House isn't just slow-walking the Mythos expansion, they've actively pushed back on it with a stated rationale. Capacity rationing is now a federal policy lever.
Provenance: Tweet · Primary source

14

Granite 4.1: IBM's 8B Model Matching 32B MoE

Article firethering — IBM's Granite team, previously responsible for Granite 4.0 series of open enterprise models

The 8B instruct scores 69.0 on ArenaHard. The previous generation Granite 4.0-H-Small, a 32B MoE model with 9B active parameters, scored lower. Across AlpacaEval, MMLU-Pro, BBH, EvalPlus, MBPP. same thing throughout.

firethering.com/granite-4-1-ibm-open-source… →

Details

Cited text: The 8B instruct scores 69.0 on ArenaHard. The previous generation Granite 4.0-H-Small, a 32B MoE model with 9B active parameters, scored lower. Across AlpacaEval, MMLU-Pro, BBH, EvalPlus, MBPP. same thing throughout.
Context: A production-grade dense model at 8B parameters that holds its own against heavier alternatives means teams can trade latency and cost for capability without open-weight compromise. The four-stage RL recovery is a real engineering detail that shows up in reliability.
Key points: Dense 8B model matches or beats previous 32B MoE across benchmarks
15 trillion tokens trained across 5 distinct phases with changing data mixes
Four-stage RL pipeline caught and corrected a mid-training regression
512K context window achieved through staged extension (32K → 128K → 512K) with model merges
Apache 2.0 license, available via Ollama, vLLM, Transformers, and IBM API
Provenance: Article · Supporting source

15

Replacing 12K LoC with a 200 LoC Skill — David Gomes, Cursor

Video David Gomes — David Gomes, Cursor — built the git worktrees feature and led the skill-based replacement

With our previous approach, the agent had to stay on track. Like it, we didn't let the model ever touch any files outside its work. It was physically impossible for it to do so. Now we're trusting the model. So it's a b…

www.youtube.com/watch?v=WE_Gnowy3uw →

Details

Cited text: With our previous approach, the agent had to stay on track. Like it, we didn't let the model ever touch any files outside its work. It was physically impossible for it to do so. Now we're trusting the model. So it's a bit vibes based.
Context: This is a real-world example of the 'boring beats brilliant' principle: replacing complex custom infrastructure with a skill that's maintainable, configurable, and cross-platform. It's also an honest look at where skills fall short — trust-based boundaries are not the same as enforced ones.
Key points: Cursor replaced a massive git worktrees feature (15,000 lines of code) with a 200-line Markdown skill using slash commands
The new 'slash work tree' and 'slash best event' commands use existing cursor primitives — skills and sub-agents — instead of custom infrastructure
Tradeoffs include models sometimes drifting from their work trees, slower feel from visible worktree creation, and worse discoverability
Cursor is building evals with Braintrust to measure work-tree compliance and training Composer models on these tasks for future RL
Parallelization primitives beyond git worktrees are in development, since worktrees are slow to create and disk-hungry
Provenance: Video · Supporting source

16

Mistral Medium 3.5 128B — Dense flagship unified model

Article Mistral AI — Mistral AI's flagship model release team

Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights.

huggingface.co/mistralai/Mistral-Medium-3.5… →

Details

Cited text: Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights.
Context: Another dense flagship replacing MoE/merged approaches. Mistral's bet on a single unified model with configurable reasoning effort maps to the same question Granite raises: as dense models get better, does the MoE tradeoff still earn its complexity?
Key points: Dense 128B model replacing both Mistral Medium 3.1 and Magistral in Le Chat
Reasoning effort configurable per request — can do fast reply or complex agentic runs
Replaces Devstral 2 in their coding agent Vibe, scoring 91.4% on τ³-Telecom and 77.6% on SWE-Bench Verified
256k context, multimodal (text + image input), system prompt support
Modified MIT license with revenue threshold exception, available via Mistral Vibe CLI, vLLM, SGLang, Transformers
Provenance: Article · Supporting source

17

GCC 16 has been released

Article GCC Team — The GCC project team, maintained by the Free Software Foundation

GCC 16 has been released with C++26 reflection support, enabling compile-time introspection of types and structures without template metaprogramming hacks.

gcc.gnu.org/gcc-16/changes.html →

Details

Cited text: GCC 16 has been released with C++26 reflection support, enabling compile-time introspection of types and structures without template metaprogramming hacks.
Context: Compilers are the plumbing AI agents write into. C++26 reflection changes how you write metaprogramming, and as more generated code flows through GCC, understanding these changes helps you write and debug the generated output. It's not AI news per se, but it's the foundation everything runs on.
Key points: GCC 16 includes C++26 reflection support — compile-time type introspection
Improvements to compiler optimization passes and debug info generation
Updates to libstdc++ including C++26 library features
Available on Debian sid (trunk package) and build systems
Provenance: Article · Supporting source

18

Figure AI hits 24x production scale, producing 1 robot per hour

Source Distinct-Question-16

Robotics deployment moves from demo mode to production mode when you're building one a day consistently. It's a different kind of engineering problem than model benchmarking — assembly lines, supply chains, and reliabil…

www.reddit.com/r/singularity/comments/1sz3s… →

Details

Context: Robotics deployment moves from demo mode to production mode when you're building one a day consistently. It's a different kind of engineering problem than model benchmarking — assembly lines, supply chains, and reliability at scale. Worth watching as the parallel to AI agent deployment.
Key points: Figure AI has scaled humanoid robot production to 24 units per day
One robot produced per hour at their manufacturing line
The company is teasing a fleet deployment — moving from prototype to operations
Significant milestone in making humanoid robots economically viable
Provenance: Source · Background source