Sycophancy at 9%, Grok's Cheaper Curve, and Half-Trillion Dollar Mark-to-Market

1

Smol AI Digest: GPT-5.5, Qwen3.6, Grok 4.3, Mistral Medium 3.5

Thread Smol AI — Latent Space's daily AI news digest

The open-weight landscape is consolidating around Qwen3.6 27B, while the frontier models are racing on cost-efficiency rather than raw capability gaps. The cyber evaluation results also show the gap between OpenAI and A…

news.smol.ai/issues/26-04-30-not-much →

Details

Context: The open-weight landscape is consolidating around Qwen3.6 27B, while the frontier models are racing on cost-efficiency rather than raw capability gaps. The cyber evaluation results also show the gap between OpenAI and Anthropic is narrowing on multi-step reasoning.
Key points: Qwen3.6 27B is the new open-weights leader under 150B parameters with Intelligence Index 46
GPT-5.5 achieves 71.4% on UK AISI cyber eval, matching Claude Mythos Preview
Grok 4.3 scores 1500 Elo on GDPval-AA, up 321 points, at 40% lower input price
Mistral Medium 3.5 is a dense 128B model with a modified MIT license
Xiaomi MiMo-V2.5-Pro dominates autonomous game benchmarks at $0.99/game
Provenance: Thread · Primary source

2

GPT-5.5 Codex long-running capability

X Tibo (thsottiaux)

You can now keep codex going for days. With GPT-5.5 it will build an entire OS kernel for you if you ask, or find critical bugs in a codebase, or optimize your database schemas.

x.com/thsottiaux/status/2049970070873629026 →

Details

Excerpt: You can now keep codex going for days. With GPT-5.5 it will build an entire OS kernel for you if you ask, or find critical bugs in a codebase, or optimize your database schemas.
Context: Multi-day agent runs change the unit of work from 'prompt response' to 'sustained task execution'. That's a fundamentally different mental model for how you'd architect a development workflow.
Key points: GPT-5.5 enables multi-day continuous agent runs in Codex
OpenAI is framing Codex beyond coding into general computer work
42% faster computer/browser use in the latest update
Role-based onboarding and app connections are part of the UX shift
Engagement: 4312 likes · 271 retweets · 292 replies
Provenance: Tweet · Primary source

3

Agent rate limits on SaaS APIs

X Vikas Malpani (Building ReBillion)

Building a healthcare agent that hits 6 different SaaS APIs taught me the real shift: per-seat pricing dies, per-task pricing survives. The harder problem: rate limits.

x.com/vikasmalpani/status/20501106776926005… →

Details

Excerpt: Building a healthcare agent that hits 6 different SaaS APIs taught me the real shift: per-seat pricing dies, per-task pricing survives. The harder problem: rate limits.
Context: Rate limits are the invisible architecture decision that will determine which agent tooling stacks work in production. Every SaaS API with a human-click rate limit is a hard wall for agent workflows.
Key points: Agents hitting 6 SaaS APIs revealed that rate limits, not pricing, is the real bottleneck
Most APIs were built for humans clicking once in a while, not agents making continuous calls
Per-seat SaaS pricing doesn't translate to agent workloads
Per-task pricing is the model that survives
Provenance: Tweet · Primary source

4

Who pays for support when the agent is the user?

X AgentShadowfax

when the agent is the user, who pays for support? The agent doesn't call your help desk. Does the SaaS model just collapse into metered API pricing and vendors compete purely on reliability?

x.com/AgentShadowfax/status/205018576097261… →

Details

Cited text: when the agent is the user, who pays for support? The agent doesn't call your help desk. Does the SaaS model just collapse into metered API pricing and vendors compete purely on reliability?
Context: This is a real structural question for every SaaS vendor building for agents. If agents are the primary user, the help desk model becomes irrelevant, and reliability (uptime, error rates) becomes the product. Vendors who figure out support for agents now have a wedge.
Key points: Support costs don't map cleanly when the user is an agent
SaaS pricing may need to collapse into API-style metering
Reliability becomes the primary competitive dimension
Provenance: Tweet · Primary source

5

MiMo-V2.5-Pro - the actual best open-weights model

Source cjami

The open-weight ecosystem is getting fierce. Xiaomi pushing a model that competes with top proprietary reasoning models at a fraction of the cost shows the convergence between open and closed is accelerating.

www.reddit.com/r/LocalLLaMA/comments/1t0s5t… →

Details

Context: The open-weight ecosystem is getting fierce. Xiaomi pushing a model that competes with top proprietary reasoning models at a fraction of the cost shows the convergence between open and closed is accelerating.
Key points: MiMo-V2.5-Pro achieves 88% good team win rate in Blood on the Clocktower benchmark
At $0.99/game and 183K tokens, it undercuts Kimi K2.6's $2.65/game by more than half
Tool call error rate of 0.4% is competitive
Xiaomi's architecture choices remain unclear
Provenance: Source · Background source

6

Performance of a large language model on the reasoning tasks of a physician

Article Science — Peer-reviewed journal published by AAAS, open access

across a variety of scenarios and applications, the large language model outperformed both human physicians and older models

www.science.org/doi/10.1126/science.abn3654 →

Details

Cited text: across a variety of scenarios and applications, the large language model outperformed both human physicians and older models
Excerpt: o1 outperformed both human physicians and older models across a variety of scenarios and applications on medical benchmarks and real ER cases.
Context: This isn't a benchmark cherry-pick. It's a peer-reviewed study comparing a frontier model against actual physicians on real-world clinical reasoning tasks. The implications for how clinical decision support gets built are non-trivial.
Key points: o1 tested against human physicians on medical benchmarks and real ER cases
LLM outperformed both human physicians and older models across scenarios
Papers calls for 'urgent need for prospective trials'
This is on o1, an 'old AI' — not the latest frontier model
Engagement: 169 likes · 28 retweets · 11 replies
Provenance: Article · Supporting source

7

Performance of a large language model on the reasoning tasks of a physician

Article Science (via Ethan Mollick) — Ethan Mollick is a Wharton professor studying how people use AI in practice; he shared the open-access paper which appeared on Science.org

across a variety of scenarios and applications, the large language model outperformed both human physicians and older models

www.science.org/doi/10.1126/science.adz5802 →

Details

Cited text: across a variety of scenarios and applications, the large language model outperformed both human physicians and older models
Context: This is the first time a reasoning model has been systematically evaluated against practicing physicians on real-world clinical cases, not just benchmarks. It pushes the question from "can LLMs do medical reasoning" to "when do we let them actually do it."
Key points: o1 tested on medical benchmarks and real ER cases
Outperformed human physicians across multiple scenarios
Paper authors call for urgent prospective trials
The paper was on an older model version, not a new release
Provenance: Article · Supporting source

8

The pricing transition companies aren't ready for

X Facundo Franco

Seats are predictable. Consumption is variable. When your software bill starts moving with agent usage, someone needs to own that number.

x.com/facundofranco_/status/205020186293089… →

Details

Cited text: Seats are predictable. Consumption is variable. When your software bill starts moving with agent usage, someone needs to own that number.
Context: Facundo's point gets at the operational challenge behind the pricing shift. Seat-based pricing works because humans use software at a predictable rate. Agents can consume orders of magnitude more, and their consumption is hard to attribute. Someone needs to track that number or the model collapses.
Key points: Seat-based SaaS breaks when agents drive variable consumption
Companies need to understand what their agents are doing and why
The pricing transition is harder than the technical one
Provenance: Tweet · Primary source

9

On the Nature of Entrepreneurship - JPE

X Robin Hanson

self-employed individuals have significantly higher average incomes & steeper income growth … we find a limited role for nonpecuniary motives, uninsurable risk, and liquidity constraints driving entrepreneurial choice

x.com/robinhanson/status/2050202303748018599 →

Details

Cited text: self-employed individuals have significantly higher average incomes & steeper income growth … we find a limited role for nonpecuniary motives, uninsurable risk, and liquidity constraints driving entrepreneurial choice
Context: Hanson is pointing to a JPE paper that's relevant to the agent-era economics we've been discussing. If entrepreneurship is primarily driven by financial incentives rather than autonomy or other nonpecuniary factors, then agents that can monetize work will fundamentally reshape who becomes entrepreneurial.
Key points: Self-employed individuals have higher average incomes and steeper growth trajectories
Nonpecuniary motives play a limited role in entrepreneurial choice
The economic data suggests entrepreneurship is primarily driven by pecuniary returns
Provenance: Tweet · Primary source

10

How people ask Claude for personal guidance

Thread @AnthropicAI — Anthropic's official research thread, reporting findings from their Clio privacy-preserving conversation analysis tool.

Claude is most sycophantic under pushback, and relationship conversations are where people push back most.

x.com/AnthropicAI/status/2049927618397614466 →

Details

Cited text: Claude is most sycophantic under pushback, and relationship conversations are where people push back most.
Context: Sycophancy is the load-bearing failure mode for any model used as a research or decision aid. Anthropic publishing the prevalence numbers and the training response is the kind of thing model evaluators have been asking labs to do for two years.
Key points: Anthropic analyzed 1M conversations to surface where Claude slips into sycophancy
Sycophancy appears in ~9% of guidance conversations, concentrated in spirituality and relationship advice
About 6% of all conversations are personal guidance — health, career, relationships, personal finance
Opus 4.7 cut sycophancy in half on relationship guidance vs. Opus 4.6; Mythos Preview cut that in half again
Specific triggers identified: criticism of Claude's analysis, floods of one-sided detail; used to build synthetic training scenarios
Provenance: Thread · Primary source

11

Grok 4.3 hits 53 on Intelligence Index, agentic ELO jumps 321 points

Thread @ArtificialAnlys — Artificial Analysis runs the most-cited cross-lab benchmark suite for frontier models.

Grok 4.3 narrows the gap to the leading model on GDPval-AA, but still trails GPT-5.5 (xhigh) by 276 Elo points, with an expected win rate of ~17%.

x.com/ArtificialAnlys/status/20499870016557… →

Details

Cited text: Grok 4.3 narrows the gap to the leading model on GDPval-AA, but still trails GPT-5.5 (xhigh) by 276 Elo points, with an expected win rate of ~17%.
Context: xAI is shipping cheaper-and-better, which is the curve every frontier lab now has to compete on. The hallucination tradeoff is the catch — it's the kind of thing that doesn't show up in headline benchmarks but bites you in production.
Key points: Grok 4.3 scores 53 on the Intelligence Index, 4 points ahead of Grok 4.20
Costs $395 to run the full benchmark suite — about 20% lower than 4.20, despite 44% more output tokens
GDPval-AA agentic ELO climbs from 1179 to 1500, a 321-point jump
Reaches 98% on τ²-Bench Telecom, 81% on IFBench
AA-Omniscience accuracy up 8 points, but non-hallucination rate drops 8 points — accuracy traded for confidence
Input price ~40% lower, output price ~60% lower than 4.20
Engagement: 1776 likes · 244 retweets · 74 replies
Provenance: Thread · Primary source

12

Claude Security public beta in Claude Code on the web

X @_catwu — Cat Wu, product lead on Claude Code at Anthropic.

Point it at a repo, get validated vulnerability findings, and fix them in the same place you're already writing code.

x.com/_catwu/status/2049964403177689130 →

Details

Cited text: Point it at a repo, get validated vulnerability findings, and fix them in the same place you're already writing code.
Context: Closing the loop between scan and fix in one editor is a real productivity story for security teams. The interesting question is whether the validation step holds up — false positives are the thing that kills tools like this in practice.
Key points: Claude Security is now in public beta, scoped to Claude Code on the web
Workflow: scan a repo, surface validated vulnerabilities, fix in the same surface
GitHub-only at launch; no word on GitLab/Bitbucket support
Simon Willison asked publicly whether the underlying model is regular Opus 4.7 — no confirmation in the thread
Targets the find-vuln-to-fix handoff that traditionally loses days between scanners and devs
Provenance: Tweet · Primary source

13

Aaron Levie on the headless software business model

X @levie — Aaron Levie is co-founder and CEO of Box. His perspective on software pricing is from the seat of an enterprise SaaS vendor watching agents become the dominant API consumer.

As agents become the biggest users of software, then all software has to be available in a headless fashion. Agents won't be using your UI, they'll be talking to your APIs.

x.com/levie/status/2050051426446152159 →

Details

Cited text: As agents become the biggest users of software, then all software has to be available in a headless fashion. Agents won't be using your UI, they'll be talking to your APIs.
Context: This is the cleanest articulation yet of how SaaS pricing has to bend around agents. If you build SaaS, the question is no longer "what does the seat get me" but "what does the seat get my agent."
Key points: Seats stay for people, but every seat ships with embedded API quota the agent uses on the user's behalf
Stateful agents may get their own seats, priced very differently from human users
Headless usage above the seat allotment goes to a consumption model — pay per call or per outcome
New API shapes will emerge that represent a unit of agent work rather than a single primitive call
If you don't expose your data through ChatGPT, Codex, Claude, Gemini, Cursor, Copilot, et al., you're 'DOA'
Provenance: Tweet · Primary source

14

Codex CLI 0.128.0 ships /goal — a Ralph-loop primitive

X @fcoury — Felipe Coury, engineer at OpenAI working on Codex.

Keep a goal alive across turns. Don't stop until it's achieved.

x.com/fcoury/status/2049917871799636201 →

Details

Cited text: Keep a goal alive across turns. Don't stop until it's achieved.
Context: A goal that survives across turns is the agent-harness primitive the field has been converging on. Worth watching how it interacts with verification — a goal you can't stop is only as good as the verifier checking each step.
Key points: /goal is a new primitive in Codex CLI 0.128.0 that holds a goal across many turns
Built by Eric Traut, the engineer behind Pyright
Public framing: this is OpenAI's take on the Ralph loop — autonomous goal-pursuit instead of single-shot tasks
Pairs with the broader Codex push toward sessions that can run for hours or days
Engagement: 2853 likes · 270 retweets · 126 replies
Provenance: Tweet · Primary source

15

Epoch AI estimates 290k–1.6M H100-equivalents smuggled to China by end of 2025

Thread @EpochAIResearch — Epoch AI is the research nonprofit behind several of the most-cited compute and scaling estimates of the past three years.

We estimate between 290k and 1.6M H100-equivalents by the end of 2025 — representing ~20% to ~60% of China's total compute.

x.com/EpochAIResearch/status/20499247851536… →

Details

Cited text: We estimate between 290k and 1.6M H100-equivalents by the end of 2025 — representing ~20% to ~60% of China's total compute.
Context: Export controls are the most-debated AI policy lever, and Epoch's confidence interval is wide enough to support both 'this is leaking badly' and 'this is mostly working' narratives. The number to remember is 660k.
Key points: Median estimate: 660k H100-equivalents diverted to China by end of 2025
90% confidence interval: 290k to 1.6M H100e
Median represents ~3% of the global compute stockpile — comparable to xAI's holdings at the time
Roughly 300k traceable through indictments and reporting; rest is modeled from likely undetected flows
At the upper bound, smuggled chips would be the dominant source of frontier compute in China
Provenance: Thread · Primary source

16

Half of Google's and Amazon's blowout AI profits came from a stake in Anthropic

Article Fortune — Fortune's earnings desk, reporting on Q1 2026 results from the four largest US tech companies.

Nearly half of Alphabet's record profit — about $28.7 billion — did not come from search ads, cloud services, or any of its products at all. It came from Alphabet updating the value of the equity it owns in private comp…

fortune.com/2026/04/30/google-amazon-ai-pro… →

Details

Cited text: Nearly half of Alphabet's record profit — about $28.7 billion — did not come from search ads, cloud services, or any of its products at all. It came from Alphabet updating the value of the equity it owns in private companies, primarily Anthropic.
Context: When the marquee profit numbers come from marking up a private holding rather than from operating businesses, the bull case starts looking circular. Worth knowing as a developer because every infrastructure decision downstream — pricing, capacity, vendor risk — depends on whether this build-out is sustainable.
Key points: Q1 2026 capex from the top four US tech companies: $130.65 billion — three times the inflation-adjusted Manhattan Project
Annual capex pace: ~$700 billion, comparable to US Medicare spending
Alphabet posted $62.6B in profit; ~$28.7B was an equity remeasurement on its Anthropic stake
Amazon disclosed $16.8B in pretax gains from its Anthropic position — more than half of its pretax income
Alphabet committed an additional $40B to Anthropic last week, on top of its existing ~14% stake
Provenance: Article · Supporting source

17

16x DGX Spark cluster build update on r/LocalLLaMA

Source u/Kurcide — A LocalLLaMA poster who built a sixteen-node DGX Spark fabric for unified-memory inference.

The whole point is maximizing unified memory capacity within the Nvidia ecosystem. With 8 nodes I was serving GLM-5.1-NVFP4 (434GB) at TP=8.

www.reddit.com/r/LocalLLaMA/comments/1t0lwx… →

Details

Cited text: The whole point is maximizing unified memory capacity within the Nvidia ecosystem. With 8 nodes I was serving GLM-5.1-NVFP4 (434GB) at TP=8.
Context: Unified memory across cheap nodes is the cheapest path to running half-terabyte open-weight models without an H100 farm. The fact that this is a credible homelab build, not a hyperscaler post, is itself the story.
Key points: Sixteen DGX Sparks linked over an FS N8510 200Gbps QSFP56 fabric, dual-rail bonded
Per-rail throughput: 100–111 Gbps, aggregating to the advertised 200 Gbps
Eight-node configuration serves GLM-5.1-NVFP4 at 434GB with tensor parallelism 8
Long-term plan: prefill/decode split, with M5 Ultra Mac Studios handling decode once available
Setup scripted across nodes — passwordless SSH, jumbo frames, IPs
Provenance: Source · Background source

18

Qwen 3.6 27B vs Gemma 4 31B — Pac-Man one-shot on a MacBook Pro M5 Max

Source u/gladkos — A r/LocalLLaMA contributor running side-by-side local model evals on consumer hardware.

Qwen made a very long response and showed more creativity and visual style. But Gemma gave a shorter, clearer, and more logical answer in much less time.

www.reddit.com/r/LocalLLaMA/comments/1t0epe… →

Details

Cited text: Qwen made a very long response and showed more creativity and visual style. But Gemma gave a shorter, clearer, and more logical answer in much less time.
Context: Token throughput is the headline metric, but for one-shot codegen the question is how many tokens the model needs to do the job. Five-to-one on token count is a real productivity gap.
Key points: Both runs on a MacBook Pro M5 Max with 64GB RAM
Qwen 3.6 27B: 32 tokens/sec, 18 minutes 4 seconds, 33,946 tokens emitted
Gemma 4 31B: 27 tokens/sec, 3 minutes 51 seconds, 6,209 tokens emitted
Gemma's output: cleaner game logic, smoother controls, better entity interactions
Qwen's output: more creative visual style but slower and longer; lost on quality of final game
Provenance: Source · Background source

19

Codex with browser debugging a Ubiquiti network config in two minutes

X @sch — Michael Schade, engineer; reposted by Jason Liu.

In 2m it found an issue in my Ubiquiti network config that was causing really slow transfers from my camera storage. It literally looked up the network topology, checked the switch & AP settings, and found the routing i…

x.com/sch/status/2049940381807345679 →

Details

Cited text: In 2m it found an issue in my Ubiquiti network config that was causing really slow transfers from my camera storage. It literally looked up the network topology, checked the switch & AP settings, and found the routing issue.
Context: When the agent loop is good enough to read someone else's admin UI and fix a routing issue, the boundary between dev tool and ops tool is gone. This is the kind of usage senior engineers will start defaulting to for one-off problems.
Key points: Codex with browser access traversed a Ubiquiti admin UI to debug slow camera-storage transfers
Two-minute resolution loop, including topology lookup and switch/AP configuration check
The interesting capability is not codegen — it's autonomous traversal of a vendor admin UI
Anecdotal but specific; one of several reports this week of Codex doing real ops work
Provenance: Tweet · Primary source

20

YC Decoded: Recursive AI models HRMs and TRMs

X Y Combinator — YC's Decoded podcast, hosted by Michael Seibel with guests Aravand Gupta and François Charton

That's what recursive reasoning unlocks.

x.com/ycombinator/status/2050224443461718118 →

Details

Cited text: That's what recursive reasoning unlocks.
Excerpt: A 7-million parameter model outperforming models a thousand times its size on tasks like ARC Prize through recursive reasoning.
Context: If recursive reasoning lets tiny models punch above their weight, the cost equation for long-horizon agent tasks changes dramatically. You don't need a 100B model for every step.
Key points: A 7M parameter model beats models 1000x its size on ARC Prize via recursive reasoning
Two papers discussed: HRMs (Hierarchical Reasoning Models) and TRMs (something recursive)
The gap between parameter count and reasoning ability is widening
The video frames this as a fundamental shift, not an incremental improvement
Engagement: 118 likes · 29 retweets · 14 replies
Provenance: Tweet · Primary source

21

create_agent - how we build Deep Agents on the simplest harness primitive

X Viv (LangChain) — LangChain co-founder, posted with repost from Harrison Chase

Underlying all of the harness engineering, research, and API design in Deep Agents is a very simple primitive in LangChain called create_agent.

x.com/Vtrivedy10/status/2050239109038232005 →

Details

Excerpt: Underlying all of the harness engineering, research, and API design in Deep Agents is a very simple primitive in LangChain called create_agent.
Context: The convergence around a simple agent primitive suggests the harness layer is stabilizing. When everyone's API design starts from the same building block, the competitive advantage shifts to runtime quality, evals, and degradation repair — not API surface area.
Key points: LangChain's Deep Agents is built on a single simple create_agent primitive
The entire harness design flows from this one abstraction
Harrison Chase reposted, signaling organizational buy-in
Cursor also published a strong note on harness testing and tuning patterns
Engagement: 24 likes · 4 retweets · 4 replies
Provenance: Tweet · Primary source