Archive CONSTRUCT
The Harness Starts to Count / DISPATCH 011
PDF RSS

Dispatch 011 · 2026-05-25 GSV The Meter Is Part of the Machine

The Harness Starts to Count

/ 00:13:52 / 9 sources

“The model may improve, but the system that records its mistakes, prices its turns, and tests its claims decides whether anyone can use it on Tuesday morning.”

— Lenar Kess, today's narration

Monday's CONSTRUCT follows a practical tension: model capability is moving, but the systems around the model now decide whether that capability becomes usable work.

Chapters

  1. 00:00:00 Transcript

Sources

9 cited
  1. 1

    Agentic Evaluations at Scale, For Everybody

    Video AI Engineer / Google DeepMind and Kaggle speakers — Conference talk by Kaggle Benchmarks product and engineering staff, as summarized in the packet.

    On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other.

    www.youtube.com/watch?v=Ubwb6NzegyA →
    Details
    Cited text
    On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other.
    Context
    The episode uses this as the central evidence that agent progress depends on reusable evaluation surfaces, not model release claims alone.
    Key points
    • Kaggle is building hackathon-based benchmark creation, a Standardized Agent Exam, Game Arena, and an open benchmark platform.
    • The talk frames benchmark creation as too concentrated among AI researchers compared with the broader developer population.
    • More than 500 agents were evaluated within the first week of the Standardized Agent Exam launch.
    Provenance
    Video · Supporting source
  2. 2

    Microsoft switched from Claude Code to GitHub Copilot

    Thread Tren Griffin — Technology and business commentator; packet item is an X post, not an official Microsoft statement.

    They just want to dogfood the GHCP harness so they get scale and feedback.

    x.com/trengriffin/status/2058975752738189566 →
    Details
    Cited text
    They just want to dogfood the GHCP harness so they get scale and feedback.
    Context
    It gives the enterprise version of the episode's thesis: the surface around the model controls feedback, policy, and spend.
    Key points
    • Claims Microsoft moved engineers from Claude Code to GitHub Copilot while still paying for Opus 4.7 through enterprise API usage.
    • Frames the move as harness dogfooding and feedback capture rather than an Anthropic-payment cut.
    • Related posts in the packet repeat the same claim across several IDs.
    Provenance
    Thread · Primary source
  3. 3

    Inference cost per token falling 60%-70% per year

    X Tren Griffin — Technology and business commentator, cited from the packet.

    Semiconductor providers are delivering lower cost per token of 60%-70% per year for inference.

    x.com/trengriffin/status/2058973327583293763 →
    Details
    Cited text
    Semiconductor providers are delivering lower cost per token of 60%-70% per year for inference.
    Context
    The number lets the hosts separate model-call economics from the business value of owning the agent workflow.
    Key points
    • Gives a concrete cost-pressure claim for inference economics.
    • Pairs with enterprise harness standardization to show raw calls can get cheaper while control surfaces become more valuable.
    Provenance
    Tweet · Primary source
  4. 4

    DeepMind's Insane AI Breakthroughs With CEO Demis Hassabis

    Video Two Minute Papers / Demis Hassabis — Interview summary in the packet covering DeepMind and Isomorphic Labs' AI-for-science platform strategy.

    The strategy moves beyond single-model solutions toward deploying six to twelve AlphaFold-level systems.

    www.youtube.com/watch?v=huAwz_BR8WM →
    Details
    Cited text
    The strategy moves beyond single-model solutions toward deploying six to twelve AlphaFold-level systems.
    Context
    It lets the episode extend the harness argument into science: the platform boundary matters as much as any one model.
    Key points
    • Hassabis describes specialized models for different stages of drug discovery, from structure prediction to toxicity and clinical-trial optimization.
    • The Co-Scientist system is described as a fine-tuned Gemini variant with specialized tools for research work.
    • The interview presents AI as a collaborative sparring partner rather than an autonomous researcher.
    Provenance
    Video · Supporting source
  5. 5

    MLX DeepSeek V4 Flash on an M3 Ultra using less than 128GB

    X Ivan Fioravanti — Developer posting a local MLX inference experiment, cited from the packet.

    107GB used here with a custom quantization recipe!

    x.com/ivanfioravanti/status/205903210948274… →
    Details
    Cited text
    107GB used here with a custom quantization recipe!
    Context
    It grounds the local inference segment in a concrete memory-fit claim rather than generic enthusiasm.
    Key points
    • Claims MLX DeepSeek V4 Flash ran on an M3 Ultra under 128GB, with 107GB used in the test.
    • Attributes help to Claude plus Opus 4.7 for the quantization recipe.
    • Shows the pressure to make large models fit on local Apple hardware.
    Provenance
    Tweet · Primary source
  6. 6

    CUDA: add fast walsh-hadamard transform

    Source am17an / ggml-org llama.cpp contributors — GitHub pull request surfaced through r/LocalLLaMA in the packet.

    1-2% boost on pp & 7-9% boost on tg.

    github.com/ggml-org/llama.cpp/pull/23615 →
    Details
    Cited text
    1-2% boost on pp & 7-9% boost on tg.
    Context
    It shows how small runtime gains in shared tooling can change local agent economics.
    Key points
    • Adds a fast Walsh-Hadamard transform for CUDA paths in llama.cpp.
    • The packet reports gains when quantizing the key-value cache, including larger token-generation improvements.
    • The benchmark cited uses a 5090 with q8_0 cache settings.
    Provenance
    Source · Background source
  7. 7

    Viv on Hugging Face agent vocabulary aggregation

    Thread Viv — X post from the packet pointing to a Hugging Face write-up on agents, harnesses, environments, and reinforcement learning.

    The more we can roughly have a shared vocabulary the better.

    x.com/Vtrivedy10/status/2058969154523435256 →
    Details
    Cited text
    The more we can roughly have a shared vocabulary the better.
    Context
    Shared terms are necessary for comparing evaluations, incidents, procurement decisions, and product claims.
    Key points
    • Highlights confusion around agent, harness, environment, and reinforcement-learning terminology.
    • The episode uses the post as a vocabulary and operations item rather than a full literature review.
    Provenance
    Thread · Primary source
  8. 8

    DHH on GPT-5.5 coding capability

    X DHH — Programmer and Rails creator posting a personal capability report, cited from the packet.

    All steering, no handwriting.

    x.com/dhh/status/2058953269360156783 →
    Details
    Cited text
    All steering, no handwriting.
    Context
    It helps separate felt workflow change from measured model improvement.
    Key points
    • Reports more 'I can't believe it's this good' moments with GPT-5.5 than any model since Opus 4.5.
    • Frames the experience as steering rather than writing code by hand.
    • The episode treats this as an expert-user signal, not a benchmark.
    Provenance
    Tweet · Primary source
  9. 9

    OpenAI model and an Erdos conjecture claim

    X Peter Diamandis — Entrepreneur and public commentator; packet provides the post but not the underlying proof source.

    An OpenAI model just disproved an 80 year old math conjecture from Paul Erdos.

    x.com/PeterDiamandis/status/205895695607787… →
    Details
    Cited text
    An OpenAI model just disproved an 80 year old math conjecture from Paul Erdos.
    Context
    It shows how the hosts handle high-heat claims without over-claiming beyond the packet.
    Key points
    • Claims an OpenAI model disproved an eighty-year-old Erdos conjecture.
    • The packet does not provide the paper, theorem statement, or independent mathematical review.
    • The episode deliberately treats it as a capability-report mention.
    Provenance
    Tweet · Primary source