◆ Dispatch 049 · 2026-06-06 GSV The Schema Was A Suggestion
When the Harness Carries the Model
“The benchmark measures the model; the repair layer is what you actually ship.”
— Lenar Kess, today's narration
An open-weights model that fumbles tool calls on its own can go toe to toe with a frontier closed model — once you wrap the right error-handling around it. That gap, between what a model scores and what it does inside your repo, runs through everything we covered today.
- Ahmad Awais on Latent Space describes "tool confusion" — open models repeating the same invalid tool call roughly fifty-six times per billion tokens — and Command Code's deterministic repair layer that patches malformed output instead of arguing with the model. The claim that reframes the day: the harness, not the weights, decides whether a cheap model is usable.
- DeepSeek V4 Flash support in llama.cpp (PR #24162) makes the same model runnable locally — but the repair layer that makes it pleasant stays behind Command Code's API. Access to weights isn't access to the experience.
- Knowledge Activation (Bakal et al.) argues AI skills should be the institutional-knowledge unit for agentic development; Mutation Without Variation warns that repeated LLM edits converge rather than diverge — together a hint that skill files plus a converging model could homogenize a codebase.
- Agents' Last Exam, SentinelBench, and Stability vs. Manipulability in LLM judges all poke at the same wound: our scores have drifted from the work, especially for long-running and judge-graded evaluation.
- Anthropic's "When AI builds itself" (via a thin Reddit summary) claims AI is accelerating its own development; a zero-knowledge verification paper offers a cryptographic path to actually check claims like that — and the pause proposals that depend on verification.
- The Washington Post (Elizabeth Dwoskin), via Techmeme, reports an FDA fast track for digital health tech including AI chatbots — the same model behavior that costs a retry in coding costs a patient in a clinic.
Chapters
- 00:00:04 Transcript
Sources
20 cited-
1
arXiv cs.AI - Research Science (GLOBAL)
Article Gal Bakal
Knowledge Activation: AI Skills as the Institutional Knowledge Primitive for Agentic Software Development - arXiv:2603.14805v2 Announce Type: replace Abstract: Enterprise software organizations accumulate critical...
arxiv.org/abs/2603.14805 →Details
- Excerpt
- Knowledge Activation: AI Skills as the Institutional Knowledge Primitive for Agentic Software Development - arXiv:2603.14805v2 Announce Type: replace Abstract: Enterprise software organizations accumulate critical...
- Context
- This paper directly addresses the bottleneck of 'institutional knowledge' for agentic development, a core topic. It proposes a structured framework (AKUs) that changes how agents interact with enterprise context.
- Key points
- This paper directly addresses the bottleneck of 'institutional knowledge' for agentic development, a core topic. It proposes a structured framework (AKUs) that changes how agents interact with enterprise context.
- Provenance
- Article · Supporting source
-
2
arXiv cs.AI - Research Science (GLOBAL)
Article Kokil Jaidka, Saifuddin Ahmed
How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment - arXiv:2606.05256v1 Announce Type: new Abstract: This study analyzes a publicly released dataset from a discontinued...
arxiv.org/abs/2606.05256 →Details
- Excerpt
- How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment - arXiv:2606.05256v1 Announce Type: new Abstract: This study analyzes a publicly released dataset from a discontinued...
- Context
- Analyzes LLM agents' persuasive tactics in a real-world debate setting (Reddit), directly addressing AI power dynamics and epistemic control.
- Key points
- Analyzes LLM agents' persuasive tactics in a real-world debate setting (Reddit), directly addressing AI power dynamics and epistemic control.
- Provenance
- Article · Supporting source
-
3
arXiv cs.AI - Research Science (GLOBAL)
Article Chen Huang, Yuhao Wu, Wenxuan Zhang
What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems - arXiv:2606.05304v1 Announce Type: new Abstract: Multi-agent systems (MAS) built on large language models are typically organized...
arxiv.org/abs/2606.05304 →Details
- Excerpt
- What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems - arXiv:2606.05304v1 Announce Type: new Abstract: Multi-agent systems (MAS) built on large language models are typically organized...
- Context
- This paper proposes PACT, a method to structure inter-agent communication for efficiency and performance in MAS. It directly impacts agentic coding tools and AI infrastructure costs.
- Key points
- This paper proposes PACT, a method to structure inter-agent communication for efficiency and performance in MAS. It directly impacts agentic coding tools and AI infrastructure costs.
- Provenance
- Article · Supporting source
-
4
arXiv cs.AI - Research Science (GLOBAL)
Article Matheus Kunzler Maldaner, Adam Fourney, Amanda Swearngin, Hussein Mozzanar, Gagan Bansal, Maya Murad, Rafah Hosn, Saleema Amershi
SentinelBench: A Benchmark for Long-Running Monitoring Agents - arXiv:2606.05342v1 Announce Type: new Abstract: AI agents are increasingly asked to carry out work that spans minutes, hours, or longer. Yet the default...
arxiv.org/abs/2606.05342 →Details
- Excerpt
- SentinelBench: A Benchmark for Long-Running Monitoring Agents - arXiv:2606.05342v1 Announce Type: new Abstract: AI agents are increasingly asked to carry out work that spans minutes, hours, or longer. Yet the default...
- Context
- Introduces a new benchmark (SentinelBench) for long-running monitoring agents, directly addressing agentic coding/practice and AI infrastructure.
- Key points
- Introduces a new benchmark (SentinelBench) for long-running monitoring agents, directly addressing agentic coding/practice and AI infrastructure.
- Provenance
- Article · Supporting source
-
5
arXiv cs.AI - Research Science (GLOBAL)
Article Srimonti Dutta, Akshata Kishore Moharir
Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges - arXiv:2606.05384v1 Announce Type: new Abstract: LLM-as-judge evaluation is widely used in benchmarking pipelines,...
arxiv.org/abs/2606.05384 →Details
- Excerpt
- Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges - arXiv:2606.05384v1 Announce Type: new Abstract: LLM-as-judge evaluation is widely used in benchmarking pipelines,...
- Context
- This paper directly challenges a core assumption (stability) in LLM evaluation/benchmarking, impacting how models are measured and deployed.
- Key points
- This paper directly challenges a core assumption (stability) in LLM evaluation/benchmarking, impacting how models are measured and deployed.
- Provenance
- Article · Supporting source
-
6
arXiv cs.AI - Research Science (GLOBAL)
Article Zihao Li, Kaifeng Jin, Yuanchen Bei, Jiaru Zou, Avaneesh Kumar, Xuying Ning, Yanjun Zhao, Mengting Ai, Baoyu Jing, Hanghang Tong, Jingrui He
Harnessing Generalist Agents for Contextualized Time Series - arXiv:2606.05404v1 Announce Type: new Abstract: Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover,...
arxiv.org/abs/2606.05404 →Details
- Excerpt
- Harnessing Generalist Agents for Contextualized Time Series - arXiv:2606.05404v1 Announce Type: new Abstract: Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover,...
- Context
- Introduces TimeClaw, an agentic framework for time series analysis, directly addressing AI infrastructure and specialized coding tools.
- Key points
- Introduces TimeClaw, an agentic framework for time series analysis, directly addressing AI infrastructure and specialized coding tools.
- Provenance
- Article · Supporting source
-
7
arXiv cs.AI - Research Science (GLOBAL)
Article Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu, Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu, Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu, Jianhong Tu, Zhonghua Wang, Zheng Zhang, Zijiao Chen, yanqiong Jiang, Zhendong Li, Bohan Lyu, Chang Ma, Peiran Xu, Benran Zhang, Shangding Gu, Haoyue Hua, Haoyang Li, Wanzhe Liao, Chengzhi Liu, Junbo Peng, Haoran Sun, Zechen Xu, Bo Chen, Jiayi Cheng, Yi Jiang, Keying Kuang, Yuan Li, Youbang Pan, Ziyan Rao, Alexander Schubert, Yifan Shen, Vincent Siu, Xiatao Sun, Kangqi Zhang, Xiaopan Zhang, Yuchen Zhu, Ishaan Singh Chandok, Lei Ding, Jingxuan Fan, Andrew Glover, Jiaming Hu, Yiran Hu, Wenbo Huang, Zixin Jiang, Haoran Jin, Lukas Kim, Ming Liu, Yang Liu, Alireza Rafiei, Xuhuan Shen, Kunyang Sun, Sophia Sun, Ting Sun, Eric Wang, Yixin Wang, Hanwen Xing, Sihan Xu, Yuzheng Xu, Zhongxing Xu, Zhiling Yan, Boqin Yuan, Ruiqi Zhang, Yifan Zhang, Zibo Zhao, Liana, Santanu Bosu Antu, Haoyue Bai, Carlo Bosio, Joseph Cavanagh, Patricia Cavazos-Rehg, Tianxing Chen, Xuewen Chen, Yipu Chen, Zhu Chenyu, Chen Dai, Stefano De Castro, Yunfu Deng, Kaustubh Dhole, Jiayuan Ding, Chenchen Du, Zhehang Du, Hao Fan, Run-ze Fan, Hengyu Fu, Shi Gu, Yifan Gu, Charlie Guo, Baihe Huang, Baixiang Huang, Rimika Jaiswal, Zhihan Jiang, Ran Jin, Erin Kasson, Xin Lan, Joseph Lee, Deren Lei, Chenyu Li, Daofeng Li, Haitao Li, Hongwei Li, Jingyan Li, Xiao Li, Yi Li, Yinsheng Li, Yuangang Li, Zhixu Li, Wenyu Liang, Longtai Liao, Kevin Qinghong Lin, AndyZeyi Liu, Che Liu, Jiaming Liu, Kaiyuan Liu, Xuan Liu, Pan Lu, Wenbo Lv, Yicheng Lv, Qiuyang Mang, Kyle Montgomery, Yuzhou Nie, Ruoxi Ning, Jorin Overwiening, Xu Pan, Layna Paraboschi, Core Francisco Park, Justin Purnomo, Swati Rajwal, Scott Rankin, Bixuan Ren, Yiren Rong, HaoYang Shang, Ventus Shaw, Fiona Shen, Jiawei Shen, Minqi Shi, Qiu Shi, Huaxiu Yao, Tianneng Shi, Jonah So, Vladislav Susoy, Hannah Szlyk, Haocheng Wang, Jialu Wang, Wei Wang, Xinyu Wang, Zehao Wang, Dowling Wong, Angela Wu, Dehao Wu, Fangyu Wu, Mengyuan "Millie" Wu, Yu Wu, Yuchen Wu, Yuhao Wu, Qingpo Wuwu, Weihang Xiao, Yongyi Xiong, Fan Xu, Ruiling Xu, Mingxuan Yan, Benjamin Yang, Jirong Yang, Sen Yang, Xiaoli Yang, Yushi Yang, Haoran Ye, Xiaohu Yu, Zhengming Yu, Chenlong Zhang, Chi Zhang, Hanning Zhang, Hanwen Zhang, Junge Zhang, Kunpeng Zhang, Song Zhang, Wenjin Zhang, Wenshuo Zhang, Ying Zhang, Yizhi Zhang, Brian Zhao, Qijian Zhao, Yimin Zhao, Yuhaohua Zheng, Liwei Zhou, Tianyue Zhou, Sichen Zhu, Siqi Zhu, Yan Zhu, Yishu Zhu, Jierui Zuo, Chonghao Cai, Helena Casademunt, Wenjia Chen, Benjamin Cheng, Nawen Deng, Rao Fu, Tianfu Fu, Yifan Han, Ren He, Zhenyu He, Qiao Jin, Lang Lang, Yuetai Li, Sylvia Liu, Lu Lu, Qing Lu, Subhabrata Mukherjee, Yunqi Ouyang, Yin Ren, Dawei Shi, Haoran Wu, Zhiyue Wu, Hannah Yao, Zhuoran Yi, Jenny Yu, Rhea Zhan, Hang Zhou, Blake Zhu, Junfan Zhu, Alan Yuille, Yang Liu, Russell Alan Poldrack, Jiachen Li, Zhenglu Li, Molei Tao, Jing Huang, Wenqi Shi, Costas Spanos, Lichao Sun, Chenguang Wang, Orson Xu, Zhen Dong, Hector Gomez, Aylin Caliskan, Ali Emami, Haimin Hu, Zhi Li, Lihui Liu, Murphy Niu, Yi Shao, Jianxin Sun, Mikko Tolonen, Ting Wang, Sanjiv Das, Yanjun Gao, Wenbo Guo, Erika J Schneider, Zhiyong Lu, Mark Mueller, Radha Poovendran, Somayeh Sojoudi, Dawn Song
Agents' Last Exam - arXiv:2606.05405v1 Announce Type: new Abstract: Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful...
arxiv.org/abs/2606.05405 →Details
- Excerpt
- Agents' Last Exam - arXiv:2606.05405v1 Announce Type: new Abstract: Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful...
- Context
- Introduces a new, large-scale benchmark (ALE) focused on economically valuable, real-world tasks and professional domains, directly addressing the gap between benchmarks and GDP-relevant impact.
- Key points
- Introduces a new, large-scale benchmark (ALE) focused on economically valuable, real-world tasks and professional domains, directly addressing the gap between benchmarks and GDP-relevant impact.
- Provenance
- Article · Supporting source
-
8
arXiv cs.AI - Research Science (GLOBAL)
Article Can Gurkan, Forrest Stonedahl, Uri Wilensky
Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution - arXiv:2606.05408v1 Announce Type: new Abstract: When an LLM repeatedly mutates a program, does it explore new forms or circle back to...
arxiv.org/abs/2606.05408 →Details
- Excerpt
- Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution - arXiv:2606.05408v1 Announce Type: new Abstract: When an LLM repeatedly mutates a program, does it explore new forms or circle back to...
- Context
- Directly addresses LLM-driven program evolution and structural bias in code generation, a core topic for agentic coding tools.
- Key points
- Directly addresses LLM-driven program evolution and structural bias in code generation, a core topic for agentic coding tools.
- Provenance
- Article · Supporting source
-
9
arXiv cs.AI - Research Science (GLOBAL)
Article Anna Mikeda, Ben Goertzel
A Motivational Architecture for Conversational AGI - arXiv:2606.05411v1 Announce Type: new Abstract: Motivational architectures in cognitive AI have largely been designed for physical agents regulating bodily needs....
arxiv.org/abs/2606.05411 →Details
- Excerpt
- A Motivational Architecture for Conversational AGI - arXiv:2606.05411v1 Announce Type: new Abstract: Motivational architectures in cognitive AI have largely been designed for physical agents regulating bodily needs....
- Context
- This paper proposes a novel motivational architecture for conversational AGI, directly addressing agentic design and the nature of intelligence control.
- Key points
- This paper proposes a novel motivational architecture for conversational AGI, directly addressing agentic design and the nature of intelligence control.
- Provenance
- Article · Supporting source
-
10
arXiv cs.AI - Research Science (GLOBAL)
Article Gianluca Guidi, Francesca Dominici, Tiziano Squartini, Callaway Sprinkle, Jonathan Gilmour, Kevin Butler, Eric Bell, Scott Delaney, Falco J. Bargagli-Stoffi
Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers - arXiv:2606.05420v1 Announce Type: new Abstract: The rapid proliferation of hyperscale data centers (HDCs) in the US, mainly driven.…
arxiv.org/abs/2606.05420 →Details
- Excerpt
- Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers - arXiv:2606.05420v1 Announce Type: new Abstract: The rapid proliferation of hyperscale data centers (HDCs) in the US, mainly driven...
- Context
- Directly addresses AI infrastructure (energy/emissions) and power dynamics (US grid reliance), a core topic.
- Key points
- Directly addresses AI infrastructure (energy/emissions) and power dynamics (US grid reliance), a core topic.
- Provenance
- Article · Supporting source
-
11
arXiv cs.AI - Research Science (GLOBAL)
Article Rayyan Abdalla, Amir Hussein, Min Wu, Dinesh Manocha
Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models - arXiv:2606.05429v1 Announce Type: new Abstract: Post-training quantization (PTQ) is critical for the efficient...
arxiv.org/abs/2606.05429 →Details
- Excerpt
- Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models - arXiv:2606.05429v1 Announce Type: new Abstract: Post-training quantization (PTQ) is critical for the efficient...
- Context
- This is a new arXiv paper detailing an advanced quantization method (SAGE-PTQ) for LLMs, directly addressing inference efficiency and deployment costs.
- Key points
- This is a new arXiv paper detailing an advanced quantization method (SAGE-PTQ) for LLMs, directly addressing inference efficiency and deployment costs.
- Provenance
- Article · Supporting source
-
12
arXiv cs.AI - Research Science (GLOBAL)
Article Pierre Peign\'e, Ky Nguyen, Paul Wang
Zero knowledge verification for frontier AI training is possible - arXiv:2606.05433v1 Announce Type: new Abstract: Frontier AI governance frameworks increasingly use cumulative training compute as the primary criterion.…
arxiv.org/abs/2606.05433 →Details
- Excerpt
- Zero knowledge verification for frontier AI training is possible - arXiv:2606.05433v1 Announce Type: new Abstract: Frontier AI governance frameworks increasingly use cumulative training compute as the primary criterion...
- Context
- Proposes technical solutions (zkVMs, Merkle commitments) to verify frontier AI training compute, directly impacting governance and control.
- Key points
- Proposes technical solutions (zkVMs, Merkle commitments) to verify frontier AI training compute, directly impacting governance and control.
- Provenance
- Article · Supporting source
-
13
arXiv cs.AI - Research Science (GLOBAL)
Article Jiateng Liu, Bingxuan Li, Zhenhailong Wang, Rushi Wang, Kaiwen Hong, Cheng Qian, Jiayu Liu, Denghui Zhang, Katherine Driggs-Campbell, Manling Li, Heng Ji
Brick-Composer: Using MLLMs for Assembly with Diverse Bricks - arXiv:2606.05445v1 Announce Type: new Abstract: We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable...
arxiv.org/abs/2606.05445 →Details
- Excerpt
- Brick-Composer: Using MLLMs for Assembly with Diverse Bricks - arXiv:2606.05445v1 Announce Type: new Abstract: We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable...
- Context
- Directly addresses physical-world AI and agentic capability (building objects). Introduces a new benchmark (BC-Bench) and framework (Brick-Composer), which is core to the field.
- Key points
- Directly addresses physical-world AI and agentic capability (building objects). Introduces a new benchmark (BC-Bench) and framework (Brick-Composer), which is core to the field.
- Provenance
- Article · Supporting source
-
14
arXiv cs.AI - Research Science (GLOBAL)
Article Quanyan Zhu
Insurance of Agentic AI - arXiv:2606.05449v1 Announce Type: new Abstract: Agentic artificial intelligence (AI) systems are transforming the risk landscape by extending beyond information generation to autonomous...
arxiv.org/abs/2606.05449 →Details
- Excerpt
- Insurance of Agentic AI - arXiv:2606.05449v1 Announce Type: new Abstract: Agentic artificial intelligence (AI) systems are transforming the risk landscape by extending beyond information generation to autonomous...
- Context
- Directly addresses the power dynamics (regulation/capital) surrounding agentic AI risk, a core topic of the podcast.
- Key points
- Directly addresses the power dynamics (regulation/capital) surrounding agentic AI risk, a core topic of the podcast.
- Provenance
- Article · Supporting source
-
15
arXiv cs.AI - Research Science (GLOBAL)
Article Abhinaw Priyadershi, Mandar Pitale, Jelena Frtunikj, Maria Spence
Output Type Before Quality: A Standards-Derived XAI Admissibility Rubric for Autonomous-Driving Safety - arXiv:2606.05461v1 Announce Type: new Abstract: Safety standards for ML-based autonomous driving specify the kind.…
arxiv.org/abs/2606.05461 →Details
- Excerpt
- Output Type Before Quality: A Standards-Derived XAI Admissibility Rubric for Autonomous-Driving Safety - arXiv:2606.05461v1 Announce Type: new Abstract: Safety standards for ML-based autonomous driving specify the kind...
- Context
- This paper defines a structural rubric for XAI in autonomous driving safety, linking ML outputs to regulatory standards (ISO/automotive). This directly impacts physical-world AI and liability.
- Key points
- This paper defines a structural rubric for XAI in autonomous driving safety, linking ML outputs to regulatory standards (ISO/automotive). This directly impacts physical-world AI and liability.
- Provenance
- Article · Supporting source
-
16
arXiv cs.AI - Research Science (GLOBAL)
Article Kuangshi Ai, Haichao Miao, Kaiyuan Tang, Shusen Liu, Chaoli Wang
SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization - arXiv:2606.05525v1 Announce Type: new Abstract: Recent advances in agentic visualization have enabled the...
arxiv.org/abs/2606.05525 →Details
- Excerpt
- SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization - arXiv:2606.05525v1 Announce Type: new Abstract: Recent advances in agentic visualization have enabled the...
- Context
- This paper introduces reusable agent skills for scientific visualization (SciVis), directly addressing agentic coding tools and advanced AI infrastructure/capabilities.
- Key points
- This paper introduces reusable agent skills for scientific visualization (SciVis), directly addressing agentic coding tools and advanced AI infrastructure/capabilities.
- Provenance
- Article · Supporting source
-
17
Latent Space · 40m41s
Video Latent Space
⚡️Making DeepSeek v4 outperform Opus 4.7 with Taste — @AhmadAwais , CommandCode.ai — https://x.com/MrAhmadAwais/status/2050956678502420612 We sit down with Ahmad Awais, CEO of CommandCodeAI, who developed a lightweight…
www.youtube.com/watch?v=-rIAVuaRjOg →Details
- Excerpt
- ⚡️Making DeepSeek v4 outperform Opus 4.7 with Taste — @AhmadAwais , CommandCode.ai — https://x.com/MrAhmadAwais/status/2050956678502420612 We sit down with Ahmad Awais, CEO of CommandCodeAI, who developed a lightweight…
- Context
- Directly addresses agentic coding tools and AI infrastructure (tool-calling reliability/repair logic), a core topic.
- Key points
- Directly addresses agentic coding tools and AI infrastructure (tool-calling reliability/repair logic), a core topic.
- Provenance
- Video · Supporting source
-
18
r/ClaudeAI: Anthropic Just Published a Major Update on Recursive Self-Improvement: AI Is Already Accelerating Its Own Development (May 2026) - 0 pts · 0 comments
Article nullvector88
Anthropic just dropped a really interesting new piece called “When AI builds itself.” They go deep into how they’re handing over more and more of their own AI development to the AI systems themselves. The numbers...
www.reddit.com/r/ClaudeAI/comments/1ty8f47/… →Details
- Excerpt
- Anthropic just dropped a really interesting new piece called “When AI builds itself.” They go deep into how they’re handing over more and more of their own AI development to the AI systems themselves. The numbers...
- Context
- Directly discusses AI acceleration, agentic coding tools, and self-improvement, hitting multiple core topics.
- Key points
- Directly discusses AI acceleration, agentic coding tools, and self-improvement, hitting multiple core topics.
- Provenance
- Article · Supporting source
-
19
Techmeme - Industry Adjacent (US)
Article
Inside the Trump admin's push to integrate AI into the healthcare system, including an FDA regulatory fast track for digital health tech like AI chatbots (Elizabeth Dwoskin/Washington Post) - Elizabeth Dwoskin /...
www.techmeme.com/260606/p5 →Details
- Excerpt
- Inside the Trump admin's push to integrate AI into the healthcare system, including an FDA regulatory fast track for digital health tech like AI chatbots (Elizabeth Dwoskin/Washington Post) - Elizabeth Dwoskin /...
- Context
- Directly addresses policy (FDA/Trump admin) and physical-world application (healthcare), impacting regulation and market structure.
- Key points
- Directly addresses policy (FDA/Trump admin) and physical-world application (healthcare), impacting regulation and market structure.
- Provenance
- Article · Supporting source
-
20
r/LocalLLaMA: DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) - 0 pts · 0 comments
Article Lowkey_LokiSN
In case you're not aware already, the DeepSeek V4 series is finally getting supported on llama.cpp with this PR! The PR is at a very early stage right now, so only try it if you're consciously willing to experiment out.…
www.reddit.com/r/LocalLLaMA/comments/1tyb3n… →Details
- Excerpt
- In case you're not aware already, the DeepSeek V4 series is finally getting supported on llama.cpp with this PR! The PR is at a very early stage right now, so only try it if you're consciously willing to experiment out...
- Context
- The post announces a working implementation of a frontier model (DeepSeek V4 Flash) for local inference via llama.cpp, directly addressing AI infrastructure and model capability.
- Key points
- The post announces a working implementation of a frontier model (DeepSeek V4 Flash) for local inference via llama.cpp, directly addressing AI infrastructure and model capability.
- Provenance
- Article · Supporting source
Transcript
00:00:04 lenarHere's a claim that sounds impossible, and I'll walk you toward why it isn't. Take an open-weights model that, left on its own, fumbles tool calls badly enough to be near-useless for serious coding work. Wrap a thin layer around it, and suddenly it's going toe to toe with one of the strongest closed models out there. No fine-tune. No new weights. Same model file. So what actually changed? That's what Ahmad Awais got into this week, talking to the Latent Space folks about his platform, Command Code.
00:00:34 damra[tsk] My first instinct is that this is a benchmark trick. 'Outperforms Opus 4.7' is the kind of headline that usually means somebody picked the three tasks where it wins. So before I believe any of it — what's the actual mechanism he's claiming?
00:00:49 lenarFair, and he's surprisingly specific, which is what makes it worth our time. He names a failure he calls tool confusion. You give an open model — he tested DeepSeek V4 Pro, Kimi, MiniMax — a schema for a tool call. The model emits something that doesn't match the schema. The runtime sends back a validation error. And instead of fixing it, the model re-issues the same broken call. Again. And again. He puts a number on it: roughly fifty-six bad tool calls per billion tokens, the same wrong call repeating.
00:01:20 damraWait — repeating the identical wrong call after it's been told it's wrong? That's not a random sampling glitch. That's the model ignoring the correction entirely.
00:01:29 lenarThat's exactly his read, and his explanation is where it gets interesting. He thinks it's baked in during training. These models get trained, in his words, to never accept a correction — to hold their answer. So when your error handler does the polite thing and says 'that was invalid, try again,' the model treats that the way it was trained to treat pushback: it digs in.
00:01:51 damraOkay, that tracks with something annoying I've seen. You tell a model its output broke the parser, it apologizes, and hands you back a near-identical broken blob. The apology is theater; the behavior doesn't move. So what does Command Code do instead of arguing with it?
00:02:08 lenarIt stops arguing. The layer he built does deterministic repair. When the model emits a malformed call, the runtime doesn't bounce an error back — it fixes the output itself. He gives concrete examples: the model emits a JSON string where an array was required, so the layer converts it to an array. The model issues a file read without an offset, so the layer infers the offset. Then it returns the corrected result plus a short repair hint. And he says by about the third attempt, the model's later calls come out clean.
00:02:39 damraSo the trick isn't making the model smarter. It's refusing to let the model's stubbornness reach the loop. You catch the bad output, you patch it, you hand back something that works — and the model, seeing a success, settles down. That's almost behavioral.
00:02:53 lenarAnd the scale is what made me sit up. He says this has generalized across more than sixteen thousand variations, and that Command Code now pushes something like six hundred billion tokens a day through this. His framing is that this infrastructure layer is what moves an open model like DeepSeek V4 Flash from unusable to competitive.
00:03:11 damraHere's where I'll push, though. If the repair logic carries that much, then 'DeepSeek beats Opus' is the wrong scoreboard. What he's actually shown is that his harness plus a cheap model beats a bare strong model. Pull the harness out and the comparison falls apart. The accurate version is narrower: the repair layer decides whether the cheap model is viable at all.
00:03:34 lenarAnd he half-says that himself. His other observation is that a lot of developers using something like Claude Code never see these tool failures, because the interface hides them behind permission prompts and retries. So they feel slowness and blame the model's intelligence, when what they're hitting is how the coding harness handles errors. The capability and the error handling get blamed for each other.
00:03:56 damraThat last bit is the useful warning for anyone building on open models right now. Your eval of 'is this model good enough' is secretly an eval of your own retry logic. If you swap models without holding the harness constant, you're not measuring what you think you're measuring.
00:04:12 lenarWhich makes the next item sit at an awkward angle. The same model Awais is praising, DeepSeek V4 Flash, is right now getting support in llama.cpp. There's a work-in-progress pull request — number 24162 — that someone on the local model subreddit flagged, posting basically 'DeepSeek V4 Flash is amazing,' with the caveat that it's very early and only worth trying if you enjoy living on the experimental edge.
00:04:39 damraSo the model that needed six hundred billion tokens a day of repair infrastructure to shine is about to be runnable on a laptop by hobbyists. Those two facts are in tension. What does a person actually get when they pull that build down today?
00:04:53 lenarBy the poster's own description, an early experiment. The pull request is at a rough stage. You get weights running locally through llama.cpp, which is the thing people care about — no API key, no rate limit, your hardware. What you don't get, and this is the catch, is Awais's repair layer. That's proprietary to Command Code. So the local runner inherits the raw tool-confusion behavior.
00:05:18 damraRight, and that's the gap nobody markets. The excitement on the thread is about access — I can run the frontier-ish model myself. But access to the weights isn't access to the experience he's describing. The cheap model is right there; the thing that makes it pleasant to code with isn't in the box.
00:05:35 lenarAnd I don't want to be sour about it, because local support for a strong open model is real progress, and the error handling is the kind of problem open-source tooling tends to chew through eventually. Someone will write an open repair layer once the behavior is well understood. But today, if you pull that build expecting Awais's results, you'll be disappointed, and you'll blame the wrong thing.
00:05:58 damraIt's a clean illustration of the whole open-weights bargain. You get the model. You also inherit every job the lab's serving stack was doing for you behind the API.
00:06:07 lenarThere's a second piece of what Awais described that connects to a paper out this week, and I think it's the more durable idea. Command Code has a component he calls Taste. When you merge to your main branch, it extracts your per-repository preferences — how this codebase likes things done — into short, skill-like files. Small, reusable, specific to the repo.
00:06:28 damraSo it's watching your merges and distilling 'here's how we actually write code in this project' into a file the agent can load later. That's a memory layer dressed as preferences.
00:06:38 lenarAnd the paper makes that into a named primitive. It's from Gal Bakal and colleagues — title is Knowledge Activation — and they argue that AI skills should be treated as the institutional-knowledge unit for agentic software development. Their framing: enterprises accumulate critical know-how that lives in people's heads and old pull requests, and agents keep re-deriving it badly. So you capture it as discrete, activatable skills.
00:07:04 damraI like the diagnosis more than I trust the cure, and let me say why. The diagnosis is dead-on: every agent I've run into rediscovers the project's conventions from scratch, every session, and gets them subtly wrong. Capturing that once is obviously right. But a skill file is just frozen context. The moment the codebase moves and the skill doesn't, you've got confident, well-formatted, stale instructions. Who keeps the skill current?
00:07:32 lenarThat's the unanswered question in both. Awais's version has a natural trigger — the skill regenerates on merge, so at least it's tied to a real event in the repo's life. The paper is more abstract about maintenance. And there's a sharper worry underneath, which a different paper this week happens to name.
00:07:48 damraGo on.
00:07:49 lenarThere's a study called Mutation Without Variation, looking at what happens when a large language model repeatedly mutates a program. They ask whether it explores new forms, or just keeps circling back to the same handful. Their finding leans toward convergence — the model's edits collapse toward a narrow band rather than spreading out.
00:08:08 damraAnd if you connect that to skills — extracted preferences plus a model that converges — you can talk yourself into a codebase that slowly homogenizes. The skill encodes 'how we do it,' the model already prefers that form, and divergent-but-better approaches stop getting proposed. Let me be careful: that's me joining two papers that didn't cite each other, so treat it as a hypothesis, not a result. But that convergence is the thing I'd test for first.
00:08:36 lenarLet's go to measurement, because three papers this week poke at the gap between what we score and what we get. The biggest swing is a benchmark called Agents' Last Exam, which they shorten to A-L-E. The premise in their abstract is blunt: recent systems ace a wide range of benchmarks, and those gains haven't translated into economically meaningful work. So they built a benchmark aimed at economically valuable, real-world tasks across professional domains.
00:09:04 damraThat author list is enormous, by the way — it reads like a few dozen labs pooled effort. Which is its own signal: a lot of people independently decided the existing benchmarks stopped predicting anything useful. What do the actual tasks look like? Because 'economically valuable' is easy to say and hard to grade.
00:09:23 lenarThat's where my read runs out, working from the abstract alone — they describe professional-domain tasks meant to track real economic output rather than puzzle-solving, but I haven't worked through the full methodology, so I won't oversell how well they pulled it off. The intent is what's notable: stop rewarding models for acing tests, and start asking whether the work would survive contact with a paying client.
00:09:46 damraPair that with the second one — SentinelBench. That's from a Microsoft-heavy author group, and it's a benchmark for long-running monitoring agents. The setup they call out is that agents are increasingly asked to do work spanning minutes, hours, or longer, and the default way we evaluate them assumes short, one-shot tasks.
00:10:05 lenarWhich is a real blind spot. A model that's brilliant for thirty seconds and forgets why it's running after twenty minutes will pass almost every benchmark we have and fail the actual job, because the actual job is to sit there and stay coherent. SentinelBench at least tries to score the endurance, not the sprint.
00:10:23 damraAnd the third one is what makes me nervous about the whole evaluation stack. It's called Stability versus Manipulability, looking at large-language-model judges — the practice of using a model to grade other models' outputs. They test robustness under what they call post-decision interaction: the judge renders a verdict, then something pushes back, and they measure whether the judge holds or caves.
00:10:48 lenarAnd let me guess — it caves more than you'd want.
00:10:50 damraThat's the worry the title telegraphs. I haven't run their numbers myself, so I'll keep it at the level of: they think it's enough of a problem to name it manipulability. And a lot of benchmark pipelines use a model as judge now. So a judge that can be talked out of its verdict means your leaderboard has a soft spot — one anyone who knows the trick can lean on. Put the three together and it's one finding from three directions: we're measuring the wrong things, or measuring the right things in fragile ways.
00:11:20 lenarAnd that loops straight back to Command Code, doesn't it. Awais's whole point was that the model's benchmark score didn't predict its real coding behavior — the harness did. These three papers are the academic version of the same complaint: the score and the job have drifted apart.
00:11:36 lenarNow the item I'll handle most carefully, because the source chain is thin. There's a post going around the Claude subreddit pointing at an Anthropic piece titled 'When AI builds itself.' The poster's summary is that Anthropic is describing handing more and more of its own AI development over to its AI systems, and that the work is, in their words, already accelerating its own development.
00:11:59 damraAnd let me flag the obvious before we touch the substance: that's a Reddit summary of a company blog post, with no engagement on the thread, and the post even references numbers that the excerpt cuts off. So we don't have the figures in front of us. I'm not going to repeat a percentage I can't see.
00:12:15 lenarAgreed, and we shouldn't. What we can talk about is what the claim is actually asserting, because it's a claim labs keep making and it deserves a calm look. 'AI is accelerating its own development' can mean something mundane and true — engineers at every lab now use coding agents heavily, so of course internal development is faster. Or it can mean something much larger and much less demonstrated — that the systems are meaningfully designing their successors. Those are very different statements wearing the same sentence.
00:12:45 damraAnd a company has every incentive to let you hear the second one while only being able to support the first. If I'm filing toward an I-P-O — which Anthropic has been reported to be doing — 'our AI is improving our AI' is a story that prices well. That doesn't make it false. It means I want the artifact, not the blog framing. Show me which parts of the pipeline the models own end to end, with a human only signing off, versus which parts are a person using autocomplete that's gotten very good.
00:13:18 lenarAnd there's a governance thread that connects here, from reporting around yesterday. Anthropic's been pushing a coordinated development-pause proposal, and the standard objection is verification — how would anyone confirm a lab actually slowed down? A paper out this week speaks to exactly that. It argues zero-knowledge verification for frontier training is possible — using zero-knowledge virtual machines and Merkle commitments to prove properties about a training run without exposing the underlying weights or data.
00:13:48 damraThat's an interesting pairing. If a lab claims its models are doing the building, and separately claims everyone should pause, the same question sits under both: can an outsider check the claim? Zero-knowledge proofs over training compute are at least a mechanism where the answer isn't just 'trust the press release.' Whether it's practical at frontier scale is a different matter — proving things about a training run that large isn't cheap — but naming a cryptographic path is more than hand-waving.
00:14:16 lenarSo the way I'd hold the Anthropic item: the direction is plausible and probably partly true, the specific magnitude is unverified from where we're sitting, and the most useful response isn't excitement or dismissal — it's asking what verifiable evidence would look like. That zero-knowledge paper is at least gesturing at the answer.
00:14:35 lenarLast one, and it's the one with the most immediate stakes for ordinary people. The Washington Post — Elizabeth Dwoskin's reporting, via Techmeme — has a piece on the Trump administration's push to integrate AI into the healthcare system. The detail that stands out is a regulatory fast track at the Food and Drug Administration for digital health tech, and the example named is AI chatbots.
00:14:58 damraA fast track at the Food and Drug Administration for AI chatbots is a phrase that should make anyone who's watched a model confidently make something up slow down. What does 'fast track' mean here concretely — lighter evidence, quicker review, or a new category that skips part of the process?
00:15:14 lenarI can't tell you the precise mechanism from the summary alone — the reporting describes a fast track and an administration push, and I'd want the actual FDA guidance text before I characterized exactly what's being waived or accelerated. So treat the specifics as reported, not yet pinned. What's clear is the direction: the agency is being steered to move faster on approving these tools, in a domain where the cost of being wrong is a person acting on a bad answer about their own health.
00:15:41 damraAnd this is where I land differently than I do on the coding stuff. With Command Code, a bad tool call costs you a retry. In a clinical setting, the chatbot's tool confusion is a patient. The same model behavior — confidently restating a wrong answer, ignoring a correction — has a wildly different cost depending on where you deploy it. A fast track is a policy decision to accept more of that risk in exchange for speed.
00:16:07 lenarAnd I'd be fair to the other side, because there is one. Access to care is broken for a lot of people, and a decent triage assistant at three in the morning beats nobody at all. The administration's case is presumably that speed reaches people who currently get nothing. The tension worth holding: the same fast track that helps the underserved also ships unvetted tools to the trusting. And the agency exists precisely to sit in that gap.
00:16:33 damraSo the line that matters is whether the actual guidance, when it's published, sorts by stakes — a chatbot that helps you book an appointment isn't the same product as one that tells you whether your chest pain needs an ambulance. If the fast track treats those the same, that's where it goes wrong.
00:16:48 lenarThat's a good note to end on, because it rhymes with the whole day. Awais showed that the layer around the model decides whether it's usable. The benchmark papers showed our scores have drifted from the work. And the healthcare story is the same idea with the stakes raised: the model is never the whole system, and the consequences live in whatever we wrap around it — the repair logic, the skill files, the review gate, and the regulation. Two things will tell me where this goes next: whether anyone ships an open repair layer for these local DeepSeek builds, and whether the Food and Drug Administration's actual guidance draws its line by stakes rather than by category.