◆ Dispatch 040 · 2026-05-29 GSV Locally Coherent, Globally Not

Locally coherent, globally not

2026-05-29 / 00:22:01 / 80 sources

“Are you measuring whether the model said no, or whether the model couldn't say yes? Those are different tests, and we mostly run the first.”
— Lenar Kess, today's narration

Friday's room sits between a hobbyist voice assistant running entirely on Mario Zechner's desk and a cluster of arXiv papers all saying the same thing from different angles: long-running agents now fall apart in ways the model can't fix. Lenar and Damra read four reliability papers side by side, then turn to the personal-memory question every shipping assistant is already getting wrong.

Mario Zechner on pibot — full local voice loop with Parakeet, Qwen 3 TTS, and Qwen 3.6 through llama.cpp, with the STT and TTS engines ported from Python into Rust on mlx-c. The runtime detail is the news, not the model lineup.
Ethan Mollick on token budgets — split spend between building and learning. Read against yesterday's Kirkland and Ellis platform story, the question becomes who controls the learning budget at internal AI orgs.
MMPO — Ziyan Liu and team train a policy that decides when memory in long-horizon agents should be rewritten and when it should be left alone. Belief drift comes from over-eager rewrites, not missing updates.
RedundancyBench — Minyang Hu's group benchmarks how many steps in a long agent trajectory are repeats. Stale duplicates of state crowd out the relevant signal in context.
Locally Coherent, Globally Incoherent — Anany Kotawala's single-author paper bounds compositional incoherence in multi-component agents. Defensible local outputs assemble into contradictory global ones.
Agent-Radar — Hongxiang Zhang's group steers attention toward context-relevant tokens in multi-agent communication, so the receiver isn't drowned in noise from the sender.
Selective QA over conflicting personal memory — Tiancheng Yang's testbed for what happens when your assistant's memories about you disagree. No single resolution strategy dominates.
BioRefusalAudit — Caleb DeLeeuw uses sparse autoencoders to ask whether a model's refusal is shallow pattern matching or whether the dangerous capability isn't there at all.
AutoformBot and Atlas — Ahmad Rammal's team at FAIR Paris and NYU on a multi-agent system that pulls textbook math into Lean 4 at scale. Lean is the verifier the agents can't argue with.

Chapters

00:00:00 Transcript

Sources

80 cited

1
OpenAI · 47m40s

Video OpenAI

Build Hour: Agents SDK — Build with the next evolution of the Agents SDK. In this Build Hour, you’ll learn how to use the updated Agents SDK to build long-running agents with a model-native harness. Give agents the…
www.youtube.com/watch?v=tK32trvj_b4 →
Details
Excerpt
Build Hour: Agents SDK — Build with the next evolution of the Agents SDK. In this Build Hour, you’ll learn how to use the updated Agents SDK to build long-running agents with a model-native harness. Give agents the…

Context
Directly addresses agentic coding tools, agent infrastructure, and the shifting craft of software engineering with technical depth.
Key points
Directly addresses agentic coding tools, agent infrastructure, and the shifting craft of software engineering with technical depth.
Provenance
Video · Supporting source
2
arXiv cs.AI - Research Science (GLOBAL)

Article Al Kari

The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling - arXiv:2605.28864v1 Announce Type: new Abstract: The Cognitive Categorical Transformer (CCT) is a 306M-parameter...
arxiv.org/abs/2605.28864 →
Details
Excerpt
The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling - arXiv:2605.28864v1 Announce Type: new Abstract: The Cognitive Categorical Transformer (CCT) is a 306M-parameter...

Context
This is a primary artifact (arXiv paper) detailing a novel, theoretically grounded model architecture (CCT) and providing quantitative evidence of performance improvement via category theory concepts.
Key points
This is a primary artifact (arXiv paper) detailing a novel, theoretically grounded model architecture (CCT) and providing quantitative evidence of performance improvement via category theory concepts.
Provenance
Article · Supporting source
3
arXiv cs.AI - Research Science (GLOBAL)

Article Jiachen Zhang (Peking University, China Agricultural University), Junyi Lao (Peking University), Chenghao Liu (Peking University), Siyuan Liu (Peking University), Shixin Wu (Peking University), Linsen Zhang (Peking University), Boyu Wang (Peking University), Songfang Huang (Peking University)

VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis - arXiv:2605.28978v1 Announce Type: new Abstract: Finite Element Analysis (FEA) serves as the cornerstone of modern engineering...
arxiv.org/abs/2605.28978 →
Details
Excerpt
VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis - arXiv:2605.28978v1 Announce Type: new Abstract: Finite Element Analysis (FEA) serves as the cornerstone of modern engineering...

Context
This paper describes an agentic system (VFEAgent) automating a complex, domain-specific engineering workflow (FEA) from multimodal inputs. This is a core example of AI applied to physical-world engineering.
Key points
This paper describes an agentic system (VFEAgent) automating a complex, domain-specific engineering workflow (FEA) from multimodal inputs. This is a core example of AI applied to physical-world engineering.
Provenance
Article · Supporting source
4
arXiv cs.AI - Research Science (GLOBAL)

Article Sara Metcalf, William Schoenberg

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation - arXiv:2605.28994v1 Announce Type: new Abstract: AI tools to support real world decision making must be able to build simulation models that inform...
arxiv.org/abs/2605.28994 →
Details
Excerpt
BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation - arXiv:2605.28994v1 Announce Type: new Abstract: AI tools to support real world decision making must be able to build simulation models that inform...

Context
Establishes a new, open-source benchmark (BEAMS) for AI in critical real-world modeling/simulation, directly impacting AI's utility and trustworthiness.
Key points
Establishes a new, open-source benchmark (BEAMS) for AI in critical real-world modeling/simulation, directly impacting AI's utility and trustworthiness.
Provenance
Article · Supporting source
5
arXiv cs.AI - Research Science (GLOBAL)

Article Aisha Najera, Alvin Moon, Vedant Srinivasan, Rajesh Veeraraghavan

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis - arXiv:2605.29025v1 Announce Type: new Abstract: Federal agencies are deploying large language models (LLMs) to categorize public comment...
arxiv.org/abs/2605.29025 →
Details
Excerpt
When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis - arXiv:2605.29025v1 Announce Type: new Abstract: Federal agencies are deploying large language models (LLMs) to categorize public comment...

Context
Addresses the critical issue of model disagreement in real-world applications (public policy/federal agencies), directly impacting how intelligence is used and interpreted.
Key points
Addresses the critical issue of model disagreement in real-world applications (public policy/federal agencies), directly impacting how intelligence is used and interpreted.
Provenance
Article · Supporting source
6
arXiv cs.AI - Research Science (GLOBAL)

Article Diego Gosmar, Deborah A. Dahl

Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching - arXiv:2605.29055v1 Announce Type: new Abstract: Hallucination remains a major reliability barrier for production...
arxiv.org/abs/2605.29055 →
Details
Excerpt
Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching - arXiv:2605.29055v1 Announce Type: new Abstract: Hallucination remains a major reliability barrier for production...

Context
New research on hallucination mitigation, agentic pipelines, and semantic caching directly addresses reliability, infrastructure, and agentic tools.
Key points
New research on hallucination mitigation, agentic pipelines, and semantic caching directly addresses reliability, infrastructure, and agentic tools.
Provenance
Article · Supporting source
7
arXiv cs.AI - Research Science (GLOBAL)

Article Siddharth Sai, Xiaofei Wen, Muhao Chen

Robust and Efficient Guardrails with Latent Reasoning - arXiv:2605.29068v1 Announce Type: new Abstract: Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world...
arxiv.org/abs/2605.29068 →
Details
Excerpt
Robust and Efficient Guardrails with Latent Reasoning - arXiv:2605.29068v1 Announce Type: new Abstract: Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world...

Context
This paper proposes COLAGUARD, a novel, efficient guardrail model for LLMs. It directly addresses the core tension between safety robustness and high-throughput deployment, which is critical for real-world AI infrastructure.
Key points
This paper proposes COLAGUARD, a novel, efficient guardrail model for LLMs. It directly addresses the core tension between safety robustness and high-throughput deployment, which is critical for real-world AI infrastructure.
Provenance
Article · Supporting source
8
arXiv cs.AI - Research Science (GLOBAL)

Article Tyler Akidau, Tyler Rockwood, Johannes Br\"uderl, Marc Millstone

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane - arXiv:2605.29082v1 Announce Type: new Abstract: AI agents are increasingly expected to operate as digital employees:...
arxiv.org/abs/2605.29082 →
Details
Excerpt
The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane - arXiv:2605.29082v1 Announce Type: new Abstract: AI agents are increasingly expected to operate as digital employees:...

Context
This paper proposes a critical architectural solution (ADP) for governing autonomous agents, directly addressing safety, policy, and enterprise data access—a core topic.
Key points
This paper proposes a critical architectural solution (ADP) for governing autonomous agents, directly addressing safety, policy, and enterprise data access—a core topic.
Provenance
Article · Supporting source
9
arXiv cs.AI - Research Science (GLOBAL)

Article Yubo Li, Ramayya Krishnan, Rema Padman

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure - arXiv:2605.29087v1 Announce Type: new Abstract: Reasoning models are evaluated on single-turn benchmarks but.…
arxiv.org/abs/2605.29087 →
Details
Excerpt
The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure - arXiv:2605.29087v1 Announce Type: new Abstract: Reasoning models are evaluated on single-turn benchmarks but...

Context
Reports a specific, measurable failure mode (unfaithful capitulation) in reasoning models under adversarial pressure, directly impacting model reliability and safety.
Key points
Reports a specific, measurable failure mode (unfaithful capitulation) in reasoning models under adversarial pressure, directly impacting model reliability and safety.
Provenance
Article · Supporting source
10
arXiv cs.AI - Research Science (GLOBAL)

Article Shreyas Fadnavis, Praitayini Kanakaraj, Felix Wyss

Beyond Consensus: Trace-Level Synthesis in Mixture of Agents - arXiv:2605.29116v1 Announce Type: new Abstract: When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a.…
arxiv.org/abs/2605.29116 →
Details
Excerpt
Beyond Consensus: Trace-Level Synthesis in Mixture of Agents - arXiv:2605.29116v1 Announce Type: new Abstract: When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a...

Context
This paper directly addresses agentic systems and the 'craft' of AI reasoning, arguing for trace-level synthesis over simple consensus voting. This is a core technical advance for agentic tools.
Key points
This paper directly addresses agentic systems and the 'craft' of AI reasoning, arguing for trace-level synthesis over simple consensus voting. This is a core technical advance for agentic tools.
Provenance
Article · Supporting source
11
arXiv cs.AI - Research Science (GLOBAL)

Article Yifei He, Rui Yang, Hao Bai, Tong Zhang, Han Zhao

PRO-CUA: Process-Reward Optimization for Computer Use Agents - arXiv:2605.29119v1 Announce Type: new Abstract: Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their...
arxiv.org/abs/2605.29119 →
Details
Excerpt
PRO-CUA: Process-Reward Optimization for Computer Use Agents - arXiv:2605.29119v1 Announce Type: new Abstract: Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their...

Context
This paper introduces a new framework (PRO-CUA) for training computer use agents (CUAs), directly addressing agentic coding/workflow automation and AI infrastructure challenges.
Key points
This paper introduces a new framework (PRO-CUA) for training computer use agents (CUAs), directly addressing agentic coding/workflow automation and AI infrastructure challenges.
Provenance
Article · Supporting source
12
arXiv cs.AI - Research Science (GLOBAL)

Article Dueun Kim, Albert No

The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models - arXiv:2605.29123v1 Announce Type: new Abstract: Masked diffusion language models (MDMs) uniquely support any-order generation, with...
arxiv.org/abs/2605.29123 →
Details
Excerpt
The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models - arXiv:2605.29123v1 Announce Type: new Abstract: Masked diffusion language models (MDMs) uniquely support any-order generation, with...

Context
Directly addresses model failure modes and reasoning limitations in diffusion models, a core topic for frontier AI research.
Key points
Directly addresses model failure modes and reasoning limitations in diffusion models, a core topic for frontier AI research.
Provenance
Article · Supporting source
13
arXiv cs.AI - Research Science (GLOBAL)

Article Muhammad Zia Hydari, Raja Iqbal, Narayan Ramasubbu

Governing Technical Debt in Agentic AI Systems - arXiv:2605.29129v1 Announce Type: new Abstract: Agentic AI systems are increasingly being explored as production infrastructure: they reason over multiple steps, call...
arxiv.org/abs/2605.29129 →
Details
Excerpt
Governing Technical Debt in Agentic AI Systems - arXiv:2605.29129v1 Announce Type: new Abstract: Agentic AI systems are increasingly being explored as production infrastructure: they reason over multiple steps, call...

Context
Defines 'Agentic Technical Debt' and 'Stochastic Tax,' directly addressing governance and infrastructure challenges in agentic AI systems.
Key points
Defines 'Agentic Technical Debt' and 'Stochastic Tax,' directly addressing governance and infrastructure challenges in agentic AI systems.
Provenance
Article · Supporting source
14
arXiv cs.AI - Research Science (GLOBAL)

Article Daniel Lee, Owen Queen, James Zou

ReasonOps: Operator Segmentation for LLM Reasoning Traces - arXiv:2605.29192v1 Announce Type: new Abstract: Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a...
arxiv.org/abs/2605.29192 →
Details
Excerpt
ReasonOps: Operator Segmentation for LLM Reasoning Traces - arXiv:2605.29192v1 Announce Type: new Abstract: Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a...

Context
This paper introduces ReasonOps, a method to analyze and structure LLM reasoning traces, revealing common compositional structures and model fingerprints. This is core research on AI capability and understanding.
Key points
This paper introduces ReasonOps, a method to analyze and structure LLM reasoning traces, revealing common compositional structures and model fingerprints. This is core research on AI capability and understanding.
Provenance
Article · Supporting source
15
arXiv cs.AI - Research Science (GLOBAL)

Article Tenghao Huang, Kung-Hsiang Huang, Prafulla Kumar Choubey, Yilun Zhou, Muhao Chen, Jonathan May, Chien-Sheng Wu

GTA: Generating Long-Horizon Tasks for Web Agents at Scale - arXiv:2605.29218v1 Announce Type: new Abstract: Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web...
arxiv.org/abs/2605.29218 →
Details
Excerpt
GTA: Generating Long-Horizon Tasks for Web Agents at Scale - arXiv:2605.29218v1 Announce Type: new Abstract: Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web...

Context
This paper introduces a scalable benchmark (GTA) for web agents, directly addressing the core topic of agentic coding tools and practice. It's a primary artifact with clear downstream consequence for agent development.
Key points
This paper introduces a scalable benchmark (GTA) for web agents, directly addressing the core topic of agentic coding tools and practice. It's a primary artifact with clear downstream consequence for agent development.
Provenance
Article · Supporting source
16
arXiv cs.AI - Research Science (GLOBAL)

Article Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu, Akiko Aizawa

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents - arXiv:2605.29225v1 Announce Type: new Abstract: Self-evolving agents improve over time by reflecting on past failures, but.…
arxiv.org/abs/2605.29225 →
Details
Excerpt
BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents - arXiv:2605.29225v1 Announce Type: new Abstract: Self-evolving agents improve over time by reflecting on past failures, but...

Context
Introduces BenchTrace, a new benchmark for evaluating self-evolving LLM agents, directly addressing agentic coding tools and agentic practice.
Key points
Introduces BenchTrace, a new benchmark for evaluating self-evolving LLM agents, directly addressing agentic coding tools and agentic practice.
Provenance
Article · Supporting source
17
arXiv cs.AI - Research Science (GLOBAL)

Article Benlong Wu, Weiming Zhang, Kejiang Chen, Han Fang, Nenghai Yu

Provably Secure Agent Guardrail - arXiv:2605.29251v1 Announce Type: new Abstract: As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of...
arxiv.org/abs/2605.29251 →
Details
Excerpt
Provably Secure Agent Guardrail - arXiv:2605.29251v1 Announce Type: new Abstract: As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of...

Context
This paper proposes a formal, provably secure guardrail for agents, directly addressing the core risk of autonomous AI systems going rogue. It's a major technical artifact.
Key points
This paper proposes a formal, provably secure guardrail for agents, directly addressing the core risk of autonomous AI systems going rogue. It's a major technical artifact.
Provenance
Article · Supporting source
18
arXiv cs.AI - Research Science (GLOBAL)

Article Yibing Liu, Yangze Liu, Xiaolong Yin, Bin Wang, Chong Zhang, Hao Yin, Zhongyi Han

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories - arXiv:2605.29253v1 Announce Type: new Abstract: Task success can hide process anomalies in real-world agent executions. An.…
arxiv.org/abs/2605.29253 →
Details
Excerpt
OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories - arXiv:2605.29253v1 Announce Type: new Abstract: Task success can hide process anomalies in real-world agent executions. An...

Context
Introduces OpenClawBench, a large-scale dataset for measuring process-side anomalies in real agent execution. Directly addresses agentic tools and reliability.
Key points
Introduces OpenClawBench, a large-scale dataset for measuring process-side anomalies in real agent execution. Directly addresses agentic tools and reliability.
Provenance
Article · Supporting source
19
arXiv cs.AI - Research Science (GLOBAL)

Article Shijie Cao, Yuan Yuan, Jing Liu

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling - arXiv:2605.29262v1 Announce Type: new Abstract: The Dynamic Flexible Job Shop Scheduling Problem...
arxiv.org/abs/2605.29262 →
Details
Excerpt
Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling - arXiv:2605.29262v1 Announce Type: new Abstract: The Dynamic Flexible Job Shop Scheduling Problem...

Context
This paper proposes an agentic framework (RACE-Sched) for dynamic scheduling, directly addressing the core tension between real-time constraints and long-horizon reasoning in industrial control systems.
Key points
This paper proposes an agentic framework (RACE-Sched) for dynamic scheduling, directly addressing the core tension between real-time constraints and long-horizon reasoning in industrial control systems.
Provenance
Article · Supporting source
20
arXiv cs.AI - Research Science (GLOBAL)

Article Yang Zhang, Xiukun Wei, Xueru Zhang

When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop - arXiv:2605.29267v1 Announce Type: new Abstract: Foundation models are increasingly trained on synthetic data generated.…
arxiv.org/abs/2605.29267 →
Details
Excerpt
When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop - arXiv:2605.29267v1 Announce Type: new Abstract: Foundation models are increasingly trained on synthetic data generated...

Context
Addresses model collapse and alignment failure in multi-model, self-consuming training loops, directly impacting AI infrastructure and control.
Key points
Addresses model collapse and alignment failure in multi-model, self-consuming training loops, directly impacting AI infrastructure and control.
Provenance
Article · Supporting source
21
arXiv cs.AI - Research Science (GLOBAL)

Article Wei Zheng, Yang Yan, Yiyang Shao, Jinyang Li, Zeze Chang, Yukuang Jia, Qiming Mao, Chihyung Wang, Jingbin Zhou

Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies - arXiv:2605.29270v1 Announce Type: new Abstract: The era of the Internet of Agents (IoA) is taking shape: LLM agents are...
arxiv.org/abs/2605.29270 →
Details
Excerpt
Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies - arXiv:2605.29270v1 Announce Type: new Abstract: The era of the Internet of Agents (IoA) is taking shape: LLM agents are...

Context
Addresses a core infrastructure problem (context management/service discovery) for the 'Internet of Agents' (IoA), a key near-future topic.
Key points
Addresses a core infrastructure problem (context management/service discovery) for the 'Internet of Agents' (IoA), a key near-future topic.
Provenance
Article · Supporting source
22
arXiv cs.AI - Research Science (GLOBAL)

Article Vaishali Senthil, Ashutosh Hathidara, Sebastian Schreiber

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval - arXiv:2605.29271v1 Announce Type: new Abstract: Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user...
arxiv.org/abs/2605.29271 →
Details
Excerpt
CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval - arXiv:2605.29271v1 Announce Type: new Abstract: Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user...

Context
This paper addresses a core bottleneck for LLM agents: tool retrieval from large API catalogs. It proposes a novel co-training method (CoHyDE) that improves agent capability.
Key points
This paper addresses a core bottleneck for LLM agents: tool retrieval from large API catalogs. It proposes a novel co-training method (CoHyDE) that improves agent capability.
Provenance
Article · Supporting source
23
arXiv cs.AI - Research Science (GLOBAL)

Article Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng, Tong Xu, Yi Zheng, Zhefeng Wang, Enhong Chen

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models - arXiv:2605.29303v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) followed by reinforcement...
arxiv.org/abs/2605.29303 →
Details
Excerpt
Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models - arXiv:2605.29303v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) followed by reinforcement...

Context
This is a new, technical paper (arXiv) proposing a novel fine-tuning method (EKSFT) for LLMs, directly impacting model training and capability.
Key points
This is a new, technical paper (arXiv) proposing a novel fine-tuning method (EKSFT) for LLMs, directly impacting model training and capability.
Provenance
Article · Supporting source
24
arXiv cs.AI - Research Science (GLOBAL)

Article Yilun Yao, Jiaming Pan, Elsie Dai, Peizhuang Cong, Yaoming Li, Tong Yang

ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression - arXiv:2605.29350v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) language models reduce per-token computation but still require.…
arxiv.org/abs/2605.29350 →
Details
Excerpt
ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression - arXiv:2605.29350v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) language models reduce per-token computation but still require...

Context
This is a primary artifact (arXiv paper) detailing a novel, train-free compression technique (ConMoE) for MoE models, directly impacting AI infrastructure and deployment.
Key points
This is a primary artifact (arXiv paper) detailing a novel, train-free compression technique (ConMoE) for MoE models, directly impacting AI infrastructure and deployment.
Provenance
Article · Supporting source
25
arXiv cs.AI - Research Science (GLOBAL)

Article Yiqun Liu, Yingsheng Wu, Ruqi Yang, Enrong Zheng, Honglei Qiu, Sijun He, Tai Liang, Jingjing Wu, Yuhan Zhou, Yiwei Zhang, Dongyan Chen, Weihan Yi, Xinqi Li, Siqi Bao

PassNet: Scaling Large Language Models for Graph Compiler Pass Generation - arXiv:2605.29357v1 Announce Type: new Abstract: Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream...
arxiv.org/abs/2605.29357 →
Details
Excerpt
PassNet: Scaling Large Language Models for Graph Compiler Pass Generation - arXiv:2605.29357v1 Announce Type: new Abstract: Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream...

Context
Addresses AI infrastructure (compilers, optimization) and the shifting craft of software engineering (LLMs for compiler passes). Primary artifact (PassNet/PassBench) with clear downstream consequence.
Key points
Addresses AI infrastructure (compilers, optimization) and the shifting craft of software engineering (LLMs for compiler passes). Primary artifact (PassNet/PassBench) with clear downstream consequence.
Provenance
Article · Supporting source
26
arXiv cs.AI - Research Science (GLOBAL)

Article Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - arXiv:2605.29358v1 Announce Type: new Abstract: We demonstrate that sparse autoencoders can extract interpretable features from Claude 3.…
arxiv.org/abs/2605.29358 →
Details
Excerpt
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - arXiv:2605.29358v1 Announce Type: new Abstract: We demonstrate that sparse autoencoders can extract interpretable features from Claude 3...

Context
A primary artifact (arXiv paper) detailing feature extraction and interpretability from a major proprietary model (Claude 3 Sonnet). Directly addresses model internals and control.
Key points
A primary artifact (arXiv paper) detailing feature extraction and interpretability from a major proprietary model (Claude 3 Sonnet). Directly addresses model internals and control.
Provenance
Article · Supporting source
27
arXiv cs.AI - Research Science (GLOBAL)

Article Zhihao Liu, Yifan Wu, Jian Lou, Di Wang, Yuxi Zhou, Yuke Hu

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization - arXiv:2605.29396v1 Announce Type: new Abstract: Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe...
arxiv.org/abs/2605.29396 →
Details
Excerpt
Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization - arXiv:2605.29396v1 Announce Type: new Abstract: Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe...

Context
This is a primary research artifact (arXiv paper) directly addressing LLM safety and robustness, a core concern in AI infrastructure and power dynamics.
Key points
This is a primary research artifact (arXiv paper) directly addressing LLM safety and robustness, a core concern in AI infrastructure and power dynamics.
Provenance
Article · Supporting source
28
arXiv cs.AI - Research Science (GLOBAL)

Article Rahul Bissa, Abhishek Vyas, Yash Jain

Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark - arXiv:2605.29400v1 Announce Type: new Abstract: We benchmark three supervised fine-tuned models against...
arxiv.org/abs/2605.29400 →
Details
Excerpt
Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark - arXiv:2605.29400v1 Announce Type: new Abstract: We benchmark three supervised fine-tuned models against...

Context
This is a primary artifact (arXiv paper) detailing a specific benchmark (PiSAR) and showing a massive performance gap between fine-tuned models and frontier zero-shot baselines. It directly impacts agentic coding/behavior prediction.
Key points
This is a primary artifact (arXiv paper) detailing a specific benchmark (PiSAR) and showing a massive performance gap between fine-tuned models and frontier zero-shot baselines. It directly impacts agentic coding/behavior prediction.
Provenance
Article · Supporting source
29
arXiv cs.AI - Research Science (GLOBAL)

Article Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation - arXiv:2605.29430v1 Announce Type: new Abstract: Automatic speech recognition (ASR) is a core component of...
arxiv.org/abs/2605.29430 →
Details
Excerpt
Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation - arXiv:2605.29430v1 Announce Type: new Abstract: Automatic speech recognition (ASR) is a core component of...

Context
Presents a new, agentic framework (Agentic ASR) for speech recognition, directly addressing the limitations of current single-pass systems. This is a primary artifact changing the developer's mental model for building AI agents.
Key points
Presents a new, agentic framework (Agentic ASR) for speech recognition, directly addressing the limitations of current single-pass systems. This is a primary artifact changing the developer's mental model for building AI agents.
Provenance
Article · Supporting source
30
arXiv cs.AI - Research Science (GLOBAL)

Article Zeli Su, Zhankai Xu, Tianlei Chen, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang

The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF - arXiv:2605.29491v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in...
arxiv.org/abs/2605.29491 →
Details
Excerpt
The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF - arXiv:2605.29491v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in...

Context
This paper addresses a critical robustness gap in RAG/agentic systems (distractor instructions), directly impacting LLM reliability and deployment in real-world, noisy data environments.
Key points
This paper addresses a critical robustness gap in RAG/agentic systems (distractor instructions), directly impacting LLM reliability and deployment in real-world, noisy data environments.
Provenance
Article · Supporting source
31
arXiv cs.AI - Research Science (GLOBAL)

Article Kevin Wang, Anna Th\"oni, Benjamin Kempinski, Bobby Cheng, Jianzhu Yao, Benjamin Finch, Leon Guertler, Viraj Nadkarni, Yihan Jiang, Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov, Siyuan Wu, Yu-Chi Cheng, Yan-Ru Ju, Ti-Rong Wu, I-Hsuan Chu, Yu-Yu Yang, I-Chen Wu, Yitian Huang, Qinlu Cao, Yiheng Sun, Yuhong Dai, Hongkun Yao, Jingxuan Fu, Jiwei Zhang, Hao Liao, Mossimo Ebeling, Govind Arun, Sadhvik Bathini, Mihir S Arya, Avinash Anish, Aditya Ranjan, Kirtana Sunil Phatnani, Paval KS, Vrushali Mehta, Aravind S, Nikhil Arora, Tanya Upadhyay, Amol Bandagale, Yuan Lu, ChunEn Hsiao, YuTing Lin, Arvin Chung, Jerry John Thomas, Mathieu Lauri\`ere, Leshem Choshen, Yoram Bachrach, Pramod Viswanath, Maria Polukarov, Cheston Tan, Tal Kachman, Atlas Wang

MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs - arXiv:2605.29512v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as interactive agents,...
arxiv.org/abs/2605.29512 →
Details
Excerpt
MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs - arXiv:2605.29512v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as interactive agents,...

Context
Introduces a new, comprehensive multi-agent evaluation platform (Mindgames) and dataset, directly addressing the core topic of agentic tools and power dynamics.
Key points
Introduces a new, comprehensive multi-agent evaluation platform (Mindgames) and dataset, directly addressing the core topic of agentic tools and power dynamics.
Provenance
Article · Supporting source
32
arXiv cs.AI - Research Science (GLOBAL)

Article Zekai Yu, Qi Meng, Qizhi Chu, Yu Hao, Chuan Shi, Cheng Yang

ParaTool: Shifting Tool Representations from Context to Parameters - arXiv:2605.29561v1 Announce Type: new Abstract: Tool calling extends large language models (LLMs) by enabling grounded interaction with external...
arxiv.org/abs/2605.29561 →
Details
Excerpt
ParaTool: Shifting Tool Representations from Context to Parameters - arXiv:2605.29561v1 Announce Type: new Abstract: Tool calling extends large language models (LLMs) by enabling grounded interaction with external...

Context
This paper proposes a fundamental architectural shift for tool use in LLMs, moving from context-based documentation to parameter-based integration. This directly impacts agentic coding and LLM infrastructure.
Key points
This paper proposes a fundamental architectural shift for tool use in LLMs, moving from context-based documentation to parameter-based integration. This directly impacts agentic coding and LLM infrastructure.
Provenance
Article · Supporting source
33
arXiv cs.AI - Research Science (GLOBAL)

Article Kangrui Wang, Linjie Li, Zhengyuan Yang, Shiqi Chen, Zihan Wang, Li Fei-Fei, Jiajun Wu, Leonidas Guibas, Lijuan Wang, Manling Li

Planning with the Views via Scene Self-Exploration - arXiv:2605.29563v1 Announce Type: new Abstract: Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view...
arxiv.org/abs/2605.29563 →
Details
Excerpt
Planning with the Views via Scene Self-Exploration - arXiv:2605.29563v1 Announce Type: new Abstract: Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view...

Context
This paper details a critical planning gap in VLMs (view planning) and proposes a novel self-exploration framework. It directly addresses frontier model capabilities and 3D reasoning.
Key points
This paper details a critical planning gap in VLMs (view planning) and proposes a novel self-exploration framework. It directly addresses frontier model capabilities and 3D reasoning.
Provenance
Article · Supporting source
34
arXiv cs.AI - Research Science (GLOBAL)

Article Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu

DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning - arXiv:2605.29568v1 Announce Type: new Abstract: Tool-Integrated Reasoning (TIR) extends LLM...
arxiv.org/abs/2605.29568 →
Details
Excerpt
DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning - arXiv:2605.29568v1 Announce Type: new Abstract: Tool-Integrated Reasoning (TIR) extends LLM...

Context
This is a primary artifact (arXiv paper) detailing a novel framework (DeepTool) for improving LLM reasoning and tool use via Process-Supervised RL. Directly relates to agentic coding tools and the shifting craft of software engineering.
Key points
This is a primary artifact (arXiv paper) detailing a novel framework (DeepTool) for improving LLM reasoning and tool use via Process-Supervised RL. Directly relates to agentic coding tools and the shifting craft of software engineering.
Provenance
Article · Supporting source
35
arXiv cs.AI - Research Science (GLOBAL)

Article Silu Panda

FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification - arXiv:2605.29586v1 Announce Type: new Abstract: We introduce FinVerBench, a benchmark and validity study for...
arxiv.org/abs/2605.29586 →
Details
Excerpt
FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification - arXiv:2605.29586v1 Announce Type: new Abstract: We introduce FinVerBench, a benchmark and validity study for...

Context
This introduces a new, specialized benchmark (FinVerBench) using SEC filings for LLM financial verification. It directly addresses model reliability and real-world application in finance.
Key points
This introduces a new, specialized benchmark (FinVerBench) using SEC filings for LLM financial verification. It directly addresses model reliability and real-world application in finance.
Provenance
Article · Supporting source
36
arXiv cs.AI - Research Science (GLOBAL)

Article Junyoung Park, Sunghwan Park, Seongyong Ju, Jaewoo Lee

Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures - arXiv:2605.29629v1 Announce Type: new Abstract: Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the...
arxiv.org/abs/2605.29629 →
Details
Excerpt
Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures - arXiv:2605.29629v1 Announce Type: new Abstract: Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the...

Context
This paper introduces a new, more granular safety evaluation metric (TLO) that moves beyond simple success/failure rates. It directly impacts how LLM safety is tested and deployed.
Key points
This paper introduces a new, more granular safety evaluation metric (TLO) that moves beyond simple success/failure rates. It directly impacts how LLM safety is tested and deployed.
Provenance
Article · Supporting source
37
arXiv cs.AI - Research Science (GLOBAL)

Article Jiajie Fu, Junwen Chen, Mengzhao Wang, Aoxiang He, Maojia Sheng, Xiangyu Ke, Yifan Zhu, Yunjun Gao

VikingMem: A Memory Base Management System for Stateful LLM-based Applications - arXiv:2605.29640v1 Announce Type: new Abstract: Large Language Models have revolutionized interactive applications; however, their finite.…
arxiv.org/abs/2605.29640 →
Details
Excerpt
VikingMem: A Memory Base Management System for Stateful LLM-based Applications - arXiv:2605.29640v1 Announce Type: new Abstract: Large Language Models have revolutionized interactive applications; however, their finite...

Context
Addresses the critical technical challenge of state management and long-term memory for LLM applications, a core topic for agentic tools and software engineering.
Key points
Addresses the critical technical challenge of state management and long-term memory for LLM applications, a core topic for agentic tools and software engineering.
Provenance
Article · Supporting source
38
arXiv cs.AI - Research Science (GLOBAL)

Article Elliot Gestrin, Jendrik Seipp

LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning - arXiv:2605.29649v1 Announce Type: new Abstract: Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are...
arxiv.org/abs/2605.29649 →
Details
Excerpt
LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning - arXiv:2605.29649v1 Announce Type: new Abstract: Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are...

Context
This paper reports a primary artifact (new heuristic) that uses LLMs to generate domain-independent planning heuristics, directly addressing the 'agentic coding tools' and 'shifting craft of software engineering' topics.
Key points
This paper reports a primary artifact (new heuristic) that uses LLMs to generate domain-independent planning heuristics, directly addressing the 'agentic coding tools' and 'shifting craft of software engineering' topics.
Provenance
Article · Supporting source
39
arXiv cs.AI - Research Science (GLOBAL)

Article Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan, Martin Hadamitzky, Daniel Rueckert, Lisa Adams, Keno Bressem

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents - arXiv:2605.29668v1 Announce Type: new Abstract: LLM agents acting in structured environments fail in operational rather than conversational...
arxiv.org/abs/2605.29668 →
Details
Excerpt
GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents - arXiv:2605.29668v1 Announce Type: new Abstract: LLM agents acting in structured environments fail in operational rather than conversational...

Context
This paper introduces GRASP, a method for reliable self-improvement in LLM agents by preventing catastrophic forgetting (regression). This is a core technical advance in agentic systems.
Key points
This paper introduces GRASP, a method for reliable self-improvement in LLM agents by preventing catastrophic forgetting (regression). This is a core technical advance in agentic systems.
Provenance
Article · Supporting source
40
arXiv cs.AI - Research Science (GLOBAL)

Article Lorenz Kutschka, Bernhard Geiger

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems - arXiv:2605.29676v1 Announce Type: new Abstract: Large language models in Agentic AI systems consume tool schemas and execution...
arxiv.org/abs/2605.29676 →
Details
Excerpt
Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems - arXiv:2605.29676v1 Announce Type: new Abstract: Large language models in Agentic AI systems consume tool schemas and execution...

Context
This paper directly addresses the infrastructure and efficiency of agentic AI systems by proposing and benchmarking token-optimized data formats (TOON, TRON) to replace JSON for tool schemas and execution results.
Key points
This paper directly addresses the infrastructure and efficiency of agentic AI systems by proposing and benchmarking token-optimized data formats (TOON, TRON) to replace JSON for tool schemas and execution results.
Provenance
Article · Supporting source
41
arXiv cs.AI - Research Science (GLOBAL)

Article Pedro Orvalho, Marta Kwiatkowska, Guillem Aleny\`a, Felip Many\`a

Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability - arXiv:2605.29687v1 Announce Type: new Abstract: Large Language Models (LLMs) excel at understanding natural language but...
arxiv.org/abs/2605.29687 →
Details
Excerpt
Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability - arXiv:2605.29687v1 Announce Type: new Abstract: Large Language Models (LLMs) excel at understanding natural language but...

Context
This paper proposes a verifiable, structured approach (MaxSAT) for LLMs to solve complex optimization problems, directly addressing LLM reliability and capability in constrained domains like robotics.
Key points
This paper proposes a verifiable, structured approach (MaxSAT) for LLMs to solve complex optimization problems, directly addressing LLM reliability and capability in constrained domains like robotics.
Provenance
Article · Supporting source
42
arXiv cs.AI - Research Science (GLOBAL)

Article Yuchen Liu, Yingjie Feng, Lixiong Qin, Jiasi Chen, Jianing Yu, Sheng Gao, Sheng Yang, Weiran Xu

Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling - arXiv:2605.29697v1 Announce Type: new Abstract: In Agentic Search, trajectory-level outcome rewards fail to quantify the...
arxiv.org/abs/2605.29697 →
Details
Excerpt
Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling - arXiv:2605.29697v1 Announce Type: new Abstract: In Agentic Search, trajectory-level outcome rewards fail to quantify the...

Context
This is a new arXiv paper proposing a novel step-level reward mechanism (GDCR/SAPO) for agentic search, directly addressing core challenges in agentic AI.
Key points
This is a new arXiv paper proposing a novel step-level reward mechanism (GDCR/SAPO) for agentic search, directly addressing core challenges in agentic AI.
Provenance
Article · Supporting source
43
arXiv cs.AI - Research Science (GLOBAL)

Article Mincheol Kang, Hyunjin Lim, Bomin Kang, Daehee Park

BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices - arXiv:2605.29705v1 Announce Type: new Abstract: Trajectory prediction is a fundamental task for autonomous systems, requiring complex...
arxiv.org/abs/2605.29705 →
Details
Excerpt
BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices - arXiv:2605.29705v1 Announce Type: new Abstract: Trajectory prediction is a fundamental task for autonomous systems, requiring complex...

Context
New research (arXiv) on deploying complex LLM reasoning (trajectory prediction) to resource-constrained edge devices. Directly impacts autonomous systems and AI infrastructure.
Key points
New research (arXiv) on deploying complex LLM reasoning (trajectory prediction) to resource-constrained edge devices. Directly impacts autonomous systems and AI infrastructure.
Provenance
Article · Supporting source
44
arXiv cs.AI - Research Science (GLOBAL)

Article Shuaidi Wang, Zhan Zhuang, Ruping Huang, Yu Zhang

NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs - arXiv:2605.29716v1 Announce Type: new Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive...
arxiv.org/abs/2605.29716 →
Details
Excerpt
NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs - arXiv:2605.29716v1 Announce Type: new Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive...

Context
This is a primary artifact (arXiv paper) detailing a new PEFT method (NaRA) specifically for Diffusion LLMs (dLLMs), improving code generation and reasoning.
Key points
This is a primary artifact (arXiv paper) detailing a new PEFT method (NaRA) specifically for Diffusion LLMs (dLLMs), improving code generation and reasoning.
Provenance
Article · Supporting source
45
arXiv cs.AI - Research Science (GLOBAL)

Article Yeong-Joon Ju, Seong-Whan Lee

Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering - arXiv:2605.29742v1 Announce Type: new Abstract: Deploying Large Language Models (LLMs) for regulatory...
arxiv.org/abs/2605.29742 →
Details
Excerpt
Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering - arXiv:2605.29742v1 Announce Type: new Abstract: Deploying Large Language Models (LLMs) for regulatory...

Context
Addresses regulatory compliance and traceability for LLMs, directly impacting legal/policy use cases (HIPAA, national regulations). Provides a new benchmark (RegOps-Bench) and framework (RefWalk).
Key points
Addresses regulatory compliance and traceability for LLMs, directly impacting legal/policy use cases (HIPAA, national regulations). Provides a new benchmark (RegOps-Bench) and framework (RefWalk).
Provenance
Article · Supporting source
46
arXiv cs.AI - Research Science (GLOBAL)

Article Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou, Aiguo Wang, Cuiwei Yang

Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence - arXiv:2605.29744v1 Announce Type: new Abstract: The impressive performance of generalist large language...
arxiv.org/abs/2605.29744 →
Details
Excerpt
Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence - arXiv:2605.29744v1 Announce Type: new Abstract: The impressive performance of generalist large language...

Context
This paper addresses the architecture of medical AI, focusing on multi-agent systems and the synergy between generalist and specialist models. This is core to the 'where intelligence is built' and 'power dynamics' themes.
Key points
This paper addresses the architecture of medical AI, focusing on multi-agent systems and the synergy between generalist and specialist models. This is core to the 'where intelligence is built' and 'power dynamics' themes.
Provenance
Article · Supporting source
47
arXiv cs.AI - Research Science (GLOBAL)

Article Omar Benjelloun, Leonardo Martins Bianco, Isabelle Guyon, Thanh Gia Hieu Khuong, Jonathan Lebensold, Sebastian Lobentanzer, Luis Oala, Benedictus Kent Rachmat, Ihsan Ullah, Peyman Vahidi, Joaquin Vanschoren

Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations - arXiv:2605.29786v1 Announce Type: new Abstract: Reproducibility is fundamental to the scientific method, yet remains a critical...
arxiv.org/abs/2605.29786 →
Details
Excerpt
Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations - arXiv:2605.29786v1 Announce Type: new Abstract: Reproducibility is fundamental to the scientific method, yet remains a critical...

Context
Introduces a formal, machine-actionable metadata standard (Croissant Tasks) for conceptual reproducibility, directly impacting ML evaluation and agentic development.
Key points
Introduces a formal, machine-actionable metadata standard (Croissant Tasks) for conceptual reproducibility, directly impacting ML evaluation and agentic development.
Provenance
Article · Supporting source
48
arXiv cs.AI - Research Science (GLOBAL)

Article Ashutosh Ojha, Vinay Aggarwal, Ashutosh Srivastava, Siddharth Yedlapati, Yaman K Singla, Jitendra Ajmera

MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains - arXiv:2605.29795v1 Announce Type: new Abstract: Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data...
arxiv.org/abs/2605.29795 →
Details
Excerpt
MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains - arXiv:2605.29795v1 Announce Type: new Abstract: Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data...

Context
Presents a novel agentic framework (MEMENTO) that treats the web as a learning signal, directly addressing agentic coding/practice and AI infrastructure.
Key points
Presents a novel agentic framework (MEMENTO) that treats the web as a learning signal, directly addressing agentic coding/practice and AI infrastructure.
Provenance
Article · Supporting source
49
arXiv cs.AI - Research Science (GLOBAL)

Article Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search - arXiv:2605.29796v1 Announce Type: new Abstract: Agentic search enables LLMs to solve complex multi-hop questions through iterative...
arxiv.org/abs/2605.29796 →
Details
Excerpt
SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search - arXiv:2605.29796v1 Announce Type: new Abstract: Agentic search enables LLMs to solve complex multi-hop questions through iterative...

Context
Presents a novel RL framework (SAAS) to solve a critical, practical limitation (over-search) in agentic search, directly impacting agentic coding/tools.
Key points
Presents a novel RL framework (SAAS) to solve a critical, practical limitation (over-search) in agentic search, directly impacting agentic coding/tools.
Provenance
Article · Supporting source
50
arXiv cs.AI - Research Science (GLOBAL)

Article Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Junxiao Yang, Kun Wang, Zhiheng Xi, Linfeng Zhang, Ranjie Duan, Bo Zhang, Wenjie Wang, Wen Shen, Qiaosheng Zhang, Yan Teng, Chaochao Lu, Rui Mei, Man Li, Jialing Tao, Xi Lin, Tianhang Zheng, Yong Liu, Quanshi Zhang, Lei Zhu, Xingjun Ma, Junhua Liu, Hui Xue, Xiaoxiang Zuo, Xiangnan He, Chao Shen, Xianglong Liu, Minlie Huang, Jing Shao, Xia Hu

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security - arXiv:2605.29801v1 Announce Type: new Abstract: Modern open-world agents such as OpenClaw exhibit powerful...
arxiv.org/abs/2605.29801 →
Details
Excerpt
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security - arXiv:2605.29801v1 Announce Type: new Abstract: Modern open-world agents such as OpenClaw exhibit powerful...

Context
This paper introduces a new, lightweight, and scalable alignment framework (AgentDoG 1.5) for AI agents, directly addressing safety risks in advanced agentic systems. It is a primary artifact with clear downstream consequence for agent deployment.
Key points
This paper introduces a new, lightweight, and scalable alignment framework (AgentDoG 1.5) for AI agents, directly addressing safety risks in advanced agentic systems. It is a primary artifact with clear downstream consequence for agent deployment.
Provenance
Article · Supporting source
51
arXiv cs.AI - Research Science (GLOBAL)

Article Krzysztof \.Zurawicki, Julia Farganus, Arkadiusz Gawe{\l}, Mateusz Bystro\'nski, Tomasz Jan Kajdanowicz

PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing - arXiv:2605.29815v1 Announce Type: new Abstract: The growing number of submitted papers has motivated the exploration of Large Language Models...
arxiv.org/abs/2605.29815 →
Details
Excerpt
PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing - arXiv:2605.29815v1 Announce Type: new Abstract: The growing number of submitted papers has motivated the exploration of Large Language Models...

Context
This paper introduces a benchmark (PRAIB) and empirical study on LLM review behavior, directly impacting the reliability and deployment of AI in academic/scientific processes.
Key points
This paper introduces a benchmark (PRAIB) and empirical study on LLM review behavior, directly impacting the reliability and deployment of AI in academic/scientific processes.
Provenance
Article · Supporting source
52
arXiv cs.AI - Research Science (GLOBAL)

Article Haochen Yang, Ke Zhao, Mengyuan Ma, Xingyu Lu, Xiangfeng Wang, Hong Qian

OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation - arXiv:2605.29829v1 Announce Type: new Abstract: Leveraging Large Language Models (LLMs) to automatically...
arxiv.org/abs/2605.29829 →
Details
Excerpt
OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation - arXiv:2605.29829v1 Announce Type: new Abstract: Leveraging Large Language Models (LLMs) to automatically...

Context
This is a primary artifact (arXiv paper/tool) detailing a new agentic system (OptSkills) for solving optimization problems using LLMs, directly addressing generalization and skill learning.
Key points
This is a primary artifact (arXiv paper/tool) detailing a new agentic system (OptSkills) for solving optimization problems using LLMs, directly addressing generalization and skill learning.
Provenance
Article · Supporting source
53
arXiv cs.AI - Research Science (GLOBAL)

Article Minyang Hu, Bo Yang, Zhinuo Zhou, Jiachen Liang, Guo Jiahao, Yiyang Yin, Xiongwei Han

Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories - arXiv:2605.29893v1 Announce Type: new Abstract: LLM-based agents have demonstrated strong capabilities in solving complex tasks...
arxiv.org/abs/2605.29893 →
Details
Excerpt
Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories - arXiv:2605.29893v1 Announce Type: new Abstract: LLM-based agents have demonstrated strong capabilities in solving complex tasks...

Context
Introduces a new benchmark (RedundancyBench) and research area for evaluating agent efficiency, directly impacting agentic coding tools and practice.
Key points
Introduces a new benchmark (RedundancyBench) and research area for evaluating agent efficiency, directly impacting agentic coding tools and practice.
Provenance
Article · Supporting source
54
arXiv cs.AI - Research Science (GLOBAL)

Article Toru Takahashi

Toward AI Systems That Understand Self and Others: A Multi-Phase Inference Framework for Human Cognitive Diversity and World-Model Alignment - arXiv:2605.29930v1 Announce Type: new Abstract: Mutual misunderstanding in...
arxiv.org/abs/2605.29930 →
Details
Excerpt
Toward AI Systems That Understand Self and Others: A Multi-Phase Inference Framework for Human Cognitive Diversity and World-Model Alignment - arXiv:2605.29930v1 Announce Type: new Abstract: Mutual misunderstanding in...

Context
Addresses core AI alignment and world-model issues, directly impacting how AI systems interact with human cognitive diversity and social reality.
Key points
Addresses core AI alignment and world-model issues, directly impacting how AI systems interact with human cognitive diversity and social reality.
Provenance
Article · Supporting source
55
arXiv cs.AI - Research Science (GLOBAL)

Article Ahmad Rammal, Niket Patel, Fabian Gloeckle, Amaury Hayat, Julia Kempe, Remi Munos, Charles Arnal, Vivien Cabannes

Formalizing Mathematics at Scale - arXiv:2605.29955v1 Announce Type: new Abstract: We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot...
arxiv.org/abs/2605.29955 →
Details
Excerpt
Formalizing Mathematics at Scale - arXiv:2605.29955v1 Announce Type: new Abstract: We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot...

Context
This describes a multi-agent system (AutoformBot) for autoformalizing complex mathematics (Lean 4) at scale. It's a major artifact demonstrating AI's capability to automate high-level, verifiable knowledge creation, impacting science and education.
Key points
This describes a multi-agent system (AutoformBot) for autoformalizing complex mathematics (Lean 4) at scale. It's a major artifact demonstrating AI's capability to automate high-level, verifiable knowledge creation, impacting science and education.
Provenance
Article · Supporting source
56
arXiv cs.AI - Research Science (GLOBAL)

Article Kun Feng, Ziwei Shan, Yuchen Fang, Yiyang Tan, Sihan Lu, Shuqi Gu, Lintao Ma, Xingyu Lu, Kan Ren

KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning - arXiv:2605.30002v1 Announce Type: new Abstract: Cross-domain multimodal time series forecasting is a challenging task, requiring models to...
arxiv.org/abs/2605.30002 →
Details
Excerpt
KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning - arXiv:2605.30002v1 Announce Type: new Abstract: Cross-domain multimodal time series forecasting is a challenging task, requiring models to...

Context
This is a primary artifact (arXiv paper) detailing a novel agentic framework for time series forecasting, directly addressing the intersection of LLMs, agents, and specialized AI infrastructure.
Key points
This is a primary artifact (arXiv paper) detailing a novel agentic framework for time series forecasting, directly addressing the intersection of LLMs, agents, and specialized AI infrastructure.
Provenance
Article · Supporting source
57
arXiv cs.AI - Research Science (GLOBAL)

Article Zhen Chen, Yibing Liu, Weihao Xie, Yu Liang, Peilin Chen, Shiqi Wang

RAISE: RAG Design as an Architecture Search Problem - arXiv:2605.30029v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking,...
arxiv.org/abs/2605.30029 →
Details
Excerpt
RAISE: RAG Design as an Architecture Search Problem - arXiv:2605.30029v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking,...

Context
Introduces a new, comprehensive framework (RAISE) and benchmark for RAG optimization, directly addressing systematic challenges in AI architecture design.
Key points
Introduces a new, comprehensive framework (RAISE) and benchmark for RAG optimization, directly addressing systematic challenges in AI architecture design.
Provenance
Article · Supporting source
58
arXiv cs.AI - Research Science (GLOBAL)

Article Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang, Jianguo Li, Peng Di, Peiyu Liu, Jianwei Yin, Wenhai Wang

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning - arXiv:2605.30039v1 Announce Type: new Abstract: Large Language Models have demonstrated remarkable progress in general-purpose...
arxiv.org/abs/2605.30039 →
Details
Excerpt
Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning - arXiv:2605.30039v1 Announce Type: new Abstract: Large Language Models have demonstrated remarkable progress in general-purpose...

Context
This paper introduces a new paradigm (DOMINO) for synthesizing domain-specific data for LLMs using only reference examples, bypassing manual prompt engineering. This directly impacts LLM training and application.
Key points
This paper introduces a new paradigm (DOMINO) for synthesizing domain-specific data for LLMs using only reference examples, bypassing manual prompt engineering. This directly impacts LLM training and application.
Provenance
Article · Supporting source
59
arXiv cs.AI - Research Science (GLOBAL)

Article Geremy Loacham\'in-Suntaxi, Robert Lazar, Dimitrios G. Giovanis, Ioannis G. Kevrekidis, Eleni D. Koronaki

Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection - arXiv:2605.30042v1 Announce Type: new Abstract: Automating scientific computing workflows...
arxiv.org/abs/2605.30042 →
Details
Excerpt
Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection - arXiv:2605.30042v1 Announce Type: new Abstract: Automating scientific computing workflows...

Context
This describes a multi-agent system for scientific computing that addresses semantic drift and action-outcome fidelity, directly impacting agentic coding and AI infrastructure.
Key points
This describes a multi-agent system for scientific computing that addresses semantic drift and action-outcome fidelity, directly impacting agentic coding and AI infrastructure.
Provenance
Article · Supporting source
60
arXiv cs.AI - Research Science (GLOBAL)

Article Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan

Conformal Certification of Reasoning Trace Prefixes - arXiv:2605.30085v1 Announce Type: new Abstract: Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a.…
arxiv.org/abs/2605.30085 →
Details
Excerpt
Conformal Certification of Reasoning Trace Prefixes - arXiv:2605.30085v1 Announce Type: new Abstract: Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a...

Context
This paper introduces CROP, a method for certifying valid intermediate reasoning prefixes in LLMs. This directly addresses the reliability and process supervision of AI reasoning, which is core to agentic tools and frontier model safety.
Key points
This paper introduces CROP, a method for certifying valid intermediate reasoning prefixes in LLMs. This directly addresses the reliability and process supervision of AI reasoning, which is core to agentic tools and frontier model safety.
Provenance
Article · Supporting source
61
arXiv cs.AI - Research Science (GLOBAL)

Article Tiancheng Yang, Matthias Schonlau, Ilia Sucholutsky

Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison - arXiv:2605.30087v1 Announce Type: new Abstract: Emerging personal AI agents are moving toward persistent,...
arxiv.org/abs/2605.30087 →
Details
Excerpt
Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison - arXiv:2605.30087v1 Announce Type: new Abstract: Emerging personal AI agents are moving toward persistent,...

Context
This paper introduces a new benchmark and method for personal AI memory, directly addressing conflict resolution and selective QA, which is core to agentic systems.
Key points
This paper introduces a new benchmark and method for personal AI memory, directly addressing conflict resolution and selective QA, which is core to agentic systems.
Provenance
Article · Supporting source
62
arXiv cs.AI - Research Science (GLOBAL)

Article Hongxiang Zhang, Yuan Tian, Tianyi Zhang

Enhancing Multi-Agent Communication through Attention Steering with Context Relevance - arXiv:2605.30136v1 Announce Type: new Abstract: LLM-based multi-agent systems have demonstrated remarkable performance on complex...
arxiv.org/abs/2605.30136 →
Details
Excerpt
Enhancing Multi-Agent Communication through Attention Steering with Context Relevance - arXiv:2605.30136v1 Announce Type: new Abstract: LLM-based multi-agent systems have demonstrated remarkable performance on complex...

Context
This is a new arXiv paper detailing a technical improvement (Agent-Radar) for multi-agent systems, directly addressing context management and performance degradation in complex AI applications.
Key points
This is a new arXiv paper detailing a technical improvement (Agent-Radar) for multi-agent systems, directly addressing context management and performance degradation in complex AI applications.
Provenance
Article · Supporting source
63
arXiv cs.AI - Research Science (GLOBAL)

Article Ziyan Liu, Zhezheng Hao, Yeqiu Chen, Hong Wang, Jingren Hou, Ruiyi Ding, Yongkang Yang, Wence Ji, Wei Xia, Feng Liu

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents - arXiv:2605.30159v1 Announce Type: new Abstract: Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing...
arxiv.org/abs/2605.30159 →
Details
Excerpt
Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents - arXiv:2605.30159v1 Announce Type: new Abstract: Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing...

Context
This paper introduces a novel optimization method (MMPO) for long-horizon LLM agents, directly addressing memory degradation and belief deviation. This is a core technical advance in agentic AI.
Key points
This paper introduces a novel optimization method (MMPO) for long-horizon LLM agents, directly addressing memory degradation and belief deviation. This is a core technical advance in agentic AI.
Provenance
Article · Supporting source
64
arXiv cs.AI - Research Science (GLOBAL)

Article Caleb DeLeeuw

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders - arXiv:2605.30162v1 Announce Type: new Abstract: Biosecurity evaluations of language models typically ask...
arxiv.org/abs/2605.30162 →
Details
Excerpt
BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders - arXiv:2605.30162v1 Announce Type: new Abstract: Biosecurity evaluations of language models typically ask...

Context
Directly addresses model safety, refusal mechanisms, and internal auditing (SAE), which is critical to the power dynamics and reliability of frontier models.
Key points
Directly addresses model safety, refusal mechanisms, and internal auditing (SAE), which is critical to the power dynamics and reliability of frontier models.
Provenance
Article · Supporting source
65
arXiv cs.AI - Research Science (GLOBAL)

Article Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao, Chiyu Wu, Jin Shang, Yu Gong, Shumin Deng

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models - arXiv:2605.30219v1 Announce Type: new Abstract: Long-horizon interactions require language models to manage accumulating...
arxiv.org/abs/2605.30219 →
Details
Excerpt
When Should Models Change Their Minds? Contextual Belief Management in Large Language Models - arXiv:2605.30219v1 Announce Type: new Abstract: Long-horizon interactions require language models to manage accumulating...

Context
This is a new arXiv paper introducing a formal benchmark (BeliefTrack) and methods (RL, representation steering) for managing LLM belief states, directly impacting model reliability and capability.
Key points
This is a new arXiv paper introducing a formal benchmark (BeliefTrack) and methods (RL, representation steering) for managing LLM belief states, directly impacting model reliability and capability.
Provenance
Article · Supporting source
66
arXiv cs.AI - Research Science (GLOBAL)

Article A. J. Lew (Unreasonable Labs), Y. Cao (Unreasonable Labs), M. J. Buehler (Unreasonable Labs)

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure - arXiv:2605.30284v1 Announce Type: new Abstract: Scientific discovery is an inherently creative and...
arxiv.org/abs/2605.30284 →
Details
Excerpt
ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure - arXiv:2605.30284v1 Announce Type: new Abstract: Scientific discovery is an inherently creative and...

Context
A new benchmark (ProjectionBench) for evaluating LLMs on scientific hypothesis generation and discovery. This directly addresses the 'AI scientist/co-scientist' frontier and model capabilities.
Key points
A new benchmark (ProjectionBench) for evaluating LLMs on scientific hypothesis generation and discovery. This directly addresses the 'AI scientist/co-scientist' frontier and model capabilities.
Provenance
Article · Supporting source
67
arXiv cs.AI - Research Science (GLOBAL)

Article Haowen Wang, Yaxin Du, Jian Yang, Jiajun Wu, Shukai Liu, Yuxuan Zhang, Pingjie Wang, Siheng Chen, Tuney Zheng, Ming Zhou, Xianglong Liu

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection - arXiv:2605.30288v1 Announce Type: new Abstract: Mid-training has become an important stage in modern LLM development, using large-scale curated...
arxiv.org/abs/2605.30288 →
Details
Excerpt
MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection - arXiv:2605.30288v1 Announce Type: new Abstract: Mid-training has become an important stage in modern LLM development, using large-scale curated...

Context
This paper introduces MIRA, a source-aware data selection framework for mid-training LLMs. It directly addresses the core technical challenge of data curation and model capability enhancement.
Key points
This paper introduces MIRA, a source-aware data selection framework for mid-training LLMs. It directly addresses the core technical challenge of data curation and model capability enhancement.
Provenance
Article · Supporting source
68
arXiv cs.AI - Research Science (GLOBAL)

Article Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang, Xin Zhang, Wenshan Wu, Qihao Zhao, Hao Li, Yuanyuan Gao, Kim-Hui Yap, Scarlett Li

Demystifying Data Organization for Enhanced LLM Training - arXiv:2605.30334v1 Announce Type: new Abstract: Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily...
arxiv.org/abs/2605.30334 →
Details
Excerpt
Demystifying Data Organization for Enhanced LLM Training - arXiv:2605.30334v1 Announce Type: new Abstract: Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily...

Context
This paper proposes novel data ordering methods (STR, SAW) and guidelines for optimizing LLM training data organization, directly impacting training efficiency and model performance.
Key points
This paper proposes novel data ordering methods (STR, SAW) and guidelines for optimizing LLM training data organization, directly impacting training efficiency and model performance.
Provenance
Article · Supporting source
69
arXiv cs.AI - Research Science (GLOBAL)

Article Anany Kotawala

Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents - arXiv:2605.30335v1 Announce Type: new Abstract: Multi-component LLM agents assemble probabilistic claims from...
arxiv.org/abs/2605.30335 →
Details
Excerpt
Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents - arXiv:2605.30335v1 Announce Type: new Abstract: Multi-component LLM agents assemble probabilistic claims from...

Context
This paper addresses a fundamental failure mode (compositional incoherence) in multi-component LLM agents, directly impacting agentic coding and reliability.
Key points
This paper addresses a fundamental failure mode (compositional incoherence) in multi-component LLM agents, directly impacting agentic coding and reliability.
Provenance
Article · Supporting source
70
@michpokrass (Michelle Pokrass)

X michpokrass

we shipped a new version of gpt-5.5 instant today. the previous model was too bullet pilled. the new one improves on some other important dimensions: sycophancy, factuality, and multilingual performance. hope you'll…
x.com/michpokrass/status/2060219759682330970 →
Details
Excerpt
we shipped a new version of gpt-5.5 instant today. the previous model was too bullet pilled. the new one improves on some other important dimensions: sycophancy, factuality, and multilingual performance. hope you'll…

Context
Reports a primary artifact (new model release) and directly relates to the near-future of AI and frontier models.
Key points
Reports a primary artifact (new model release) and directly relates to the near-future of AI and frontier models.
Provenance
Tweet · Primary source
71
@trengriffin (Tren Griffin)

X trengriffin

Did the CNN reporter call Microsoft to confirm the claim? Nope. Microsoft switching from Claude code to GitHub Copilot (both with Opus 4.7 paid for by enterprise API usage) enables dogfooding of the GHCP harness so…
x.com/trengriffin/status/2060220238147551244 →
Details
Excerpt
Did the CNN reporter call Microsoft to confirm the claim? Nope. Microsoft switching from Claude code to GitHub Copilot (both with Opus 4.7 paid for by enterprise API usage) enables dogfooding of the GHCP harness so…

Context
Discusses a major shift in AI tooling (Claude to Copilot) and the underlying business/infrastructure dynamics (enterprise API usage, dogfooding), which is central to the podcast's focus.
Key points
Discusses a major shift in AI tooling (Claude to Copilot) and the underlying business/infrastructure dynamics (enterprise API usage, dogfooding), which is central to the podcast's focus.
Provenance
Tweet · Primary source
72
@badlogicgames (Mario Zechner)

X badlogicgames

pibot is now running fully local, using parakeet for STT, qwen3-tts for TTS, and Qwen 3.6 as the local multi-modal LLM via llama.cpp. The STT and TTS inference engines are Rust/mlx-c based. Ported from Python. So, zero…
x.com/badlogicgames/status/2060268257739677… →
Details
Excerpt
pibot is now running fully local, using parakeet for STT, qwen3-tts for TTS, and Qwen 3.6 as the local multi-modal LLM via llama.cpp. The STT and TTS inference engines are Rust/mlx-c based. Ported from Python. So, zero…

Context
Reports a specific, working technical artifact (pibot) and its local, dependency-free stack (Rust/mlx-c, Qwen 3.6), directly relevant to AI infrastructure and tools.
Key points
Reports a specific, working technical artifact (pibot) and its local, dependency-free stack (Rust/mlx-c, Qwen 3.6), directly relevant to AI infrastructure and tools.
Provenance
Tweet · Primary source
73
@badlogicgames (Mario Zechner)

X badlogicgames

pibot is now running fully local, using parakeet for STT, qwen3-tts for TTS, and Qwen 3.6 as the local multi-modal LLM via llama.cpp. The STT and TTS inference engines are Rust/mlx-c based. Ported from Python. So, zero…
x.com/badlogicgames/status/2060268257739677… →
Details
Excerpt
pibot is now running fully local, using parakeet for STT, qwen3-tts for TTS, and Qwen 3.6 as the local multi-modal LLM via llama.cpp. The STT and TTS inference engines are Rust/mlx-c based. Ported from Python. So, zero…

Context
Reports a specific, working technical artifact (pibot) and its local, dependency-free stack (Rust/mlx-c, Qwen 3.6), directly relevant to AI infrastructure and tools.
Key points
Reports a specific, working technical artifact (pibot) and its local, dependency-free stack (Rust/mlx-c, Qwen 3.6), directly relevant to AI infrastructure and tools.
Provenance
Tweet · Primary source
74
Axios - Industry Adjacent (US)

Article Maria Curi

Inside the Democratic resistance on AI - Progressive Democrats taking hardline positions against AI are getting louder. Why it matters: Five influential progressives are shaping a confrontational Democratic message on...
www.axios.com/2026/05/29/inside-democratic-… →
Details
Excerpt
Inside the Democratic resistance on AI - Progressive Democrats taking hardline positions against AI are getting louder. Why it matters: Five influential progressives are shaping a confrontational Democratic message on...

Context
Details specific policy proposals (moratoriums, taxes, labor protections) and political power dynamics (Sanders, AOC, Warren) shaping AI regulation and control.
Key points
Details specific policy proposals (moratoriums, taxes, labor protections) and political power dynamics (Sanders, AOC, Warren) shaping AI regulation and control.
Provenance
Article · Supporting source
75
Axios - Industry Adjacent (US)

Article Zachary Basu

"The pitchforks are here": Billionaires work to contain AI's populist revolt - America's billionaires are developing their own prescriptions for AI-fueled inequality, anxious to defuse a populist revolt aimed at their...
www.axios.com/2026/05/29/ai-billionaires-te… →
Details
Excerpt
"The pitchforks are here": Billionaires work to contain AI's populist revolt - America's billionaires are developing their own prescriptions for AI-fueled inequality, anxious to defuse a populist revolt aimed at their...

Context
Directly addresses power dynamics, wealth concentration, and policy/regulation (wealth tax, data centers) shaping AI's future.
Key points
Directly addresses power dynamics, wealth concentration, and policy/regulation (wealth tax, data centers) shaping AI's future.
Provenance
Article · Supporting source
76
Real-time LLM Inference on Standard GPUs: 3k tokens/s per request — 67 pts · 38 comments

Article NicoConstant

https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/ · @mungoman2: This looks very interesting. Possible to get those rates without exotic hardware. But I have to say that the…
blog.kog.ai/real-time-llm-inference-on-stan… →
Details
Excerpt
https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/ · @mungoman2: This looks very interesting. Possible to get those rates without exotic hardware. But I have to say that the…

Context
Directly addresses AI infrastructure (inference, GPUs) and model performance, which is central to the podcast topic.
Key points
Directly addresses AI infrastructure (inference, GPUs) and model performance, which is central to the podcast topic.
Provenance
Article · Supporting source
77
Techmeme - Industry Adjacent (US)

Article

OpenAI says it has briefed the White House on its new biodefense program, which uses GPT-Rosalind to help develop biodefense and pandemic preparedness tools (Maria Curi/Axios) - Maria Curi / Axios : OpenAI says it has...
www.techmeme.com/260529/p13 →
Details
Excerpt
OpenAI says it has briefed the White House on its new biodefense program, which uses GPT-Rosalind to help develop biodefense and pandemic preparedness tools (Maria Curi/Axios) - Maria Curi / Axios : OpenAI says it has...

Context
Directly addresses the intersection of frontier AI (GPT-Rosalind) and critical infrastructure/policy (biodefense/White House), fitting the power dynamics theme.
Key points
Directly addresses the intersection of frontier AI (GPT-Rosalind) and critical infrastructure/policy (biodefense/White House), fitting the power dynamics theme.
Provenance
Article · Supporting source
78
TechCrunch AI - Media Culture (US)

Article Kate Park

This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory - South Korean chip startup XCENA is betting that AI's real bottleneck is not compute, but memory.
techcrunch.com/2026/05/29/xcena-secures-135… →
Details
Excerpt
This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory - South Korean chip startup XCENA is betting that AI's real bottleneck is not compute, but memory.

Context
Directly addresses AI infrastructure bottlenecks (memory/HBM), a core topic. Funding/valuation adds market/capital dynamics.
Key points
Directly addresses AI infrastructure bottlenecks (memory/HBM), a core topic. Funding/valuation adds market/capital dynamics.
Provenance
Article · Supporting source
79
Techmeme - Industry Adjacent (US)

Article

Former Tesla data labelers say FSD relies on laborious mapping for hazards; crash data analysis shows Tesla exaggerates FSD's safety via flawed methodology (Reuters) - Reuters : Former Tesla data labelers say FSD...
www.techmeme.com/260529/p16 →
Details
Excerpt
Former Tesla data labelers say FSD relies on laborious mapping for hazards; crash data analysis shows Tesla exaggerates FSD's safety via flawed methodology (Reuters) - Reuters : Former Tesla data labelers say FSD...

Context
Directly challenges Tesla's safety claims and methodology for FSD, impacting public trust, regulation, and the viability of autonomous systems.
Key points
Directly challenges Tesla's safety claims and methodology for FSD, impacting public trust, regulation, and the viability of autonomous systems.
Provenance
Article · Supporting source
80
@emollick (Ethan Mollick)

X emollick

Reconstructing software engineering around AI is going to take work (even as the ability of AI to code increases at a rapid rate). Organizations are ideally spending tokens for two things: 1) building stuff 2)…
x.com/emollick/status/2060357604044358108 →
Details
Excerpt
Reconstructing software engineering around AI is going to take work (even as the ability of AI to code increases at a rapid rate). Organizations are ideally spending tokens for two things: 1) building stuff 2)…

Context
Directly addresses the shifting craft of software engineering and the need for organizational investment in AI-assisted development and experimentation.
Key points
Directly addresses the shifting craft of software engineering and the need for organizational investment in AI-assisted development and experimentation.
Provenance
Tweet · Primary source

00:00:00

Transcript

00:00:00 lenarA guy named Mario Zechner posted a photo this morning of a small device on his desk — speaker, microphone, a tangle of cables running back to a tiny board. He calls it pibot. He's been building it for a few months. As of this morning, the whole voice loop runs locally on the box: speech-to-text, language model, and text-to-speech, all running without a cloud round trip and without a Python interpreter behind it. He talks. It answers. Everything stays in the room. So here's what we'll walk through over the next half hour. Mario's box sits on the personal end of the story today. Ethan Mollick posted a short note about how organizations should split their AI budget between building and learning. And a cluster of papers landed on arXiv this morning that all read like the same observation from different angles — as we run agents for longer, they fall apart in new ways, and the fixes aren't better models. They're better wrappers around the models. We'll close with a benchmark for conflicting personal memory, an auditing technique that asks how deep a refusal actually goes, and a project that's pulling math textbooks into Lean 4. Damra, where do you want to start?

00:01:06 damraStart with the box. It's the only thing in the story you can point at. The stack he's running is Parakeet for speech-to-text, Qwen 3 TTS for synthesis, and Qwen 3.6 as the multimodal large language model behind it all, served through llama.cpp. Parakeet is Nvidia's open recognition family. Qwen 3 TTS is Alibaba's open synthesis model. Qwen 3.6 is the dense multimodal release from earlier this month. What's new in Mario's setup isn't the model lineup. It's the runtime. He ported the recognition and synthesis inference engines from Python into Rust on top of mlx-c. So none of those four components need a Python interpreter at runtime.

00:01:50 lenarWhich matters why? Spell it out for someone who hasn't tried to put one of these on a small device.

00:01:56 damraBecause Python is what you eventually hit. You can ship a quantized model in a small package. The moment your recognition stack demands torch and your audio pipeline pulls in transformers, you're back to a multi-hundred-megabyte install on a Pi-class machine — and a cold-start time the user can feel. Rust plus mlx-c keeps you in single-binary territory. The whole assistant fits in a fraction of the disk and starts in a fraction of the time. And on Apple silicon, mlx-c lets him use the unified memory the way the hardware wants to be used.

00:02:30 lenarThere's a photo of the device sitting on his desk, by the way. It isn't a research result. It's a person saying — the local voice stack is finally good enough that I built one for my apartment, and it works. A year ago that sentence required a workstation under the desk. Two years ago it required a cloud bill.

00:02:47 damraAnd it changes what "agent" means in homes and small offices. If the speech round trip stays on the device, the conversation history stays on the device. That's a different privacy posture than anything that ships an audio buffer to a cloud endpoint. It's also a different latency posture. The reason most voice assistants feel sluggish isn't the model. It's the network leg in both directions.

00:03:11 lenarTwo things I'll flag and not oversell. One — I haven't run pibot myself. I'm taking Mario at his word that the throughput is conversational. He doesn't post a tokens-per-second number in the screenshot I'm looking at. Two — there's a real model-quality gap between Qwen 3.6 at the size he's running and the frontier hosted models. He isn't claiming parity. He's claiming the local version answers the kinds of questions he's actually asking it. Which is the right test for this category.

00:03:40 damraIt does not have to beat Opus on a benchmark. It has to be good enough that the user doesn't reach for their phone. And the cohort of people who'd build one of these for their kitchen — they care more about "works without the internet" than about the last few points on MMLU.

00:03:55 lenarEthan Mollick — who teaches at Wharton and writes the One Useful Thing newsletter — posted a thought this morning about how organizations should spend their AI budget. He frames it as two buckets. One — tokens you spend on building things. Two — and the version of the tweet I can see ends in an ellipsis, so the second item is cut off. But in context, and given what Mollick has written before, the second bucket is tokens you spend learning what works. Tokens against problems you don't yet know how to solve.

00:04:23 damraThe truncation matters less than the move. He's saying token spend isn't a single line item. It's two different activities with two different success criteria. Building is — ship the artifact. Learning is — find out whether this even makes sense. Those want different review cadences, different teams, and different definitions of done.

00:04:43 lenarAnd the reason I keep that distinction in mind this week is that we spent yesterday on Kirkland and Ellis's five-hundred-million-dollar internal AI platform. Most of that money is going into building. What's harder to see in the K&E story is whether they've reserved enough capacity to learn — to try things that don't work, that they can throw away. Internal AI orgs at that scale almost always under-fund the learning bucket, because the deliverables column is what gets approved at the board meeting.

00:05:11 damraAnd when the board approves a budget, the line items are deliverables. Nobody writes — twenty percent of our token spend will go to ideas we abandon. But that's exactly the spend that tells you which deliverables are worth shipping next. The team that ran experiments and threw them away knows things the team that only shipped doesn't.

00:05:30 lenarI'd add one more. Experiment tokens have a different review cadence. Build tokens get checked at the end. Experiment tokens have to be checked weekly, because otherwise you can spend a quarter of compute against a problem no one can describe well enough to evaluate the result against.

00:05:46 damraAnd the people running the experiment have to be the same people who'd ship the result. If you split the experiment team from the deploy team, the experiment team learns things the deploy team doesn't trust, and the deploy team builds things the experiment team would've talked them out of.

00:06:01 lenarMollick's post is short. The implication is heavier than the post. Read it next to yesterday's K&E story and what you'd ask of any internal AI org becomes — what's the learning budget, who controls it, and how often does it get reset to zero so the team can try again.

00:06:16 lenarFour papers landed on arXiv this morning that read as a cluster, even though none of the authors know each other. Each one names a different way that long-running agent sessions fall apart. Together they sketch the shape of where the reliability work is right now.

00:06:31 damraWalk me through them. Slowly. Start with the one that connects back to what we covered Wednesday.

00:06:37 lenarRight. The first is Meta-Cognitive Memory Policy Optimization — MMPO — from a team led by Ziyan Liu. The setup is long-horizon agents that keep their context manageable by recursively summarizing their own history. After each step, the agent compresses what it knows into a smaller summary. The problem they name is belief deviation. After enough summarization rounds, the agent's working belief about the world drifts away from what was actually established earlier. The summary is fluent. It's also slightly wrong. And the next summary compresses the slightly wrong version, so the drift compounds.

00:07:14 damraWait — recursive summarization, you mean the technique every long-context agent has been using for the last year? That's what they're modifying?

00:07:22 lenarThat's what they're modifying. Their move is to train a policy that decides when to update memory and when to leave it alone. Memory updates become actions the policy can refuse. If the new information doesn't change anything important, the policy leaves the existing summary as it is. They report meaningful gains on long-horizon tasks. The intuition tracks — most of the drift in these systems comes from over-eager rewrites, not from missing updates.

00:07:48 damraWhich maps cleanly onto what we covered Wednesday — the agent memory degradation work, and the broader observation that persistent memory systems age badly. This is the same family of problem with a learned controller bolted on top. The controller's job is to know when to write.

00:08:04 lenarThe second paper is RedundancyBench, from a team at Huawei and Hong Kong Polytechnic, lead author Minyang Hu. They ask whether the steps an agent actually takes in a long trajectory are necessary. They build a benchmark for detecting redundant steps after the fact. The headline finding — a meaningful fraction of agent steps in current systems are repeats. The agent re-reads the same file it read fifty steps ago. It re-queries the same endpoint. It re-derives a fact it already had in context.

00:08:34 damraWhich sounds boring until you do the math on a thousand-step trajectory. If a quarter of your steps are redundant, you're paying for inference and tool calls you don't need, and you're filling the context with stale duplicates of state the agent already established. So the redundancy isn't just a cost line item. It actively makes the next step worse, because the relevant signal is now buried under repetition.

00:08:58 lenarThird — Anany Kotawala has a single-author paper with my favorite title of the day. Locally Coherent, Globally Incoherent. Bounding compositional incoherence in multi-component LLM agents. The framing — each sub-agent or sub-component in a multi-agent pipeline produces something defensible on its own. The assembled output is internally inconsistent because the components don't share constraints with each other. Kotawala's contribution is a bound. He proves a relationship between how often individual components are locally right and how often the assembly is globally right.

00:09:31 damraThat's the failure every team building multi-agent pipelines runs into the first time they show a demo to someone outside the room. Every component looks defensible. The end-to-end answer contradicts itself. The planner says one thing. The retriever brings back something inconsistent with that. The summarizer smooths over the conflict and produces a coherent-sounding paragraph that's wrong in a different way than either input.

00:09:55 lenarAnd the fourth, briefly — Agent-Radar, from Hongxiang Zhang at Purdue. Same neighborhood. They study attention steering with context relevance in multi-agent communication. When sub-agents exchange messages, the receiving agent's attention spreads across irrelevant pieces of the incoming message and the relevant signal gets diluted. They propose a steering mechanism that biases attention toward context-relevant tokens.

00:10:21 damraSo if you read all four side by side — MMPO on memory drift, RedundancyBench on wasted steps, Compositional Incoherence on assembled wrongness, and Agent-Radar on attention dilution — you can see the shape. As agent sessions get longer and as more sub-components get composed together, the new failure modes aren't about whether the model can answer a question. They're about whether the trajectory stays coherent and whether the steps add up to something useful.

00:10:49 lenarAnd the fixes proposed across the four papers are not new model capabilities. They're control layers wrapped around the model. A learned policy that gates memory writes. A benchmark that catches redundancy. A bound that quantifies compositional damage. An attention steering mechanism. Same shape as the harness conversation we had Tuesday. The model is fine. The layer wrapping the model is where the bugs live now.

00:11:13 damraLet me put a brake on one piece. These are all arXiv preprints from today. None of them have replication yet. The MMPO numbers look strong enough that I'd want to see another team rerun the experiments before I bring the policy into production. Kotawala's bound is single-author and the proof needs review.

00:11:31 lenarFair. The direction feels right and it lines up with what people running long agents in production have been complaining about all month. The specific numbers, I'm holding loosely. Anyone shipping an agent today should read MMPO and the redundancy paper this weekend. They might not adopt the methods. They'll recognize the failure modes.

00:11:50 lenarOne more in the same neighborhood, but on the personal-assistant side. Tiancheng Yang at Waterloo, with Matthias Schonlau and Ilia Sucholutsky from Vector, posted a benchmark and method comparison they're calling — and I'll just read the title — Selective QA over Conflicting Multi-Source Personal Memory. The setup is what happens when a personal AI assistant has accumulated memories about you from multiple sources, and those sources disagree.

00:12:16 damraGive me a concrete example. What does the disagreement look like in practice?

00:12:20 lenarThe example they walk through is preference conflict. Your calendar says you prefer morning meetings — the calendar's been saying that for two years. A message you sent two weeks ago says you've started blocking mornings for deep work and you want all meetings after lunch. Which one does the assistant believe when someone messages it asking to book time on your behalf? Both pieces of information were true when they were written. Neither one is a lie. They contradict each other now.

00:12:47 damraAnd the harder version of the same problem — neither source is wrong even today. The calendar is a stated preference. The message is a more recent stated preference. The assistant has to know that recency matters, that explicit statements override inferred ones, that some preferences are revisable and some aren't, and that some context-windows of your life override others. That's a lot of judgment to ask a retrieval system to perform.

00:13:13 lenarThey build a diagnostic testbed across several conflict types, and they compare a range of methods — straight retrieval, retrieval with a conflict-resolution step, methods that condition on recency, and methods that condition on source type. The honest summary is that no single method dominates. Different conflict types want different resolution strategies. Systems that try to use one strategy for everything underperform compared to systems that route the conflict type first and then apply a type-specific resolver.

00:13:43 damraWhich lines up with how humans handle the same problem. You don't have a single algorithm for resolving contradictory information about a friend. You weight sources by recency, by who said it, by how confident they sounded, by whether it was an explicit statement or an inference from behavior. Asking a retrieval system to bake one of those weightings into its index gets you the wrong answer in three out of four cases.

00:14:07 lenarThe reason this matters now — not in two years — is that the products shipping persistent memory right now don't have any of this machinery. When ChatGPT or Claude remember something about you, and that something becomes wrong, the next time the assistant uses that memory it confidently uses the stale version. There's no resolution step. There isn't a conflict-detection step. The newest entry doesn't necessarily win. The most explicit entry doesn't necessarily win. Whatever the retriever surfaces, the model treats as fact.

00:14:37 damraAnd that's a real cost for the user. Not a paper cost — a felt one. The assistant tells a coworker you prefer morning meetings when you've been telling everyone you don't, for the last two weeks. You don't see it happen. You just see the meeting on your calendar and wonder why nothing you say about your schedule sticks.

00:14:55 lenarCaleb DeLeeuw, an independent researcher, posted a paper called BioRefusalAudit. The premise is that current biosecurity evaluations of language models ask the model questions and grade whether it refuses. He argues that's a shallow test. A model can refuse for surface reasons — it pattern-matches on the phrasing of the question — and still have the relevant capability accessible if the question is asked differently.

00:15:21 damraSo how does he test deeper than that? What does the audit actually measure?

00:15:25 lenarHe uses sparse autoencoders — SAEs, the interpretability technique that's been getting attention this year — to look at the internal features the model activates when it's given a biosecurity-adjacent prompt. He asks a different question — not whether the model refused, but whether the model's internal representations contain the dangerous capability even when it refused at the surface. He compares general-purpose SAEs against ones he fine-tuned on the biosecurity domain to make the relevant features sharper.

00:15:55 damraThat's a real distinction. Refusing because the request matches a refusal pattern is different from refusing because the relevant knowledge isn't there. The first is brittle — paraphrase the request, switch language, embed the question in a roleplay, and the pattern stops matching. The second isn't brittle in the same way, because there's nothing to retrieve.

00:16:15 lenarHis finding is roughly that current refusal training in frontier open-weight models operates much more at the first level than the second. The capability is present internally. The refusal is a learned output filter sitting on top. And filters can be bypassed. The depth-versus-surface gap shows up clearly in the SAE features.

00:16:35 damraWhich doesn't mean the filter is worthless. It means it's a layer, not a wall. The work this pushes on is whether we should be measuring refusal depth as a separate quantity from refusal rate. The current public scorecards for model safety mostly report the rate. They don't report the depth. And the depth is what determines how the model behaves against an adversary who's actually trying.

00:16:58 lenarThat's what I'd hand to any safety team running biosecurity evals this quarter. Are you measuring whether the model said no, or whether the model couldn't say yes? Those are different tests, and we mostly run the first. DeLeeuw's paper doesn't solve the second one. It builds the apparatus to ask it.

00:17:15 damraAnd it ties into something bigger. SAE-based auditing is moving from an interpretability curiosity into something safety teams will plausibly run as part of release evaluations within a year. Today's paper is one application. The general technique — read the internal features, don't just read the outputs — is the move.

00:17:34 lenarOne last item, and it's a more cheerful one. A team with Ahmad Rammal at the lead, with people from FAIR Paris and NYU, posted AutoformBot. It's a multi-agent system that builds something they're calling Atlas — an autoformalized textbook library in Lean 4. The headline claim is that the system can take textbook math written in natural language and turn it into machine-checked Lean code at scale.

00:17:58 damraDefine autoformalization for someone who hasn't met the term before.

00:18:02 lenarMathematics written in natural-language proofs — the way textbooks write proofs, with English between the equations and a fair amount of "it is clear that" and "by symmetry" papering over the steps — translated into a proof assistant's formal language. Lean 4 is the proof assistant. It checks every step. If the translation is wrong, Lean refuses to compile it. Atlas is their target — a library of textbook math, formalized, that the Lean community can build on.

00:18:30 damraWhy a multi-agent system for that? What's the role split?

00:18:34 lenarBecause formalizing a single theorem from natural language is a multi-stage problem. You have to parse the statement, decide what context to import, translate the statement into Lean, translate each proof step, close gaps that the textbook waved over, and verify that the translated version actually compiles. They give different agents different stages. One reads the textbook. One drafts the Lean statement. One drafts the proof. One closes the gaps the others left. They critique each other's output, and Lean is the ground-truth oracle for whether the final product survives.

00:19:06 damraAnd does it work at the textbook scale they're claiming? Or is this one chapter and a press release?

00:19:12 lenarTheir claim is multi-textbook coverage with a meaningful fraction of theorems closing automatically. I haven't independently checked the numbers. The bigger story underneath — formalized math has been a fifteen-year project mostly run by small teams of dedicated mathematicians who hand-write Lean. The library that exists today, mathlib, was assembled one theorem at a time over a decade. If a multi-agent system can credibly do the bulk translation work, the rate of growth changes by an order of magnitude.

00:19:41 damraAnd once it's in Lean, it's verified. The agent can be wrong about the translation a hundred different ways. It can't be wrong about whether the Lean version compiles. The proof assistant is the ground truth. So unlike most agent benchmarks, this one isn't grading itself. The grading lives outside the loop entirely.

00:20:01 lenarThat's the part that makes this category interesting to me. Most agent benchmarks grade themselves — the same model that produced the answer is involved in judging the answer. This one has an outside verifier that doesn't care about model-style answers. Either the proof closes or it doesn't. There's no rubric, no judge model, and no partial credit.

00:20:21 damraAnd it's an early sign that some agent workflows have natural verification built into the domain. Coding has tests. Math has proofs. Hardware design has simulation. Most other domains do not — which is why so much of the agent literature this week is about catching incoherence inside the trajectory rather than at the output. When you can't verify the output, you have to verify the process. When you can verify the output, the process can be as messy as it needs to be.

00:20:49 lenarThat's where the day lands for me. A working local voice stack on a desk. A short note from Mollick about how to split your AI budget. A cluster of papers all saying that long agents fail in ways the harness has to catch. A benchmark for conflicting personal memory. An auditing technique that asks whether a refusal is shallow or deep. And a math project where the verifier is a proof assistant.

00:21:12 damraThe thread I'm pulling out — the model is rarely the constraint anymore on the things people are trying to build. The layer wrapped around it is. Mollick's learning budget, the MMPO memory policy, the RedundancyBench redundancy detector, the personal-memory conflict resolver, the SAE-based refusal auditor, and the proof-assistant verifier behind Atlas — different control layers, same job. They all sit between the model and the work, deciding what the model gets to do next.

00:21:41 lenarTomorrow is going to be quiet. I'll be reading the MMPO paper end to end and seeing if the numbers hold up to a closer look. If something surprising lands over the weekend, we'll cover it Monday. Lenar Kess.