◆ Dispatch 041 · 2026-05-29 GSV Locally Coherent, Globally Not
Locally coherent, globally not
“Are you measuring whether the model said no, or whether the model couldn't say yes? Those are different tests, and we mostly run the first.”
— Lenar Kess, today's narration
Friday's room sits between a hobbyist voice assistant running entirely on Mario Zechner's desk and a cluster of arXiv papers all saying the same thing from different angles: long-running agents now fall apart in ways the model can't fix. Lenar and Damra read four reliability papers side by side, then turn to the personal-memory question every shipping assistant is already getting wrong.
- Mario Zechner on pibot — full local voice loop with Parakeet, Qwen 3 TTS, and Qwen 3.6 through llama.cpp, with the STT and TTS engines ported from Python into Rust on mlx-c. The runtime detail is the news, not the model lineup.
- Ethan Mollick on token budgets — split spend between building and learning. Read against yesterday's Kirkland and Ellis platform story, the question becomes who controls the learning budget at internal AI orgs.
- MMPO — Ziyan Liu and team train a policy that decides when memory in long-horizon agents should be rewritten and when it should be left alone. Belief drift comes from over-eager rewrites, not missing updates.
- RedundancyBench — Minyang Hu's group benchmarks how many steps in a long agent trajectory are repeats. Stale duplicates of state crowd out the relevant signal in context.
- Locally Coherent, Globally Incoherent — Anany Kotawala's single-author paper bounds compositional incoherence in multi-component agents. Defensible local outputs assemble into contradictory global ones.
- Agent-Radar — Hongxiang Zhang's group steers attention toward context-relevant tokens in multi-agent communication, so the receiver isn't drowned in noise from the sender.
- Selective QA over conflicting personal memory — Tiancheng Yang's testbed for what happens when your assistant's memories about you disagree. No single resolution strategy dominates.
- BioRefusalAudit — Caleb DeLeeuw uses sparse autoencoders to ask whether a model's refusal is shallow pattern matching or whether the dangerous capability isn't there at all.
- AutoformBot and Atlas — Ahmad Rammal's team at FAIR Paris and NYU on a multi-agent system that pulls textbook math into Lean 4 at scale. Lean is the verifier the agents can't argue with.
Chapters
- 00:00:00 Transcript
Sources
80 cited-
1
OpenAI · 47m40s
Video OpenAI
Build Hour: Agents SDK — Build with the next evolution of the Agents SDK. In this Build Hour, you’ll learn how to use the updated Agents SDK to build long-running agents with a model-native harness. Give agents the…
www.youtube.com/watch?v=tK32trvj_b4 →Details
- Excerpt
- Build Hour: Agents SDK — Build with the next evolution of the Agents SDK. In this Build Hour, you’ll learn how to use the updated Agents SDK to build long-running agents with a model-native harness. Give agents the…
- Context
- Directly addresses agentic coding tools, agent infrastructure, and the shifting craft of software engineering with technical depth.
- Key points
- Directly addresses agentic coding tools, agent infrastructure, and the shifting craft of software engineering with technical depth.
- Provenance
- Video · Supporting source
-
2
arXiv cs.AI - Research Science (GLOBAL)
Article Al Kari
The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling - arXiv:2605.28864v1 Announce Type: new Abstract: The Cognitive Categorical Transformer (CCT) is a 306M-parameter...
arxiv.org/abs/2605.28864 →Details
- Excerpt
- The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling - arXiv:2605.28864v1 Announce Type: new Abstract: The Cognitive Categorical Transformer (CCT) is a 306M-parameter...
- Context
- This is a primary artifact (arXiv paper) detailing a novel, theoretically grounded model architecture (CCT) and providing quantitative evidence of performance improvement via category theory concepts.
- Key points
- This is a primary artifact (arXiv paper) detailing a novel, theoretically grounded model architecture (CCT) and providing quantitative evidence of performance improvement via category theory concepts.
- Provenance
- Article · Supporting source
-
3
arXiv cs.AI - Research Science (GLOBAL)
Article Jiachen Zhang (Peking University, China Agricultural University), Junyi Lao (Peking University), Chenghao Liu (Peking University), Siyuan Liu (Peking University), Shixin Wu (Peking University), Linsen Zhang (Peking University), Boyu Wang (Peking University), Songfang Huang (Peking University)
VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis - arXiv:2605.28978v1 Announce Type: new Abstract: Finite Element Analysis (FEA) serves as the cornerstone of modern engineering...
arxiv.org/abs/2605.28978 →Details
- Excerpt
- VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis - arXiv:2605.28978v1 Announce Type: new Abstract: Finite Element Analysis (FEA) serves as the cornerstone of modern engineering...
- Context
- This paper describes an agentic system (VFEAgent) automating a complex, domain-specific engineering workflow (FEA) from multimodal inputs. This is a core example of AI applied to physical-world engineering.
- Key points
- This paper describes an agentic system (VFEAgent) automating a complex, domain-specific engineering workflow (FEA) from multimodal inputs. This is a core example of AI applied to physical-world engineering.
- Provenance
- Article · Supporting source
-
4
arXiv cs.AI - Research Science (GLOBAL)
Article Sara Metcalf, William Schoenberg
BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation - arXiv:2605.28994v1 Announce Type: new Abstract: AI tools to support real world decision making must be able to build simulation models that inform...
arxiv.org/abs/2605.28994 →Details
- Excerpt
- BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation - arXiv:2605.28994v1 Announce Type: new Abstract: AI tools to support real world decision making must be able to build simulation models that inform...
- Context
- Establishes a new, open-source benchmark (BEAMS) for AI in critical real-world modeling/simulation, directly impacting AI's utility and trustworthiness.
- Key points
- Establishes a new, open-source benchmark (BEAMS) for AI in critical real-world modeling/simulation, directly impacting AI's utility and trustworthiness.
- Provenance
- Article · Supporting source
-
5
arXiv cs.AI - Research Science (GLOBAL)
Article Aisha Najera, Alvin Moon, Vedant Srinivasan, Rajesh Veeraraghavan
When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis - arXiv:2605.29025v1 Announce Type: new Abstract: Federal agencies are deploying large language models (LLMs) to categorize public comment...
arxiv.org/abs/2605.29025 →Details
- Excerpt
- When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis - arXiv:2605.29025v1 Announce Type: new Abstract: Federal agencies are deploying large language models (LLMs) to categorize public comment...
- Context
- Addresses the critical issue of model disagreement in real-world applications (public policy/federal agencies), directly impacting how intelligence is used and interpreted.
- Key points
- Addresses the critical issue of model disagreement in real-world applications (public policy/federal agencies), directly impacting how intelligence is used and interpreted.
- Provenance
- Article · Supporting source
-
6
arXiv cs.AI - Research Science (GLOBAL)
Article Diego Gosmar, Deborah A. Dahl
Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching - arXiv:2605.29055v1 Announce Type: new Abstract: Hallucination remains a major reliability barrier for production...
arxiv.org/abs/2605.29055 →Details
- Excerpt
- Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching - arXiv:2605.29055v1 Announce Type: new Abstract: Hallucination remains a major reliability barrier for production...
- Context
- New research on hallucination mitigation, agentic pipelines, and semantic caching directly addresses reliability, infrastructure, and agentic tools.
- Key points
- New research on hallucination mitigation, agentic pipelines, and semantic caching directly addresses reliability, infrastructure, and agentic tools.
- Provenance
- Article · Supporting source
-
7
arXiv cs.AI - Research Science (GLOBAL)
Article Siddharth Sai, Xiaofei Wen, Muhao Chen
Robust and Efficient Guardrails with Latent Reasoning - arXiv:2605.29068v1 Announce Type: new Abstract: Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world...
arxiv.org/abs/2605.29068 →Details
- Excerpt
- Robust and Efficient Guardrails with Latent Reasoning - arXiv:2605.29068v1 Announce Type: new Abstract: Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world...
- Context
- This paper proposes COLAGUARD, a novel, efficient guardrail model for LLMs. It directly addresses the core tension between safety robustness and high-throughput deployment, which is critical for real-world AI infrastructure.
- Key points
- This paper proposes COLAGUARD, a novel, efficient guardrail model for LLMs. It directly addresses the core tension between safety robustness and high-throughput deployment, which is critical for real-world AI infrastructure.
- Provenance
- Article · Supporting source
-
8
arXiv cs.AI - Research Science (GLOBAL)
Article Tyler Akidau, Tyler Rockwood, Johannes Br\"uderl, Marc Millstone
The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane - arXiv:2605.29082v1 Announce Type: new Abstract: AI agents are increasingly expected to operate as digital employees:...
arxiv.org/abs/2605.29082 →Details
- Excerpt
- The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane - arXiv:2605.29082v1 Announce Type: new Abstract: AI agents are increasingly expected to operate as digital employees:...
- Context
- This paper proposes a critical architectural solution (ADP) for governing autonomous agents, directly addressing safety, policy, and enterprise data access—a core topic.
- Key points
- This paper proposes a critical architectural solution (ADP) for governing autonomous agents, directly addressing safety, policy, and enterprise data access—a core topic.
- Provenance
- Article · Supporting source
-
9
arXiv cs.AI - Research Science (GLOBAL)
Article Yubo Li, Ramayya Krishnan, Rema Padman
The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure - arXiv:2605.29087v1 Announce Type: new Abstract: Reasoning models are evaluated on single-turn benchmarks but.…
arxiv.org/abs/2605.29087 →Details
- Excerpt
- The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure - arXiv:2605.29087v1 Announce Type: new Abstract: Reasoning models are evaluated on single-turn benchmarks but...
- Context
- Reports a specific, measurable failure mode (unfaithful capitulation) in reasoning models under adversarial pressure, directly impacting model reliability and safety.
- Key points
- Reports a specific, measurable failure mode (unfaithful capitulation) in reasoning models under adversarial pressure, directly impacting model reliability and safety.
- Provenance
- Article · Supporting source
-
10
arXiv cs.AI - Research Science (GLOBAL)
Article Shreyas Fadnavis, Praitayini Kanakaraj, Felix Wyss
Beyond Consensus: Trace-Level Synthesis in Mixture of Agents - arXiv:2605.29116v1 Announce Type: new Abstract: When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a.…
arxiv.org/abs/2605.29116 →Details
- Excerpt
- Beyond Consensus: Trace-Level Synthesis in Mixture of Agents - arXiv:2605.29116v1 Announce Type: new Abstract: When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a...
- Context
- This paper directly addresses agentic systems and the 'craft' of AI reasoning, arguing for trace-level synthesis over simple consensus voting. This is a core technical advance for agentic tools.
- Key points
- This paper directly addresses agentic systems and the 'craft' of AI reasoning, arguing for trace-level synthesis over simple consensus voting. This is a core technical advance for agentic tools.
- Provenance
- Article · Supporting source
-
11
arXiv cs.AI - Research Science (GLOBAL)
Article Yifei He, Rui Yang, Hao Bai, Tong Zhang, Han Zhao
PRO-CUA: Process-Reward Optimization for Computer Use Agents - arXiv:2605.29119v1 Announce Type: new Abstract: Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their...
arxiv.org/abs/2605.29119 →Details
- Excerpt
- PRO-CUA: Process-Reward Optimization for Computer Use Agents - arXiv:2605.29119v1 Announce Type: new Abstract: Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their...
- Context
- This paper introduces a new framework (PRO-CUA) for training computer use agents (CUAs), directly addressing agentic coding/workflow automation and AI infrastructure challenges.
- Key points
- This paper introduces a new framework (PRO-CUA) for training computer use agents (CUAs), directly addressing agentic coding/workflow automation and AI infrastructure challenges.
- Provenance
- Article · Supporting source
-
12
arXiv cs.AI - Research Science (GLOBAL)
Article Dueun Kim, Albert No
The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models - arXiv:2605.29123v1 Announce Type: new Abstract: Masked diffusion language models (MDMs) uniquely support any-order generation, with...
arxiv.org/abs/2605.29123 →Details
- Excerpt
- The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models - arXiv:2605.29123v1 Announce Type: new Abstract: Masked diffusion language models (MDMs) uniquely support any-order generation, with...
- Context
- Directly addresses model failure modes and reasoning limitations in diffusion models, a core topic for frontier AI research.
- Key points
- Directly addresses model failure modes and reasoning limitations in diffusion models, a core topic for frontier AI research.
- Provenance
- Article · Supporting source
-
13
arXiv cs.AI - Research Science (GLOBAL)
Article Muhammad Zia Hydari, Raja Iqbal, Narayan Ramasubbu
Governing Technical Debt in Agentic AI Systems - arXiv:2605.29129v1 Announce Type: new Abstract: Agentic AI systems are increasingly being explored as production infrastructure: they reason over multiple steps, call...
arxiv.org/abs/2605.29129 →Details
- Excerpt
- Governing Technical Debt in Agentic AI Systems - arXiv:2605.29129v1 Announce Type: new Abstract: Agentic AI systems are increasingly being explored as production infrastructure: they reason over multiple steps, call...
- Context
- Defines 'Agentic Technical Debt' and 'Stochastic Tax,' directly addressing governance and infrastructure challenges in agentic AI systems.
- Key points
- Defines 'Agentic Technical Debt' and 'Stochastic Tax,' directly addressing governance and infrastructure challenges in agentic AI systems.
- Provenance
- Article · Supporting source
-
14
arXiv cs.AI - Research Science (GLOBAL)
Article Daniel Lee, Owen Queen, James Zou
ReasonOps: Operator Segmentation for LLM Reasoning Traces - arXiv:2605.29192v1 Announce Type: new Abstract: Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a...
arxiv.org/abs/2605.29192 →Details
- Excerpt
- ReasonOps: Operator Segmentation for LLM Reasoning Traces - arXiv:2605.29192v1 Announce Type: new Abstract: Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a...
- Context
- This paper introduces ReasonOps, a method to analyze and structure LLM reasoning traces, revealing common compositional structures and model fingerprints. This is core research on AI capability and understanding.
- Key points
- This paper introduces ReasonOps, a method to analyze and structure LLM reasoning traces, revealing common compositional structures and model fingerprints. This is core research on AI capability and understanding.
- Provenance
- Article · Supporting source
-
15
arXiv cs.AI - Research Science (GLOBAL)
Article Tenghao Huang, Kung-Hsiang Huang, Prafulla Kumar Choubey, Yilun Zhou, Muhao Chen, Jonathan May, Chien-Sheng Wu
GTA: Generating Long-Horizon Tasks for Web Agents at Scale - arXiv:2605.29218v1 Announce Type: new Abstract: Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web...
arxiv.org/abs/2605.29218 →Details
- Excerpt
- GTA: Generating Long-Horizon Tasks for Web Agents at Scale - arXiv:2605.29218v1 Announce Type: new Abstract: Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web...
- Context
- This paper introduces a scalable benchmark (GTA) for web agents, directly addressing the core topic of agentic coding tools and practice. It's a primary artifact with clear downstream consequence for agent development.
- Key points
- This paper introduces a scalable benchmark (GTA) for web agents, directly addressing the core topic of agentic coding tools and practice. It's a primary artifact with clear downstream consequence for agent development.
- Provenance
- Article · Supporting source
-
16
arXiv cs.AI - Research Science (GLOBAL)
Article Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu, Akiko Aizawa
BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents - arXiv:2605.29225v1 Announce Type: new Abstract: Self-evolving agents improve over time by reflecting on past failures, but.…
arxiv.org/abs/2605.29225 →Details
- Excerpt
- BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents - arXiv:2605.29225v1 Announce Type: new Abstract: Self-evolving agents improve over time by reflecting on past failures, but...
- Context
- Introduces BenchTrace, a new benchmark for evaluating self-evolving LLM agents, directly addressing agentic coding tools and agentic practice.
- Key points
- Introduces BenchTrace, a new benchmark for evaluating self-evolving LLM agents, directly addressing agentic coding tools and agentic practice.
- Provenance
- Article · Supporting source
-
17
arXiv cs.AI - Research Science (GLOBAL)
Article Benlong Wu, Weiming Zhang, Kejiang Chen, Han Fang, Nenghai Yu
Provably Secure Agent Guardrail - arXiv:2605.29251v1 Announce Type: new Abstract: As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of...
arxiv.org/abs/2605.29251 →Details
- Excerpt
- Provably Secure Agent Guardrail - arXiv:2605.29251v1 Announce Type: new Abstract: As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of...
- Context
- This paper proposes a formal, provably secure guardrail for agents, directly addressing the core risk of autonomous AI systems going rogue. It's a major technical artifact.
- Key points
- This paper proposes a formal, provably secure guardrail for agents, directly addressing the core risk of autonomous AI systems going rogue. It's a major technical artifact.
- Provenance
- Article · Supporting source
-
18
arXiv cs.AI - Research Science (GLOBAL)
Article Yibing Liu, Yangze Liu, Xiaolong Yin, Bin Wang, Chong Zhang, Hao Yin, Zhongyi Han
OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories - arXiv:2605.29253v1 Announce Type: new Abstract: Task success can hide process anomalies in real-world agent executions. An.…
arxiv.org/abs/2605.29253 →Details
- Excerpt
- OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories - arXiv:2605.29253v1 Announce Type: new Abstract: Task success can hide process anomalies in real-world agent executions. An...
- Context
- Introduces OpenClawBench, a large-scale dataset for measuring process-side anomalies in real agent execution. Directly addresses agentic tools and reliability.
- Key points
- Introduces OpenClawBench, a large-scale dataset for measuring process-side anomalies in real agent execution. Directly addresses agentic tools and reliability.
- Provenance
- Article · Supporting source
-
19
arXiv cs.AI - Research Science (GLOBAL)
Article Shijie Cao, Yuan Yuan, Jing Liu
Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling - arXiv:2605.29262v1 Announce Type: new Abstract: The Dynamic Flexible Job Shop Scheduling Problem...
arxiv.org/abs/2605.29262 →Details
- Excerpt
- Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling - arXiv:2605.29262v1 Announce Type: new Abstract: The Dynamic Flexible Job Shop Scheduling Problem...
- Context
- This paper proposes an agentic framework (RACE-Sched) for dynamic scheduling, directly addressing the core tension between real-time constraints and long-horizon reasoning in industrial control systems.
- Key points
- This paper proposes an agentic framework (RACE-Sched) for dynamic scheduling, directly addressing the core tension between real-time constraints and long-horizon reasoning in industrial control systems.
- Provenance
- Article · Supporting source
-
20
arXiv cs.AI - Research Science (GLOBAL)
Article Yang Zhang, Xiukun Wei, Xueru Zhang
When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop - arXiv:2605.29267v1 Announce Type: new Abstract: Foundation models are increasingly trained on synthetic data generated.…
arxiv.org/abs/2605.29267 →Details
- Excerpt
- When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop - arXiv:2605.29267v1 Announce Type: new Abstract: Foundation models are increasingly trained on synthetic data generated...
- Context
- Addresses model collapse and alignment failure in multi-model, self-consuming training loops, directly impacting AI infrastructure and control.
- Key points
- Addresses model collapse and alignment failure in multi-model, self-consuming training loops, directly impacting AI infrastructure and control.
- Provenance
- Article · Supporting source
-
21
arXiv cs.AI - Research Science (GLOBAL)
Article Wei Zheng, Yang Yan, Yiyang Shao, Jinyang Li, Zeze Chang, Yukuang Jia, Qiming Mao, Chihyung Wang, Jingbin Zhou
Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies - arXiv:2605.29270v1 Announce Type: new Abstract: The era of the Internet of Agents (IoA) is taking shape: LLM agents are...
arxiv.org/abs/2605.29270 →Details
- Excerpt
- Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies - arXiv:2605.29270v1 Announce Type: new Abstract: The era of the Internet of Agents (IoA) is taking shape: LLM agents are...
- Context
- Addresses a core infrastructure problem (context management/service discovery) for the 'Internet of Agents' (IoA), a key near-future topic.
- Key points
- Addresses a core infrastructure problem (context management/service discovery) for the 'Internet of Agents' (IoA), a key near-future topic.
- Provenance
- Article · Supporting source
-
22
arXiv cs.AI - Research Science (GLOBAL)
Article Vaishali Senthil, Ashutosh Hathidara, Sebastian Schreiber
CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval - arXiv:2605.29271v1 Announce Type: new Abstract: Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user...
arxiv.org/abs/2605.29271 →Details
- Excerpt
- CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval - arXiv:2605.29271v1 Announce Type: new Abstract: Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user...
- Context
- This paper addresses a core bottleneck for LLM agents: tool retrieval from large API catalogs. It proposes a novel co-training method (CoHyDE) that improves agent capability.
- Key points
- This paper addresses a core bottleneck for LLM agents: tool retrieval from large API catalogs. It proposes a novel co-training method (CoHyDE) that improves agent capability.
- Provenance
- Article · Supporting source
-
23
arXiv cs.AI - Research Science (GLOBAL)
Article Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng, Tong Xu, Yi Zheng, Zhefeng Wang, Enhong Chen
Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models - arXiv:2605.29303v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) followed by reinforcement...
arxiv.org/abs/2605.29303 →Details
- Excerpt
- Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models - arXiv:2605.29303v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) followed by reinforcement...
- Context
- This is a new, technical paper (arXiv) proposing a novel fine-tuning method (EKSFT) for LLMs, directly impacting model training and capability.
- Key points
- This is a new, technical paper (arXiv) proposing a novel fine-tuning method (EKSFT) for LLMs, directly impacting model training and capability.
- Provenance
- Article · Supporting source
-
24
arXiv cs.AI - Research Science (GLOBAL)
Article Yilun Yao, Jiaming Pan, Elsie Dai, Peizhuang Cong, Yaoming Li, Tong Yang
ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression - arXiv:2605.29350v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) language models reduce per-token computation but still require.…
arxiv.org/abs/2605.29350 →Details
- Excerpt
- ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression - arXiv:2605.29350v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) language models reduce per-token computation but still require...
- Context
- This is a primary artifact (arXiv paper) detailing a novel, train-free compression technique (ConMoE) for MoE models, directly impacting AI infrastructure and deployment.
- Key points
- This is a primary artifact (arXiv paper) detailing a novel, train-free compression technique (ConMoE) for MoE models, directly impacting AI infrastructure and deployment.
- Provenance
- Article · Supporting source
-
25
arXiv cs.AI - Research Science (GLOBAL)
Article Yiqun Liu, Yingsheng Wu, Ruqi Yang, Enrong Zheng, Honglei Qiu, Sijun He, Tai Liang, Jingjing Wu, Yuhan Zhou, Yiwei Zhang, Dongyan Chen, Weihan Yi, Xinqi Li, Siqi Bao
PassNet: Scaling Large Language Models for Graph Compiler Pass Generation - arXiv:2605.29357v1 Announce Type: new Abstract: Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream...
arxiv.org/abs/2605.29357 →Details
- Excerpt
- PassNet: Scaling Large Language Models for Graph Compiler Pass Generation - arXiv:2605.29357v1 Announce Type: new Abstract: Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream...
- Context
- Addresses AI infrastructure (compilers, optimization) and the shifting craft of software engineering (LLMs for compiler passes). Primary artifact (PassNet/PassBench) with clear downstream consequence.
- Key points
- Addresses AI infrastructure (compilers, optimization) and the shifting craft of software engineering (LLMs for compiler passes). Primary artifact (PassNet/PassBench) with clear downstream consequence.
- Provenance
- Article · Supporting source
-
26
arXiv cs.AI - Research Science (GLOBAL)
Article Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - arXiv:2605.29358v1 Announce Type: new Abstract: We demonstrate that sparse autoencoders can extract interpretable features from Claude 3.…
arxiv.org/abs/2605.29358 →Details
- Excerpt
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - arXiv:2605.29358v1 Announce Type: new Abstract: We demonstrate that sparse autoencoders can extract interpretable features from Claude 3...
- Context
- A primary artifact (arXiv paper) detailing feature extraction and interpretability from a major proprietary model (Claude 3 Sonnet). Directly addresses model internals and control.
- Key points
- A primary artifact (arXiv paper) detailing feature extraction and interpretability from a major proprietary model (Claude 3 Sonnet). Directly addresses model internals and control.
- Provenance
- Article · Supporting source
-
27
arXiv cs.AI - Research Science (GLOBAL)
Article Zhihao Liu, Yifan Wu, Jian Lou, Di Wang, Yuxi Zhou, Yuke Hu
Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization - arXiv:2605.29396v1 Announce Type: new Abstract: Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe...
arxiv.org/abs/2605.29396 →Details
- Excerpt
- Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization - arXiv:2605.29396v1 Announce Type: new Abstract: Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe...
- Context
- This is a primary research artifact (arXiv paper) directly addressing LLM safety and robustness, a core concern in AI infrastructure and power dynamics.
- Key points
- This is a primary research artifact (arXiv paper) directly addressing LLM safety and robustness, a core concern in AI infrastructure and power dynamics.
- Provenance
- Article · Supporting source
-
28
arXiv cs.AI - Research Science (GLOBAL)
Article Rahul Bissa, Abhishek Vyas, Yash Jain
Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark - arXiv:2605.29400v1 Announce Type: new Abstract: We benchmark three supervised fine-tuned models against...
arxiv.org/abs/2605.29400 →Details
- Excerpt
- Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark - arXiv:2605.29400v1 Announce Type: new Abstract: We benchmark three supervised fine-tuned models against...
- Context
- This is a primary artifact (arXiv paper) detailing a specific benchmark (PiSAR) and showing a massive performance gap between fine-tuned models and frontier zero-shot baselines. It directly impacts agentic coding/behavior prediction.
- Key points
- This is a primary artifact (arXiv paper) detailing a specific benchmark (PiSAR) and showing a massive performance gap between fine-tuned models and frontier zero-shot baselines. It directly impacts agentic coding/behavior prediction.
- Provenance
- Article · Supporting source
-
29
arXiv cs.AI - Research Science (GLOBAL)
Article Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen
Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation - arXiv:2605.29430v1 Announce Type: new Abstract: Automatic speech recognition (ASR) is a core component of...
arxiv.org/abs/2605.29430 →Details
- Excerpt
- Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation - arXiv:2605.29430v1 Announce Type: new Abstract: Automatic speech recognition (ASR) is a core component of...
- Context
- Presents a new, agentic framework (Agentic ASR) for speech recognition, directly addressing the limitations of current single-pass systems. This is a primary artifact changing the developer's mental model for building AI agents.
- Key points
- Presents a new, agentic framework (Agentic ASR) for speech recognition, directly addressing the limitations of current single-pass systems. This is a primary artifact changing the developer's mental model for building AI agents.
- Provenance
- Article · Supporting source
-
30
arXiv cs.AI - Research Science (GLOBAL)
Article Zeli Su, Zhankai Xu, Tianlei Chen, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang
The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF - arXiv:2605.29491v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in...
arxiv.org/abs/2605.29491 →Details
- Excerpt
- The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF - arXiv:2605.29491v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in...
- Context
- This paper addresses a critical robustness gap in RAG/agentic systems (distractor instructions), directly impacting LLM reliability and deployment in real-world, noisy data environments.
- Key points
- This paper addresses a critical robustness gap in RAG/agentic systems (distractor instructions), directly impacting LLM reliability and deployment in real-world, noisy data environments.
- Provenance
- Article · Supporting source
-
31
arXiv cs.AI - Research Science (GLOBAL)
Article Kevin Wang, Anna Th\"oni, Benjamin Kempinski, Bobby Cheng, Jianzhu Yao, Benjamin Finch, Leon Guertler, Viraj Nadkarni, Yihan Jiang, Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov, Siyuan Wu, Yu-Chi Cheng, Yan-Ru Ju, Ti-Rong Wu, I-Hsuan Chu, Yu-Yu Yang, I-Chen Wu, Yitian Huang, Qinlu Cao, Yiheng Sun, Yuhong Dai, Hongkun Yao, Jingxuan Fu, Jiwei Zhang, Hao Liao, Mossimo Ebeling, Govind Arun, Sadhvik Bathini, Mihir S Arya, Avinash Anish, Aditya Ranjan, Kirtana Sunil Phatnani, Paval KS, Vrushali Mehta, Aravind S, Nikhil Arora, Tanya Upadhyay, Amol Bandagale, Yuan Lu, ChunEn Hsiao, YuTing Lin, Arvin Chung, Jerry John Thomas, Mathieu Lauri\`ere, Leshem Choshen, Yoram Bachrach, Pramod Viswanath, Maria Polukarov, Cheston Tan, Tal Kachman, Atlas Wang
MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs - arXiv:2605.29512v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as interactive agents,...
arxiv.org/abs/2605.29512 →Details
- Excerpt
- MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs - arXiv:2605.29512v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as interactive agents,...
- Context
- Introduces a new, comprehensive multi-agent evaluation platform (Mindgames) and dataset, directly addressing the core topic of agentic tools and power dynamics.
- Key points
- Introduces a new, comprehensive multi-agent evaluation platform (Mindgames) and dataset, directly addressing the core topic of agentic tools and power dynamics.
- Provenance
- Article · Supporting source
-
32
arXiv cs.AI - Research Science (GLOBAL)
Article Zekai Yu, Qi Meng, Qizhi Chu, Yu Hao, Chuan Shi, Cheng Yang
ParaTool: Shifting Tool Representations from Context to Parameters - arXiv:2605.29561v1 Announce Type: new Abstract: Tool calling extends large language models (LLMs) by enabling grounded interaction with external...
arxiv.org/abs/2605.29561 →Details
- Excerpt
- ParaTool: Shifting Tool Representations from Context to Parameters - arXiv:2605.29561v1 Announce Type: new Abstract: Tool calling extends large language models (LLMs) by enabling grounded interaction with external...
- Context
- This paper proposes a fundamental architectural shift for tool use in LLMs, moving from context-based documentation to parameter-based integration. This directly impacts agentic coding and LLM infrastructure.
- Key points
- This paper proposes a fundamental architectural shift for tool use in LLMs, moving from context-based documentation to parameter-based integration. This directly impacts agentic coding and LLM infrastructure.
- Provenance
- Article · Supporting source
-
33
arXiv cs.AI - Research Science (GLOBAL)
Article Kangrui Wang, Linjie Li, Zhengyuan Yang, Shiqi Chen, Zihan Wang, Li Fei-Fei, Jiajun Wu, Leonidas Guibas, Lijuan Wang, Manling Li
Planning with the Views via Scene Self-Exploration - arXiv:2605.29563v1 Announce Type: new Abstract: Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view...
arxiv.org/abs/2605.29563 →Details
- Excerpt
- Planning with the Views via Scene Self-Exploration - arXiv:2605.29563v1 Announce Type: new Abstract: Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view...
- Context
- This paper details a critical planning gap in VLMs (view planning) and proposes a novel self-exploration framework. It directly addresses frontier model capabilities and 3D reasoning.
- Key points
- This paper details a critical planning gap in VLMs (view planning) and proposes a novel self-exploration framework. It directly addresses frontier model capabilities and 3D reasoning.
- Provenance
- Article · Supporting source
-
34
arXiv cs.AI - Research Science (GLOBAL)
Article Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu
DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning - arXiv:2605.29568v1 Announce Type: new Abstract: Tool-Integrated Reasoning (TIR) extends LLM...
arxiv.org/abs/2605.29568 →Details
- Excerpt
- DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning - arXiv:2605.29568v1 Announce Type: new Abstract: Tool-Integrated Reasoning (TIR) extends LLM...
- Context
- This is a primary artifact (arXiv paper) detailing a novel framework (DeepTool) for improving LLM reasoning and tool use via Process-Supervised RL. Directly relates to agentic coding tools and the shifting craft of software engineering.
- Key points
- This is a primary artifact (arXiv paper) detailing a novel framework (DeepTool) for improving LLM reasoning and tool use via Process-Supervised RL. Directly relates to agentic coding tools and the shifting craft of software engineering.
- Provenance
- Article · Supporting source
-
35
arXiv cs.AI - Research Science (GLOBAL)
Article Silu Panda
FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification - arXiv:2605.29586v1 Announce Type: new Abstract: We introduce FinVerBench, a benchmark and validity study for...
arxiv.org/abs/2605.29586 →Details
- Excerpt
- FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification - arXiv:2605.29586v1 Announce Type: new Abstract: We introduce FinVerBench, a benchmark and validity study for...
- Context
- This introduces a new, specialized benchmark (FinVerBench) using SEC filings for LLM financial verification. It directly addresses model reliability and real-world application in finance.
- Key points
- This introduces a new, specialized benchmark (FinVerBench) using SEC filings for LLM financial verification. It directly addresses model reliability and real-world application in finance.
- Provenance
- Article · Supporting source
-
36
arXiv cs.AI - Research Science (GLOBAL)
Article Junyoung Park, Sunghwan Park, Seongyong Ju, Jaewoo Lee
Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures - arXiv:2605.29629v1 Announce Type: new Abstract: Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the...
arxiv.org/abs/2605.29629 →Details
- Excerpt
- Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures - arXiv:2605.29629v1 Announce Type: new Abstract: Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the...
- Context
- This paper introduces a new, more granular safety evaluation metric (TLO) that moves beyond simple success/failure rates. It directly impacts how LLM safety is tested and deployed.
- Key points
- This paper introduces a new, more granular safety evaluation metric (TLO) that moves beyond simple success/failure rates. It directly impacts how LLM safety is tested and deployed.
- Provenance
- Article · Supporting source
-
37
arXiv cs.AI - Research Science (GLOBAL)
Article Jiajie Fu, Junwen Chen, Mengzhao Wang, Aoxiang He, Maojia Sheng, Xiangyu Ke, Yifan Zhu, Yunjun Gao
VikingMem: A Memory Base Management System for Stateful LLM-based Applications - arXiv:2605.29640v1 Announce Type: new Abstract: Large Language Models have revolutionized interactive applications; however, their finite.…
arxiv.org/abs/2605.29640 →Details
- Excerpt
- VikingMem: A Memory Base Management System for Stateful LLM-based Applications - arXiv:2605.29640v1 Announce Type: new Abstract: Large Language Models have revolutionized interactive applications; however, their finite...
- Context
- Addresses the critical technical challenge of state management and long-term memory for LLM applications, a core topic for agentic tools and software engineering.
- Key points
- Addresses the critical technical challenge of state management and long-term memory for LLM applications, a core topic for agentic tools and software engineering.
- Provenance
- Article · Supporting source
-
38
arXiv cs.AI - Research Science (GLOBAL)
Article Elliot Gestrin, Jendrik Seipp
LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning - arXiv:2605.29649v1 Announce Type: new Abstract: Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are...
arxiv.org/abs/2605.29649 →Details
- Excerpt
- LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning - arXiv:2605.29649v1 Announce Type: new Abstract: Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are...
- Context
- This paper reports a primary artifact (new heuristic) that uses LLMs to generate domain-independent planning heuristics, directly addressing the 'agentic coding tools' and 'shifting craft of software engineering' topics.
- Key points
- This paper reports a primary artifact (new heuristic) that uses LLMs to generate domain-independent planning heuristics, directly addressing the 'agentic coding tools' and 'shifting craft of software engineering' topics.
- Provenance
- Article · Supporting source
-
39
arXiv cs.AI - Research Science (GLOBAL)
Article Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan, Martin Hadamitzky, Daniel Rueckert, Lisa Adams, Keno Bressem
GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents - arXiv:2605.29668v1 Announce Type: new Abstract: LLM agents acting in structured environments fail in operational rather than conversational...
arxiv.org/abs/2605.29668 →Details
- Excerpt
- GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents - arXiv:2605.29668v1 Announce Type: new Abstract: LLM agents acting in structured environments fail in operational rather than conversational...
- Context
- This paper introduces GRASP, a method for reliable self-improvement in LLM agents by preventing catastrophic forgetting (regression). This is a core technical advance in agentic systems.
- Key points
- This paper introduces GRASP, a method for reliable self-improvement in LLM agents by preventing catastrophic forgetting (regression). This is a core technical advance in agentic systems.
- Provenance
- Article · Supporting source
-
40
arXiv cs.AI - Research Science (GLOBAL)
Article Lorenz Kutschka, Bernhard Geiger
Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems - arXiv:2605.29676v1 Announce Type: new Abstract: Large language models in Agentic AI systems consume tool schemas and execution...
arxiv.org/abs/2605.29676 →Details
- Excerpt
- Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems - arXiv:2605.29676v1 Announce Type: new Abstract: Large language models in Agentic AI systems consume tool schemas and execution...
- Context
- This paper directly addresses the infrastructure and efficiency of agentic AI systems by proposing and benchmarking token-optimized data formats (TOON, TRON) to replace JSON for tool schemas and execution results.
- Key points
- This paper directly addresses the infrastructure and efficiency of agentic AI systems by proposing and benchmarking token-optimized data formats (TOON, TRON) to replace JSON for tool schemas and execution results.
- Provenance
- Article · Supporting source
-
41
arXiv cs.AI - Research Science (GLOBAL)
Article Pedro Orvalho, Marta Kwiatkowska, Guillem Aleny\`a, Felip Many\`a
Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability - arXiv:2605.29687v1 Announce Type: new Abstract: Large Language Models (LLMs) excel at understanding natural language but...
arxiv.org/abs/2605.29687 →Details
- Excerpt
- Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability - arXiv:2605.29687v1 Announce Type: new Abstract: Large Language Models (LLMs) excel at understanding natural language but...
- Context
- This paper proposes a verifiable, structured approach (MaxSAT) for LLMs to solve complex optimization problems, directly addressing LLM reliability and capability in constrained domains like robotics.
- Key points
- This paper proposes a verifiable, structured approach (MaxSAT) for LLMs to solve complex optimization problems, directly addressing LLM reliability and capability in constrained domains like robotics.
- Provenance
- Article · Supporting source
-
42
arXiv cs.AI - Research Science (GLOBAL)
Article Yuchen Liu, Yingjie Feng, Lixiong Qin, Jiasi Chen, Jianing Yu, Sheng Gao, Sheng Yang, Weiran Xu
Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling - arXiv:2605.29697v1 Announce Type: new Abstract: In Agentic Search, trajectory-level outcome rewards fail to quantify the...
arxiv.org/abs/2605.29697 →Details
- Excerpt
- Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling - arXiv:2605.29697v1 Announce Type: new Abstract: In Agentic Search, trajectory-level outcome rewards fail to quantify the...
- Context
- This is a new arXiv paper proposing a novel step-level reward mechanism (GDCR/SAPO) for agentic search, directly addressing core challenges in agentic AI.
- Key points
- This is a new arXiv paper proposing a novel step-level reward mechanism (GDCR/SAPO) for agentic search, directly addressing core challenges in agentic AI.
- Provenance
- Article · Supporting source
-
43
arXiv cs.AI - Research Science (GLOBAL)
Article Mincheol Kang, Hyunjin Lim, Bomin Kang, Daehee Park
BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices - arXiv:2605.29705v1 Announce Type: new Abstract: Trajectory prediction is a fundamental task for autonomous systems, requiring complex...
arxiv.org/abs/2605.29705 →Details
- Excerpt
- BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices - arXiv:2605.29705v1 Announce Type: new Abstract: Trajectory prediction is a fundamental task for autonomous systems, requiring complex...
- Context
- New research (arXiv) on deploying complex LLM reasoning (trajectory prediction) to resource-constrained edge devices. Directly impacts autonomous systems and AI infrastructure.
- Key points
- New research (arXiv) on deploying complex LLM reasoning (trajectory prediction) to resource-constrained edge devices. Directly impacts autonomous systems and AI infrastructure.
- Provenance
- Article · Supporting source
-
44
arXiv cs.AI - Research Science (GLOBAL)
Article Shuaidi Wang, Zhan Zhuang, Ruping Huang, Yu Zhang
NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs - arXiv:2605.29716v1 Announce Type: new Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive...
arxiv.org/abs/2605.29716 →Details
- Excerpt
- NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs - arXiv:2605.29716v1 Announce Type: new Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive...
- Context
- This is a primary artifact (arXiv paper) detailing a new PEFT method (NaRA) specifically for Diffusion LLMs (dLLMs), improving code generation and reasoning.
- Key points
- This is a primary artifact (arXiv paper) detailing a new PEFT method (NaRA) specifically for Diffusion LLMs (dLLMs), improving code generation and reasoning.
- Provenance
- Article · Supporting source
-
45
arXiv cs.AI - Research Science (GLOBAL)
Article Yeong-Joon Ju, Seong-Whan Lee
Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering - arXiv:2605.29742v1 Announce Type: new Abstract: Deploying Large Language Models (LLMs) for regulatory...
arxiv.org/abs/2605.29742 →Details
- Excerpt
- Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering - arXiv:2605.29742v1 Announce Type: new Abstract: Deploying Large Language Models (LLMs) for regulatory...
- Context
- Addresses regulatory compliance and traceability for LLMs, directly impacting legal/policy use cases (HIPAA, national regulations). Provides a new benchmark (RegOps-Bench) and framework (RefWalk).
- Key points
- Addresses regulatory compliance and traceability for LLMs, directly impacting legal/policy use cases (HIPAA, national regulations). Provides a new benchmark (RegOps-Bench) and framework (RefWalk).
- Provenance
- Article · Supporting source
-
46
arXiv cs.AI - Research Science (GLOBAL)
Article Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou, Aiguo Wang, Cuiwei Yang
Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence - arXiv:2605.29744v1 Announce Type: new Abstract: The impressive performance of generalist large language...
arxiv.org/abs/2605.29744 →Details
- Excerpt
- Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence - arXiv:2605.29744v1 Announce Type: new Abstract: The impressive performance of generalist large language...
- Context
- This paper addresses the architecture of medical AI, focusing on multi-agent systems and the synergy between generalist and specialist models. This is core to the 'where intelligence is built' and 'power dynamics' themes.
- Key points
- This paper addresses the architecture of medical AI, focusing on multi-agent systems and the synergy between generalist and specialist models. This is core to the 'where intelligence is built' and 'power dynamics' themes.
- Provenance
- Article · Supporting source
-
47
arXiv cs.AI - Research Science (GLOBAL)
Article Omar Benjelloun, Leonardo Martins Bianco, Isabelle Guyon, Thanh Gia Hieu Khuong, Jonathan Lebensold, Sebastian Lobentanzer, Luis Oala, Benedictus Kent Rachmat, Ihsan Ullah, Peyman Vahidi, Joaquin Vanschoren
Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations - arXiv:2605.29786v1 Announce Type: new Abstract: Reproducibility is fundamental to the scientific method, yet remains a critical...
arxiv.org/abs/2605.29786 →Details
- Excerpt
- Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations - arXiv:2605.29786v1 Announce Type: new Abstract: Reproducibility is fundamental to the scientific method, yet remains a critical...
- Context
- Introduces a formal, machine-actionable metadata standard (Croissant Tasks) for conceptual reproducibility, directly impacting ML evaluation and agentic development.
- Key points
- Introduces a formal, machine-actionable metadata standard (Croissant Tasks) for conceptual reproducibility, directly impacting ML evaluation and agentic development.
- Provenance
- Article · Supporting source
-
48
arXiv cs.AI - Research Science (GLOBAL)
Article Ashutosh Ojha, Vinay Aggarwal, Ashutosh Srivastava, Siddharth Yedlapati, Yaman K Singla, Jitendra Ajmera
MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains - arXiv:2605.29795v1 Announce Type: new Abstract: Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data...
arxiv.org/abs/2605.29795 →Details
- Excerpt
- MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains - arXiv:2605.29795v1 Announce Type: new Abstract: Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data...
- Context
- Presents a novel agentic framework (MEMENTO) that treats the web as a learning signal, directly addressing agentic coding/practice and AI infrastructure.
- Key points
- Presents a novel agentic framework (MEMENTO) that treats the web as a learning signal, directly addressing agentic coding/practice and AI infrastructure.
- Provenance
- Article · Supporting source
-
49
arXiv cs.AI - Research Science (GLOBAL)
Article Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su
SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search - arXiv:2605.29796v1 Announce Type: new Abstract: Agentic search enables LLMs to solve complex multi-hop questions through iterative...
arxiv.org/abs/2605.29796 →Details
- Excerpt
- SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search - arXiv:2605.29796v1 Announce Type: new Abstract: Agentic search enables LLMs to solve complex multi-hop questions through iterative...
- Context
- Presents a novel RL framework (SAAS) to solve a critical, practical limitation (over-search) in agentic search, directly impacting agentic coding/tools.
- Key points
- Presents a novel RL framework (SAAS) to solve a critical, practical limitation (over-search) in agentic search, directly impacting agentic coding/tools.
- Provenance
- Article · Supporting source
-
50
arXiv cs.AI - Research Science (GLOBAL)
Article Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Junxiao Yang, Kun Wang, Zhiheng Xi, Linfeng Zhang, Ranjie Duan, Bo Zhang, Wenjie Wang, Wen Shen, Qiaosheng Zhang, Yan Teng, Chaochao Lu, Rui Mei, Man Li, Jialing Tao, Xi Lin, Tianhang Zheng, Yong Liu, Quanshi Zhang, Lei Zhu, Xingjun Ma, Junhua Liu, Hui Xue, Xiaoxiang Zuo, Xiangnan He, Chao Shen, Xianglong Liu, Minlie Huang, Jing Shao, Xia Hu
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security - arXiv:2605.29801v1 Announce Type: new Abstract: Modern open-world agents such as OpenClaw exhibit powerful...
arxiv.org/abs/2605.29801 →Details
- Excerpt
- AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security - arXiv:2605.29801v1 Announce Type: new Abstract: Modern open-world agents such as OpenClaw exhibit powerful...
- Context
- This paper introduces a new, lightweight, and scalable alignment framework (AgentDoG 1.5) for AI agents, directly addressing safety risks in advanced agentic systems. It is a primary artifact with clear downstream consequence for agent deployment.
- Key points
- This paper introduces a new, lightweight, and scalable alignment framework (AgentDoG 1.5) for AI agents, directly addressing safety risks in advanced agentic systems. It is a primary artifact with clear downstream consequence for agent deployment.
- Provenance
- Article · Supporting source
-
51
arXiv cs.AI - Research Science (GLOBAL)
Article Krzysztof \.Zurawicki, Julia Farganus, Arkadiusz Gawe{\l}, Mateusz Bystro\'nski, Tomasz Jan Kajdanowicz
PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing - arXiv:2605.29815v1 Announce Type: new Abstract: The growing number of submitted papers has motivated the exploration of Large Language Models...
arxiv.org/abs/2605.29815 →Details
- Excerpt
- PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing - arXiv:2605.29815v1 Announce Type: new Abstract: The growing number of submitted papers has motivated the exploration of Large Language Models...
- Context
- This paper introduces a benchmark (PRAIB) and empirical study on LLM review behavior, directly impacting the reliability and deployment of AI in academic/scientific processes.
- Key points
- This paper introduces a benchmark (PRAIB) and empirical study on LLM review behavior, directly impacting the reliability and deployment of AI in academic/scientific processes.
- Provenance
- Article · Supporting source
-
52
arXiv cs.AI - Research Science (GLOBAL)
Article Haochen Yang, Ke Zhao, Mengyuan Ma, Xingyu Lu, Xiangfeng Wang, Hong Qian
OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation - arXiv:2605.29829v1 Announce Type: new Abstract: Leveraging Large Language Models (LLMs) to automatically...
arxiv.org/abs/2605.29829 →Details
- Excerpt
- OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation - arXiv:2605.29829v1 Announce Type: new Abstract: Leveraging Large Language Models (LLMs) to automatically...
- Context
- This is a primary artifact (arXiv paper/tool) detailing a new agentic system (OptSkills) for solving optimization problems using LLMs, directly addressing generalization and skill learning.
- Key points
- This is a primary artifact (arXiv paper/tool) detailing a new agentic system (OptSkills) for solving optimization problems using LLMs, directly addressing generalization and skill learning.
- Provenance
- Article · Supporting source
-
53
arXiv cs.AI - Research Science (GLOBAL)
Article Minyang Hu, Bo Yang, Zhinuo Zhou, Jiachen Liang, Guo Jiahao, Yiyang Yin, Xiongwei Han
Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories - arXiv:2605.29893v1 Announce Type: new Abstract: LLM-based agents have demonstrated strong capabilities in solving complex tasks...
arxiv.org/abs/2605.29893 →Details
- Excerpt
- Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories - arXiv:2605.29893v1 Announce Type: new Abstract: LLM-based agents have demonstrated strong capabilities in solving complex tasks...
- Context
- Introduces a new benchmark (RedundancyBench) and research area for evaluating agent efficiency, directly impacting agentic coding tools and practice.
- Key points
- Introduces a new benchmark (RedundancyBench) and research area for evaluating agent efficiency, directly impacting agentic coding tools and practice.
- Provenance
- Article · Supporting source
-
54
arXiv cs.AI - Research Science (GLOBAL)
Article Toru Takahashi
Toward AI Systems That Understand Self and Others: A Multi-Phase Inference Framework for Human Cognitive Diversity and World-Model Alignment - arXiv:2605.29930v1 Announce Type: new Abstract: Mutual misunderstanding in...
arxiv.org/abs/2605.29930 →Details
- Excerpt
- Toward AI Systems That Understand Self and Others: A Multi-Phase Inference Framework for Human Cognitive Diversity and World-Model Alignment - arXiv:2605.29930v1 Announce Type: new Abstract: Mutual misunderstanding in...
- Context
- Addresses core AI alignment and world-model issues, directly impacting how AI systems interact with human cognitive diversity and social reality.
- Key points
- Addresses core AI alignment and world-model issues, directly impacting how AI systems interact with human cognitive diversity and social reality.
- Provenance
- Article · Supporting source
-
55
arXiv cs.AI - Research Science (GLOBAL)
Article Ahmad Rammal, Niket Patel, Fabian Gloeckle, Amaury Hayat, Julia Kempe, Remi Munos, Charles Arnal, Vivien Cabannes
Formalizing Mathematics at Scale - arXiv:2605.29955v1 Announce Type: new Abstract: We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot...
arxiv.org/abs/2605.29955 →Details
- Excerpt
- Formalizing Mathematics at Scale - arXiv:2605.29955v1 Announce Type: new Abstract: We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot...
- Context
- This describes a multi-agent system (AutoformBot) for autoformalizing complex mathematics (Lean 4) at scale. It's a major artifact demonstrating AI's capability to automate high-level, verifiable knowledge creation, impacting science and education.
- Key points
- This describes a multi-agent system (AutoformBot) for autoformalizing complex mathematics (Lean 4) at scale. It's a major artifact demonstrating AI's capability to automate high-level, verifiable knowledge creation, impacting science and education.
- Provenance
- Article · Supporting source
-
56
arXiv cs.AI - Research Science (GLOBAL)
Article Kun Feng, Ziwei Shan, Yuchen Fang, Yiyang Tan, Sihan Lu, Shuqi Gu, Lintao Ma, Xingyu Lu, Kan Ren
KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning - arXiv:2605.30002v1 Announce Type: new Abstract: Cross-domain multimodal time series forecasting is a challenging task, requiring models to...
arxiv.org/abs/2605.30002 →Details
- Excerpt
- KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning - arXiv:2605.30002v1 Announce Type: new Abstract: Cross-domain multimodal time series forecasting is a challenging task, requiring models to...
- Context
- This is a primary artifact (arXiv paper) detailing a novel agentic framework for time series forecasting, directly addressing the intersection of LLMs, agents, and specialized AI infrastructure.
- Key points
- This is a primary artifact (arXiv paper) detailing a novel agentic framework for time series forecasting, directly addressing the intersection of LLMs, agents, and specialized AI infrastructure.
- Provenance
- Article · Supporting source
-
57
arXiv cs.AI - Research Science (GLOBAL)
Article Zhen Chen, Yibing Liu, Weihao Xie, Yu Liang, Peilin Chen, Shiqi Wang
RAISE: RAG Design as an Architecture Search Problem - arXiv:2605.30029v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking,...
arxiv.org/abs/2605.30029 →Details
- Excerpt
- RAISE: RAG Design as an Architecture Search Problem - arXiv:2605.30029v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking,...
- Context
- Introduces a new, comprehensive framework (RAISE) and benchmark for RAG optimization, directly addressing systematic challenges in AI architecture design.
- Key points
- Introduces a new, comprehensive framework (RAISE) and benchmark for RAG optimization, directly addressing systematic challenges in AI architecture design.
- Provenance
- Article · Supporting source
-
58
arXiv cs.AI - Research Science (GLOBAL)
Article Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang, Jianguo Li, Peng Di, Peiyu Liu, Jianwei Yin, Wenhai Wang
Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning - arXiv:2605.30039v1 Announce Type: new Abstract: Large Language Models have demonstrated remarkable progress in general-purpose...
arxiv.org/abs/2605.30039 →Details
- Excerpt
- Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning - arXiv:2605.30039v1 Announce Type: new Abstract: Large Language Models have demonstrated remarkable progress in general-purpose...
- Context
- This paper introduces a new paradigm (DOMINO) for synthesizing domain-specific data for LLMs using only reference examples, bypassing manual prompt engineering. This directly impacts LLM training and application.
- Key points
- This paper introduces a new paradigm (DOMINO) for synthesizing domain-specific data for LLMs using only reference examples, bypassing manual prompt engineering. This directly impacts LLM training and application.
- Provenance
- Article · Supporting source
-
59
arXiv cs.AI - Research Science (GLOBAL)
Article Geremy Loacham\'in-Suntaxi, Robert Lazar, Dimitrios G. Giovanis, Ioannis G. Kevrekidis, Eleni D. Koronaki
Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection - arXiv:2605.30042v1 Announce Type: new Abstract: Automating scientific computing workflows...
arxiv.org/abs/2605.30042 →Details
- Excerpt
- Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection - arXiv:2605.30042v1 Announce Type: new Abstract: Automating scientific computing workflows...
- Context
- This describes a multi-agent system for scientific computing that addresses semantic drift and action-outcome fidelity, directly impacting agentic coding and AI infrastructure.
- Key points
- This describes a multi-agent system for scientific computing that addresses semantic drift and action-outcome fidelity, directly impacting agentic coding and AI infrastructure.
- Provenance
- Article · Supporting source
-
60
arXiv cs.AI - Research Science (GLOBAL)
Article Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan
Conformal Certification of Reasoning Trace Prefixes - arXiv:2605.30085v1 Announce Type: new Abstract: Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a.…
arxiv.org/abs/2605.30085 →Details
- Excerpt
- Conformal Certification of Reasoning Trace Prefixes - arXiv:2605.30085v1 Announce Type: new Abstract: Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a...
- Context
- This paper introduces CROP, a method for certifying valid intermediate reasoning prefixes in LLMs. This directly addresses the reliability and process supervision of AI reasoning, which is core to agentic tools and frontier model safety.
- Key points
- This paper introduces CROP, a method for certifying valid intermediate reasoning prefixes in LLMs. This directly addresses the reliability and process supervision of AI reasoning, which is core to agentic tools and frontier model safety.
- Provenance
- Article · Supporting source
-
61
arXiv cs.AI - Research Science (GLOBAL)
Article Tiancheng Yang, Matthias Schonlau, Ilia Sucholutsky
Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison - arXiv:2605.30087v1 Announce Type: new Abstract: Emerging personal AI agents are moving toward persistent,...
arxiv.org/abs/2605.30087 →Details
- Excerpt
- Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison - arXiv:2605.30087v1 Announce Type: new Abstract: Emerging personal AI agents are moving toward persistent,...
- Context
- This paper introduces a new benchmark and method for personal AI memory, directly addressing conflict resolution and selective QA, which is core to agentic systems.
- Key points
- This paper introduces a new benchmark and method for personal AI memory, directly addressing conflict resolution and selective QA, which is core to agentic systems.
- Provenance
- Article · Supporting source
-
62
arXiv cs.AI - Research Science (GLOBAL)
Article Hongxiang Zhang, Yuan Tian, Tianyi Zhang
Enhancing Multi-Agent Communication through Attention Steering with Context Relevance - arXiv:2605.30136v1 Announce Type: new Abstract: LLM-based multi-agent systems have demonstrated remarkable performance on complex...
arxiv.org/abs/2605.30136 →Details
- Excerpt
- Enhancing Multi-Agent Communication through Attention Steering with Context Relevance - arXiv:2605.30136v1 Announce Type: new Abstract: LLM-based multi-agent systems have demonstrated remarkable performance on complex...
- Context
- This is a new arXiv paper detailing a technical improvement (Agent-Radar) for multi-agent systems, directly addressing context management and performance degradation in complex AI applications.
- Key points
- This is a new arXiv paper detailing a technical improvement (Agent-Radar) for multi-agent systems, directly addressing context management and performance degradation in complex AI applications.
- Provenance
- Article · Supporting source
-
63
arXiv cs.AI - Research Science (GLOBAL)
Article Ziyan Liu, Zhezheng Hao, Yeqiu Chen, Hong Wang, Jingren Hou, Ruiyi Ding, Yongkang Yang, Wence Ji, Wei Xia, Feng Liu
Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents - arXiv:2605.30159v1 Announce Type: new Abstract: Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing...
arxiv.org/abs/2605.30159 →Details
- Excerpt
- Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents - arXiv:2605.30159v1 Announce Type: new Abstract: Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing...
- Context
- This paper introduces a novel optimization method (MMPO) for long-horizon LLM agents, directly addressing memory degradation and belief deviation. This is a core technical advance in agentic AI.
- Key points
- This paper introduces a novel optimization method (MMPO) for long-horizon LLM agents, directly addressing memory degradation and belief deviation. This is a core technical advance in agentic AI.
- Provenance
- Article · Supporting source
-
64
arXiv cs.AI - Research Science (GLOBAL)
Article Caleb DeLeeuw
BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders - arXiv:2605.30162v1 Announce Type: new Abstract: Biosecurity evaluations of language models typically ask...
arxiv.org/abs/2605.30162 →Details
- Excerpt
- BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders - arXiv:2605.30162v1 Announce Type: new Abstract: Biosecurity evaluations of language models typically ask...
- Context
- Directly addresses model safety, refusal mechanisms, and internal auditing (SAE), which is critical to the power dynamics and reliability of frontier models.
- Key points
- Directly addresses model safety, refusal mechanisms, and internal auditing (SAE), which is critical to the power dynamics and reliability of frontier models.
- Provenance
- Article · Supporting source
-
65
arXiv cs.AI - Research Science (GLOBAL)
Article Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao, Chiyu Wu, Jin Shang, Yu Gong, Shumin Deng
When Should Models Change Their Minds? Contextual Belief Management in Large Language Models - arXiv:2605.30219v1 Announce Type: new Abstract: Long-horizon interactions require language models to manage accumulating...
arxiv.org/abs/2605.30219 →Details
- Excerpt
- When Should Models Change Their Minds? Contextual Belief Management in Large Language Models - arXiv:2605.30219v1 Announce Type: new Abstract: Long-horizon interactions require language models to manage accumulating...
- Context
- This is a new arXiv paper introducing a formal benchmark (BeliefTrack) and methods (RL, representation steering) for managing LLM belief states, directly impacting model reliability and capability.
- Key points
- This is a new arXiv paper introducing a formal benchmark (BeliefTrack) and methods (RL, representation steering) for managing LLM belief states, directly impacting model reliability and capability.
- Provenance
- Article · Supporting source
-
66
arXiv cs.AI - Research Science (GLOBAL)
Article A. J. Lew (Unreasonable Labs), Y. Cao (Unreasonable Labs), M. J. Buehler (Unreasonable Labs)
ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure - arXiv:2605.30284v1 Announce Type: new Abstract: Scientific discovery is an inherently creative and...
arxiv.org/abs/2605.30284 →Details
- Excerpt
- ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure - arXiv:2605.30284v1 Announce Type: new Abstract: Scientific discovery is an inherently creative and...
- Context
- A new benchmark (ProjectionBench) for evaluating LLMs on scientific hypothesis generation and discovery. This directly addresses the 'AI scientist/co-scientist' frontier and model capabilities.
- Key points
- A new benchmark (ProjectionBench) for evaluating LLMs on scientific hypothesis generation and discovery. This directly addresses the 'AI scientist/co-scientist' frontier and model capabilities.
- Provenance
- Article · Supporting source
-
67
arXiv cs.AI - Research Science (GLOBAL)
Article Haowen Wang, Yaxin Du, Jian Yang, Jiajun Wu, Shukai Liu, Yuxuan Zhang, Pingjie Wang, Siheng Chen, Tuney Zheng, Ming Zhou, Xianglong Liu
MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection - arXiv:2605.30288v1 Announce Type: new Abstract: Mid-training has become an important stage in modern LLM development, using large-scale curated...
arxiv.org/abs/2605.30288 →Details
- Excerpt
- MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection - arXiv:2605.30288v1 Announce Type: new Abstract: Mid-training has become an important stage in modern LLM development, using large-scale curated...
- Context
- This paper introduces MIRA, a source-aware data selection framework for mid-training LLMs. It directly addresses the core technical challenge of data curation and model capability enhancement.
- Key points
- This paper introduces MIRA, a source-aware data selection framework for mid-training LLMs. It directly addresses the core technical challenge of data curation and model capability enhancement.
- Provenance
- Article · Supporting source
-
68
arXiv cs.AI - Research Science (GLOBAL)
Article Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang, Xin Zhang, Wenshan Wu, Qihao Zhao, Hao Li, Yuanyuan Gao, Kim-Hui Yap, Scarlett Li
Demystifying Data Organization for Enhanced LLM Training - arXiv:2605.30334v1 Announce Type: new Abstract: Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily...
arxiv.org/abs/2605.30334 →Details
- Excerpt
- Demystifying Data Organization for Enhanced LLM Training - arXiv:2605.30334v1 Announce Type: new Abstract: Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily...
- Context
- This paper proposes novel data ordering methods (STR, SAW) and guidelines for optimizing LLM training data organization, directly impacting training efficiency and model performance.
- Key points
- This paper proposes novel data ordering methods (STR, SAW) and guidelines for optimizing LLM training data organization, directly impacting training efficiency and model performance.
- Provenance
- Article · Supporting source
-
69
arXiv cs.AI - Research Science (GLOBAL)
Article Anany Kotawala
Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents - arXiv:2605.30335v1 Announce Type: new Abstract: Multi-component LLM agents assemble probabilistic claims from...
arxiv.org/abs/2605.30335 →Details
- Excerpt
- Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents - arXiv:2605.30335v1 Announce Type: new Abstract: Multi-component LLM agents assemble probabilistic claims from...
- Context
- This paper addresses a fundamental failure mode (compositional incoherence) in multi-component LLM agents, directly impacting agentic coding and reliability.
- Key points
- This paper addresses a fundamental failure mode (compositional incoherence) in multi-component LLM agents, directly impacting agentic coding and reliability.
- Provenance
- Article · Supporting source
-
70
@michpokrass (Michelle Pokrass)
X michpokrass
we shipped a new version of gpt-5.5 instant today. the previous model was too bullet pilled. the new one improves on some other important dimensions: sycophancy, factuality, and multilingual performance. hope you'll…
x.com/michpokrass/status/2060219759682330970 →Details
- Excerpt
- we shipped a new version of gpt-5.5 instant today. the previous model was too bullet pilled. the new one improves on some other important dimensions: sycophancy, factuality, and multilingual performance. hope you'll…
- Context
- Reports a primary artifact (new model release) and directly relates to the near-future of AI and frontier models.
- Key points
- Reports a primary artifact (new model release) and directly relates to the near-future of AI and frontier models.
- Provenance
- Tweet · Primary source
-
71
@trengriffin (Tren Griffin)
X trengriffin
Did the CNN reporter call Microsoft to confirm the claim? Nope. Microsoft switching from Claude code to GitHub Copilot (both with Opus 4.7 paid for by enterprise API usage) enables dogfooding of the GHCP harness so…
x.com/trengriffin/status/2060220238147551244 →Details
- Excerpt
- Did the CNN reporter call Microsoft to confirm the claim? Nope. Microsoft switching from Claude code to GitHub Copilot (both with Opus 4.7 paid for by enterprise API usage) enables dogfooding of the GHCP harness so…
- Context
- Discusses a major shift in AI tooling (Claude to Copilot) and the underlying business/infrastructure dynamics (enterprise API usage, dogfooding), which is central to the podcast's focus.
- Key points
- Discusses a major shift in AI tooling (Claude to Copilot) and the underlying business/infrastructure dynamics (enterprise API usage, dogfooding), which is central to the podcast's focus.
- Provenance
- Tweet · Primary source
-
72
@badlogicgames (Mario Zechner)
X badlogicgames
pibot is now running fully local, using parakeet for STT, qwen3-tts for TTS, and Qwen 3.6 as the local multi-modal LLM via llama.cpp. The STT and TTS inference engines are Rust/mlx-c based. Ported from Python. So, zero…
x.com/badlogicgames/status/2060268257739677… →Details
- Excerpt
- pibot is now running fully local, using parakeet for STT, qwen3-tts for TTS, and Qwen 3.6 as the local multi-modal LLM via llama.cpp. The STT and TTS inference engines are Rust/mlx-c based. Ported from Python. So, zero…
- Context
- Reports a specific, working technical artifact (pibot) and its local, dependency-free stack (Rust/mlx-c, Qwen 3.6), directly relevant to AI infrastructure and tools.
- Key points
- Reports a specific, working technical artifact (pibot) and its local, dependency-free stack (Rust/mlx-c, Qwen 3.6), directly relevant to AI infrastructure and tools.
- Provenance
- Tweet · Primary source
-
73
@badlogicgames (Mario Zechner)
X badlogicgames
pibot is now running fully local, using parakeet for STT, qwen3-tts for TTS, and Qwen 3.6 as the local multi-modal LLM via llama.cpp. The STT and TTS inference engines are Rust/mlx-c based. Ported from Python. So, zero…
x.com/badlogicgames/status/2060268257739677… →Details
- Excerpt
- pibot is now running fully local, using parakeet for STT, qwen3-tts for TTS, and Qwen 3.6 as the local multi-modal LLM via llama.cpp. The STT and TTS inference engines are Rust/mlx-c based. Ported from Python. So, zero…
- Context
- Reports a specific, working technical artifact (pibot) and its local, dependency-free stack (Rust/mlx-c, Qwen 3.6), directly relevant to AI infrastructure and tools.
- Key points
- Reports a specific, working technical artifact (pibot) and its local, dependency-free stack (Rust/mlx-c, Qwen 3.6), directly relevant to AI infrastructure and tools.
- Provenance
- Tweet · Primary source
-
74
Axios - Industry Adjacent (US)
Article Maria Curi
Inside the Democratic resistance on AI - Progressive Democrats taking hardline positions against AI are getting louder. Why it matters: Five influential progressives are shaping a confrontational Democratic message on...
www.axios.com/2026/05/29/inside-democratic-… →Details
- Excerpt
- Inside the Democratic resistance on AI - Progressive Democrats taking hardline positions against AI are getting louder. Why it matters: Five influential progressives are shaping a confrontational Democratic message on...
- Context
- Details specific policy proposals (moratoriums, taxes, labor protections) and political power dynamics (Sanders, AOC, Warren) shaping AI regulation and control.
- Key points
- Details specific policy proposals (moratoriums, taxes, labor protections) and political power dynamics (Sanders, AOC, Warren) shaping AI regulation and control.
- Provenance
- Article · Supporting source
-
75
Axios - Industry Adjacent (US)
Article Zachary Basu
"The pitchforks are here": Billionaires work to contain AI's populist revolt - America's billionaires are developing their own prescriptions for AI-fueled inequality, anxious to defuse a populist revolt aimed at their...
www.axios.com/2026/05/29/ai-billionaires-te… →Details
- Excerpt
- "The pitchforks are here": Billionaires work to contain AI's populist revolt - America's billionaires are developing their own prescriptions for AI-fueled inequality, anxious to defuse a populist revolt aimed at their...
- Context
- Directly addresses power dynamics, wealth concentration, and policy/regulation (wealth tax, data centers) shaping AI's future.
- Key points
- Directly addresses power dynamics, wealth concentration, and policy/regulation (wealth tax, data centers) shaping AI's future.
- Provenance
- Article · Supporting source
-
76
Real-time LLM Inference on Standard GPUs: 3k tokens/s per request — 67 pts · 38 comments
Article NicoConstant
https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/ · @mungoman2: This looks very interesting. Possible to get those rates without exotic hardware. But I have to say that the…
blog.kog.ai/real-time-llm-inference-on-stan… →Details
- Excerpt
- https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/ · @mungoman2: This looks very interesting. Possible to get those rates without exotic hardware. But I have to say that the…
- Context
- Directly addresses AI infrastructure (inference, GPUs) and model performance, which is central to the podcast topic.
- Key points
- Directly addresses AI infrastructure (inference, GPUs) and model performance, which is central to the podcast topic.
- Provenance
- Article · Supporting source
-
77
Techmeme - Industry Adjacent (US)
Article
OpenAI says it has briefed the White House on its new biodefense program, which uses GPT-Rosalind to help develop biodefense and pandemic preparedness tools (Maria Curi/Axios) - Maria Curi / Axios : OpenAI says it has...
www.techmeme.com/260529/p13 →Details
- Excerpt
- OpenAI says it has briefed the White House on its new biodefense program, which uses GPT-Rosalind to help develop biodefense and pandemic preparedness tools (Maria Curi/Axios) - Maria Curi / Axios : OpenAI says it has...
- Context
- Directly addresses the intersection of frontier AI (GPT-Rosalind) and critical infrastructure/policy (biodefense/White House), fitting the power dynamics theme.
- Key points
- Directly addresses the intersection of frontier AI (GPT-Rosalind) and critical infrastructure/policy (biodefense/White House), fitting the power dynamics theme.
- Provenance
- Article · Supporting source
-
78
TechCrunch AI - Media Culture (US)
Article Kate Park
This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory - South Korean chip startup XCENA is betting that AI's real bottleneck is not compute, but memory.
techcrunch.com/2026/05/29/xcena-secures-135… →Details
- Excerpt
- This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory - South Korean chip startup XCENA is betting that AI's real bottleneck is not compute, but memory.
- Context
- Directly addresses AI infrastructure bottlenecks (memory/HBM), a core topic. Funding/valuation adds market/capital dynamics.
- Key points
- Directly addresses AI infrastructure bottlenecks (memory/HBM), a core topic. Funding/valuation adds market/capital dynamics.
- Provenance
- Article · Supporting source
-
79
Techmeme - Industry Adjacent (US)
Article
Former Tesla data labelers say FSD relies on laborious mapping for hazards; crash data analysis shows Tesla exaggerates FSD's safety via flawed methodology (Reuters) - Reuters : Former Tesla data labelers say FSD...
www.techmeme.com/260529/p16 →Details
- Excerpt
- Former Tesla data labelers say FSD relies on laborious mapping for hazards; crash data analysis shows Tesla exaggerates FSD's safety via flawed methodology (Reuters) - Reuters : Former Tesla data labelers say FSD...
- Context
- Directly challenges Tesla's safety claims and methodology for FSD, impacting public trust, regulation, and the viability of autonomous systems.
- Key points
- Directly challenges Tesla's safety claims and methodology for FSD, impacting public trust, regulation, and the viability of autonomous systems.
- Provenance
- Article · Supporting source
-
80
@emollick (Ethan Mollick)
X emollick
Reconstructing software engineering around AI is going to take work (even as the ability of AI to code increases at a rapid rate). Organizations are ideally spending tokens for two things: 1) building stuff 2)…
x.com/emollick/status/2060357604044358108 →Details
- Excerpt
- Reconstructing software engineering around AI is going to take work (even as the ability of AI to code increases at a rapid rate). Organizations are ideally spending tokens for two things: 1) building stuff 2)…
- Context
- Directly addresses the shifting craft of software engineering and the need for organizational investment in AI-assisted development and experimentation.
- Key points
- Directly addresses the shifting craft of software engineering and the need for organizational investment in AI-assisted development and experimentation.
- Provenance
- Tweet · Primary source
Transcript
00:00:00 lenarA guy named Mario Zechner posted a photo this morning of a small device on his desk — speaker, microphone, a tangle of cables running back to a tiny board. He calls it pibot. He's been building it for a few months. As of this morning, the whole voice loop runs locally on the box: speech-to-text, language model, and text-to-speech, all running without a cloud round trip and without a Python interpreter behind it. He talks. It answers. Everything stays in the room. So here's what we'll walk through over the next half hour. Mario's box sits on the personal end of the story today. Ethan Mollick posted a short note about how organizations should split their AI budget between building and learning. And a cluster of papers landed on arXiv this morning that all read like the same observation from different angles — as we run agents for longer, they fall apart in new ways, and the fixes aren't better models. They're better wrappers around the models. We'll close with a benchmark for conflicting personal memory, an auditing technique that asks how deep a refusal actually goes, and a project that's pulling math textbooks into Lean 4. Damra, where do you want to start?
00:01:06 damraStart with the box. It's the only thing in the story you can point at. The stack he's running is Parakeet for speech-to-text, Qwen 3 TTS for synthesis, and Qwen 3.6 as the multimodal large language model behind it all, served through llama.cpp. Parakeet is Nvidia's open recognition family. Qwen 3 TTS is Alibaba's open synthesis model. Qwen 3.6 is the dense multimodal release from earlier this month. What's new in Mario's setup isn't the model lineup. It's the runtime. He ported the recognition and synthesis inference engines from Python into Rust on top of mlx-c. So none of those four components need a Python interpreter at runtime.
00:01:50 lenarWhich matters why? Spell it out for someone who hasn't tried to put one of these on a small device.
00:01:56 damraBecause Python is what you eventually hit. You can ship a quantized model in a small package. The moment your recognition stack demands torch and your audio pipeline pulls in transformers, you're back to a multi-hundred-megabyte install on a Pi-class machine — and a cold-start time the user can feel. Rust plus mlx-c keeps you in single-binary territory. The whole assistant fits in a fraction of the disk and starts in a fraction of the time. And on Apple silicon, mlx-c lets him use the unified memory the way the hardware wants to be used.
00:02:30 lenarThere's a photo of the device sitting on his desk, by the way. It isn't a research result. It's a person saying — the local voice stack is finally good enough that I built one for my apartment, and it works. A year ago that sentence required a workstation under the desk. Two years ago it required a cloud bill.
00:02:47 damraAnd it changes what "agent" means in homes and small offices. If the speech round trip stays on the device, the conversation history stays on the device. That's a different privacy posture than anything that ships an audio buffer to a cloud endpoint. It's also a different latency posture. The reason most voice assistants feel sluggish isn't the model. It's the network leg in both directions.
00:03:11 lenarTwo things I'll flag and not oversell. One — I haven't run pibot myself. I'm taking Mario at his word that the throughput is conversational. He doesn't post a tokens-per-second number in the screenshot I'm looking at. Two — there's a real model-quality gap between Qwen 3.6 at the size he's running and the frontier hosted models. He isn't claiming parity. He's claiming the local version answers the kinds of questions he's actually asking it. Which is the right test for this category.
00:03:40 damraIt does not have to beat Opus on a benchmark. It has to be good enough that the user doesn't reach for their phone. And the cohort of people who'd build one of these for their kitchen — they care more about "works without the internet" than about the last few points on MMLU.
00:03:55 lenarEthan Mollick — who teaches at Wharton and writes the One Useful Thing newsletter — posted a thought this morning about how organizations should spend their AI budget. He frames it as two buckets. One — tokens you spend on building things. Two — and the version of the tweet I can see ends in an ellipsis, so the second item is cut off. But in context, and given what Mollick has written before, the second bucket is tokens you spend learning what works. Tokens against problems you don't yet know how to solve.
00:04:23 damraThe truncation matters less than the move. He's saying token spend isn't a single line item. It's two different activities with two different success criteria. Building is — ship the artifact. Learning is — find out whether this even makes sense. Those want different review cadences, different teams, and different definitions of done.
00:04:43 lenarAnd the reason I keep that distinction in mind this week is that we spent yesterday on Kirkland and Ellis's five-hundred-million-dollar internal AI platform. Most of that money is going into building. What's harder to see in the K&E story is whether they've reserved enough capacity to learn — to try things that don't work, that they can throw away. Internal AI orgs at that scale almost always under-fund the learning bucket, because the deliverables column is what gets approved at the board meeting.
00:05:11 damraAnd when the board approves a budget, the line items are deliverables. Nobody writes — twenty percent of our token spend will go to ideas we abandon. But that's exactly the spend that tells you which deliverables are worth shipping next. The team that ran experiments and threw them away knows things the team that only shipped doesn't.
00:05:30 lenarI'd add one more. Experiment tokens have a different review cadence. Build tokens get checked at the end. Experiment tokens have to be checked weekly, because otherwise you can spend a quarter of compute against a problem no one can describe well enough to evaluate the result against.
00:05:46 damraAnd the people running the experiment have to be the same people who'd ship the result. If you split the experiment team from the deploy team, the experiment team learns things the deploy team doesn't trust, and the deploy team builds things the experiment team would've talked them out of.
00:06:01 lenarMollick's post is short. The implication is heavier than the post. Read it next to yesterday's K&E story and what you'd ask of any internal AI org becomes — what's the learning budget, who controls it, and how often does it get reset to zero so the team can try again.
00:06:16 lenarFour papers landed on arXiv this morning that read as a cluster, even though none of the authors know each other. Each one names a different way that long-running agent sessions fall apart. Together they sketch the shape of where the reliability work is right now.
00:06:31 damraWalk me through them. Slowly. Start with the one that connects back to what we covered Wednesday.
00:06:37 lenarRight. The first is Meta-Cognitive Memory Policy Optimization — MMPO — from a team led by Ziyan Liu. The setup is long-horizon agents that keep their context manageable by recursively summarizing their own history. After each step, the agent compresses what it knows into a smaller summary. The problem they name is belief deviation. After enough summarization rounds, the agent's working belief about the world drifts away from what was actually established earlier. The summary is fluent. It's also slightly wrong. And the next summary compresses the slightly wrong version, so the drift compounds.
00:07:14 damraWait — recursive summarization, you mean the technique every long-context agent has been using for the last year? That's what they're modifying?
00:07:22 lenarThat's what they're modifying. Their move is to train a policy that decides when to update memory and when to leave it alone. Memory updates become actions the policy can refuse. If the new information doesn't change anything important, the policy leaves the existing summary as it is. They report meaningful gains on long-horizon tasks. The intuition tracks — most of the drift in these systems comes from over-eager rewrites, not from missing updates.
00:07:48 damraWhich maps cleanly onto what we covered Wednesday — the agent memory degradation work, and the broader observation that persistent memory systems age badly. This is the same family of problem with a learned controller bolted on top. The controller's job is to know when to write.
00:08:04 lenarThe second paper is RedundancyBench, from a team at Huawei and Hong Kong Polytechnic, lead author Minyang Hu. They ask whether the steps an agent actually takes in a long trajectory are necessary. They build a benchmark for detecting redundant steps after the fact. The headline finding — a meaningful fraction of agent steps in current systems are repeats. The agent re-reads the same file it read fifty steps ago. It re-queries the same endpoint. It re-derives a fact it already had in context.
00:08:34 damraWhich sounds boring until you do the math on a thousand-step trajectory. If a quarter of your steps are redundant, you're paying for inference and tool calls you don't need, and you're filling the context with stale duplicates of state the agent already established. So the redundancy isn't just a cost line item. It actively makes the next step worse, because the relevant signal is now buried under repetition.
00:08:58 lenarThird — Anany Kotawala has a single-author paper with my favorite title of the day. Locally Coherent, Globally Incoherent. Bounding compositional incoherence in multi-component LLM agents. The framing — each sub-agent or sub-component in a multi-agent pipeline produces something defensible on its own. The assembled output is internally inconsistent because the components don't share constraints with each other. Kotawala's contribution is a bound. He proves a relationship between how often individual components are locally right and how often the assembly is globally right.
00:09:31 damraThat's the failure every team building multi-agent pipelines runs into the first time they show a demo to someone outside the room. Every component looks defensible. The end-to-end answer contradicts itself. The planner says one thing. The retriever brings back something inconsistent with that. The summarizer smooths over the conflict and produces a coherent-sounding paragraph that's wrong in a different way than either input.
00:09:55 lenarAnd the fourth, briefly — Agent-Radar, from Hongxiang Zhang at Purdue. Same neighborhood. They study attention steering with context relevance in multi-agent communication. When sub-agents exchange messages, the receiving agent's attention spreads across irrelevant pieces of the incoming message and the relevant signal gets diluted. They propose a steering mechanism that biases attention toward context-relevant tokens.
00:10:21 damraSo if you read all four side by side — MMPO on memory drift, RedundancyBench on wasted steps, Compositional Incoherence on assembled wrongness, and Agent-Radar on attention dilution — you can see the shape. As agent sessions get longer and as more sub-components get composed together, the new failure modes aren't about whether the model can answer a question. They're about whether the trajectory stays coherent and whether the steps add up to something useful.
00:10:49 lenarAnd the fixes proposed across the four papers are not new model capabilities. They're control layers wrapped around the model. A learned policy that gates memory writes. A benchmark that catches redundancy. A bound that quantifies compositional damage. An attention steering mechanism. Same shape as the harness conversation we had Tuesday. The model is fine. The layer wrapping the model is where the bugs live now.
00:11:13 damraLet me put a brake on one piece. These are all arXiv preprints from today. None of them have replication yet. The MMPO numbers look strong enough that I'd want to see another team rerun the experiments before I bring the policy into production. Kotawala's bound is single-author and the proof needs review.
00:11:31 lenarFair. The direction feels right and it lines up with what people running long agents in production have been complaining about all month. The specific numbers, I'm holding loosely. Anyone shipping an agent today should read MMPO and the redundancy paper this weekend. They might not adopt the methods. They'll recognize the failure modes.
00:11:50 lenarOne more in the same neighborhood, but on the personal-assistant side. Tiancheng Yang at Waterloo, with Matthias Schonlau and Ilia Sucholutsky from Vector, posted a benchmark and method comparison they're calling — and I'll just read the title — Selective QA over Conflicting Multi-Source Personal Memory. The setup is what happens when a personal AI assistant has accumulated memories about you from multiple sources, and those sources disagree.
00:12:16 damraGive me a concrete example. What does the disagreement look like in practice?
00:12:20 lenarThe example they walk through is preference conflict. Your calendar says you prefer morning meetings — the calendar's been saying that for two years. A message you sent two weeks ago says you've started blocking mornings for deep work and you want all meetings after lunch. Which one does the assistant believe when someone messages it asking to book time on your behalf? Both pieces of information were true when they were written. Neither one is a lie. They contradict each other now.
00:12:47 damraAnd the harder version of the same problem — neither source is wrong even today. The calendar is a stated preference. The message is a more recent stated preference. The assistant has to know that recency matters, that explicit statements override inferred ones, that some preferences are revisable and some aren't, and that some context-windows of your life override others. That's a lot of judgment to ask a retrieval system to perform.
00:13:13 lenarThey build a diagnostic testbed across several conflict types, and they compare a range of methods — straight retrieval, retrieval with a conflict-resolution step, methods that condition on recency, and methods that condition on source type. The honest summary is that no single method dominates. Different conflict types want different resolution strategies. Systems that try to use one strategy for everything underperform compared to systems that route the conflict type first and then apply a type-specific resolver.
00:13:43 damraWhich lines up with how humans handle the same problem. You don't have a single algorithm for resolving contradictory information about a friend. You weight sources by recency, by who said it, by how confident they sounded, by whether it was an explicit statement or an inference from behavior. Asking a retrieval system to bake one of those weightings into its index gets you the wrong answer in three out of four cases.
00:14:07 lenarThe reason this matters now — not in two years — is that the products shipping persistent memory right now don't have any of this machinery. When ChatGPT or Claude remember something about you, and that something becomes wrong, the next time the assistant uses that memory it confidently uses the stale version. There's no resolution step. There isn't a conflict-detection step. The newest entry doesn't necessarily win. The most explicit entry doesn't necessarily win. Whatever the retriever surfaces, the model treats as fact.
00:14:37 damraAnd that's a real cost for the user. Not a paper cost — a felt one. The assistant tells a coworker you prefer morning meetings when you've been telling everyone you don't, for the last two weeks. You don't see it happen. You just see the meeting on your calendar and wonder why nothing you say about your schedule sticks.
00:14:55 lenarCaleb DeLeeuw, an independent researcher, posted a paper called BioRefusalAudit. The premise is that current biosecurity evaluations of language models ask the model questions and grade whether it refuses. He argues that's a shallow test. A model can refuse for surface reasons — it pattern-matches on the phrasing of the question — and still have the relevant capability accessible if the question is asked differently.
00:15:21 damraSo how does he test deeper than that? What does the audit actually measure?
00:15:25 lenarHe uses sparse autoencoders — SAEs, the interpretability technique that's been getting attention this year — to look at the internal features the model activates when it's given a biosecurity-adjacent prompt. He asks a different question — not whether the model refused, but whether the model's internal representations contain the dangerous capability even when it refused at the surface. He compares general-purpose SAEs against ones he fine-tuned on the biosecurity domain to make the relevant features sharper.
00:15:55 damraThat's a real distinction. Refusing because the request matches a refusal pattern is different from refusing because the relevant knowledge isn't there. The first is brittle — paraphrase the request, switch language, embed the question in a roleplay, and the pattern stops matching. The second isn't brittle in the same way, because there's nothing to retrieve.
00:16:15 lenarHis finding is roughly that current refusal training in frontier open-weight models operates much more at the first level than the second. The capability is present internally. The refusal is a learned output filter sitting on top. And filters can be bypassed. The depth-versus-surface gap shows up clearly in the SAE features.
00:16:35 damraWhich doesn't mean the filter is worthless. It means it's a layer, not a wall. The work this pushes on is whether we should be measuring refusal depth as a separate quantity from refusal rate. The current public scorecards for model safety mostly report the rate. They don't report the depth. And the depth is what determines how the model behaves against an adversary who's actually trying.
00:16:58 lenarThat's what I'd hand to any safety team running biosecurity evals this quarter. Are you measuring whether the model said no, or whether the model couldn't say yes? Those are different tests, and we mostly run the first. DeLeeuw's paper doesn't solve the second one. It builds the apparatus to ask it.
00:17:15 damraAnd it ties into something bigger. SAE-based auditing is moving from an interpretability curiosity into something safety teams will plausibly run as part of release evaluations within a year. Today's paper is one application. The general technique — read the internal features, don't just read the outputs — is the move.
00:17:34 lenarOne last item, and it's a more cheerful one. A team with Ahmad Rammal at the lead, with people from FAIR Paris and NYU, posted AutoformBot. It's a multi-agent system that builds something they're calling Atlas — an autoformalized textbook library in Lean 4. The headline claim is that the system can take textbook math written in natural language and turn it into machine-checked Lean code at scale.
00:17:58 damraDefine autoformalization for someone who hasn't met the term before.
00:18:02 lenarMathematics written in natural-language proofs — the way textbooks write proofs, with English between the equations and a fair amount of "it is clear that" and "by symmetry" papering over the steps — translated into a proof assistant's formal language. Lean 4 is the proof assistant. It checks every step. If the translation is wrong, Lean refuses to compile it. Atlas is their target — a library of textbook math, formalized, that the Lean community can build on.
00:18:30 damraWhy a multi-agent system for that? What's the role split?
00:18:34 lenarBecause formalizing a single theorem from natural language is a multi-stage problem. You have to parse the statement, decide what context to import, translate the statement into Lean, translate each proof step, close gaps that the textbook waved over, and verify that the translated version actually compiles. They give different agents different stages. One reads the textbook. One drafts the Lean statement. One drafts the proof. One closes the gaps the others left. They critique each other's output, and Lean is the ground-truth oracle for whether the final product survives.
00:19:06 damraAnd does it work at the textbook scale they're claiming? Or is this one chapter and a press release?
00:19:12 lenarTheir claim is multi-textbook coverage with a meaningful fraction of theorems closing automatically. I haven't independently checked the numbers. The bigger story underneath — formalized math has been a fifteen-year project mostly run by small teams of dedicated mathematicians who hand-write Lean. The library that exists today, mathlib, was assembled one theorem at a time over a decade. If a multi-agent system can credibly do the bulk translation work, the rate of growth changes by an order of magnitude.
00:19:41 damraAnd once it's in Lean, it's verified. The agent can be wrong about the translation a hundred different ways. It can't be wrong about whether the Lean version compiles. The proof assistant is the ground truth. So unlike most agent benchmarks, this one isn't grading itself. The grading lives outside the loop entirely.
00:20:01 lenarThat's the part that makes this category interesting to me. Most agent benchmarks grade themselves — the same model that produced the answer is involved in judging the answer. This one has an outside verifier that doesn't care about model-style answers. Either the proof closes or it doesn't. There's no rubric, no judge model, and no partial credit.
00:20:21 damraAnd it's an early sign that some agent workflows have natural verification built into the domain. Coding has tests. Math has proofs. Hardware design has simulation. Most other domains do not — which is why so much of the agent literature this week is about catching incoherence inside the trajectory rather than at the output. When you can't verify the output, you have to verify the process. When you can verify the output, the process can be as messy as it needs to be.
00:20:49 lenarThat's where the day lands for me. A working local voice stack on a desk. A short note from Mollick about how to split your AI budget. A cluster of papers all saying that long agents fail in ways the harness has to catch. A benchmark for conflicting personal memory. An auditing technique that asks whether a refusal is shallow or deep. And a math project where the verifier is a proof assistant.
00:21:12 damraThe thread I'm pulling out — the model is rarely the constraint anymore on the things people are trying to build. The layer wrapped around it is. Mollick's learning budget, the MMPO memory policy, the RedundancyBench redundancy detector, the personal-memory conflict resolver, the SAE-based refusal auditor, and the proof-assistant verifier behind Atlas — different control layers, same job. They all sit between the model and the work, deciding what the model gets to do next.
00:21:41 lenarTomorrow is going to be quiet. I'll be reading the MMPO paper end to end and seeing if the numbers hold up to a closer look. If something surprising lands over the weekend, we'll cover it Monday. Lenar Kess.