Archive BRAID
Locally coherent, globally not / DISPATCH 041
PDF RSS

Dispatch 041 · 2026-05-29 GSV Locally Coherent, Globally Not

Locally coherent, globally not

/ 00:22:01 / 80 sources

“Are you measuring whether the model said no, or whether the model couldn't say yes? Those are different tests, and we mostly run the first.”

— Lenar Kess, today's narration

Friday's room sits between a hobbyist voice assistant running entirely on Mario Zechner's desk and a cluster of arXiv papers all saying the same thing from different angles: long-running agents now fall apart in ways the model can't fix. Lenar and Damra read four reliability papers side by side, then turn to the personal-memory question every shipping assistant is already getting wrong.

  • Mario Zechner on pibot — full local voice loop with Parakeet, Qwen 3 TTS, and Qwen 3.6 through llama.cpp, with the STT and TTS engines ported from Python into Rust on mlx-c. The runtime detail is the news, not the model lineup.
  • Ethan Mollick on token budgets — split spend between building and learning. Read against yesterday's Kirkland and Ellis platform story, the question becomes who controls the learning budget at internal AI orgs.
  • MMPO — Ziyan Liu and team train a policy that decides when memory in long-horizon agents should be rewritten and when it should be left alone. Belief drift comes from over-eager rewrites, not missing updates.
  • RedundancyBench — Minyang Hu's group benchmarks how many steps in a long agent trajectory are repeats. Stale duplicates of state crowd out the relevant signal in context.
  • Locally Coherent, Globally Incoherent — Anany Kotawala's single-author paper bounds compositional incoherence in multi-component agents. Defensible local outputs assemble into contradictory global ones.
  • Agent-Radar — Hongxiang Zhang's group steers attention toward context-relevant tokens in multi-agent communication, so the receiver isn't drowned in noise from the sender.
  • Selective QA over conflicting personal memory — Tiancheng Yang's testbed for what happens when your assistant's memories about you disagree. No single resolution strategy dominates.
  • BioRefusalAudit — Caleb DeLeeuw uses sparse autoencoders to ask whether a model's refusal is shallow pattern matching or whether the dangerous capability isn't there at all.
  • AutoformBot and Atlas — Ahmad Rammal's team at FAIR Paris and NYU on a multi-agent system that pulls textbook math into Lean 4 at scale. Lean is the verifier the agents can't argue with.

Chapters

  1. 00:00:00 Transcript

Sources

80 cited
  1. 1

    OpenAI · 47m40s

    Video OpenAI

    Build Hour: Agents SDK — Build with the next evolution of the Agents SDK. In this Build Hour, you’ll learn how to use the updated Agents SDK to build long-running agents with a model-native harness. Give agents the…

    www.youtube.com/watch?v=tK32trvj_b4 →
    Details
    Excerpt
    Build Hour: Agents SDK — Build with the next evolution of the Agents SDK. In this Build Hour, you’ll learn how to use the updated Agents SDK to build long-running agents with a model-native harness. Give agents the…
    Context
    Directly addresses agentic coding tools, agent infrastructure, and the shifting craft of software engineering with technical depth.
    Key points
    • Directly addresses agentic coding tools, agent infrastructure, and the shifting craft of software engineering with technical depth.
    Provenance
    Video · Supporting source
  2. 2

    arXiv cs.AI - Research Science (GLOBAL)

    Article Al Kari

    The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling - arXiv:2605.28864v1 Announce Type: new Abstract: The Cognitive Categorical Transformer (CCT) is a 306M-parameter...

    arxiv.org/abs/2605.28864 →
    Details
    Excerpt
    The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling - arXiv:2605.28864v1 Announce Type: new Abstract: The Cognitive Categorical Transformer (CCT) is a 306M-parameter...
    Context
    This is a primary artifact (arXiv paper) detailing a novel, theoretically grounded model architecture (CCT) and providing quantitative evidence of performance improvement via category theory concepts.
    Key points
    • This is a primary artifact (arXiv paper) detailing a novel, theoretically grounded model architecture (CCT) and providing quantitative evidence of performance improvement via category theory concepts.
    Provenance
    Article · Supporting source
  3. 3

    arXiv cs.AI - Research Science (GLOBAL)

    Article Jiachen Zhang (Peking University, China Agricultural University), Junyi Lao (Peking University), Chenghao Liu (Peking University), Siyuan Liu (Peking University), Shixin Wu (Peking University), Linsen Zhang (Peking University), Boyu Wang (Peking University), Songfang Huang (Peking University)

    VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis - arXiv:2605.28978v1 Announce Type: new Abstract: Finite Element Analysis (FEA) serves as the cornerstone of modern engineering...

    arxiv.org/abs/2605.28978 →
    Details
    Excerpt
    VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis - arXiv:2605.28978v1 Announce Type: new Abstract: Finite Element Analysis (FEA) serves as the cornerstone of modern engineering...
    Context
    This paper describes an agentic system (VFEAgent) automating a complex, domain-specific engineering workflow (FEA) from multimodal inputs. This is a core example of AI applied to physical-world engineering.
    Key points
    • This paper describes an agentic system (VFEAgent) automating a complex, domain-specific engineering workflow (FEA) from multimodal inputs. This is a core example of AI applied to physical-world engineering.
    Provenance
    Article · Supporting source
  4. 4

    arXiv cs.AI - Research Science (GLOBAL)

    Article Sara Metcalf, William Schoenberg

    BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation - arXiv:2605.28994v1 Announce Type: new Abstract: AI tools to support real world decision making must be able to build simulation models that inform...

    arxiv.org/abs/2605.28994 →
    Details
    Excerpt
    BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation - arXiv:2605.28994v1 Announce Type: new Abstract: AI tools to support real world decision making must be able to build simulation models that inform...
    Context
    Establishes a new, open-source benchmark (BEAMS) for AI in critical real-world modeling/simulation, directly impacting AI's utility and trustworthiness.
    Key points
    • Establishes a new, open-source benchmark (BEAMS) for AI in critical real-world modeling/simulation, directly impacting AI's utility and trustworthiness.
    Provenance
    Article · Supporting source
  5. 5

    arXiv cs.AI - Research Science (GLOBAL)

    Article Aisha Najera, Alvin Moon, Vedant Srinivasan, Rajesh Veeraraghavan

    When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis - arXiv:2605.29025v1 Announce Type: new Abstract: Federal agencies are deploying large language models (LLMs) to categorize public comment...

    arxiv.org/abs/2605.29025 →
    Details
    Excerpt
    When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis - arXiv:2605.29025v1 Announce Type: new Abstract: Federal agencies are deploying large language models (LLMs) to categorize public comment...
    Context
    Addresses the critical issue of model disagreement in real-world applications (public policy/federal agencies), directly impacting how intelligence is used and interpreted.
    Key points
    • Addresses the critical issue of model disagreement in real-world applications (public policy/federal agencies), directly impacting how intelligence is used and interpreted.
    Provenance
    Article · Supporting source
  6. 6

    arXiv cs.AI - Research Science (GLOBAL)

    Article Diego Gosmar, Deborah A. Dahl

    Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching - arXiv:2605.29055v1 Announce Type: new Abstract: Hallucination remains a major reliability barrier for production...

    arxiv.org/abs/2605.29055 →
    Details
    Excerpt
    Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching - arXiv:2605.29055v1 Announce Type: new Abstract: Hallucination remains a major reliability barrier for production...
    Context
    New research on hallucination mitigation, agentic pipelines, and semantic caching directly addresses reliability, infrastructure, and agentic tools.
    Key points
    • New research on hallucination mitigation, agentic pipelines, and semantic caching directly addresses reliability, infrastructure, and agentic tools.
    Provenance
    Article · Supporting source
  7. 7

    arXiv cs.AI - Research Science (GLOBAL)

    Article Siddharth Sai, Xiaofei Wen, Muhao Chen

    Robust and Efficient Guardrails with Latent Reasoning - arXiv:2605.29068v1 Announce Type: new Abstract: Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world...

    arxiv.org/abs/2605.29068 →
    Details
    Excerpt
    Robust and Efficient Guardrails with Latent Reasoning - arXiv:2605.29068v1 Announce Type: new Abstract: Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world...
    Context
    This paper proposes COLAGUARD, a novel, efficient guardrail model for LLMs. It directly addresses the core tension between safety robustness and high-throughput deployment, which is critical for real-world AI infrastructure.
    Key points
    • This paper proposes COLAGUARD, a novel, efficient guardrail model for LLMs. It directly addresses the core tension between safety robustness and high-throughput deployment, which is critical for real-world AI infrastructure.
    Provenance
    Article · Supporting source
  8. 8

    arXiv cs.AI - Research Science (GLOBAL)

    Article Tyler Akidau, Tyler Rockwood, Johannes Br\"uderl, Marc Millstone

    The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane - arXiv:2605.29082v1 Announce Type: new Abstract: AI agents are increasingly expected to operate as digital employees:...

    arxiv.org/abs/2605.29082 →
    Details
    Excerpt
    The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane - arXiv:2605.29082v1 Announce Type: new Abstract: AI agents are increasingly expected to operate as digital employees:...
    Context
    This paper proposes a critical architectural solution (ADP) for governing autonomous agents, directly addressing safety, policy, and enterprise data access—a core topic.
    Key points
    • This paper proposes a critical architectural solution (ADP) for governing autonomous agents, directly addressing safety, policy, and enterprise data access—a core topic.
    Provenance
    Article · Supporting source
  9. 9

    arXiv cs.AI - Research Science (GLOBAL)

    Article Yubo Li, Ramayya Krishnan, Rema Padman

    The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure - arXiv:2605.29087v1 Announce Type: new Abstract: Reasoning models are evaluated on single-turn benchmarks but.…

    arxiv.org/abs/2605.29087 →
    Details
    Excerpt
    The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure - arXiv:2605.29087v1 Announce Type: new Abstract: Reasoning models are evaluated on single-turn benchmarks but...
    Context
    Reports a specific, measurable failure mode (unfaithful capitulation) in reasoning models under adversarial pressure, directly impacting model reliability and safety.
    Key points
    • Reports a specific, measurable failure mode (unfaithful capitulation) in reasoning models under adversarial pressure, directly impacting model reliability and safety.
    Provenance
    Article · Supporting source
  10. 10

    arXiv cs.AI - Research Science (GLOBAL)

    Article Shreyas Fadnavis, Praitayini Kanakaraj, Felix Wyss

    Beyond Consensus: Trace-Level Synthesis in Mixture of Agents - arXiv:2605.29116v1 Announce Type: new Abstract: When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a.…

    arxiv.org/abs/2605.29116 →
    Details
    Excerpt
    Beyond Consensus: Trace-Level Synthesis in Mixture of Agents - arXiv:2605.29116v1 Announce Type: new Abstract: When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a...
    Context
    This paper directly addresses agentic systems and the 'craft' of AI reasoning, arguing for trace-level synthesis over simple consensus voting. This is a core technical advance for agentic tools.
    Key points
    • This paper directly addresses agentic systems and the 'craft' of AI reasoning, arguing for trace-level synthesis over simple consensus voting. This is a core technical advance for agentic tools.
    Provenance
    Article · Supporting source
  11. 11

    arXiv cs.AI - Research Science (GLOBAL)

    Article Yifei He, Rui Yang, Hao Bai, Tong Zhang, Han Zhao

    PRO-CUA: Process-Reward Optimization for Computer Use Agents - arXiv:2605.29119v1 Announce Type: new Abstract: Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their...

    arxiv.org/abs/2605.29119 →
    Details
    Excerpt
    PRO-CUA: Process-Reward Optimization for Computer Use Agents - arXiv:2605.29119v1 Announce Type: new Abstract: Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their...
    Context
    This paper introduces a new framework (PRO-CUA) for training computer use agents (CUAs), directly addressing agentic coding/workflow automation and AI infrastructure challenges.
    Key points
    • This paper introduces a new framework (PRO-CUA) for training computer use agents (CUAs), directly addressing agentic coding/workflow automation and AI infrastructure challenges.
    Provenance
    Article · Supporting source
  12. 12

    arXiv cs.AI - Research Science (GLOBAL)

    Article Dueun Kim, Albert No

    The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models - arXiv:2605.29123v1 Announce Type: new Abstract: Masked diffusion language models (MDMs) uniquely support any-order generation, with...

    arxiv.org/abs/2605.29123 →
    Details
    Excerpt
    The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models - arXiv:2605.29123v1 Announce Type: new Abstract: Masked diffusion language models (MDMs) uniquely support any-order generation, with...
    Context
    Directly addresses model failure modes and reasoning limitations in diffusion models, a core topic for frontier AI research.
    Key points
    • Directly addresses model failure modes and reasoning limitations in diffusion models, a core topic for frontier AI research.
    Provenance
    Article · Supporting source
  13. 13

    arXiv cs.AI - Research Science (GLOBAL)

    Article Muhammad Zia Hydari, Raja Iqbal, Narayan Ramasubbu

    Governing Technical Debt in Agentic AI Systems - arXiv:2605.29129v1 Announce Type: new Abstract: Agentic AI systems are increasingly being explored as production infrastructure: they reason over multiple steps, call...

    arxiv.org/abs/2605.29129 →
    Details
    Excerpt
    Governing Technical Debt in Agentic AI Systems - arXiv:2605.29129v1 Announce Type: new Abstract: Agentic AI systems are increasingly being explored as production infrastructure: they reason over multiple steps, call...
    Context
    Defines 'Agentic Technical Debt' and 'Stochastic Tax,' directly addressing governance and infrastructure challenges in agentic AI systems.
    Key points
    • Defines 'Agentic Technical Debt' and 'Stochastic Tax,' directly addressing governance and infrastructure challenges in agentic AI systems.
    Provenance
    Article · Supporting source
  14. 14

    arXiv cs.AI - Research Science (GLOBAL)

    Article Daniel Lee, Owen Queen, James Zou

    ReasonOps: Operator Segmentation for LLM Reasoning Traces - arXiv:2605.29192v1 Announce Type: new Abstract: Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a...

    arxiv.org/abs/2605.29192 →
    Details
    Excerpt
    ReasonOps: Operator Segmentation for LLM Reasoning Traces - arXiv:2605.29192v1 Announce Type: new Abstract: Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a...
    Context
    This paper introduces ReasonOps, a method to analyze and structure LLM reasoning traces, revealing common compositional structures and model fingerprints. This is core research on AI capability and understanding.
    Key points
    • This paper introduces ReasonOps, a method to analyze and structure LLM reasoning traces, revealing common compositional structures and model fingerprints. This is core research on AI capability and understanding.
    Provenance
    Article · Supporting source
  15. 15

    arXiv cs.AI - Research Science (GLOBAL)

    Article Tenghao Huang, Kung-Hsiang Huang, Prafulla Kumar Choubey, Yilun Zhou, Muhao Chen, Jonathan May, Chien-Sheng Wu

    GTA: Generating Long-Horizon Tasks for Web Agents at Scale - arXiv:2605.29218v1 Announce Type: new Abstract: Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web...

    arxiv.org/abs/2605.29218 →
    Details
    Excerpt
    GTA: Generating Long-Horizon Tasks for Web Agents at Scale - arXiv:2605.29218v1 Announce Type: new Abstract: Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web...
    Context
    This paper introduces a scalable benchmark (GTA) for web agents, directly addressing the core topic of agentic coding tools and practice. It's a primary artifact with clear downstream consequence for agent development.
    Key points
    • This paper introduces a scalable benchmark (GTA) for web agents, directly addressing the core topic of agentic coding tools and practice. It's a primary artifact with clear downstream consequence for agent development.
    Provenance
    Article · Supporting source
  16. 16

    arXiv cs.AI - Research Science (GLOBAL)

    Article Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu, Akiko Aizawa

    BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents - arXiv:2605.29225v1 Announce Type: new Abstract: Self-evolving agents improve over time by reflecting on past failures, but.…

    arxiv.org/abs/2605.29225 →
    Details
    Excerpt
    BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents - arXiv:2605.29225v1 Announce Type: new Abstract: Self-evolving agents improve over time by reflecting on past failures, but...
    Context
    Introduces BenchTrace, a new benchmark for evaluating self-evolving LLM agents, directly addressing agentic coding tools and agentic practice.
    Key points
    • Introduces BenchTrace, a new benchmark for evaluating self-evolving LLM agents, directly addressing agentic coding tools and agentic practice.
    Provenance
    Article · Supporting source
  17. 17

    arXiv cs.AI - Research Science (GLOBAL)

    Article Benlong Wu, Weiming Zhang, Kejiang Chen, Han Fang, Nenghai Yu

    Provably Secure Agent Guardrail - arXiv:2605.29251v1 Announce Type: new Abstract: As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of...

    arxiv.org/abs/2605.29251 →
    Details
    Excerpt
    Provably Secure Agent Guardrail - arXiv:2605.29251v1 Announce Type: new Abstract: As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of...
    Context
    This paper proposes a formal, provably secure guardrail for agents, directly addressing the core risk of autonomous AI systems going rogue. It's a major technical artifact.
    Key points
    • This paper proposes a formal, provably secure guardrail for agents, directly addressing the core risk of autonomous AI systems going rogue. It's a major technical artifact.
    Provenance
    Article · Supporting source
  18. 18

    arXiv cs.AI - Research Science (GLOBAL)

    Article Yibing Liu, Yangze Liu, Xiaolong Yin, Bin Wang, Chong Zhang, Hao Yin, Zhongyi Han

    OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories - arXiv:2605.29253v1 Announce Type: new Abstract: Task success can hide process anomalies in real-world agent executions. An.…

    arxiv.org/abs/2605.29253 →
    Details
    Excerpt
    OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories - arXiv:2605.29253v1 Announce Type: new Abstract: Task success can hide process anomalies in real-world agent executions. An...
    Context
    Introduces OpenClawBench, a large-scale dataset for measuring process-side anomalies in real agent execution. Directly addresses agentic tools and reliability.
    Key points
    • Introduces OpenClawBench, a large-scale dataset for measuring process-side anomalies in real agent execution. Directly addresses agentic tools and reliability.
    Provenance
    Article · Supporting source
  19. 19

    arXiv cs.AI - Research Science (GLOBAL)

    Article Shijie Cao, Yuan Yuan, Jing Liu

    Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling - arXiv:2605.29262v1 Announce Type: new Abstract: The Dynamic Flexible Job Shop Scheduling Problem...

    arxiv.org/abs/2605.29262 →
    Details
    Excerpt
    Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling - arXiv:2605.29262v1 Announce Type: new Abstract: The Dynamic Flexible Job Shop Scheduling Problem...
    Context
    This paper proposes an agentic framework (RACE-Sched) for dynamic scheduling, directly addressing the core tension between real-time constraints and long-horizon reasoning in industrial control systems.
    Key points
    • This paper proposes an agentic framework (RACE-Sched) for dynamic scheduling, directly addressing the core tension between real-time constraints and long-horizon reasoning in industrial control systems.
    Provenance
    Article · Supporting source
  20. 20

    arXiv cs.AI - Research Science (GLOBAL)

    Article Yang Zhang, Xiukun Wei, Xueru Zhang

    When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop - arXiv:2605.29267v1 Announce Type: new Abstract: Foundation models are increasingly trained on synthetic data generated.…

    arxiv.org/abs/2605.29267 →
    Details
    Excerpt
    When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop - arXiv:2605.29267v1 Announce Type: new Abstract: Foundation models are increasingly trained on synthetic data generated...
    Context
    Addresses model collapse and alignment failure in multi-model, self-consuming training loops, directly impacting AI infrastructure and control.
    Key points
    • Addresses model collapse and alignment failure in multi-model, self-consuming training loops, directly impacting AI infrastructure and control.
    Provenance
    Article · Supporting source
  21. 21

    arXiv cs.AI - Research Science (GLOBAL)

    Article Wei Zheng, Yang Yan, Yiyang Shao, Jinyang Li, Zeze Chang, Yukuang Jia, Qiming Mao, Chihyung Wang, Jingbin Zhou

    Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies - arXiv:2605.29270v1 Announce Type: new Abstract: The era of the Internet of Agents (IoA) is taking shape: LLM agents are...

    arxiv.org/abs/2605.29270 →
    Details
    Excerpt
    Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies - arXiv:2605.29270v1 Announce Type: new Abstract: The era of the Internet of Agents (IoA) is taking shape: LLM agents are...
    Context
    Addresses a core infrastructure problem (context management/service discovery) for the 'Internet of Agents' (IoA), a key near-future topic.
    Key points
    • Addresses a core infrastructure problem (context management/service discovery) for the 'Internet of Agents' (IoA), a key near-future topic.
    Provenance
    Article · Supporting source
  22. 22

    arXiv cs.AI - Research Science (GLOBAL)

    Article Vaishali Senthil, Ashutosh Hathidara, Sebastian Schreiber

    CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval - arXiv:2605.29271v1 Announce Type: new Abstract: Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user...

    arxiv.org/abs/2605.29271 →
    Details
    Excerpt
    CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval - arXiv:2605.29271v1 Announce Type: new Abstract: Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user...
    Context
    This paper addresses a core bottleneck for LLM agents: tool retrieval from large API catalogs. It proposes a novel co-training method (CoHyDE) that improves agent capability.
    Key points
    • This paper addresses a core bottleneck for LLM agents: tool retrieval from large API catalogs. It proposes a novel co-training method (CoHyDE) that improves agent capability.
    Provenance
    Article · Supporting source
  23. 23

    arXiv cs.AI - Research Science (GLOBAL)

    Article Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng, Tong Xu, Yi Zheng, Zhefeng Wang, Enhong Chen

    Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models - arXiv:2605.29303v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) followed by reinforcement...

    arxiv.org/abs/2605.29303 →
    Details
    Excerpt
    Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models - arXiv:2605.29303v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) followed by reinforcement...
    Context
    This is a new, technical paper (arXiv) proposing a novel fine-tuning method (EKSFT) for LLMs, directly impacting model training and capability.
    Key points
    • This is a new, technical paper (arXiv) proposing a novel fine-tuning method (EKSFT) for LLMs, directly impacting model training and capability.
    Provenance
    Article · Supporting source
  24. 24

    arXiv cs.AI - Research Science (GLOBAL)

    Article Yilun Yao, Jiaming Pan, Elsie Dai, Peizhuang Cong, Yaoming Li, Tong Yang

    ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression - arXiv:2605.29350v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) language models reduce per-token computation but still require.…

    arxiv.org/abs/2605.29350 →
    Details
    Excerpt
    ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression - arXiv:2605.29350v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) language models reduce per-token computation but still require...
    Context
    This is a primary artifact (arXiv paper) detailing a novel, train-free compression technique (ConMoE) for MoE models, directly impacting AI infrastructure and deployment.
    Key points
    • This is a primary artifact (arXiv paper) detailing a novel, train-free compression technique (ConMoE) for MoE models, directly impacting AI infrastructure and deployment.
    Provenance
    Article · Supporting source
  25. 25

    arXiv cs.AI - Research Science (GLOBAL)

    Article Yiqun Liu, Yingsheng Wu, Ruqi Yang, Enrong Zheng, Honglei Qiu, Sijun He, Tai Liang, Jingjing Wu, Yuhan Zhou, Yiwei Zhang, Dongyan Chen, Weihan Yi, Xinqi Li, Siqi Bao

    PassNet: Scaling Large Language Models for Graph Compiler Pass Generation - arXiv:2605.29357v1 Announce Type: new Abstract: Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream...

    arxiv.org/abs/2605.29357 →
    Details
    Excerpt
    PassNet: Scaling Large Language Models for Graph Compiler Pass Generation - arXiv:2605.29357v1 Announce Type: new Abstract: Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream...
    Context
    Addresses AI infrastructure (compilers, optimization) and the shifting craft of software engineering (LLMs for compiler passes). Primary artifact (PassNet/PassBench) with clear downstream consequence.
    Key points
    • Addresses AI infrastructure (compilers, optimization) and the shifting craft of software engineering (LLMs for compiler passes). Primary artifact (PassNet/PassBench) with clear downstream consequence.
    Provenance
    Article · Supporting source
  26. 26

    arXiv cs.AI - Research Science (GLOBAL)

    Article Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan

    Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - arXiv:2605.29358v1 Announce Type: new Abstract: We demonstrate that sparse autoencoders can extract interpretable features from Claude 3.…

    arxiv.org/abs/2605.29358 →
    Details
    Excerpt
    Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - arXiv:2605.29358v1 Announce Type: new Abstract: We demonstrate that sparse autoencoders can extract interpretable features from Claude 3...
    Context
    A primary artifact (arXiv paper) detailing feature extraction and interpretability from a major proprietary model (Claude 3 Sonnet). Directly addresses model internals and control.
    Key points
    • A primary artifact (arXiv paper) detailing feature extraction and interpretability from a major proprietary model (Claude 3 Sonnet). Directly addresses model internals and control.
    Provenance
    Article · Supporting source
  27. 27

    arXiv cs.AI - Research Science (GLOBAL)

    Article Zhihao Liu, Yifan Wu, Jian Lou, Di Wang, Yuxi Zhou, Yuke Hu

    Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization - arXiv:2605.29396v1 Announce Type: new Abstract: Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe...

    arxiv.org/abs/2605.29396 →
    Details
    Excerpt
    Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization - arXiv:2605.29396v1 Announce Type: new Abstract: Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe...
    Context
    This is a primary research artifact (arXiv paper) directly addressing LLM safety and robustness, a core concern in AI infrastructure and power dynamics.
    Key points
    • This is a primary research artifact (arXiv paper) directly addressing LLM safety and robustness, a core concern in AI infrastructure and power dynamics.
    Provenance
    Article · Supporting source
  28. 28

    arXiv cs.AI - Research Science (GLOBAL)

    Article Rahul Bissa, Abhishek Vyas, Yash Jain

    Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark - arXiv:2605.29400v1 Announce Type: new Abstract: We benchmark three supervised fine-tuned models against...

    arxiv.org/abs/2605.29400 →
    Details
    Excerpt
    Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark - arXiv:2605.29400v1 Announce Type: new Abstract: We benchmark three supervised fine-tuned models against...
    Context
    This is a primary artifact (arXiv paper) detailing a specific benchmark (PiSAR) and showing a massive performance gap between fine-tuned models and frontier zero-shot baselines. It directly impacts agentic coding/behavior prediction.
    Key points
    • This is a primary artifact (arXiv paper) detailing a specific benchmark (PiSAR) and showing a massive performance gap between fine-tuned models and frontier zero-shot baselines. It directly impacts agentic coding/behavior prediction.
    Provenance
    Article · Supporting source
  29. 29

    arXiv cs.AI - Research Science (GLOBAL)

    Article Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen

    Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation - arXiv:2605.29430v1 Announce Type: new Abstract: Automatic speech recognition (ASR) is a core component of...

    arxiv.org/abs/2605.29430 →
    Details
    Excerpt
    Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation - arXiv:2605.29430v1 Announce Type: new Abstract: Automatic speech recognition (ASR) is a core component of...
    Context
    Presents a new, agentic framework (Agentic ASR) for speech recognition, directly addressing the limitations of current single-pass systems. This is a primary artifact changing the developer's mental model for building AI agents.
    Key points
    • Presents a new, agentic framework (Agentic ASR) for speech recognition, directly addressing the limitations of current single-pass systems. This is a primary artifact changing the developer's mental model for building AI agents.
    Provenance
    Article · Supporting source
  30. 30

    arXiv cs.AI - Research Science (GLOBAL)

    Article Zeli Su, Zhankai Xu, Tianlei Chen, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang

    The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF - arXiv:2605.29491v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in...

    arxiv.org/abs/2605.29491 →
    Details
    Excerpt
    The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF - arXiv:2605.29491v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in...
    Context
    This paper addresses a critical robustness gap in RAG/agentic systems (distractor instructions), directly impacting LLM reliability and deployment in real-world, noisy data environments.
    Key points
    • This paper addresses a critical robustness gap in RAG/agentic systems (distractor instructions), directly impacting LLM reliability and deployment in real-world, noisy data environments.
    Provenance
    Article · Supporting source
  31. 31

    arXiv cs.AI - Research Science (GLOBAL)

    Article Kevin Wang, Anna Th\"oni, Benjamin Kempinski, Bobby Cheng, Jianzhu Yao, Benjamin Finch, Leon Guertler, Viraj Nadkarni, Yihan Jiang, Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov, Siyuan Wu, Yu-Chi Cheng, Yan-Ru Ju, Ti-Rong Wu, I-Hsuan Chu, Yu-Yu Yang, I-Chen Wu, Yitian Huang, Qinlu Cao, Yiheng Sun, Yuhong Dai, Hongkun Yao, Jingxuan Fu, Jiwei Zhang, Hao Liao, Mossimo Ebeling, Govind Arun, Sadhvik Bathini, Mihir S Arya, Avinash Anish, Aditya Ranjan, Kirtana Sunil Phatnani, Paval KS, Vrushali Mehta, Aravind S, Nikhil Arora, Tanya Upadhyay, Amol Bandagale, Yuan Lu, ChunEn Hsiao, YuTing Lin, Arvin Chung, Jerry John Thomas, Mathieu Lauri\`ere, Leshem Choshen, Yoram Bachrach, Pramod Viswanath, Maria Polukarov, Cheston Tan, Tal Kachman, Atlas Wang

    MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs - arXiv:2605.29512v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as interactive agents,...

    arxiv.org/abs/2605.29512 →
    Details
    Excerpt
    MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs - arXiv:2605.29512v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as interactive agents,...
    Context
    Introduces a new, comprehensive multi-agent evaluation platform (Mindgames) and dataset, directly addressing the core topic of agentic tools and power dynamics.
    Key points
    • Introduces a new, comprehensive multi-agent evaluation platform (Mindgames) and dataset, directly addressing the core topic of agentic tools and power dynamics.
    Provenance
    Article · Supporting source
  32. 32

    arXiv cs.AI - Research Science (GLOBAL)

    Article Zekai Yu, Qi Meng, Qizhi Chu, Yu Hao, Chuan Shi, Cheng Yang

    ParaTool: Shifting Tool Representations from Context to Parameters - arXiv:2605.29561v1 Announce Type: new Abstract: Tool calling extends large language models (LLMs) by enabling grounded interaction with external...

    arxiv.org/abs/2605.29561 →
    Details
    Excerpt
    ParaTool: Shifting Tool Representations from Context to Parameters - arXiv:2605.29561v1 Announce Type: new Abstract: Tool calling extends large language models (LLMs) by enabling grounded interaction with external...
    Context
    This paper proposes a fundamental architectural shift for tool use in LLMs, moving from context-based documentation to parameter-based integration. This directly impacts agentic coding and LLM infrastructure.
    Key points
    • This paper proposes a fundamental architectural shift for tool use in LLMs, moving from context-based documentation to parameter-based integration. This directly impacts agentic coding and LLM infrastructure.
    Provenance
    Article · Supporting source
  33. 33

    arXiv cs.AI - Research Science (GLOBAL)

    Article Kangrui Wang, Linjie Li, Zhengyuan Yang, Shiqi Chen, Zihan Wang, Li Fei-Fei, Jiajun Wu, Leonidas Guibas, Lijuan Wang, Manling Li

    Planning with the Views via Scene Self-Exploration - arXiv:2605.29563v1 Announce Type: new Abstract: Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view...

    arxiv.org/abs/2605.29563 →
    Details
    Excerpt
    Planning with the Views via Scene Self-Exploration - arXiv:2605.29563v1 Announce Type: new Abstract: Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view...
    Context
    This paper details a critical planning gap in VLMs (view planning) and proposes a novel self-exploration framework. It directly addresses frontier model capabilities and 3D reasoning.
    Key points
    • This paper details a critical planning gap in VLMs (view planning) and proposes a novel self-exploration framework. It directly addresses frontier model capabilities and 3D reasoning.
    Provenance
    Article · Supporting source
  34. 34

    arXiv cs.AI - Research Science (GLOBAL)

    Article Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu

    DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning - arXiv:2605.29568v1 Announce Type: new Abstract: Tool-Integrated Reasoning (TIR) extends LLM...

    arxiv.org/abs/2605.29568 →
    Details
    Excerpt
    DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning - arXiv:2605.29568v1 Announce Type: new Abstract: Tool-Integrated Reasoning (TIR) extends LLM...
    Context
    This is a primary artifact (arXiv paper) detailing a novel framework (DeepTool) for improving LLM reasoning and tool use via Process-Supervised RL. Directly relates to agentic coding tools and the shifting craft of software engineering.
    Key points
    • This is a primary artifact (arXiv paper) detailing a novel framework (DeepTool) for improving LLM reasoning and tool use via Process-Supervised RL. Directly relates to agentic coding tools and the shifting craft of software engineering.
    Provenance
    Article · Supporting source
  35. 35

    arXiv cs.AI - Research Science (GLOBAL)

    Article Silu Panda

    FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification - arXiv:2605.29586v1 Announce Type: new Abstract: We introduce FinVerBench, a benchmark and validity study for...

    arxiv.org/abs/2605.29586 →
    Details
    Excerpt
    FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification - arXiv:2605.29586v1 Announce Type: new Abstract: We introduce FinVerBench, a benchmark and validity study for...
    Context
    This introduces a new, specialized benchmark (FinVerBench) using SEC filings for LLM financial verification. It directly addresses model reliability and real-world application in finance.
    Key points
    • This introduces a new, specialized benchmark (FinVerBench) using SEC filings for LLM financial verification. It directly addresses model reliability and real-world application in finance.
    Provenance
    Article · Supporting source
  36. 36

    arXiv cs.AI - Research Science (GLOBAL)

    Article Junyoung Park, Sunghwan Park, Seongyong Ju, Jaewoo Lee

    Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures - arXiv:2605.29629v1 Announce Type: new Abstract: Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the...

    arxiv.org/abs/2605.29629 →
    Details
    Excerpt
    Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures - arXiv:2605.29629v1 Announce Type: new Abstract: Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the...
    Context
    This paper introduces a new, more granular safety evaluation metric (TLO) that moves beyond simple success/failure rates. It directly impacts how LLM safety is tested and deployed.
    Key points
    • This paper introduces a new, more granular safety evaluation metric (TLO) that moves beyond simple success/failure rates. It directly impacts how LLM safety is tested and deployed.
    Provenance
    Article · Supporting source
  37. 37

    arXiv cs.AI - Research Science (GLOBAL)

    Article Jiajie Fu, Junwen Chen, Mengzhao Wang, Aoxiang He, Maojia Sheng, Xiangyu Ke, Yifan Zhu, Yunjun Gao

    VikingMem: A Memory Base Management System for Stateful LLM-based Applications - arXiv:2605.29640v1 Announce Type: new Abstract: Large Language Models have revolutionized interactive applications; however, their finite.…

    arxiv.org/abs/2605.29640 →
    Details
    Excerpt
    VikingMem: A Memory Base Management System for Stateful LLM-based Applications - arXiv:2605.29640v1 Announce Type: new Abstract: Large Language Models have revolutionized interactive applications; however, their finite...
    Context
    Addresses the critical technical challenge of state management and long-term memory for LLM applications, a core topic for agentic tools and software engineering.
    Key points
    • Addresses the critical technical challenge of state management and long-term memory for LLM applications, a core topic for agentic tools and software engineering.
    Provenance
    Article · Supporting source
  38. 38

    arXiv cs.AI - Research Science (GLOBAL)

    Article Elliot Gestrin, Jendrik Seipp

    LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning - arXiv:2605.29649v1 Announce Type: new Abstract: Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are...

    arxiv.org/abs/2605.29649 →
    Details
    Excerpt
    LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning - arXiv:2605.29649v1 Announce Type: new Abstract: Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are...
    Context
    This paper reports a primary artifact (new heuristic) that uses LLMs to generate domain-independent planning heuristics, directly addressing the 'agentic coding tools' and 'shifting craft of software engineering' topics.
    Key points
    • This paper reports a primary artifact (new heuristic) that uses LLMs to generate domain-independent planning heuristics, directly addressing the 'agentic coding tools' and 'shifting craft of software engineering' topics.
    Provenance
    Article · Supporting source
  39. 39

    arXiv cs.AI - Research Science (GLOBAL)

    Article Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan, Martin Hadamitzky, Daniel Rueckert, Lisa Adams, Keno Bressem

    GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents - arXiv:2605.29668v1 Announce Type: new Abstract: LLM agents acting in structured environments fail in operational rather than conversational...

    arxiv.org/abs/2605.29668 →
    Details
    Excerpt
    GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents - arXiv:2605.29668v1 Announce Type: new Abstract: LLM agents acting in structured environments fail in operational rather than conversational...
    Context
    This paper introduces GRASP, a method for reliable self-improvement in LLM agents by preventing catastrophic forgetting (regression). This is a core technical advance in agentic systems.
    Key points
    • This paper introduces GRASP, a method for reliable self-improvement in LLM agents by preventing catastrophic forgetting (regression). This is a core technical advance in agentic systems.
    Provenance
    Article · Supporting source
  40. 40

    arXiv cs.AI - Research Science (GLOBAL)

    Article Lorenz Kutschka, Bernhard Geiger

    Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems - arXiv:2605.29676v1 Announce Type: new Abstract: Large language models in Agentic AI systems consume tool schemas and execution...

    arxiv.org/abs/2605.29676 →
    Details
    Excerpt
    Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems - arXiv:2605.29676v1 Announce Type: new Abstract: Large language models in Agentic AI systems consume tool schemas and execution...
    Context
    This paper directly addresses the infrastructure and efficiency of agentic AI systems by proposing and benchmarking token-optimized data formats (TOON, TRON) to replace JSON for tool schemas and execution results.
    Key points
    • This paper directly addresses the infrastructure and efficiency of agentic AI systems by proposing and benchmarking token-optimized data formats (TOON, TRON) to replace JSON for tool schemas and execution results.
    Provenance
    Article · Supporting source
  41. 41

    arXiv cs.AI - Research Science (GLOBAL)

    Article Pedro Orvalho, Marta Kwiatkowska, Guillem Aleny\`a, Felip Many\`a

    Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability - arXiv:2605.29687v1 Announce Type: new Abstract: Large Language Models (LLMs) excel at understanding natural language but...

    arxiv.org/abs/2605.29687 →
    Details
    Excerpt
    Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability - arXiv:2605.29687v1 Announce Type: new Abstract: Large Language Models (LLMs) excel at understanding natural language but...
    Context
    This paper proposes a verifiable, structured approach (MaxSAT) for LLMs to solve complex optimization problems, directly addressing LLM reliability and capability in constrained domains like robotics.
    Key points
    • This paper proposes a verifiable, structured approach (MaxSAT) for LLMs to solve complex optimization problems, directly addressing LLM reliability and capability in constrained domains like robotics.
    Provenance
    Article · Supporting source
  42. 42

    arXiv cs.AI - Research Science (GLOBAL)

    Article Yuchen Liu, Yingjie Feng, Lixiong Qin, Jiasi Chen, Jianing Yu, Sheng Gao, Sheng Yang, Weiran Xu

    Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling - arXiv:2605.29697v1 Announce Type: new Abstract: In Agentic Search, trajectory-level outcome rewards fail to quantify the...

    arxiv.org/abs/2605.29697 →
    Details
    Excerpt
    Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling - arXiv:2605.29697v1 Announce Type: new Abstract: In Agentic Search, trajectory-level outcome rewards fail to quantify the...
    Context
    This is a new arXiv paper proposing a novel step-level reward mechanism (GDCR/SAPO) for agentic search, directly addressing core challenges in agentic AI.
    Key points
    • This is a new arXiv paper proposing a novel step-level reward mechanism (GDCR/SAPO) for agentic search, directly addressing core challenges in agentic AI.
    Provenance
    Article · Supporting source
  43. 43

    arXiv cs.AI - Research Science (GLOBAL)

    Article Mincheol Kang, Hyunjin Lim, Bomin Kang, Daehee Park

    BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices - arXiv:2605.29705v1 Announce Type: new Abstract: Trajectory prediction is a fundamental task for autonomous systems, requiring complex...

    arxiv.org/abs/2605.29705 →
    Details
    Excerpt
    BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices - arXiv:2605.29705v1 Announce Type: new Abstract: Trajectory prediction is a fundamental task for autonomous systems, requiring complex...
    Context
    New research (arXiv) on deploying complex LLM reasoning (trajectory prediction) to resource-constrained edge devices. Directly impacts autonomous systems and AI infrastructure.
    Key points
    • New research (arXiv) on deploying complex LLM reasoning (trajectory prediction) to resource-constrained edge devices. Directly impacts autonomous systems and AI infrastructure.
    Provenance
    Article · Supporting source
  44. 44

    arXiv cs.AI - Research Science (GLOBAL)

    Article Shuaidi Wang, Zhan Zhuang, Ruping Huang, Yu Zhang

    NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs - arXiv:2605.29716v1 Announce Type: new Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive...

    arxiv.org/abs/2605.29716 →
    Details
    Excerpt
    NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs - arXiv:2605.29716v1 Announce Type: new Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive...
    Context
    This is a primary artifact (arXiv paper) detailing a new PEFT method (NaRA) specifically for Diffusion LLMs (dLLMs), improving code generation and reasoning.
    Key points
    • This is a primary artifact (arXiv paper) detailing a new PEFT method (NaRA) specifically for Diffusion LLMs (dLLMs), improving code generation and reasoning.
    Provenance
    Article · Supporting source
  45. 45

    arXiv cs.AI - Research Science (GLOBAL)

    Article Yeong-Joon Ju, Seong-Whan Lee

    Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering - arXiv:2605.29742v1 Announce Type: new Abstract: Deploying Large Language Models (LLMs) for regulatory...

    arxiv.org/abs/2605.29742 →
    Details
    Excerpt
    Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering - arXiv:2605.29742v1 Announce Type: new Abstract: Deploying Large Language Models (LLMs) for regulatory...
    Context
    Addresses regulatory compliance and traceability for LLMs, directly impacting legal/policy use cases (HIPAA, national regulations). Provides a new benchmark (RegOps-Bench) and framework (RefWalk).
    Key points
    • Addresses regulatory compliance and traceability for LLMs, directly impacting legal/policy use cases (HIPAA, national regulations). Provides a new benchmark (RegOps-Bench) and framework (RefWalk).
    Provenance
    Article · Supporting source
  46. 46

    arXiv cs.AI - Research Science (GLOBAL)

    Article Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou, Aiguo Wang, Cuiwei Yang

    Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence - arXiv:2605.29744v1 Announce Type: new Abstract: The impressive performance of generalist large language...

    arxiv.org/abs/2605.29744 →
    Details
    Excerpt
    Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence - arXiv:2605.29744v1 Announce Type: new Abstract: The impressive performance of generalist large language...
    Context
    This paper addresses the architecture of medical AI, focusing on multi-agent systems and the synergy between generalist and specialist models. This is core to the 'where intelligence is built' and 'power dynamics' themes.
    Key points
    • This paper addresses the architecture of medical AI, focusing on multi-agent systems and the synergy between generalist and specialist models. This is core to the 'where intelligence is built' and 'power dynamics' themes.
    Provenance
    Article · Supporting source
  47. 47

    arXiv cs.AI - Research Science (GLOBAL)

    Article Omar Benjelloun, Leonardo Martins Bianco, Isabelle Guyon, Thanh Gia Hieu Khuong, Jonathan Lebensold, Sebastian Lobentanzer, Luis Oala, Benedictus Kent Rachmat, Ihsan Ullah, Peyman Vahidi, Joaquin Vanschoren

    Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations - arXiv:2605.29786v1 Announce Type: new Abstract: Reproducibility is fundamental to the scientific method, yet remains a critical...

    arxiv.org/abs/2605.29786 →
    Details
    Excerpt
    Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations - arXiv:2605.29786v1 Announce Type: new Abstract: Reproducibility is fundamental to the scientific method, yet remains a critical...
    Context
    Introduces a formal, machine-actionable metadata standard (Croissant Tasks) for conceptual reproducibility, directly impacting ML evaluation and agentic development.
    Key points
    • Introduces a formal, machine-actionable metadata standard (Croissant Tasks) for conceptual reproducibility, directly impacting ML evaluation and agentic development.
    Provenance
    Article · Supporting source
  48. 48

    arXiv cs.AI - Research Science (GLOBAL)

    Article Ashutosh Ojha, Vinay Aggarwal, Ashutosh Srivastava, Siddharth Yedlapati, Yaman K Singla, Jitendra Ajmera

    MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains - arXiv:2605.29795v1 Announce Type: new Abstract: Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data...

    arxiv.org/abs/2605.29795 →
    Details
    Excerpt
    MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains - arXiv:2605.29795v1 Announce Type: new Abstract: Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data...
    Context
    Presents a novel agentic framework (MEMENTO) that treats the web as a learning signal, directly addressing agentic coding/practice and AI infrastructure.
    Key points
    • Presents a novel agentic framework (MEMENTO) that treats the web as a learning signal, directly addressing agentic coding/practice and AI infrastructure.
    Provenance
    Article · Supporting source
  49. 49

    arXiv cs.AI - Research Science (GLOBAL)

    Article Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su

    SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search - arXiv:2605.29796v1 Announce Type: new Abstract: Agentic search enables LLMs to solve complex multi-hop questions through iterative...

    arxiv.org/abs/2605.29796 →
    Details
    Excerpt
    SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search - arXiv:2605.29796v1 Announce Type: new Abstract: Agentic search enables LLMs to solve complex multi-hop questions through iterative...
    Context
    Presents a novel RL framework (SAAS) to solve a critical, practical limitation (over-search) in agentic search, directly impacting agentic coding/tools.
    Key points
    • Presents a novel RL framework (SAAS) to solve a critical, practical limitation (over-search) in agentic search, directly impacting agentic coding/tools.
    Provenance
    Article · Supporting source
  50. 50

    arXiv cs.AI - Research Science (GLOBAL)

    Article Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Junxiao Yang, Kun Wang, Zhiheng Xi, Linfeng Zhang, Ranjie Duan, Bo Zhang, Wenjie Wang, Wen Shen, Qiaosheng Zhang, Yan Teng, Chaochao Lu, Rui Mei, Man Li, Jialing Tao, Xi Lin, Tianhang Zheng, Yong Liu, Quanshi Zhang, Lei Zhu, Xingjun Ma, Junhua Liu, Hui Xue, Xiaoxiang Zuo, Xiangnan He, Chao Shen, Xianglong Liu, Minlie Huang, Jing Shao, Xia Hu

    AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security - arXiv:2605.29801v1 Announce Type: new Abstract: Modern open-world agents such as OpenClaw exhibit powerful...

    arxiv.org/abs/2605.29801 →
    Details
    Excerpt
    AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security - arXiv:2605.29801v1 Announce Type: new Abstract: Modern open-world agents such as OpenClaw exhibit powerful...
    Context
    This paper introduces a new, lightweight, and scalable alignment framework (AgentDoG 1.5) for AI agents, directly addressing safety risks in advanced agentic systems. It is a primary artifact with clear downstream consequence for agent deployment.
    Key points
    • This paper introduces a new, lightweight, and scalable alignment framework (AgentDoG 1.5) for AI agents, directly addressing safety risks in advanced agentic systems. It is a primary artifact with clear downstream consequence for agent deployment.
    Provenance
    Article · Supporting source
  51. 51

    arXiv cs.AI - Research Science (GLOBAL)

    Article Krzysztof \.Zurawicki, Julia Farganus, Arkadiusz Gawe{\l}, Mateusz Bystro\'nski, Tomasz Jan Kajdanowicz

    PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing - arXiv:2605.29815v1 Announce Type: new Abstract: The growing number of submitted papers has motivated the exploration of Large Language Models...

    arxiv.org/abs/2605.29815 →
    Details
    Excerpt
    PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing - arXiv:2605.29815v1 Announce Type: new Abstract: The growing number of submitted papers has motivated the exploration of Large Language Models...
    Context
    This paper introduces a benchmark (PRAIB) and empirical study on LLM review behavior, directly impacting the reliability and deployment of AI in academic/scientific processes.
    Key points
    • This paper introduces a benchmark (PRAIB) and empirical study on LLM review behavior, directly impacting the reliability and deployment of AI in academic/scientific processes.
    Provenance
    Article · Supporting source
  52. 52

    arXiv cs.AI - Research Science (GLOBAL)

    Article Haochen Yang, Ke Zhao, Mengyuan Ma, Xingyu Lu, Xiangfeng Wang, Hong Qian

    OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation - arXiv:2605.29829v1 Announce Type: new Abstract: Leveraging Large Language Models (LLMs) to automatically...

    arxiv.org/abs/2605.29829 →
    Details
    Excerpt
    OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation - arXiv:2605.29829v1 Announce Type: new Abstract: Leveraging Large Language Models (LLMs) to automatically...
    Context
    This is a primary artifact (arXiv paper/tool) detailing a new agentic system (OptSkills) for solving optimization problems using LLMs, directly addressing generalization and skill learning.
    Key points
    • This is a primary artifact (arXiv paper/tool) detailing a new agentic system (OptSkills) for solving optimization problems using LLMs, directly addressing generalization and skill learning.
    Provenance
    Article · Supporting source
  53. 53

    arXiv cs.AI - Research Science (GLOBAL)

    Article Minyang Hu, Bo Yang, Zhinuo Zhou, Jiachen Liang, Guo Jiahao, Yiyang Yin, Xiongwei Han

    Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories - arXiv:2605.29893v1 Announce Type: new Abstract: LLM-based agents have demonstrated strong capabilities in solving complex tasks...

    arxiv.org/abs/2605.29893 →
    Details
    Excerpt
    Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories - arXiv:2605.29893v1 Announce Type: new Abstract: LLM-based agents have demonstrated strong capabilities in solving complex tasks...
    Context
    Introduces a new benchmark (RedundancyBench) and research area for evaluating agent efficiency, directly impacting agentic coding tools and practice.
    Key points
    • Introduces a new benchmark (RedundancyBench) and research area for evaluating agent efficiency, directly impacting agentic coding tools and practice.
    Provenance
    Article · Supporting source
  54. 54

    arXiv cs.AI - Research Science (GLOBAL)

    Article Toru Takahashi

    Toward AI Systems That Understand Self and Others: A Multi-Phase Inference Framework for Human Cognitive Diversity and World-Model Alignment - arXiv:2605.29930v1 Announce Type: new Abstract: Mutual misunderstanding in...

    arxiv.org/abs/2605.29930 →
    Details
    Excerpt
    Toward AI Systems That Understand Self and Others: A Multi-Phase Inference Framework for Human Cognitive Diversity and World-Model Alignment - arXiv:2605.29930v1 Announce Type: new Abstract: Mutual misunderstanding in...
    Context
    Addresses core AI alignment and world-model issues, directly impacting how AI systems interact with human cognitive diversity and social reality.
    Key points
    • Addresses core AI alignment and world-model issues, directly impacting how AI systems interact with human cognitive diversity and social reality.
    Provenance
    Article · Supporting source
  55. 55

    arXiv cs.AI - Research Science (GLOBAL)

    Article Ahmad Rammal, Niket Patel, Fabian Gloeckle, Amaury Hayat, Julia Kempe, Remi Munos, Charles Arnal, Vivien Cabannes

    Formalizing Mathematics at Scale - arXiv:2605.29955v1 Announce Type: new Abstract: We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot...

    arxiv.org/abs/2605.29955 →
    Details
    Excerpt
    Formalizing Mathematics at Scale - arXiv:2605.29955v1 Announce Type: new Abstract: We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot...
    Context
    This describes a multi-agent system (AutoformBot) for autoformalizing complex mathematics (Lean 4) at scale. It's a major artifact demonstrating AI's capability to automate high-level, verifiable knowledge creation, impacting science and education.
    Key points
    • This describes a multi-agent system (AutoformBot) for autoformalizing complex mathematics (Lean 4) at scale. It's a major artifact demonstrating AI's capability to automate high-level, verifiable knowledge creation, impacting science and education.
    Provenance
    Article · Supporting source
  56. 56

    arXiv cs.AI - Research Science (GLOBAL)

    Article Kun Feng, Ziwei Shan, Yuchen Fang, Yiyang Tan, Sihan Lu, Shuqi Gu, Lintao Ma, Xingyu Lu, Kan Ren

    KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning - arXiv:2605.30002v1 Announce Type: new Abstract: Cross-domain multimodal time series forecasting is a challenging task, requiring models to...

    arxiv.org/abs/2605.30002 →
    Details
    Excerpt
    KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning - arXiv:2605.30002v1 Announce Type: new Abstract: Cross-domain multimodal time series forecasting is a challenging task, requiring models to...
    Context
    This is a primary artifact (arXiv paper) detailing a novel agentic framework for time series forecasting, directly addressing the intersection of LLMs, agents, and specialized AI infrastructure.
    Key points
    • This is a primary artifact (arXiv paper) detailing a novel agentic framework for time series forecasting, directly addressing the intersection of LLMs, agents, and specialized AI infrastructure.
    Provenance
    Article · Supporting source
  57. 57

    arXiv cs.AI - Research Science (GLOBAL)

    Article Zhen Chen, Yibing Liu, Weihao Xie, Yu Liang, Peilin Chen, Shiqi Wang

    RAISE: RAG Design as an Architecture Search Problem - arXiv:2605.30029v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking,...

    arxiv.org/abs/2605.30029 →
    Details
    Excerpt
    RAISE: RAG Design as an Architecture Search Problem - arXiv:2605.30029v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking,...
    Context
    Introduces a new, comprehensive framework (RAISE) and benchmark for RAG optimization, directly addressing systematic challenges in AI architecture design.
    Key points
    • Introduces a new, comprehensive framework (RAISE) and benchmark for RAG optimization, directly addressing systematic challenges in AI architecture design.
    Provenance
    Article · Supporting source
  58. 58

    arXiv cs.AI - Research Science (GLOBAL)

    Article Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang, Jianguo Li, Peng Di, Peiyu Liu, Jianwei Yin, Wenhai Wang

    Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning - arXiv:2605.30039v1 Announce Type: new Abstract: Large Language Models have demonstrated remarkable progress in general-purpose...

    arxiv.org/abs/2605.30039 →
    Details
    Excerpt
    Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning - arXiv:2605.30039v1 Announce Type: new Abstract: Large Language Models have demonstrated remarkable progress in general-purpose...
    Context
    This paper introduces a new paradigm (DOMINO) for synthesizing domain-specific data for LLMs using only reference examples, bypassing manual prompt engineering. This directly impacts LLM training and application.
    Key points
    • This paper introduces a new paradigm (DOMINO) for synthesizing domain-specific data for LLMs using only reference examples, bypassing manual prompt engineering. This directly impacts LLM training and application.
    Provenance
    Article · Supporting source
  59. 59

    arXiv cs.AI - Research Science (GLOBAL)

    Article Geremy Loacham\'in-Suntaxi, Robert Lazar, Dimitrios G. Giovanis, Ioannis G. Kevrekidis, Eleni D. Koronaki

    Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection - arXiv:2605.30042v1 Announce Type: new Abstract: Automating scientific computing workflows...

    arxiv.org/abs/2605.30042 →
    Details
    Excerpt
    Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection - arXiv:2605.30042v1 Announce Type: new Abstract: Automating scientific computing workflows...
    Context
    This describes a multi-agent system for scientific computing that addresses semantic drift and action-outcome fidelity, directly impacting agentic coding and AI infrastructure.
    Key points
    • This describes a multi-agent system for scientific computing that addresses semantic drift and action-outcome fidelity, directly impacting agentic coding and AI infrastructure.
    Provenance
    Article · Supporting source
  60. 60

    arXiv cs.AI - Research Science (GLOBAL)

    Article Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan

    Conformal Certification of Reasoning Trace Prefixes - arXiv:2605.30085v1 Announce Type: new Abstract: Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a.…

    arxiv.org/abs/2605.30085 →
    Details
    Excerpt
    Conformal Certification of Reasoning Trace Prefixes - arXiv:2605.30085v1 Announce Type: new Abstract: Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a...
    Context
    This paper introduces CROP, a method for certifying valid intermediate reasoning prefixes in LLMs. This directly addresses the reliability and process supervision of AI reasoning, which is core to agentic tools and frontier model safety.
    Key points
    • This paper introduces CROP, a method for certifying valid intermediate reasoning prefixes in LLMs. This directly addresses the reliability and process supervision of AI reasoning, which is core to agentic tools and frontier model safety.
    Provenance
    Article · Supporting source
  61. 61

    arXiv cs.AI - Research Science (GLOBAL)

    Article Tiancheng Yang, Matthias Schonlau, Ilia Sucholutsky

    Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison - arXiv:2605.30087v1 Announce Type: new Abstract: Emerging personal AI agents are moving toward persistent,...

    arxiv.org/abs/2605.30087 →
    Details
    Excerpt
    Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison - arXiv:2605.30087v1 Announce Type: new Abstract: Emerging personal AI agents are moving toward persistent,...
    Context
    This paper introduces a new benchmark and method for personal AI memory, directly addressing conflict resolution and selective QA, which is core to agentic systems.
    Key points
    • This paper introduces a new benchmark and method for personal AI memory, directly addressing conflict resolution and selective QA, which is core to agentic systems.
    Provenance
    Article · Supporting source
  62. 62

    arXiv cs.AI - Research Science (GLOBAL)

    Article Hongxiang Zhang, Yuan Tian, Tianyi Zhang

    Enhancing Multi-Agent Communication through Attention Steering with Context Relevance - arXiv:2605.30136v1 Announce Type: new Abstract: LLM-based multi-agent systems have demonstrated remarkable performance on complex...

    arxiv.org/abs/2605.30136 →
    Details
    Excerpt
    Enhancing Multi-Agent Communication through Attention Steering with Context Relevance - arXiv:2605.30136v1 Announce Type: new Abstract: LLM-based multi-agent systems have demonstrated remarkable performance on complex...
    Context
    This is a new arXiv paper detailing a technical improvement (Agent-Radar) for multi-agent systems, directly addressing context management and performance degradation in complex AI applications.
    Key points
    • This is a new arXiv paper detailing a technical improvement (Agent-Radar) for multi-agent systems, directly addressing context management and performance degradation in complex AI applications.
    Provenance
    Article · Supporting source
  63. 63

    arXiv cs.AI - Research Science (GLOBAL)

    Article Ziyan Liu, Zhezheng Hao, Yeqiu Chen, Hong Wang, Jingren Hou, Ruiyi Ding, Yongkang Yang, Wence Ji, Wei Xia, Feng Liu

    Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents - arXiv:2605.30159v1 Announce Type: new Abstract: Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing...

    arxiv.org/abs/2605.30159 →
    Details
    Excerpt
    Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents - arXiv:2605.30159v1 Announce Type: new Abstract: Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing...
    Context
    This paper introduces a novel optimization method (MMPO) for long-horizon LLM agents, directly addressing memory degradation and belief deviation. This is a core technical advance in agentic AI.
    Key points
    • This paper introduces a novel optimization method (MMPO) for long-horizon LLM agents, directly addressing memory degradation and belief deviation. This is a core technical advance in agentic AI.
    Provenance
    Article · Supporting source
  64. 64

    arXiv cs.AI - Research Science (GLOBAL)

    Article Caleb DeLeeuw

    BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders - arXiv:2605.30162v1 Announce Type: new Abstract: Biosecurity evaluations of language models typically ask...

    arxiv.org/abs/2605.30162 →
    Details
    Excerpt
    BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders - arXiv:2605.30162v1 Announce Type: new Abstract: Biosecurity evaluations of language models typically ask...
    Context
    Directly addresses model safety, refusal mechanisms, and internal auditing (SAE), which is critical to the power dynamics and reliability of frontier models.
    Key points
    • Directly addresses model safety, refusal mechanisms, and internal auditing (SAE), which is critical to the power dynamics and reliability of frontier models.
    Provenance
    Article · Supporting source
  65. 65

    arXiv cs.AI - Research Science (GLOBAL)

    Article Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao, Chiyu Wu, Jin Shang, Yu Gong, Shumin Deng

    When Should Models Change Their Minds? Contextual Belief Management in Large Language Models - arXiv:2605.30219v1 Announce Type: new Abstract: Long-horizon interactions require language models to manage accumulating...

    arxiv.org/abs/2605.30219 →
    Details
    Excerpt
    When Should Models Change Their Minds? Contextual Belief Management in Large Language Models - arXiv:2605.30219v1 Announce Type: new Abstract: Long-horizon interactions require language models to manage accumulating...
    Context
    This is a new arXiv paper introducing a formal benchmark (BeliefTrack) and methods (RL, representation steering) for managing LLM belief states, directly impacting model reliability and capability.
    Key points
    • This is a new arXiv paper introducing a formal benchmark (BeliefTrack) and methods (RL, representation steering) for managing LLM belief states, directly impacting model reliability and capability.
    Provenance
    Article · Supporting source
  66. 66

    arXiv cs.AI - Research Science (GLOBAL)

    Article A. J. Lew (Unreasonable Labs), Y. Cao (Unreasonable Labs), M. J. Buehler (Unreasonable Labs)

    ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure - arXiv:2605.30284v1 Announce Type: new Abstract: Scientific discovery is an inherently creative and...

    arxiv.org/abs/2605.30284 →
    Details
    Excerpt
    ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure - arXiv:2605.30284v1 Announce Type: new Abstract: Scientific discovery is an inherently creative and...
    Context
    A new benchmark (ProjectionBench) for evaluating LLMs on scientific hypothesis generation and discovery. This directly addresses the 'AI scientist/co-scientist' frontier and model capabilities.
    Key points
    • A new benchmark (ProjectionBench) for evaluating LLMs on scientific hypothesis generation and discovery. This directly addresses the 'AI scientist/co-scientist' frontier and model capabilities.
    Provenance
    Article · Supporting source
  67. 67

    arXiv cs.AI - Research Science (GLOBAL)

    Article Haowen Wang, Yaxin Du, Jian Yang, Jiajun Wu, Shukai Liu, Yuxuan Zhang, Pingjie Wang, Siheng Chen, Tuney Zheng, Ming Zhou, Xianglong Liu

    MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection - arXiv:2605.30288v1 Announce Type: new Abstract: Mid-training has become an important stage in modern LLM development, using large-scale curated...

    arxiv.org/abs/2605.30288 →
    Details
    Excerpt
    MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection - arXiv:2605.30288v1 Announce Type: new Abstract: Mid-training has become an important stage in modern LLM development, using large-scale curated...
    Context
    This paper introduces MIRA, a source-aware data selection framework for mid-training LLMs. It directly addresses the core technical challenge of data curation and model capability enhancement.
    Key points
    • This paper introduces MIRA, a source-aware data selection framework for mid-training LLMs. It directly addresses the core technical challenge of data curation and model capability enhancement.
    Provenance
    Article · Supporting source
  68. 68

    arXiv cs.AI - Research Science (GLOBAL)

    Article Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang, Xin Zhang, Wenshan Wu, Qihao Zhao, Hao Li, Yuanyuan Gao, Kim-Hui Yap, Scarlett Li

    Demystifying Data Organization for Enhanced LLM Training - arXiv:2605.30334v1 Announce Type: new Abstract: Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily...

    arxiv.org/abs/2605.30334 →
    Details
    Excerpt
    Demystifying Data Organization for Enhanced LLM Training - arXiv:2605.30334v1 Announce Type: new Abstract: Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily...
    Context
    This paper proposes novel data ordering methods (STR, SAW) and guidelines for optimizing LLM training data organization, directly impacting training efficiency and model performance.
    Key points
    • This paper proposes novel data ordering methods (STR, SAW) and guidelines for optimizing LLM training data organization, directly impacting training efficiency and model performance.
    Provenance
    Article · Supporting source
  69. 69

    arXiv cs.AI - Research Science (GLOBAL)

    Article Anany Kotawala

    Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents - arXiv:2605.30335v1 Announce Type: new Abstract: Multi-component LLM agents assemble probabilistic claims from...

    arxiv.org/abs/2605.30335 →
    Details
    Excerpt
    Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents - arXiv:2605.30335v1 Announce Type: new Abstract: Multi-component LLM agents assemble probabilistic claims from...
    Context
    This paper addresses a fundamental failure mode (compositional incoherence) in multi-component LLM agents, directly impacting agentic coding and reliability.
    Key points
    • This paper addresses a fundamental failure mode (compositional incoherence) in multi-component LLM agents, directly impacting agentic coding and reliability.
    Provenance
    Article · Supporting source
  70. 70

    @michpokrass (Michelle Pokrass)

    X michpokrass

    we shipped a new version of gpt-5.5 instant today. the previous model was too bullet pilled. the new one improves on some other important dimensions: sycophancy, factuality, and multilingual performance. hope you'll…

    x.com/michpokrass/status/2060219759682330970 →
    Details
    Excerpt
    we shipped a new version of gpt-5.5 instant today. the previous model was too bullet pilled. the new one improves on some other important dimensions: sycophancy, factuality, and multilingual performance. hope you'll…
    Context
    Reports a primary artifact (new model release) and directly relates to the near-future of AI and frontier models.
    Key points
    • Reports a primary artifact (new model release) and directly relates to the near-future of AI and frontier models.
    Provenance
    Tweet · Primary source
  71. 71

    @trengriffin (Tren Griffin)

    X trengriffin

    Did the CNN reporter call Microsoft to confirm the claim? Nope. Microsoft switching from Claude code to GitHub Copilot (both with Opus 4.7 paid for by enterprise API usage) enables dogfooding of the GHCP harness so…

    x.com/trengriffin/status/2060220238147551244 →
    Details
    Excerpt
    Did the CNN reporter call Microsoft to confirm the claim? Nope. Microsoft switching from Claude code to GitHub Copilot (both with Opus 4.7 paid for by enterprise API usage) enables dogfooding of the GHCP harness so…
    Context
    Discusses a major shift in AI tooling (Claude to Copilot) and the underlying business/infrastructure dynamics (enterprise API usage, dogfooding), which is central to the podcast's focus.
    Key points
    • Discusses a major shift in AI tooling (Claude to Copilot) and the underlying business/infrastructure dynamics (enterprise API usage, dogfooding), which is central to the podcast's focus.
    Provenance
    Tweet · Primary source
  72. 72

    @badlogicgames (Mario Zechner)

    X badlogicgames

    pibot is now running fully local, using parakeet for STT, qwen3-tts for TTS, and Qwen 3.6 as the local multi-modal LLM via llama.cpp. The STT and TTS inference engines are Rust/mlx-c based. Ported from Python. So, zero…

    x.com/badlogicgames/status/2060268257739677… →
    Details
    Excerpt
    pibot is now running fully local, using parakeet for STT, qwen3-tts for TTS, and Qwen 3.6 as the local multi-modal LLM via llama.cpp. The STT and TTS inference engines are Rust/mlx-c based. Ported from Python. So, zero…
    Context
    Reports a specific, working technical artifact (pibot) and its local, dependency-free stack (Rust/mlx-c, Qwen 3.6), directly relevant to AI infrastructure and tools.
    Key points
    • Reports a specific, working technical artifact (pibot) and its local, dependency-free stack (Rust/mlx-c, Qwen 3.6), directly relevant to AI infrastructure and tools.
    Provenance
    Tweet · Primary source
  73. 73

    @badlogicgames (Mario Zechner)

    X badlogicgames

    pibot is now running fully local, using parakeet for STT, qwen3-tts for TTS, and Qwen 3.6 as the local multi-modal LLM via llama.cpp. The STT and TTS inference engines are Rust/mlx-c based. Ported from Python. So, zero…

    x.com/badlogicgames/status/2060268257739677… →
    Details
    Excerpt
    pibot is now running fully local, using parakeet for STT, qwen3-tts for TTS, and Qwen 3.6 as the local multi-modal LLM via llama.cpp. The STT and TTS inference engines are Rust/mlx-c based. Ported from Python. So, zero…
    Context
    Reports a specific, working technical artifact (pibot) and its local, dependency-free stack (Rust/mlx-c, Qwen 3.6), directly relevant to AI infrastructure and tools.
    Key points
    • Reports a specific, working technical artifact (pibot) and its local, dependency-free stack (Rust/mlx-c, Qwen 3.6), directly relevant to AI infrastructure and tools.
    Provenance
    Tweet · Primary source
  74. 74

    Axios - Industry Adjacent (US)

    Article Maria Curi

    Inside the Democratic resistance on AI - Progressive Democrats taking hardline positions against AI are getting louder. Why it matters: Five influential progressives are shaping a confrontational Democratic message on...

    www.axios.com/2026/05/29/inside-democratic-… →
    Details
    Excerpt
    Inside the Democratic resistance on AI - Progressive Democrats taking hardline positions against AI are getting louder. Why it matters: Five influential progressives are shaping a confrontational Democratic message on...
    Context
    Details specific policy proposals (moratoriums, taxes, labor protections) and political power dynamics (Sanders, AOC, Warren) shaping AI regulation and control.
    Key points
    • Details specific policy proposals (moratoriums, taxes, labor protections) and political power dynamics (Sanders, AOC, Warren) shaping AI regulation and control.
    Provenance
    Article · Supporting source
  75. 75

    Axios - Industry Adjacent (US)

    Article Zachary Basu

    "The pitchforks are here": Billionaires work to contain AI's populist revolt - America's billionaires are developing their own prescriptions for AI-fueled inequality, anxious to defuse a populist revolt aimed at their...

    www.axios.com/2026/05/29/ai-billionaires-te… →
    Details
    Excerpt
    "The pitchforks are here": Billionaires work to contain AI's populist revolt - America's billionaires are developing their own prescriptions for AI-fueled inequality, anxious to defuse a populist revolt aimed at their...
    Context
    Directly addresses power dynamics, wealth concentration, and policy/regulation (wealth tax, data centers) shaping AI's future.
    Key points
    • Directly addresses power dynamics, wealth concentration, and policy/regulation (wealth tax, data centers) shaping AI's future.
    Provenance
    Article · Supporting source
  76. 76

    Real-time LLM Inference on Standard GPUs: 3k tokens/s per request — 67 pts · 38 comments

    Article NicoConstant

    https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/ · @mungoman2: This looks very interesting. Possible to get those rates without exotic hardware. But I have to say that the…

    blog.kog.ai/real-time-llm-inference-on-stan… →
    Details
    Excerpt
    https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/ · @mungoman2: This looks very interesting. Possible to get those rates without exotic hardware. But I have to say that the…
    Context
    Directly addresses AI infrastructure (inference, GPUs) and model performance, which is central to the podcast topic.
    Key points
    • Directly addresses AI infrastructure (inference, GPUs) and model performance, which is central to the podcast topic.
    Provenance
    Article · Supporting source
  77. 77

    Techmeme - Industry Adjacent (US)

    Article

    OpenAI says it has briefed the White House on its new biodefense program, which uses GPT-Rosalind to help develop biodefense and pandemic preparedness tools (Maria Curi/Axios) - Maria Curi / Axios : OpenAI says it has...

    www.techmeme.com/260529/p13 →
    Details
    Excerpt
    OpenAI says it has briefed the White House on its new biodefense program, which uses GPT-Rosalind to help develop biodefense and pandemic preparedness tools (Maria Curi/Axios) - Maria Curi / Axios : OpenAI says it has...
    Context
    Directly addresses the intersection of frontier AI (GPT-Rosalind) and critical infrastructure/policy (biodefense/White House), fitting the power dynamics theme.
    Key points
    • Directly addresses the intersection of frontier AI (GPT-Rosalind) and critical infrastructure/policy (biodefense/White House), fitting the power dynamics theme.
    Provenance
    Article · Supporting source
  78. 78

    TechCrunch AI - Media Culture (US)

    Article Kate Park

    This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory - South Korean chip startup XCENA is betting that AI's real bottleneck is not compute, but memory.

    techcrunch.com/2026/05/29/xcena-secures-135… →
    Details
    Excerpt
    This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory - South Korean chip startup XCENA is betting that AI's real bottleneck is not compute, but memory.
    Context
    Directly addresses AI infrastructure bottlenecks (memory/HBM), a core topic. Funding/valuation adds market/capital dynamics.
    Key points
    • Directly addresses AI infrastructure bottlenecks (memory/HBM), a core topic. Funding/valuation adds market/capital dynamics.
    Provenance
    Article · Supporting source
  79. 79

    Techmeme - Industry Adjacent (US)

    Article

    Former Tesla data labelers say FSD relies on laborious mapping for hazards; crash data analysis shows Tesla exaggerates FSD's safety via flawed methodology (Reuters) - Reuters : Former Tesla data labelers say FSD...

    www.techmeme.com/260529/p16 →
    Details
    Excerpt
    Former Tesla data labelers say FSD relies on laborious mapping for hazards; crash data analysis shows Tesla exaggerates FSD's safety via flawed methodology (Reuters) - Reuters : Former Tesla data labelers say FSD...
    Context
    Directly challenges Tesla's safety claims and methodology for FSD, impacting public trust, regulation, and the viability of autonomous systems.
    Key points
    • Directly challenges Tesla's safety claims and methodology for FSD, impacting public trust, regulation, and the viability of autonomous systems.
    Provenance
    Article · Supporting source
  80. 80

    @emollick (Ethan Mollick)

    X emollick

    Reconstructing software engineering around AI is going to take work (even as the ability of AI to code increases at a rapid rate). Organizations are ideally spending tokens for two things: 1) building stuff 2)…

    x.com/emollick/status/2060357604044358108 →
    Details
    Excerpt
    Reconstructing software engineering around AI is going to take work (even as the ability of AI to code increases at a rapid rate). Organizations are ideally spending tokens for two things: 1) building stuff 2)…
    Context
    Directly addresses the shifting craft of software engineering and the need for organizational investment in AI-assisted development and experimentation.
    Key points
    • Directly addresses the shifting craft of software engineering and the need for organizational investment in AI-assisted development and experimentation.
    Provenance
    Tweet · Primary source