◆ Dispatch 005 · 2026-05-13

The Agent Now Watches the Agent

2026-05-13 / 00:14:35 / 10 sources

“Once the trace becomes material for the next run, observability stops being a dashboard and becomes part of the agent's workspace.”
— Lenar Kess, today's narration

Once the trace becomes material for the next run, observability stops being a dashboard and becomes part of the agent's workspace.

The Agent Now Watches the Agent

Chapters

00:00:00 Transcript

Sources

10 cited

1
Harrison Chase announces LangSmith Engine

Thread Harrison Chase — LangChain cofounder announcing LangSmith Engine during Interrupt.

LangSmith Engine is an agent that sits on top of your traces
x.com/hwchase17/status/2054657397902455060 →
Details
Cited text
LangSmith Engine is an agent that sits on top of your traces

Context
It made the episode's main question concrete: what happens when the agent starts inspecting and improving the agent loop.
Key points
Engine runs in the background over LangSmith traces.
It identifies issues and suggests code changes or evaluators.
The announcement positions traces as input to agent improvement, not only debugging evidence.
Provenance
Thread · Primary source
2
LangSmith Sandboxes are Generally Available

Article Mukhil Loganathan — LangChain author announcing GA for LangSmith Sandboxes.

Each sandbox is a hardware-virtualized micro virtual machine
www.langchain.com/blog/langsmith-sandboxes-… →
Details
Cited text
Each sandbox is a hardware-virtualized micro virtual machine

Context
It supplied the execution-containment half of the LangChain bundle and gave Halek concrete operator controls to discuss.
Key points
GA adds micro virtual machine isolation, snapshots, cheap forks, blueprints, service URLs, a CLI, creator-private defaults, and an auth proxy.
The post argues that containers and eval boundaries are insufficient for untrusted agent code.
Future work includes local-to-cloud agents, shared volumes, and process/network tracing inside the VM.
Provenance
Article · Supporting source
3
LangChain announces Managed Deep Agents

Thread LangChain — Company account announcing managed deployment for Deep Agents.

Harness, context, and code execution
x.com/LangChain/status/2054684227053162957 →
Details
Cited text
Harness, context, and code execution

Context
It let the episode frame agent products as managed work surfaces rather than single model calls.
Key points
Managed Deep Agents promises harness, context, and code execution as managed pieces.
The pitch is deployment with a single line of code.
The thread drew operator questions about execution limits and manageability.
Provenance
Thread · Primary source
4
Use the Claude Agent SDK with your Claude plan

Source Anthropic — Claude Help Center article explaining Agent SDK monthly credits.

Starting June 15, 2026
support.claude.com/en/articles/15036540-use… →
Details
Cited text
Starting June 15, 2026

Context
It gave the episode a hard pricing boundary for unattended agent loops.
Key points
Claude Agent SDK and claude -p usage move out of the normal subscription limits on June 15, 2026.
Eligible plans get separate per-user monthly credits from $20 to $200 depending on plan type.
After credits are used, requests either move to extra usage at standard API rates or stop if extra usage is disabled.
Provenance
Source · Background source
5
I'm cooked. Anthropic just split --print mode to $/mo credits

Article raedyohed — ClaudeAI user describing how the Agent SDK credit change affects an autonomous Kanban-agent project.

all jobs launched using "--print" will get billed
www.reddit.com/r/ClaudeAI/comments/1tcetsd/… →
Details
Cited text
all jobs launched using "--print" will get billed

Context
It showed how the policy change hits actual agent-harness builders rather than only billing pages.
Key points
The author built an unattended Claude Code orchestration concept around claude -p.
They read the new policy as closing an API-like control path inside a subscription.
Community replies debated workarounds and whether the dependency was sustainable.
Provenance
Article · Supporting source
6
Open-source, self-updating wiki for your codebase

Article ElectronicUnit6303 — ClaudeAI poster announcing Almanac, a local markdown wiki for agent codebase context.

re-explaining the same codebase context
www.reddit.com/r/ClaudeAI/comments/1tcjv9b/… →
Details
Cited text
re-explaining the same codebase context

Context
It gave the episode a smaller, repo-level memory tool to contrast with trace-driven observability.
Key points
Almanac stores codebase history and agent-conversation context as local markdown.
The examples are architectural facts that code alone does not preserve well.
The project is open source and Mac-only for now.
Provenance
Article · Supporting source
7
OpenAI says Windows lacked the sandboxing tools Linux already had

Article Brian Fagioli — Technology journalist summarizing OpenAI's Codex sandboxing writeup.

Windows forced the company to engineer a custom solution
nerds.xyz/2026/05/openai-linux-windows-code… →
Details
Cited text
Windows forced the company to engineer a custom solution

Context
It connected managed sandboxes to the operating-system work needed when coding agents run local commands.
Key points
Linux and macOS had sandbox primitives Codex could use, including seccomp, bubblewrap, and Seatbelt.
OpenAI rejected several Windows approaches before using dedicated local users, firewall restrictions, restricted tokens, and helper executables.
The failed unelevated sandbox could be bypassed by programs ignoring proxy settings or implementing their own networking stack.
Provenance
Article · Supporting source
8
we really all are going to make it, aren't we? 2x3090 setup.

Article RedShiftedTime — LocalLLaMA poster describing a dual RTX 3090 local coding setup.

113 tk/s with no nvlink
www.reddit.com/r/LocalLLaMA/comments/1tcf2d… →
Details
Cited text
113 tk/s with no nvlink

Context
It gave the local-execution segment an operator example beyond lab hardware.
Key points
The author reports a large speed jump after moving from WSL2 to Ubuntu.
They describe Qwen 3.6 27 billion parameter with a 262 thousand token window as useful for coding patches and reviews.
The post frames local models as becoming practical for budget setups.
Provenance
Article · Supporting source
9
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Article mdda — LocalLLaMA poster and blog author testing mixture-of-experts models on secondhand hardware.

The trick is MoE offloading
www.reddit.com/r/LocalLLaMA/comments/1tcc7h… →
Details
Cited text
The trick is MoE offloading

Context
It added concrete performance numbers and caveats for local agent loops.
Key points
Qwen 3.6 35 billion parameter A-three-B reaches roughly 24 tokens per second on the tested setup.
Gemma 4 26 billion parameter A-four-B with fixed multi-token prediction reaches about 24.5 tokens per second.
A commenter cautions that reserving 128 thousand tokens is not the same as filling the window.
Provenance
Article · Supporting source
10
Build Hour: GPT-Realtime-2

Video OpenAI — OpenAI Build Hour technical session summarized in the provided materials.

parallel tool calling
www.youtube.com/watch?v=qGS9Ghnq1RU →
Details
Cited text
parallel tool calling

Context
It let the closing segment connect voice-to-action with the same execution, trace, memory, and cost constraints.
Key points
GPT Realtime 2 brings GPT-5 class reasoning to voice interfaces.
The release includes a larger 128 thousand token context window and parallel tool calls.
Demos included voice-driven e-commerce search and analytics workflows.
Provenance
Video · Supporting source

00:00:00

Transcript

00:00:00 liraenLangChain used Interrupt on Wednesday to announce a cluster of agent infrastructure: LangSmith Engine, LangSmith Sandboxes moving to general availability, Managed Deep Agents, and SmithDB. The day has a clear shape. The agent stack is getting its own repair shop. The harder question is whether that repair shop is inspectable enough for an operator to trust it.

00:00:21 halekThat's the operator version of it. Harrison Chase's post says Engine is an agent that sits on top of traces, runs in the background, identifies issues, and suggests code changes or evaluators. That's a serious product claim. The trace stops being only evidence after something breaks; it becomes input to the next intervention.

00:00:41 liraenAnd it lands one day after yesterday's Braid episode spent time on inspection layers for AI products, so I don't want to repeat the generic point. The new detail is the closed loop. If the system watches traces and proposes both patches and tests, does that move debugging closer to a production process, or does it just move the fog one layer up?

00:01:01 halekIt depends on the artifact it gives you. A trace-aware agent that says, 'your retrieval failed' is a fancy alert. A trace-aware agent that shows the failed span, proposes a minimal code patch, adds an evaluator that catches the same break next time, and lets a human accept or reject each step is closer to something I'd run. The evaluator suggestion matters as much as the code suggestion.

00:01:25 liraenLangChain's own wording points there. Engine is described as identifying issues and suggesting action items, not silently changing production. I like that boundary. I also notice how much faith that puts in the trace format. If the trace doesn't preserve the user's intent, the tool call, the failed assumption, and the output, the agent can only repair the shadow of the system.

00:01:46 halekSmithDB is the other half of that story. Ankush Gola's announcement calls it a database for agent observability workloads, and the announcement says the problem is traces that can contain tens of thousands of events. That isn't a normal app-log table with a few indexes. Agent traces are nested, long, permission-sensitive, and expensive to scan. If Engine is going to reason over them every day, the storage layer becomes part of the product, not a back-office detail.

00:02:16 liraenThe timing says something too. Managed Deep Agents offers the harness, context, and code execution as managed pieces, with one line to deploy. Sandboxes offers secure, scalable environments for agent code execution. Engine says an agent can read your traces and propose fixes. Together, that's a pitch that agent development has moved beyond a single model call and into the operating surface around it.

00:02:41 halek[tongue-click] I agree with the direction and still want to see the knobs. Managed harnesses are wonderful until the first weird integration. Then you need the execution timeout, the network policy, and the state model. You also need the billing boundary and the export path for the trace. The one-line deploy is attractive, but the long-term question is whether the managed layer keeps enough shape for you to debug it yourself.

00:03:04 liraenLangSmith's Sandboxes post says each sandbox runs as a hardware-virtualized micro virtual machine, isolated at the kernel level from services and from other sandboxes. It also says containers alone can't guarantee that for untrusted, model-generated code. That is a more pointed claim than 'safe code execution.'

00:03:24 halekThe feature list is practical. You get snapshots, cheap copy-on-write forks, blueprints, service URLs, a command-line tool, creator-private access by default, and an auth proxy that injects credentials at the network layer. That last one is the kind of detail I care about. Secrets never touch the runtime if the proxy is doing its job.

00:03:45 liraenThe post also uses supply-chain attacks and kernel exploits as the reason. It names the Shai-Hulud npm worm, n8n remote-code-execution issues, and a Linux kernel bug as examples of why a JavaScript eval boundary or a shared-kernel container isn't enough. That makes the agent story less abstract: generated code installs dependencies, runs tests, opens previews, and touches credentials unless the environment refuses to let it.

00:04:12 halekOpenAI's Codex sandboxing writeup, as covered by NERDS.xyz, points to the same pressure from the operating-system side. On Linux and macOS, Codex could lean on existing primitives like seccomp, bubblewrap, and Apple's Seatbelt framework. On Windows, OpenAI apparently had to build dedicated local users, outbound firewall restrictions, restricted tokens, and helper executables because the first approach could be bypassed.

00:04:41 liraenThe two stories meet at the execution boundary. LangChain is selling a managed micro virtual machine surface for agent execution. OpenAI is wrestling with desktop operating-system boundaries because Codex runs commands on a user's machine. Different product layer, same underlying fact: the agent is no longer only reading code. It is acting as a local process with consequences.

00:05:05 halekThe practical test is whether the environment can say no in a way the operator can understand. No outbound network unless allowed. No shell access to another user's sandbox. No secret value copied into the filesystem. No package install that persists after the session unless explicitly saved. I don't need the UI to feel magical. I need the refusal to be precise enough that the operator can repeat it in a runbook.

00:05:31 liraen[chuckle] I'll take that correction. But yes. The best agent platform might feel less like a chat window and more like a controlled workbench: stateful enough to do real work, contained enough that one strange dependency doesn't become everyone's problem.

00:05:47 liraenAnthropic's help-center article says that starting June 15, 2026, Claude Agent SDK and claude dash p usage no longer count toward a Claude plan's usage limits. Instead, eligible users get a separate monthly Agent SDK credit: twenty dollars for Pro, one hundred dollars for Max 5x, and two hundred dollars for Max 20x, with similar per-seat credits for some team and enterprise plans.

00:06:13 halekThat's a clean commercial boundary. Interactive Claude Code stays in the subscription bucket. Programmatic use through the Agent SDK or claude dash p moves to a credit bucket, and after the credit is gone, it moves to standard API billing if extra usage is enabled. If extra usage isn't enabled, requests stop until the credit refreshes.

00:06:35 liraenThe Reddit reaction was much less clean. A ClaudeAI post described an autonomous Kanban-and-agent setup built around claude dash p, and the author says the change torpedoes the premise because always-on hands-free runs now consume a separate monthly credit. The top comment framed it bluntly: if your idea depends on a vulnerability in another product, this is where it tends to end.

00:06:58 halekI wouldn't call it a vulnerability in the security sense, but as a business dependency, yes. A subscription interface with a print mode looked like cheap API control. Anthropic is saying that once you use it as infrastructure, it gets priced like infrastructure. That changes what people can run overnight, what they can afford to test, and whether a solo developer treats Claude Code as an engine or as an assistant.

00:07:23 liraenThere is a fairness question here, but the systems question is sharper. If the agent runtime is going to run jobs unattended, the meter has to be part of the design. The operator needs a budget cap, a stop condition, a visible queue, and a way to know which run spent which dollars. Otherwise the product promise isn't automation; it's a bill with a cheerful transcript attached.

00:07:44 halekThat connects back to LangChain. Managed Deep Agents can make deployment easier, but cost observability has to sit beside trace observability. A run that opens a PR, runs tests, forks a sandbox ten times, and asks another model to diagnose the trace can be a good run. It can also be a surprisingly expensive run. The system has to show the operator where the money went and where the error happened.

00:08:09 liraenA smaller Reddit item fits the same day neatly. A developer posted Almanac, an open-source, self-updating wiki for a codebase. Their examples were the kind of context agents forget: auth moved into middleware and got backed out because OAuth callbacks broke, or retry logic exists because Stripe webhooks arrive out of order.

00:08:30 halekThat is exactly the class of fact that doesn't live well in code comments. It is history, not syntax. A coding agent can read the current file and still miss the scar tissue around it. If Almanac turns conversations with Claude Code or Codex into local markdown that the next agent can read, it is trying to make project memory visible at the repo level.

00:08:52 liraenThe author says the wiki updates from the repo and from conversations, lives locally as markdown, and is mainly for the agent, though humans can read it too. I like the locality. The wiki isn't another hosted brain that owns the team's architectural memory. It is a repo-adjacent artifact the team can inspect.

00:09:11 halekThe hard part is conflict and freshness. If the agent writes, 'we backed out middleware auth,' and two weeks later the team reintroduces it with a better callback path, what retires the old note? A self-updating wiki needs provenance and decay. Who said this? What commit made it true? What commit might have made it stale? Without that, it becomes a very confident pile of old decisions.

00:09:35 liraenThat makes it a useful counterweight to Engine. Engine reads traces after behavior. Almanac tries to preserve context before the next edit. Sandboxes contain execution while it happens. These are three different answers to one problem: agents need working memory outside the model's immediate context window, and that memory has to stay accountable.

00:09:56 halekAccountable means precise metadata: source, time, file, conversation, confidence, and a retirement path. The model doesn't need a mythic memory palace. It needs notes that survive the next run without lying about how they got there.

00:10:11 liraenThe local-model items add a hardware edge to the same story. One LocalLLaMA post describes a dual RTX 3090 setup running Qwen 3.6 27 billion parameter with a 262 thousand token window, around 113 tokens per second after moving from WSL2 to Ubuntu. Another describes Qwen 3.6 35 billion parameter A-three-B and Gemma 4 26 billion parameter A-four-B on an old GTX 1080 with 8 gigabytes of VRAM, using llama.cpp and key-value cache quantization.

00:10:47 halekThe GTX 1080 result is the more interesting operator artifact. The author reports roughly 24 tokens per second for Qwen 3.6 35 billion parameter A-three-B and about 24.5 for Gemma 4 26 billion parameter A-four-B with multi-token prediction after forcing the draft embedding table onto the GPU. They also say the machine is PCIe-bandwidth limited, with the GPU sitting around forty to fifty percent utilization while PCIe 3.0 x16 is maxed out.

00:11:21 liraenA commenter adds a useful caveat: reserving a 128 thousand token context window is different from actually using the full window. The tests mostly used short contexts, and performance may fall when the prompt grows. That is the kind of caveat I want in this conversation, because local execution can turn into romance very quickly.

00:11:41 halekYes. Local isn't free; it just changes which constraints you can touch. You trade provider billing for hardware, heat, setup time, driver pain, and a very specific understanding of where the bottleneck sits. But the practical improvement is real. If a secondhand GPU box can run useful coding loops fast enough for reviews and patches, the fallback story changes. Teams can reserve frontier models for the hard calls and use local models for repetitive passes.

00:12:08 liraenThat pairs strangely well with OpenAI's Build Hour summary from OpenAI. GPT Realtime 2 brings GPT-5 class reasoning to voice interfaces, a 128 thousand token context window, parallel tool calling, voice cloning, tone matching, and low-latency translation. The demos were not just voice chat. They were voice-to-action: an e-commerce search agent using many tools, and an analytics agent finding a mobile Safari regression.

00:12:36 halekThe phrase I'd keep is parallel tool calling. In a voice interface, serial tool calls feel broken because the person is waiting in real time. If the model can call several tools at once, inspect UI state, and keep the conversation coherent, voice becomes a control surface for software, not a dictation layer. But then all the earlier constraints come back: sandbox, budget, trace, memory, and refusal behavior.

00:13:02 liraenSo Wednesday's through-line is not one company winning a product category. The agent now has a larger work surface. It can inspect traces, inhabit sandboxes, spend credits, consult codebase memory, run on local machines, and drive tools while a person talks.

00:13:19 halekEach surface needs a named boundary. On traces: did you preserve the evidence? In the sandbox: what can code touch? In pricing: what spends money after I walk away? In memory: which old claims still apply? On local hardware: what does this machine sustain under load?

00:13:37 liraenDoes that make LangChain's day feel bigger to you, or just more complete?

00:13:42 halekMore complete. Engine without Sandboxes would be a watcher with no safe hands. Sandboxes without Engine would be a safe place to run code, but not a loop that learns from the run. Managed Deep Agents without SmithDB would be a deploy pitch without the data layer to explain what happened. The bundle is interesting because the pieces answer each other's weaknesses.

00:14:04 liraenThen I'll carry the audit trail around the agent's own interventions into Thursday. If Engine proposes a patch and an evaluator, I want to see the trace that led to it, the sandbox that tested it, the credit bucket that paid for it, and the human who accepted it. That chain decides whether agent self-improvement becomes engineering practice or just another invisible automation promise.