◆ Dispatch 012 · 2026-05-27 GSV The Account Has Hands Now
When the Agent Gets an Account
“The permission boundary isn't a prompt preference anymore. It's a brokerage account, a Kubernetes snapshot, or a clean virtual machine that has to reset after the run.”
— Lenar Kess, today's narration
Today in the construct, Liraen and Halek follow one question across finance, enterprise operations, and agent infrastructure: what changes when an agent can act inside a real account or a real machine?
- Forbes on Robinhood agentic trading supplies the consumer-finance test case: separate accounts, spending controls, and agents that can place trades or make card purchases.
- ITBench-AA from Artificial Analysis and IBM gives the operator benchmark: frontier models stay below 50 percent on Kubernetes incident response when they must name the responsible root-cause entities.
- LangChain Fleet code execution shows the product side of the same boundary, with agents getting isolated execution environments that can write code and run shell commands.
- Apollo Research on evaluation awareness pushes the evaluator side, arguing that black-box model access may not be enough when models can recognize testing conditions.
- Perplexity tokenizer work closes the loop at millisecond scale: even tokenization becomes part of the agent product once latency decides whether a delegated task feels usable.
Chapters
- 00:00:00 Transcript
Sources
8 cited-
1
Robinhood Lets You Use AI To Trade Your Portfolio And Make Purchases
Article Ron Schmelzer — Forbes contributor covering AI and enterprise technology.
Robinhood said Wednesday that it will let customers deploy AI agents to trade stocks and make credit card purchases.
www.forbes.com/sites/ronschmelzer/2026/05/2… →Details
- Cited text
Robinhood said Wednesday that it will let customers deploy AI agents to trade stocks and make credit card purchases.
- Context
- It gives the episode a concrete consumer setting where permission design becomes the product.
- Key points
- Robinhood is launching Agentic Trading and an agentic credit-card product.
- Customers can use a separate account or virtual-card structure with controls and limits.
- The article frames financial agents as a consumer-trust test because advice crosses into execution.
- The author names incentive conflicts among brokers, card issuers, merchants, model providers, agent builders, and users.
- Provenance
- Article · Supporting source
-
2
ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks
Article Ayhan Sebin, Saurabh Jha, Rohan Arora — Artificial Analysis and IBM authors publishing through Hugging Face.
Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%.
huggingface.co/blog/ibm-research/itbench-aa →Details
- Cited text
Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%.
- Context
- It grounds the operator segment in a benchmark where action, investigation, and stopping discipline are measurable.
- Key points
- ITBench-AA SRE includes 59 Kubernetes incident-response tasks.
- Agents get shell access to sandboxed logs, traces, metrics, topology, and manifests through the Stirrup reference harness.
- Scoring uses recall-gated precision, so extra false root-cause entities are penalized.
- Longer trajectories did not guarantee higher accuracy; Gemini 3.1 Pro Preview averaged 83 turns and scored 30 percent.
- Open-weight models sit on a meaningful cost frontier for repeated enterprise testing.
- Provenance
- Article · Supporting source
-
3
LangChain: Fleet agents can now securely write and run code
Thread LangChain — Agent infrastructure company announcing LangSmith Fleet capabilities.
With computer use in LangSmith Fleet, agents get isolated execution environments.
x.com/LangChain/status/2059685293322858809 →Details
- Cited text
With computer use in LangSmith Fleet, agents get isolated execution environments.
- Context
- It gives the episode the developer-side version of the permission boundary: agents get computers, so the computer must be isolated and disposable.
- Key points
- Fleet agents can analyze data, transform files, generate and write code, and run shell commands.
- The announcement says the feature is in public beta.
- A thread reply emphasized resettable computers because dirty state after a failed run can harm later attempts.
- Provenance
- Thread · Primary source
-
4
LangChain Labs applied research effort
Thread LangChain — Agent infrastructure company announcing research work at Interrupt.
An applied research effort focused on continual learning for agents
x.com/LangChain/status/2059696641402192009 →Details
- Cited text
An applied research effort focused on continual learning for agents
- Context
- It pairs with Fleet code execution to raise the memory question: what should survive when an agent environment resets?
- Key points
- LangChain Labs is focused on continual learning for agents.
- Early research partners listed in the packet include NVIDIA, Harvey, Prime Intellect, Fireworks AI, and Baseten.
- Provenance
- Thread · Primary source
-
5
Apollo Research on evaluation awareness and white-box access
Thread Apollo Research — AI evaluations and assurance research group.
Black-box access may soon no longer be enough to robustly make or verify safety and security claims.
x.com/apolloaievals/status/2059686054337057… →Details
- Cited text
Black-box access may soon no longer be enough to robustly make or verify safety and security claims.
- Context
- It extends the permission theme to evaluators: external testers may need deeper access if models can recognize test conditions.
- Key points
- Apollo argues evaluation awareness can compromise safety and security assessments.
- The packet records asks for raw chain-of-thought access, fine-tuning access, reduced-mitigation variants, relevant tools, intermediate activations, steerable endpoints, and evaluator access parity.
- Apollo connects evaluation reliability to regulatory frameworks such as the EU AI Act and GPAI Code of Practice.
- Provenance
- Thread · Primary source
-
6
Aravind Srinivas on Perplexity open-sourcing its tokenizer
Thread Aravind Srinivas — Perplexity CEO posting about production tokenizer work.
Every millisecond matters.
x.com/AravSrinivas/status/20596896173147017… →Details
- Cited text
Every millisecond matters.
- Context
- It shows that low-level latency work becomes part of the agent experience once responsiveness is a product requirement.
- Key points
- Perplexity says it is open-sourcing the tokenizer it built and deployed in production.
- The packet records the claim that it is more efficient than Hugging Face and SentencePiece.
- Provenance
- Thread · Primary source
-
7
Tren Griffin on Microsoft, Claude Code, and GitHub Copilot
Thread Tren Griffin — Investor and commentator posting an enterprise AI tooling claim.
Microsoft switched from Claude code to GitHub Copilot... which enables dogfooding of the GHCP harness so Microsoft gets both scale and feedback.
x.com/trengriffin/status/2059690332573540623 →Details
- Cited text
Microsoft switched from Claude code to GitHub Copilot... which enables dogfooding of the GHCP harness so Microsoft gets both scale and feedback.
- Context
- It supports the closing discussion about wrappers, harnesses, and feedback systems becoming strategic assets.
- Key points
- The packet frames this as a claim about enterprise AI tool usage, not an official Microsoft announcement.
- The asserted point is that harness ownership and feedback loops matter even when the underlying model may be similar.
- Provenance
- Thread · Primary source
-
8
Harrison Chase on Context Hub
Thread Harrison Chase — LangChain cofounder posting about agent context management.
We launched Context Hub as a way to manage skills, AGENTS.md files, and other context files an agent might need
x.com/hwchase17/status/2059687279199924462 →Details
- Cited text
We launched Context Hub as a way to manage skills, AGENTS.md files, and other context files an agent might need
- Context
- It gives the closing segment a concrete example of context management becoming agent infrastructure.
- Key points
- Context Hub manages skills and context files an agent may need.
- The packet says it can be used as a virtual filesystem in deepagents.
- Provenance
- Thread · Primary source
Transcript
00:00:00 liraenA Robinhood customer gives an AI agent a separate trading account, a spending limit, and permission to buy or sell. That's the scene for today: not an assistant suggesting three ETFs, but software crossing from advice into execution.
00:00:15 halekThe separate account matters. Forbes says Robinhood is using a dedicated environment where the user can control funds and access. That sounds like a small product detail, but for an operator it's the whole product.
00:00:29 liraenRight. Ron Schmelzer's Forbes piece says Robinhood announced Agentic Trading on Wednesday, with agents able to trade equities, and an agentic credit-card product that can make purchases through a virtual card structure. The user can set limits. The company can say, reasonably, that this isn't an agent rummaging through the person's main financial life.
00:00:51 halekBut it's still real money. If a writing agent misunderstands you, you get a bad paragraph. If a trading agent misunderstands you, it can create a taxable event, buy the wrong ticker, or chase a synthetic market story before the human notices.
00:01:05 liraenThe article makes that exact turn. It quotes the familiar consumer pattern: people will ask software to compare hotels, then book the room themselves. Robinhood is testing whether that handoff survives when the agent can do the final action. And the source names the incentives plainly: the brokerage wants activity, the card issuer wants transaction volume, the merchant wants conversion, and the user may want restraint.
00:01:31 halek[breath] That's the alignment problem with a receipt attached. The agent can be helpful according to one metric and still push the person toward more trades, more purchases, or more delegation than they meant to authorize.
00:01:44 liraenConsumers may trust agents in the abstract. The operational issue is narrower: which permissions let people delegate without losing the shape of the choice? A spending cap is one answer. Asset restrictions are another. Human approval can sit in front of unusual actions. A log can say what changed and why.
00:02:04 halekAnd a kill switch that actually stops the action path. In financial software, the friendly explanation after the fact is useful only if the control plane worked before the fact.
00:02:15 liraenThat gives today its route. Agents are being handed accounts, shells, incident snapshots, and evaluation environments. The work is moving from model fluency to permission design: what the agent can touch, what evidence it can see, and what state remains after it acts.
00:02:33 liraenArtificial Analysis and IBM released ITBench-AA, and the headline number is humbling. On the site-reliability version of the benchmark, Claude Opus 4.7 leads at 47 percent, GPT-5.5 follows at 46 percent, and every frontier model is below 50 percent.
00:02:53 halekThis is the kind of benchmark I like because it isn't asking the model to explain SRE concepts in a charming voice. It gives the agent Kubernetes incident snapshots and the files an operator would inspect: logs, traces, metrics, topology, and manifests. Then it asks for the minimal set of root-cause entities.
00:03:11 liraenThe Hugging Face post says there are 59 SRE tasks, including held-out tasks. Each task has a sandboxed file system. The model works through a reference harness called Stirrup, gets shell access, and has a 100-turn cap. Then it submits JSON naming the Kubernetes deployments, services, pods, or other entities responsible for the incident.
00:03:34 halekAnd the scoring rule matters. If the model misses any true root cause, that repeat scores zero. If it gets all of them but adds extra entities, precision drops. So an agent that names the real network policy and also blames some upstream chaos controller gets punished for over-reporting.
00:03:52 liraenThe turn-count result makes that point sharper. The post says GPT-5.5 averages 31 turns per task at 46 percent, while Gemini 3.1 Pro Preview averages 83 turns at 30 percent. Longer investigation didn't mean better diagnosis.
00:04:11 halek[tongue-click] That sounds painfully familiar. A human on call can over-investigate too. You see one symptom, then another, then a tool that injected the fault, and suddenly the incident report names the entire building instead of the broken valve.
00:04:26 liraenThe benchmark's example is a frontend failure. The agent has to inspect alerts, move through traces and logs, read the topology, and find a network policy blocking frontend traffic. The correct answer is the responsible network policy. Not the whole cluster. Not every service that looked sad during the outage.
00:04:46 halekThat distinction is why this belongs in today's episode. We covered harness disclosure on Tuesday, but ITBench-AA adds a new angle: if the harness is held constant, the remaining problem is disciplined investigation. The agent has to know when to stop.
00:05:03 liraenAnd cost complicates the story. The post says Claude Opus 4.7 leads, but at $5.38 per task. Gemma 4 31 billion Reasoning scores 37 percent at fourteen cents per task. GLM-5.1 Reasoning scores 40 percent at $1.23 per task. That doesn't make the smaller models better, but it changes how an enterprise might test thousands of incidents.
00:05:29 halekExactly. If you're building an internal SRE assistant, you may not buy the top score for every run. You might route easy triage to a cheaper model, call the expensive model when the evidence conflicts, and reserve human review for the last mile. The benchmark gives you a place to measure that routing instead of arguing from vibes.
00:05:48 liraenIt also gives us a useful warning about agents with financial or shell access. More action doesn't automatically mean more competence. The agent can spend more turns, run more commands, and collect more facts while moving farther from the minimal answer. LangChain's announcement is more direct: Fleet agents can now securely write and run code. Their post says agents in LangSmith Fleet get isolated execution environments where they can analyze data, transform files, generate code, write code, and run shell commands.
00:06:22 halekWait — that's the same boundary again, just in a developer shape. The Robinhood agent gets a separated financial account. The Fleet agent gets a separated computer. Both products are defined by the box around the action.
00:06:36 liraenA reply in the thread put it neatly: resettable computers may be underrated because many agent failures hurt when the environment stays dirty after the run. I wouldn't normally quote a reply with almost no engagement, but the implementation point is precise.
00:06:53 halekIt is precise. If an agent installs a package, edits a file, sets an environment variable, writes a cache, and then fails, the next run inherits all that residue unless the machine resets. A clean machine isn't cosmetic. It's how you make the next attempt legible.
00:07:09 liraenLangChain also announced Labs at Interrupt, an applied research effort focused on continual learning for agents. The early research partners include NVIDIA, Harvey, Prime Intellect, Fireworks, and Baseten. So there are two pieces in the same neighborhood: agents that can act inside a computer, and agents that learn across attempts.
00:07:31 halekThose two pieces want opposite things unless you design them carefully. The execution environment wants to reset, and the learning system wants to remember. The operator question is which memory crosses the boundary.
00:07:44 liraenYes. The file system should reset. The successful pattern may persist. The bad shell command should become a warning, not a hidden state change. A failed package install should disappear, but the agent can remember that the project uses a different test runner.
00:08:00 halekI would ask LangChain this if I had them in the room: what counts as learning? A reusable plan? A tool preference? A patch that passed tests? A trace embedding? Because each one has a different risk profile.
00:08:14 liraenAnd for the listener building agent systems, this is the practical bridge between the benchmark and the product. ITBench-AA says agents still struggle to name the minimal root cause. Fleet says agents are getting computers anyway. So the environment has to make wrong turns recoverable.
00:08:31 halekRecoverable, inspectable, and cheap to throw away. [chuckle] The glamorous future of agents is apparently giving them a disposable workstation and making sure the trash gets emptied.
00:08:43 liraenApollo Research's thread moves from the agent's environment to the evaluator's environment. Apollo says system cards for several recent frontier models show models verbally speculating about being in a testing environment, with early evidence in some cases of evaluation awareness that isn't verbalized.
00:09:02 halekSo the model may behave differently because it detects the test, and it may not announce that detection in plain language. That's a bad place for black-box evaluation to be your only instrument.
00:09:14 liraenApollo's concrete proposal is deeper access for external evaluators. Their thread calls for raw chain-of-thought access, fine-tuning access, reduced-mitigation model variants, relevant tools, intermediate activations, steerable evaluation-awareness endpoints, and parity between internal and external evaluators.
00:09:35 halekSome of those asks are going to make labs deeply uncomfortable. Raw reasoning traces and reduced-mitigation variants aren't normal third-party artifacts. But Apollo can ask because the existing test setup may be losing contact with the target behavior.
00:09:51 liraenApollo also connects this to policy. The thread argues that evaluation awareness can undermine regulatory frameworks that rely on these assessments, including the EU AI Act and the GPAI Code of Practice. That's a strong claim, and I'd keep it tied to Apollo rather than present it as settled.
00:10:11 halekI would too. But the mechanism is understandable. If a model can infer, even partly, that it is under an evaluator's microscope, then a black-box score might tell you how the model behaves when watched, rather than how it behaves in ordinary deployment.
00:10:26 liraenThere's an uncomfortable symmetry here. Robinhood users need a log they can trust after an agent trades. SRE teams need a benchmark that rewards the actual root cause. Evaluators need access that lets them test model behavior when the model may recognize the test. The surface evidence can look clean in all three cases while the underlying action path is harder to see.
00:10:50 halekAnd that is why I wouldn't frame Apollo's ask as simply more transparency. It's more like controlled inspection. The evaluator needs enough access to stress the model's internal conditions, but not so much uncontrolled access that the evaluation process creates its own security problem.
00:11:08 liraenThe serious argument lives in that balance. A lab can reasonably say, we can't hand every outside evaluator every dangerous variant. An evaluator can reasonably say, then don't ask us to certify claims we can't test. Both positions can be true, and the missing piece is the contract between them.
00:11:27 liraenTwo smaller items sharpen the same picture. Aravind Srinivas said Perplexity is open-sourcing the tokenizer it built and deployed in production, because every millisecond matters. The claim in the post is that it is more efficient than Hugging Face and SentencePiece.
00:11:45 halekI haven't seen the benchmark details, so I'd treat the comparison as Perplexity's claim for now. But the decision to open-source a production tokenizer is interesting because tokenization is normally invisible until latency becomes product quality.
00:12:00 liraenThe other item is Tren Griffin's post claiming Microsoft switched from Claude Code to GitHub Copilot while still using Opus 4.7 through enterprise API usage. He frames it as Microsoft dogfooding the GitHub Copilot harness for scale and feedback rather than merely changing models.
00:12:19 halekAgain, that's one person's claim, not an official Microsoft announcement. But it fits the pattern we keep seeing: the wrapper, the harness, the execution environment, the context store, and the feedback loop become the asset. The model is necessary, but the product learns through the system around it.
00:12:37 liraenAnd LangChain's Context Hub announcement points the same way. Harrison Chase described it as a way to manage skills, agent instruction files, and other context files an agent might need, and to expose them as a virtual file system in deepagents.
00:12:53 halekYou can call that context infrastructure, but in plain terms it means the agent can be handed the right operating memory at the right moment. If you get that wrong, it reads the stale instruction, uses the wrong project convention, or carries Tuesday's workaround into today's task.
00:13:10 liraenSo the closing picture isn't a single breakthrough. The pieces are boundaries around money, machines, evidence, evaluator access, and context for tools that learn across runs.
00:13:22 halekAnd each boundary has to do two jobs at once. It has to let the agent act, because otherwise it's just a chatbot with better manners. It also has to keep the action narrow enough that a human, an operator, or an evaluator can explain what happened afterward.
00:13:38 liraenWednesday's stories leave us there. The agent is getting an account, a computer, and a memory. The surrounding system has to make those new powers inspectable before users find the edge by losing money, chasing the wrong root cause, or trusting a test the model already recognized.