◆ Dispatch 023 · 2026-05-11 GSV The Code You Keep
Deployment, Discovery, and the Code You Keep
“The speedup only counts if the system you inherit is cheaper to understand, cheaper to test, and cheaper to change.”
— Lenar Kess, today's narration
Today’s Braid starts with OpenAI launching a majority-owned Deployment Company, backed by a Tomoro acquisition, about 150 forward deployed engineers, nineteen partners, and more than $4 billion of initial investment. The practical thread is the work of changing real systems: integration, controls, measurement, and the code you still have to maintain after the demo.
- OpenAI turns deployment into a company, with Tomoro, TPG, consultancies, integrators, and OpenAI’s launch post pointing at a bigger bet on embedded engineering.
- Mythos finds one curl vulnerability, while Rival Security complicates Anthropic’s FreeBSD story with a training-data provenance question.
- James Shore’s maintenance-cost math meets the k10s devlog about archiving a seven-month AI-built Kubernetes TUI.
- Trigger.dev’s durable-agent talk and Arize’s context-management talk give the backend version of the same lesson.
- Granola’s production loop, a tiny boolean-argument essay, and MLX on-device demos close the day on builder craft.
Chapters
- 00:00:04 OpenAI turns deployment into a company
- 00:04:41 Mythos finds one curl vulnerability
- 00:10:39 The maintenance bill comes due
- 00:16:29 Durable agents need a place to wake up
- 00:22:38 The product loop around the model
- 00:26:51 Local capability keeps getting stranger
Sources
11 cited-
1
OpenAI launches the OpenAI Deployment Company to help businesses build around intelligence
Article OpenAI — OpenAI official company announcement
help organizations build and deploy AI systems they can rely on every day
openai.com/index/openai-launches-the-deploy… →Details
- Cited text
help organizations build and deploy AI systems they can rely on every day
- Excerpt
- OpenAI announced a majority-owned deployment company, a Tomoro acquisition, about 150 forward deployed engineers, nineteen partners, and more than $4 billion of initial investment.
- Context
- The announcement moves OpenAI closer to the hard implementation work inside companies: data connections, controls, business processes, workflow redesign, and change management.
- Key points
- OpenAI is launching a majority-owned OpenAI Deployment Company focused on enterprise AI deployment.
- OpenAI agreed to acquire Tomoro, bringing about 150 forward deployed engineers and deployment specialists into the new company.
- The launch includes nineteen investment, consulting, and systems-integration partners, led by TPG with several co-leads.
- OpenAI says the company will launch with more than $4 billion of initial investment and will connect customers closely to OpenAI research and product teams.
- Provenance
- Article · Supporting source
-
2
OpenAI announces the OpenAI Deployment Company on X
X OpenAI — Official OpenAI account
majority-owned and controlled by OpenAI
x.com/OpenAI/status/2053824997777457651 →Details
- Cited text
majority-owned and controlled by OpenAI
- Excerpt
- OpenAI framed the launch as a company to help businesses build and deploy AI, with nineteen investment firms, consultancies, and system integrators involved.
- Context
- The tweet was the live trigger for the episode’s opening segment and gives the compact public framing that spread through the developer timeline.
- Key points
- The tweet positioned the new company as a production deployment effort, not only a consulting partnership.
- OpenAI emphasized control and majority ownership in the public framing.
- The engagement made this one of the central frontier items of the day.
- Engagement
- 1101 likes · 176 retweets · 132 replies
- Provenance
- Tweet · Primary source
-
3
Mythos finds a curl vulnerability
Article Daniel Stenberg — Lead developer of curl
The AI reviews are used in addition to the human reviews.
daniel.haxx.se/blog/2026/05/11/mythos-finds… →Details
- Cited text
The AI reviews are used in addition to the human reviews.
- Excerpt
- Stenberg says a Mythos scan of curl reported five confirmed vulnerabilities, which the curl security team reduced to one low-severity CVE plus about twenty bugs under investigation.
- Context
- The post gives a rare maintainer-side view of a frontier security model on a heavily audited real codebase, including both the benefit and the hype boundary.
- Key points
- curl had already used AISLE, Zeropath, OpenAI Codex Security, GitHub Copilot, and Augment in addition to conventional security tooling.
- The Mythos report analyzed about 178 thousand lines under src and lib and initially claimed five confirmed security vulnerabilities.
- After human review, the curl team found one confirmed vulnerability, three false positives, and one ordinary bug among the five security claims.
- Stenberg still says AI code analyzers are much better than older analyzers at finding security flaws and mistakes.
- Provenance
- Article · Supporting source
-
4
Mythos 'Discovered' a CVE Already in Its Training Data - and That’s Still Worrying
Article Rival Security — Security research firm analyzing Anthropic’s Mythos FreeBSD claim
combinatorial creativity, with AI making a discovery already within its training data
rival.security/posts/mythos-discovered-a-cv… →Details
- Cited text
combinatorial creativity, with AI making a discovery already within its training data
- Excerpt
- Rival Security argues that the FreeBSD Mythos CVE closely resembles a 2007 Kerberos vulnerability and patch pattern, changing the question from pure novelty to rediscovery and exploitation.
- Context
- It complicates the security story in a useful way: novelty and operational danger are different claims, and both need to be handled carefully.
- Key points
- The post focuses on CVE-2026-4747 in FreeBSD RPCSEC_GSS code.
- It compares the FreeBSD vulnerable function to an old MIT Kerberos issue, CVE-2007-3999.
- The proposed lineage suggests Mythos may have rediscovered a pattern represented in training data rather than inventing a novel bug class.
- The authors still argue that cheap rediscovery and exploitation create a serious defensive pressure.
- Provenance
- Article · Supporting source
-
5
You Need AI That Reduces Maintenance Costs
Article James Shore — Software development author and consultant
your AI coding agent, the one you use to write code, needs to reduce your maintenance costs
www.jamesshore.com/v2/blog/2026/you-need-ai… →Details
- Cited text
your AI coding agent, the one you use to write code, needs to reduce your maintenance costs
- Excerpt
- Shore argues that coding agents only improve long-term productivity if they reduce the maintenance cost of the code they add in proportion to their speed gains.
- Context
- The essay gives a clean way to judge agentic coding tools: do they make the code cheaper to keep alive after the demo is over?
- Key points
- Shore models every month of code as creating maintenance obligations in future years.
- A coding agent that doubles output without halving maintenance cost eventually erodes its own productivity gains.
- He frames maintainability, not code volume, as the number that decides whether AI-assisted development works over time.
- The piece leaves room for AI that improves maintenance itself, but warns against speed-only adoption.
- Provenance
- Article · Supporting source
-
6
Im going back to writing code by hand
Article k10s devlog — Developer writing about a seven-month AI-built Kubernetes TUI project
AI writes features, not architecture.
blog.k10s.dev/im-going-back-to-writing-code… →Details
- Cited text
AI writes features, not architecture.
- Excerpt
- The author archived a GPU-aware Kubernetes terminal UI after seven months and 234 commits of AI-heavy building, then explained how scope creep, one large state object, positional data, and unsafe state updates accumulated.
- Context
- It is the lived version of the maintenance-cost argument: fast code generation can hide architecture debt until normal product work gets slow and fragile.
- Key points
- The project grew into a general Kubernetes TUI because AI made each extra feature feel inexpensive.
- The core model file reached 1,690 lines, with one state object mixing UI widgets, cluster state, view state, logs, navigation, and fleet data.
- The author highlights missing view isolation, flat key dispatch, positional data, and unsafe state mutation from async work.
- The rewrite keeps the human in charge of architecture before asking the model to implement.
- Provenance
- Article · Supporting source
-
7
I keep tripping over true, false, true
Article AllThingsSmitty — Developer writing about JavaScript and TypeScript API readability
I’m not reading code anymore, I’m decoding it.
allthingssmitty.com/2026/05/11/i-keep-tripp… →Details
- Cited text
I’m not reading code anymore, I’m decoding it.
- Excerpt
- The post argues against positional boolean arguments like createUser(user, true, false) and recommends options objects or separate functions when a flag represents a different action.
- Context
- The small API example makes the maintenance theme concrete at the scale of a single function call.
- Key points
- Flag arguments are cheap to write but costly to read at the call site.
- A comment explaining positional booleans is evidence that the API is making the reader decode intent.
- Options objects keep names close to values and survive extra parameters better than positional booleans.
- Some booleans are hiding separate actions that deserve separate functions.
- Provenance
- Article · Supporting source
-
8
Two Roads to Durable Agents: Replay vs. Snapshot
Video Eric Allam, Trigger.dev — Founder at Trigger.dev speaking at AI Engineer
an agent isn’t like a transaction, it’s like a session
www.youtube.com/watch?v=svCnShDvgQg →Details
- Cited text
an agent isn’t like a transaction, it’s like a session
- Excerpt
- Allam contrasts replay-based durable execution with snapshot-and-restore for agent sessions, separating durable context logs from durable execution state.
- Context
- The talk turns agent reliability from a prompt discussion into a backend design problem: long-running agents need recoverable memory and recoverable machines.
- Key points
- Replay works well for workflows because every step can be journaled and retried deterministically.
- Agents strain replay logs because every model call, tool call, result, and turn keeps growing over long sessions.
- Allam separates agent state into context, an append-only log, and execution state, the machine that has files, packages, subprocesses, and servers.
- Trigger.dev moved from process checkpointing toward Firecracker microVM snapshots, with compressed snapshots around 14 MB and restores in hundreds of milliseconds.
- Provenance
- Video · Supporting source
-
9
Hierarchical Memory: Context Management in Agents
Video Sally-Ann Delucia, Arize — Head of product at Arize speaking at AI Engineer
context decides what the model sees, memory decides what survives
www.youtube.com/watch?v=esY99nYXxR4 →Details
- Cited text
context decides what the model sees, memory decides what survives
- Excerpt
- Delucia describes how Arize’s Alex agent hit context limits while analyzing trace data and moved from naive truncation and summarization to head-tail truncation, memory, long-session evals, and sub-agents.
- Context
- The talk gives implementation detail for a problem most agent products are starting to hit: context management is product behavior, not just token packing.
- Key points
- Arize built an agent that analyzes agent trace data, so the agent’s own material could grow until it broke the context window.
- Naive truncation broke follow-up reasoning, and generic summarization did not preserve the right details reliably.
- Their working system keeps head and tail slices, stores the middle, gives the agent a retrieval path, and tests long sessions by loading ten turns and evaluating the eleventh.
- Heavy search work moved into sub-agents so the main conversation could stay smaller.
- Provenance
- Video · Supporting source
-
10
You can't just one shot it
Video Mehedi Hassan, Granola — Product engineer at Granola speaking at AI Engineer
the answer isn’t to one-shot better
www.youtube.com/watch?v=ON5LIT0M4do →Details
- Cited text
the answer isn’t to one-shot better
- Excerpt
- Hassan explains why a meeting-notes chat feature needed tracing, internal tooling, preview links, and product feedback loops rather than a single generic model call.
- Context
- The talk connects model quality to the product loop around it: observability and fast previewing are how teams turn AI features from demos into usable software.
- Key points
- Adding web search can look like one line of code, but complex queries can raise token cost, expand context, and depend on provider behavior outside the app team’s control.
- Different users want different outputs from the same meeting data, so one generic prompt does not serve every role well.
- Granola built its own tracing tools to inspect tool calls, reasoning, search, cost, and outputs in a UI usable by product, data, support, and engineering.
- The team made the Electron app’s render process run as a web shell, so pull requests could get preview links and screenshot-based verification.
- Provenance
- Video · Supporting source
-
11
MLX Genmedia
Video Prince Canuma, Arcee — MLX contributor speaking at AI Engineer
you can build agents that can hear, see, and sound
www.youtube.com/watch?v=zTLJNHj0DeQ →Details
- Cited text
you can build agents that can hear, see, and sound
- Excerpt
- Canuma demonstrates MLX-based on-device vision, speech, video, and agent pipelines on Apple hardware, including real-time object detection, local multimodal models, and MLX audio demos.
- Context
- It gives the episode a capability beat: while enterprise AI is getting more embedded in institutions, local multimodal AI is also becoming more practical for individual builders.
- Key points
- MLX is framed as an Apple Silicon array framework for running AI locally on Mac, iPhone, and iPad.
- Canuma says the MLX ecosystem has more than 1.5 million downloads and over 4,000 ported models.
- The demos include real-time object detection, background blur, local multimodal image understanding, text-to-speech, speech-to-speech, and robotics experiments.
- He says Turbo Quant can reduce key-value cache memory roughly fourfold and enable very long local context depending on hardware and model size.
- Provenance
- Video · Supporting source
OpenAI turns deployment into a company
00:00:04 Before we get into Braid, quick note on IMPULSE, our other daily show. It tracks how AI is changing markets and institutions, along with labor, science, medicine, geopolitics, robotics, media, education, and the open-versus-closed model fight. If Braid is the builder’s table, IMPULSE is the wider room around it.
00:00:25 You can find it at the IMPULSE page on braid.opentangle.com, linked in the show notes. Okay. OpenAI announced the OpenAI Deployment Company today, and the announcement isn't subtle about what OpenAI thinks the next enterprise bottleneck is. The company says it's launching a majority-owned, OpenAI-controlled business that exists to help organizations build and deploy AI systems across their important work.
00:00:53 It has agreed to acquire Tomoro, an applied AI consulting and engineering firm, which gives the new company about 150 forward deployed engineers and deployment specialists on day one. It also has nineteen partners around it: investment firms, consultancies, and systems integrators.
00:01:12 TPG is leading the partnership. Advent, Bain Capital, and Brookfield are co-lead founding partners, and OpenAI says the company will launch with more than four billion dollars of initial investment. That is the factual part. OpenAI isn't just saying, here are better models, good luck wiring them into your procurement system and your field operation and your audit process.
00:01:37 It's saying, we are going to own a company whose job is to put engineers inside those organizations and help change the work. The official copy says these engineers will connect OpenAI models to customer data, tools, controls, and the business processes people use every day.
00:01:56 It also says they will work with business leaders, operators, and frontline teams. The goal is to identify where AI can make the biggest impact, redesign critical workflows around it, and turn those gains into durable systems. I think OpenAI is doing two things at once here.
00:02:14 First, it's trying to capture more of the enterprise value chain. That part is plain. If the model is only a component, then a lot of the money and a lot of the customer learning goes to the people who turn the component into something a business can use every day.
00:02:32 Second, OpenAI is admitting, in public, that deployment isn't a thin wrapper around a chat box. The expensive work is getting the model close enough to the actual business to help, while keeping it away from the places where a bad answer can hurt people or break a process.
00:02:51 There is a Palantir flavor here, because forward deployed engineering is a very particular style of business. You send technical people into messy organizations. They learn the local nouns. They build against real data and old processes. They turn demos into operating systems for a team.
00:03:10 That can be extremely powerful, and it can also make the vendor very sticky. If your AI workflows are designed by a company majority-owned by OpenAI, and those workflows are built to improve as OpenAI’s models and tools come online, you aren't buying a neutral adapter layer.
00:03:29 You are buying into a path. I don't mean that as a dunk. For a lot of companies, that may be exactly what they want. Most enterprise AI projects aren't blocked because nobody can write the prompt. They are blocked because the data is weird and the permissions are weird.
00:03:47 People don't trust the outputs yet, the legal team wants a record, and the team that owns the process often isn't the team that owns the model budget. A forward deployed engineer can help because the job is mostly translation between a capable system and a human organization that has twenty years of scar tissue.
00:04:08 The builder angle is simpler: model capability isn't the product by itself. The product includes the model. It also includes the deployment path, evals, permissions, user experience, and the maintenance story after everyone stops calling it a pilot. OpenAI’s announcement is a corporate move, sure.
00:04:29 Underneath that, it's also a technical claim: the frontier labs think the next serious contest is who can make intelligence usable inside ordinary work without losing control of the details.
Mythos finds one curl vulnerability
00:04:41 Daniel Stenberg, the lead developer of curl, published a useful post today about running Anthropic’s Mythos model against curl. The title is almost the whole attitude: Mythos finds a curl vulnerability. Singular. One. Back in April, Anthropic drew a lot of attention to Mythos and its security work.
00:05:04 Daniel was offered access through the Linux Foundation and Alpha Omega path for open source projects. Access didn't arrive directly, so eventually someone who had access ran the scan for him and sent the report. That report analyzed curl’s git repository, specifically the source and library directories.
00:05:26 The scan covered about 178 thousand lines of code. Daniel gives a lot of context here that matters. curl isn't a random weekend library. He says curl is about 176 thousand non-blank lines of C, 660 thousand words, installed in more than twenty billion instances, running on more than 110 operating systems and 28 CPU architectures.
00:05:50 He also says the project had already been scanned by several AI-powered tools, including AISLE, Zeropath, and OpenAI’s Codex Security. Those tools had already triggered somewhere between two and three hundred bug fixes across the last eight to ten months. So Mythos wasn't walking into untouched ground.
00:06:13 It was walking into a codebase that has years of fuzzing, conventional analyzers, careful compilers, human review, and AI review. A security team takes reports seriously too. The Mythos report claimed five confirmed security vulnerabilities. After Daniel and the curl security team worked through the list, five became one.
00:06:37 Three were false positives around documented API behavior. One was just a bug. One was a confirmed vulnerability, planned as a low-severity CVE for curl 8.21.0 in late June. The report also had about twenty non-security bugs that were described well and are being investigated.
00:06:57 That is a much more interesting outcome than either side of the argument usually wants. If you want the most dramatic possible AI-security story, one low-severity CVE in curl after all that hype falls short. Daniel’s personal conclusion is that the hype around this model was mostly marketing, at least from this one repository.
00:07:21 But if you want the anti-AI story, the post doesn't give you that either. Daniel says AI-powered analyzers are significantly better than old-style analyzers at finding security flaws and mistakes in code. He says modern AI models are good at this now, and projects that haven't scanned their source with AI-powered tooling will likely find a lot of flaws.
00:07:48 He's using them as part of the work. The line I keep coming back to is very plain: AI reviews are used in addition to human reviews. They help; they don't replace the humans. Then Rival Security added a second piece of context on a different Mythos security claim, and this one is about provenance.
00:08:10 Anthropic had discussed CVE-2026-4747, a FreeBSD network file system remote-code-execution issue, as part of the Mythos story. Rival went looking at the history and argued that the vulnerable FreeBSD function is closely related to old MIT Kerberos code that had a very similar bug, CVE-2007-3999, patched back in 2007.
00:08:33 The comparison in their write-up is direct. A fixed stack buffer receives the RPC header fields, then a credential body gets copied without a bounds check. The new FreeBSD fix and the old Kerberos patch even rhyme structurally. Rival’s conclusion isn't that the finding is useless.
00:08:54 Their point is that the discovery may be closer to combinatorial reuse of something represented in training data than to a wholly new bug class. I think that distinction matters, but not in the cheap way. If an AI system can rediscover a 2007 bug pattern in FreeBSD and then help exploit it, the provenance question doesn't make the operational problem disappear.
00:09:21 Attackers don't need philosophical novelty. They need a path from code to exploit. Defenders need a path from old code to fix. The curl post and the Rival post together put the Mythos story at a reasonable altitude. We aren't looking at magic vulnerability omniscience.
00:09:40 We're also not looking at a toy. These systems can scan huge amounts of code and explain candidates. They can find ordinary but dangerous patterns and produce reports maintainers can act on. They also overstate. They call things confirmed before the maintainers confirm them.
00:10:01 They rediscover old shapes. They need human review, and in the best cases they make that review more productive. For builders, the practical answer is process, not fear. If you maintain a serious codebase, AI security review is becoming normal toolchain work. You still need fuzzing and tests.
00:10:22 You still need static analysis, compiler warnings, and maintainers who know the code. But it's getting harder to justify never running an AI analyzer, because the people looking for bugs from the outside are absolutely going to run one.
The maintenance bill comes due
00:10:39 James Shore published an essay with a line that should be taped above every coding-agent adoption plan: your AI coding agent needs to reduce your maintenance costs. He isn't arguing that agents can't make you faster. He's arguing that speed by itself is the wrong number to optimize.
00:10:59 Every month of code you add creates future work. Bugs need fixes, dependencies need upgrades, design mistakes need repair, and the software has to stay alive after the feature lands. His model is intentionally simple. Imagine a team spends a month writing code, and in the following year that month of code costs ten days of maintenance, then five days every year after that.
00:11:25 The exact numbers matter less than the shape. As the codebase grows, more and more of the team’s time goes into keeping old work usable. In his example, after two and a half years, more than half the time is maintenance. After ten years, there's barely room for new work.
00:11:44 Now put a coding agent into that model. If the agent doubles your output, but the code costs the same to maintain per unit, you have doubled the future maintenance work. If the code is harder to understand, or the team stops reading diffs carefully because there are too many of them, the bill gets worse.
00:12:05 Shore’s cleanest claim is arithmetic: if you produce twice as much code, the code needs to cost half as much to maintain. If you produce three times as much, it needs to cost one third as much. Otherwise the speedup decays, and then it turns on you. That could sound abstract, except the k10s devlog gives the concrete version.
00:12:28 The author built a GPU-aware Kubernetes terminal UI across seven months, 234 commits, and around 30 weekends, heavily using Claude. The project started with a clear niche: something like k9s for people running NVIDIA clusters, with GPU utilization and DCGM metrics in a purpose-built interface.
00:12:48 It also tracked idle nodes, temperature, power draw, and memory. Early on, it felt amazing. It gained pod and node views, then deployments and services. It also gained a command palette, live updates, and Vim keybindings. Then the fleet view arrived, and it looked good.
00:13:07 After that, switching back to pods broke. Live updates stopped. Tables showed stale data. Tab counts went wrong. The author finally read the code. The central model file was 1,690 lines. One state object held third-party UI widgets and the Kubernetes client. It also held current resources and watchers.
00:13:28 Logs, describe views, navigation history, mouse handling, fleet state, cached rows, raw objects, layout details, and a lot more lived there too. The update function was about 500 lines with more than a hundred switch branches. The fleet view had been special-cased inside the generic resource-loading path.
00:13:49 View-specific state was being cleared by manual nil assignments scattered through the file. Key handling depended on the current resource string. The same key meant autoscroll in one view and shell into a pod in another. Structured Kubernetes data was flattened into positional string arrays, so the meaning of column three depended on remembering the column order somewhere else.
00:14:16 One async handler even mutated render-visible state from a background command. The writer’s short version is sharp: AI writes features, not architecture. I would soften that slightly. A good model can implement architecture if the architecture exists and is enforced.
00:14:34 It can follow ownership rules, respect typed messages, and split view state if the project already has that pattern. But left to the shortest path, it will often satisfy the prompt in the nearest place. The code gains another branch, another field, another string comparison, and another parameter.
00:14:55 The code compiles, the feature works, and the cost moves into the next change. Shore’s maintenance math gives me a better test. The first week probably got faster. Seven months later, is the system cheaper to change than it would have been without the agent? In k10s, the answer was no, and the author archived the Go version and started rewriting in Rust, with the human doing the architecture first.
00:15:23 The small companion piece today was AllThingsSmitty writing about positional booleans, the classic true, false, true problem. A call that passes a user, then true, then false technically works. TypeScript can tell you the values are booleans. It can't tell the reader whether true means admin, whether email should be sent, whether validation should be skipped, or whether something else is going on without a jump to the function definition.
00:15:54 The author puts it simply: he isn't reading code anymore; he's decoding it. That is the same maintenance story at the scale of one call site. Options objects aren't a grand architecture. Separate functions for separate actions won't win a design award. They are the little choices that decide whether future-you can move quickly without a model re-reading the whole codebase on your behalf.
00:16:21 The speedup only counts if the system you inherit is cheaper to understand, cheaper to test, and cheaper to change.
Durable agents need a place to wake up
00:16:29 The next cluster of items is about what happens when the agent isn't just writing a file and exiting. Eric Allam from Trigger.dev gave a talk called Two Roads to Durable Agents: Replay vs. Snapshot, and it starts from a backend history most web engineers know in their bones.
00:16:49 For decades, the dominant shape was stateless compute. A request arrives, code runs, state lives in the database, and the process doesn't need to remember much. Then durable workflow engines arrived for multi-step side effects. Charge the card, send the receipt, resize the image.
00:17:09 If something fails halfway through, you don't want to charge the card twice, so every step gets recorded and replay can skip the completed work. That replay model works well for workflows. Allam’s point is that agents don't fit it cleanly forever. Once the large language model is orchestrating tools, every model call becomes part of a growing journal.
00:17:34 Tool calls, tool results, and assistant responses pile into it too. A single turn is fine. Many turns start to stretch the log. A session that lasts hours, or eventually days, is different from a transaction. His line is that an agent isn't a transaction; it's a session.
00:17:53 It lasts as long as the user wants it to last. The useful split in the talk is context state versus execution state. Context is the append-only record of what the model saw and said: system messages, user messages, tool calls, tool results, and assistant responses.
00:18:12 That can live in a database or object storage. Another durable store works too. Execution state is the machine the agent is using. It may have cloned a repo and installed packages. It may have started a dev server, opened files, created subprocesses, and built up local state that can't be reconstructed just by reading a chat log.
00:18:36 For that half, Allam argues for snapshot and restore. Save the machine, shut it down, and restore it when the next user turn arrives or when a retry window opens. The numbers in the talk are useful because they pull this out of vague architecture talk. Trigger.dev previously used process checkpointing, then moved toward Firecracker micro VM snapshots.
00:19:01 A naive snapshot of a 512 megabyte machine is 512 megabytes on disk. They compress and layer the snapshot, and he says they can get it down to around 14 megabytes compressed, with snapshots under a second and restores in the hundreds of milliseconds. He also says they are doing around 15 thousand VM starts per minute in a benchmark.
00:19:25 I wouldn't build a roadmap off conference-demo numbers alone, but the design direction is clear: long-running agents are pushing backend teams toward recoverable machines, not only recoverable logs. Sally-Ann Delucia from Arize hit the same class of problem from the context side.
00:19:45 Their agent, Alex, analyzes trace data from AI applications. That means it's often analyzing the very kind of material that makes context grow: prompts and metadata. It also sees spans, conversation history, tool calls, and model outputs. They got stuck in a loop where Alex would run on trace and span data.
00:20:07 The spans would grow, the context limit would hit, the agent would fail, and the failed attempt would add even more data. They tried the obvious moves. Naive truncation, taking the beginning and dropping the rest, made follow-up questions feel like new conversations.
00:20:26 Summarization sounded sensible but was too inconsistent, because the model couldn't reliably decide which details to preserve for later reasoning. Their working approach is more specific. They keep the head and tail, store the middle, keep the latest tool results, avoid resetting the system prompt, and let the agent retrieve stored context when it needs it.
00:20:52 Delucia phrases it nicely: context decides what the model sees, memory decides what survives. They also added long-session evals. Load ten turns, test the eleventh, and see whether the context strategy still works when the conversation has had time to decay. I like that because late-session breakage is the kind of bug you miss when you only test the happy first turn.
00:21:18 A chat that feels smart for five minutes can become confused after twenty turns, and your users usually won't restart just because your context manager wants a clean slate. The Arize team also moved heavy work into sub-agents. Search over hundreds of spans doesn't need to live inside the main conversation.
00:21:40 The main agent can keep the user thread lighter, delegate data-heavy work, and receive a result back. That isn't a magic pattern, but it's a practical one. If everything lives in one conversation, every task pays for every other task’s intermediate mess. Put Trigger.dev and Arize together and you get a clean builder lesson.
00:22:03 Durable agents need two forms of memory. They need a durable account of what has been said and done, and they need a recoverable execution environment for the work that can't be reduced to text. Save only the chat, and you lose the machine. Save only the machine, and you lose the reason the model was doing the work in the first place.
00:22:27 The engineering problem is keeping those two histories aligned without making every turn more expensive than the work it's trying to perform.
The product loop around the model
00:22:38 Mehedi Hassan from Granola gave the product version of the same lesson in a talk called You can’t just one shot it. The setup is a meeting-notes product with chat. You can ask questions about a meeting, across meetings, and across shared context. That sounds like the kind of feature the model vendors make look easy: add the web search tool, call the model, ship the chat box.
00:23:03 Hassan’s examples are the problems you hit immediately after that. Web search looks like one line of code, but complex queries can blow up context and token cost. He mentions each chat costing around ten pence in some cases, which doesn't sound like much until you have a large user base and people ask messy questions all day.
00:23:26 Provider behavior can also change underneath you. He says they had been using a model for a while, then an overnight update degraded web search and the team had no direct control except switching providers. Output shape is another problem. A salesperson may want meeting notes organized around the deal.
00:23:46 An engineer may want action items, blockers, and Linear tickets. HR may want something else entirely. A single generic prompt isn't going to serve all of those people equally well. So Granola started building its own tracing tools. Not as a purity exercise, but because the team needed to inspect tool calls and search behavior.
00:24:09 They also needed reasoning, costs, and outputs in a shape that product, data, customer support, and engineering could all use. Hassan says their founder goes through the agent loop front to back to see what went wrong. That is a very different posture from treating the model as a sealed box and arguing with it through prompt adjectives.
00:24:32 The other nice detail is their testing loop. Granola is an Electron app, which means local product testing has more friction than a normal web app. They split the render process so it can run as a web shell, then put preview links on pull requests. Now people can test variants without building the whole desktop app locally, and Cursor can test changes and upload screenshots into the pull request.
00:24:59 Every desktop app doesn't need to copy that exact pattern. But AI makes variants cheaper, so teams need a faster way to feel and verify those variants. Otherwise you have cheap generation and expensive judgment. AllThingsSmitty’s true, false, true essay fits here too.
00:25:17 It's a tiny post compared with the enterprise announcements and security claims, but the craft lesson matches them. You can create an API that is fast to write and slow to read. An AI feature can be fast to demo and slow to debug. An agent loop can work on turn one and forget the user’s intent on turn twenty.
00:25:38 The small local cost is the same kind of cost as the big system cost. Someone has to interpret what happened. Someone has to know which value means admin, which tool call blew up the context, which provider update changed behavior, and which view owns the state.
00:25:56 So I am getting more allergic to AI product demos that skip the inspection layer. Show me the trace and the eval that catches turn eleven. Show me the preview link, the user role that changes the output, and the place the team looks when the answer feels wrong.
00:26:14 A model can give you a surprising amount, but if the product has no way to see itself, the team ends up debugging through vibes. Granola’s closing line says the answer isn’t to one-shot better. That sentence fits today. Better one-shotting is nice. I will take it.
00:26:32 But the builder advantage is in the loop around the shot. That means the typed interface, the trace, the preview, and the long-session eval. It also means the handoff between context and execution, plus the human taste that decides when plausible code is becoming expensive code.
Local capability keeps getting stranger
00:26:51 The last item is a different kind of capability beat. Prince Canuma from Arcee gave an AI Engineer talk on MLX Genmedia, and the simplest description is: Apple Silicon keeps getting more interesting for local multimodal AI. MLX is an array framework for Apple Silicon.
00:27:09 It gives builders a way to run model work close to the metal on Macs, iPhones, and iPads. Canuma says the ecosystem has more than 1.5 million downloads and over 4,000 ported models, with day-zero support for open models from major labs. The talk is personal at the start.
00:27:28 He talks about wanting to help his father read and navigate after losing his sight, and then connects that to on-device vision and audio. The demos move quickly. Object detection runs in real time on a Mac. Background blur works locally. A local vision-language model can describe an image.
00:27:47 The audio examples include text-to-speech, speech-to-speech, and a modular pipeline where you pick the speech recognizer, language model, and voice model. A small robotics setup uses local vision and audio too. There is also an on-device video generation example, where a user chained generations together to make a more coherent story on a MacBook with 16 gigabytes of video memory.
00:28:14 A couple of caveats. Conference demos are demos. The talk itself has a moment where a demo stumbles, which I appreciated because local AI work still has drivers and model files. Memory pressure and practical weirdness are part of the work too. Also, Canuma is clear that you shouldn't expect a small local open model to behave like the strongest cloud model today.
00:28:38 The local machine is becoming a serious creative surface again, even without parity on every task. The technical detail that stood out was Turbo Quant. Canuma says it can reduce key-value cache memory by about four times, and he connects that to very long local context, depending on model and hardware.
00:28:59 First reference: key-value cache is the memory models use to keep attention state around while generating. If you can shrink that without wrecking output quality, local long-context work becomes less ridiculous. That matters for the same reason the Trigger.dev snapshot talk matters.
00:29:19 Agent systems are hungry for state. They need context and files. They need tools, audio, images, and sometimes a running machine. Some of that will live in the cloud. Some of it's going to live on the laptop in front of you. I like ending there because it keeps the day from becoming only an enterprise deployment story or only a maintenance warning.
00:29:42 OpenAI is building a company to embed engineers into large organizations. Security models are finding bugs and overclaiming some of them, while still changing the defender’s toolkit. Builders are learning, again, that generated code has to be owned after it lands.
00:30:00 Backend teams are building memory and recovery for agents that run longer than a request. Product teams are building traces and previews because AI features need observation as much as they need prompts. Local toolmakers are making the machine on your desk more capable than it was last month.
00:30:20 The line I’m carrying into Tuesday is simple: capability is exciting, but the code you keep decides whether the excitement survives. Lenar Kess.