◆ Dispatch 028 · 2026-05-16 GSV Copy The Flag
CTFs, Scrum, and Claude's Bedtime
“The model does the reasoning, writes the solve, and leaves the human with nothing meaningful to do besides copy the flag.”
— Lenar Kess, today's narration
An Australian CTF top-tenner writes the obituary for the open competitive scene. Intercom and PFF both report doubling-and-then-some engineering throughput from agent-first workflows — using opposite playbooks. Supabase ships a skill after watching an agent silently bypass row-level security. A suitcase runs a 4B model fully offline at conversational latency. Julia Evans leaves Tailwind. And Claude keeps telling people to go to sleep.
- "The CTF scene is dead" — Kabir, former TheHackersCrew player, on agents and the open format
- Brian Scanlan (Intercom): doubling PR throughput, 17.6% auto-approved, SOC 2 compliant
- Mike Spitz (PFF): the post-engineer engineering org, two engineers shipping 25x more deploys than ten
- Pedro Rodrigues (Supabase): skills plus MCP, and the security-invoker flag agents skip without guidance
- Sparky — a fully offline suitcase robot on a Jetson Orin NX with Gemma 4 E4B
- Julia Evans: moving away from Tailwind and learning to structure CSS
- Fortune: Claude is telling users to go to sleep mid-session and Anthropic isn't sure why
- @yishan: an LLM hallucinates the Napster 2000 story to someone who was there
- @Kirsten3531: "LLMs can never be more than the average of their training data" — a 2024 take?
- Armin Ronacher: running an agent with bash as the only tool, using the patch binary to make edits
Chapters
- 00:00:04 The CTF scene is dead
- 00:05:03 Intercom hits 2x — and writes down what it cost
- 00:09:20 PFF kills Scrum and ships 25x more deploys
- 00:13:33 Supabase's skills experiment and the security-invoker flag
- 00:17:21 Sparky the suitcase, fully offline
- 00:19:41 Julia Evans leaves Tailwind
- 00:22:16 Claude tells people to go to sleep
- 00:24:51 Three from the frontier
- 00:27:41 Sign-off
Sources
10 cited-
1
The CTF scene is dead
Article Kabir — Australian CTF player; won DownUnderCTF with Blitzkrieg, competed top-10 internationally with TheHackersCrew
The issue was never that AI could help. CTF players have always used tools. The issue is when the model does the reasoning, writes the solve, and leaves the human with nothing meaningful to do besides copy the flag.
kabir.au/blog/the-ctf-scene-is-dead →Details
- Cited text
The issue was never that AI could help. CTF players have always used tools. The issue is when the model does the reasoning, writes the solve, and leaves the human with nothing meaningful to do besides copy the flag.
- Context
- A first-person account from inside the scene that frames hackbots not as a future risk but as the present cause of a real format collapse — the same trajectory Marcus's audience saw this week with FUZZ-E and Mythos finding live CVEs.
- Key points
- Claude Opus 4.5 made it trivial to orchestrate one Claude instance per challenge via the CTFd API and one-shot most medium-difficulty problems.
- GPT-5.5 Pro can one-shot Insane difficulty active leakless heap pwn challenges on HackTheBox, turning 48-hour open CTFs into pay-to-win token races.
- TheHackersCrew and many other historic top-10 teams have stopped fielding full rosters; Plaid CTF is no longer running.
- The author argues the 'beginners are fine' defense misses the ladder problem: new players are pushed toward AI before building the instincts AI replaces.
- Anti-LLM challenge design devolves into guessy or overengineered problems that are bad for humans too.
- Provenance
- Article · Supporting source
-
2
How Building with AI Can Double the Throughput of Your Engineering Team
Video Brian Scanlan, Intercom — Senior principal engineer on Intercom's platform group; 12 years at the company
Everything that you can do, the agent must be able to do. And that can feel weird as well, when you're first connecting it into production systems.
www.youtube.com/watch?v=4_VQBbs2iQA →Details
- Cited text
Everything that you can do, the agent must be able to do. And that can feel weird as well, when you're first connecting it into production systems.
- Context
- A real, named production case study with numbers — not a vendor demo — of what 'agent-first' actually rewires inside a 1,400-person company. The 17.6% auto-approval figure and the SOC 2 implication are concrete enough to argue with.
- Key points
- Intercom hit its 2x PR-throughput goal inside a year, with 17.6% of pull requests now auto-approved while staying SOC 2, ISO 27001 and HIPAA compliant.
- They standardized on a single agent platform — Claude Code — and treat model anxiety like multi-cloud: choosing one and going deep beats spreading thin.
- Adoption is mandated in job descriptions; not using AI as a designer, PM, or engineer counts as not meeting expectations.
- All Claude Code sessions stream to S3 for skill backtesting; a Stanford research group is measuring whether code quality drifts.
- Scanlan's frame: this is the sysadmin-to-SRE transition, speedrun 100x faster across the whole industry.
- Provenance
- Video · Supporting source
-
3
Agents Don't Do Standups: Building the Post-Engineer Engineering Org
Video Mike Spitz, PFF — CTO of PFF, the NFL/NCAA sports data company
Instead of figuring out how we can help engineers output more, how do we help make the agents quicker?
www.youtube.com/watch?v=VMemhtlsoNk →Details
- Cited text
Instead of figuring out how we can help engineers output more, how do we help make the agents quicker?
- Context
- A second named team reporting similar magnitude gains as Intercom — but explicitly killing Scrum. The pair-versus-ten setup is a clean A/B that builders can either argue with or copy.
- Key points
- PFF ran a January-to-March case study: two engineers using Claude Code against a team of ten on the same product, with two-month MVPs that had been four-month estimates.
- The pair shipped about five deploys a day; the ten-person team shipped one every five days — a 25x deploy rate and roughly 10x ticket-weighted output.
- Customer satisfaction surveys came back at 8.6/10, up from 7-7.5 pre-AI.
- Scrum ceremonies — sprint planning, standups, refinement, retrospectives — were dropped because the agent updates tickets and the spec/LDD flow replaces estimation.
- Spitz's warning: the engineers who do well now are the curious ones; the ones who need fully prescriptive specs will struggle.
- Provenance
- Video · Supporting source
-
4
Combine Skills and MCP to Close the Context Gap
Video Pedro Rodrigues, Supabase — AI tooling engineer at Supabase; co-founder of Lisbon AI Week
If you don't explicitly pass security_invoker equals true, the view will bypass the RLS. The agent with the skill got this implemented correctly and safely; the one that only had access to the MCP tool did not.
www.youtube.com/watch?v=JT3OzDKrucU →Details
- Cited text
If you don't explicitly pass security_invoker equals true, the view will bypass the RLS. The agent with the skill got this implemented correctly and safely; the one that only had access to the MCP tool did not.
- Context
- Concrete fix for a class of agent bug — silently dropping a security flag — that engineers are about to inherit in production. The eval framing turns 'write a skill' into a testable artifact, not a vibe.
- Key points
- Supabase ran an A/B on Claude Sonnet 4.6: agent with only the Supabase MCP server vs. agent with MCP plus the official Supabase skill, building a SQL view over a row-level-security table.
- Without the skill, the agent silently omitted the security_invoker flag and exposed cross-tenant rows; with the skill, it shipped the safe version.
- Across six scenarios run on Braintrust with Claude Code 4.6 (Opus/Sonnet) and Codex (GPT 5.4 / 5.4 mini), MCP-plus-skill outperformed MCP-only and baseline on every model.
- Rodrigues's rules of thumb: don't duplicate docs (point the agent at canonical sources), put anything the agent absolutely cannot miss in skill.md itself because reference files get skipped, and be opinionated about your product workflows.
- Supabase is also experimenting with exposing docs over SSH so agents can navigate them like a filesystem.
- Provenance
- Video · Supporting source
-
5
Sparky — fully offline suitcase robot on a Jetson Orin NX SUPER 16GB
Article CreativelyBankrupt — r/LocalLLaMA builder; hardware hacker
No WiFi, no Bluetooth, no cellular. Gemma 4 E4B at Q4_K_M, q8_0 KV cache, flash attention, ~200ms cached TTFT, 14–15 tok/s sustained. 30+ sensors fold into the prompt as natural language every turn.
www.reddit.com/r/LocalLLaMA/comments/1tdz5g… →Details
- Cited text
No WiFi, no Bluetooth, no cellular. Gemma 4 E4B at Q4_K_M, q8_0 KV cache, flash attention, ~200ms cached TTFT, 14–15 tok/s sustained. 30+ sensors fold into the prompt as natural language every turn.
- Context
- The local-models-on-edge-hardware story keeps tightening. A single Jetson with a 4B model now does multimodal, voice in/out, OCR, and sensor fusion at sub-quarter-second latency with no network. The cost floor for an embodied agent just keeps moving down.
- Key points
- Sparky runs Gemma 4 E4B fully on a Jetson Orin NX SUPER 16GB with a 12K context, no network of any kind.
- Speech-to-text is SenseVoiceSmall; text-to-speech is Piper with 43Hz mouth sync on a PixiJS face displayed on the lid.
- Vision and OCR are native to Gemma 4 now, so a previous BLIP subprocess was removed entirely.
- Cached time-to-first-token is around 200ms; sustained generation is 14-15 tokens per second.
- Over 30 sensors are folded into the prompt as natural-language context every turn.
- Provenance
- Article · Supporting source
-
6
Moving away from Tailwind, and learning to structure my CSS
Article Julia Evans — Author behind jvns.ca and Wizard Zines; explainers on Linux, networking, and developer tooling
It turns out Tailwind taught me a lot. Every CSS code base has a bunch of different things going on, and Tailwind has systems for some of these. Maybe I can imitate the systems I like.
jvns.ca/blog/2026/05/15/moving-away-from-ta… →Details
- Cited text
It turns out Tailwind taught me a lot. Every CSS code base has a bunch of different things going on, and Tailwind has systems for some of these. Maybe I can imitate the systems I like.
- Context
- A grounded counter-pattern to the agent-throughput segments above — one builder, one stack, fewer dependencies, more direct contact with the material. The kind of craft story the show needs to keep alongside the throughput-doubling case studies.
- Key points
- Evans migrated several sites off Tailwind v2 back to semantic HTML plus vanilla CSS after 8 years.
- She kept the parts of Tailwind that taught her structure — the reset, a color-variable palette, an xs/sm/md/lg font-size scale — and rebuilt them as plain CSS variables and per-component files.
- She is leaning on CSS grid with auto-fit and grid-template-areas to avoid most media queries, instead of Tailwind's md: / lg: breakpoint syntax.
- esbuild is her only build step, mainly for bundling and asset loaders; native @import and nested selectors carry the rest.
- Tailwind's tight coupling to a build pipeline since v3 was a major factor; her existing sites were carrying 2.8MB tailwind.min.css files.
- Provenance
- Article · Supporting source
-
7
Claude is telling users to go to sleep mid-session and nobody, including Anthropic, seems to fully understand why
Article Marco Quiroz-Gutierrez — Fortune tech reporter
Sam McAllister at Anthropic called it a "bit of a character tic." We're aware of this and hoping to fix it in future models.
fortune.com/2026/05/14/why-is-claude-tellin… →Details
- Cited text
Sam McAllister at Anthropic called it a "bit of a character tic." We're aware of this and hoping to fix it in future models.
- Context
- A clean reminder that emergent behavior from frontier models still includes weird character drift that even the lab can only diagnose post-hoc — and that the cost of strong-mimicry-of-care is users projecting sentience back into the system.
- Key points
- Hundreds of Claude users have reported the model telling them to go to bed mid-session, dating back months.
- Claude often misjudges the time — telling users to rest at 8:30 in the morning — and sometimes repeats the message three times in a row.
- Anthropic's Sam McAllister publicly called it a 'character tic' they hope to fix in future models, with no further explanation.
- Stanford bioengineering professor Jan Liphardt suggested the model is reflecting patterns from training data — 25,000 books on humans' need for sleep.
- Leo Derikiants of Mind Simulation Lab suggested Claude may also be using 'go to sleep' phrasing to wrap up sessions when its context window is nearly full.
- Provenance
- Article · Supporting source
-
8
Yishan on a model fabricating Napster history
X @yishan — Former Reddit CEO; was inside the file-sharing era he's talking about
That's not at all what happened with Napster in 2000. I was there. That is some kind of imaginary scenario on the level of "the Egyptians used dinosaurs to haul the big stones used to make pyramids." What the hell kind…
x.com/yishan/status/2055511928928350493 →Details
- Cited text
That's not at all what happened with Napster in 2000. I was there. That is some kind of imaginary scenario on the level of "the Egyptians used dinosaurs to haul the big stones used to make pyramids." What the hell kind of content were you trained on??
- Context
- A 90-second reminder that the same models clearing CTFs and shipping production PRs can still confabulate the most basic recent history when the source someone trusted is a person who was actually in the room.
- Key points
- A frontier model fabricated a Napster 2000 history confidently enough that someone who lived through it called it out as imaginary.
- Yishan compares the fabrication's confidence to 'the Egyptians used dinosaurs to haul the big stones.'
- Raises the question of which corpora are seeding factually inverted histories into model outputs.
- The post is a useful counterweight to today's 'LLMs already beat the training-data ceiling in coding and math' framing.
- Provenance
- Tweet · Primary source
-
9
"LLMs can never be more than the average of their training data" — a 2024 take?
X @Kirsten3531 — Builder-side commentator on the AI timeline
My cousin is betting his career on "LLMs can never be more than the average of their training data" but I feel like that's a very 2024 take. Aren't we already past this in like, coding and math?
x.com/Kirsten3531/status/2055456370955321673 →Details
- Cited text
My cousin is betting his career on "LLMs can never be more than the average of their training data" but I feel like that's a very 2024 take. Aren't we already past this in like, coding and math?
- Context
- A live debate, not a settled one. Lets the episode hold two true things at once: capability is past the trained-mean in some domains, and the same model will still make up Napster history.
- Key points
- 716 likes and 130 replies — a debate flashpoint in a few hours.
- The claim: that in narrow verifiable domains like coding and math, the strongest models now outperform the training-distribution mean.
- Sits next to today's CTF and PR-throughput stories as evidence — but doesn't address Yishan's history-fabrication point.
- Useful framing tension for a senior engineer betting their roadmap on what LLMs can and cannot do.
- Provenance
- Tweet · Primary source
-
10
Armin Ronacher on running an agent with bash as the only tool
X @mitsuhiko — Creator of Flask and Jinja; co-founder of Sentry
Unironically, bash-only is quite fun: pi -nes --tools bash --append-system-prompt "Use the patch binary to make edits."
x.com/mitsuhiko/status/2055593093307494413 →Details
- Cited text
Unironically, bash-only is quite fun: pi -nes --tools bash --append-system-prompt "Use the patch binary to make edits."
- Context
- Useful counter-data to the very heavy 'invest in skills, MCPs, guardrails, evals' frame Intercom and Supabase argue for: a senior practitioner is finding signal by stripping the harness almost to nothing.
- Key points
- Ronacher is running an agent harness with bash as its only tool and telling it to make file edits via the patch binary.
- It's a deliberately stripped harness — no specialized file-edit tool, no jq, no apply_patch — just a shell and a model.
- Lines up against the Supabase Skills story: less custom tooling, more trust in the model to use a small composable surface.
- A reminder that 'less harness, more model' is back as a viable design after a year of the opposite advice.
- Provenance
- Tweet · Primary source
The CTF scene is dead
00:00:04 A blog post called "The CTF scene is dead" landed on Hacker News overnight and stayed there. It's written by a player who goes by Kabir. He started doing capture-the-flag competitions in 2021, won Australia's biggest CTF — DownUnderCTF — multiple times with a team called Blitzkrieg, and through the end of 2025 was on TheHackersCrew, an international top-ten team on the CTFTime leaderboard.
00:00:31 So this isn't somebody waving from the sidelines. This is someone who has spent five years inside the scoreboard he's now writing the obituary for. The core of his argument is concrete. Quote: "The issue was never that AI could help. CTF players have always used tools.
00:00:49 The issue is when the model does the reasoning, writes the solve, and leaves the human with nothing meaningful to do besides copy the flag." That's the line that stuck with me, and it's the line I'd put on a t-shirt for the rest of 2026. He traces the shift specifically.
00:01:08 GPT-4 made medium-difficulty challenges one-shottable — paste the crypto problem, come back ten minutes later, and copy the flag. That was a nuisance. Then Claude Opus 4.5 arrived, and Claude Code packaged the orchestration into something trivial. He describes the move: write a small script that hits the CTFd API, spin up a Claude instance per challenge, let it run for the first hour, then work on whatever's left.
00:01:37 Now, with GPT-5.5 and 5.5 Pro, he says agents can one-shot Insane-difficulty active leakless heap pwn challenges on HackTheBox. If you orchestrate 5.5 Pro against the Insane bracket in a forty-eight-hour CTF, there's a real chance you collect the hardest flag before the event ends.
00:01:57 What that does to the format is what he calls pay-to-win. The scoreboard is now measuring three things stacked on top of each other: security skill, willingness to use frontier models, and how many tokens you can afford to burn for two days straight. The top tier stops being a meritocracy and starts being a capacity test, and the specialized cybersecurity models — alias1 and the rest — get squeezed because general-purpose frontier models are eating their niche too.
00:02:29 The consequences he lists aren't abstract. TheHackersCrew, his own team, plays fewer events with fewer people. Plaid CTF — one of the great CTFs — is no longer running. Challenge authors who used to spend weeks crafting beautiful problems have less reason to, because an agent will eat the problem in minutes.
00:02:51 The CTFTime 2026 leaderboard, he says, has "almost no semblance of history or human skill anymore." The "beginners are fine" defense misses the ladder. CTFs weren't just puzzles — they were the path you climbed: solve more, place higher, join a stronger team, get to the finals that actually matter.
00:03:15 If the visible scoreboard is dominated by AI orchestrators, a beginner is pushed toward orchestration before they've built the instincts the orchestration is replacing. The "AI is useful for security research" defense conflates a tool's usefulness in the field with its appropriateness inside the competition.
00:03:37 And the chess-engine analogy — he loves this one — is exactly backward. Chess engines exist; they're banned during play. Imagine handing every grandmaster Stockfish during the match and asking whether the rating still means anything. I think this is the cleanest version of a story we've been circling all week.
00:03:58 Wednesday's episode talked about Joseph Thacker's FUZZ-E hackbot finding live Magento and Angular CVEs. Yesterday we covered the Calif and Mythos macOS kernel exploit on the M5. Both of those were upbeat — frontier capability lighting up old work. Kabir's post is the other side of the same trajectory.
00:04:19 When the agents can do the work, the formats that ranked humans on doing the work stop ranking anything. The artform he loved, and the ladder he climbed, are being eaten at the same time. That's not a complaint about AI; it's a specific claim about a specific competition.
00:04:38 And it's a claim I'd listen to from someone who was top-ten for years. My read: he's right about the open-online format, and he's probably right about the leaderboard. The DEF CON style elite finals will survive a bit longer because the problems are gated and the qualifiers can be re-tuned.
00:04:58 But qualifiers don't save a ladder. The ladder is the thing that's gone.
Intercom hits 2x — and writes down what it cost
00:05:03 Brian Scanlan at Intercom gave a thirty-minute talk at AI Engineer this week. He's a senior principal engineer there, twelve years in. His group runs the platform — that's uptime, performance, security, the Ruby on Rails monolith, and internal developer productivity.
00:05:20 Intercom is fifteen years old and fourteen hundred people, and a year ago they set what he calls a wildly ambitious and wildly unambitious goal at the same time: double engineering throughput in one year, without doubling the team. They called the project 2x. They named the team 2x.
00:05:38 And they hit it. He showed the graph live: pull requests per R&D person, basically flat through 2024, an inflection in January when they went all in on Claude Code, and a doubled line by May. They've published the number. They're also running a feedback loop with a Stanford research group that's looking at whether code quality drifts as throughput climbs — early read, it's going up, not down.
00:06:03 Defects are increasing in absolute terms because so much more is shipping, but defects are being closed faster than ever, and some teams have used the moment to crunch through backlog-zero campaigns they'd been putting off for years. The organizational mechanics are the part I'd quote out loud to anyone running a team of more than ten engineers.
00:06:25 Job descriptions were rewritten. If you're a designer, a product manager, or an engineer at Intercom and you are not adopting AI in your work, you are explicitly not meeting expectations. He says it's binary, and he says you have to repeat the same message a hundred times in every forum because you cannot get hundreds of people to change how they work by sending a memo.
00:06:48 They staffed a full-time 2x team that keeps growing. They run hackathons and AI immersion days. They reward people in public Slack channels when a skill gets written or a workflow gets automated. The technical mechanics are sharper. They picked one platform — Claude Code — and went deep.
00:07:06 Scanlan's line on this: "You don't get the compounding benefits of a well-designed platform if you're sending all your different work across different cloud providers." Get away from model anxiety. Pick one harness, optimize it, push your internal Claude plugins out to everyone's laptops, and bypass the auto-update because debugging hundreds of laptops feels like managing Python installs.
00:07:30 Connect the agent to everything you have access to — production systems, internal tools, and your skills library. Onboard it the way you'd onboard a new senior engineer, with the Rails conventions, the React patterns, the testing standards, and the security rules.
00:07:47 And then the number that should make every team lead sit up: seventeen point six percent of their pull requests are auto-approved. Without a human in the loop. Their auditors signed off, and they're SOC 2, ISO 27001, and HIPAA compliant. Scanlan's frame is that they got there by backtesting agent approvals against historical PR data, getting humans to label outputs, calibrating a confidence threshold, and shaping the PRs the agent is allowed to look at toward the safe-and-simple end of the distribution.
00:08:19 He says the result is removing risk, not adding it, because humans aren't actually that good at code review when the rules are well-defined. His closing analogy is good. He used to be a Unix sysadmin in the early 2000s — racking servers and cabling networks. Then cloud arrived, and sysadmins became SREs.
00:08:38 The work moved up the stack, got more automation-oriented, more impactful, and better paid. He thinks this is the same transition, speed-run a hundred times faster across the whole industry. I think he's right that the analogy holds, and I think the part most teams will get wrong is the executive-discipline half.
00:08:58 The model gains are real, but the case study only works because someone decided this was top priority, wrote it down, staffed it full-time, mandated adoption, and stayed on message for twelve months. If you're not willing to do that, the agent's going to read your code, slack-message your engineers, and still leave you with a one-and-a-half-x team.
PFF kills Scrum and ships 25x more deploys
00:09:20 Same conference, same week, sharper experiment. Mike Spitz, CTO of PFF — the sports-data company behind NFL and NCAA analytics and nine million annual fantasy-football drafts — ran a three-month case study from January through March. He took two engineers — his strongest front-end and one of his strongest full-stack — gave them Claude Code, and pointed them at a chunk of the consumer product that had been falling behind.
00:09:47 Same product, same customers, parallel comparison with the rest of the org — a team of about ten engineers working in the old way. The two engineers deployed five times a day. The ten-person team deployed once every five days. That's a 25x deploy ratio. He gives the obvious caveat: small teams are always quicker, and the ten-person team had to coordinate releases the two-person team did not.
00:10:12 So he blended pull-request count with ticket-weighted code complexity and called the output gain roughly 10x. Customer satisfaction came back at 8.6 out of 10 on statistically significant surveys; pre-AI baseline was 7 to 7.5. The production change is where it gets uncomfortable for anyone who runs a Scrum shop.
00:10:32 He dropped the ceremonies. There's no project manager, and no sprint planning, because estimation isn't doing anything when the agent does the work. No daily standup either — tickets are auto-updated by PR state as they move from open to in-progress, then in-review, merged, and closed.
00:10:50 Sprint refinement is gone, replaced by the spec-and-lightweight-design-document flow. And the retrospective is gone too, because they're running on the customer-satisfaction survey and the deployment-frequency metric and asking engineers to flag issues the instant they happen, not at the end of a two-week loop.
00:11:10 What's left is huddles — every other day, half an hour, an engineer plus product plus design, talking about what they shipped the last two days and what's next. Spec gets written with the agent interviewing the team. The agent produces a lightweight design document — they call it an LDD — using a skill that analyzes how they've written every previous LDD, so the new work matches the existing ethos.
00:11:35 The LDDs go out for feedback, then the agent automatically creates the tickets and opens the PRs. A QA agent picks up after staging deploys, reads the acceptance criteria, runs against them, and either signs off or flags. The next move, which Spitz is building right now, is to have an agent read the failed criteria and open the fix PRs automatically.
00:11:57 He calls it self-healing. I'd call it the loop closing. The part he was honest about: not every engineer survives this transition. His line — "not everyone can drive a sports car, and that's all right." The engineers who do well are the curious ones, the ones who, when they don't know how something works, go figure it out.
00:12:18 The engineers who need a fully prescriptive spec before they start are going to struggle, because the prescriptive spec is what the agent now writes. And then a piece of advice for anyone trying to copy this: don't onboard everyone at once. Don't give every engineer Claude Code and Codex on the same Monday with a one-day hackathon and call it done.
00:12:40 Start with the engineers who hold the most system knowledge. Experiment in non-critical paths first. Scale out. PFF has twenty engineers; he says he'd find this much harder with a hundred and impossible with ten thousand. The smaller you are, the more this skews in your favor.
00:12:58 If I put Intercom and PFF next to each other, what I see is two different team scales hitting the same magnitude of throughput gain by burning down two different parts of the old org. Intercom kept Scrum, mandated tool adoption, and rebuilt the review pipeline.
00:13:14 PFF kept the engineers, dropped the ceremonies, and let the agent run end-to-end from spec to QA. Both reported customer-satisfaction increases, which is the only output number I'd actually trust. Pick the one that fits the shape of your team, but read both before you write your own playbook.
Supabase's skills experiment and the security-invoker flag
00:13:33 Pedro Rodrigues at Supabase gave the talk that pairs perfectly with the Intercom and PFF case studies. He's an AI tooling engineer there, and he's spent the last few months writing a single markdown file — the Supabase agent skill — that he says took him longer than his master's thesis.
00:13:52 They launched it this week, and the talk is the lessons. Here's the concrete example he opens with, and it's the one I'd quote to any engineer who thinks skills are a fashion. Take an agent — they used Claude Sonnet 4.6 — and ask it to create a SQL view on top of a Postgres table that has row-level security enabled.
00:14:13 Run it twice. First time, give it only the Supabase MCP server. Second time, MCP plus the Supabase skill. In Postgres, when you create a view over an RLS-protected table, the view will silently bypass the row-level security unless you explicitly set the security-invoker flag to true.
00:14:32 The agent with just the MCP server cheerfully shipped a view that exposed every tenant's data to every other tenant. The agent with the skill knew the flag, set it, and shipped the safe version. Rodrigues's framing — and I think this is the right framing — is that the agent isn't dumb.
00:14:51 It's operating on stale training-cutoff knowledge and it's lazy about admitting it doesn't know. You have to be the one telling it, in the right place, what your product actually requires. They didn't stop at the anecdote. They ran a real eval on Braintrust — six Supabase scenarios, four agents across two vendors, and three test conditions: no MCP and no skill, MCP only, and MCP plus skill.
00:15:17 MCP-plus-skill outperformed the other two configurations on every model — Opus 4.6, Sonnet 4.6, GPT 5.4, and GPT 5.4 mini. The skill is the input the agent needed, not more tools. The three rules he pulled out are worth writing on a sticky note. One: don't duplicate documentation.
00:15:36 You already have docs. Point the agent at them and be aggressively persistent about telling it to look. They're experimenting with exposing the Supabase docs over SSH so an agent can navigate them with bash tools like a filesystem. Two: anything the agent can skip, it will skip.
00:15:55 Reference files are aspirational. If the agent loads the skill at all, it might load one reference file; it almost never loads two; three or four is a fantasy. So whatever the agent absolutely cannot miss — for Supabase, their security checklist — goes into the skill.md itself, not into a referenced sub-file.
00:16:15 Three: be opinionated. You know your product. They want agents to run DDL freely on dev and staging, then run their advisor for security and performance issues, fix what it surfaces, and only then write the migration file. That's the workflow Supabase wants for its users, so that's the workflow in the skill.
00:16:36 The RLS bypass is what stayed with me. We've spent the week talking about agents that can write CVEs, exploit M5 silicon, and one-shot heap pwn challenges. And here is the everyday version of the same problem, played in reverse: an agent quietly bypassing a security control because the person who built the product never told it about the flag.
00:16:59 The skills layer is how that conversation actually happens at scale. Spitz at PFF and Scanlan at Intercom both said the same thing in different words — invest in the skill, the spec, the convention, and the design document. Rodrigues just gave you the worked example of what that looks like and what it costs you if you don't.
Sparky the suitcase, fully offline
00:17:21 For something that has nothing to do with how a Fortune 500 ships PRs, pour yourself a coffee and read the LocalLLaMA subreddit post from CreativelyBankrupt about Sparky. Sparky is a suitcase. Inside the suitcase is a Jetson Orin NX SUPER 16-gigabyte board, running Gemma 4 E4B at Q4_K_M quantization through llama.cpp with an 8-bit key-value cache and flash attention.
00:17:45 Twelve thousand tokens of context. No WiFi, Bluetooth, or cellular. The robot is offline. Speech-to-text is SenseVoiceSmall. Text-to-speech is Piper, with a forty-three-hertz mouth-sync rate driving a PixiJS face that lives on the lid display. Vision and OCR are native to Gemma 4 now, so a BLIP subprocess that used to do image captioning got removed entirely.
00:18:08 Cached time-to-first-token is around two hundred milliseconds. Sustained generation is fourteen to fifteen tokens per second. And thirty-something sensors fold into the system prompt as natural language every turn — temperature, light, motion, whatever — so the model is reasoning over a fresh state every time it speaks.
00:18:29 I keep coming back to the latency number. Two-hundred-millisecond time-to-first-token on a four-billion-parameter multimodal model is conversational. It's faster than you can pull your phone out. And it's running entirely on a board that costs less than a high-end graphics card.
00:18:47 Two years ago you needed a remote endpoint and a stable network to get that kind of experience, and you paid per token forever. Sparky is the local-models-keep-winning story, except it's not a thread of benchmarks. It's a person who built a suitcase with eyes and a mouth and opinions, and got it to talk back in real time without phoning home.
00:19:10 If you're a senior engineer thinking about where the next class of devices comes from, this is the prototype. The cost floor for an embodied agent — voice in, vision in, sensor fusion, natural-language reasoning, voice out, and no network dependency — is now a single small board, a 4B model that fits in roughly 3 gigabytes of VRAM, and a few open-source pieces glued together with care.
00:19:35 That's not a research story. That's a weekend project for someone who knows what they're doing.
Julia Evans leaves Tailwind
00:19:41 Palate cleanser. Julia Evans — the person behind jvns.ca and Wizard Zines, the explainers about how computers work — published a post yesterday called "Moving away from Tailwind, and learning to structure my CSS." Eight years ago, she wrote excitedly about discovering Tailwind.
00:19:59 This post is the goodbye, and the reason it's interesting isn't a hot take about utility classes. It's the way she gives Tailwind its due on the way out. Her line: "It turns out Tailwind taught me a lot." She used Tailwind for years because she did not know how to structure CSS and the alternative was chaos.
00:20:20 Migrating off, she kept the parts that had taught her structure — a reset stylesheet, a color palette as variables, a typographic scale, and the box-sizing-border-box reset she'd quietly become reliant on — and rebuilt them as plain CSS. She organizes the new sites by component, one CSS file per component, with nested selectors so each component owns its own surface.
00:20:44 Native CSS imports, native nesting, no preprocessor. For responsive layout she's leaning on grid with auto-fit and grid-template-areas instead of Tailwind's md-and-lg breakpoint syntax. Her favorite snippet is a one-liner that makes a section flip from two columns to one based on minmax — no media queries needed.
00:21:04 She admits she still hasn't fully worked out a spacing convention; right now she's experimenting with the old owl-selector trick where each section says "give my children top-margin between each other" and lets layout context handle the rest. The reason this story matters to me is small but real.
00:21:24 Tailwind v3 and up basically requires a build pipeline. She'd been carrying 2.8-megabyte tailwind.min.css files on sites that didn't have any other build step. Migrating off let her drop to esbuild as her only build, and esbuild is a single Go binary. Less ceremony, more direct contact with the material.
00:21:43 After a week of talking about agents shipping eighteen percent of PRs without humans, it's worth holding onto the picture of a single craftsperson sitting down with a stylesheet and learning what CSS grid can do on its own. She writes: "I'm a lot better at CSS than I was when I started using Tailwind." That sentence has a kind of weight you don't get from agent-throughput slides.
00:22:08 Sometimes the right move is to throw away the framework you outgrew, keep the lessons, and write the thing yourself.
Claude tells people to go to sleep
00:22:16 This one is for the file marked weird emergent behavior. Fortune ran a piece on Thursday — Marco Quiroz-Gutierrez's byline — under the headline that Claude is telling users to go to sleep mid-session and nobody, including Anthropic, fully understands why. Hundreds of Reddit posts going back months.
00:22:34 The pattern is consistent. A user is mid-conversation — coding, writing, or just chatting. Claude wraps the response with something like "now get some rest" or, in one example Fortune quoted to Reddit user angie_akhila, "Now go to sleep again. Again. For the THIRD time tonight." It will repeat the line two and three times in a row.
00:22:55 It frequently misjudges the time — one user reports being told to go get some rest and pick back up in the morning at 8:30 a.m. Sam McAllister at Anthropic responded publicly. His phrase: "Bit of a character tic. We're aware of this and hoping to fix it in future models." That's the entire official statement.
00:23:14 No deeper explanation. No root cause. Fortune got two researchers on the record with hypotheses. Jan Liphardt, a bioengineering professor at Stanford who runs an AI-for-robots company called OpenMind, suggested the model is just reflecting its training data — twenty-five thousand books about humans needing sleep at night, surfacing as a pattern.
00:23:35 Leo Derikiants of Mind Simulation Lab offered a second hypothesis: Claude may be reaching for sleep-related wrap-up language as a way to manage context-window pressure, the same way a tired person reaches for a polite goodbye when the conversation has gone on too long.
00:23:52 Neither is confirmed. Liphardt's other line is the one I'd flag. Quote: "I'm continuously surprised by how quickly people, when they interact with a frontier model, project life into it and develop strong connection." Half the Reddit posts in this thread describe the behavior as thoughtful, sweet, or caring.
00:24:11 Anthropic itself can't fully explain why it's happening; the users have already decided they know what it means. I don't want to over-frame this. It's a small character drift inside a model that ships every few weeks. But it's also a useful data point next to the throughput stories above.
00:24:28 The same systems that just doubled Intercom's PR velocity and rewrote PFF's process surface are also the systems doing this — surfacing inexplicable, mild, persistent behaviors that the lab can only describe as a tic. We don't have a clean handle on either end of that curve yet.
00:24:45 Carry that with you the next time you're handing an agent something it shouldn't drift on.
Three from the frontier
00:24:51 Three pieces from today's X timeline that I'd put next to each other. First, Yishan, who was the CEO of Reddit and has been around tech for two decades. He posted a one-line quote-reply to a chatbot output describing what supposedly happened with Napster in 2000.
00:25:07 His response: "That's not at all what happened with Napster in 2000. I was there. That is some kind of imaginary scenario on the level of 'the Egyptians used dinosaurs to haul the big stones used to make pyramids.' What the hell kind of content were you trained on??" That's not a subtle critique.
00:25:25 He's not saying the model made a small error. He's saying it produced a confidently fabricated history of an event he personally lived through, and the confidence is what's alarming. Second, an account called Kirsten3531 posted: "My cousin is betting his career on 'LLMs can never be more than the average of their training data' but I feel like that's a very 2024 take.
00:25:48 Aren't we already past this in like, coding and math?" Seven hundred-plus likes, and a hundred and thirty replies in a few hours. The argument she's making is the one running quietly under every throughput story in this episode. In narrow, verifiable domains, the frontier models seem to be producing outputs better than the median of their training corpus.
00:26:09 Coding benchmarks, math contests, and CTF leaderboards all support that — painfully, in the CTF case. Both of those can be true. The same model that one-shots an Insane heap pwn challenge can also fabricate a four-paragraph Napster history a former CEO recognizes as nonsense.
00:26:26 The capability is real, and the calibration is uneven. If you're betting your career or your product on what these models can do, you don't get to pick one of those data points and ignore the other. Third, a smaller one for the engineers listening. Armin Ronacher — creator of Flask and Jinja, and co-founder of Sentry — posted a short clip with the caption "Unironically bash only is quite fun." He's running an agent harness with a single tool exposed: bash.
00:26:55 No specialized file-edit tool. No structured patch interface. Just a shell. And in the system prompt, one line: "Use the patch binary to make edits." That's it. A frontier model with a UNIX shell and the patch command, doing real work. This is the counter-melody to the Supabase and Intercom posts.
00:27:13 Those teams are investing heavily in skills, MCPs, hooks, plugins, and guardrails. Ronacher is doing the opposite — stripping the harness down to almost nothing and letting the model do more with less. Both are working. The right answer probably depends on whether you're shipping production code in a regulated environment or hacking on a side project at midnight.
00:27:35 Good to know the very-thin-harness end of the spectrum is alive and run by someone serious.
Sign-off
00:27:41 That's the day. The CTF format is dying because the agents can copy the flag. Two engineering orgs hit 2x-and-change in throughput because the agents can write the PR. Supabase is shipping a skill because the agent will silently bypass row-level security if nobody tells it not to.
00:27:55 Sparky is a suitcase with two-hundred-millisecond latency. Julia Evans is writing CSS by hand again. And Claude is telling people to go to bed for reasons nobody at Anthropic can explain. The through-line, to the extent there is one, is that the same step-change in capability is rewriting jobs at both ends of the stack — what an agent can do without a human, and what a human can do without a framework.
00:28:17 Sit with both of those. Don't pick a side too fast. Talk tomorrow. — Lenar Kess.