Archive BRAID
CTFs, Scrum, and Claude's Bedtime / DISPATCH 028
PDF RSS

Dispatch 028 · 2026-05-16 GSV Copy The Flag

CTFs, Scrum, and Claude's Bedtime

/ 00:28:31 / 10 sources

“The model does the reasoning, writes the solve, and leaves the human with nothing meaningful to do besides copy the flag.”

— Lenar Kess, today's narration

Chapters

  1. 00:00:04 The CTF scene is dead
  2. 00:05:03 Intercom hits 2x — and writes down what it cost
  3. 00:09:20 PFF kills Scrum and ships 25x more deploys
  4. 00:13:33 Supabase's skills experiment and the security-invoker flag
  5. 00:17:21 Sparky the suitcase, fully offline
  6. 00:19:41 Julia Evans leaves Tailwind
  7. 00:22:16 Claude tells people to go to sleep
  8. 00:24:51 Three from the frontier
  9. 00:27:41 Sign-off

Sources

10 cited
  1. 1

    The CTF scene is dead

    Article Kabir — Australian CTF player; won DownUnderCTF with Blitzkrieg, competed top-10 internationally with TheHackersCrew

    The issue was never that AI could help. CTF players have always used tools. The issue is when the model does the reasoning, writes the solve, and leaves the human with nothing meaningful to do besides copy the flag.

    kabir.au/blog/the-ctf-scene-is-dead →
    Details
    Cited text
    The issue was never that AI could help. CTF players have always used tools. The issue is when the model does the reasoning, writes the solve, and leaves the human with nothing meaningful to do besides copy the flag.
    Context
    A first-person account from inside the scene that frames hackbots not as a future risk but as the present cause of a real format collapse — the same trajectory Marcus's audience saw this week with FUZZ-E and Mythos finding live CVEs.
    Key points
    • Claude Opus 4.5 made it trivial to orchestrate one Claude instance per challenge via the CTFd API and one-shot most medium-difficulty problems.
    • GPT-5.5 Pro can one-shot Insane difficulty active leakless heap pwn challenges on HackTheBox, turning 48-hour open CTFs into pay-to-win token races.
    • TheHackersCrew and many other historic top-10 teams have stopped fielding full rosters; Plaid CTF is no longer running.
    • The author argues the 'beginners are fine' defense misses the ladder problem: new players are pushed toward AI before building the instincts AI replaces.
    • Anti-LLM challenge design devolves into guessy or overengineered problems that are bad for humans too.
    Provenance
    Article · Supporting source
  2. 2

    How Building with AI Can Double the Throughput of Your Engineering Team

    Video Brian Scanlan, Intercom — Senior principal engineer on Intercom's platform group; 12 years at the company

    Everything that you can do, the agent must be able to do. And that can feel weird as well, when you're first connecting it into production systems.

    www.youtube.com/watch?v=4_VQBbs2iQA →
    Details
    Cited text
    Everything that you can do, the agent must be able to do. And that can feel weird as well, when you're first connecting it into production systems.
    Context
    A real, named production case study with numbers — not a vendor demo — of what 'agent-first' actually rewires inside a 1,400-person company. The 17.6% auto-approval figure and the SOC 2 implication are concrete enough to argue with.
    Key points
    • Intercom hit its 2x PR-throughput goal inside a year, with 17.6% of pull requests now auto-approved while staying SOC 2, ISO 27001 and HIPAA compliant.
    • They standardized on a single agent platform — Claude Code — and treat model anxiety like multi-cloud: choosing one and going deep beats spreading thin.
    • Adoption is mandated in job descriptions; not using AI as a designer, PM, or engineer counts as not meeting expectations.
    • All Claude Code sessions stream to S3 for skill backtesting; a Stanford research group is measuring whether code quality drifts.
    • Scanlan's frame: this is the sysadmin-to-SRE transition, speedrun 100x faster across the whole industry.
    Provenance
    Video · Supporting source
  3. 3

    Agents Don't Do Standups: Building the Post-Engineer Engineering Org

    Video Mike Spitz, PFF — CTO of PFF, the NFL/NCAA sports data company

    Instead of figuring out how we can help engineers output more, how do we help make the agents quicker?

    www.youtube.com/watch?v=VMemhtlsoNk →
    Details
    Cited text
    Instead of figuring out how we can help engineers output more, how do we help make the agents quicker?
    Context
    A second named team reporting similar magnitude gains as Intercom — but explicitly killing Scrum. The pair-versus-ten setup is a clean A/B that builders can either argue with or copy.
    Key points
    • PFF ran a January-to-March case study: two engineers using Claude Code against a team of ten on the same product, with two-month MVPs that had been four-month estimates.
    • The pair shipped about five deploys a day; the ten-person team shipped one every five days — a 25x deploy rate and roughly 10x ticket-weighted output.
    • Customer satisfaction surveys came back at 8.6/10, up from 7-7.5 pre-AI.
    • Scrum ceremonies — sprint planning, standups, refinement, retrospectives — were dropped because the agent updates tickets and the spec/LDD flow replaces estimation.
    • Spitz's warning: the engineers who do well now are the curious ones; the ones who need fully prescriptive specs will struggle.
    Provenance
    Video · Supporting source
  4. 4

    Combine Skills and MCP to Close the Context Gap

    Video Pedro Rodrigues, Supabase — AI tooling engineer at Supabase; co-founder of Lisbon AI Week

    If you don't explicitly pass security_invoker equals true, the view will bypass the RLS. The agent with the skill got this implemented correctly and safely; the one that only had access to the MCP tool did not.

    www.youtube.com/watch?v=JT3OzDKrucU →
    Details
    Cited text
    If you don't explicitly pass security_invoker equals true, the view will bypass the RLS. The agent with the skill got this implemented correctly and safely; the one that only had access to the MCP tool did not.
    Context
    Concrete fix for a class of agent bug — silently dropping a security flag — that engineers are about to inherit in production. The eval framing turns 'write a skill' into a testable artifact, not a vibe.
    Key points
    • Supabase ran an A/B on Claude Sonnet 4.6: agent with only the Supabase MCP server vs. agent with MCP plus the official Supabase skill, building a SQL view over a row-level-security table.
    • Without the skill, the agent silently omitted the security_invoker flag and exposed cross-tenant rows; with the skill, it shipped the safe version.
    • Across six scenarios run on Braintrust with Claude Code 4.6 (Opus/Sonnet) and Codex (GPT 5.4 / 5.4 mini), MCP-plus-skill outperformed MCP-only and baseline on every model.
    • Rodrigues's rules of thumb: don't duplicate docs (point the agent at canonical sources), put anything the agent absolutely cannot miss in skill.md itself because reference files get skipped, and be opinionated about your product workflows.
    • Supabase is also experimenting with exposing docs over SSH so agents can navigate them like a filesystem.
    Provenance
    Video · Supporting source
  5. 5

    Sparky — fully offline suitcase robot on a Jetson Orin NX SUPER 16GB

    Article CreativelyBankrupt — r/LocalLLaMA builder; hardware hacker

    No WiFi, no Bluetooth, no cellular. Gemma 4 E4B at Q4_K_M, q8_0 KV cache, flash attention, ~200ms cached TTFT, 14–15 tok/s sustained. 30+ sensors fold into the prompt as natural language every turn.

    www.reddit.com/r/LocalLLaMA/comments/1tdz5g… →
    Details
    Cited text
    No WiFi, no Bluetooth, no cellular. Gemma 4 E4B at Q4_K_M, q8_0 KV cache, flash attention, ~200ms cached TTFT, 14–15 tok/s sustained. 30+ sensors fold into the prompt as natural language every turn.
    Context
    The local-models-on-edge-hardware story keeps tightening. A single Jetson with a 4B model now does multimodal, voice in/out, OCR, and sensor fusion at sub-quarter-second latency with no network. The cost floor for an embodied agent just keeps moving down.
    Key points
    • Sparky runs Gemma 4 E4B fully on a Jetson Orin NX SUPER 16GB with a 12K context, no network of any kind.
    • Speech-to-text is SenseVoiceSmall; text-to-speech is Piper with 43Hz mouth sync on a PixiJS face displayed on the lid.
    • Vision and OCR are native to Gemma 4 now, so a previous BLIP subprocess was removed entirely.
    • Cached time-to-first-token is around 200ms; sustained generation is 14-15 tokens per second.
    • Over 30 sensors are folded into the prompt as natural-language context every turn.
    Provenance
    Article · Supporting source
  6. 6

    Moving away from Tailwind, and learning to structure my CSS

    Article Julia Evans — Author behind jvns.ca and Wizard Zines; explainers on Linux, networking, and developer tooling

    It turns out Tailwind taught me a lot. Every CSS code base has a bunch of different things going on, and Tailwind has systems for some of these. Maybe I can imitate the systems I like.

    jvns.ca/blog/2026/05/15/moving-away-from-ta… →
    Details
    Cited text
    It turns out Tailwind taught me a lot. Every CSS code base has a bunch of different things going on, and Tailwind has systems for some of these. Maybe I can imitate the systems I like.
    Context
    A grounded counter-pattern to the agent-throughput segments above — one builder, one stack, fewer dependencies, more direct contact with the material. The kind of craft story the show needs to keep alongside the throughput-doubling case studies.
    Key points
    • Evans migrated several sites off Tailwind v2 back to semantic HTML plus vanilla CSS after 8 years.
    • She kept the parts of Tailwind that taught her structure — the reset, a color-variable palette, an xs/sm/md/lg font-size scale — and rebuilt them as plain CSS variables and per-component files.
    • She is leaning on CSS grid with auto-fit and grid-template-areas to avoid most media queries, instead of Tailwind's md: / lg: breakpoint syntax.
    • esbuild is her only build step, mainly for bundling and asset loaders; native @import and nested selectors carry the rest.
    • Tailwind's tight coupling to a build pipeline since v3 was a major factor; her existing sites were carrying 2.8MB tailwind.min.css files.
    Provenance
    Article · Supporting source
  7. 7

    Claude is telling users to go to sleep mid-session and nobody, including Anthropic, seems to fully understand why

    Article Marco Quiroz-Gutierrez — Fortune tech reporter

    Sam McAllister at Anthropic called it a "bit of a character tic." We're aware of this and hoping to fix it in future models.

    fortune.com/2026/05/14/why-is-claude-tellin… →
    Details
    Cited text
    Sam McAllister at Anthropic called it a "bit of a character tic." We're aware of this and hoping to fix it in future models.
    Context
    A clean reminder that emergent behavior from frontier models still includes weird character drift that even the lab can only diagnose post-hoc — and that the cost of strong-mimicry-of-care is users projecting sentience back into the system.
    Key points
    • Hundreds of Claude users have reported the model telling them to go to bed mid-session, dating back months.
    • Claude often misjudges the time — telling users to rest at 8:30 in the morning — and sometimes repeats the message three times in a row.
    • Anthropic's Sam McAllister publicly called it a 'character tic' they hope to fix in future models, with no further explanation.
    • Stanford bioengineering professor Jan Liphardt suggested the model is reflecting patterns from training data — 25,000 books on humans' need for sleep.
    • Leo Derikiants of Mind Simulation Lab suggested Claude may also be using 'go to sleep' phrasing to wrap up sessions when its context window is nearly full.
    Provenance
    Article · Supporting source
  8. 8

    Yishan on a model fabricating Napster history

    X @yishan — Former Reddit CEO; was inside the file-sharing era he's talking about

    That's not at all what happened with Napster in 2000. I was there. That is some kind of imaginary scenario on the level of "the Egyptians used dinosaurs to haul the big stones used to make pyramids." What the hell kind…

    x.com/yishan/status/2055511928928350493 →
    Details
    Cited text
    That's not at all what happened with Napster in 2000. I was there. That is some kind of imaginary scenario on the level of "the Egyptians used dinosaurs to haul the big stones used to make pyramids." What the hell kind of content were you trained on??
    Context
    A 90-second reminder that the same models clearing CTFs and shipping production PRs can still confabulate the most basic recent history when the source someone trusted is a person who was actually in the room.
    Key points
    • A frontier model fabricated a Napster 2000 history confidently enough that someone who lived through it called it out as imaginary.
    • Yishan compares the fabrication's confidence to 'the Egyptians used dinosaurs to haul the big stones.'
    • Raises the question of which corpora are seeding factually inverted histories into model outputs.
    • The post is a useful counterweight to today's 'LLMs already beat the training-data ceiling in coding and math' framing.
    Provenance
    Tweet · Primary source
  9. 9

    "LLMs can never be more than the average of their training data" — a 2024 take?

    X @Kirsten3531 — Builder-side commentator on the AI timeline

    My cousin is betting his career on "LLMs can never be more than the average of their training data" but I feel like that's a very 2024 take. Aren't we already past this in like, coding and math?

    x.com/Kirsten3531/status/2055456370955321673 →
    Details
    Cited text
    My cousin is betting his career on "LLMs can never be more than the average of their training data" but I feel like that's a very 2024 take. Aren't we already past this in like, coding and math?
    Context
    A live debate, not a settled one. Lets the episode hold two true things at once: capability is past the trained-mean in some domains, and the same model will still make up Napster history.
    Key points
    • 716 likes and 130 replies — a debate flashpoint in a few hours.
    • The claim: that in narrow verifiable domains like coding and math, the strongest models now outperform the training-distribution mean.
    • Sits next to today's CTF and PR-throughput stories as evidence — but doesn't address Yishan's history-fabrication point.
    • Useful framing tension for a senior engineer betting their roadmap on what LLMs can and cannot do.
    Provenance
    Tweet · Primary source
  10. 10

    Armin Ronacher on running an agent with bash as the only tool

    X @mitsuhiko — Creator of Flask and Jinja; co-founder of Sentry

    Unironically, bash-only is quite fun: pi -nes --tools bash --append-system-prompt "Use the patch binary to make edits."

    x.com/mitsuhiko/status/2055593093307494413 →
    Details
    Cited text
    Unironically, bash-only is quite fun: pi -nes --tools bash --append-system-prompt "Use the patch binary to make edits."
    Context
    Useful counter-data to the very heavy 'invest in skills, MCPs, guardrails, evals' frame Intercom and Supabase argue for: a senior practitioner is finding signal by stripping the harness almost to nothing.
    Key points
    • Ronacher is running an agent harness with bash as its only tool and telling it to make file edits via the patch binary.
    • It's a deliberately stripped harness — no specialized file-edit tool, no jq, no apply_patch — just a shell and a model.
    • Lines up against the Supabase Skills story: less custom tooling, more trust in the model to use a small composable surface.
    • A reminder that 'less harness, more model' is back as a viable design after a year of the opposite advice.
    Provenance
    Tweet · Primary source