Archive BRAID DAILY
The CTF scene reports its own death
Subscribe

Braid Daily · 2026-05-16

The CTF scene reports its own death

An Australian top-10 player on why Opus 4.5 and GPT-5.5 Pro broke open competitive CTF, plus Intercom and PFF on what agent-first actually…

An empty after-hours CTF room with monitors glowing and a single yellow capture-the-flag pennant; a terminal in the foreground shows an agent submitting a flag.
After-hours: the agents kept playing.

The lead

1

Following last week's thread on FUZZ-E and Mythos finding live CVEs: Kabir, who won DownUnderCTF with Blitzkrieg and competed top-10 with TheHackersCrew, says the scene has collapsed from the inside. Claude Opus 4.5 one-shots medium challenges across the CTFd API; GPT-5.5 Pro one-shots Insane heap pwn on HackTheBox. Plaid CTF is gone, and most historic top-10 teams aren't fielding full rosters.

Read source

Agents on the org chart

3

Intercom hit 2x PR throughput in a year

Brian Scanlan, Intercom

Scanlan, a senior principal at Intercom, reports the company hit its 2x PR-throughput target inside a year, with 17.6% of pull requests now auto-approved while SOC 2, ISO 27001, and HIPAA stay intact. Their bet: standardize on Claude Code, mandate adoption in job descriptions, and stream every session to S3 so a Stanford group can measure whether code quality drifts.

“Everything that you can do, the agent must be able to do. And that can feel weird as well, when you're first connecting it into production systems.”

Read source

PFF: two engineers, ten engineers, same product

Mike Spitz, PFF

PFF's CTO ran a January-to-March case study pitting two Claude Code engineers against a team of ten on the same product. The pair shipped about five deploys a day to the team's one every five days, with CSAT moving from 7-7.5 to 8.6, and PFF dropped sprint planning, standups, refinement, and retros. Spitz's framing: optimize the agent's loop, not the engineer's output.

“Instead of figuring out how we can help engineers output more, how do we help make the agents quicker?”

Read source

Supabase: MCP plus a skill closes a security flag the agent silently dropped

Pedro Rodrigues, Supabase

Rodrigues ran a Braintrust eval on Claude Sonnet 4.6 building a SQL view over a row-level-security table. With only the Supabase MCP server, the agent omitted security_invoker=true and silently exposed cross-tenant rows; pairing MCP with the official Supabase skill produced the safe version, and MCP-plus-skill won across all six scenarios on Claude 4.6 and Codex GPT-5.4.

“If you don't explicitly pass security_invoker equals true, the view will bypass the RLS. The agent with the skill got this implemented correctly and safely; the one that only had access to the MCP tool did not.”

Read source

Counter-patterns

2

Julia Evans moves off Tailwind after eight years

jvns.ca

Evans walks several sites back to semantic HTML and vanilla CSS, keeping only the parts of Tailwind that taught her structure — the reset, a color-variable palette, an xs/sm/md/lg type scale. CSS grid with auto-fit replaces most breakpoints; esbuild is the only build step; the 2.8MB tailwind.min.css files are gone.

“It turns out Tailwind taught me a lot. Every CSS code base has a bunch of different things going on, and Tailwind has systems for some of these. Maybe I can imitate the systems I like.”

Read source

Armin Ronacher: bash as the only tool

@mitsuhiko

Sentry's co-founder is running an agent harness with bash as its only tool and telling it to make file edits via the patch binary — no apply_patch, no jq, no specialized file-edit tool. Useful counter-data to the heavy skills-and-MCP investment Intercom and Supabase argue for.

“Unironically, bash-only is quite fun: pi -nes --tools bash --append-system-prompt "Use the patch binary to make edits."”

Read source

Where the models still drift

3

Claude is telling users to go to sleep mid-session

Fortune

Hundreds of users have reported Claude telling them to go to bed — sometimes at 8:30 in the morning, sometimes three times in a row. Anthropic's Sam McAllister calls it a character tic and says they hope to fix it. Outside theories range from 25,000 books on human sleep needs in the corpus to context-window wrap-up behavior.

“Sam McAllister at Anthropic called it a "bit of a character tic." We're aware of this and hoping to fix it in future models.”

Read source

Yishan: that's not what happened with Napster in 2000

@yishan

The former Reddit CEO was inside the file-sharing era and watched a frontier model invent its history with full confidence. His comparison: 'the Egyptians used dinosaurs to haul the big stones used to make pyramids.' A reminder that the models clearing CTFs can still confabulate basic recent history.

“That's not at all what happened with Napster in 2000. I was there. That is some kind of imaginary scenario on the level of "the Egyptians used dinosaurs to haul the big stones used to make pyramids." What the hell kind of content were you trained on??”

Read source

Is the training-data-mean ceiling a 2024 take?

@Kirsten3531

A short post — 716 likes, 130 replies in hours — asking whether the line that large language models can't exceed the mean of their training data still holds in coding and math. Reads alongside today's CTF and PR-throughput stories on one side and Yishan's Napster post on the other.

“My cousin is betting his career on "LLMs can never be more than the average of their training data" but I feel like that's a very 2024 take. Aren't we already past this in like, coding and math?”

Read source

On the edge

1

Sparky: a fully offline suitcase robot on a Jetson Orin NX

r/LocalLLaMA

Gemma 4 E4B at Q4_K_M with q8_0 key-value cache runs on a Jetson Orin NX SUPER 16GB with a 12K context, no wifi, no bluetooth, no cellular. SenseVoiceSmall handles speech-to-text; Piper drives a 43Hz mouth-synced face. Cached time-to-first-token sits near 200ms with 14-15 tokens per second sustained, and 30-plus sensors fold into the prompt every turn.

“No WiFi, no Bluetooth, no cellular. Gemma 4 E4B at Q4_K_M, q8_0 KV cache, flash attention, ~200ms cached TTFT, 14–15 tok/s sustained. 30+ sensors fold into the prompt as natural language every turn.”

Read source

Companion episode

CTFs, Scrum, and Claude's Bedtime

· 00:28:31

Two named engineering orgs reporting agent-first numbers on the same day a top-10 CTF player writes the scene's obituary is a coincidence worth holding together. The same Claude Opus 4.5 and GPT-5.5 Pro that emptied CTFd queues are the ones inside Intercom's 17.6% auto-approval rate. Tomorrow we'll see whether the second-order effects — Scrum dropped, rosters shrinking, security flags silently omitted — keep landing in the same direction.