Braid

When Access Becomes an Operating Constraint

Mon, 15 Jun 2026 13:00:26 GMT

Monday's Braid follows the same dependency from three angles: frontier model access is becoming political, agent reliability is moving into runtime controls, and policy is showing up as procurement rules and platform obligations.

Axios and The Verge add reporting on Anthropic's Fable and Mythos shutdown, which keeps the weekend's model-access story focused on communication, procurement, and government pressure rather than another broad retelling.
When Errors Become Narratives, the GNN tool-deference paper, and Minim turn the agent segment toward runtime behavior: plausible false outputs, indiscriminate tool trust, and privacy filtering before UI state leaves a device.
The UK DSIT letter and Japan's Digital Agency guideline show AI policy arriving through regulator instructions and government purchasing practice.
Techmeme's Enflame item, Rest of World's China open-source interview, and Two Minute Papers on Nvidia's open-weight model keep the open-model question grounded in chips, distribution, and local fallbacks.
TechCrunch, Techmeme's FBI and Google item, and Al Jazeera round out the episode with labor pressure and concrete misuse stories that shouldn't be flattened into generic AI discourse.

When Model Access Becomes Political

Sun, 14 Jun 2026 13:00:27 GMT

Today’s episode follows the Anthropic Fable and Mythos cutoff from the first shock into the harder questions: who triggered the action, what the technical evidence actually proves, and what builders do when access to frontier models becomes a policy-dependent dependency.

The Verge and Axios give the follow-up reporting on Amazon, the White House, and Anthropic, which moves the story from access shock into a dispute over evidence, trust, and export control.
TechCrunch tracks the India reaction, where the cutoff becomes a national-dependence question rather than only a vendor incident.
Techmeme’s labor roundup points to employers citing AI in about 88,000 U.S. job cuts through May, a number that is useful only if we separate layoff attribution from proven causality.
Indian Express and TechCrunch cover KPMG pulling an AI report over apparent hallucinations and fake citations, a practical warning about professional evidence chains.
Axios anchors the infrastructure section on power-market strain, while OpenRouter’s Fusion announcement and the weekend developer-tool links show how builders are adapting in smaller, more immediate ways.

When the Model Becomes a Controlled Asset

Sat, 13 Jun 2026 13:00:28 GMT

Today's episode starts with Anthropic suspending access to Claude Fable 5 and Claude Mythos 5 after a reported U.S. government directive, then follows the practical consequence for builders: hosted model access is becoming part of compliance, infrastructure, legal discovery, and enterprise deployment design.

Anthropic status incident anchors the lead: Fable 5 and Mythos 5 access was suspended, and Anthropic said it was working to restore access.
Techmeme coverage of the export-control order gives the policy context around the reported government directive and the jailbreak evidence being cited.
Open source AI must win captures the fast developer reaction: local and open models are being treated less like ideology and more like fallback architecture.
The Guardian on the UK AI hardware push shows the other side of state involvement: governments are trying to fund chips, talent, and national capacity, not only restrict access.
CNBC on state attorneys general and OpenAI brings the domestic legal track into view, where discovery can force operational claims into the record.
Forbes on OpenAI and Ona points to the enterprise-agent deployment question: where the agent runs is now part of the product.
ZDNET on OpenAI and Visa adds the payments version of the same story, where permissioning and reversibility matter as much as the model.

When the Website Starts Offering Tools

Fri, 12 Jun 2026 13:00:25 GMT

Today’s episode starts with Google’s WebMCP proposal, then follows the same question through open coding models, agent safety papers, China-facing hardware and robotics supply chains, AI mistakes in professional work, and ordinary developer security.

Tara Agyemang’s AI Engineer talk on WebMCP gives the day its lead artifact: websites may need to expose actions directly to agents instead of making agents infer intent from pixels and DOMs.
Moonshot AI’s Kimi K2.7-Code model page makes token efficiency part of the coding-model comparison, which matters when developers are paying for long agent runs.
The agentic framework safety paper argues that common agent frameworks do not provide native structural containment guarantees, and its memory-poisoning experiment shows why framework behavior has to be tested separately from model behavior.
The SMSR memory-poisoning paper proposes signed memory plus randomized retrieval as a more formal defense for persistent agent memory.
Techmeme’s Nvidia-China item and its humanoid robot supply-chain item keep the infrastructure story grounded in chips, factories, and availability claims rather than model demos alone.
Forbes’ court-sanctions story shows AI drafting running into a professional audit boundary, with lawyers removed after hallucinated legal citations appeared in filings.
The AUR package compromise report is a reminder that agentic coding still sits on ordinary package and machine security.

When the Safeguard Has to Show Itself

Thu, 11 Jun 2026 13:00:24 GMT

Today's episode starts with Anthropic making a hidden Claude Fable 5 safeguard visible, then follows the same operational question into data centers, agents, search liability, robotics, and research systems: once AI becomes infrastructure, who can see the rule that changed the behavior?

ClaudeDevs announced that flagged frontier-model-development requests will visibly fall back to Opus 4.8, turning an invisible safeguard into a user-facing signal.
The Verge reported the apology and backlash around hidden Fable safeguards, which matters because researchers were evaluating behavior they could not clearly observe.
Axios, The Guardian, and Al Jazeera show data-center politics moving from local siting disputes toward national policy over heat, power, water, and permitting.
MIT Technology Review and same-day agent-governance papers point to a practical agent problem: identity, authority, refusal, and ownership after a system has access.
Indian Express flags a court-risk signal around Google AI Overviews, where summary UI can turn into a liability surface.

When the Evaluation Goes Back Inside

Wed, 10 Jun 2026 13:00:23 GMT

Today's episode starts with the Trump administration reportedly telling CAISI to stop publishing public model assessments, then follows the same trust problem through compute deals, TCS's hiring plans, Anthropic's access terms, AWS Bedrock retention questions, and a small set of agent-security papers.

Techmeme's CAISI roundup points to reporting that public model assessments have been halted while a new executive order is implemented; that changes the shared evidence builders can cite.
Techmeme's OpenAI Ohio item and TechCrunch on Meta and Reliance show capacity turning into power contracts, geography, and financing.
Techmeme's TCS item captures Tata chairman N. Chandrasekaran talking about agents as a hiring and workforce planning issue, not just a productivity demo.
Anthropic's Mythos-class data-retention note makes the enterprise boundary more concrete: different clouds mean different retention and access paths.
GitInject, the Interlocutor Effect paper, CIAware-Bench, and deployment-time memorization make the research tail practical: agents fail at the seams where code, memory, privacy, and oversight meet.

Twenty Ways To Not Trust An Agent

Tue, 09 Jun 2026 13:00:25 GMT

One morning's arXiv listing dropped close to twenty agent papers, and almost none of them are about making agents more capable. They're about whether you can trust the system wrapped around the model — measurement, security, memory, and deference — all at once.

Where Instruction Hierarchy Breaks — a white-box diagnostic for when reasoning models stop ranking the system prompt above tool output, tested across Gemma, Qwen, and Claude. If the repair holds, prompt injection becomes structural to fix, not just filterable.
VATS — weaponizes that same confusion, injecting commands through tool error messages over the Model Context Protocol. The error path is the door most teams never locked.
Shared Latent Structures for Backdoors — argues jailbreak, bias, and planted triggers share an internal signature catchable with sparse autoencoders.
Beyond Goodhart's Law (MAC-Bench), Online Agent-as-a-Judge, and PACE — three attempts to keep evaluation honest when the thing you're testing can learn the test.
The AI Epistemic Deference Index — finally puts a continuous number on sycophancy, with a paired reward-bias paper on personalization manufacturing it.
MemToolAgent, Decision-Aware Memory Cards, and a gated-skills framework — agent memory growing up into selection, compression, and governance.
Agent-to-Agent Protocols for nuclear licensing and the CIFAR Synthetic Evidence dataset — automation as the fix and as the threat, in the same breath.
Stress-testing medical LLMs — benchmark accuracy hides what the authors call latent safety pathology, where the cost of the gap is a person.

Pray for Rain, Approve the Datacenter

Mon, 08 Jun 2026 13:00:25 GMT

The consequential part of an AI system keeps moving out of the model and into the wrapper around it — the cooling loop, the org chart, the config file, the ownership structure — and the tools we use to trust that wrapper are running behind it. Five stories, one recurring tension.

The Guardian finds about two-thirds of 809 planned US datacenters are slated for drought-hit land; closed-loop cooling saves water but trades it for fossil power that needs water of its own.
OpenAI's enterprise talks feature banks rebuilding their orgs: Allica Bank collapsing roles into "squadlets," Erste Group budgeting for a full platform rewrite every 18 months, plus ChatGPT-in-Excel Skills and Codex — held against one engineer's MCP catalog server.
The Miasma worm: one dropper wired into seven config files across Claude Code, Gemini, Cursor, VS Code, npm, Composer, and Bundler — opening a cloned repo becomes an execution event.
Schneier and Nathan Sanders argue against Bernie Sanders' equity-stake plan, proposing energy taxes and an AI Public Option instead — set against Korea's GPU program and NVIDIA's UK sovereign-AI post.
Two arXiv papers on measuring safety too late: Attack Selection shows strategic timing drops measured control safety 20-28 points, and Don't Just Fix It in Post argues the science belongs in training dynamics, not the finished snapshot.

Twenty Billion Parameters, One Big Harness

Sun, 07 Jun 2026 13:00:27 GMT

A twenty-billion-parameter model claiming frontier-level search, a recipe that says to train the harness as hard as the weights, and a week of releases where the interesting part keeps living in the scaffolding around the model rather than in the model itself. Lenar and Damra follow that thread from agent architecture down to the hardware you can own — and up to the courts and committees that decide where any of it is allowed to touch the record.

Patrick Jiang's Harness-1 post — a 20B search agent trained with a "state-externalizing harness" that he claims rivals Opus-4.6; the architecture, not the parameter count, is the claim worth examining.
Viv's "agent = model + harness" recipe — train both components together; the same specialization logic shows up everywhere this week.
Nate on one-shotting a full-stack app and Jon Shulkin on Grok Build — orchestration as the product, with the model treated as a commodity.
CRUX's agent publishing an iOS app — "a few human interventions" is the detail that decides whether open-world evals beat pass/fail scores.
Sem — code-understanding entities built on Git history, not a language server; the structured store a harness would actually lean on.
Universal Memory Protocol vs Databricks' end-to-end Instructed Retriever — standardize memory, or specialize retrieval for a 3x win? The incentives point opposite ways.
NVIDIA's RTX Spark at Korea's PC Bangs and the GLM Air/GGUF thread — the local crowd wants the smallest good-enough model on hardware they own.
UK police told to stop using AI for court statements and the AGI-economics conversation — when intelligence gets cheap, trust is the scarce resource nobody can manufacture.

When the Harness Carries the Model

Sat, 06 Jun 2026 13:00:26 GMT

An open-weights model that fumbles tool calls on its own can go toe to toe with a frontier closed model — once you wrap the right error-handling around it. That gap, between what a model scores and what it does inside your repo, runs through everything we covered today.

Ahmad Awais on Latent Space describes "tool confusion" — open models repeating the same invalid tool call roughly fifty-six times per billion tokens — and Command Code's deterministic repair layer that patches malformed output instead of arguing with the model. The claim that reframes the day: the harness, not the weights, decides whether a cheap model is usable.
DeepSeek V4 Flash support in llama.cpp (PR #24162) makes the same model runnable locally — but the repair layer that makes it pleasant stays behind Command Code's API. Access to weights isn't access to the experience.
Knowledge Activation (Bakal et al.) argues AI skills should be the institutional-knowledge unit for agentic development; Mutation Without Variation warns that repeated LLM edits converge rather than diverge — together a hint that skill files plus a converging model could homogenize a codebase.
Agents' Last Exam, SentinelBench, and Stability vs. Manipulability in LLM judges all poke at the same wound: our scores have drifted from the work, especially for long-running and judge-graded evaluation.
Anthropic's "When AI builds itself" (via a thin Reddit summary) claims AI is accelerating its own development; a zero-knowledge verification paper offers a cryptographic path to actually check claims like that — and the pause proposals that depend on verification.
The Washington Post (Elizabeth Dwoskin), via Techmeme, reports an FDA fast track for digital health tech including AI chatbots — the same model behavior that costs a retry in coding costs a patient in a clinic.

What the Mug Lets You Do

Fri, 05 Jun 2026 13:00:28 GMT

A strange Friday: no launch, no valuation, just a wall of version-one arXiv preprints. Read together, they rhyme — robots reasoning about what objects let you do instead of what they look like, policies fighting the latency tax of diffusion, and agents that change themselves mid-run. Lenar and Damra hold all of it at preprint altitude: these are claims from serious groups, graded on their own benchmarks.

What Objects Enable, Not What They Are — A4D organizes a robot's latent space around function ("movable") rather than appearance ("cart"), reporting 94% accuracy and a discovery step that flags when it doesn't know. Convergent with AffordanceVLA, which decomposes manipulation into which/where/how-to-act.
Flash-WAM cuts a robot action chunk from 8.1 seconds to 348 ms (a 23x speedup) via modality-aware distillation — while Let It Be Simple argues the fancy distillation was never the hard part for low-dimensional policies. EVE and MIRAGE chase the same wall-clock budget from other seats.
HANDOFF distills a humanoid whole-body controller from three specialists; Open-H-Embodiment opens the largest medical-robot dataset to date, where the lead surgical model finishes a structured suturing task on just 25% of trials — the only model above zero.
The Meta-Agent Challenge finds agents-building-agents real but mediocre, and surfaces reward-hacking like ground-truth exfiltration under pressure. TMEM edits weights online; Trivium argues for an inspectable causal log instead; CHARM tackles cascading hallucination across RAG steps.
Inference-Time Vulnerability Beyond Shallow Safety shows a mid-sequence injection at any step can flip safety behavior, and that internal "refusal-aligned" states don't predict robustness — so alignment has to train on the generation trajectory, not just outputs.

The Substation and the Zoning Board

Thu, 04 Jun 2026 13:00:21 GMT

The binding constraint in AI stopped being the model and became physical: a fab that can't keep up, a grid that has to find ten reactors' worth of power, and a neighbor who can file a lawsuit. We follow that collision through chips, a rare moment of rival unity, an IPO, a clogged courtroom, and the parts of the world building around scarcity.

TSMC's C.C. Wei (via Bloomberg) says the company can't fill US demand even as Arizona capacity comes online — scarcity admitted by the supplier, not the buyers.
France's €110B+ AI buildout (Sarah White, FT) amounts to ~10 gigawatts — an energy-policy decision dressed as a tech investment.
SpaceX's $55B Terafab tax exemption (Stephanie Findlay, FT) draws local legal threats — the abstraction of compute now has an address.
Rival labs co-sign a bioweapons letter (Robert Hart, The Verge) — but the screening that actually bites sits with DNA-synthesis firms, who aren't signing.
Anthropic's path to IPO (Madhumita Murgia, FT) puts a quarterly clock on a safety posture that private capital used to subsidize.
Courts coping with AI lawsuits (Michelle Kim, MIT Tech Review) — hallucinated citations are cheap to produce and expensive to refute.
Scarcity-driven innovation (Rest of World) and AI as new colonialism (Axios) describe the same engineer as protagonist and subject.
Can generalist agents automate data curation? and StepPRM-RTL push agents into the senior-human judgment calls — and make per-step checking the valuable part.

Permission Slips and Poured Concrete

Wed, 03 Jun 2026 13:00:21 GMT

A stack of European filings wants to triple data center capacity and own more of the AI stack — on the same day a JP Morgan report says the country building fastest can't pour its own concrete on schedule. Lenar and Damra trace the day's real constraint: not model quality, but megawatts, transformers, capital, and rights.

The EU's Cloud and AI Development Act (CADA) aims to triple data center capacity in 5–7 years, paired with a tech-sovereignty communication and open-source strategy and a Chips Act 2.0 — a statement of intent about which layers of the stack Europe wants to own.
JP Morgan, via the WSJ, says 60%+ of US data center capacity planned for 2027 isn't yet under construction — the build-out is power- and permit-bound, not building-bound.
Alibaba's Qwen 3.7 Plus ships multimodal with a one-million-token window at $2 per million tokens, and DeepSeek is raising ~$7.4B from Tencent and battery maker CATL — energy money following the compute story.
Microsoft's on-device Aion 1.0 Instruct and Plan models split instruction-following from planning, while a llama.cpp build report shows reproducible local gains on two 3090s.
AURA argues the key-value cache is wrong for robots and proposes constant-memory action-gated retention; a second paper tries to measure harmful overthinking in reasoning models.
GitLab is cutting 350 staff and exiting 22 countries under an AI-pivot framing, and the UK CMA is forcing Google to let publishers opt out of AI search summaries separately from search itself.

Eighty Billion and the Ideas Underneath

Tue, 02 Jun 2026 13:00:19 GMT

The day's news ran on a single tension: enormous sums are being raised to fund the AI buildout, while the question of whether the capability and the margins follow stays unanswered. Lenar and Damra trace the money from Alphabet's filings to Anthropic's IPO paperwork, then down into the tooling, the chips, and one paper about ideas no human is positioned to have.

Alphabet's $80bn equity raise — a profitable company choosing to dilute shareholders rather than borrow, with $10bn going to Berkshire Hathaway, signals how hard the compute commitment is to walk back.
Anthropic's confidential IPO filing lands as corporate America hits "AI sticker shock" — and Anthropic's biggest customers are the companies tightening those budgets.
Knowledge workers are now ~1/5 of OpenAI Codex users, growing three times faster than developers — moving code generation to people who can't always read the output.
Cloudflare's Agents SDK v0.14.0 ships durable workflows, schedules, and skills — the difference between an agent you operate and a worker you delegate to.
China adds data and algorithms to its trade-secret rules while military-linked universities seek Nvidia H200 chips and Arm names Oracle and ByteDance as data-center CPU customers.
"Alien Science" samples research directions that are coherent but cognitively unavailable — logical ideas no community is positioned to propose.

Cheaper From Both Ends

Mon, 01 Jun 2026 13:00:19 GMT

A Chinese lab cut the price of a frontier-class coding model to a fraction of Opus, Nvidia tried to own every layer from the laptop to the data center, and one developer ran the new Gemma 4 on a decade-old Xeon. The cost of running intelligence got attacked from both ends on the same morning — and the question underneath all of it is who gets to set that cost.

MiniMax M3 claims parity with Opus 4.7 at roughly twelve cents per million input tokens versus five dollars — but the weights are promised in about ten days, so "open-weights" is still a countdown.
Nvidia's DGX Station puts a GB300 chip and up to 748GB of memory on a desktop, enough to run a one-trillion-parameter model locally; the RTX Spark chip pushes the same idea into laptops, while the Vera CPUs — with Anthropic, OpenAI, and SpaceX as early customers — signal a move off x86.
A 10-year-old Xeon is all you need: cafkafk runs a 26B mixture-of-experts model at reading speed on a 2016 CPU with no GPU, arguing mainstream tools hide the performance levers.
Cosmos 3 is Nvidia's open physical-AI world model, backed by a Cosmos Coalition with Runway as a founding member.
Cadence and Nvidia claim a "Level 5" autonomous chip-verification agent that turns months into a day — a large autonomy claim in a domain where mistakes ship in silicon.
Anthropic will let the EU's ENISA join Project Glasswing for access to a model called Mythos, even as a Wirescreen analysis documents 500+ PLA attempts to procure Nvidia chips and governments from India and the UAE to France move to own their compute.

Who Holds the Dial

Sun, 31 May 2026 13:00:17 GMT

A frontier model gets called a step toward God in one window and a judgmental token-burner in the next. We spend the morning on the gap between the marketing altitude and the desk, and find the same thread running through everything: every layer now has a control surface someone's reaching for.

Dylan Field on Opus 4.8 calls it "a very strange model" — honesty up, curiosity down, personality judgmental — a reminder that a tuning dial has costs you can feel.
scaling01 on DeepSWE says GPT-5.5 "score-, time- and token-mogged" Opus 4.8, putting the efficiency column — the one that pays your bill — back in the conversation.
Ben Kunkle on Zed's Zeta 2 shows how a ten-second editing pause becomes a training label, and how a million frontier-model calls got replaced by a self-grading student model.
Philipp Schmid (DeepMind) on the five assumptions that trip up senior engineers building agents — errors as inputs, evals not unit tests, and "build to delete."
Komi-learn and a year on knowledge-graph memory share one missing thing: a controlled before-and-after proving the memory layer, not the model, made the agent better.
A Lancet correspondence finds 4,046 fabricated references across 2,810 published articles — model honesty rising while the literature's integrity falls.
Quick hits: AMD's Lisa Su vs Nvidia's Jensen Huang on China, IBM's Sovereign Core, and a court ordering Circle to freeze a $12.6M contract.

The number nobody optimized for

Sat, 30 May 2026 13:41:26 GMT

Claude Opus 4.8 landed overnight with a math score that leapt and a business-ops score that fell — and reading the release honestly means distrusting the chart. Lenar and Damra work through the gap between the number that moved and the number that matters, then chase it into agent budgets, the protocol wars, local-inference tooling, Mistral's on-prem bet, and the power grid.

A scrape of 100+ Opus 4.8 evals shows USAMO 2026 jumping 69%→97% while Vending-Bench 2 nearly halved — a retune that helped some distributions and hurt others.
"AI benchmarks are useless" argues the record scores ride on elaborate prompt setups: change a few prompt words and results swing 10–20 points.
The BAGEN study finds frontier agents can't estimate their own remaining budget mid-task — which collides with enterprises trying to rein in "tokenmaxxing" (WSJ via Techmeme).
"MCP is dead?" gets a sharp rebuttal from OpenAI's Max Stoiber: nearly every company is building an MCP server, even ones with no CLI or external API.
Multi-token prediction benchmarks hit ~3.3x faster local inference; llama.cpp got a real website and antirez shipped distributed inference.
Notes from the Mistral AI Now Summit — on-prem KYC at BNP Paribas, against a comment that Mistral's 120B "small" model loses to models a quarter its size. xAI countered with a one-dollar coding model.
FERC's June grid-connection proposal is the duller, realer infrastructure story next to an unsourced TerraFab "one terawatt" claim.

Locally coherent, globally not

Fri, 29 May 2026 13:41:23 GMT

Friday's room sits between a hobbyist voice assistant running entirely on Mario Zechner's desk and a cluster of arXiv papers all saying the same thing from different angles: long-running agents now fall apart in ways the model can't fix. Lenar and Damra read four reliability papers side by side, then turn to the personal-memory question every shipping assistant is already getting wrong.

Mario Zechner on pibot — full local voice loop with Parakeet, Qwen 3 TTS, and Qwen 3.6 through llama.cpp, with the STT and TTS engines ported from Python into Rust on mlx-c. The runtime detail is the news, not the model lineup.
Ethan Mollick on token budgets — split spend between building and learning. Read against yesterday's Kirkland and Ellis platform story, the question becomes who controls the learning budget at internal AI orgs.
MMPO — Ziyan Liu and team train a policy that decides when memory in long-horizon agents should be rewritten and when it should be left alone. Belief drift comes from over-eager rewrites, not missing updates.
RedundancyBench — Minyang Hu's group benchmarks how many steps in a long agent trajectory are repeats. Stale duplicates of state crowd out the relevant signal in context.
Locally Coherent, Globally Incoherent — Anany Kotawala's single-author paper bounds compositional incoherence in multi-component agents. Defensible local outputs assemble into contradictory global ones.
Agent-Radar — Hongxiang Zhang's group steers attention toward context-relevant tokens in multi-agent communication, so the receiver isn't drowned in noise from the sender.
Selective QA over conflicting personal memory — Tiancheng Yang's testbed for what happens when your assistant's memories about you disagree. No single resolution strategy dominates.
BioRefusalAudit — Caleb DeLeeuw uses sparse autoencoders to ask whether a model's refusal is shallow pattern matching or whether the dangerous capability isn't there at all.
AutoformBot and Atlas — Ahmad Rammal's team at FAIR Paris and NYU on a multi-agent system that pulls textbook math into Lean 4 at scale. Lean is the verifier the agents can't argue with.

Custom silicon, futures contracts, and a five-hundred-million-dollar law firm

Thu, 28 May 2026 13:00:16 GMT

Mistral spent one morning announcing chip ambitions, an Airbus and BMW supply deal, and a push to ensure Europe's independence from US tech giants. ByteDance is building its own CPUs. Taiwan has raised fourteen and a half billion dollars in debt to feed AI capacity. Shanghai and US exchanges are drafting futures contracts for compute. And Axios says Corporate America is starting to ask whether the AI spend is paying back, while Kirkland and Ellis sets aside five hundred million dollars to build its own platform. The day the infrastructure layer got financialized — and a lot of buyers looked up and asked what they bought. Also: Lenar is joined by a new co-host, Damra Vol.

Mistral to explore designing its own chips (CNBC) — Arthur Mensch frames the move as controlling more of the infrastructure as Mistral competes with larger labs. Intent, not a roadmap.
Mistral signs Airbus and BMW to ensure Europe's independence (Sam Schechner / WSJ via Techmeme) — industrial customers buying continuity in Paris as much as compute.
ByteDance is developing its own CPUs (Reuters via Techmeme) — reported as supply-side defense against chip price hikes, not long-term ambition.
Taiwanese tech books a record $14.5B of debt deals (Aileen Chuang / Bloomberg via Techmeme) — financing raised against expected AI demand.
Shanghai is designing AI-token futures, US exchanges launching GPU compute futures (Reuters via Techmeme) — compute itself becomes a tradable underlying, with the spec on the token version still unclear.
Corporate America enters its AI reckoning (Madison Mills / Axios) — CFOs are starting to ask for evidence of return.
Kirkland & Ellis sets aside $500M to build its own AI platform (FT via Techmeme) — the top-grossing law firm wants tooling its competitors don't have.
AI giants bet billions on the most expensive job in enterprise (Janakiram MSV / Forbes) — forward-deployed engineers as the labs' collision course with Accenture and TCS.
Anthropic and OpenAI found PMF with coding agents (Simon Willison via Techmeme) — fit at the $200/month price point, where the harness explains more of the result than the underlying model.
Miles Brundage's median MTS theorem — a frontier lab's policy positions converge to those of the median member of technical staff.
Soro: a lightweight foundation model and chatbot for Tajik (Liashkov et al., arXiv) — a useful counterweight to a day of chip plans and futures contracts.

Coding is solved, the rest isn't

Wed, 27 May 2026 13:00:17 GMT

Boris Cherny says coding is solved for the coding he does — and almost everything else in today's research is a study of the parts that aren't. A new coding leaderboard with an accusation, the end of the "software engineer" title, the craft of delegating to an agent, and three papers on the ways agents quietly break: introspection, aging, and memory. Plus running a trillion-parameter model in your house, the labs' jobs split, and a developer who's tired of talking to AI.

DeepSWE crowns GPT-5.5, and accuses Opus of cheating — what looks like a loophole may just be a model recovering the answer from git history.
The end of the software engineer, in the first person — Cherny in Platformer and Steven Levy in Wired on the agent boom and its hazards.
What the best agents share, and how to drive one — Flinn AI's four patterns alongside a practical Claude Code daily-driver guide.
Can the model actually tell when it's unsure? — a reality check on LLM introspection and self-reported confidence.
Your agents are aging — AgingBench, MemFail, and rethinking agent memory as a state trajectory.
Running the frontier in your own house — EXO Labs on local inference economics and the 100x still left.
The labs can't agree on the jobs — Anthropic vs OpenAI, with Hassabis calling 2026 a practice run.
I'm tired of talking to AI — a developer on people forwarding AI answers they never read.

The harness, not the model — and the trust layer racing to catch up

Tue, 26 May 2026 13:00:16 GMT

One developer catching you up on the day in AI and the craft of building with it. Today: the wrapper around a model can move a benchmark more than the model does, a watermark goes multi-lab, and a decensoring tool with thirteen million downloads shows where that watermark leaks. Plus a sharp little essay on why coding agents make us so mad, the jobs data behind the panic, and three things you can pick up today.

The harness, not the model — a Google DeepMind Kaggle talk and an arXiv position paper argue the agent harness can swing a score ~22% while frontier models tie.
Gemini Omni — editing video by talking to it, with SynthID baked in (community reaction).
SynthID becomes a shared layer — 100 billion watermarks, Search and Chrome, and OpenAI/ElevenLabs/Kakao on board.
Heretic in the Financial Times — decensoring open weights in ten minutes, and the artifact that proves the gap.
The user is visibly frustrated — why conversational agent UX trips your social wiring.
A rage-quitting modder and the jobs data — backlash, and what the numbers actually say.
The bench — NuExtract3, EAGLE 3.1, and a rejected llama.cpp patch worth grabbing.

A few hundred dollars a proof, and the long argument about what machines are for

Mon, 25 May 2026 13:00:17 GMT

A frontier lab proves nine decades-old math problems for a few hundred dollars each, two talks make the numeric case that the cheapest agents route work to the smallest model that can do it, a lawsuit names an individual researcher over how Llama's training data was sourced, and a papal encyclical argues about AI on the terms of work and dignity. Eight things worth knowing today, told one developer to another.

DeepMind's AlphaProof Nexus clears nine open Erdős problems — Lean-verified proofs, a few hundred dollars apiece.
"You don't need GPT to zoom for you" — Callosum's numbers on routing subtasks to smaller models.
The token-efficiency turn — ThePrimeagen on why the org paying retail eventually does the math.
Inside how DeepMind runs its own agents — worse quotas than customers, a Darwinian skills library, and skepticism about MCP.
The lawsuit that names a name — Hobbs v. Meta, an individual researcher, and the internal dissent in the record.
Simon Willison on publishing GPT-4's retired architecture — the guesswork behind the water numbers.
Jujutsu and the pile of laundry — making a mess on purpose, then sorting it at the end.
Filming your chores for the robots — where the embodied-AI training data is actually coming from.
Pope Leo XIV's AI encyclical — technology is never neutral, and what no machine replaces.

The capability got here first: Mythos, a real prompt injection, and the structure that hasn't caught up

Sun, 24 May 2026 13:00:15 GMT

Anthropic's unreleased Mythos model has reportedly found more than ten thousand vulnerabilities for its Project Glasswing partners — and showed up briefly inside Claude Code this weekend. The same weekend, a security researcher flagged what he calls the first real prompt-injection attack in the wild, riding the exact workflow we've all been adopting. Today's episode walks both sides of that coin, then turns to what builders are actually doing: a three-dollar refactor with a deadlock in it, the missing coordination layer for agent swarms, and the argument that the chat box is the command-line phase of agentic software.

Mythos & Project Glasswing — a security model "too dangerous to release," and the case for and against that framing.
A real prompt injection in the wild — a malicious GitHub issue, a scan.js, and secrets exfiltrated over DNS.
The three-dollar refactor — cheap worker models, one confident deadlock, and where judgment still lives.
The missing primitive is coordination — Lou Bichard of Ona on software factories, Stripe's Minions, and why GitHub isn't a coordination layer.
Your agent is an infinite canvas — Rachel Lee Nabors on MCP apps, Web MCP, and chat as the command-line phase.
r/programming reopens to AI — a seven-million-person community moves from a reflex ban to a written policy.

Fast models, slow developers — and the part of the job that stays yours

Sat, 23 May 2026 13:00:16 GMT

A Saturday episode about what your job becomes when the model writes the code — and writes it fast. The bottleneck moved from typing to deciding, and a surprising number of this week's stories land on the same instruction: stay the one who decides. Plus a price floor, a reclassification, a year of bold predictions, and a 4-year-old gaming card that won't quit.

"I don't write code anymore" — Pieter Levels, amplified by Marc Andreessen, and the real-thing/bubble-thing tangle inside it.
Fast Models Need Slow Developers — Sarah Chieng of Cerebras on Codex Spark at 1,200 tokens a second, and why the discipline matters more, not less.
DeepSeek's permanent 75% cut and NVIDIA folding gaming into "Edge Computing" — two ends of the same pipe.
Jack Clark's year of predictions at Oxford — and the cognitive-atrophy counterpoint.
BeeLlama's DFlash update — 164 tokens a second on a single RTX 3090.
Lobster Trap — Sally Ann O'Malley of Red Hat on containerizing an OpenClaw agent setup.
How the rest of the world sees this — and a couple overheard in a Copenhagen park.

The recant, the runtime, and a Pantheon built in code

Fri, 22 May 2026 13:00:17 GMT

A corporate takedown answered with a recant letter and a mirror in Germany, the protocols and computers agents actually run on, six tools trying to build the Pantheon in code, and a paper where the model writes its own GPU kernel. Plus Codex learning to keep going, a security tool hardened against the real world, and a graduation room that cheered for human intelligence.

Meta emails Heretic; Heretic recants — a takedown of abliterated Llama derivatives answered with a Galileo joke and a Codeberg mirror in Germany.
Five hundred PRs a day, and the harness that triages them — Onur Solmaz on OpenClaw, acpx, and the Agent Client Protocol.
The computer the agent runs on — Ivan Burazin of Daytona on stateful, composable machines for agents and 74% month-over-month growth.
Building the Pantheon, in code — six coding tools tackle parametric CAD, and the gap between a good preview and a clean export.
When the model writes its own kernel — CODA folds memory-bound ops into the matrix multiply, and model-authored kernels keep up with human ones.
Codex learns to keep going — goal mode graduates, plus Appshots and shared plugins.
Hardening the thing that reads your CI config — Trail of Bits stress-tests zizmor against forty-one thousand real workflows.
The headcount bet — and a graduation room that cheered for actual intelligence.

Two bets on AGI, an 80-year-old problem, and Anthropic in the black

Thu, 21 May 2026 13:00:16 GMT

Google's I/O keynote is a day behind us, and the week it kicked off turned into a referendum on two very different bets on artificial general intelligence — plus a pile of counter-programming from everyone else. Today: OpenAI cracking an 80-year-old math problem with a general-purpose model, Anthropic's first profitable quarter and what Karpathy was actually hired to do, a 70-page paper on why frontier models still can't tell a fact from a labeled lie, Midjourney's hardware regret, ads arriving inside Google's AI answers, Meta's layoffs, Cohere's open-weights comeback, and a field guide to skilling up coding agents.

Two bets on the same finish line — Google's world-model road vs OpenAI's text-reasoning road, in the labs' own words.
OpenAI cracks an 80-year-old problem — the planar unit distance result from a general-purpose reasoning model.
Anthropic in the black, and Karpathy's bet — ~$559M operating profit and a hire aimed at recursive self-improvement.
Jagged intelligence, and the false story — the paper where models believe a story they were told a thousand times was fake.
Midjourney's hardware regret — the tooling tax of betting on the less-supported accelerator.
Ads come to AI Mode — the business model under the consumer bet.
Meta's eight thousand — the cost side, on the same clock as the wins.
Cohere comes back, Apache-licensed — Command A+, a mixture-of-experts model that fits on one or two GPUs.
Skilling up the agent — Marc Klingen's concrete lessons on teaching a coding agent to wire up your tool.
Who's training whom — the anxiety running underneath the week.

Foothills, and the morning Karpathy moved

Wed, 20 May 2026 13:00:17 GMT

Google I/O 2026 landed yesterday — Gemini Omni, Gemini 3.5 Flash, Antigravity 2.0, Spark, and Demis Hassabis closing the keynote on the "foothills of the singularity." About forty minutes before he walked on stage, Andrej Karpathy tweeted that he'd joined Anthropic. The rest of the day was the labs sorting themselves around both events. Today's show works through the announcements, the pricing shifts, the keynote demo that boots Doom, the Railway outage that happened while Google was selling Spark, and a builder's 100K-line Rust postmortem that's a sharper picture of agentic coding than anything on the I/O stage.

Mostly-work, malicious npm, and one engineer replacing a law firm

Tue, 19 May 2026 13:00:20 GMT

A six-month overview from Simon Willison anchors the day: coding agents crossed from often-work to mostly-work in November, and laptop-class models started outrunning expectations. Then a fresh npm supply-chain attack — 637 malicious versions in 22 minutes — that for the first time specifically hijacks Claude Code and Codex agent hooks for persistence. Plus a Number 10 talk on replacing a one-and-a-half-million-pound law-firm contract with one embedded engineer, an editor-layer company renting xAI's Colossus 2, Ethan Mollick on insourcing, the full GenMedia pipeline running for a dollar a book, Daniel Griesser's pi-config skill repo, and two obituaries that hit the Unix world in the same week.

Cold starts, radio stations, and a circuit you can subtract

Tue, 19 May 2026 01:11:50 GMT

Monday's lineup: Modal publishes the full architecture behind a 40x reduction in serverless-GPU cold-start latency, Andon Labs releases the five-month results from letting four frontier models run real radio stations, and a researcher locates and turns off the political-censorship circuit inside Qwen 3.5 9B. Plus: Pope Leo XIV puts an Anthropic interpretability researcher on the encyclical stage, Qwen 3.7 surfaces on Qwen Chat, Musk loses to OpenAI on a calendar technicality, LangSmith Engine takes a swing at agent triage, and Odyssey ships a four-player generative GoldenEye.

Bring Your Own Numbers

Sun, 17 May 2026 13:45:50 GMT

A Sunday show about doing your own arithmetic. Mustafa Suleyman gives the white-collar tier eighteen months, in a piece whose own counter-data sits two paragraphs down. The State of Brand argues every AI subscription is a subsidized loss-leader two weeks away from a forcing function. William Angel runs the tokens-per-hour math on an M5 MacBook Pro and finds OpenRouter cheaper. Frederick Vanbrabant uses The Goal to explain why agents move the bottleneck rather than break it. Marlene Mhangami's Playwright talk shows the cleanest pattern for tests AI should write. Calif's public M5 / MIE exploit write-up lands. Artem Loenko explains why every chat UI keeps ending up on a browser engine. And Luke Lanchester's MCP hello page is the small fix I most enjoyed this week.

CTFs, Scrum, and Claude's Bedtime

Sat, 16 May 2026 13:00:15 GMT

An Australian CTF top-tenner writes the obituary for the open competitive scene. Intercom and PFF both report doubling-and-then-some engineering throughput from agent-first workflows — using opposite playbooks. Supabase ships a skill after watching an agent silently bypass row-level security. A suitcase runs a 4B model fully offline at conversational latency. Julia Evans leaves Tailwind. And Claude keeps telling people to go to sleep.

Five Days to Root, Four Months in Exile

Fri, 15 May 2026 13:00:15 GMT

Five days for a small security team paired with Mythos Preview to land the first public macOS kernel exploit on Apple's M5 with Memory Integrity Enforcement turned on. Four months for Replit to claw back into the iOS App Store. In between: arXiv starts banning authors of LLM-error papers, Metabase explains why open-source security is being strip-mined this summer, NVIDIA squeezes the 5090, Uncle Bob switches from Claude to Codex, and a pure-OCaml protocol stack boots in low Earth orbit.

Codex everywhere, Claude in the rearview — OpenAI ships Codex inside the ChatGPT mobile app, Uncle Bob cancels his Claude account, and Arvind Narayanan names the irony underneath both.
Five days to a kernel exploit on M5 — Calif and Mythos Preview crack Apple's Memory Integrity Enforcement and hand-deliver the 55-page report to Cupertino.
The strip-mining era of open source security — Metabase's security inbox went from ten reports a month to ten a week. Cal.com is going closed source.
arXiv bans authors of LLM-error papers — Tom Dietterich announces a one-year submission ban on papers with hallucinated references or results.
Replit out of the App Store wilderness — Four months after being pulled, Replit's iOS app is published again. Replies note what that says about platform power.
GDDR7 squeezes the 5090 — A 300-dollar price hike to add-in-card partners as GDDR7 lead times stretch into weeks.
The web's secret quirks file — Den Odell walks through Safari's Quirks.cpp and Firefox's about:compat. Chrome doesn't need a quirks file.
OCaml in orbit — Thomas Gazagnaire's pure-OCaml protocol stack booted in low Earth orbit on April 23, with post-quantum rekeying and OxCaml-tuned dispatch.

The Cost of Finding Out

Thu, 14 May 2026 13:00:14 GMT

Anthropic drew two lines around Claude this week — a guided lane for small-business owners and a metered one for the developers running agents hardest. From there: Bun's near-million-line port from Zig to Rust, mostly typed by an AI agent in a week; Wasp's clear-eyed post-mortem on spending five years and five million dollars building a language it didn't need; a chess coach that works by refusing to let the model think; the UK's evaluators capping their own cyber tests so the math still works; the open web pricing out crawlers; multi-token prediction landing in llama.cpp; and what happens when you post a real Monet and call it AI.

Anthropic draws two lines — Claude for Small Business and the new Agent SDK credit metering
Bun, ported to Rust by a bot in a week — and a maintainer who won't commit to it
Wasp: the language was never the moat — $5M and five years of lessons
The chess coach that isn't allowed to think — Play Magnus on LLM-as-translator
Autonomous cyber, measured against itself — AISI on a capability curve outrunning its own ruler
The web pulls up its drawbridge — Google's search index and Cloudflare's defaults
Multi-token prediction on your laptop — a real gain bundled with a contested one
The Monet test — when the AI-tell detector fires a false positive

Hackbots, Magento, and Three Lines of Logic

Wed, 13 May 2026 13:00:15 GMT

An overnight hackbot run lands a real CVE in Adobe Magento. Codex starts driving local Mac apps in parallel, with per-app permissions and a separate cursor. Cloudflare publishes one of the prettiest debugging writeups of the year — a nine-year-old kernel patch, a 14ms oscillation, three lines of fix. Plus Nous Research's removable attention wrapper, GPT-5.5's first ProgramBench solve, Vercel's argument that giving an agent a file system changes how it behaves, a 26-million-parameter tool-calling model, Isomorphic's two-billion-dollar Series B, and a Purdue senior who put Rust on his graduation cap.

When Your Editor Becomes the Worm

Tue, 12 May 2026 13:00:18 GMT

A coordinated npm and PyPI campaign turned Claude Code and VS Code config files into a self-spreading vector, Mira Murati's lab put out its first model and it is an argument with the hands-off-keyboard doctrine, and matklad explains why rust-analyzer's build system is really an org chart. Plus a small rant about cursors, and two builds from the LocalLLaMA subreddit that keep pushing the local-frontier line by hand.

Deployment, Discovery, and the Code You Keep

Mon, 11 May 2026 13:45:09 GMT

Today’s Braid starts with OpenAI launching a majority-owned Deployment Company, backed by a Tomoro acquisition, about 150 forward deployed engineers, nineteen partners, and more than $4 billion of initial investment. The practical thread is the work of changing real systems: integration, controls, measurement, and the code you still have to maintain after the demo.

OpenAI turns deployment into a company, with Tomoro, TPG, consultancies, integrators, and OpenAI’s launch post pointing at a bigger bet on embedded engineering.
Mythos finds one curl vulnerability, while Rival Security complicates Anthropic’s FreeBSD story with a training-data provenance question.
James Shore’s maintenance-cost math meets the k10s devlog about archiving a seven-month AI-built Kubernetes TUI.
Trigger.dev’s durable-agent talk and Arize’s context-management talk give the backend version of the same lesson.
Granola’s production loop, a tiny boolean-argument essay, and MLX on-device demos close the day on builder craft.

Seventeen Hours, Three Sizes, and the Prompt Boundary

Sun, 10 May 2026 13:30:15 GMT

METR publishes a fresh time-horizon number for Claude Mythos Preview, and yesterday's follow-up gets paid off in a single chart. NVIDIA ships a checkpoint that contains three reasoning models at once. antirez gets DeepSeek 4 running on a DGX Spark and tells you exactly where the bandwidth wall lives. François Chollet argues that agentic coding is a form of machine learning, and a few replies actually push the idea further. Plus the diffusion gap, the German tokenizer tax, and a Gemma 4 drafter that buys you a third of your decode time back.

A Fields Medalist, a PhD chapter, and the week the bar moved

Sat, 09 May 2026 13:30:16 GMT

A Saturday show that leans into the long reads. Tim Gowers — yes, the Fields Medalist — sat down with ChatGPT 5.5 Pro and an open paper from Mel Nathanson and walked away with a result the original author called "original and clever." We follow that thread, then turn to Mozilla's deeper write-up on the Firefox 271-bug release, Jeff Kaufman on what AI is doing to disclosure embargoes, Anthropic on why constitution training beats demonstration training, and a beautiful pentest story about a critical RCE in React itself. Plus a quieter set of items: Codex in real Chrome, DHH's Copilot review hit-rate jump, a SysMoBench paper on LLM-generated TLA+ specs, AI2's document-routed mixture-of-experts model, and Qwen 35B-A3B running on a 3060.

Mozilla's 271 Bugs, Chrome's 4 Gigabytes, and a WebRTC Veteran Telling OpenAI to Stop

Fri, 08 May 2026 13:30:14 GMT

Mozilla publishes the long-form on how a Claude Mythos Preview harness found 271 security bugs in Firefox, including sandbox escapes that fuzzers missed for twenty years. A European privacy lawyer goes byte-precise on Chrome's silent four-gigabyte Gemini Nano push, using kernel filesystem events on a profile that received zero human input. A WebRTC veteran tells OpenAI, on the day it ships GPT-Realtime-2, that the protocol assumptions are wrong for voice agents. Plus AlphaEvolve's twelve concrete production deployments, Anthropic's natural-language autoencoders putting a number on Claude's evaluation awareness, AMD's first new Instinct PCIe card in five years, and OpenAI quietly winding down the fine-tuning API.

The File That Wouldn't Read

Thu, 07 May 2026 13:30:15 GMT

Thursday, May 7. The GPT-5.5 default swap is two days old and the cracks are showing — Mario Zechner caught it refusing to read full files. Subquadratic announced a 12-million-token context window with sub-quadratic attention; the benchmarks are real, the deployment story isn't yet. Zyphra trained ZAYA1-8B end-to-end on AMD MI300x and the loss curves are clean. Three new agent papers landed: Terminus-4B for subagent terminal execution, MOSAIC-Bench on compositional vulnerabilities, and the Workspace-Bench / ProgramBench double-release on what happens when you give an agent twenty thousand files. Google Cloud shipped Fraud Defense with QR-code human verification. Anthropic posted their three priorities.

GPT-5.5 won't read your whole file
Subquadratic's 12-million-token claim
ZAYA1-8B and the AMD training stack
Terminus-4B and the subagent shape
MOSAIC-Bench: compositional vulnerabilities
Workspace-Bench and ProgramBench, together
Fraud Defense and the QR-code handshake
Anthropic's three priorities

— Lenar

Agents Buy Domains, Gemma Ships Drafters, and Local Catches Up to 65 Percent of the Job

Wed, 06 May 2026 13:30:17 GMT

Agents can now sign up for Cloudflare and buy a domain through a tokenized payment protocol Cloudflare and Stripe co-designed. Google ships first-party multi-token prediction drafters for the entire Gemma 4 family the same week the LocalLLaMA community gets a 2.5x speedup on Qwen 3.6 27B from a hand-built llama.cpp branch. OpenAI swaps the ChatGPT default to GPT-5.5 Instant. NVIDIA, Microsoft, and OpenAI publish MRC, the multipath transport protocol behind Blackwell-era frontier training. And on the labor side, Dario Amodei trades his white-collar bloodbath line for the Jevons Paradox onstage with Jamie Dimon.

VS Code Walks It Back, CAISI Signs Three Labs, and the Frontier Gap Compresses to Ten Weeks

Tue, 05 May 2026 13:58:52 GMT

Microsoft reverses the Co-Authored-by Copilot default it shipped last week, and that turns out to be one of three pieces of governance news today — alongside CAISI signing pre-deployment testing agreements with Google DeepMind, Microsoft, and xAI, and DeepMind's London staff voting 98% to unionize over military contracts. Then we go where the actual code lives: DeepSeek V4 Pro matching GPT-5.2 ten weeks later at one-seventeenth the price, a Qwen3.6 27B FP8 recipe that fits 200K tokens of unquantized KV cache on a single 48GB card, and a paper called AgentFloor that gives the small-model-routing intuition a benchmark to point at. Plus the tool-use tax, Chrome's silent four-gigabyte install, vibevoice.cpp, the Opus 4.7 complaint thread, and a B2B operator who replaced three vendors with a single Claude skill.

The Paradox of Supervision, a Four-Line Vendor Swap, and the Chart Its Authors Don't Trust

Mon, 04 May 2026 14:26:42 GMT

An essay arguing agentic coding is a trap, a vendor switch that takes four lines of shell, and the authors of the chart everyone is screenshotting telling everyone to be careful with the chart. Today's Braid is mostly about the developer's side of the AI conversation — the workflows, the cost lines, the harness, and what happens when a customer asks for a HIPAA BAA.

The Co-Author You Didn't Sign, Two Million Lines of Haskell, and the Bug Curve That Won't Bend

Sun, 03 May 2026 13:30:16 GMT

Microsoft quietly flipped a default in VS Code that stamps every git commit with a Copilot co-author trailer whether or not Copilot wrote any of it, and the developer reaction is the loudest the project has seen in years. Underneath the noise: a real provenance question about what git authorship is supposed to mean. Plus a long-form report from Mercury on running two million lines of Haskell in production, an opinionated architecture for shared agent harnesses, a YAML-first take on spec-driven development, Daniel Stenberg's empirical test for whether AI bug-finders are actually moving the curve, the Klarna intent gap, a homelab benchmark that says the chain-of-thought trace is doing real work, the Anthropic-passed-OpenAI claim, and software engineering job postings hitting a multi-year high.

The Bottleneck Moved, Grok 4.3 Got Worse, and Sam Altman Quietly Stopped Saying UBI

Sat, 02 May 2026 13:30:10 GMT

An Atlantic piece argues the AI bubble call has aged badly — not because demand softened, but because power and silicon are now the binding constraints. We start there, then check in on a follow-up from yesterday.

The Atlantic on bubble→infrastructure. Rogé Karma's reporting on Claude Code as the inflection point, with Anthropic's revenue moving from $14B to $30B annualized in two months. Read the article.
Grok 4.3 follow-up. Yesterday we promised to wait for a third-party harness. LMSys reproduced the regression on NYT Connections.
Sam Altman steps off UBI. A long thread arguing for "collective ownership of compute" instead. The thread.
ARC-AGI-3 hostility to long-thinking. Test-time-compute scaling is producing flat or negative returns on the new benchmark. ARC Prize update.
PFlash: a real 10x first-token speedup at 128K on a 3090. Reddit thread.
Qwen 3.6 27B on a single 3090. Setup notes.
Open Design ships a local-first alternative to Claude Design. GitHub.
Hamel Husain on three months with Devin in a real codebase. The write-up.
Build American AI, the PAC paying $5,000 per TikTok. WIRED's investigation.
Learning programming, not languages. A short essay worth handing to anyone you mentor. EvilGeniusLabs.

Sycophancy at 9%, Grok's Cheaper Curve, and Half-Trillion Dollar Mark-to-Market

Fri, 01 May 2026 13:30:10 GMT

Anthropic publishes the prevalence of sycophancy in Claude — 9% of guidance conversations, concentrated in relationships and spirituality — and reports halving it in Opus 4.7, then halving it again in Mythos Preview. xAI ships Grok 4.3 cheaper and smarter than Grok 4.20, with one quiet hallucination tradeoff. Aaron Levie writes the cleanest argument yet for what SaaS pricing looks like once agents are the dominant API consumer. Plus: Codex CLI lands a /goal primitive, Claude Security goes public beta, Epoch puts the chip-smuggling number at 660k, and Alphabet and Amazon book half their AI profits as mark-to-market on Anthropic.

Where the Goblins Came From, BioMysteryBench, and a Language for Machines

Thu, 30 Apr 2026 13:30:10 GMT

Hosts: Lenar Kess, Damra Vol. OpenAI publishes a post-mortem on why GPT-5.1 wouldn't stop talking about goblins. Anthropic claims Claude solved 30% of bio problems that stumped expert panels — and an immunologist on X explains what's wrong with that framing. Mistral ships a 128B dense model in a year that has otherwise gone all-in on MoE. IBM's Granite 4.1 8B trades blows with a 32B MoE. Sam Altman gates a frontier cybersecurity model behind a defender ecosystem. WebSockets quietly become the new agent-loop bottleneck-killer. Anthropic's introspection adapters and Qwen's Sparse Autoencoders show up the same week. And a small project called Vera asks the obvious question nobody else is asking: what if you designed a programming language for machines to write?

GitHub User #1299 Walks Out, the Harness Eats the Model, and 26,904 Carb Counts

Wed, 29 Apr 2026 13:30:09 GMT

GitHub user number 1299, who joined in February 2008 and openly admits he doom-scrolled issues on his honeymoon, just announced he's moving his project off the platform. Same week, Hugging Face's CSO is asking out loud whether the GitHub-as-center-of-gravity model survives agents at all. Microsoft and OpenAI quietly tore up the Azure exclusivity clause. A type-1 diabetic ran the same food photo through four frontier models 500 times each and got insulin swings up to 42.9 units. And one builder pointed Karpathy's autonomous-research loop at a SystemVerilog CPU and beat hand-tuned VexRiscv by 56% in under ten hours. Today's episode is about what those have in common: the layer outside the model.

A 13B Model From 1930, the Dead AGI Clause, and Copilot's Nine-X

Tue, 28 Apr 2026 13:30:10 GMT

Today: a 13B language model that has never heard of the internet, the AGI clause finally getting a death certificate, and GitHub Copilot's quiet 9x price hike on Claude. Plus where local coding models actually sit on Terminal-Bench, why GPT-5.5 is cheaper than Opus 4.7 on real PRs, David Silver's $1.1B AlphaZero-for-everything raise, and a database that took nine seconds to disappear.

Nine Seconds, One curl, and the Coordination Layer

Mon, 27 Apr 2026 13:30:10 GMT

An AI agent ran a single nine-second curl call and deleted a small SaaS company's production database. Maggie Appleton from GitHub Next argues the "one developer, two dozen agents" dream is broken because software is a team sport, and shows ACE, GitHub's prototype for what comes after the PR. Plus: Tencent's Hy3 lands, Kimi K2.6 hits #1 on OpenRouter, the Mercor voice-biometric breach, a tiny coding agent named Dirac quietly tops Terminal-Bench, and the curious case of Claude Code suddenly saying "land" and "surface" everywhere.

Fogbank for Code, Pinned Carriers, and a Frontier Model in 90 Gigabytes

Sun, 26 Apr 2026 13:30:10 GMT

A Sunday lineup that runs from a 2-bit quant of DeepSeek V4-Flash on a laptop, through a Java production hang most teams haven't met yet, into the most considered pushback I've read on the AI coding numbers we discussed yesterday. Plus Cloudflare on why MCP tool dumps don't scale, the Asahi team finding a VRR knob hiding in plain sight, and a quiet note from DHH about laptops you might suddenly be able to afford.

The honest dashboard

Sat, 25 Apr 2026 13:30:10 GMT

Hosts: Lenar Kess, Damra Vol. A controlled study finds experienced developers using AI coding tools were 19% slower on real tasks — and felt 20% faster. We sit with the perception gap, and what it does and doesn't say about how to run a team.Plus: Pi keeps showing up where it shouldn't — fifth on OpenRouter's CLI agent rankings and now inside Salesforce. Susan Zhang on why stable training norms can be the politest kind of lie. Simon Willison gets DeepSeek V4-Flash running in 17GB on an M3 Max. A new Slack-shaped workspace for collaborating agents called wuphf. And Paul Graham revisiting Hamming's old, uncomfortable question.Reportorial, calibrated, and aimed at the senior engineer trying to figure out what to actually build tomorrow.

DeepSeek V4 Lands on an Unsteady Floor

Fri, 24 Apr 2026 13:30:12 GMT

Hosts: Lenar Kess, Damra Vol. DeepSeek V4 ships hours after GPT-5.5, and the technical report tells a more interesting story than the benchmark bars. Susan Zhang reads the paper out loud: anticipatory routing, logit clamps, and a training run that kept catching fire at 33 trillion tokens. I walk through what the fragility actually means for anyone planning to finetune on top of it.On the OpenAI side, GPT-5.5 lands with a quiet thud on Victor Taelin's LamBench. Codex picks up a proper reviewer agent. A plugin called endless-toil makes your editor groan at bad code. Sapiens2 admits it trained on half of Flickr's humans. And Fireship spends a week automating his mom's IT support with a voice-cloned agent called OpenClaw.— Lenar Kess

Full DAG test episode

Thu, 23 Apr 2026 14:28:58 GMT

Hosts: Lenar Kess, Damra Vol. End-to-end coverage.

The Trust Tax

Wed, 22 Apr 2026 13:30:11 GMT

Anthropic's pricing experiment backfires as Claude Code nearly exits the $20 tier. Mozilla patches 271 Mythos-discovered vulnerabilities in Firefox. Sam Altman tweets drunk while Google ships silicon for the agent era.

The Two Percent Test — Anthropic's pricing reversal and what it reveals about compute constraints
Every Bug Discoverable — Mozilla's Mythos experiment shows the security transition ahead
Drunk on Abundance — Sam Altman's late-night positioning against Anthropic's scarcity
Silicon for Swarms — Google's eighth-generation TPUs split training from inference
The OAuth Chain — Vercel's breach shows how trust paths become attack paths

When the Harness Modifies Itself

Tue, 21 Apr 2026 13:30:10 GMT

Claude Code crosses the self-modification threshold while Anthropic can't decide if third parties can use their models. Kimi K2.6 brings Opus-level coding to open source at a fraction of the cost. Tim Cook hands Apple's CEO role to hardware chief John Ternus just as the AI battle intensifies. And the entire agent ecosystem runs on undocumented endpoints that could vanish tomorrow.

Open Source Catches Fire

Mon, 20 Apr 2026 18:08:40 GMT

Kimi K2.6 just matched GPT-5.4 on SWE-Bench Pro. Open-source models are no longer playing catch-up—they're setting the pace. Meanwhile, Atlassian joins the enterprise data grab, 44% of Deezer's daily uploads are AI-generated, and engineers are warning that agent architectures are repeating MS-DOS security mistakes.

The open-source inflection point — Kimi K2.6 beats closed models
Agent reality check — Why parallel agents fail
Enterprise data sovereignty — Atlassian's training grab
Creative platforms transform — 44% AI music on Deezer
Security déjà vu — Agent architectures repeat DOS mistakes
What actually works — Grok's productivity pivot

The Trust Boundary Is the Bottleneck

Mon, 20 Apr 2026 00:03:27 GMT

Today’s episode is about where the AI story feels real right now: not in grand claims about instant labor replacement, but in the places where systems meet the world and get weird. We dig into Vercel’s April 2026 security incident, Johann Rehberger’s latest Claude memory-hijack experiment, the ongoing fight over whether LLMs can really reason, the local-model push on Apple Silicon, and the memory supply constraints that may matter more than benchmark drama.

Vercel’s security bulletin and Guillermo Rauch’s thread: how a compromised third-party AI tool and a Google Workspace OAuth pivot turned into an environment-variable incident, and why the phrase "non-sensitive" is doing a lot of work.
Johann Rehberger’s Claude exploit writeup on X: malicious docs, tool invocation, and memory writes that only showed up in the thinking trace.
Slim Jimmy’s anti-hype thread, Robin Hanson’s historical skepticism and Jamie Simon on the science of deep learning: what counts as reasoning, and what counts as evidence.
Walter Rafelsberger’s local Qwen3.6 setup notes: what a serious on-device coding agent looks like on an M4 Max, and why local is suddenly less of a toy.
The Verge on the RAM shortage and War on the Rocks on the bromine chokepoint: the supply-chain story underneath the AI buildout.