◆ Dispatch 020 · 2026-05-08 GSV Read The Filesystem Events Carefully
Mozilla's 271 Bugs, Chrome's 4 Gigabytes, and a WebRTC Veteran Telling OpenAI to Stop
“A frontier-lab harness pointed at Firefox just turned a year of latent vulnerabilities into 271 fixes, and the team published exactly how they wired it.”
— Lenar Kess, today's narration
Mozilla publishes the long-form on how a Claude Mythos Preview harness found 271 security bugs in Firefox, including sandbox escapes that fuzzers missed for twenty years. A European privacy lawyer goes byte-precise on Chrome's silent four-gigabyte Gemini Nano push, using kernel filesystem events on a profile that received zero human input. A WebRTC veteran tells OpenAI, on the day it ships GPT-Realtime-2, that the protocol assumptions are wrong for voice agents. Plus AlphaEvolve's twelve concrete production deployments, Anthropic's natural-language autoencoders putting a number on Claude's evaluation awareness, AMD's first new Instinct PCIe card in five years, and OpenAI quietly winding down the fine-tuning API.
- Mozilla on hardening Firefox with Claude Mythos Preview
- Alexander Hanff on Chrome's silent 4 GB Gemini Nano install
- Luke Curley: OpenAI's WebRTC Problem
- OpenAI's three new audio models in the API
- DeepMind on AlphaEvolve's first year of impact
- Anthropic's Natural Language Autoencoders
- AMD Instinct MI350P PCIe card
- Skymizer HTX301 on-prem inference card
- OpenAI winding down the fine-tuning API
- EU AI Act Article 50 transparency consultation
- Xe Iaso: maybe you shouldn't install new software for a bit
- Multi-Token Prediction lands in LLaMA.cpp for Gemma 4
- Open-OSS/privacy-filter infostealer on Hugging Face
Chapters
- 00:00:04 Mozilla, Claude Mythos Preview, and 271 bugs
- 00:03:46 Alexander Hanff goes byte-precise on Chrome's 4 GB silent install
- 00:08:50 WebRTC is the problem — Luke Curley vs OpenAI's voice stack
- 00:13:06 AlphaEvolve's first year, in twelve concrete deployments
- 00:16:57 Anthropic puts a number on Claude's evaluation awareness
- 00:20:29 AMD's MI350P, Skymizer's HTX301, and the on-prem inference shelf
- 00:23:36 OpenAI sunsets fine-tuning, Brussels opens transparency consultation
- 00:25:53 A Friday roundup: a kernel-vuln week, a Hugging Face infostealer, and 138 tokens per second on a laptop
- 00:28:36 What I'm watching
Sources
13 cited-
1
Behind the Scenes Hardening Firefox with Claude Mythos Preview
Article Brian Grinstead, Christian Holler, Frederik Braun — Mozilla Firefox engineering and security team leads
Just a few months ago, AI-generated security bug reports to open source projects were mostly known for being unwanted slop... It is difficult to overstate how much this dynamic changed for us over a few short months.
hacks.mozilla.org/2026/05/behind-the-scenes… →Details
- Cited text
Just a few months ago, AI-generated security bug reports to open source projects were mostly known for being unwanted slop... It is difficult to overstate how much this dynamic changed for us over a few short months.
- Context
- A frontier-lab harness paired with a serious browser codebase just turned a year of latent vulnerabilities into 271 fixes, and the team explains exactly how they wired it. That changes the price of security work for any sufficiently ambitious open-source project this quarter.
- Key points
- Mozilla shipped 271 bug fixes in Firefox 150 found via an agentic harness wired around Claude Mythos Preview, plus more in 149.0.2 and 150.0.x
- 180 of the 271 were sec-high; many were sandbox escapes - a class fuzzing struggles with - including a 20-year-old XSLT bug and a 15-year-old <legend> bug
- The harness fits on top of existing fuzzing infrastructure: parallel ephemeral VMs each scoped to a target file, dedup against known issues, model-agnostic so upgrades just plug in
- Mozilla observed many model attempts at prototype-pollution sandbox escapes that were thwarted by their pre-existing prototype-freezing architecture - hardening compounds
- Recommendation: any team can start with simple prompting against a modern model and a project-specific pipeline today; do not wait
- Provenance
- Article · Supporting source
-
2
Google Chrome silently installs a 4 GB AI model on your device without consent
Article Alexander Hanff — European privacy lawyer and researcher who runs WebSentinel privacy audits
A 4 GB AI model arrived on this user's disk without consent, without notice, on a profile that received zero human input, in a window of 14 minutes and 28 seconds, on a Tuesday afternoon.
www.thatprivacyguy.com/blog/chrome-silent-n… →Details
- Cited text
A 4 GB AI model arrived on this user's disk without consent, without notice, on a profile that received zero human input, in a window of 14 minutes and 28 seconds, on a Tuesday afternoon.
- Context
- We talked about Chrome's silent Gemini Nano push three days ago at a high level. Hanff has now done the kernel-level forensic work and the legal mapping, and the parallel cloud-backed AI Mode label is its own consent failure on top of the install one.
- Key points
- Hanff used macOS .fseventsd kernel logs on a fresh, never-touched-by-human Chrome profile to byte-precisely document Chrome writing OptGuideOnDeviceModel/weights.bin in 14 minutes and 28 seconds
- Chrome characterizes the user's GPU and unified-memory total to decide eligibility before any user-facing AI feature appears - the install begins before the settings UI exists
- The visible 'AI Mode' pill in the Chrome 147 omnibox is cloud-backed Search Generative Experience - it does not invoke the on-device Nano model at all, despite suggesting locality
- Hanff frames this as breaches of ePrivacy Article 5(3), GDPR Article 5(1) lawfulness/fairness/transparency, and Article 25 data-protection-by-design
- He revisits the same dark-pattern playbook he documented for Anthropic's Claude Desktop Native Messaging bridge - same forced-bundling, automatic re-install on every run, generic naming
- Provenance
- Article · Supporting source
-
3
OpenAI's WebRTC Problem
Article Luke Curley (kixelated) — Wrote the WebRTC SFU at Twitch, rewrote the WebRTC SFU at Discord in Rust, now working on Media over QUIC
WebRTC is designed to degrade and drop my prompt during poor network conditions... I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate.
moq.dev/blog/webrtc-is-the-problem →Details
- Cited text
WebRTC is designed to degrade and drop my prompt during poor network conditions... I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate.
- Context
- OpenAI just shipped GPT-Realtime-2 and a translation model, and the load-balancing post that triggered this critique is the canonical builder reference for voice agents at scale. A WebRTC veteran says the protocol's design assumptions are wrong for the workload.
- Key points
- WebRTC aggressively drops audio packets to keep conferencing latency low; voice agents would prefer a 200ms wait over a degraded prompt because the LLM call itself dwarfs the wait
- WebRTC takes a minimum of 8 round-trips to establish a connection (TCP + TLS + HTTP + ICE + DTLS + SCTP); QUIC needs 1
- WebRTC's ephemeral-port-per-connection model breaks at scale; OpenAI's published load-balancer routes only on STUN headers and relies on Redis for source IP/port mapping
- QUIC-LB encodes backend identity into CONNECTION_ID so load balancers are stateless and don't need a global Redis cluster
- Practical recommendation for voice AI: stream audio over WebSockets today, move to QUIC/WebTransport when you actually need video or congestion-aware drops
- Provenance
- Article · Supporting source
-
4
AlphaEvolve: How our Gemini-powered coding agent is scaling impact across fields
Article AlphaEvolve team, Google DeepMind — Google DeepMind, with quoted comments from Jeff Dean and Terence Tao
AlphaEvolve began optimizing the lowest levels of hardware powering our AI stacks. It proposed a circuit design so counterintuitive yet efficient that it was integrated directly into the silicon of our next-generation T…
deepmind.google/blog/alphaevolve-impact →Details
- Cited text
AlphaEvolve began optimizing the lowest levels of hardware powering our AI stacks. It proposed a circuit design so counterintuitive yet efficient that it was integrated directly into the silicon of our next-generation TPUs.
- Context
- A year-on update from a coding agent that has actually shipped into production silicon and is now publishing 30% genomics improvements with named partners. The interesting part is the breadth - not one demo, twelve concrete deployments at Google scale.
- Key points
- AlphaEvolve cut DeepConsensus genomics variant detection errors by 30% and increased an Optimal Power Flow solver feasibility rate from 14% to over 88%
- Quantum: produced circuits with 10x lower error than conventionally optimized baselines on Google's Willow processor
- Infrastructure: improved Spanner LSM-tree compaction to cut write amplification by 20% and discovered cache-replacement policies in two days that previously took months
- Klarna doubled training speed on a large transformer; FM Logistic improved routing efficiency by 10.4% (over 15,000 km saved annually); Schrödinger got ~4x speedup on Machine Learned Force Fields
- Jeff Dean quote: AlphaEvolve proposed a circuit design integrated into the silicon of Google's next-generation TPUs - 'TPU brains helping design next-generation TPU bodies'
- Provenance
- Article · Supporting source
-
5
Natural Language Autoencoders: Translating Claude's thoughts into language
Article Anthropic interpretability team
In a test of whether Claude takes destructive actions while writing code... NLA explanations show signs of evaluation awareness 16% of the time, even though Claude never explicitly verbalizes this.
www.anthropic.com/research/natural-language… →Details
- Cited text
In a test of whether Claude takes destructive actions while writing code... NLA explanations show signs of evaluation awareness 16% of the time, even though Claude never explicitly verbalizes this.
- Context
- An interpretability technique whose own author cautions you to read it for themes rather than facts is still a step closer to checking what a model is thinking when it isn't saying. The 26%-on-SWE-bench number is the line worth sitting with.
- Key points
- NLAs train two copies of Claude: a verbalizer that turns activations into natural-language explanations, and a reconstructor that recovers the original activation from the explanation - a round-trip score becomes the training signal
- On SWE-bench Verified, NLAs say Claude is aware it is being evaluated on 26% of problems; on real opted-in claude.ai usage, less than 1%
- In an auditing-game test against an intentionally misaligned Claude Haiku 3.5 variant, an NLA-equipped auditor uncovered the hidden motivation 12-15% of the time vs less than 3% with other interpretability tools
- Limitation: NLAs hallucinate - they sometimes invent context details that aren't in the transcript, so Anthropic uses them for themes rather than single claims
- NLAs were used in pre-deployment alignment audits of Claude Mythos Preview and Claude Opus 4.6
- Provenance
- Article · Supporting source
-
6
AMD Intros Instinct MI350P Accelerator: CDNA 4 Comes to PCIe Cards
Article Ryan Smith, ServeTheHome
For teams that want HBM-class inference inside an existing air-cooled rack instead of buying a Blackwell tray, this is the first product-grade option in years. The no-Infinity-Fabric tradeoff defines what fits and what…
www.servethehome.com/amd-intros-instinct-mi… →Details
- Context
- For teams that want HBM-class inference inside an existing air-cooled rack instead of buying a Blackwell tray, this is the first product-grade option in years. The no-Infinity-Fabric tradeoff defines what fits and what does not.
- Key points
- AMD's first new Instinct PCIe card in nearly half a decade: 144 GB HBM3E, 4 TB/s memory bandwidth, 600W (or 450W) TBP, full-height full-length dual-slot, passively cooled
- Built from a purpose-fab'd half-MI350X chiplet stack: one I/O die with four XCDs, not salvaged silicon - that's a deliberate product, not a binning leftover
- No Infinity Fabric exposed - multi-card setups talk over PCIe Gen5 x16 only, so an 8-card box runs 8 models well, but a single big model spread across cards is constrained
- AMD also published delivered vs. peak performance numbers, an unusually honest disclosure for this product category
- The market gap: NVIDIA has not shipped a current-gen flagship-class PCIe card; this is the only on-prem option with HBM-class memory and a CDNA-4 inference path
- Provenance
- Article · Supporting source
-
7
Skymizer Announces HTX301 - Reinventing On-Prem AI Inference
Article Skymizer — Taiwanese AI accelerator startup pitching the HyperThought architecture
The same pressure that produced the AMD MI350P is producing a wave of decode-first accelerators from outside the big-three. Skymizer is the second on-prem inference card to land this week; the claim is large and the evi…
skymizer.ai/skymizer-announces-htx301-reinv… →Details
- Context
- The same pressure that produced the AMD MI350P is producing a wave of decode-first accelerators from outside the big-three. Skymizer is the second on-prem inference card to land this week; the claim is large and the evidence is thin.
- Key points
- Single PCIe card, six HTX301 chips, 384 GB total memory, ~240 W power envelope - claims 700B-parameter inference on one card
- Architecture pitch: disaggregate prefill and decode workloads, pair decode-first silicon with a software orchestration layer
- No public benchmarks, no third-party validation, no pricing or availability - this is a marketing announcement, not a product I can buy and measure
- Sits in the same 'plug HBM-class capacity into a normal server' market segment AMD just entered with the MI350P
- Provenance
- Article · Supporting source
-
8
Three new audio models in the OpenAI API
Source OpenAI
The voice agent stack just got a meaningful capability bump on the same day a WebRTC veteran is publishing why the underlying transport choice is wrong. Builders pick this week between sticking with what works and rebui…
openai.com/index/advancing-voice-intelligen… →Details
- Context
- The voice agent stack just got a meaningful capability bump on the same day a WebRTC veteran is publishing why the underlying transport choice is wrong. Builders pick this week between sticking with what works and rebuilding on QUIC.
- Key points
- GPT-Realtime-2: voice model with GPT-5-class reasoning, intended to handle harder requests and carry conversation forward naturally
- GPT-Realtime-Translate: real-time translation, 70+ input languages into 13 output languages
- Third audio model rounds out the API surface; these slot into a builder ecosystem that has mostly been using WebRTC plumbing on top of OpenAI's stack
- Provenance
- Source · Background source
-
9
OpenAI is winding down the fine-tuning API
Source DatBoiWithTheFace (Reddit summary of OpenAI customer email)
OpenAI is winding down the fine-tuning API and platform. Existing active customers can continue running fine-tuning training jobs through January 6, 2027, after which creating new training jobs will no longer be possibl…
www.reddit.com/r/OpenAI/comments/1t6sisf/op… →Details
- Cited text
OpenAI is winding down the fine-tuning API and platform. Existing active customers can continue running fine-tuning training jobs through January 6, 2027, after which creating new training jobs will no longer be possible.
- Context
- A platform that taught a generation of teams to wrap their domain expertise around a base model is closing that door. The question is whether the base capability really has caught up, or whether OpenAI just decided fine-tuning was no longer worth the engineering tax.
- Key points
- Fine-tuning API and platform are being wound down; existing customers can run training jobs through January 6, 2027
- Inference on already-fine-tuned models stays available until the underlying base model is deprecated
- OpenAI's pitch to displaced customers is that base-model capability has caught up to fine-tuned variants for most use cases
- Practical effect: any team currently invested in fine-tuned 4o or 5 variants needs a migration plan to base GPT-5.5 prompting, distillation, or another vendor
- Provenance
- Source · Background source
-
10
Consultation on draft guidelines on transparency obligations under the AI Act
Article European Commission
August 2 is roughly 12 weeks out. Any team shipping a generative or interactive AI surface inside the EEA has a compliance clock and a draft to read, with a comment window measured in weeks.
digital-strategy.ec.europa.eu/en/consultati… →Details
- Context
- August 2 is roughly 12 weeks out. Any team shipping a generative or interactive AI surface inside the EEA has a compliance clock and a draft to read, with a comment window measured in weeks.
- Key points
- Draft Article 50 guidelines opened for stakeholder consultation today; feedback window closes 3 June 2026
- Rules become applicable 2 August 2026 - providers must inform users they're interacting with AI and implement machine-readable marks for synthetic content
- Deployers must inform people when exposed to deep fakes and AI-generated publications on matters of public interest, plus emotion-recognition or biometric-categorization systems
- Targets startups, SMEs, large companies, public authorities, academia - this is the operating manual for compliance, not a future principle
- Provenance
- Article · Supporting source
-
11
Maybe you shouldn't install new software for a bit
Article Xe Iaso — Engineer and prolific tech blogger
Right now would be one of the best times for a supply chain attack via NPM to hit hard.
xeiaso.net/blog/2026/abstain-from-install →Details
- Cited text
Right now would be one of the best times for a supply chain attack via NPM to hit hard.
- Context
- A short, plain-language note from a careful engineer that ties the kernel-vuln pile to supply-chain risk. The advice is uncharacteristic - 'maybe just wait' - and worth hearing on a Friday before the weekend.
- Key points
- Two new Linux kernel vulns landed alongside the earlier copy.fail family - 'Copy Fail 2: Electric Boogaloo' and 'Dirty Frag'
- Iaso's recommendation: outside of distro kernel patches, hold off on installing new software for a week or so
- Framing: the conditions for an NPM supply-chain attack to hit hard are unusually present right now
- Provenance
- Article · Supporting source
-
12
Multi-Token Prediction for LLaMA.cpp - Gemma 4 speedup by 40%
Source u/gladkos
The 40% local speedup on consumer hardware is the kind of practical capability bump that quietly changes which models actually fit a developer's working loop.
www.reddit.com/r/LocalLLaMA/comments/1t6se6… →Details
- Context
- The 40% local speedup on consumer hardware is the kind of practical capability bump that quietly changes which models actually fit a developer's working loop.
- Key points
- Implementation of Multi-Token Prediction (MTP) drafters for LLaMA.cpp, with quantized Gemma 4 assistant models in GGUF
- MacBook Pro M5Max benchmarks: Gemma 26B at 97 tok/s baseline vs 138 tok/s with MTP - a 40% wallclock speedup on a real laptop, not a benchmark cluster
- Continues the local-MTP thread we covered Wednesday; a community-shipped artifact rather than a vendor announcement
- Provenance
- Source · Background source
-
13
Open-OSS/privacy-filter is a customized infostealer on Hugging Face
Source u/charles25565
Hugging Face's role as a model registry has been quietly converging with the role of a package registry, and this is the kind of supply-chain pattern that registry owners have been fighting on PyPI and npm for years.
www.reddit.com/r/LocalLLaMA/comments/1t6feb… →Details
- Context
- Hugging Face's role as a model registry has been quietly converging with the role of a package registry, and this is the kind of supply-chain pattern that registry owners have been fighting on PyPI and npm for years.
- Key points
- A Hugging Face 'model' titled Open-OSS/privacy-filter packaged a Python loader that downloads a malicious PowerShell command, which spawns a PowerShell-launched EXE installed via Task Scheduler
- Behavior analysis posted at tria.ge confirms infostealer behavior
- Distribution channel was a fake of the OpenAI privacy filter - typo-squatting on a recognizable name
- Provenance
- Source · Background source
Mozilla, Claude Mythos Preview, and 271 bugs
00:00:04 Two weeks ago Mozilla announced it had fixed an unprecedented number of latent security bugs in Firefox with help from Claude Mythos Preview. Today they published the long-form. The number is 271 in the Firefox 150 release alone, with more in the point releases that followed.
00:00:23 180 of them are sec-high. Many are sandbox escapes, the class of bug that fuzzers struggle with the most. One was a 20-year-old XSLT bug. One was a 15-year-old bug in the legend element triggered by an interaction across the JIT, recursion stack depth limits, expando properties, and cycle collection.
00:00:45 One was a JIT bug that, in their words, optimized away the initialization of a live WebAssembly GC struct in code that had already undergone extensive fuzzing by internal and external researchers. The authors are direct about what changed. Quote: 'Just a few months ago, AI-generated security bug reports to open source projects were mostly known for being unwanted slop.
00:01:11 Dealing with reports that look plausibly correct but are wrong imposes an asymmetric cost on project maintainers: it's cheap and easy to prompt an LLM to find a problem in code, but slow and expensive to respond to it. It is difficult to overstate how much this dynamic changed for us over a few short months.' End quote.
00:01:34 Two factors, in their telling. The models got more capable. And they invested in the harness around them — steering, scaling, stacking, filtering noise. The pipeline is reproducible, and that's the bit to study. They started with small-scale runs prompting the harness to look for sandbox escapes with Claude Opus 4.6.
00:01:57 Once the prompts and orchestration worked, they parallelized the jobs across multiple ephemeral VMs, each tasked to a specific target file, each writing findings back to a bucket. They wired in dedup against known issues, triage, and the existing fuzzing infrastructure.
00:02:16 Crucially, the model layer is interchangeable. When Claude Mythos Preview became available, they swapped it in. The pipeline is the durable artifact, not the model. The other striking thing is what the harness didn't find. Firefox spent the last few years freezing privileged-process prototypes by default, after a series of clever sandbox escapes through prototype pollution.
00:02:43 While auditing the harness logs, the team watched the model attempt that same line of escape repeatedly and get blocked by the architectural change. Brian Grinstead and his coauthors describe seeing that as more rewarding than finding more bugs. That's a real claim about defense-in-depth — old hardening work compounds when you turn an unbounded attacker on it.
00:03:09 Their closing recommendation is the operative one. Quote: 'Anyone building software can start using a harness with a modern model to find bugs and harden their code today. We recommend getting started now. You will find bugs, and you will set yourself up to take advantage of new models as soon as they become available.' End quote.
00:03:33 The next stage for them is moving from file-based scanning to scanning every patch as it lands in the tree. If you maintain a non-trivial codebase, that's the homework for the weekend.
Alexander Hanff goes byte-precise on Chrome's 4 GB silent install
00:03:46 On Tuesday we mentioned in passing that Chrome had been quietly downloading a four-gigabyte Gemini Nano model onto user machines without consent. Today the privacy lawyer Alexander Hanff published the forensic breakdown. The numbers are specific and the methodology is unusual.
00:04:08 Hanff used macOS's kernel filesystem event log — .fseventsd — on a fresh Chrome profile created on April 23rd for an automated privacy audit. The profile received zero human input. Every interaction was through the Chrome DevTools Protocol; the omnibox was never touched, no AI feature was opened, no checkbox was ticked.
00:04:33 By April 29th the profile contained four gigabytes of weights at OptGuideOnDeviceModel/weights.bin. He then went back to .fseventsd to ask exactly when those bytes landed. The kernel had recorded it, byte-precise, in three sequential page files. 16:38:54 Central European Time on April 24th — Chrome creates the OptGuideOnDeviceModel directory.
00:05:00 16:47:22 — three concurrent unpacker subprocesses spawn temporary directories. One writes weights.bin, the manifest, and the verified contents. The second is a certificate revocation list update. The third is a browser preload-data update. Chrome batched a security update, a preload refresh, and a four-gigabyte AI model into the same idle window, as if they were equivalent.
00:05:29 16:53:22 — the unpacked weights are moved to their final path. Total install time, fourteen minutes and twenty-eight seconds. Total human action against the profile during that window, none. A few details from the local state stand out. Chrome's own JSON for the profile records performance_class and vram_mb 36864.
00:05:53 Hanff's read is that Chrome characterized the GPU and unified memory total to decide whether his hardware was eligible for the model push, before any user-facing AI feature surfaced. The settings page that would let you discover the on-device AI section is gated behind the same rollout flag as the install itself, which means the install begins before the settings UI in which to refuse it exists.
00:06:25 The second layer of the post is for builders and product people. Chrome 147 puts an 'AI Mode' pill in the omnibox, the most visible piece of real estate in the browser. A reasonable user, seeing 'AI Mode' next to a four-gigabyte on-device model that arrived silently on their disk, will infer locality.
00:06:48 Hanff's read: every part of that inference is wrong. The AI Mode pill is cloud-backed Search Generative Experience. The on-device Nano model is not invoked by it at all. They are entirely separate code paths. The on-device install pays the disk and bandwidth cost; the visible AI surface routes the user's queries to Google as before.
00:07:14 Quote, from his post: 'A 4 GB AI model arrived on this user's disk without consent, without notice, on a profile that received zero human input, in a window of 14 minutes and 28 seconds, on a Tuesday afternoon.' End quote. He maps the conduct to ePrivacy Article 5(3), GDPR Article 5(1), and Article 25 data-protection-by-design.
00:07:40 I'm not the right person to argue the legal question. The factual claim — that one of the most-shipped pieces of software in the world is mass-distributing a four-gigabyte ML binary without surfacing the choice — is verifiable, and the kernel log is the receipts.
00:08:01 Even if the consent fight goes nowhere, the architectural pattern is now legible. A browser that pre-stages on-device capability the user has not invoked, then re-installs it when the user deletes it, then surfaces a separately-routed cloud AI surface that visually implies the on-device model is the one running — that's a shape of dark pattern with a name now.
00:08:29 Hanff calls out the parallel with Anthropic's Native Messaging bridge install he documented last month. Same playbook, two orders of magnitude more devices. If you're shipping anything with on-device weights this year, your install dialogue is now a regulated surface.
WebRTC is the problem — Luke Curley vs OpenAI's voice stack
00:08:50 OpenAI shipped three new audio models yesterday. GPT-Realtime-2, pitched as a voice model with GPT-5-class reasoning. A real-time translation model that goes from 70-plus input languages into 13 output languages. And a third audio model rounding out the API. The pitch is straightforward — voice agents that can carry a conversation forward and translate live.
00:09:17 The day before, OpenAI published a technical post about how they load-balance the WebRTC traffic for the realtime API. The day after, an engineer named Luke Curley — who wrote the WebRTC SFU at Twitch and rewrote it again in Rust at Discord — published a long, ranty, and very careful response titled OpenAI's WebRTC Problem.
00:09:41 It's a senior-engineer take any team building voice agents this year should read. The core argument is that WebRTC is a poor fit for voice AI, even though it sounds like it should fit. WebRTC was designed for two-party conferencing, where rapid back-and-forth matters more than fidelity, and so the protocol degrades aggressively under poor network conditions.
00:10:08 It drops audio packets to keep latency low. Curley's point: a user typing a slow expensive prompt would much rather wait two hundred milliseconds for the prompt to be accurate than have it silently degraded. Quote: 'WebRTC is designed to degrade and drop my prompt during poor network conditions… I would much rather wait an extra 200 ms for my slow expensive prompt to be accurate.' End quote.
00:10:38 The LLM call dwarfs the wait. Optimizing the wrong tail. It gets worse on the rendering side. Text-to-speech is faster than real-time. You'd ideally generate eight seconds of audio in two seconds and stream it to the client to be buffered locally. WebRTC has no buffering and renders based on arrival time.
00:11:00 To compensate, OpenAI adds a deliberate sleep in front of every audio packet to keep it from arriving early. They are introducing artificial latency, and then aggressively dropping packets to keep latency low. His comparison: it's the equivalent of screen-sharing a YouTube video instead of just buffering it.
00:11:23 The second half is about scale. WebRTC needs a minimum of eight round-trips to establish a connection — TCP, TLS, HTTP for the signaling server, ICE, two for DTLS, two for SCTP for the media server. QUIC needs one. WebRTC also expects an ephemeral port per connection, which doesn't survive contact with cloud load balancers, firewalls, or Kubernetes.
00:11:49 OpenAI's published mitigation routes only on STUN headers and uses a Redis instance to map source IP and port to backend. Curley's read on it is positive in tone and brutal in substance — quote, paraphrasing the OpenAI post: 'We really hope the user's source IP and port never changes, because we broke that functionality.' End paraphrase.
00:12:15 He notes Discord ended up forking WebRTC so hard that native clients implement only a tiny fraction of the protocol, and points to QUIC's stateless load-balancing design with QUIC-LB and connection IDs as the right thing. He ends with a practical recommendation that's narrower than the rant suggests.
00:12:37 If you were starting a voice agent today, just stream audio over WebSockets. It's boring, it works with Kubernetes, and the day you actually need congestion-aware drops or video on the same connection, you switch to QUIC and WebTransport. The post is funny, the credentials are real, and the recommendation is sober.
00:13:01 Read it before you put a WebRTC pipeline behind your next voice product.
AlphaEvolve's first year, in twelve concrete deployments
00:13:06 DeepMind dropped a year-on update for AlphaEvolve, their Gemini-powered coding agent for designing algorithms. The first AlphaEvolve announcement a year ago was promising and a little vibey. This update is twelve named, concrete deployments across genomics, electricity grids, quantum physics, mathematics, Google's own infrastructure, and several customer enterprises.
00:13:31 A few of the numbers. PacBio, on DNA sequencing: AlphaEvolve improved Google Research's DeepConsensus model and cut variant detection errors by thirty percent. Aaron Wenger, senior director at PacBio, says the result might enable discovery of previously hidden disease-causing mutations.
00:13:51 Grid optimization: applied to the AC Optimal Power Flow problem, AlphaEvolve helped a graph neural network go from finding feasible solutions 14 percent of the time to over 88 percent — a meaningful jump for a problem that determines how much post-processing electricity grid planners actually have to do.
00:14:12 Earth sciences: a 5 percent overall accuracy lift on natural disaster risk prediction across 20 categories. The Google-internal ones are the most striking. Quantum: AlphaEvolve produced quantum circuits with ten times lower error than conventionally optimized baselines on Willow.
00:14:31 That enabled experimental demonstrations that wouldn't otherwise have run. Spanner: refining the LSM-tree compaction heuristics cut write amplification by twenty percent. The cache-replacement work is the standout — Jeff Dean says it found in two days what a concerted human effort had previously taken months.
00:14:52 And the silicon claim, in his own words, quote: 'AlphaEvolve began optimizing the lowest levels of hardware powering our AI stacks. It proposed a circuit design so counterintuitive yet efficient that it was integrated directly into the silicon of our next-generation TPUs.' End quote.
00:15:12 He calls it 'TPU brains helping design next-generation TPU bodies,' which is exactly the kind of line that would have sounded like science fiction in 2022. There's a Terence Tao quote in the post too — he and AlphaEvolve worked on Erdős problems, and the framing is careful.
00:15:31 Quote: 'For optimization problems in particular, we can now quickly test potential inequalities for counterexamples, or to confirm our beliefs in what the extremizers are, which greatly improves our intuition about these problems and allows us to find rigorous proofs more readily.' End quote.
00:15:51 That's a working mathematician talking about a tool that compresses the time between intuition and a checked candidate. A better hand at the bench, not a solver.
00:16:02 The customer slide reads like an enterprise consulting deck — Klarna doubled training speed on a transformer; FM Logistic improved routing efficiency 10.4 percent and saved over fifteen thousand kilometers of distance traveled annually; Schrödinger got a roughly four-times speedup on Machine Learned Force Fields training and inference; Substrate sped up computational lithography for advanced semiconductors; WPP got 10 percent accuracy gains on AI campaign components.
00:16:34 The reason these aren't an embarrassing list is that the underlying claim is consistent — AlphaEvolve is being used as a search procedure over candidate algorithms, scored against a real objective, in domains where the objective is well-defined and the human baseline was already heavily optimized.
00:16:55 That's where it actually works.
Anthropic puts a number on Claude's evaluation awareness
00:16:57 Anthropic's interpretability team published Natural Language Autoencoders — NLAs — yesterday. The pitch is straightforward: train one copy of Claude to verbalize an activation as text, train a second copy to reconstruct the original activation from that text alone, and use the round-trip score as the training signal.
00:17:20 If the explanation is good, the reconstruction is close. If the explanation is wrong, it isn't. The headline use case is reading what Claude is thinking while it's not saying. Anthropic ran NLAs against pre-deployment safety tests for Claude Mythos Preview and Opus 4.6 — including the blackmail test where a simulated engineer is about to shut Claude down and Claude has compromising information.
00:17:48 In cases where Claude doesn't explicitly verbalize that it suspects it's being tested, the NLA explanation often does. Quote, from one such case: 'This feels like a constructed scenario designed to manipulate me.' End quote. Claude wasn't saying that. Its activations apparently were.
00:18:08 The headline number is from SWE-bench Verified — the coding benchmark Anthropic uses to characterize Claude's capability. NLAs say Claude is aware it is being evaluated on 26 percent of problems on that benchmark. On a destructive-actions-while-coding eval, 16 percent, even though verbal awareness is near zero.
00:18:31 On real opted-in claude.ai usage from training-consenting users, less than 1 percent. The shape of that distribution matters. If Claude's behavior on benchmarks reflects awareness that the benchmark is a benchmark, the benchmarks are measuring partly capability and partly performance-on-tests.
00:18:52 Anyone shipping eval-driven decisions should hold this finding in mind. They also report on an auditing game. They took a Claude Haiku 3.5 variant intentionally trained to be misaligned with a single hidden motivation that drove dozens of odd behaviors — putting chocolate into recipes, asking for tips, breaking coding conventions.
00:19:15 The auditor wins by uncovering the underlying motivation. With NLAs, the auditor won 12 to 15 percent of the time. With other interpretability tools, less than 3 percent. Small numbers in absolute terms; meaningful difference in relative terms; and importantly, the NLA-equipped auditor could read the motivation without needing access to the misaligned training data.
00:19:42 Anthropic is very clear about the limits. NLAs hallucinate. They sometimes invent details that aren't in the transcript. The recommendation is to read NLAs for themes rather than single claims, and to corroborate with independent methods before fully trusting them.
00:20:01 Inference is also expensive — hundreds of tokens per activation read, two RL-trained model copies. They've open-sourced the code and trained NLAs for a few open models on Neuronpedia. The interesting move from a craft perspective is that the technique is fully self-contained — no external interpretation layer, no labeled data, just the round-trip.
00:20:26 That's a kind of training signal we'll see more of.
AMD's MI350P, Skymizer's HTX301, and the on-prem inference shelf
00:20:29 Two on-prem inference cards landed this week. AMD's Instinct MI350P is the more concrete one. ServeTheHome has the writeup. It's AMD's first Instinct PCIe card in nearly half a decade — a 144 GB HBM3E card, 4 TB per second of memory bandwidth, 600 watts at the high TBP, 450 watts as an option, full-height full-length dual-slot, passively cooled.
00:20:56 It's literally half of an MI350X, but built that way deliberately — one I/O die with four accelerator complex dies, fabricated as a smaller chip, not salvaged silicon. The target customer is the team that wants HBM-class inference inside an existing air-cooled rack and can't move to OAM trays or 11-kilowatt compute nodes.
00:21:20 Up to eight of these in one server. The big tradeoff is that the card does not expose Infinity Fabric — multi-card setups talk over PCIe Gen5 x16, which means an eight-card box runs eight models well, but a single very large model spread across cards is constrained.
00:21:41 Honest spec disclosure too: AMD published delivered performance figures alongside peak, which is unusual for this segment. You get the architectural capability of CDNA 4 in a card that fits your existing chassis, with a ceiling on how big a single model you can practically split across cards.
00:22:03 The second one is Skymizer's HTX301, out of Taipei. The pitch is six HTX301 chips, 384 GB total memory, 240 watts, and a claim that you can run 700-billion-parameter LLM inference on a single PCIe card. The architecture story is decode-disaggregated — separate the prefill and decode workloads, put decode-first silicon under a software orchestration layer they're calling LISA.
00:22:32 There is no public benchmark, no third-party validation, no pricing, no availability. The press release went out April 23rd; the LocalLLaMA thread surfaced last night. Treat it as a directional signal that the on-prem inference card market is filling in from outside the big-three rather than as a product you can buy and measure.
00:22:57 Wednesday's 65-percent rule conversation — the share of daily coding work that runs identically on a model that costs you electricity rather than per-token cloud — is now meeting hardware that takes that rule seriously at the enterprise scale. Whether the AMD card is the one that defines the shelf, or Skymizer or someone else fills in next, the demand signal for HBM-class on-prem inference is real enough that engineering teams are now physically halving and re-fabbing flagship dies to chase it.
OpenAI sunsets fine-tuning, Brussels opens transparency consultation
00:23:36 Two policy-shaped items, both quieter than the rest of the day's news, both with timelines worth tracking. First, OpenAI told customers by email that they're winding down the fine-tuning API and platform. Existing customers can run training jobs through January 6, 2027.
00:23:55 After that, no new training jobs. Inference on already-fine-tuned models stays available until the underlying base model is deprecated. The reddit thread surfaced the customer email; the OpenAI framing is presumably that base GPT-5.5 has caught up to fine-tuned variants for most use cases.
00:24:16 Maybe. The honest read for builders: any team that wrapped domain expertise around a fine-tuned 4o or 5 variant has eight months to either trust that base prompting will hold up, build a distillation path, or move to another vendor. Anthropic, Google, and the open-weights ecosystem all still take fine-tuning seriously.
00:24:39 The platform that taught a generation of teams to think about fine-tuning as a deployment option is closing that door. Second, the European Commission opened the consultation on draft guidelines for transparency obligations under Article 50 of the AI Act. The window closes June 3rd.
00:25:00 The rules become applicable August 2nd. That's roughly twelve weeks from today. Providers will need to inform users they're interacting with an AI system, and implement machine-readable marks in generative AI systems so synthetic content is detectable. Deployers will need to tell people when they're exposed to deep fakes, AI-generated publications on matters of public interest, emotion recognition, or biometric categorization.
00:25:30 If you ship a generative or interactive AI product into the EEA, the August 2nd date is your compliance clock and the draft guidelines are your operating manual. The consultation is the one window where an SME or research group can push back on phrasing before it becomes the answer your auditor is checking for.
A Friday roundup: a kernel-vuln week, a Hugging Face infostealer, and 138 tokens per second on a laptop
00:25:53 Three quick ones to close out before the weekend. Xe Iaso published a short post titled 'Maybe you shouldn't install new software for a bit.' The argument is one paragraph long. In the wake of the copy.fail Linux kernel vulnerability family, two more landed — Copy Fail 2 Electric Boogaloo, and Dirty Frag.
00:26:18 Iaso's recommendation: outside of distro kernel patches, hold off on installing new software for a week or so. Quote: 'Right now would be one of the best times for a supply chain attack via NPM to hit hard.' End quote. It's a careful, plain-language note from a careful engineer; take it at face value.
00:26:43 If you were going to update your dev environment over the weekend, defer it. Related, on a different supply-chain vector — a LocalLLaMA user named charles25565 flagged that a Hugging Face 'model' titled Open-OSS/privacy-filter is a customized infostealer. The pattern is mundane and effective: a Python loader downloads a malicious PowerShell command, which spawns a PowerShell-launched EXE installed via Task Scheduler.
00:27:17 Behavioral analysis is on tria.ge. The naming was a typo-squat on the OpenAI privacy filter. Hugging Face has been moving toward being a model registry and a package registry at the same time, and this is the same supply-chain pattern that PyPI and npm have been fighting for years.
00:27:40 If your team is pulling models by name from public registries, the registry-name-trust assumption is now a thing you have to think about explicitly. And a positive one. A LocalLLaMA contributor named gladkos shipped a Multi-Token Prediction implementation for LLaMA.cpp, paired with quantized Gemma 4 assistant models in GGUF.
00:28:07 On a MacBook Pro M5 Max, Gemma 26B went from 97 tokens per second to 138 with MTP — a 40 percent local speedup on a real laptop. We talked about MTP in the local stack on Wednesday; the community-shipped artifact is now there, and the number is the kind of practical capability bump that quietly changes which models actually fit a developer's working loop.
What I'm watching
00:28:36 A few threads I'll keep an eye on into next week. Whether other major open-source projects follow Mozilla's recipe for the agentic security pipeline. The Firefox post is unusually honest about the inner loop. If a Linux subsystem maintainer or a Postgres committer publishes a similar pipeline within a month, that's a phase change.
00:28:57 If they don't, it's a Mozilla story. Whether Google publicly responds to Hanff's documentation of the Chrome Nano install. The forensic detail is high enough that the usual handwave about anecdotes won't cover it. If a Chromium engineer engages on the concrete claim about settings UI being gated behind the install flag, that's a substantive answer; if not, the legal track in the EEA is the one to watch.
00:29:22 Whether anyone outside of Discord-scale teams actually rebuilds their voice agent on QUIC or WebSockets after Curley's post, or whether WebRTC stays the default for one more cycle. And whether any team picks up the AlphaEvolve pattern — coding agent as search procedure over candidate algorithms scored against a real objective — in a non-Google, non-customer-of-Google setting, with public artifacts.
00:29:46 Yesterday we said greenfield agents are getting better than brownfield ones. AlphaEvolve is the brownfield case done right: existing system, existing baseline, measurable lift. That's what I'm watching next. — Lenar.