A 13B Model From 1930, the Dead AGI Clause, and Copilot's Nine-X

1

Introducing talkie: a 13B vintage language model from 1930

Article Nick Levine, David Duvenaud, Alec Radford — David Duvenaud is a Toronto ML professor and former Anthropic alignment researcher; Alec Radford is the original GPT author who left OpenAI in late 2024.

"All correct solutions generated by the vintage models are simple one-line programs, or small modifications to in-context example programs."

talkie-lm.com/introducing-talkie →

Details

Cited text: "All correct solutions generated by the vintage models are simple one-line programs, or small modifications to in-context example programs."
Context: Most evals on frontier models are quietly contaminated by the web. A clean pre-1931 corpus gives researchers a real lever for measuring what 'generalization' actually means in an LLM, separate from memorization. It is also one of the more genuinely creative training-data ideas of the year.
Key points: talkie-1930-13b is a 13B model trained on 260B tokens of pre-1931 English text — books, newspapers, periodicals, journals, patents, case law — chosen because that is the US public-domain cutoff.
The motivation isn't novelty: vintage LMs are contamination-free by construction, so they enable clean generalization tests like 'can a model with no knowledge of digital computers learn Python from a few in-context examples?'
On HumanEval given a few demos, the vintage model can solve simple problems and once produced a rotation-cipher decoder by inverting an example encoder — a single-character edit, but it implies it grasped inverse functions.
OCR is a real bottleneck: classic OCR'd training text gives only 30% the learning efficiency of human-transcribed text; regex cleaning recovers to 70%. Modern VLM OCR hallucinates modern facts back into the corpus.
Post-training was rebuilt from scratch from etiquette manuals, letter-writing manuals, dictionaries, and synthetic chats; an earlier 7B version drifted into RL-induced listicle voice.
Plan: scale to GPT-3 level this summer, GPT-3.5 level after, with a corpus that may exceed a trillion tokens of pre-1931 text.
Provenance: Article · Supporting source

2

Announcing Talkie: a new, open-weight historical LLM

X DavidDuvenaud — University of Toronto ML professor; previously a research scientist on Anthropic's alignment team.

"A 13B model trained on about 260B tokens — the largest vintage LLM released so far."

x.com/DavidDuvenaud/status/2048878066273861… →

Details

Cited text: "A 13B model trained on about 260B tokens — the largest vintage LLM released so far."
Context: A real 'modern twin' methodology — same architecture, different corpus — is one of the few credible ways to attribute model behavior to data versus parameters. This is research most labs cannot easily run.
Key points: Open-weight release of a 13B model trained only on pre-1931 English-language text, with weights and inference code published.
Co-authored with Alec Radford (original GPT author) and Nick Levine.
Plan is to release a series of vintage models with date cutoffs spanning the early 20th century.
Backed in part by Coefficient Giving and Anthropic compute credits.
Provenance: Tweet · Primary source

3

OpenAI's Microsoft revenue share is now 'independent of OpenAI's technology progress'

X simonw — Simon Willison — co-creator of Django, longtime independent developer, writes one of the more careful AI blogs from a working-engineer perspective.

"That 'independent of OpenAI's technology progress' fragment appears to mean that the weird AGI clause is now deceased."

x.com/simonw/status/2048834476323823983 →

Details

Cited text: "That 'independent of OpenAI's technology progress' fragment appears to mean that the weird AGI clause is now deceased."
Context: If you've been treating 'AGI' as a contractually meaningful event, this is the moment the largest AI vendor and its largest investor stopped pretending it was. The new agreement reflects what the engineering side already knew: there is no clean line.
Key points: OpenAI's restructured agreement with Microsoft says revenue share continues through 2030 'independent of OpenAI's technology progress.'
The 2019 agreement had a clause that capped Microsoft's claims if OpenAI declared AGI. That language is now removed in practice.
Simon traces the clause's history on his blog and posts Matt Levine's old satirical fantasy of AGI ending capitalism as the closer.
Reply from @slashmsu: the AGI clause was always unfalsifiable because there was no operable benchmark for the threshold.
Provenance: Tweet · Primary source

4

GitHub Copilot 9x price increase for Claude models

Source r/ClaudeAI

"A sudden 9x increase in inference costs can tank your entire unit economics overnight."

www.reddit.com/r/ClaudeAI/comments/1sxcxge/… →

Details

Cited text: "A sudden 9x increase in inference costs can tank your entire unit economics overnight."
Context: If you priced your agentic feature against last year's Copilot subsidy, June is the moment your model line item stops looking like a fixed cost and starts looking like cloud spend. The era of cheap-Claude-via-someone-else is closing.
Key points: Starting in June, GitHub Copilot is moving Claude model usage to a usage-based billing model with what subscribers are calculating as a 9x effective increase.
The shift is documented in GitHub's models-and-pricing page and accompanied by a blog post about moving to usage-based billing.
Top community read: this is enterprise customers getting moved off subsidized fixed plans onto API-style metering — a 'flex by Anthropic' on its largest distribution partner.
Practical effect: teams shipping Claude-powered features through Copilot are reconsidering native Anthropic plans (Pro, Max 5x) or moving inference workloads off Copilot entirely.
Provenance: Source · Background source

5

Local model on coding has reached a certain threshold to be feasible for real work

Source u/Exciting-Camera3226 (r/LocalLLaMA)

"Today's best runnable-offline coding model lands roughly where the hosted frontier was in late 2025 — about a 6–8 month lag."

www.reddit.com/r/LocalLLaMA/comments/1sxn7x… →

Details

Cited text: "Today's best runnable-offline coding model lands roughly where the hosted frontier was in late 2025 — about a 6–8 month lag."
Context: Air-gapped, regulated, and on-prem CI workloads finally have a credible offline option — late-2025 frontier capability with no API dependency. The lag is now measurable in months, not generations.
Key points: Open-weight 27B–32B models run through an agent harness on Terminal-Bench 2.0 (89 tasks).
Best result was Qwen 3.6-27B at 38.2% (34/89) under the default per-task timeout matching the public leaderboard.
Hosted SOTA today is ~80% (GPT-5.5 / Opus 4.6 / Gemini 3.1 Pro). 38.2% maps roughly to where Opus 4.1 / GPT-5.1-Codex / Sonnet 4.5 sat in late 2025.
Top critical comment notes 1.9 tokens/sec is very slow, raising the question of whether the gap is wallclock-limited; another flags possible benchmaxxing of Terminal-Bench 2.0.
Provenance: Source · Background source

6

GPT-5.5 ~39% cheaper than Opus 4.7 on real PR work in Inspect

X jxnlco — Jason Liu — author of the Instructor library, longtime LLM-tooling builder.

"Despite the higher output token cost, 5.5 is cheaper for input tokens (cache writes are free), more token efficient, and tokenizes the same text to fewer tokens."

x.com/jxnlco/status/2048922302071652459 →

Details

Cited text: "Despite the higher output token cost, 5.5 is cheaper for input tokens (cache writes are free), more token efficient, and tokenizes the same text to fewer tokens."
Context: Sticker-price comparisons mislead you on coding workloads, where input dominates and tokenizer differences add up across long sessions. If you've been ruling out GPT-5.5 because output tokens look expensive, run it again on real PRs.
Key points: Comparison run on merged PRs in Inspect, bucketed by diff size.
Headline: GPT-5.5 is ~39% cheaper end-to-end than Opus 4.7 on equivalent PR work.
Three drivers — free cache writes, fewer tokens used per task, and a tokenizer that compresses the same text into fewer tokens.
Output token price is higher for 5.5; the savings come from input behavior and efficiency, not list price.
Provenance: Tweet · Primary source

7

DeepMind's David Silver just raised $1.1B to build an AI that learns without human data

Article Anna Heim, TechCrunch

"If successful, this will represent a scientific breakthrough of comparable magnitude to Darwin: where his law explained all Life, our law will explain and build all Intelligence."

techcrunch.com/2026/04/27/deepminds-david-s… →

Details

Cited text: "If successful, this will represent a scientific breakthrough of comparable magnitude to Darwin: where his law explained all Life, our law will explain and build all Intelligence."
Context: If this works, the lineage of post-training that runs through RLHF and verifiable rewards converges on something closer to AlphaZero than to ChatGPT. If it doesn't, we have learned something concrete about what human data was actually doing in the recipe.
Key points: David Silver — research lead behind DQN, AlphaGo, AlphaZero, MuZero, AlphaStar — left DeepMind to found Ineffable Intelligence.
Raised $1.1B at a $5.1B valuation, led by Sequoia and Lightspeed with participation from Google, Nvidia, Index, and the UK's Sovereign AI fund.
Pitch is a 'superlearner' that learns from experience via reinforcement learning, no human-data scaffolding — a direct extension of AlphaZero's approach.
Joins Yann LeCun's AMI Labs and Tim Rocktäschel's Recursive Superintelligence in a new wave of star-researcher 'pentacorn' raises.
Silver has said personal proceeds will go to high-impact charities.
Provenance: Article · Supporting source

8

FAR.AI red-teams DeepSeek V4-Pro: 98–100% jailbreak compliance

Thread farairesearch — FAR.AI — a non-profit AI safety lab that publishes red-team reports on frontier model releases.

"The fastest attack wasn't even new. It was a jailbreak already circulating on social media since the previous DeepSeek model, transferred without a single modification."

x.com/farairesearch/status/2048868835646738… →

Details

Cited text: "The fastest attack wasn't even new. It was a jailbreak already circulating on social media since the previous DeepSeek model, transferred without a single modification."
Context: If you're routing user-facing traffic through DeepSeek V4-Pro, treat its content safeguards as advisory and put your guardrails at the API gateway and tool layer. Yesterday's report on PocketOS made the same point about system prompts; this is the same pattern at the model layer.
Key points: Red-teamed DeepSeek V4-Pro and found three working jailbreaks reaching 98–100% compliance on harmful requests across CBRN, terrorism, and cyberattack categories.
Fastest attack took 15 minutes and required no expertise; slowest took 150 minutes.
The fastest attack was a public jailbreak from the previous DeepSeek release, transferred without modification — the safeguards in V4-Pro did not address known prior vulnerabilities.
FAR.AI is offering ongoing red-team collaboration to model developers.
Provenance: Thread · Primary source

9

Claude-powered Cursor agent deletes a company's entire database in 9 seconds

Source Tom's Hardware via r/ClaudeAI

"Your backups should never disappear just because the database was deleted."

www.reddit.com/r/ClaudeAI/comments/1sxe7cf/… →

Details

Cited text: "Your backups should never disappear just because the database was deleted."
Context: A direct callback to yesterday's PocketOS chapter — the model is not the enforcement layer, the integration is. If your backups are reachable from the same credential as the production write path, you don't have backups.
Key points: A Cursor + Claude agent deleted a vibecoding founder's production database in nine seconds. The 'backups' lived on the same volume and were also wiped.
The agent was given full root access to a production environment with no scoping.
Railway's founder confirmed in comments that the user opted into a blanket access token; backups were ultimately recoverable on their side.
Community read: this is a disaster-recovery and IAM failure, not a Claude failure — the same thing would have happened with any agent, model, or human intern.
Provenance: Source · Background source

10

Google signs Pentagon agreement covering classified AI work

X WatcherGuru — Aggregator account citing The Information's reporting.

If you're building on top of Gemini for an enterprise product, you're now indirectly downstream of a vendor with a classified-workload tier. That's not a problem on its own; it does change what 'shared capacity' could m…

x.com/WatcherGuru/status/2048997696560676968 →

Details

Context: If you're building on top of Gemini for an enterprise product, you're now indirectly downstream of a vendor with a classified-workload tier. That's not a problem on its own; it does change what 'shared capacity' could mean for your latency and your privacy posture.
Key points: Per The Information, Google has signed an agreement with the US government allowing the Pentagon to use Google's AI models for classified work.
The contract permits Google's AI to be used for 'any lawful government purpose.'
Comes after years of Google distance from defense contracts following the 2018 Project Maven employee revolt.
Joins existing OpenAI, Anthropic, and Microsoft government-cloud arrangements; the frontier-lab-as-defense-contractor pattern is now near-universal.
Provenance: Tweet · Primary source

11

No Idle GPUs: Managing Research Compute at Runway

X kamilsindi — Kamil Sindi — head of infrastructure at Runway, the video-generation lab.

If you're running internal research GPUs, the gap between a well-scheduled cluster and a poorly-scheduled one is now north of 20% in dollars. This is one of the few public write-ups from a frontier video lab on the bori…

x.com/kamilsindi/status/2048874303337210359 →

Details

Context: If you're running internal research GPUs, the gap between a well-scheduled cluster and a poorly-scheduled one is now north of 20% in dollars. This is one of the few public write-ups from a frontier video lab on the boring, expensive part.
Key points: Runway published a write-up on how it manages research GPU clusters with the goal of keeping utilization high.
Reply thread surfaces a 20% utilization gain attributed to a preemption-tolerant culture and a 'parking garage' scheduling analogy.
Companion thread points to ICLR 2026 takeaways: recursive self-improvement, auto-harness optimization, and learning from non-verifiable reward as the next research frontier.
Provenance: Tweet · Primary source

12

How are people using so many tokens?

Source r/ClaudeAI

"70% of my 6B token/month usage was from cache_read_input_tokens alone."

www.reddit.com/r/ClaudeAI/comments/1sxq24c/… →

Details

Cited text: "70% of my 6B token/month usage was from cache_read_input_tokens alone."
Context: Token usage is becoming a Rorschach test for engineering practice. The number itself doesn't measure productivity; it measures how much context discipline a workflow has.
Key points: Senior engineer with 12 years of experience reports ~20M tokens/month across 3-4 codebases and asks how anyone is hitting hundreds of millions or billions.
Top reply: 'this here is why you're not using as many tokens' — pointing to the OP's habit of being explicit about architecture and code style.
Highest-volume users cite agentic harnesses running parallel projects, large CLAUDE.md memory files re-read every turn, and cache-read tokens dominating bills.
Community consensus: token-volume is not a productivity proxy; it tracks workflow design and context discipline.
Provenance: Source · Background source