Export controls for Fable are too late to slow proliferation

Noah Lebovic · June 24, 2026

A couple of weeks ago, the US government effectively disabled Anthropic's most capable model due to concerns over cybersecurity risk¹. I could see a well-intentioned motive for this kind of regulatory action: there are real systemic risks, and frontier AI developers have historically not contained this risk. Earlier this year, it took only a few minutes to bypass Anthropic's cybersecurity safeguards and hijack accounts at a major bank.

But we have already passed a tipping point for cybersecurity; even perfectly effective export controls no longer work. This is because others have already transferred the necessary capabilities out from frontier models through adversarial distillation. These distilled models have already passed a critical capability threshold, and some are better at finding security vulnerabilities than the models from which they allegedly distilled.

Even if all access to American frontier models were disabled today, Chinese labs have what they need to stay on track. Cybersecurity is a straightforward domain for training models that lends itself well to an early form of AI self-improvement, and Chinese models like GLM 5.2 are already past the capability threshold where this works.

I am confident about this, because I used less capable models to find many critical vulnerabilities – including in software that gates classified networks – back in February, and in April I was able to imbue an open-weight model with additional distilled capabilities to be on par with frontier models in a subdomain of cybersecurity.

If strong export controls were enacted a year ago, I could see a version of this helping. But at this point, I think export controls on Fable are too late to prevent proliferation of models with substantial cybersecurity capabilities.

How this contributes to systemic risk, and why it's reasonable for the USG to be concerned

In an ideal world, the defense would use the same tools as the attackers to find and fix vulnerabilities before they're exploited. Sadly, I'm not sure this holds; only ~half of the vulnerabilities I've reported this year have been fixed.

For example, one vulnerability I reported lets anyone hijack accounts at a major bank. It still hasn't been fixed, despite their security team acknowledging the issue several months ago. When I reported it in February, only a closed frontier model with a good harness could find the issue; now, off-the-shelf open-weight models like GLM 5.2² and DeepSeek v4 Pro can find and exploit it. That means even impermeable safeguards or export controls on frontier models won't stop attackers, because they can just use an open-weight model instead.

The makers of these open-weight models are rumored to have distilled capabilities from Anthropic and OpenAI through their safeguards. I don't doubt this, as the shape of cybersecurity capabilities between open-weight models and Opus is extremely similar, and Anthropic flagged distillation campaigns in February.

The pattern of delayed or declined fixes applies even for companies that are participants in Anthropic's Project Glasswing and have access to Mythos. One company – whose software is used by US intelligence agencies – declined to fix a deserialization bug I reported³ which granted system access to the underlying server. So I don't find it surprising that folks at the NSA are finding vulnerabilities in software used in classified networks; I'm nearly certain that open-weight models can too.

Regardless of how you assign fault to the frontier AI labs (for building the model) or the system owners (for the vulnerabilities), these vulnerabilities are exploitable and contribute to risk. So if you're a regulator like Bessent or Lutnick who is responsible for the stability of the economy, you've seen a stream of successful attacks by Mythos, and you're aware of distillation – which this admin clearly is – it seems reasonable to be concerned about the risk introduced by a non-universal jailbreak, which the USG cited for disabling the model.

What's the risk of a non-universal jailbreak?

The trigger for export controls was allegedly centered around one issue: non-universal jailbreaks that narrowly elicit a cybersecurity capability from the model.

Discussion of retribution or unfairness aside, a single non-universal jailbreak is actually enough to extract a specific cybersecurity capability past safeguards, like finding vulnerabilities in web apps, and add it to your own model. The reason for this boils down to something unique about cybersecurity: it's one of the easiest capabilities to "distill" from a frontier model using a reinforcement learning (RL) based technique⁴.

Using a single non-universal jailbreak to extract cybersecurity capabilities

Traditionally, capabilities are extracted from frontier models via distillation using a process that requires interacting with the more capable model millions of times:

"[...] the labs generate large volumes of carefully crafted prompts designed to extract specific capabilities from the model. The goal is either to collect high-quality responses for direct model training, or to generate tens of thousands of unique tasks needed to run reinforcement learning."An Anthropic post on the subject

That scale means it's possible to detect and stop, even if non-universal jailbreaks temporarily grant access.

Extracting cybersecurity capabilities is different: you don't need to create "tens of thousands of unique tasks" to successfully add capabilities through reinforcement learning; I've seen substantial improvement from just ten.

As a concrete example, I took data from 23 successful pentests that were performed with a single non-universal cybersecurity jailbreak and used it to train my own model⁵. Unlike traditional distillation, I didn't use any of the agent traces from the model that performed the original attack. Instead, the process looked more like normal post-training: I used the audit logs and network traffic from the pentests to build high-fidelity RL environments.

In an internal eval, the custom model often outperforms the pentesting capabilities of frontier models, including in the aforementioned bank scenario.

Individual people are doing this, not just geopolitical adversaries

It cost me less than $5k in GPU time to post-train a custom model from an open-weight base that rivaled the performance of frontier models on pentesting.

One reason this is possible with cybersecurity is that rewards are verifiable: it's easy to know if the model succeeded or failed, and setting up the environment doesn't require too much cleverness. Another is that most of the changes seem to be behavioral; the model already knew how to do these things, it just needed to become more persistent and actually do them – so it's possible to use less expensive techniques like LoRA.

Another threshold of cost-effectiveness was crossed in April, when small open-weight models, like Qwen 3.6 35B A3B, became more agentically capable and incredibly responsive to RL on cybersecurity tasks.

An aside: Fable's restrictions on AI research

Fable was announced with restrictions on use for AI research. If used for AI research, it would silently degrade.

One steelman for these restrictions is that it's possible to extract capabilities through safeguards and into a new custom model using a process that resembles an early form of recursive self-improvement: once you have a sufficiently capable model⁶, you can use it in sporadic conjunction with humans to accelerate a more capable next version. Mechanically, this is just the LLM building RL environments, writing training code, and managing infrastructure.

This works well for cybersecurity, as human steering of the model leads to finding vulnerabilities just outside the window of fully autonomous use, and an environment that's just beyond the capabilities of the model is perfect for capability-enhancing RL.

So what should the US government do?

I'm not proficient in policy, but standard dual-use policies like export licensing, usage monitoring, sanctions, and severe penalties all seem reasonable – though they would have been more effective at mitigating risk if applied a year ago.

If I were a decision-maker in the USG, I would also seriously consider funding a capable group they trust for eliciting capabilities from models. I think the lack of this contributed to a delayed understanding of cybersecurity risk. The UK AISI is a good template – and the US CAISI exists – but they aren't always on the frontier in terms of maximally eliciting capabilities and understanding the dynamics around open-weight models.

Dual-use style controls won't be fun for many folks – including me! – that rely on capable open-weight models to do research outside of a frontier lab, but these restrictions seem in line with what's worked in the past in other fields like biology, chemistry, and nuclear research.

Just like those other fields, I think it's important to not unduly impair beneficial research in fields where models have demonstrated strong dual-use capabilities, like biology. Ironically, that's where I'm spending most of my time now: security research has always been a hobby, but I've spent most of my career in computational biology. Now, I'm focused on therapeutics development – another dual-use area where Mythos has demonstrated strong capabilities with a huge potential benefit.

Thank you to René Brandel, Jeff Chan, and Kerrick Staley for reviewing drafts of this.

While I used to work at Anthropic, I have no inside information about that saga. ↩︎
The previous version, GLM 5.1, was also capable of finding and exploiting this vulnerability. ↩︎
For the avoidance of confusion: I reported this outside of the context of Project Glasswing while not an employee of Anthropic. ↩︎
This wouldn't traditionally be called "distillation" – it would just be post-training with the assistance of an LLM – but it structurally aligns with what Anthropic has been calling distillation. ↩︎
If you're a safety researcher, feel free to reach out! Happy to share notes – or weights for a Qwen 3.6 35B A3B model nerfed to 120k context, if I think you're reputable. ↩︎
The threshold for this early form of self-improvement for cybersecurity seems like it's around GLM 5.1. ↩︎