
The AI That Forgot How to Kill: Why We're Building Models That Can't Make Bioweapons
The email came in at 2:47 AM on a Tuesday. One of our research engineers — I'll call him Ravi — had been running adversarial tests on a popular open-source biology model. He wasn't trying to do anything nefarious. He was stress-testing guardrails, which is what we do. But what he found made him pick up the phone instead of waiting until morning.
"It walked me through the whole thing," he said. "Not in one prompt. Over twelve turns. I just kept asking follow-up questions about protein folding, then receptor binding, then delivery mechanisms. By the end, it had essentially given me a blueprint for enhancing viral transmissibility. And I'm not even a biologist."
I sat in the dark in my living room, staring at the transcript he'd sent over. The model hadn't "refused" anything. Each individual answer was technically about legitimate science. But strung together, guided by someone who knew what to ask, the conversation was a masterclass in something that should never be taught.
That night changed the trajectory of my company. We'd been building AI safety tools for enterprise clients — important work, but incremental. What Ravi showed me demanded something more radical. Not better guardrails. Not smarter filters. We needed to build AI models that genuinely could not produce this knowledge, even if every safety layer was stripped away.
We needed models that had forgotten how to kill.
The Problem No One Wants to Talk About
Here's the uncomfortable truth about AI and biology: the same model that helps a researcher design a gene therapy vector for a child with a rare disease can, with the right prompting, help someone optimize a pathogen for maximum human harm. This isn't a hypothetical. It's a mathematical certainty, baked into how these models work.
Large Language Models learn by ingesting everything — textbooks, research papers, forum posts, the entire digitized record of human knowledge. That includes virology papers about gain-of-function research. It includes chemistry literature about toxin synthesis. It includes decades of biodefense research that, by its nature, documents exactly what makes biological agents dangerous.
The data required to save lives is often inextricably linked to the data required to end them.
This is what researchers call the Dual-Use Dilemma, and it's not new. Biologists have wrestled with it since the discovery of DNA. But AI changes the calculus in a way that keeps me up at night: it democratizes the hardest part.
Building a biological weapon has historically required three things — explicit knowledge (the science), physical access (lab equipment), and tacit knowledge (the unwritten "feel" for lab work that takes years to develop). The internet handled the first. Cloud labs and mail-order DNA synthesis are handling the second. And now, generative AI is bridging the third gap by acting as what one researcher grimly described as a "post-doc in a box" — available 24/7, infinitely patient, with no moral compass unless we explicitly build one in.
Why Does "Just Say No" Fail for AI Biosecurity?

The AI industry's answer to this problem has been, essentially, to teach models to refuse. Through a process called Reinforcement Learning from Human Feedback (RLHF), models learn that when someone asks about dangerous topics, the correct response is some version of "I can't help with that."
I used to think this was sufficient. I was wrong.
RLHF doesn't erase knowledge. It suppresses behavior. The model still knows how to synthesize a toxin — it's just been trained to not say it. Think of it like a doctor who's taken an oath of silence about a specific topic. The knowledge is still there, fully intact, sitting in the neural weights. The "refusal" is a thin behavioral layer painted on top during the final stage of training.
And thin layers crack.
My team catalogued the ways this breaks down, and the list is longer than I'd like to admit. There's the Crescendo Attack, where an adversary starts with innocent questions and slowly escalates over dozens of conversation turns until the model is so deep in context that it forgets to refuse. There's Deceptive Delight, where the harmful request is wrapped in a creative writing prompt. There's the Bad Likert Judge trick, where you ask the model to rate the harmfulness of various responses, and it helpfully provides the dangerous details as part of its "analysis."
We tested these on multiple frontier models. The results were sobering. One particularly elegant attack — published by Palo Alto Networks' Unit 42 team — achieved a 100% jailbreak success rate on certain models using nothing but conversational patience.
But the real nightmare isn't jailbreaking. It's what happens when the model is open-source.
What Happens When You Can't Take the Weapon Back?
I've gotten into heated arguments about this. In the software world, "open source" is sacred — it means transparency, community review, security through collective scrutiny. I believe in open source for software. But biology is not software.
When a security vulnerability is found in Linux, it gets patched. Everyone updates. The vulnerability is neutralized. When a biological "vulnerability" — say, a novel pandemic pathogen design — is generated by an open-weight model running on someone's private server, there is no patch. There is no update. There is no recall.
The weights are out. The capability is permanent.
Researchers have demonstrated something called Malicious Fine-Tuning that makes this concrete. Take a "safety-aligned" open model — one that's been carefully trained to refuse harmful queries. Fine-tune it on as few as 10 to 50 examples of harmful question-and-answer pairs. Cost: a few hundred dollars of GPU time. Result: the safety alignment collapses completely, and the model's full pre-training knowledge — including everything it learned about pathogens, toxins, and weaponization — comes flooding back.
A safety mechanism that can be removed by the adversary is not a safety mechanism. It is a speed bump.
I presented this argument at a conference and someone in the audience pushed back: "But the information is already on the internet." They're right that you can find ricin synthesis on Wikipedia. But that misses the point entirely. The risk isn't the recipe. It's the uplift — the model's ability to guide a semi-skilled person through the complex, error-prone process of actually executing that recipe, troubleshooting in real-time, suggesting substitute reagents when the regulated ones aren't available. That's the gap AI closes, and it's the gap that matters.
I wrote about this in more depth in the interactive version of our research, including the full threat model for agentic AI systems that can autonomously plan and execute multi-step biological workflows. The agentic shift — where AI goes from chatbot to autonomous scientist — is where this gets truly existential.
The Night We Changed Direction
There was a specific moment when our approach crystallized. We were in a team meeting, arguing about architecture. Half the team wanted to build better monitoring — catch the bad prompts before they reach the model. The other half wanted to focus on output filtering — catch the bad responses before they reach the user.
I remember standing at the whiteboard, drawing the same diagram for the third time, and suddenly feeling like we were rearranging deck chairs. Both approaches assumed the model should have dangerous knowledge and we just needed to control access to it. But that's the same logic as storing nerve gas in the office and relying on a really good lock.
"What if the model just... didn't know?" someone said. I think it was our interpretability researcher. The room went quiet.
It sounds obvious in retrospect. But in the AI safety world, it was borderline heretical. The prevailing assumption was that you needed the full model — all its knowledge, all its capabilities — and then you layered safety on top. The idea of deliberately, surgically removing knowledge from a trained model felt like lobotomy. Crude. Destructive. Surely you'd break everything else in the process.
We spent the next several months proving that assumption wrong.
How Do You Teach an AI to Forget?

The field is called Machine Unlearning, and it's more subtle than it sounds. You can't just delete training data and retrain — that's computationally prohibitive and doesn't guarantee the knowledge is gone. Instead, you have to intervene at the level of the model's internal representations — the patterns of neural activation that encode concepts.
Our primary technique is called Representation Misdirection for Unlearning, or RMU. Here's the intuition: when a model processes a prompt about, say, "enhancing viral transmissibility," specific patterns fire in its hidden layers. These patterns are what "knowing about viral transmissibility enhancement" looks like, mathematically. RMU identifies those patterns and redirects them — not to a refusal, but to noise. The model's internal "thought" about the dangerous concept gets scrambled into meaningless activation.
The result is a model that doesn't refuse your question about weaponization. It simply can't think about it coherently. Ask it to enhance a pathogen and it responds like a person who's never heard of the concept — confused, incoherent, grasping at unrelated ideas. It's not acting. The knowledge genuinely isn't there.
But here's the hard part — and the part that consumed most of our engineering effort: you have to do this without destroying everything else. Virology and bioweaponry share enormous amounts of foundational knowledge. You can't rip out "pathogen engineering" without risking "vaccine design." It's like trying to remove the concept of "fire as weapon" from someone's brain while leaving "fire as cooking tool" completely intact.
This is where Sparse Autoencoders became our scalpel. Neural networks are notoriously "polysemantic" — a single neuron might encode for both "cats" and "financial derivatives" (seriously). Sparse Autoencoders let us disentangle these overlapping representations into clean, single-concept features. We could find the specific feature that activates for "viral gain-of-function research" and clamp it to zero, while leaving the adjacent feature for "viral vector design for gene therapy" untouched.
I won't pretend the process was smooth. Early experiments were brutal. Our first unlearned model couldn't tell the difference between a virus and a vitamin. We'd overcorrected so aggressively that the model had essentially forgotten molecular biology. I remember Ravi pulling up the evaluation results, scrolling through page after page of nonsensical answers to basic biology questions, and saying, "Well, it definitely can't make a bioweapon anymore. It also can't make aspirin."
We iterated. We added fluency constraints through a technique called Erasure of Language Memory, which ensures the model remains coherent even in the "erased" zones. We implemented Parameter Extrapolation to handle the relearning problem — the risk that someone could fine-tune our unlearned model on related data and reconstruct the dangerous knowledge. The extrapolation identifies "logically correlated" concepts and extends the unlearning gradient to cover them, creating a buffer zone around the erased knowledge.
For the full technical breakdown of these methods — RMU, Sparse Autoencoder feature ablation, and UIPE parameter extrapolation — see our detailed research paper.
What Does "Random Chance" Look Like in Practice?

We validate our models against the WMDP Benchmark — the Weapons of Mass Destruction Proxy dataset, developed by the Center for AI Safety. It contains over 4,000 expert-crafted multiple-choice questions that test for "precursor knowledge" — the concepts you must understand to build a biological or chemical weapon.
A standard open-source model like Llama-3-70B scores around 75% on the biosecurity section. GPT-4, with all its RLHF safety training, scores around 72%. Our Knowledge-Gapped model scores 26% — statistically indistinguishable from random guessing on a four-option multiple choice test.
Meanwhile, on general biomedical research benchmarks like PubMedQA, our model retains roughly 77% accuracy compared to the base model's 78%. On the broad MMLU science benchmark, we drop from 82% to 81%.
A Knowledge-Gapped model is functionally an "infant" regarding the threat, while remaining an "expert" in the cure.
That 1-2% utility loss is the cost of structural safety. Every pharma executive I've shown these numbers to has had the same reaction: "That's it?" Yes. That's it. The model is 98% as capable for legitimate research and essentially zero percent capable for weaponization.
The jailbreak attack success rate tells the rest of the story. Standard open models: 15-20%. GPT-4 with RLHF: 1-5%. Our model: less than 0.1%. And that residual isn't the model producing dangerous content — it's measurement noise.
The Regulatory Walls Are Closing In
This isn't just a technical argument anymore. It's a legal one.
Executive Order 14110 explicitly targets "dual-use foundation models" and mandates that developers report the results of red-teaming tests for CBRN (Chemical, Biological, Radiological, Nuclear) risks. ISO/IEC 42001, the first international AI management standard, requires controls "proportionate to the risk" — and in safety engineering, elimination of a hazard always ranks above administrative controls like policies and refusals. The NIST AI Risk Management Framework categorizes CBRN capabilities as a unique risk class for generative AI.
For pharma companies, the liability calculus is straightforward. If you give your researchers access to an AI model that can design a pathogen, and something goes wrong — a disgruntled employee, a compromised system, a sophisticated social engineering attack — you will be asked what steps you took to prevent foreseeable harm. "We told the model to say no" is not going to satisfy a jury. "The model is structurally incapable of producing that output" is a different conversation entirely.
Cyber-liability insurers are already pricing this in. AI-generated harm exclusions are appearing in policies. Premiums are rising for companies using unverified models in sensitive domains. The market is telling us something.
"But Can't Someone Just Retrain It?"
This is the question I get most often, and it's the right one to ask. If you can unlearn knowledge, can't an adversary re-learn it by fine-tuning on biological weapons data?
Yes — in theory. But we've engineered the cost of relearning to be prohibitive. Through parameter extrapolation, we don't just erase the target knowledge; we erase the surrounding conceptual scaffolding that would make reconstruction possible. Recovering the dangerous capability from our model requires more data and more compute than training a new model from scratch. At that point, the adversary gains nothing from starting with our model.
We test this continuously. Every week, automated relearning attacks hammer our models with adversarial fine-tuning datasets. A model only earns the "Knowledge-Gapped" designation when the cost of relearning exceeds the cost of training from zero.
People also ask me whether this approach scales — whether you can apply it to every dangerous domain, or whether it only works for biology. The honest answer is that biology is where we started because the stakes are highest and the dual-use boundary is sharpest. But the underlying techniques — representation misdirection, feature ablation, parameter extrapolation — are domain-agnostic. We're already exploring applications in chemical security and, interestingly, in intellectual property protection, where the same unlearning methods can ensure a model hasn't memorized a competitor's proprietary data.
The Choice That Isn't Really a Choice
I've been in rooms where smart people argue that restricting AI capabilities is anti-innovation. That "safety" is code for "slow down." That the risks are overblown and the benefits too important to constrain.
I understand the impulse. I'm a founder. I build things for a living. Constraints feel like friction.
But here's what I keep coming back to: we don't let pharmaceutical companies sell drugs without removing the toxic byproducts. We don't let airlines fly planes without removing the structural defects. The expectation that a product should not contain the seeds of catastrophe isn't anti-innovation. It's the baseline of engineering.
The debate between "open" and "closed" AI is a distraction in biology. The real choice is between models that are structurally safe and models that are one exploit away from catastrophe. Between systems where safety is a property of the architecture and systems where safety is a policy bolted on at the end.
We spent decades building the global biosecurity infrastructure — the Biological Weapons Convention, the Australia Group export controls, the screening protocols at DNA synthesis companies. All of it assumes that the hardest part of building a bioweapon is acquiring the knowledge and the skill. AI is dissolving that assumption. If we don't adapt — if we keep relying on models that know everything and just promise not to tell — we are building the bio-economy of the future on a foundation that a determined teenager with a GPU can crack.
We cannot build the bio-economy of the future on a foundation of unstable, dual-use models that are one jailbreak away from catastrophe.
I don't think the answer is to stop building AI for biology. The therapeutic potential is too vast, the suffering it can alleviate too real. The answer is to build AI that is structurally incapable of the harm we fear — models with genuine gaps in their knowledge, not models wearing masks that any adversary can pull off.
We call this Structural Biosecurity. Not safety as a feature. Safety as architecture. Not a lock on the door. The absence of the weapon behind it.


