
The $800,000-a-Day Typo: How a Confused AI Catheter Is Killing Drug Discovery
It was a Tuesday night, and I was staring at a spreadsheet that made no sense.
We'd been running a pilot — testing how well a large language model could screen patient records against the eligibility criteria for an oncology trial. The protocol was straightforward, as oncology protocols go: a novel anticoagulant with a list of exclusion criteria, one of which was "prior cardiac catheterization." Heart catheterization. A catheter threaded into the chambers of the heart to evaluate coronary function. A serious, invasive cardiac procedure.
The AI had flagged a patient as ineligible. Reason: cardiac catheterization. I pulled up the patient's record. The procedure documented was a central venous puncture — a central line placed in the jugular vein for medication delivery. It's a bedside vascular access procedure. Nurses do it in the ICU. It is not a heart procedure. It is not even close.
But the model saw "catheter," saw "venous," saw the note was written in a cardiac care unit, and concluded: same thing. The patient was gone. Excluded. Never surfaced to the site coordinator. And here's what haunted me — nobody would have noticed. The system would have silently discarded an eligible patient, and the trial would have been one person shorter, and no one would have known why enrollment was lagging.
That was the moment I stopped believing that better prompts would fix clinical trial recruitment. The problem isn't the model's vocabulary. The problem is that we're using a probability machine to do logic's job.
Why Does 80% of Pharma's Pipeline Get Stuck in Recruitment?
The pharmaceutical industry has a dirty secret that no earnings call likes to dwell on: approximately 80% of clinical trials fail to meet their enrollment timelines. Not because the science is wrong. Not because patients don't exist. Because the process of finding eligible patients and matching them to trials is broken at a fundamental level.
Let me put a dollar figure on that brokenness. According to the Tufts Center for the Study of Drug Development, a single day of delay in drug development now costs roughly $800,000 in lost prescription sales for a high-performing asset. In cardiovascular and hematology, that number climbs past $1.3 million per day. For a six-month enrollment delay on a competitive oncology drug — the kind of delay that happens routinely — you're looking at a figure that can render a scientifically superior therapy commercially dead on arrival.
The bottleneck in drug discovery is no longer the science. It's the syntax.
And the operational reality is even grimmer than the financial one. 37% of research sites under-enroll, and 11% fail to enroll a single patient. Each screen failure — a patient who looks eligible on paper but isn't — costs about $1,200. When your AI tool generates 100 "matches" and only 5 are real, you haven't automated recruitment. You've launched a denial-of-service attack on your own clinical sites.
I watched this happen. Site coordinators who'd been excited about our early prototypes started ignoring the match lists entirely. "Your tool gives me garbage," one told me over a call. She wasn't wrong. She went back to manually scanning PDFs. Ctrl+F. The industry's actual state of the art.
The Catheter That Broke My Faith in LLMs
Let me go deeper into that Tuesday night error, because it illustrates something that most AI-in-healthcare pitches gloss over.
When a large language model processes text, it converts words into vectors — points in a high-dimensional mathematical space. Words that appear in similar contexts end up near each other. "Cardiac catheterization" and "central venous catheterization" are, in vector space, practically neighbors. Both involve catheters. Both involve the vascular system. Both appear in clinical notes surrounded by similar medical jargon.
But they are completely different procedures targeting different anatomical structures with different risk profiles and different clinical implications. One goes into the heart. The other goes into a vein. The protocol excluded the first. The patient had the second. And the AI couldn't tell the difference because it doesn't understand anatomy — it understands word proximity.
This isn't a corner case. Studies evaluating AI models for trial matching have identified this exact failure mode: models incorrectly concluding that cardiac catheterization is the same as a central venous puncture, leading to wrongful exclusion. It's a class of error, not a one-off bug.
I brought this to my team the next morning. One of our engineers — brilliant guy, deep learning background — suggested we could fix it with better fine-tuning. More medical training data. Larger context windows. I remember the argument that followed, because it was the argument that shaped our entire technical direction. My position was simple, and I said it probably too bluntly: you cannot fine-tune your way out of a missing ontology.
An LLM doesn't know that "cardiac catheterization" lives on a different branch of the medical procedure tree than "central venous catheterization." It doesn't have a tree. It has a fog of statistical associations. And no amount of training data will give it the rigid, hierarchical understanding that a medical ontology provides — the knowledge that Procedure A is a subtype of "Procedure on heart" while Procedure B is a subtype of "Catheterization of vein," and that these are categorically distinct.
That argument ended with us rebuilding our architecture from the ground up.
What Is Ontology-Driven Phenotyping, and Why Should You Care?

Here's the idea in plain language: instead of asking an AI to read medical records and guess what they mean, we force the AI to translate every medical concept it encounters into a standardized code from SNOMED CT — the world's most comprehensive clinical terminology system — before it makes any decisions.
SNOMED CT isn't a dictionary. It's a massive directed graph where medical concepts are connected by logical relationships. The most important one is the Is-A relationship. "Coronary angiography" is-a "cardiac catheterization" is-a "procedure on heart." "Central venous catheterization" is-a "catheterization of vein" is-a "insertion of vascular catheter." Different branches. Different parents. Different meaning.
So when our system encounters a protocol that excludes "cardiac catheterization" and a patient record that mentions a central line placement, it doesn't compare strings or vectors. It asks the ontology: Is this patient's procedure a subtype of the excluded procedure? The graph answers no. The patient stays eligible. Deterministically. Every time.
We stopped asking "do these words look similar?" and started asking "are these concepts logically related?" That single shift changed everything.
This works even when doctors write in shorthand. "Heart cath," "angio," "LHC," "central line," "CVC insertion" — SNOMED CT maps all of these variants to specific concept IDs. Once you're operating on concept IDs instead of strings, the ambiguity vanishes. You're matching meaning to meaning, not word to word.
I wrote about the technical architecture behind this — the SNOMED CT hierarchies, the post-coordination for laterality and severity, the construction of computational phenotypes — in the interactive version of our research. But the core insight is simple: medical AI needs a map of medicine, not just a statistical model of medical language.
How Do You Parse "Unless"?

Ontology handles the what — what medical concepts are we talking about? But clinical trial protocols have another layer of complexity that generic AI handles terribly: the logic of eligibility.
Here's a real exclusion criterion from an oncology trial:
"Exclude patients with hypertension, unless it is well-controlled on stable medication for at least 3 months."
A keyword matcher sees "hypertension" and excludes the patient. A boolean filter sees hypertension = TRUE and excludes. Both approaches throw away a patient who has hypertension but is perfectly eligible because their blood pressure has been controlled and stable for months.
This drove me slightly crazy when I first encountered it at scale. We pulled the eligibility criteria from a batch of Phase II and III oncology protocols and found that the majority contained conditional exclusions — "unless" clauses, "except when" clauses, temporal dependencies like "within 6 months" or "completed more than 90 days prior." These aren't edge cases. They're the norm. And every single one of them is a trap for systems that can't reason about conditions, permissions, and time.
We turned to deontic logic — a branch of formal logic that deals with obligations, permissions, and prohibitions. It's the logic of norms and rules, originally developed by philosophers, and it maps perfectly onto clinical trial criteria. Having hypertension is prohibited — unless you also satisfy the permission conditions of controlled blood pressure and stable medication for the required duration. The system models this as a formal logical expression, checks the patient's timeline, and computes eligibility with mathematical precision.
Another pattern we see constantly:
"Patients must not have received prior chemotherapy, unless it was neoadjuvant therapy completed more than 6 months ago."
The AI has to simultaneously verify three things: Did the patient receive chemotherapy? Was its intent neoadjuvant? And did it end more than six months before the reference date? We handle this with what the literature calls Temporal Ensemble Logic — the system builds a timeline of the patient's clinical history and places events within valid observation windows.
A keyword search sees "chemotherapy" in the record and panics. Our system sees chemotherapy, checks the intent attribute, measures the time delta, and correctly determines eligibility.
The Architecture Nobody Asked For (But Everyone Needs)

When I describe our approach to investors and pharma executives, I sometimes get a particular look — the look that says "why are you making this so complicated? Just use GPT."
I got that look from a potential partner about a year into our development. He was a smart guy, ran a CRO's digital innovation team, and he genuinely believed that a well-prompted GPT-4 wrapper with some retrieval-augmented generation bolted on would solve the problem. "The models are getting better every quarter," he told me. "You're over-engineering this."
I pulled up our test results. Same dataset, same eligibility criteria. His team's GPT wrapper: variable accuracy between runs — literally different answers on the same patient depending on when you ran it. No audit trail. No way to explain why a patient was included or excluded. And accuracy that topped out around 63-87% depending on the complexity of the criteria.
Our neuro-symbolic system: deterministic, reproducible, >95% accuracy, with a complete reasoning trace for every decision.
The FDA doesn't accept "the AI thought so" as a rationale. They need a logic proof. That's not a nice-to-have — it's the difference between a tool that augments clinical research and a toy that impresses demo audiences.
Here's how the architecture actually works, without drowning you in implementation details:
The LLM reads. It ingests the messy, unstructured reality of medical records — scanned PDFs, handwritten notes, physician narratives — and its only job is to extract medical entities and normalize them. It reads "pt complains of chest pain" and outputs the SNOMED concept for chest pain. That's it. The LLM is the perception layer. It never makes an eligibility decision.
The knowledge graph maps. Extracted entities get mapped to SNOMED CT concept IDs, disambiguated by context. "Cold" the virus versus "cold" the temperature. The graph structure resolves the ambiguity.
The logic solver reasons. This is where the actual eligibility determination happens — a deterministic symbolic reasoner that applies deontic logic rules against the patient's structured phenotype. It checks Is-A relationships, calculates temporal durations, evaluates conditional permissions. Given the same inputs, it always produces the same output.
We also use GraphRAG instead of standard vector-based retrieval. Standard RAG retrieves document chunks based on word similarity. GraphRAG traverses relationships. If a trial excludes "any drug interacting with CYP3A4 enzymes" and a patient is taking Drug B, standard RAG might miss the connection if the patient's record never explicitly says "Drug B is a CYP3A4 inhibitor." GraphRAG knows, because the knowledge graph contains the relationship: Drug B inhibits CYP3A4. Multi-hop reasoning. The kind of connection a pharmacist makes intuitively but a text-matching system never would.
For the full technical breakdown of the architecture — the Type 4 neuro-symbolic integration, concept-aware decoding, the FHIR/CDISC interoperability layer — see our detailed research paper.
"But Won't the Models Just Get Better?"
People always push back on this point, and I understand why. The trajectory of LLMs is genuinely impressive. Every few months, a new model scores higher on medical benchmarks. So why not wait?
Because the problem isn't capability — it's architecture. An LLM is a probabilistic token predictor. Making it bigger and training it on more medical text makes it a better probabilistic token predictor. It doesn't make it a logic engine. It doesn't give it determinism. It doesn't give it an audit trail. And in a regulated industry where the FDA and EMA need to know exactly why Patient #4,271 was excluded from Trial XYZ-003, "the model predicted this was the most likely answer" is not acceptable.
There's also the privacy problem that doesn't go away with scale. Sending unstructured patient records to cloud-based model APIs — even enterprise ones — creates HIPAA and GDPR exposure that no amount of BAA agreements fully mitigates. Our architecture keeps patient data within secure enclaves. The symbolic reasoning layer and the knowledge graph run locally. The neural layer can be a local open-source model. Protected health information never leaves the firewall.
And then there's the reproducibility issue that I find most damning. Run the same patient record through an LLM twice with the same prompt, and you can get different answers. Change the temperature setting, adjust the context window, rephrase the question slightly — different result. Clinical trials require 100% reproducible decisions. The regulatory framework demands it. The ethics demand it.
The Patients We're Losing
I've spent most of this essay talking about architecture and economics, but I want to end somewhere more honest.
For patients with metastatic cancer, or AML, or a rare genetic disorder, a six-month enrollment delay isn't a line item on a financial model. It's the difference between accessing a potentially curative therapy and not. When our system wrongly excludes an eligible patient — because it confused two catheter procedures, or because it couldn't parse an "unless" clause — that patient doesn't get a notification saying "sorry, the AI made an error." They just never hear about the trial. Their oncologist never gets the alert. The slot goes unfilled, or it goes to someone else, and the patient continues on standard of care, never knowing that an option existed.
That's what I think about when someone tells me to just use a wrapper API.
We built Veriprajna because the gap between what AI promises in healthcare and what it actually delivers is not a marketing problem — it's an engineering problem. The industry chose the easy architecture (throw an LLM at it) instead of the right architecture (give the LLM an ontology and a logic solver and constrain it to do only what it's good at).
We are not going to prompt-engineer our way to precision medicine. We need systems that reason, not systems that guess confidently.
The cure for the recruitment crisis isn't better language models. It's the recognition that eligibility is a logic problem wearing a language costume. Strip away the unstructured text, map it to a medical ontology, apply formal reasoning, and suddenly the 80% of trials that miss enrollment timelines starts to look like a solvable problem rather than an industry inevitability.
Stop matching words. Start matching patients. The difference is a knowledge graph, a logic solver, and the willingness to build something harder than a wrapper.


