A striking editorial image evoking the tension between a too-perfect AI accuracy number and the messy reality of clinical healthcare — specific to the article's domain of healthcare AI accountability.

Artificial IntelligenceHealthcareTechnology

The Most Dangerous Number in AI Right Now Is 99.999%

Ashutosh Singhal March 28, 202614 min read

I was on a call with a hospital CIO last year when he pulled up a vendor's pitch deck and shared his screen. Slide seven had a single number on it, centered, in 72-point font: <0.001% hallucination rate. Below it, in smaller type: "clinically validated."

He looked at me through the webcam and said, "Ashutosh, should I believe this?"

I told him I didn't know — but that the number itself should make him nervous, not reassured. A hallucination rate below one in a hundred thousand, for a system summarizing messy, contradictory, handwritten-and-dictated clinical notes across dozens of specialties? That's not an accuracy claim. That's a magic trick. And in my experience, when someone shows you a magic trick in a sales meeting, you should check your pockets afterward.

A few months later, the Texas Attorney General made that instinct official. In September 2024, the state reached a landmark settlement with Pieces Technologies, a Dallas-based healthcare AI company, over what the AG alleged were deceptive accuracy claims — including that exact 0.001% critical hallucination rate. It was the first enforcement action of its kind against a generative AI company in healthcare, and it didn't require any new AI-specific law. Just the plain old Texas Deceptive Trade Practices Act, the same statute used to go after shady car dealerships.

That settlement changed how I think about everything we build at Veriprajna. Not because we were doing anything wrong, but because it crystallized something I'd been struggling to articulate to clients: the problem with enterprise AI isn't that the models hallucinate. It's that the industry has built an entire go-to-market strategy around pretending they don't.

What Does a 0.001% Hallucination Rate Actually Mean?

Let me walk you through the math, because the math is where the magic trick falls apart.

Large language models are probabilistic systems. They don't "know" things the way a database knows things. They predict the next word — or more precisely, the next token — based on patterns learned during training. The probability of any generated output is the product of the probabilities of every individual token in the sequence. Each token is a tiny bet, and the final output is a long chain of tiny bets multiplied together.

Now, claiming a critical hallucination rate below 0.001% means that fewer than 1 in 100,000 outputs contain an error serious enough to cause clinical harm. To validate that claim with any statistical confidence, you'd need an enormous, perfectly annotated gold-standard dataset — tens of thousands of clinical summaries, each reviewed by domain experts who agree on what counts as "critical." That dataset doesn't exist. Not for Pieces Technologies, not for anyone. Clinical notes are too idiosyncratic, too specialty-specific, too dependent on the individual physician's style and the patient's history.

When someone claims 99.999% accuracy on a task where even human experts disagree on what "correct" looks like, they haven't solved the problem. They've defined it away.

The Texas AG's investigation concluded that the metrics Pieces used were "likely inaccurate" and potentially misleading to the hospitals deploying the tool — which included Houston Methodist, Parkland Hospital, Children's Health System of Texas, and Texas Health Resources. Four major systems. Real patients. Real clinical notes being summarized by a system whose accuracy claims couldn't withstand regulatory scrutiny.

The Night I Stopped Trusting Benchmarks

A side-by-side comparison showing the dramatic gap between automated AI evaluation metrics (95%+ scores) and actual physician review results (23 out of 50 flagged), illustrating why benchmarks are unreliable proxies for clinical safety.

I want to tell you about a moment that rewired my thinking on this.

We were running an internal evaluation of a clinical summarization pipeline — not for a client, just for our own R&D. My team had built what we thought was a solid RAG-based system. Retrieval-Augmented Generation, for the uninitiated, is a technique where instead of asking the model to answer from memory, you first retrieve relevant documents from a knowledge base and feed them to the model as context. It's supposed to ground the output in facts.

Our internal metrics looked great. Faithfulness scores above 95%. Retrieval precision in the high 90s. We were feeling good. Then one of our engineers — Priya, who has this maddening habit of being right about things nobody wants to hear — suggested we do something different. Instead of measuring against our own test set, she pulled fifty real discharge summaries from a public dataset and had two physicians independently review the AI-generated versions.

The results came back on a Thursday night. I remember because I was making dinner and my phone buzzed with a Slack message from Priya that just said: "You should look at this before tomorrow."

The physicians flagged issues in 23 out of 50 summaries. Not catastrophic errors in most cases — a medication dosage pulled from a previous admission instead of the current one, a family history detail attributed to the wrong relative, a lab value that was directionally correct but numerically off. But in a clinical context, "directionally correct but numerically off" can mean the difference between a safe discharge and a readmission.

Our automated metrics had missed almost all of it. The system was generating text that was linguistically fluent and semantically similar to the source material — which is exactly what the metrics measured. But it wasn't generating text that was clinically safe, which is what actually mattered.

That was the night I stopped trusting benchmarks as a proxy for quality. And it's why the Pieces case hit me so hard when it broke. I knew exactly how a company could look at its own numbers, believe them sincerely, and still be dangerously wrong.

Why Did Texas Use a Consumer Protection Law — Not an AI Law?

This is the part that should keep every AI vendor up at night.

The Texas Attorney General didn't wait for Congress to pass an AI regulation. He didn't invoke any novel legal theory. He used the Texas Deceptive Trade Practices–Consumer Protection Act — a statute that's been on the books for decades — and applied it to AI accuracy claims the same way it would be applied to a company lying about the fuel efficiency of a car.

The resulting Assurance of Voluntary Compliance locks Pieces Technologies into a five-year period of heightened transparency. The company must now disclose the definitions and calculation methods behind any accuracy metrics it advertises. It must notify customers of "known or reasonably knowable harmful or potentially harmful uses" of its products. It must provide documentation on its training data and model types. And it must respond to information requests from the AG's office within 30 days.

This is not a slap on the wrist. This is a template.

The first major AI enforcement action in healthcare didn't require new legislation. Existing consumer protection law was enough — and every state has one.

I've talked to enterprise legal teams who assumed they were safe because "there's no AI law yet." That assumption is wrong. If you make a claim about your AI system's performance, and that claim is misleading, you're already exposed under existing law. The Pieces settlement just proved it.

I wrote about the full regulatory implications — including the specific obligations under the settlement — in the interactive version of our research. If you're in procurement, legal, or compliance, it's worth reading closely.

The Wrapper Problem

Here's what I think actually went wrong, architecturally — and why it matters far beyond Pieces Technologies.

Most enterprise AI products shipping today are what the industry calls "wrappers." A wrapper takes a user's input, sends it to a foundational model API — GPT-4, Claude, Gemini — and displays the response with some light formatting and maybe a few guardrails bolted on. It's fast to build, fast to ship, and fast to sell. It's also fundamentally fragile.

A wrapper doesn't understand your data. It doesn't maintain context across a patient's longitudinal record. It doesn't know that Dr. Ramirez in cardiology writes notes differently than Dr. Chen in oncology. It doesn't have access to the institutional knowledge that a nurse with twenty years of experience carries in her head. It just predicts tokens.

I had an argument with an investor about this once — a heated one. He'd seen a demo of a wrapper-based clinical documentation tool and was convinced it was "good enough." His exact words: "Ashutosh, just use GPT. Fine-tune it a little. Ship it. The market won't wait."

I told him the market wouldn't wait, but the regulators would. And the patients would. And the lawsuits would.

He didn't invest. I don't regret the conversation.

The alternative — what we build at Veriprajna and what I think the industry needs to move toward — is deep integration. That means embedding the model into the enterprise's actual data fabric. It means using RAG not as a checkbox feature but as a genuine grounding mechanism, with retrieval pipelines tuned to the specific domain. It means fine-tuning on domain-specific corpora. It means multi-layered human oversight where the humans actually have the authority and the context to catch errors.

Research backs this up. Studies show that 65% of developers report AI "loses relevant context" during complex tasks — and that's in software engineering, where the stakes are a broken build, not a broken patient. In healthcare, context loss isn't a bug. It's a safety event.

What Actually Works: Adversarial AI and the 3.7-Hour Problem

A tiered diagram showing AI use cases mapped to required oversight speed and risk level, illustrating the intervention speed problem and the mismatch between deployment risk and actual oversight.

I'll give Pieces Technologies credit for one thing: their architecture included an Adversarial Detection Module. The idea is sound — use a second AI model to police the first one, scanning generated summaries for discrepancies against the source clinical data. Their technical paper showed the adversarial module was 7.5 times more effective at catching clinically significant hallucinations than random sampling.

That's a real result. It's also not enough.

Here's why. When the adversarial module flagged an error, the flagged summary was routed to a board-certified physician for review. The median time to remedy? 3.7 hours. For a progress note that gets filed at the end of a shift, maybe that's acceptable. For a discharge summary that determines whether a patient goes home today or stays another night, 3.7 hours is an eternity. For a real-time clinical decision support tool — the kind everyone's racing to build — it's useless.

This is what I call the intervention speed problem, and it's one the industry hasn't solved. You can build the best hallucination detection system in the world, but if the correction loop is slower than the clinical workflow, the uncorrected output is what the doctor sees when it matters.

Detection without timely correction is just documentation of failure.

At Veriprajna, we've started thinking about this in tiers. Not every AI use case carries the same risk, and not every use case needs the same speed of human intervention. Administrative scheduling? Audit it weekly. Clinical documentation? Review before it hits the chart. Real-time decision support? The human has to be in the loop before the output is generated, not after.

The AI Safety Level framework emerging in healthcare maps this well — from Level 1 (low-impact administrative tasks) up to Level 5 (autonomous patient interaction). Most organizations I talk to are deploying Level 3 and 4 tools with Level 1 oversight. That's the gap that regulators are going to keep closing.

Why Are Only 5% of Companies Getting Real Value from AI?

A comparison infographic showing how the successful 5% of companies allocate implementation effort (70% organizational, 20% tech, 10% algorithm) versus how the unsuccessful 95% invert that ratio, making the pattern immediately visible.

There's a statistic from enterprise AI research that haunts me: only 5% of companies are achieving measurable business value from AI at scale. Not 50%. Not 25%. Five percent.

The companies in that 5% share a pattern. They spend 70% of their implementation effort on organizational transformation — redesigning workflows, redefining roles, changing how decisions get made. Twenty percent goes to the technology stack. Ten percent goes to the algorithm itself.

Everyone else inverts that ratio. They spend months picking the right model, weeks building the pipeline, and approximately zero time thinking about whether the humans downstream actually trust, understand, or can effectively oversee the AI's output.

I've seen this firsthand. We worked with a team that had built a technically elegant system — beautiful architecture, clean code, impressive benchmarks. But the clinicians it was built for didn't use it. Not because it was bad, but because nobody had asked them what they needed. The tool generated summaries in a format that didn't match their existing workflow. It surfaced information they already knew and buried information they actually needed. It was a solution to a problem defined by engineers, not by the people doing the work.

We spent three weeks just sitting with the clinical staff, watching them work, before we wrote a single line of code for the redesign. That's the 70% that matters.

For the full technical breakdown of evaluation frameworks, adversarial detection architectures, and the ROI patterns that separate the 5% from the 95%, see our detailed research paper.

How Should Enterprises Actually Evaluate AI Accuracy Claims?

People ask me this constantly, so let me be direct.

First, demand definitions. When a vendor tells you their hallucination rate is X%, ask: What counts as a hallucination? Who annotated the test set? How large was it? Was it evaluated by domain experts or by another AI model? If they can't answer these questions clearly, the number is meaningless.

Second, look at the evaluation framework. The best one I've seen for healthcare is Med-HALT — the Medical Domain Hallucination Test. It doesn't just measure whether the model gets the right answer. It tests whether the model can resist giving a confidently wrong answer. One of its subtests, the False Confidence Test, presents the model with a question and a suggested "correct" answer that's actually wrong, then checks whether the model goes along with it. Another test, called "None of the Above," checks whether the model can recognize when none of the provided options are correct — a critical skill, because in medicine, "I don't know" is often the safest answer.

Third, insist on what the FAIR-AI framework calls an "AI Label" — a standardized disclosure that tells the end user what data the model was trained on, what its known failure modes are, and what version is currently deployed. Think of it like a nutrition label for AI. If a vendor won't give you one, ask yourself what they're hiding.

The question isn't "how accurate is your AI?" It's "how do you know — and can you prove it to a regulator?"

The Settlement Changed Everything. Most People Haven't Noticed Yet.

Here's what I think is going to happen over the next two years, and I'm saying this as someone who builds the systems that will be subject to these rules.

The Texas settlement is going to be replicated. Other state AGs are watching. The FTC is watching. The pattern is established: you don't need an AI law to regulate AI claims. You just need a consumer protection statute and a vendor who overpromised.

Enterprise procurement is going to change. Hospital systems and large buyers are going to start requiring independent third-party audits of AI accuracy claims before signing contracts. The settlement explicitly allows for this as an alternative to self-disclosure, and smart buyers are going to demand it.

The wrapper model is going to die — slowly, then all at once. Not because wrappers don't work for low-stakes applications (they do), but because the regulatory cost of deploying an ungrounded system in a high-stakes environment is about to become prohibitive. The companies that survive will be the ones that invested in deep integration when it was hard, not the ones that shipped fast and hoped nobody checked.

And the 0.001% claim? It's going to become a cautionary tale — the enterprise AI equivalent of Theranos's "one drop of blood." A number so perfect it should have been a warning.

I think about that hospital CIO sometimes. The one who showed me the slide with the big number. He didn't buy that system. He told me later that something about the precision of the claim bothered him — it was too clean, too confident for a technology he knew was fundamentally probabilistic.

He was right to be bothered. The hardest thing in enterprise AI isn't building a system that works. It's building a system that tells you honestly when it doesn't. That's the standard now. Not 99.999%. Not a number on a slide. The standard is: can you show your work, stand behind it, and accept the consequences when you're wrong?

That's what we're building toward. Not perfect AI. Honest AI. And I think that's going to matter a lot more than anyone's hallucination rate.