A close-up editorial image of a pulse oximeter on a child's dark-skinned finger displaying a reading, with the subtle visual tension of clinical trust vs. hidden inaccuracy — specific to the article's opening scene and central theme.
Artificial IntelligenceHealthcareRacial Equity

The Pulse Oximeter on My Daughter's Finger Was Lying — And So Is Your Hospital's AI

Ashutosh SinghalAshutosh SinghalMarch 30, 202615 min read

My daughter had a fever of 103 last spring. We were in the ER, and the nurse clipped a pulse oximeter onto her small brown finger. The screen read 97% oxygen saturation. Normal. The nurse smiled. I did not.

I knew — because I'd spent months buried in the clinical literature for a project at Veriprajna — that the device on her finger was almost certainly overestimating her blood oxygen. Not by a trivial amount. By enough to matter. Research published in the New England Journal of Medicine and the British Medical Journal has shown that Black patients are nearly three times more likely to experience what clinicians call "occult hypoxemia" — a condition where the device says you're fine while your actual oxygen levels are dangerously low. A 2024 Vanderbilt study found that common pulse oximeters failed to detect low oxygen in 7% of children with the darkest skin tones. They missed zero cases in children with the lightest tones.

I looked at that little glowing number on the screen and thought: this is where it starts. Not with a malicious algorithm. Not with a biased dataset. With a $30 piece of hardware that was calibrated on white skin and has been lying about everyone else for thirty years.

That night changed how I think about everything we build at Veriprajna. It's the reason I wrote our research on algorithmic equity in clinical AI, and it's the reason I'm writing this now.

Why Does Your Pulse Oximeter Work Differently on Dark Skin?

The physics is almost insultingly simple. A pulse oximeter shines red and infrared light through your finger and measures how much gets absorbed. Oxygenated hemoglobin and deoxygenated hemoglobin absorb light at different ratios, and the device uses that ratio to estimate your blood oxygen level.

Here's the problem: melanin also absorbs light across those same wavelengths. If you calibrate the device primarily on lighter-skinned people — which manufacturers did, and until recently the FDA only required testing on ten subjects total — then the extra absorption from melanin in darker skin gets misread. The device interprets it as more oxygenated hemoglobin than actually exists. Your number looks higher than reality.

This isn't a subtle academic finding. The false negative rate for detecting low oxygen ranges from 1.2% to 26.9% in lighter skin tones. In darker skin tones, it jumps to 7.6% to 62.2%. That's not a rounding error. That's a different medical reality.

When the device on your finger says 97%, and your actual arterial oxygen is 88%, you don't get supplemental oxygen. You don't get escalated care. You get sent home.

I remember sitting in a team meeting after I'd compiled this data, and one of our engineers — a brilliant guy, someone I trust completely — said, "But surely the AI models downstream correct for this?" And I realized that was exactly the assumption that was killing people. The AI doesn't correct for it. The AI amplifies it.

The Cascade Nobody Talks About

A system diagram showing how a single biased pulse oximeter reading flows through the hospital data pipeline — from sensor to EHR to AI alert system — and how bias at the input stage causes the AI to silently fail to escalate care for the patient.

Here's what happens in a modern hospital. A patient arrives. Vital signs get recorded — including that pulse oximeter reading. Those vitals flow into the Electronic Health Record. And increasingly, an AI system watches that data stream, looking for patterns that suggest deterioration: sepsis, respiratory failure, cardiac events.

If the AI's threshold for a "high-priority" alert is an SpO₂ below 92%, and a Black patient's oximeter reads 93% when their true arterial oxygen is 88%, the alert never fires. The patient doesn't get flagged. The clinician, who is managing fifteen other patients and has learned to trust the system, doesn't intervene.

This isn't a hypothetical. This is the architecture of hundreds of hospitals right now.

I spent a long evening working through the implications of this with my co-founder. We kept coming back to the same uncomfortable realization: the bias isn't in the algorithm. It's in the input. And if you build the most sophisticated, fairness-aware, perfectly calibrated AI model in the world, and you feed it data from a racist thermometer, you get a racist AI with excellent credentials.

What Happens When the Most-Used Sepsis AI Misses 67% of Cases?

If the pulse oximeter story is about hardware bias flowing into software, the Epic Sepsis Model story is about what happens when the software itself was never built to work for everyone.

The Epic Sepsis Model, or ESM, is integrated into the EHR systems of hundreds of American hospitals. It was marketed as a breakthrough — an AI that could identify sepsis before clinicians recognized it, saving lives through early intervention. The developer reported an Area Under the Curve (a standard performance metric) of 0.76 to 0.83. Respectable numbers.

Then researchers at Michigan Medicine ran an independent external validation. The AUC dropped to 0.63. Sensitivity — the model's ability to actually catch sepsis cases — was 33%. It missed two out of every three cases. The positive predictive value was 12%, meaning 88% of its alerts were false alarms. And it only flagged patients before clinicians did in 6% of cases.

I want to sit with that for a moment. A system deployed in hundreds of hospitals, integrated into the workflow that doctors rely on every day, was wrong almost nine times out of ten when it raised an alarm, and it missed the real cases two-thirds of the time.

A sepsis model with 33% sensitivity isn't a safety net. It's a false sense of security with a hospital-wide subscription fee.

But the performance numbers, as bad as they are, aren't the worst part. The worst part is who it fails.

Why Does AI Sepsis Detection Fail Black Patients Specifically?

Black and Hispanic patients experience nearly double the incidence of sepsis compared to white patients, and they often present at younger ages. You'd think that would make them the highest-priority population for an AI detection system. Instead, studies have found that models like the ESM exhibit poor calibration across these groups.

The reason is something called label bias, and once you understand it, you can't unsee it.

Most sepsis models are trained on clinical definitions or billing codes. Those codes are generated by human clinicians making human decisions. If doctors are historically slower to order blood cultures for Black patients — whether from implicit bias, communication barriers, or systemic factors — then the training data reflects that delay. The AI learns that "sepsis" looks like the data signatures of white patients, because those are the patients who got diagnosed promptly. It becomes, in effect, blind to the presentation of sepsis in Black patients.

And then the lethal feedback loop closes: the AI misses the patient because the historical data was biased. The clinician misses the patient because they trusted an AI that didn't fire an alert.

I had an argument with a potential investor about this. He said, "Can't you just retrain the model on better data?" As if "better data" were sitting in a warehouse somewhere, waiting to be plugged in. The data is the history. The history is the bias. You can't fix a biased dataset by adding more of the same biased data. You have to change the architecture.

50.3 Deaths Per 100,000: The Number That Should Haunt Healthcare AI

A comparative infographic consolidating the key racial disparity statistics from across the article — oximeter error rates, maternal mortality, and AI system miss rates — into one visual that makes the scale of inequity immediately visceral.

Everything I've described so far — the oximeter lies, the sepsis model failures, the label bias — converges most devastatingly in maternal health.

The CDC reports that Black women face a pregnancy-related mortality rate of 50.3 per 100,000 live births. White women: 14.5. That's not a gap. That's a chasm — 3.5 times higher. And it persists even when you control for education and income. A Black woman with a college degree is more likely to die in childbirth than a white woman without a high school diploma.

California's Maternal Data Center, one of the most data-rich maternal health environments in the country, found that automated early warning systems missed 40% of severe morbidity cases in Black patients. Forty percent. These are life-threatening complications — hemorrhage, preeclampsia, sepsis — that occur 100 times more frequently than maternal death. The AI was supposed to catch them. It didn't.

Part of the reason involves what researchers call the "weathering" effect — the physiological toll of chronic stress caused by systemic racism. Black women often present with higher baseline blood pressures and altered cardiovascular responses. An AI trained on population averages may interpret these as "normal for this patient" rather than recognizing them as warning signs in a body under chronic duress.

When an AI early warning system misses 40% of severe complications in Black mothers, it's not a technical glitch. It's a system performing exactly as its training data taught it to — which is to say, inequitably.

And here's the number that should make every healthcare executive pay attention: McKinsey estimates that closing the Black maternal health gap could add $24.4 billion to US GDP and save $385 million in annual preventable healthcare costs. This isn't just a moral crisis. It's an economic one.

Black women are 1.79 times more likely to die once a severe complication has occurred compared to white women. That's not about incidence — it's about "failure to rescue." The complication happens, the window for intervention opens, and the system fails to act in time. When the AI doesn't alert, and the clinician is managing a dozen other patients, that window closes.

Why Can't ChatGPT Fix This?

I get this question constantly. Some version of: "Why not just use GPT-4 with medical prompts? It knows a lot about medicine."

It does know a lot about medicine in the same way that someone who's read every textbook but never touched a patient knows a lot about medicine. An LLM is a statistical engine trained on language probabilities. It doesn't understand pathophysiology. It doesn't process real-time waveform data from a bedside monitor. It can't tell you whether a particular SpO₂ reading is trustworthy given the patient's skin tone and the specific device model being used.

Studies found that LLMs achieved only 16.7% accuracy in dose adjustments for renal dysfunction when patient-specific variables were complex. They hallucinate — confidently generating clinical information that sounds authoritative and is completely fabricated. They can't provide the transparent reasoning chain that a clinician needs to verify a recommendation, which is increasingly a regulatory requirement under GDPR and evolving US health regulations.

The healthcare AI market is flooded with what I call "wrapper" applications — thin interfaces over generalized public APIs. They're fine for drafting discharge summaries or summarizing chart notes. They are fundamentally inadequate for deciding whether a 32-year-old Black woman presenting with borderline vitals needs immediate intervention or can wait.

The distinction matters. A wrapper takes a general-purpose language model and points it at a medical question. A deep AI system — what we build at Veriprajna — integrates real-time physiological signals, expert-labeled datasets, and fairness-aware mathematical constraints into the model's architecture from the ground up.

One of these approaches can write a convincing paragraph about sepsis. The other can actually detect it equitably.

How Do You Actually Build Clinical AI That Doesn't Discriminate?

This is where I have to get a little technical, because the solution isn't philosophical — it's mathematical. And the math is what separates deep AI from well-intentioned vaporware.

Traditional machine learning optimization minimizes the average error across the entire dataset. That sounds reasonable until you realize that "average" naturally favors the majority group. If 70% of your training data comes from white patients, the model will optimize for white patients. The error rates for everyone else are just... acceptable losses in the average.

We don't accept that. At Veriprajna, we implement what's called worst-group loss optimization. Instead of minimizing the average error, we minimize the maximum error across all demographic subgroups. Mathematically, we're solving for: minimize the worst-case loss across Black, white, Hispanic, and other populations simultaneously. Research in automated depression detection has shown that while this approach may slightly lower overall accuracy, it significantly improves outcomes for underrepresented groups who are otherwise systematically misclassified.

We also enforce equalized odds — requiring that both the true positive rate and the false positive rate are equal across demographic groups. If a sepsis model has 80% sensitivity for white patients but only 40% for Black patients, it is providing a different tier of care based on race. Full stop. That's not a model performance issue. That's a civil rights issue.

For the full mathematical framework — including fairness-aware loss functions, adversarial debiasing, and our approach to multimodal signal fusion — I've laid out the technical details in our research paper.

But the math is only one layer. Here's what the full architecture looks like in practice:

You have to fix the inputs. We don't treat a pulse oximeter reading as ground truth. Our models fuse oximetry with heart rate variability, respiratory rate, and lactate trends. If a patient's heart rate and lactate are climbing while SpO₂ remains suspiciously stable, the system flags a signal discrepancy and prompts the clinician to order an arterial blood gas — the gold standard. We're triangulating the patient's true state rather than trusting a single biased sensor.

You have to fix the labels. We use expert-adjudicated ground truth rather than billing codes. When three sepsis experts independently review a case and agree on the diagnosis timeline, that's a fundamentally different training signal than a billing code that was generated six hours after the patient was already in the ICU.

You have to validate locally. Every deployment starts with a retrospective audit of the institution's own data. We measure something called the Population Stability Index to quantify how different the local patient population is from our training cohort. If the gap is too large, we recalibrate before going live. The Epic Sepsis Model's catastrophic performance drop — from 0.83 AUC internally to 0.63 externally — is what happens when you skip this step.

"But Won't This Slow Down AI Adoption?"

People ask me this, and I understand the impulse behind it. There's a real urgency to get AI into clinical workflows. People are dying while we debate fairness metrics.

But here's what I've learned: deploying a biased AI system fast doesn't save more lives. It saves some lives — disproportionately white, disproportionately wealthy — while creating a false sense of security that actively harms everyone else. The Epic Sepsis Model was deployed fast. It was deployed widely. And it missed two-thirds of sepsis cases while generating false alarms 88% of the time. Speed without equity isn't progress. It's negligence at scale.

The other objection I hear: "Fairness constraints reduce accuracy." This is technically true in the narrowest sense — optimizing for worst-group performance may slightly lower the aggregate metric. But "aggregate accuracy" is the same statistical sleight of hand that let the pulse oximeter crisis persist for thirty years. When your 95% accuracy means 95% for white patients and 62% for Black patients, the aggregate number is a lie.

Optimizing for average accuracy in healthcare AI is like reporting the average temperature in a hospital — it tells you nothing about the patient who's on fire.

What I Think About at 2 AM

I think about the fact that one in three Black women reports being mistreated during maternity care. I think about the 40% of severe morbidity cases that California's AI systems missed in Black patients. I think about my daughter's finger in that pulse oximeter clip, and the nurse's smile, and the number on the screen that I knew was probably wrong.

And I think about the fact that we have the mathematical tools to fix this. Fairness-aware loss functions exist. Multimodal signal fusion exists. Local validation frameworks exist. Worst-group optimization exists. None of this is theoretical. We've built it. Other teams are building it. The knowledge is here.

What's missing is the will. Too many health systems are buying wrapper solutions because they're cheap and fast. Too many AI vendors are reporting aggregate accuracy because subgroup breakdowns would be embarrassing. Too many regulators are testing devices on ten subjects and calling it sufficient.

The path forward isn't complicated. Demand subgroup performance metrics from every AI vendor — sensitivity, specificity, and positive predictive value broken down by race, age, and sex. Reject "99% accuracy" claims that don't show you the denominator. Require independent external validation, not vendor whitepapers. And stop treating fairness as a feature request. It's a design requirement.

Black mothers are dying at 3.5 times the rate of white mothers. AI systems built on biased hardware and biased labels are making it worse. And every day we deploy another wrapper solution without asking who it works for and who it doesn't, we're choosing convenience over lives.

I didn't start Veriprajna to build another chatbot with a medical vocabulary. I started it because I believe deep AI — the kind that interrogates its own inputs, enforces equity mathematically, and validates locally before it touches a single patient — is the only technology that deserves to be in the room when someone's life is on the line.

The question isn't whether AI belongs in healthcare. It does. The question is whether we have the integrity to build it right.

Related Research