An editorial image conveying the tension between algorithmic lending systems and regulatory accountability, specific to the article's domain of AI fairness in consumer finance.

Artificial IntelligenceFinancial ServicesMachine Learning

A $2.5 Million Fine Exposed What's Really Wrong With AI Lending — And It's Not What You Think

Ashutosh Singhal April 4, 202613 min read

I was sitting in my home office on a Thursday evening in July 2025, scrolling through the Massachusetts Attorney General's press release about Earnest Operations, when I felt something I hadn't expected: relief.

Not because a lender got fined $2.5 million for AI-driven discrimination against Black and Hispanic borrowers. That was infuriating. The relief came from something else — the specificity of the charges. The AG's office didn't just say "your AI is biased." They named the exact variable. They traced the exact mechanism. They showed, in painstaking detail, how a seemingly neutral data point — the Cohort Default Rate of a borrower's college — became a pipeline for racial discrimination baked into code.

For years, my team at Veriprajna and I have been arguing that the way most fintechs deploy AI in lending is architecturally broken. Not just ethically questionable — structurally incapable of fairness. The Earnest settlement was the first major enforcement action that proved us right in the language regulators actually use.

And it won't be the last.

The Variable That Looked Innocent

Here's what Earnest did, and I want you to sit with this because it's more subtle than "the algorithm was racist."

Earnest built an AI-powered student loan refinancing model. One of the inputs was the Cohort Default Rate, or CDR — a metric that tracks how often graduates of a specific school default on their federal loans. On paper, this seems reasonable. Schools with high default rates might produce borrowers who struggle to repay. Why wouldn't you factor that in?

Because the CDR doesn't measure individual creditworthiness. It measures institutional outcomes. And those outcomes are shaped by decades of systemic underfunding, intergenerational wealth gaps, and racial segregation in higher education. Historically Black Colleges and Universities carry higher CDRs not because their graduates are less capable, but because the system gave those institutions — and their students — less to work with.

When you penalize an individual for the statistical history of their institution, you're not predicting risk. You're perpetuating it.

The Massachusetts AG alleged that the CDR's predictive power came not from any signal about the borrower, but from its correlation with race and socioeconomic class. A Black graduate of an HBCU with pristine credit, solid income, and zero missed payments would score lower than a white graduate of a well-funded state school — because of where they went to college, not what they did after.

I remember pulling up the settlement documents and reading them aloud to my co-founder over the phone. "They had knockout rules too," I said. "Hard-coded gates that auto-denied anyone without at least a green card." There was a long pause. "So the bias was in the architecture from the start," she said. Yes. From the very first line of the decision tree.

Why Did Nobody Catch This?

This is the part that kept me up that night. Earnest had internal policies. They had model oversight requirements. They had senior review processes for exceptions.

None of it worked.

The investigation revealed that underwriters routinely bypassed the model or applied arbitrary standards without documentation. The human-in-the-loop safeguard — the thing every AI company points to when regulators come knocking — was theater. There was no consistent logging. No independent review. No audit trail that could tell you why a particular override happened.

I've seen this pattern so many times that we gave it a name internally: governance cosplay. The institution has all the right policies on paper. The org chart shows a compliance team. The board deck mentions "responsible AI." But when you open the hood, there's no mechanism connecting the policy to the code. The algorithm runs in one universe; the governance framework exists in another.

The Earnest case made this explicit. Both algorithmic bias and unmonitored human bias coexisted in the same system, making it — as I wrote in our interactive analysis of the case — fundamentally impossible to audit and defend.

What Happens When the Disparity Is 29 Percentage Points?

If Earnest was the scalpel case — precise, variable-level, traceable — then Navy Federal Credit Union is the sledgehammer.

In 2022, Navy Federal, the largest credit union in the United States, approved roughly 77% of white conventional mortgage applicants. For Black applicants? 48.5%. That's a gap of nearly 29 percentage points — the widest of any top 50 mortgage lender in the country.

Navy Federal's defense was predictable: "Public HMDA data doesn't include credit scores or cash-on-hand. You can't draw conclusions without the full picture." It's the same defense every institution reaches for. And it might have worked a decade ago.

It didn't work this time. When independent researchers controlled for more than a dozen variables — income, debt-to-income ratio, property value, neighborhood characteristics — Black applicants were still more than twice as likely to be denied as white applicants with identical profiles.

I remember presenting these numbers at a fintech conference last year. An audience member — a VP of risk at a mid-size lender — raised his hand and said, "But maybe there's something in the data we're not seeing. Some legitimate factor." I asked him: "If your model produces a 29-point racial gap that persists after controlling for every variable you can name, at what point do you stop looking for innocent explanations and start looking at the model?"

He didn't have an answer. Most of the industry doesn't.

In May 2024, a federal judge ruled that disparate impact claims against Navy Federal could proceed to discovery. That means plaintiffs will get to examine the internal logic of the credit union's underwriting algorithm. The era of "our model is proprietary and too complex to explain" is over.

Statistical disparity alone is now enough to survive a motion to dismiss. The burden has shifted: prove your process is fair, or face discovery.

Why Do LLM Wrappers Keep Failing the Fairness Test?

Here's where I need to be blunt about something that a lot of people in AI don't want to hear.

The dominant architecture in fintech AI right now — what I call the "wrapper" model — is structurally incapable of meeting the regulatory standards that already exist, let alone the ones coming in 2026.

A wrapper takes your data, passes it to a third-party large language model like GPT-4 or Gemini, and returns an output. It's fast to build. It demos beautifully. And it is a compliance time bomb.

LLMs predict the next token in a sequence. They don't retrieve facts. They don't perform actuarial calculations. They don't reason about causation. When you ask an LLM to evaluate a loan application, it generates text that sounds like a credit assessment. But it might fabricate a justification for denial that has no basis in the applicant's actual file. The industry calls this hallucination. Regulators call it a violation.

The CFPB has been unambiguous: creditors must provide "accurate and specific reasons" for adverse actions. You cannot tell a denied applicant "the algorithm decided" or cite a vague category like "purchasing history" when the real trigger was a non-traditional data point the model latched onto. "The algorithm decided" is not a legally defensible statement — the Bureau has said so explicitly.

And there's a deeper problem. LLMs are trained on the internet. The internet is saturated with historical biases — racial, gender, socioeconomic. When your wrapper uses an LLM to "evaluate" a borrower's employment history or narrative, the model may apply stereotypes embedded in its training data. Certain nationalities, certain professions, certain zip codes carry invisible weight in the model's latent space. Not because anyone programmed bias in. Because the training data is the bias.

I had an argument about this with an investor early on. He said, "Just use GPT with a good prompt. You're overcomplicating this." I pulled up a demo where we fed the same loan application through a wrapper with two versions — one with a name that coded as white, one with a name that coded as Black. The outputs weren't identical. The tone shifted. The risk language shifted. Not dramatically. Subtly. The kind of subtle that, scaled across millions of decisions, produces a 29-point gap.

He stopped arguing.

What Does "Deep AI" Actually Mean?

Side-by-side architecture comparison showing the shallow "LLM Wrapper" model (left) versus the multi-layered "Deep AI" system (right), with labeled components showing why one is auditable and the other is not.

I use the term "Deep AI" not as marketing — though I understand the skepticism — but as a technical distinction from what most of the industry is building.

A Deep AI system for lending doesn't call a single model and return an answer. It's a multi-layered architecture where different types of intelligence handle different types of decisions, and every layer is auditable.

Deterministic rule engines handle the things that must be 100% correct — residency requirements, regulatory thresholds, hard compliance checks. These aren't probabilistic. They're logic. They don't hallucinate.

Gradient-boosted models like XGBoost handle structured credit scoring — the kind of tabular data where interpretability and stability matter more than linguistic fluency. These models are boring. They're also reliable, explainable, and well-understood by regulators.

Fine-tuned LLMs are used — but only for what they're actually good at: extracting entities from unstructured documents, parsing tax returns, reading bank statements. And they're grounded through Retrieval-Augmented Generation, meaning the model can only reference the applicant's actual documents, not its training data's vague associations.

On top of all this sits a continuous monitoring layer that tracks model drift, bias drift, and hallucination rates in real time. When the Disparate Impact Ratio — the ratio of approval rates between protected and control groups — drops below the 0.8 threshold (the four-fifths rule that regulators use as a red flag), the system alerts before a human complaint ever surfaces.

This isn't aspirational. We built it because the alternative — the wrapper, the black box, the governance cosplay — keeps producing Earnest settlements and Navy Federal lawsuits.

How Do You Actually Engineer Fairness Into a Model?

A left-to-right pipeline diagram showing the four stages of fairness engineering (pre-training, during training, post-training, deployment) with the specific technique used at each stage.

People ask me this constantly, and I think they expect the answer to be simple. It's not. But it's also not mysterious.

Fairness engineering means applying mathematical constraints at every stage of the model lifecycle. Before training, you examine your data for representation gaps and use techniques like synthetic oversampling to balance underrepresented demographics. During training, you deploy adversarial debiasing — a technique where a secondary model tries to predict the applicant's race from the primary model's output. If it can, the primary model is leaking protected information, and you retrain until the adversary fails.

After training, you calibrate decision thresholds to ensure equalized odds — meaning the model is equally accurate across demographic groups. Not equally lenient. Equally accurate. A model that approves everyone isn't fair. A model that's right at the same rate for everyone is.

And then there's explainability. Every adverse action our system generates comes with SHAP values — a mathematically rigorous attribution method that tells you exactly which features drove the decision, and by how much. We generate counterfactual explanations in real time: "If your credit utilization were 15% lower, or your income $5,000 higher, this loan would have been approved." That's not a courtesy. Under current CFPB guidance, it's approaching a requirement.

Fair AI isn't a model that avoids saying anything offensive. It's a system where every decision can be decomposed, challenged, and defended with math.

For the full technical breakdown of our fairness engineering pipeline and architecture, I've published a detailed research paper that goes deeper than I can here.

The Regulatory Walls Are Closing In

Let me sketch the landscape for anyone who thinks they have time.

The CFPB's 2023 and 2025 guidance on adverse action notices has teeth. SR 11-7 — the Federal Reserve's model risk management standard — now requires documented conceptual soundness, independent validation by teams with no connection to development, and regular outcomes analysis. The NIST AI Risk Management Framework 2.0, released in 2025, introduced the concept of an "AI Bill of Materials" — a complete inventory of every data source, every model (including third-party APIs), and every interaction between components.

This isn't guidance you can ignore. A federal judge just allowed discovery into Navy Federal's algorithm. The Massachusetts AG didn't just fine Earnest — they required the company to overhaul its model governance, implement independent validation, and submit to ongoing monitoring.

The message is clear: if you can't explain your model, you can't defend it. And if you can't defend it, you will pay — in settlements, in litigation costs, in reputational damage, and in the erosion of trust from the communities you claim to serve.

Why "Search for Alternatives" Is the Requirement Nobody's Ready For

There's one regulatory concept that I think will reshape the industry more than any other, and almost nobody is talking about it.

Under current fair lending law, it's not enough to show that your model is accurate. You must actively search for less discriminatory alternatives — models that achieve comparable predictive performance with a smaller disparity gap. If a plaintiff can demonstrate that such an alternative existed and you didn't use it, your model fails the legal test regardless of its accuracy.

Think about what that means operationally. You can't just build one model, test it for bias, and ship it. You need to train multiple configurations — different feature sets, different algorithms, different threshold calibrations — and document why you chose the one you chose. You need evidence that you looked for a fairer option and either found it (and adopted it) or proved that no materially less discriminatory alternative existed.

We spent three months building our LDA search pipeline. Three months where my engineering team kept asking, "Are we overthinking this?" And then the Earnest settlement dropped, and the AG's office specifically cited the company's failure to search for alternatives. We weren't overthinking it. The industry was underthinking it.

The Earnest Lesson Most People Are Missing

I want to close with something that's been nagging at me since July.

Most of the commentary on the Earnest settlement focused on the CDR variable. And yes, that was the headline. But the deeper failure wasn't a bad variable. It was the absence of architecture that would have caught the bad variable before it ever reached production.

Earnest didn't have independent model validation. They didn't have systematic proxy testing. They didn't have auditable human override logging. They didn't have continuous bias monitoring. They had a model, a policy document, and a gap between the two that was wide enough to drive a class action through.

The $2.5 million wasn't the cost of bias. It was the cost of building AI without the infrastructure to know when bias exists.

That's the distinction I keep coming back to. The question isn't "is your AI biased?" — every model trained on historical data carries the fingerprints of historical inequality. The question is: do you have the architecture to detect it, measure it, explain it, and correct it before a regulator does it for you?

Most lenders, if they're honest, would answer no.

We built Veriprajna because we believe the answer has to be yes — not as an aspiration, but as a structural property of the system itself. Fairness isn't a feature you bolt on after launch. It's a load-bearing wall. Remove it, and the whole building comes down.

The first wave of AI in lending was defined by speed and scale. The second wave will be defined by whether your system can survive a subpoena. I know which one I'm building for.