The Problem
Navy Federal Credit Union rejected more than half of its Black mortgage applicants in 2022. It approved 77% of white applicants. That 29-percentage-point gap was the widest of any top-50 mortgage lender in the country. Meanwhile, Earnest Operations LLC paid a $2.5 million settlement in July 2025 after Massachusetts found its AI lending models disproportionately harmed Black, Hispanic, and non-citizen borrowers.
These are not isolated incidents. They are warning signs of a structural problem across the industry. If your institution uses automated decision-making for credit — and nearly every lender does — you face the same exposure.
The core issue at Earnest was a variable called the Cohort Default Rate (CDR). This metric tracks average loan defaults at specific schools. It sounds neutral. But Historically Black Colleges and Universities carry higher default rates due to decades of systemic underfunding and wealth gaps. By weighting CDR in its model, Earnest penalized individual applicants for the historical disadvantages of their institution — regardless of their personal financial health. The algorithm looked race-neutral on paper. In practice, it encoded discrimination.
At Navy Federal, the problem was even harder to pin down. When researchers controlled for more than a dozen factors — income, debt-to-income ratio, property value, neighborhood — Black applicants were still more than twice as likely to be denied as white applicants with identical profiles. Something inside the credit union's underwriting algorithm was producing bias that traditional credit factors could not explain.
Why This Matters to Your Business
The financial consequences are already here. Earnest paid $2.5 million to settle one state investigation. Navy Federal faces consolidated class-action lawsuits and Congressional inquiries. In May 2024, a federal judge ruled that the disparate impact claims against Navy Federal could move forward, opening the door to discovery of the credit union's internal model logic.
Here is what this means for your institution:
- Regulatory enforcement is getting specific. The CFPB now requires "accurate and specific reasons" for adverse actions. You cannot tell an applicant they were denied for "insufficient income" if the real reason was an algorithmic flag on a non-traditional data point. "The algorithm decided" is not a legal defense.
- Statistical disparity alone can survive a motion to dismiss. Courts are shifting the burden to you. Your institution must prove its underwriting process is both necessary and the least discriminatory option available.
- The gap between policy and practice is a liability. Earnest had internal policies requiring senior oversight for model exceptions. Investigators found underwriters frequently bypassed models and applied arbitrary standards without documentation. That disconnect made the system impossible to audit and impossible to defend.
- National benchmarks expose outliers. The CFPB's national average denial rate is 5.6% for white applicants and 16.2% for Black applicants — a 10.6-point gap. Navy Federal's gap was nearly three times that. If your numbers are above national averages, regulators will notice.
Your board, your general counsel, and your regulators will all ask the same question: can you explain how your AI makes lending decisions? If you cannot answer that clearly, you carry the same risk Earnest and Navy Federal now face.
What's Actually Happening Under the Hood
Most AI lending tools today are what the industry calls "wrappers." Think of it like putting a new label on a generic product. These systems take your data, pass it to a general-purpose large language model (LLM) like GPT-4 or Gemini, and return a result. They look sophisticated. Under the hood, they lack the domain-specific rules, transparency layers, and causal reasoning that financial regulators demand.
Here is the fundamental mismatch: financial services requires deterministic logic — the same inputs should always produce the same output. But LLMs are probabilistic. They predict the next word in a sequence. They do not retrieve facts or perform actuarial calculations. In a credit underwriting context, an LLM can generate a "hallucination" — a fabricated justification for a loan denial that sounds reasonable but has no basis in the applicant's actual financial file.
This is not a theoretical risk. When Air Canada's chatbot made up a bereavement fare policy that did not exist, the airline was held liable. The same principle applies to your lending decisions.
Generic AI platforms also lack the vertical context to accurately interpret mortgage documents, tax returns, and bank statements. Without industry-specific training, these models misread income patterns and cash flow, leading to false rejections of creditworthy borrowers. And because LLMs train on vast amounts of internet text saturated with historical biases, they can associate certain nationalities or professions with lower creditworthiness — even when an individual applicant's data is clean.
The worst part: none of this shows up in a surface-level audit. The bias lives in what researchers call the model's "latent space" — the hidden layer where correlations form. Without the right tools, you will not see it until a regulator or plaintiff's attorney does.
What Works (And What Doesn't)
Let's start with what fails under scrutiny:
- "We removed race from the model." Removing the race field does not remove racial bias. Proxy variables like school default rates, zip codes, and shopping patterns carry the same signal. Earnest never asked applicants their race, and it still discriminated.
- "We use a reputable third-party AI vendor." Outsourcing your AI does not outsource your liability. The CFPB holds the lender responsible, not the vendor. If your vendor's model cannot explain its decisions, the compliance gap is yours.
- "Our team reviews edge cases manually." Unstructured human overrides actually make things worse. Earnest's underwriters bypassed the model without documentation, creating a hybrid system where both algorithmic bias and human bias coexisted — and neither was auditable.
What does work is a layered architecture that separates different types of decisions based on what each requires:
Input: Data validation before the AI ever sees it. Every data point passes through checks for accuracy, completeness, consistency, timeliness, relevance, and representativeness. This prevents contaminated data from producing biased results. Your system should know its data lineage — where every input came from and how it was processed.
Processing: The right model for the right job. Hard compliance checks — like residency requirements — run through deterministic rule engines that produce the same answer every time. Structured credit scoring uses interpretable models like gradient-boosted decision trees. LLMs handle only unstructured tasks like document analysis, and they are grounded in the applicant's actual files through a technique called Retrieval-Augmented Generation (RAG) — where you feed the AI the real source documents instead of letting it guess.
Output: Every decision gets an explanation. Explainability tools like SHAP (SHapley Additive exPlanations) — a method based on game theory that assigns a specific contribution score to each input factor — generate audit-ready reasons for every adverse action. Better yet, the system produces counterfactual explanations: "If your credit usage were 15% lower, or your income $5,000 higher, this loan would have been approved." That is the level of specificity the CFPB now demands.
The audit trail advantage is what ties it all together. Every human override gets logged with a mandatory justification. Continuous monitoring tracks both model drift — when incoming data stops matching the training set — and bias drift, triggering real-time alerts when your disparate impact ratio drops below acceptable thresholds. Your compliance team gets a defensible record, not a black box. And when regulators come asking, you can show them exactly how every decision was made, checked, and documented.
The NIST AI Risk Management Framework 2.0 now calls for an "AI Bill of Materials" — a full inventory of your data sources, models, and third-party components. If you cannot produce one today, your institution has a governance gap that regulators are specifically looking for.
Key Takeaways
- Navy Federal's 29-point gap in mortgage approval rates between white and Black applicants was the widest of any top-50 lender — and researchers could not explain it away with income, debt, or property value.
- Earnest paid $2.5 million for using a school default rate variable that looked neutral but functioned as a racial proxy in its AI lending model.
- The CFPB now requires specific, accurate reasons for every credit denial — 'the algorithm decided' is not a legally defensible answer.
- Generic AI wrappers built on large language models are probabilistic by design and cannot provide the deterministic, explainable decisions that financial regulators require.
- A layered architecture — deterministic rules for compliance, interpretable models for scoring, grounded LLMs for documents — creates the audit trail your institution needs to defend its decisions.
The Bottom Line
Regulators are no longer asking whether you use AI. They are asking whether you can explain every decision your AI makes, prove it is the least discriminatory option available, and document every human override. The cost of not having answers is measured in settlements, class-action discovery, and Congressional hearings. Ask your AI vendor: when your model denies a loan applicant, can it produce a specific counterfactual explanation — exactly what would need to change for approval — and a full audit trail showing why every input variable was used?