A conceptual editorial image showing an algorithmic score standing between a person and a home, representing AI-mediated housing decisions.
Artificial IntelligenceFair HousingMachine Learning

The Algorithm That Denied Housing to Black Women — and What It Taught Me About Building AI That Can't Hide

Ashutosh SinghalAshutosh SinghalMarch 24, 202616 min read

I was sitting in my home office on a Tuesday evening, scrolling through the final settlement documents in Louis et al. v. SafeRent Solutions, LLC, when a single detail stopped me cold.

Mary Louis and Monica Douglas — two Black women holding federally funded housing vouchers — had been denied apartments. Not by a landlord who looked them in the eye and said no. By a score. A number between 200 and 800, generated by an algorithm called "Registry ScorePLUS," that decided they were too risky to house. The algorithm didn't know they were Black. It didn't need to. It just knew their credit histories looked like the credit histories of people who had been systematically excluded from financial systems for generations — and it called that "risk."

The settlement was $2.275 million. The injunction lasts five years. And the ruling contained a line that I read three times because I couldn't believe a federal court had actually said it: if a landlord relies primarily on a third-party AI score to make housing decisions, the company that built the score shares liability under the Fair Housing Act.

I closed my laptop and sat there in the dark for a while. Because that ruling didn't just change the tenant screening industry. It changed the entire calculus of what it means to build AI for regulated markets. And it validated something my team at Veriprajna had been arguing — sometimes to skeptical investors, sometimes to our own exhaustion — for years: that the way most companies deploy AI in high-stakes decisions is not just ethically questionable. It's architecturally broken.

What Actually Went Wrong Inside SafeRent's Algorithm?

The technical failure is deceptively simple to describe and maddeningly difficult to fix without rethinking your entire approach to model design.

SafeRent's scoring system leaned heavily on traditional credit history and non-tenancy debt — things like medical bills, old credit card balances, the kind of financial scar tissue that accumulates when you've spent years navigating poverty. What it did not account for was the single most relevant fact about its subjects: housing choice voucher holders have a guaranteed income stream from the federal government. Their rent is subsidized. Their likelihood of missing payments is, statistically, quite different from what a raw credit score would suggest.

But the model didn't know that. Or more precisely, nobody told it to care.

The algorithm didn't discriminate on purpose. It discriminated by design — by treating historically biased data as neutral truth.

Here's where the numbers get damning. As of October 2021, the median credit score for White consumers was 725. For Hispanic consumers, 661. For Black consumers, 612. When you build a model that treats credit score as a primary predictor of "lease performance risk," you're not making a neutral mathematical choice. You're encoding a century of redlining, predatory lending, and wealth inequality into a single feature weight. SafeRent's algorithm looked at Mary Louis's credit history and saw risk. What it should have seen was a woman with guaranteed rent money and a system that had never given her a fair shot at building credit.

Why Did a Court Say the Software Vendor Is Liable?

A diagram showing the legal liability chain established by the SafeRent ruling — how liability flows from algorithm developer through to housing decision, breaking the traditional "we just built the tool" defense.

This is the part that should keep every AI company founder awake at night.

SafeRent tried the obvious defense: we're a technology provider, not a landlord. We don't make housing decisions. We just provide information. The court rejected this argument flatly. The Department of Justice filed a Statement of Interest arguing that when a landlord outsources its decision-making to an algorithm, the developer of that algorithm is functionally part of the decision chain.

Think about what that means for a moment. Every company selling AI-powered scoring, screening, underwriting, or risk assessment in a regulated market just lost the ability to say "we only built the tool."

I remember the conversation with my co-founder the week after the ruling came down. We were on a call, supposedly reviewing a client deliverable, and instead we spent forty-five minutes mapping out every industry where this precedent could apply. Credit scoring. Insurance underwriting. Employment screening. Healthcare triage. The list kept growing. At some point one of us said, "This isn't a housing case. This is the beginning of AI product liability law." We weren't celebrating — we'd been warning about this exact scenario — but there was a grim satisfaction in watching the legal system finally catch up to what the technology had been doing unchecked.

The settlement didn't just cost SafeRent $2.275 million. It imposed a five-year injunction with teeth:

SafeRent can no longer issue automated approve or decline recommendations for voucher holders unless the model is validated for fairness by independent civil rights experts. Without that validation, the system can only provide raw background information — stripped of its predictive scoring. The company must also train its clients on the limitations of scoring models for subsidized populations. And these terms apply nationwide, not just in Massachusetts.

For a deeper look at the settlement structure and its regulatory implications, I wrote an interactive breakdown of the full case analysis.

The LLM Wrapper Trap

About a year before the SafeRent settlement was finalized, I took a meeting with a potential client — a mid-size property management company running about 12,000 units across the Southeast. They'd been approached by a vendor offering an "AI-powered tenant screening solution" built on top of a large language model. The pitch was slick: natural language processing, instant risk summaries, beautiful dashboards. The vendor had raised a Series A. They had logos on their website.

I asked one question: "Can the system explain, for a specific applicant, which features drove the decline decision in a way that satisfies Fair Credit Reporting Act adverse action notice requirements?"

Silence. Then: "We can generate a natural language explanation of the decision."

"Generated by the LLM?"

"Yes."

"So the explanation is a plausible narrative about why the person was declined, not a verified causal trace of the actual model computation?"

More silence.

This is the core problem with what I call "LLM wrappers" — and it's the problem that the SafeRent case illuminated in brutal, expensive detail. A Large Language Model can summarize a lease agreement. It can draft a letter. It can even produce a convincing-sounding explanation of why an applicant was rejected. But it cannot certify that its reasoning is causally connected to the actual decision pathway. It cannot prove that a protected characteristic didn't influence the outcome. It cannot search for less discriminatory alternatives. It hallucinates explanations the same way it hallucinates everything else — by predicting the most statistically likely next token.

In high-stakes decisions, the ability to generate a plausible answer is worth nothing. The ability to prove a fair one is worth everything.

I've had investors tell me, "Just use GPT and add a compliance layer on top." One said it to my face at a pitch event, like it was obvious, like we were overcomplicating things. I wanted to hand him the SafeRent settlement documents and ask which compliance layer would have caught a model that systematically ignored voucher income. The answer is none of them. Because the bias wasn't in the output formatting or the user interface. It was in the feature weights. It was in the training data. It was in the fundamental architecture of what the model was optimized to predict.

How Does HUD's 2024 Guidance Change the Game?

In May 2024, HUD issued guidance that effectively codified the lessons of the SafeRent case into regulatory expectations for the entire housing industry. The standard is "disparate impact" — meaning a system can be illegal even if nobody intended to discriminate, as long as it produces disproportionate negative effects on a protected class that can't be justified by a legitimate, nondiscriminatory interest.

Three requirements stand out:

Feature relevance must be causal, not just correlational. Every data point in a screening model needs a defensible link to actual lease performance. "Credit score predicts default" is not sufficient if credit score is a proxy for race and you haven't tested whether voucher-adjusted income is a better predictor.

Applicants must have a meaningful path to challenge AI results. This means human-in-the-loop review isn't optional — it's mandatory. A system that produces a score with no recourse mechanism is a system waiting to be sued.

Developers must search for Less Discriminatory Alternatives. This is the provision that changes everything. It's not enough to build a model that works. You have to demonstrate that you looked for models that work equally well with less discriminatory impact — and either adopted them or can prove none exist.

That last requirement — the Less Discriminatory Alternative, or LDA — is where most AI companies I've seen fall apart. Not because the math is impossibly hard, but because they've never been forced to do it. They optimize for accuracy. They ship. They move on. The idea that you might need to search through thousands of alternative model configurations to find one that maintains performance while maximizing fairness across demographic groups? That's not a feature request most product managers have ever received.

What We Actually Build Instead

A comparison diagram showing the architectural difference between post-hoc auditing (patch after deployment) versus fairness-as-optimization-constraint (built into training), illustrating why the latter catches bias that the former misses.

I need to be honest about something: when we first started building fairness-aware systems at Veriprajna, we got it wrong.

Our initial approach was post-hoc auditing. Build the model, test it for bias, adjust the thresholds if something looked off. It felt responsible. It felt like enough. It wasn't.

The problem with post-processing is that you're trying to patch outcomes without understanding causes. You can adjust a decision threshold so that approval rates look similar across groups — a technique called "Equalized Odds" — but if the underlying model has learned a biased representation of risk, you're just putting makeup on a structural problem. The model still thinks certain people are riskier. You're just overriding it at the last mile. And the first time someone audits the feature importances, the bias is right there, staring back at you.

The breakthrough — and I use that word carefully, because it felt more like a slow, frustrating accumulation of failures than a eureka moment — came when we started treating fairness as an optimization constraint rather than a post-deployment audit.

Here's what that means in practice. During model training, we don't just minimize prediction error. We simultaneously penalize the model if a secondary "adversarial" network can predict a protected attribute (like race or gender) from the primary model's outputs. If the adversary succeeds — if it can look at the model's predictions and guess who's Black and who's White — the primary model gets penalized and retrained. The result is a model that's been forced to learn features that are genuinely independent of protected characteristics.

We pair this with what researchers call "counterfactual testing." For every applicant the model evaluates, we ask: if this person's race were different but everything else stayed the same, would the decision change? If the answer is yes, the model fails. Not "flags for review." Fails.

Counterfactual fairness asks the question every civil rights attorney will eventually ask: would this person have been approved if they were White? Your model better have the same answer.

There was a night — I think it was around 2 AM — when we ran our first full counterfactual audit on a prototype screening model we'd built using a public housing dataset. We expected maybe a 3-4% discrepancy. The actual number was closer to 11%. Eleven percent of decisions would have flipped if we changed nothing but the demographic group. My engineer sent me a Slack message that just said: "We have a problem." We spent the next three weeks rebuilding the feature pipeline from scratch, replacing credit score with a composite indicator that weighted voucher income, direct rent payment history, and employment stability. The counterfactual gap dropped to under 1%.

That's the difference between what I call "Deep AI" and an LLM wrapper. It's not about having better prompts or a nicer interface. It's about whether fairness is a property of the system's architecture or a sticker you put on the box.

For the full technical breakdown of our fairness engineering approach — including the adversarial debiasing methodology and the mathematical formalization of the metrics we use — see our research paper on algorithmic integrity and enterprise risk.

Why Can't You Just Audit After Deployment?

People ask me this constantly, and I understand the appeal. Auditing feels cheaper. It feels less disruptive. You build fast, ship fast, audit later, fix what breaks.

The problem is that in regulated markets, "what breaks" is people's lives.

By the time SafeRent's algorithm was challenged in court, it had been running for years. How many Mary Louises were there who never filed a lawsuit? How many families with vouchers were denied housing by an algorithm that couldn't see past their credit score? Those denials don't get reversed by a settlement. Those apartments went to someone else. Those families found somewhere worse to live, or didn't find anywhere at all.

Static audits also miss something critical: data drift. The socioeconomic patterns that a model learned during training shift over time. Voucher utilization rates change. Credit scoring methodologies evolve. Rental markets tighten or loosen. A model that was "fair enough" in 2022 might be discriminatory by 2024 — not because anyone changed the code, but because the world changed around it.

This is why we've moved toward continuous monitoring with automated retraining triggers. The model doesn't just get audited once a year. It gets audited every time it makes a decision, against a battery of fairness metrics — Statistical Parity Difference, Disparate Impact Ratio, Equalized Odds — running in real time. When any metric drifts past a threshold, the system flags it before a human ever sees the output.

I think of it like this: you wouldn't build a bridge, inspect it once, and then never check it again. You'd monitor it continuously for stress, fatigue, environmental changes. AI systems that make decisions about people's housing, credit, and employment deserve at least the same engineering rigor we give to concrete and steel.

What Does the EU AI Act Mean for American Companies?

If the SafeRent settlement and HUD guidance represent the current regulatory floor, the EU AI Act — which begins phased enforcement in 2025-2026 — represents where the ceiling is heading.

The Act classifies AI systems used for credit scoring, tenant screening, and employment decisions as "High Risk," subjecting them to mandatory conformity assessments, transparency requirements, and human oversight obligations. American companies that serve European markets, or that serve American markets in ways that European regulators decide to care about, will need to comply.

But here's what I find more interesting than the specific requirements: the EU Act operationalizes the NIST AI Risk Management Framework's four pillars — Govern, Map, Measure, Manage — into legally binding obligations. What was voluntary guidance becomes mandatory practice. The companies that aligned their architectures with these principles early will find compliance straightforward. The companies that treated fairness as a marketing claim will find it expensive.

I've watched this pattern play out in data privacy (GDPR), financial reporting (SOX), and now AI governance. The regulatory trajectory only moves in one direction. Building for tomorrow's requirements today isn't idealism. It's risk management.

The Model Multiplicity Problem Nobody Talks About

A scatter plot infographic showing the accuracy-vs-fairness tradeoff landscape, illustrating how thousands of models with near-identical accuracy have wildly different fairness profiles, and why the LDA search matters.

There's a concept in machine learning research called "model multiplicity" — the observation that for any given dataset, there are potentially millions of models that achieve nearly identical accuracy but have wildly different fairness profiles. Some of those models are deeply biased. Some are remarkably fair. And without an explicit, systematic search for the fair ones, developers will almost always land on whatever the optimizer finds first.

This is the technical foundation of the Less Discriminatory Alternative requirement, and it's why I believe the LDA search will become the single most important capability in regulated AI development over the next decade.

When we conduct an LDA search, we're not just training one model. We're training hundreds, varying feature sets, architectures, hyperparameters, and fairness constraints, then mapping the entire landscape of accuracy-fairness tradeoffs. The goal is to find the model that achieves the business objective — predicting lease performance, assessing credit risk, whatever the task — with the minimum possible discriminatory impact.

Sometimes that search reveals something uncomfortable: the "most accurate" model is also the most biased, because accuracy and historical bias are correlated in the training data. The second-most-accurate model might sacrifice half a percentage point of predictive power while cutting the Disparate Impact Ratio gap by 40%. Is that tradeoff worth it?

If your model is 0.5% less accurate but 40% less discriminatory, and you chose accuracy — good luck explaining that to a judge.

In the SafeRent case, the fundamental question was whether a model could have predicted lease performance equally well without penalizing voucher holders. The answer, based on everything we know about the data, is almost certainly yes. SafeRent just never looked.

The Night I Almost Agreed to Build a Wrapper

I want to end with a story I haven't told publicly before.

About eighteen months ago, we were approached by a company — I won't name them — that wanted us to build a compliance screening tool for a major financial services client. The budget was significant. The timeline was aggressive. And the spec they handed us was, essentially, an LLM wrapper: take a foundation model, fine-tune it on regulatory documents, add a scoring layer, ship it.

My team was split. Half of them saw the revenue. The other half saw the SafeRent case in slow motion. We had a call that lasted almost three hours. One of my engineers — someone I trust deeply — said something that stuck with me: "We can build what they're asking for in eight weeks. We can build what they need in eight months. If we build what they're asking for, we become the next case study in why this approach fails."

We walked away from the deal. It was the most expensive decision I've made as a founder. I second-guessed it for weeks.

I don't second-guess it anymore.

The SafeRent settlement proved that the market for AI in regulated industries isn't a race to ship fastest. It's a race to ship safest — where "safe" means architecturally fair, legally defensible, and engineered to withstand the kind of forensic scrutiny that a federal court will eventually apply. The companies that understand this will build the systems that last. The companies that don't will build the next $2.275 million cautionary tale.

The era of the black box is over. Not because regulators killed it, but because it was never built to survive contact with reality. The question isn't whether your AI can make a decision. It's whether your AI can defend one.

Related Research