A striking visual connecting algorithmic policing surveillance with enterprise AI decision-making, showing the throughline of the article's argument.

Artificial IntelligenceTechnologyMachine Learning

The Algorithm That Ate a City: What Predictive Policing's Collapse Taught Me About Building AI That Deserves Trust

Ashutosh Singhal April 16, 202614 min read

I was sitting in a conference room in late 2023, watching a prospective client demo their internal AI tool — a chatbot they'd wired up to help their compliance team flag risk in financial documents. The interface was slick. The responses were fast. And about every fourth answer was confidently, dangerously wrong.

When I pointed out a hallucination — the model had invented a regulatory citation that didn't exist — the VP of Engineering shrugged. "Yeah, we know about that. We're hoping the next model update fixes it."

That moment crystallized something I'd been thinking about for months. The enterprise world was sleepwalking into the exact same trap that had already destroyed public trust in AI-powered policing across America. Not because the technology was inherently evil, but because the people deploying it had confused having an AI system with governing one.

At Veriprajna, we build deep AI solutions for high-stakes enterprise environments. But to explain why we build them the way we do — with governance baked in from day zero, with explainability as a non-negotiable, with mathematical fairness constraints woven into the training process — I need to take you somewhere uncomfortable first. I need to take you to Chicago.

56% of Young Black Men in a City, Flagged by a Machine

Chicago's Strategic Subject List — internally called the "heat list" — was supposed to be the future of smart policing. Instead of blanketing neighborhoods with officers, the algorithm would identify specific individuals most likely to be involved in gun violence, either as perpetrators or victims. Precision over brute force. Data over intuition.

The list ballooned to over 400,000 people.

Let that number sit for a moment. In a city of 2.7 million, the algorithm decided that 400,000 individuals were worth flagging. And the demographics were staggering: 56% of Black men in Chicago between the ages of 20 and 29 ended up on that list. In West Garfield Park, 73% of Black males between 10 and 29 were flagged. Ninety-six percent of individuals the system classified as "suspected gang members" were Black or Latino.

Here's what broke my brain when I first dug into the audit data: 57% of the algorithm's priority targets had never been arrested for a violent crime. The system was pulling in low-level misdemeanors — things like minor drug possession or disorderly conduct — and treating them as predictive signals for future gun violence. It was using the machinery of over-policing as evidence to justify more policing.

When an algorithm treats the consequences of bias as proof that bias is warranted, you don't have a prediction engine. You have a discrimination machine running on autopilot.

The Chicago Office of Inspector General eventually documented what many community organizations had been screaming about for years: the SSL was biased along racial lines and largely ineffective at reducing murder rates. It was decommissioned in late 2019, but not before it had sent officers on unannounced visits to thousands of people whose only "crime" was living in a neighborhood that the algorithm had decided was dangerous.

Why Did the Earthquake Model Fail at Predicting Crime?

Three thousand miles west, the LAPD was running its own experiment. Geolitica — formerly PredPol — used a model originally designed to predict earthquake aftershocks. The logic was seductive: just as tremors cluster in space and time, certain types of crime follow predictable spatiotemporal patterns. Feed the algorithm historical incident data — location, time, crime type — and it would generate 500-by-500 square foot "hotspot boxes" telling officers where to patrol.

I remember reading the technical documentation and thinking, this is elegant. The math was clean. The interface was intuitive. And the results were catastrophic.

A 2019 audit by the LAPD Inspector General found "significant inconsistencies" in data entry. Officers were logging patrol time at police facilities rather than in the field, contaminating the hotspot data. The system couldn't isolate its own impact from broader policing trends. And in comparable jurisdictions like Plainfield, New Jersey, the prediction success rate was documented at less than 1%.

Less than one percent. A coin flip would have been more useful.

But the deeper problem wasn't accuracy — it was the feedback loop. When the algorithm flagged a predominantly Black or Latino neighborhood as a hotspot, more officers went there. More officers meant more stops. More stops meant more arrests for minor infractions that wouldn't have been enforced in wealthier, whiter areas. Those new arrests flowed back into the training data as "evidence" of high crime, and the algorithm dutifully intensified its predictions for that same neighborhood.

California's Racial and Identity Profiling Act (RIPA) data laid this bare in numbers that are hard to argue with: Black individuals were stopped 126% more frequently than expected based on their population share. Officers conducted 4.7 million vehicle and pedestrian stops in 2023. And here's the kicker — when officers searched Black and Latino individuals at higher rates, they were consistently less likely to find contraband compared to searches of white individuals.

The data was telling us, in plain statistical language, that the system was wrong. And the system kept running anyway.

The LAPD finally terminated its relationship with Geolitica in early 2024. I wrote about the broader implications of these failures — and what they mean for enterprise AI architecture — in the interactive version of our research.

What Happens When Nobody Can Open the Black Box?

There's a term in philosophy of science that kept coming up in my research: epistemic opacity. It means that the system is so complex that even the people operating it can't fully understand how it reaches its conclusions.

Most predictive policing systems were proprietary black boxes. The specific data inputs, the factors weighed, the logic of predictions — all hidden as trade secrets. The police departments using these tools often couldn't explain why a particular person or neighborhood was flagged, even when civil liberties organizations demanded answers.

This isn't just a policing problem. It's the defining vulnerability of how most enterprises are deploying AI right now.

I think about that compliance chatbot I saw demoed. The VP of Engineering couldn't tell me which documents the model had actually retrieved to generate its answer. He couldn't explain why it had hallucinated a regulatory citation. He couldn't tell me whether the system would give a different answer tomorrow if the same question were asked. And his plan was to wait for OpenAI to ship a better model.

That's not an AI strategy. That's a prayer.

The Runaway Feedback Loop Isn't Just a Policing Problem

A diagram showing how the self-reinforcing bias feedback loop works, applicable to both policing and enterprise AI, with labeled stages showing how biased outputs become biased training data.

Here's where I need to make the connection that I think most people in enterprise AI are missing.

The feedback loop that destroyed predictive policing — where biased outputs generate biased training data, which generates more biased outputs — isn't unique to law enforcement. It's a structural property of any AI system that learns from its own operational environment without independent validation.

Think about an AI-powered hiring tool that screens resumes. If it's trained on historical hiring data from a company that has predominantly hired men for engineering roles, it will learn to associate male-coded language with "good candidates." It will downrank women. The company will hire fewer women. That hiring data will feed back into the next training cycle, and the bias will deepen.

Or consider a financial underwriting model trained on historical loan approvals. If past loan officers were more likely to approve applications from certain zip codes — zip codes that happen to correlate with race due to decades of redlining — the model will learn those patterns. It will deny loans to qualified applicants from those areas. Those denials will become training data. The cycle continues.

The most dangerous AI systems aren't the ones that are obviously broken. They're the ones that work just well enough to avoid scrutiny while quietly encoding the biases of their training data into automated decisions at scale.

This is why I get frustrated when I hear enterprise leaders talk about AI governance as a "nice to have" or a "phase two" initiative. Governance isn't a feature you bolt on after launch. It's the architecture itself.

Why Are LLM Wrappers Failing in High-Stakes Environments?

A side-by-side comparison showing the architectural difference between a simple LLM wrapper (51% accuracy) and a multi-agent architecture (89% accuracy), with labeled components.

Let me be direct about something: the age of simple LLM wrappers is ending, and most enterprises haven't realized it yet.

An LLM wrapper — a thin layer of prompt engineering and a nice UI on top of a foundational model like GPT-4 or Claude — works fine for drafting emails and summarizing meeting notes. It does not work for legal review, financial compliance, medical triage, or any domain where a wrong answer has material consequences.

We tested this rigorously at Veriprajna. In security vulnerability triage — a domain where you need to distinguish between a minor bug and a critical exploit — a naive LLM wrapper achieved roughly 51% accuracy. That's barely better than random. The model lacked the specialized tools and domain knowledge to make meaningful distinctions. And it had another problem that I've started calling the "fence-sitting" phenomenon: the safety alignments built into foundational models made them reluctant to take firm positions on ambiguous cases. In a triage context, ambiguity is the entire job. An AI that hedges on every edge case isn't augmenting your team — it's creating more work.

Our multi-agent architecture, by contrast — with composable agents, structured workflows, and domain-specific knowledge bases — hit 89% accuracy on the same benchmarks. Not because we used a "better" model, but because we built a system rather than a wrapper.

That difference — 51% versus 89% — is the difference between an AI that generates plausible text and an AI that actually reasons about a domain.

What Does Mathematical Fairness Actually Look Like?

One of the things I've learned building Veriprajna is that "fairness" in AI can't be a vibe. It has to be a number.

When we build systems for high-stakes environments, we define fairness mathematically and monitor it continuously. Two metrics matter most:

Demographic Parity asks: is the probability of a positive outcome independent of a protected attribute like race or gender? If your hiring AI approves 60% of male applicants and 35% of female applicants, you've failed this test.

Equalized Odds goes deeper: are the true positive rates and false positive rates equal across groups? This matters because a system could achieve demographic parity by randomly approving more applications from underrepresented groups — without actually getting better at identifying qualified candidates.

Both metrics need to be monitored simultaneously, and neither is sufficient alone. That's why our bias mitigation strategy operates across the entire AI lifecycle: re-weighting training data before the model ever sees it, incorporating fairness constraints directly into the training process through techniques like adversarial debiasing, and calibrating decision thresholds after training to ensure equitable outcomes across demographic groups.

I know this sounds technical. But here's the plain-English version: if you can't express your fairness criteria as a mathematical equation, you don't have fairness criteria. You have a press release.

The Regulatory Wave Most Companies Aren't Ready For

While enterprises have been busy experimenting with chatbots, regulators have been busy writing laws.

Over 40 U.S. cities have moved to ban or strictly restrict predictive policing and related AI technologies like facial recognition. San Francisco was first in 2019. Boston, Portland, and Santa Cruz followed. In March 2024, the White House issued a landmark policy requiring federal agencies to conduct independent testing and mandatory impact assessments for any rights-impacting AI systems.

This isn't just a government problem. The EU AI Act, NIST's AI Risk Management Framework, ISO 42001 — these frameworks are converging on a single message: if you deploy AI in high-stakes decisions, you will be required to prove it's fair, explain how it works, and demonstrate that you're monitoring it continuously.

The enterprises that have governance infrastructure in place will adapt. The ones that built LLM wrappers and called it an "AI strategy" will scramble.

I've watched this pattern before, in cybersecurity. Companies that treated security as an afterthought spent years playing catch-up when regulations hit. The ones that built security into their architecture from the start barely noticed. AI governance is following the same trajectory, just faster.

For the full technical breakdown of how we align our governance framework with NIST, ISO 42001, and the EU AI Act, see our research paper.

"Just Use GPT" and Other Expensive Mistakes

People ask me all the time why enterprises shouldn't just use a foundational model with some prompt engineering and call it a day. The answer is the same reason the LAPD shouldn't have used an earthquake model to predict crime.

The tool isn't the problem. The assumption is.

The assumption is that a general-purpose system — whether it's a seismology algorithm or a large language model trained on the internet — can be dropped into a specialized, high-stakes domain without fundamental architectural changes. Without domain-specific reasoning layers. Without explainability. Without continuous bias monitoring. Without governance.

That assumption has been tested. In policing, it destroyed public trust, harmed hundreds of thousands of people, and triggered a nationwide regulatory backlash. In enterprise AI, the consequences are playing out more quietly — in hallucinated legal citations, in biased hiring decisions, in compliance failures that won't surface until an audit or a lawsuit forces them into the light.

The question isn't whether your AI will make a mistake. The question is whether you'll know when it does — and whether you've built the architecture to catch it before it compounds.

At Veriprajna, we don't start with a model. We start with the data. We audit it for quality, accessibility, and historical bias before a single parameter is trained. We build multi-agent architectures where specialized reasoning layers can perform deep research rather than relying on zero-shot calls to a general-purpose model. We implement explainable AI validation so that every decision can be traced, interrogated, and defended. And we monitor continuously — not just for accuracy, but for fairness drift, because what was equitable six months ago may not be equitable today if the underlying data distribution has shifted.

This isn't more expensive than the wrapper approach. It's less expensive — because the cost of deploying an ungoverned AI system in a high-stakes environment isn't measured in engineering hours. It's measured in lawsuits, regulatory fines, reputational damage, and the human cost of automated decisions that nobody can explain or defend.

The Room Where It Happens

I want to end with a moment that stays with me.

We were deep into building a new reasoning layer for a client in financial services. The team had been arguing for two days about whether to prioritize accuracy or explainability in a particular module — one of those arguments where everyone is technically right and the real question is about values, not engineering.

My lead engineer finally said something that shut the room up: "If we can't explain why this model denied someone a loan, then we haven't built an AI system. We've built a more efficient version of the problem we were hired to solve."

She was right. And that sentence has become something close to a design principle for everything we build.

The failures of predictive policing — the 400,000 people on Chicago's heat list, the less-than-1% accuracy in Plainfield, the feedback loops that turned historical racism into mathematical certainty — these aren't cautionary tales from a different industry. They're the clearest possible preview of what happens when you deploy AI without the architecture to earn trust.

The path forward isn't to abandon AI. It's to stop treating governance as overhead and start treating it as the product. The enterprises that understand this will build systems that survive scrutiny. The ones that don't will learn the lesson the LAPD learned, the Chicago PD learned, and that compliance chatbot will eventually learn: an AI system without integrity isn't a tool. It's a liability with a nice interface.