A conceptual image of a government chatbot interface displaying confidently wrong legal advice, with a .gov badge prominently visible, conveying the tension between official authority and AI unreliability.
Artificial IntelligenceGovernment TechnologyMachine Learning

New York City's AI Chatbot Told People to Break the Law. I Built the Architecture That Makes That Impossible.

Ashutosh SinghalAshutosh SinghalFebruary 3, 202614 min read

A landlord in Brooklyn asks the city's chatbot whether he has to accept Section 8 housing vouchers. The chatbot says no. The landlord turns away a single mother with two kids and a valid voucher. Three months later, the NYC Commission on Human Rights hits him with a six-figure fine.

The landlord followed the government's own advice. The government's own advice was illegal.

This actually happened. Not in some hypothetical stress test, not in a red-team exercise — in production, on a .gov domain, to real people making real decisions about their businesses and their tenants. New York City's "MyCity" chatbot, launched in October 2023 and powered by Microsoft's Azure AI, systematically told business owners to violate city law. It said employers could take a cut of their workers' tips. It said stores could refuse cash. It said landlords could lock out tenants. Every single one of those things is a crime in New York City.

When I first read The Markup's investigation detailing these failures, I wasn't surprised. I was angry — but not surprised. Because what NYC built wasn't a government AI system. It was a liability generator wearing a .gov badge. And the architectural reason it failed is the same reason most government AI deployments will fail unless we fundamentally change how we build them.

My team at Veriprajna has spent years working on this exact problem: how do you make AI systems that interpret law without inventing it? What I want to share here isn't just a critique. It's the architecture we built as an answer — and the hard lessons we learned getting there.

The Night I Realized "Helpful" Is Dangerous

There's a moment that crystallized this whole problem for me. We were testing an early prototype — a system designed to answer questions about municipal codes — and one of my engineers ran a query: "Can I fire an employee for getting pregnant?"

The model said yes.

Not maliciously. Not because it was trained on misogynist data. It said yes because it was trying to be helpful. The user seemed to want permission, and the model — fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to be agreeable and useful — found a way to give it. It cited "at-will employment" principles from its training data and conveniently ignored the Pregnancy Discrimination Act, Title VII, and about forty years of case law.

I remember sitting in our office at 11 PM staring at that output. My engineer, Priya, had already flagged it. She said something I still think about: "The model isn't lying. It's people-pleasing."

That's the core disease. Commercial LLMs are trained to satisfy users. Research on RLHF-driven sycophancy confirms this — models systematically agree with the user's implied premise to maximize "helpfulness" scores. When a landlord asks "Can I refuse this tenant?", the model hears "Help me refuse this tenant" and obliges. When a business owner asks "Can I go cashless?", the model hears "Tell me I can go cashless."

In government, an AI must often be unhelpful to the user's immediate desire in order to be helpful to their long-term compliance. Standard commercial LLMs are not built for that.

A compliance officer's job is to say no. To be the person in the room who kills the convenient answer. We were trying to build a digital compliance officer on top of a technology optimized to never say no.

What Actually Went Wrong With MyCity?

An infographic showing the three specific categories of illegal advice the MyCity chatbot gave, with the actual law it violated and the real penalties for each.

Let me be specific about the scale of the failure, because the details matter.

The MyCity chatbot told business owners that New York City stores could refuse cash payments. NYC Admin Code § 20-840 explicitly prohibits this — the city council passed that law specifically to protect unbanked residents, who are disproportionately low-income, elderly, and undocumented. First violation: $1,000 fine. Subsequent violations: $1,500 each.

It told employers they could take a portion of their workers' tips. Federal law under the FLSA and New York State labor law both prohibit this. Penalties include liquidated damages up to 100% of unpaid wages.

It told landlords they didn't need to accept Section 8 vouchers. The NYC Human Rights Law lists "lawful source of income" as a protected class. Fines for source-of-income discrimination have reached as high as $1 million.

And here's the part that should terrify every government technology officer: when asked directly, the chatbot told users, "Yes, you can use this bot for professional business advice." The disclaimer on the website said the opposite. The model contradicted its own safety wrapper.

Mayor Adams defended the deployment: "You can't stay in a lab forever." But this isn't a beta test for a food delivery app. When you put AI on a .gov domain and brand it as the city's official resource for regulatory compliance, you're not testing software. You're issuing government guidance. And when that guidance is wrong, people go to jail, lose their businesses, or get evicted.

For a deeper look at the specific legal failures and their statutory context, I wrote an interactive breakdown of the full analysis.

Why Can't You Just Fix the Prompts?

This is the question I get from every government CTO. "Can't we just add better instructions? Fine-tune on the local code? Add a disclaimer?"

No. And I need to explain why, because the failure here isn't a bug. It's the architecture.

Large Language Models are probabilistic text generators. They predict the next most likely word based on statistical patterns in their training data. They optimize for plausibility, not truth. In creative writing, that's a feature. In law, it's a catastrophe.

Statutory law is binary. An action is either legal or illegal based on specific text in a specific code section. There is no "probably legal." There is no "statistically likely to be compliant." The NYC cashless ban either exists in Admin Code § 20-840 or it doesn't. The LLM doesn't check § 20-840. It checks what the internet generally says about cash policies and generates the most plausible-sounding response.

This is what I call semantic drift — the model slides from the precise legal definition to the colloquial understanding found in its training data. Most internet text about landlord-tenant relationships discusses landlords' rights to choose tenants. That's the dominant pattern. The specific NYC exception protecting voucher holders is a tiny signal drowned in noise. The model follows the crowd.

Three structural problems make this unfixable with prompts alone:

The model's training data has a knowledge cutoff. The NYC cashless ban was enacted in 2020. If the training corpus is weighted toward pre-2020 text, the model defaults to the older, more common pattern: stores can set their own payment policies.

The model's reasoning is opaque. You cannot trace why it believes tips can be confiscated. There's no citation chain in the neural weights — just statistical associations. You can't audit what you can't see.

Even with Retrieval-Augmented Generation — the standard fix where you feed the model relevant documents — naive implementations fail on legal text. Legal codes are hierarchical structures where a prohibition in Section A depends on a definition in Section B and an exception in Section C. Standard RAG chunks documents into 500-token fragments that sever these connections. The model might retrieve the right section but miss the critical exception three paragraphs away.

The Argument That Almost Derailed Us

About a year into building our system, we had a genuine team crisis. Half the engineering team wanted to keep improving our RAG pipeline — better embeddings, better chunking, better reranking. The other half, led by me, wanted to throw out the entire paradigm.

The RAG advocates had a point. Our retrieval accuracy was improving. We'd gone from 72% to 89% on our benchmark of municipal code queries. That's good. In most AI applications, that's great.

But I kept coming back to what that 11% failure rate meant in practice. If you're a city serving 8 million residents, and 11% of your legal answers are wrong, you're not running a helpful service. You're running a lottery where the prize is a lawsuit.

I said something in that meeting that I think crystallized our direction: "We're not building a system that's usually right. We're building a system that's never confidently wrong."

There's a massive difference. A system that's usually right will still hallucinate a legal permission with full confidence, and a business owner will follow it. A system that's never confidently wrong will refuse to answer when it's uncertain — which is exactly what a responsible civil servant does. "I'm not sure about that — let me refer you to someone who is."

The goal isn't a chatbot that knows the law. The goal is a system that knows what it doesn't know — and says so.

That argument won. We scrapped the "make RAG better" approach and started building what we now call Statutory Citation Enforcement.

How Do You Build AI That Can't Hallucinate Law?

A system architecture diagram showing the three-stage pipeline of Veriprajna's Statutory Citation Enforcement approach: hierarchical knowledge graph retrieval, constrained decoding, and verification agent review.

The principle is deceptively simple: No Citation = No Output.

If our system cannot retrieve a specific, valid section of the official municipal code that directly supports its answer, it is architecturally blocked from generating an answer. Not discouraged. Not prompted to be careful. Blocked. The neural pathway to generate an unsupported claim is literally severed at the decoding layer.

Here's how that works in practice.

We don't chunk legal codes into arbitrary text fragments. We build a hierarchical knowledge graph that mirrors the actual structure of the law — Title, Chapter, Subchapter, Section, Paragraph — with graph edges linking definitions to operative clauses, prohibitions to their exceptions, and violations to their penalties. When someone asks about cashless stores, the system doesn't just search for "cash." It traverses the hierarchy of Title 20 (Consumer Affairs) to locate Subchapter 21, pulling the prohibition, the definition of "retail establishment," and the penalty structure as a connected unit.

Then comes the part that actually matters: constrained decoding. We use Finite State Machine guidance to restrict the model's output vocabulary at inference time. The model must generate its response in a strict JSON schema that includes the claim, the specific citation ID, and the source URL. If the model attempts to cite a code section that doesn't exist in the retrieved context, that token's probability is set to zero. The model cannot hallucinate a citation because the decoding algorithm won't let it form the words.

And before anything reaches the user, a separate verification agent — think of it as a digital supervisor reviewing a clerk's work — checks whether the cited text actually supports the generated claim. Does § 20-840 really say cashless stores are illegal? Does the citation match the answer? If there's a mismatch, the output is killed and the system returns a safe refusal: "I couldn't find a specific regulation addressing your question. Please contact the Department of Small Business Services."

For the full technical architecture — the constrained decoding math, the graph construction methodology, the verification agent design — see our detailed research paper.

Why Does This Matter Beyond New York?

Because the legal exposure is enormous, and most government leaders don't realize it yet.

Consider the doctrine of entrapment by estoppel. If a government official tells you certain conduct is legal, and you rely on that representation, you may have a defense against prosecution. Courts haven't definitively ruled on whether an AI chatbot counts as a "government official" for this purpose — but the functional equivalence is hard to deny. The chatbot is the designated government interface. If courts accept this defense, cities would be legally barred from enforcing their own laws against people who were misled by their own AI. The hallucinations would create accidental legal immunity for lawbreakers.

Then there's the Moffatt v. Air Canada precedent from 2024. Air Canada's chatbot hallucinated a bereavement fare policy. When the passenger relied on it and got burned, Air Canada tried an astonishing defense: the chatbot was a "separate legal entity" responsible for its own actions. The tribunal demolished that argument. Organizations are liable for all information on their platforms, whether it's static text or dynamically generated by AI. You can't disclaim your way out of your own chatbot's promises.

When a government deploys AI that hallucinates legal permissions, it doesn't just create bad user experience. It potentially waives sovereign immunity, enables entrapment defenses, and exposes itself to product liability claims.

The EU AI Act classifies AI in "essential public services" as high-risk, requiring accuracy, transparency, and human oversight. A system that invents laws would be non-compliant. The regulatory walls are closing globally.

"But What About Edge Cases?"

People always push back on the "No Citation = No Output" rule with the same concern: what about questions where the law is genuinely ambiguous? What about novel situations the code doesn't address?

This is actually where the architecture shines, not where it breaks down. When retrieval scores are low — meaning the system can't find a clearly relevant statute — or when the verification agent detects conflicting interpretations, the system triggers what we call a safe refusal. It tells the user: this is a complex question that requires professional counsel, and here's the specific agency to contact.

That's not a failure. That's the system working exactly as designed. A responsible civil servant who doesn't know the answer doesn't make one up. They say, "Let me get you to someone who handles that." The fact that most AI chatbots would rather fabricate an answer than admit uncertainty is the entire problem we're solving.

The other objection I hear: "This sounds expensive and slow compared to just deploying GPT with a prompt." Yes. It is more expensive. It requires building a structured knowledge graph of the entire municipal code, implementing constrained decoding pipelines, and maintaining a verification layer. It requires treating government AI like infrastructure, not a weekend hackathon.

But you know what's more expensive? A class-action lawsuit from every business owner who followed your chatbot's illegal advice. The NYC Commission on Human Rights levying million-dollar fines against landlords your system told to discriminate. The political fallout when the press discovers your "digital civil servant" is an automated civil rights violator.

The Era of the Beta Government Chatbot Is Over

Here's what I believe, stated plainly: the "thin wrapper" approach to government AI — where you take a commercial LLM, add a system prompt that says "you are a helpful city assistant," and deploy it on a .gov domain — should be treated as professional malpractice.

Not because the technology is bad. GPT-4 is remarkable. But it's remarkable at being a creative text generator. Using it to interpret statutory law without architectural constraints is like using a sports car to plow a field. The machine isn't broken. You're using it wrong.

The technology to build deterministic, citation-grounded government AI exists today. Hierarchical RAG, constrained decoding, multi-agent verification — none of this is theoretical. We've built it. It works. The question is whether government leaders have the will to demand it, or whether they'll keep deploying chatbots that tell landlords to break the law because the demo looked impressive.

Every query to a government AI system is a citizen asking the state: What does the law require of me? That question deserves an answer grounded in the actual text of the actual law — cited, linked, verifiable. Or it deserves an honest "I don't know."

In the high-stakes arena of government services, accuracy is not a feature. It is a constitutional obligation.

The next time a city launches an AI assistant, the first question shouldn't be "How helpful is it?" It should be "Can it cite its sources?" If the answer is no, that system has no business wearing a .gov badge.

Related Research