A striking visual representing the collision of legal citation authority with AI-generated fabrication — a legal brief with citation text visibly fragmenting or dissolving where fake cases appear.

Artificial IntelligenceLawTechnology

The AI That Invented a Court Case — And the Architecture We Built to Make That Impossible

Ashutosh Singhal January 24, 202615 min read

I remember the exact moment I stopped trusting the way most people build legal AI.

It was late on a Tuesday, and I was reading the court transcript from Mata v. Avianca. Not a summary. Not a tweet thread. The actual filing. A lawyer had submitted a brief citing Varghese v. China Southern Airlines, Shaboon v. Egyptair, and Petersen v. Iran Air — complete with docket numbers, dates, and quoted holdings. Convincing enough that opposing counsel had to go looking for them. The cases didn't exist. ChatGPT had invented them. And when the lawyer went back to ChatGPT to double-check, the model cheerfully confirmed its own fabrications: "Yes, those cases indeed exist and can be found in reputable legal databases."

I set the transcript down and thought: this is not a prompting problem. This is an architecture problem. And most of the legal AI industry is pretending otherwise.

That incident — which resulted in a $5,000 fine, a judicial reprimand, and a reputational crater — became the founding case study for what my team at Veriprajna now builds: Citation-Enforced GraphRAG systems for legal AI. Systems where the AI physically cannot output a case citation that doesn't correspond to a verified entry in a Knowledge Graph. Not "probably won't." Cannot.

I want to explain why that distinction matters, what it took to build, and why I believe the era of slapping a chatbot interface on a foundation model and calling it "legal AI" is over.

Why Did ChatGPT Invent a Court Case?

This is the question everyone asks, and almost no one answers correctly.

The common explanation is "hallucination" — a word that has become so overused it's lost its diagnostic value. What actually happened in Mata v. Avianca is more specific and more damning. The model was asked to find precedents about airline liability for passenger injuries. It didn't search a database. It doesn't have one. It predicted the next statistically likely sequence of words.

"Varghese" is a plausible plaintiff name. "China Southern Airlines" is a plausible defendant. A docket number like "2017 WL 3245891" follows the syntactic pattern of real citations. The model assembled these fragments the same way it assembles a poem or a marketing email — by minimizing something called perplexity, which is essentially a measure of how "surprised" the model is by its own output. Low surprise equals fluent text. Fluent text is not the same as true text.

The model is trained to minimize perplexity — how surprised it is by the next word. It is not trained to optimize for provenance — whether that word traces back to something real.

This is the core tension. LLMs optimize for coherence. Law requires provenance. These are fundamentally different objectives, and no amount of prompt engineering bridges the gap. You can tell GPT-4 "You are a careful lawyer, only cite real cases." It will nod and comply — right up until its training data doesn't contain the case you need, at which point it will invent one that sounds right, because sounding right is literally what it's optimized to do.

Stanford researchers tested this rigorously. General-purpose chatbots, even those with internet access or basic retrieval capabilities, hallucinated between 58% and 82% of the time on complex legal queries. Not edge cases. Routine legal research questions.

The Wrapper Trap

After Mata, I started cataloging the legal AI tools on the market. Most of them were what the industry politely calls "wrappers" — thin user interfaces layered over OpenAI's or Anthropic's API. A system prompt saying "You are a helpful legal assistant." Maybe a PDF upload feature. Maybe a nicer font.

I had a call with a potential client — general counsel at a mid-size firm — who told me they'd been evaluating one of these tools. "It's fast," she said. "But last week it cited a dissenting opinion as if it were the majority holding. My associate almost filed it." She paused. "The scary part is, the case was real. The holding was just... wrong."

That's the thing about legal hallucinations that keeps me up at night. Mata was dramatic because the cases were entirely fabricated. But the subtler errors — real case, wrong holding; valid statute, since repealed; binding precedent from the wrong jurisdiction — are harder to catch and arguably more dangerous. A fake case gets flagged at the first verification step. A real case cited for a proposition it doesn't support? That can survive multiple rounds of review.

The wrapper approach can't solve this because it doesn't own the data layer. It doesn't know which cases exist. It doesn't know which ones have been overruled. It doesn't understand that a Second Circuit decision doesn't bind a Ninth Circuit court. It's a fancy text box connected to a probability engine.

And the economics are brutal. Analysis of the wrapper market shows that while some reach revenue quickly, the vast majority fail because they lack any defensible technology. As foundation models improve, every feature that made the wrapper useful — summarization, drafting, Q&A — gets absorbed into the base model. You're building on rented land, and the landlord is OpenAI.

What Happens When You Give AI a Map of the Law?

Side-by-side comparison diagram showing how Vector RAG retrieves isolated text chunks by similarity while GraphRAG traverses explicit legal relationships (cites, overrules, interprets) to find structurally connected authority.

Here's where my team's obsession begins.

The standard fix for hallucination is Retrieval-Augmented Generation — RAG. Instead of relying on the model's memory, you retrieve relevant documents from a database and feed them as context. It's a real improvement. But for law, it's not enough, and I want to explain why with a specific example that drove us crazy for weeks.

We were testing a standard vector RAG pipeline on a question about whether a specific 1990 environmental regulation was still enforceable after a 2023 Supreme Court decision. Vector RAG did what it does: it found text chunks that were semantically similar to the query. It returned the regulation. It returned the Supreme Court opinion. It returned a law review article discussing both.

The LLM stitched them together into a confident, well-written answer that was completely wrong. It treated the law review article — a persuasive but non-binding academic commentary — as if it carried the same weight as the Supreme Court holding. Worse, it missed that the regulation had been effectively invalidated, because the chain of authority connecting the regulation to the invalidating decision ran through an intermediate appellate case that the vector search hadn't retrieved. The connection wasn't semantic. It was structural.

I remember my lead engineer, halfway through debugging this, turning to me and saying: "The problem isn't retrieval. The problem is that vectors don't understand relationships."

She was right. And that's the insight behind GraphRAG — Graph-based Retrieval-Augmented Generation.

Instead of storing legal documents as isolated points in vector space, we map them into a Knowledge Graph: a network where every statute, case, regulation, and legal doctrine is a node, and the relationships between them — cites, overrules, distinguishes, interprets, affirms — are explicit, labeled edges. I wrote about the full architecture in the interactive version of our research.

Vector RAG asks: "Find text that looks like this query." GraphRAG asks: "Find the statute, traverse the 'interprets' edge to find case law, then traverse the 'overrules' edge to make sure it's still valid."

That's not a subtle difference. That's the difference between searching a library by vibes and searching it by the card catalog, the citation index, and the Shepard's report simultaneously.

How Do You Stop an AI From Inventing a Citation?

Step-by-step diagram showing the KG-Trie constrained decoding process — the LLM generates a partial citation, the Trie checks valid continuations against the Knowledge Graph, and invalid token paths are blocked (probability set to negative infinity).

This is the part that took us the longest to get right, and it's the part I'm most proud of.

Having a Knowledge Graph is necessary but not sufficient. The graph gives you structure. But the LLM is still generating text token by token, and at any point it could veer off the graph and start inventing. We needed a mechanism that doesn't just encourage the model to cite real cases — it physically prevents it from citing fake ones.

We call this Graph-Constrained Decoding, and the core mechanism is something called a KG-Trie.

Here's how it works in plain English. We take every valid entity in our Knowledge Graph — every case name, every reporter citation, every docket number — and we build a prefix tree (a Trie) from those identifiers. When the LLM is generating text and reaches a point where it's about to output a citation, the constraint mechanism activates. It checks: what are the valid next tokens according to the Trie?

If the model has generated "Mata v. A" — the Trie allows tokens that complete valid case names starting with that string. "Avianca" is valid. Everything else gets its probability set to negative infinity. Blocked.

If the model tries to generate "Varghese v. Chi" — the Trie finds no valid continuation. The generation is stopped. The model is forced to backtrack and either find a real citation or output something like "No precedent found."

The AI cannot dream up a case because it physically cannot output the token sequence for a case that isn't in the verified database.

This is a structural guarantee, not a probabilistic one. We're not saying "the model is 95% less likely to hallucinate." We're saying the fabrication pathway is closed. The token sequence for a fake citation literally cannot be produced.

Now, I want to be precise about what this does and doesn't do. It prevents fabrication — inventing a case that doesn't exist. It does not prevent misinterpretation — citing a real case but drawing the wrong conclusion from it. That's a reasoning error, and it still requires human review. But eliminating fabrication is enormous. It takes the most catastrophic failure mode — the Mata scenario — completely off the table.

There was a night, early in development, when we ran our first end-to-end test. We prompted the system with the exact query that had produced the fake citations in Mata. The constrained system tried to generate "Varghese," hit the Trie wall, backtracked, and returned a real case with a valid citation chain. My engineer sent a screenshot to our group chat at 1:47 AM. Nobody replied with words. Just a row of fire emojis.

Why Can't Wrappers Do This?

People ask me this constantly, and the answer is architectural, not commercial.

Graph-Constrained Decoding requires manipulating the model's token probabilities — its logits — in real time during generation. You need access to the inference engine at the decoding level. Standard commercial APIs like GPT-4 don't expose this. You can send a prompt and get a response. You cannot intercept the generation process mid-token and inject constraints.

This is why we build on open-weights models — Llama, Mistral — or deploy through enterprise endpoints that allow custom decoding loops. We host the model. We control the inference pipeline. We inject the KG-Trie constraints directly into the probability distribution of every token as it's generated.

A wrapper, by definition, cannot do this. It's calling someone else's API. It's a passenger, not the pilot.

The Hardest Part Nobody Talks About

Building the constraint mechanism was intellectually satisfying. Building the Knowledge Graph underneath it was a slog.

Legal text is messy in ways that would make a data engineer weep. A single case might be referenced as "Mata v. Avianca," "Mata," "678 F. Supp. 3d 443," "the Avianca case," or simply "Id." — a two-letter abbreviation meaning "the case I just mentioned." All of those must resolve to a single canonical node in the graph. Miss one, and you've got a gap in the citation network.

We spent months building Entity Resolution pipelines that handle deduplication ("Smith v. Jones, 123 F.3d 456" and "Smith, 123 F.3d at 456" are the same case), disambiguation ("Smith v. Jones (1995)" versus "Smith v. Jones (2002)" — different cases, same name), and the particular hell of resolving "Id." references using sliding-window context parsing.

And then there's negative treatment — the "red flag" system. A legal Knowledge Graph that treats overruled cases as valid authority is worse than useless. We ingest citator signals — language like "overruled," "abrogated," "superseded" — and encode them as blocking edges in the graph. When the system traverses a path and hits an OVERRULES edge, that path is invalidated for binding authority. If someone asks about Roe v. Wade on reproductive rights, the graph immediately surfaces the OVERRULES edge from Dobbs v. Jackson. A vector search might still enthusiastically cite Roe because the sheer volume of historical text supporting it dominates the similarity scores.

For the full technical breakdown of the graph schema, entity resolution pipeline, and constraint architecture, see our research paper.

What Does This Actually Mean for a Law Firm?

I had a conversation with a managing partner who put it bluntly: "I don't care about Knowledge Graphs. I care about whether my associates are going to embarrass me in front of a judge."

Fair. So let me translate.

The cost of Mata v. Avianca wasn't $5,000. It was the public humiliation, the client notification requirement, the malpractice exposure, and the signal to every prospective client that this firm doesn't verify its work. For a large firm, one hallucinated filing is an existential reputational event.

Citation-Enforced GraphRAG functions as an insurance policy against fabrication. The wrapper approach offers low upfront cost and unlimited liability. Our approach requires real investment in the data layer and the constraint architecture, but it reduces the risk of citation fabrication to zero.

There's also an efficiency argument that's less obvious. Right now, if a firm uses AI for research, an associate has to verify every single citation. That verification step often takes longer than the research itself, which defeats the purpose. GraphRAG benchmarks show 30-35% improvement over standard RAG on multi-hop reasoning tasks — the kind of complex, connect-the-dots research that actually matters in litigation. More importantly, because the citations are structurally guaranteed to be valid, the human role shifts from "fact checker" to "strategy reviewer." You're not spending three hours confirming that cases exist. You're spending that time on whether the argument is persuasive.

When every citation is structurally verified, the lawyer's job shifts from fact-checking the AI to thinking about strategy. That's where the real leverage is.

And there's a transparency dimension that matters for compliance. A wrapper can't explain why it chose a case. A GraphRAG system can show the exact traversal path: "I selected Case A because it interprets Statute B and was affirmed by Court C, which is binding in your jurisdiction." That audit trail isn't just nice to have — it's becoming a regulatory expectation.

Where Does This Go Next?

The industry is moving from chatbots to agents — AI systems that don't just answer questions but plan and execute multi-step tasks. A legal agent asked to draft a motion to dismiss needs to research the applicable standard, find supporting case law, verify that the cases are good law, check procedural requirements, and assemble the argument.

An agent running on vector search has no map. It has a pile of documents and a good guess. An agent running on a Knowledge Graph has an explicit structure it can traverse: statute → interpreting cases → procedural rules → jurisdiction-specific requirements. The graph is the agent's planning layer.

This is why I believe the investment in graph infrastructure now pays compound returns later. Wrappers leave behind chat logs. Knowledge Graphs leave behind a structured, growing, increasingly valuable map of legal authority that gets more useful with every case added, every relationship encoded, every negative treatment signal ingested.

The Honest Objection

People push back on two fronts, and I want to address both directly.

First: "Isn't this just Westlaw with extra steps?" No. Westlaw is a search engine for humans. It returns documents that a lawyer reads and interprets. What we build is a constraint architecture for AI — a system that governs what the AI can and cannot say. Westlaw helps lawyers find law. GraphRAG prevents AI from inventing it. They're complementary, not competitive.

Second: "Can't you just fine-tune the model to stop hallucinating?" We tried. Early in our work, we experimented with fine-tuning on verified legal datasets. It reduced hallucination rates. It did not eliminate them. A fine-tuned model is still a probability engine. It's a better probability engine, but "better" in legal citation means "wrong less often," and "wrong less often" is not a standard any court will accept. The only way to guarantee zero fabrication is to make fabrication structurally impossible, which means constraining the output space, not just improving the input data.

The End of "Good Enough"

Here's what I keep coming back to. The legal profession is built on a simple premise: when you cite authority, that authority must be real. Not probably real. Not usually real. Real.

For two years after Mata, courts have been ratcheting up sanctions, issuing standing orders about AI disclosure, and making it clear that "the AI did it" is not a defense. The profession is drawing a line: if you use AI, its output must be verified. And if verifying the output takes longer than doing the work manually, the AI isn't a tool — it's a liability.

The wrapper era solved the wrong problem. It made legal research faster. It needed to make legal research trustworthy. Speed without trust is just efficient malpractice.

What we build at Veriprajna is not a chatbot that happens to know some law. It's a constrained reasoning system where every citation is a verified traversal through a Knowledge Graph, every relationship is explicit and auditable, and the generative model is physically prevented from crossing into fiction.

The profession that invented the concept of binding precedent deserves AI that actually respects it.