Legal AI Citation Verification & Governance

The Hallucination Problem Is Worse Than Fake Citations

Most firms know about Mata v. Avianca: fabricated case names, $5,000 fine, career-ending embarrassment. That was 2023. The problem has evolved. The sanctions have escalated. And the failure mode that should worry you most is the one your current tools cannot catch.

Citation Fabrication (the Mata Problem)

The AI invents a case that does not exist. Varghese v. China Southern Airlines had a convincing docket number, a plausible court, and detailed internal citations. It was entirely fictional. This is what Shepard's and KeyCite catch: a citation that resolves to nothing in the database.

Purpose-built tools reduce this substantially. Harvey and Lexis Protege ground their output in real databases. But "reduce" is not "eliminate," and the February 2026 New Orleans case proved this: the attorney used both ChatGPT and Westlaw Precision AI, and still submitted 11 fabricated or mischaracterized citations.

Contextual Hallucination (the Real Threat)

The AI cites a real case for a proposition it does not support. The docket number is valid. The case exists. KeyCite returns a green flag. But the AI cited the dissent as if it were the majority holding. Or it cited a case that interprets an old version of a statute that was amended two years ago.

This is what the Stanford study's 33% Westlaw hallucination rate actually captures. Not fake citations, but wrong analysis of real citations. Your citation verification tool says the case exists. It does. It just does not say what the AI claims it says. And a junior associate reviewing the output under time pressure will not catch it, because the citation looks right.

A Concrete Example: The Overruled Statute Trap

A litigation associate asks Harvey to research defenses to a breach of fiduciary duty claim under Delaware law. The AI returns a thorough analysis citing Stone v. Ritter (2006) for the standard of director oversight liability. The citation is real. The holding summary is accurate for 2006.

What the AI missed: the Delaware Supreme Court's 2019 decision in Marchand v. Barnhill significantly expanded the Caremark duty, and subsequent Chancery Court opinions have further developed the "mission critical" regulatory compliance standard. The AI cited binding authority that is technically "good law" (not overruled) but whose practical application has been substantially narrowed by later developments that a citator flag would not catch. Stone still has a green KeyCite flag. The analysis built on it is still wrong for a 2026 filing.

A verification pipeline catches this by checking not just citator status but subsequent citing references, examining whether later cases have distinguished or narrowed the holding, and flagging opinions where the core proposition has been substantively modified even if the case itself remains "good law."

Legal AI Landscape: What Each Option Actually Delivers

Every platform has strengths. None of them solve the full verification problem. This table is a reference you can bring to your next technology committee meeting.

Option	What It Does Well	Citation Accuracy	Gaps
Harvey AI	Research, drafting, agentic workflows. 25,000+ custom agents. Full LexisNexis data vault access. $11B valuation, 50% of AmLaw 100.	Grounded in LexisNexis data. Better than generic LLMs. No published independent hallucination rate.	No independent verification layer. Output verification is the user's responsibility. Agentic workflows produce complex multi-step output that needs systematic QA.
Westlaw AI / CoCounsel	Deep Research capability. Agentic document review. Built on KeyCite citator system. CoCounsel workflows launched early 2026.	33% hallucination rate on Precision. 17% on Ask Practical Law. (Stanford/JELS 2025)	Published accuracy data shows significant failure rate on complex queries. KeyCite catches fabricated citations but not contextual hallucination.
Lexis+ with Protege	300+ pre-built workflows. Four specialized agents. Shepard's Citations (gold standard). Replaced Lexis+ AI in Feb 2026.	17% hallucination rate. Walked back "100% hallucination-free" claim. (Stanford/JELS 2025)	Shepard's coverage lags on state-level administrative decisions. Agentic multi-step workflows are new and unproven at scale.
Open-Source LLMs + RAG	Full control over model, data, and verification logic. No vendor lock-in. Can build custom constraint mechanisms.	58-82% hallucination without purpose-built verification. Highly variable with custom RAG.	Requires significant engineering investment. No built-in citator. Data access challenge: Harvard CAP provides raw text but not editorial enrichments.
Big 4 / Large SIs	Brand credibility. Global scale. Can throw bodies at the problem. Existing relationships with firm leadership.	Implement platforms rather than build verification infrastructure. Rely on vendor accuracy claims.	They deploy Harvey or Lexis and call it done. Engagements run $500K-$2M+ for what is essentially platform configuration. No custom verification pipeline expertise. Legal AI is a small practice within a generalist firm.
In-House Build	Full control. Deeply customized to firm's practice areas and workflows.	Depends entirely on team capability and sustained investment.	Requires hiring ML engineers, legal data engineers, and NLP specialists. Most firms cannot recruit this talent competitively. Ongoing maintenance burden is substantial.

Hallucination rates are from peer-reviewed Stanford HAI/JELS study (2025). Harvey has not published independent accuracy benchmarks. Gaps are structural, not quality judgments. Every option on this table does something valuable.

What We Build for Legal AI Teams

We do not replace your research platform. We build the verification, governance, and infrastructure layers that make your existing tools safe for high-stakes practice.

Citation Verification Pipelines

An automated QA layer between AI output and human review. Takes research output from Harvey, Lexis, Westlaw, or any source. Runs citation existence checks against citator databases. Flags negative treatment. Validates binding authority for the specific jurisdiction and court level. Scores confidence on contextual accuracy by analyzing subsequent citing references.

We reach for graph-based verification when practice areas have dense citation networks (tax, regulatory, patent prosecution). For lighter-touch verification needs (contract review, compliance memos), we build streamlined pipelines with rule-based checks and LLM cross-validation.

Legal Knowledge Graphs

Practice-area-specific knowledge graphs built on Neo4j. Nodes for statutes, cases, regulations, and legal concepts. Edges encoding citation relationships, negative treatment, jurisdictional hierarchy, and temporal validity. We start with open data: Harvard Caselaw Access Project (6.7M cases), eCFR, Federal Register, and public court records.

GraphRAG outperforms vector RAG by 14% in retrieval relevance for legal queries. The advantage is sharpest on multi-hop reasoning: "find the most recent Second Circuit case applying the Twombly plausibility standard" is a deterministic graph traversal, not a fuzzy text search. We build graphs for specific practice areas where the citation density justifies the investment.

AI Governance Systems

Not a policy PDF that sits in a shared drive. An enforceable system that implements ABA Opinion 512 requirements: tool approval workflows by practice area, usage logging that tracks which AI tools were used on which client matters, training tracking with completion verification, and audit trails that satisfy malpractice insurers. When 68% of legal professionals have used unapproved AI tools, you need enforcement, not guidelines.

The system includes standing order compliance: a database of 300+ court-specific AI requirements, automatic flagging when a filing enters a jurisdiction with disclosure rules, and templated disclosure language matching each order's specific requirements. Updates continuously as new orders are issued.

Agentic Workflow Verification

Harvey's 25,000+ custom agents and LexisNexis Protege's four-agent architecture can now handle multi-step workflows autonomously. A fund formation agent produces a 40-page analysis. A litigation agent drafts discovery requests across multiple claims. These workflows need systematic verification, not ad hoc spot-checking.

We build monitoring and validation layers for agentic legal AI: output verification checkpoints at each workflow stage, provenance tracking that logs which sources the agent consulted, confidence scoring on each claim and citation, and human-in-the-loop gates at decision points the firm defines. The verification scales with the complexity of the agentic workflow.

How a Citation Verification Pipeline Works

This is the step-by-step process we build for firms. It sits between AI-generated output and attorney review, catching errors before they reach a filing.

Citation Extraction

The pipeline receives AI-generated text (from Harvey, Lexis, Westlaw, or any source) and extracts every legal citation using pattern matching and NLP. This includes standard reporter citations (678 F. Supp. 3d 443), short-form references ("Id. at 445"), and statutory citations (28 U.S.C. § 1332). Each citation is canonicalized to a unique identifier, resolving "the Mata case," "Mata v. Avianca," and "678 F. Supp. 3d 443" to the same entity.

Existence Verification

Each extracted citation is verified against authoritative databases. For case law: does this case exist in the reporter volume cited? For statutes: is this section number valid and current in the cited code? For regulations: does this CFR section exist in the current edition? Citations that fail existence checks are flagged as fabricated. This is the check that would have caught Mata v. Avianca.

Treatment Analysis

Valid citations are checked for negative treatment. Has the case been overruled, reversed, vacated, or distinguished? Is the statute still in force, or has it been amended or repealed? The pipeline goes beyond citator flags: it analyzes subsequent citing references to detect cases where the core proposition has been narrowed even though the case retains a positive citator status. This is the check that catches the Stone v. Ritter problem described above.

Contextual Validation

The hardest check. The pipeline compares the proposition the AI attributes to the cited case against the actual holding. If the AI writes "the court held that directors have no oversight duty absent red flags," and the cited case actually held the opposite, this is flagged as a contextual hallucination. This uses a second, independent LLM call with the actual case text and the AI's characterization, cross-validated against the knowledge graph's encoded holdings.

Jurisdiction & Authority Check

Is the cited case binding or persuasive in the jurisdiction where the filing is being made? A Ninth Circuit opinion cited in a Second Circuit brief is persuasive only. A state trial court opinion has no precedential value. The pipeline validates that binding authorities are correctly identified and flags persuasive-only citations that are presented as controlling law.

Verification Report

The output is a structured report alongside the AI-generated work product. Each citation receives a status: verified, caution (valid but narrowed/distinguished), or failed (fabricated, overruled, or contextually inaccurate). The reviewing attorney sees exactly which citations need manual attention, reducing the review burden from "check everything" to "check the flagged items." The report becomes part of the matter file for audit trail purposes.

How We Work

Every engagement starts with understanding your firm's specific risk profile, practice areas, and existing technology stack. We build for your workflow, not a generic one.

Phase 1

Assessment & Architecture

Weeks 1-3

Audit current AI tool usage across practice groups (including shadow AI)
Map filing jurisdictions to standing order requirements
Identify highest-risk practice areas by hallucination exposure
Design verification pipeline architecture for your specific platforms
Deliverable: risk assessment report + technical architecture document

Phase 2

Build & Integrate

Weeks 4-10

Build citation verification pipeline for priority practice area
Construct knowledge graph for target jurisdiction/domain (if applicable)
Deploy governance system: tool approval, usage logging, training tracking
Integrate with existing platforms (Harvey API, Westlaw, Lexis)
Deliverable: working verification pipeline + governance system in staging

Phase 3

Pilot & Expand

Weeks 11-16

Pilot with 2-3 practice groups on live matters
Measure: false positive rate, verification turnaround time, attorney adoption
Refine based on real-world feedback from associates and partners
Expand to additional practice areas and jurisdictions
Deliverable: production system + expansion roadmap + training materials

Honest Caveats

No system eliminates all risk. Verification pipelines catch citation errors. Legal reasoning quality still requires human judgment. We build the safety net, not the autopilot.
Knowledge graph scope is a trade-off. A comprehensive federal + 50-state graph is a multi-year investment. We start with the practice area and jurisdiction where your exposure is highest and expand from there.
Data access limits what's possible. LexisNexis and Westlaw control the most comprehensive editorially enriched databases. We build on open data (Harvard CAP, eCFR, public records) and integrate with your licensed databases where API access exists. Coverage will never match Shepard's on day one.
Governance systems work only if leadership enforces them. We build the technology. Firm culture change is a separate conversation.

Legal AI Readiness Assessment

Answer these questions to understand your firm's current risk exposure and verification maturity. The results give you a framework for prioritizing AI governance investments, whether you work with us or not.

Questions Legal AI Buyers Actually Ask

What is the actual hallucination rate of Westlaw AI and Lexis+ AI?

A peer-reviewed Stanford study published in the Journal of Empirical Legal Studies in 2025 tested both platforms systematically. Westlaw Precision hallucinated 33% of the time, with only 42% of responses fully accurate. Lexis+ AI (now Lexis+ with Protege) hallucinated 17% of the time, with just 20% of responses fully accurate. These numbers apply to complex multi-hop queries, the kind associates handle daily in litigation and regulatory work. Simpler lookups perform better.

The critical nuance: LexisNexis quietly walked back its "100% hallucination-free" marketing language after the study, clarifying that the promise applied only to linked legal citations, not to the reasoning around them. Contextual hallucination, citing a real case for a proposition it does not support, is not captured by citation-link accuracy metrics. A verification pipeline needs to check both: does the case exist, and does it say what the AI claims it says.

How do court standing orders on AI disclosure actually work, and how do we track compliance across jurisdictions?

Over 300 federal and state judges have adopted standing orders or local rules governing AI use in filings, and they vary significantly. Some require only disclosure that AI was used and which tools. Others require certification that every citation has been independently verified. The Western District of North Carolina effectively bars generative AI for drafting entirely, permitting only standard research platforms. Florida enacted a new AI disclosure mandate in February 2026. A federal court has ruled that AI-generated documents are not protected by attorney-client privilege.

The compliance challenge is not reading one order. It is tracking 300+ orders across every jurisdiction where your firm files, keeping them updated as judges revise requirements, and generating the correct disclosure language for each filing. We build automated standing order compliance systems: a database of current requirements mapped by court, automatic flagging when a new filing enters a jurisdiction with AI rules, and templated disclosure language that matches each order's specific requirements. The system updates as new orders are issued.

We already use Harvey AI. Why would we need a separate verification layer?

Harvey is excellent at what it does. At $11B valuation and 50% AmLaw 100 adoption, it is the leading legal AI platform for research, drafting, and workflow automation. With 25,000+ custom agents operating on the platform, it is becoming infrastructure. But Harvey is a generative platform, not a verification system. It produces legal analysis. It does not independently verify that analysis against a second source.

A citation verification pipeline is a separate concern. Think of it as quality assurance for AI output, the same way a firm has document review processes that exist independently of the drafting tools. We build verification layers that take Harvey's output (or Lexis Protege, or Westlaw, or any source) and run automated checks: citation existence against KeyCite/Shepard's, negative treatment flagging, binding authority validation for the specific jurisdiction, and confidence scoring.

This matters particularly with Harvey's agentic workflows, where long-horizon agents handle multi-step processes like fund formation. An autonomous agent producing a 40-page analysis needs systematic verification, not ad hoc spot-checking.

What does ABA Formal Opinion 512 require for AI governance, and how do we comply?

ABA Formal Opinion 512, issued July 2024, is the first comprehensive ethics guidance on generative AI in legal practice. It addresses six obligations: competence, confidentiality, communication, candor toward the tribunal, supervisory responsibilities, and fees.

The practical requirements are specific. Competence means lawyers must understand AI capacity and limitations, and update that understanding periodically, not just attend one CLE. Confidentiality means assessing data exposure before entering client information into any AI tool, which most firms have not done systematically for Harvey, Lexis, or internal tools. Supervision means managerial lawyers must establish firm-wide AI policies and ensure training, not just for lawyers but for all staff who touch AI tools. On fees, lawyers cannot charge clients for time spent learning tools they will regularly use.

Compliance is not a policy document. It requires an enforceable system: tool approval workflows that log which tools are authorized for which practice areas, usage monitoring that flags when unapproved tools are used on client matters (68% of legal professionals have used unapproved AI tools at least once), training tracking with completion verification, and documentation that survives a malpractice inquiry.

How does a legal knowledge graph improve citation accuracy compared to standard RAG?

Standard vector RAG works by semantic similarity. It finds text that looks like your query. A legal knowledge graph works by structural relationships. It knows that Case A interprets Statute B, that Case C overruled Case A, and that Case D from the Second Circuit is binding while Case E from the Ninth Circuit is persuasive only in the Second Circuit.

The difference matters for three specific failure modes. First, negative treatment: vector RAG cannot distinguish between citing a case and overruling it. A thoroughly discussed overruled case scores high on semantic similarity. A knowledge graph has an explicit OVERRULES edge that blocks retrieval of that case as binding authority. Second, multi-hop reasoning: a question like "find the most recent Second Circuit case applying the Twombly plausibility standard" requires traversing statute to interpretation to circuit to date. Vector RAG retrieves fragments and hopes the LLM connects them. A graph traverses the path deterministically. Third, jurisdictional hierarchy: vector search treats a state trial court opinion the same as a Supreme Court ruling if the text is similar. A knowledge graph encodes court hierarchy and returns binding authority first.

Benchmarks show GraphRAG outperforms vector RAG by 14% in retrieval relevance for legal queries. We build practice-area-specific knowledge graphs on Neo4j, starting with regulatory compliance and tax where citation networks are densest.

What happens to our malpractice insurance if we adopt AI without proper governance?

Malpractice insurers are actively incorporating AI usage into underwriting decisions in 2026. The risk exposure is specific and documented. If firm lawyers allow AI to make critical legal judgments without attorney oversight, insurers may classify this as unauthorized practice of law, which is typically excluded from coverage. The logic: no attorney oversight means no professional services were rendered by an attorney, which means the malpractice policy does not apply.

This creates a coverage gap where the firm is most exposed. Shadow AI compounds the problem. When 68% of legal professionals have used unapproved tools, the firm has undocumented AI usage on client matters with no audit trail. If a hallucinated citation leads to sanctions or adverse outcomes, the insurer asks: what was your AI governance policy, and can you prove it was followed?

An AI governance system provides the documentation trail: which tools were approved, who was trained, what verification steps were taken on each matter. This is not about avoiding AI. It is about creating the evidentiary record that keeps your coverage intact when something goes wrong.

Your Legal AI Hallucinates.
We Build the Layer That Catches It.

The Hallucination Problem Is Worse Than Fake Citations

Citation Fabrication (the Mata Problem)

Contextual Hallucination (the Real Threat)

A Concrete Example: The Overruled Statute Trap

Legal AI Landscape: What Each Option Actually Delivers

What We Build for Legal AI Teams

Citation Verification Pipelines

Legal Knowledge Graphs

AI Governance Systems

Agentic Workflow Verification

How a Citation Verification Pipeline Works

Citation Extraction

Existence Verification

Treatment Analysis

Contextual Validation

Jurisdiction & Authority Check

Verification Report

How We Work

Assessment & Architecture

Build & Integrate

Pilot & Expand

Honest Caveats

Legal AI Readiness Assessment

Questions Legal AI Buyers Actually Ask

What is the actual hallucination rate of Westlaw AI and Lexis+ AI?

How do court standing orders on AI disclosure actually work, and how do we track compliance across jurisdictions?

We already use Harvey AI. Why would we need a separate verification layer?

What does ABA Formal Opinion 512 require for AI governance, and how do we comply?

How does a legal knowledge graph improve citation accuracy compared to standard RAG?

What happens to our malpractice insurance if we adopt AI without proper governance?

Technical Research

A Single Sanctioned Filing Costs More Than a Verification System

AI Risk Assessment

Verification Pipeline Build

Also Published On

Your Legal AI Hallucinates.We Build the Layer That Catches It.

The Hallucination Problem Is Worse Than Fake Citations

Citation Fabrication (the Mata Problem)

Contextual Hallucination (the Real Threat)

A Concrete Example: The Overruled Statute Trap

Legal AI Landscape: What Each Option Actually Delivers

What We Build for Legal AI Teams

Citation Verification Pipelines

Legal Knowledge Graphs

AI Governance Systems

Agentic Workflow Verification

How a Citation Verification Pipeline Works

Citation Extraction

Existence Verification

Treatment Analysis

Contextual Validation

Jurisdiction & Authority Check

Verification Report

How We Work

Assessment & Architecture

Build & Integrate

Pilot & Expand

Honest Caveats

Legal AI Readiness Assessment

Questions Legal AI Buyers Actually Ask

What is the actual hallucination rate of Westlaw AI and Lexis+ AI?

How do court standing orders on AI disclosure actually work, and how do we track compliance across jurisdictions?

We already use Harvey AI. Why would we need a separate verification layer?

What does ABA Formal Opinion 512 require for AI governance, and how do we comply?

How does a legal knowledge graph improve citation accuracy compared to standard RAG?

What happens to our malpractice insurance if we adopt AI without proper governance?

Technical Research

A Single Sanctioned Filing Costs More Than a Verification System

AI Risk Assessment

Verification Pipeline Build

Also Published On

Your Legal AI Hallucinates.
We Build the Layer That Catches It.