Legal AI Verification & Governance
Westlaw Precision hallucinated on 33% of complex queries in peer-reviewed testing. Lexis+ AI, 17%. Sanctions have crossed $30,000 per incident. Whether your firm uses Harvey, Lexis Protege, or open-source models, we build the citation verification pipeline, knowledge graph infrastructure, and governance systems that make AI output safe to file.
33%
Westlaw Precision hallucination rate
Stanford/JELS, 2025
$30,000
Sixth Circuit sanctions, March 2026
Bloomberg Law
1,222
Documented AI hallucination court cases
Charlotin Database, 2026
Most firms know about Mata v. Avianca: fabricated case names, $5,000 fine, career-ending embarrassment. That was 2023. The problem has evolved. The sanctions have escalated. And the failure mode that should worry you most is the one your current tools cannot catch.
The AI invents a case that does not exist. Varghese v. China Southern Airlines had a convincing docket number, a plausible court, and detailed internal citations. It was entirely fictional. This is what Shepard's and KeyCite catch: a citation that resolves to nothing in the database.
Purpose-built tools reduce this substantially. Harvey and Lexis Protege ground their output in real databases. But "reduce" is not "eliminate," and the February 2026 New Orleans case proved this: the attorney used both ChatGPT and Westlaw Precision AI, and still submitted 11 fabricated or mischaracterized citations.
The AI cites a real case for a proposition it does not support. The docket number is valid. The case exists. KeyCite returns a green flag. But the AI cited the dissent as if it were the majority holding. Or it cited a case that interprets an old version of a statute that was amended two years ago.
This is what the Stanford study's 33% Westlaw hallucination rate actually captures. Not fake citations, but wrong analysis of real citations. Your citation verification tool says the case exists. It does. It just does not say what the AI claims it says. And a junior associate reviewing the output under time pressure will not catch it, because the citation looks right.
A litigation associate asks Harvey to research defenses to a breach of fiduciary duty claim under Delaware law. The AI returns a thorough analysis citing Stone v. Ritter (2006) for the standard of director oversight liability. The citation is real. The holding summary is accurate for 2006.
What the AI missed: the Delaware Supreme Court's 2019 decision in Marchand v. Barnhill significantly expanded the Caremark duty, and subsequent Chancery Court opinions have further developed the "mission critical" regulatory compliance standard. The AI cited binding authority that is technically "good law" (not overruled) but whose practical application has been substantially narrowed by later developments that a citator flag would not catch. Stone still has a green KeyCite flag. The analysis built on it is still wrong for a 2026 filing.
A verification pipeline catches this by checking not just citator status but subsequent citing references, examining whether later cases have distinguished or narrowed the holding, and flagging opinions where the core proposition has been substantively modified even if the case itself remains "good law."
Every platform has strengths. None of them solve the full verification problem. This table is a reference you can bring to your next technology committee meeting.
| Option | What It Does Well | Citation Accuracy | Gaps |
|---|---|---|---|
| Harvey AI | Research, drafting, agentic workflows. 25,000+ custom agents. Full LexisNexis data vault access. $11B valuation, 50% of AmLaw 100. | Grounded in LexisNexis data. Better than generic LLMs. No published independent hallucination rate. | No independent verification layer. Output verification is the user's responsibility. Agentic workflows produce complex multi-step output that needs systematic QA. |
| Westlaw AI / CoCounsel | Deep Research capability. Agentic document review. Built on KeyCite citator system. CoCounsel workflows launched early 2026. | 33% hallucination rate on Precision. 17% on Ask Practical Law. (Stanford/JELS 2025) | Published accuracy data shows significant failure rate on complex queries. KeyCite catches fabricated citations but not contextual hallucination. |
| Lexis+ with Protege | 300+ pre-built workflows. Four specialized agents. Shepard's Citations (gold standard). Replaced Lexis+ AI in Feb 2026. | 17% hallucination rate. Walked back "100% hallucination-free" claim. (Stanford/JELS 2025) | Shepard's coverage lags on state-level administrative decisions. Agentic multi-step workflows are new and unproven at scale. |
| Open-Source LLMs + RAG | Full control over model, data, and verification logic. No vendor lock-in. Can build custom constraint mechanisms. | 58-82% hallucination without purpose-built verification. Highly variable with custom RAG. | Requires significant engineering investment. No built-in citator. Data access challenge: Harvard CAP provides raw text but not editorial enrichments. |
| Big 4 / Large SIs | Brand credibility. Global scale. Can throw bodies at the problem. Existing relationships with firm leadership. | Implement platforms rather than build verification infrastructure. Rely on vendor accuracy claims. | They deploy Harvey or Lexis and call it done. Engagements run $500K-$2M+ for what is essentially platform configuration. No custom verification pipeline expertise. Legal AI is a small practice within a generalist firm. |
| In-House Build | Full control. Deeply customized to firm's practice areas and workflows. | Depends entirely on team capability and sustained investment. | Requires hiring ML engineers, legal data engineers, and NLP specialists. Most firms cannot recruit this talent competitively. Ongoing maintenance burden is substantial. |
Hallucination rates are from peer-reviewed Stanford HAI/JELS study (2025). Harvey has not published independent accuracy benchmarks. Gaps are structural, not quality judgments. Every option on this table does something valuable.
We do not replace your research platform. We build the verification, governance, and infrastructure layers that make your existing tools safe for high-stakes practice.
An automated QA layer between AI output and human review. Takes research output from Harvey, Lexis, Westlaw, or any source. Runs citation existence checks against citator databases. Flags negative treatment. Validates binding authority for the specific jurisdiction and court level. Scores confidence on contextual accuracy by analyzing subsequent citing references.
We reach for graph-based verification when practice areas have dense citation networks (tax, regulatory, patent prosecution). For lighter-touch verification needs (contract review, compliance memos), we build streamlined pipelines with rule-based checks and LLM cross-validation.
Practice-area-specific knowledge graphs built on Neo4j. Nodes for statutes, cases, regulations, and legal concepts. Edges encoding citation relationships, negative treatment, jurisdictional hierarchy, and temporal validity. We start with open data: Harvard Caselaw Access Project (6.7M cases), eCFR, Federal Register, and public court records.
GraphRAG outperforms vector RAG by 14% in retrieval relevance for legal queries. The advantage is sharpest on multi-hop reasoning: "find the most recent Second Circuit case applying the Twombly plausibility standard" is a deterministic graph traversal, not a fuzzy text search. We build graphs for specific practice areas where the citation density justifies the investment.
Not a policy PDF that sits in a shared drive. An enforceable system that implements ABA Opinion 512 requirements: tool approval workflows by practice area, usage logging that tracks which AI tools were used on which client matters, training tracking with completion verification, and audit trails that satisfy malpractice insurers. When 68% of legal professionals have used unapproved AI tools, you need enforcement, not guidelines.
The system includes standing order compliance: a database of 300+ court-specific AI requirements, automatic flagging when a filing enters a jurisdiction with disclosure rules, and templated disclosure language matching each order's specific requirements. Updates continuously as new orders are issued.
Harvey's 25,000+ custom agents and LexisNexis Protege's four-agent architecture can now handle multi-step workflows autonomously. A fund formation agent produces a 40-page analysis. A litigation agent drafts discovery requests across multiple claims. These workflows need systematic verification, not ad hoc spot-checking.
We build monitoring and validation layers for agentic legal AI: output verification checkpoints at each workflow stage, provenance tracking that logs which sources the agent consulted, confidence scoring on each claim and citation, and human-in-the-loop gates at decision points the firm defines. The verification scales with the complexity of the agentic workflow.
This is the step-by-step process we build for firms. It sits between AI-generated output and attorney review, catching errors before they reach a filing.
The pipeline receives AI-generated text (from Harvey, Lexis, Westlaw, or any source) and extracts every legal citation using pattern matching and NLP. This includes standard reporter citations (678 F. Supp. 3d 443), short-form references ("Id. at 445"), and statutory citations (28 U.S.C. § 1332). Each citation is canonicalized to a unique identifier, resolving "the Mata case," "Mata v. Avianca," and "678 F. Supp. 3d 443" to the same entity.
Each extracted citation is verified against authoritative databases. For case law: does this case exist in the reporter volume cited? For statutes: is this section number valid and current in the cited code? For regulations: does this CFR section exist in the current edition? Citations that fail existence checks are flagged as fabricated. This is the check that would have caught Mata v. Avianca.
Valid citations are checked for negative treatment. Has the case been overruled, reversed, vacated, or distinguished? Is the statute still in force, or has it been amended or repealed? The pipeline goes beyond citator flags: it analyzes subsequent citing references to detect cases where the core proposition has been narrowed even though the case retains a positive citator status. This is the check that catches the Stone v. Ritter problem described above.
The hardest check. The pipeline compares the proposition the AI attributes to the cited case against the actual holding. If the AI writes "the court held that directors have no oversight duty absent red flags," and the cited case actually held the opposite, this is flagged as a contextual hallucination. This uses a second, independent LLM call with the actual case text and the AI's characterization, cross-validated against the knowledge graph's encoded holdings.
Is the cited case binding or persuasive in the jurisdiction where the filing is being made? A Ninth Circuit opinion cited in a Second Circuit brief is persuasive only. A state trial court opinion has no precedential value. The pipeline validates that binding authorities are correctly identified and flags persuasive-only citations that are presented as controlling law.
The output is a structured report alongside the AI-generated work product. Each citation receives a status: verified, caution (valid but narrowed/distinguished), or failed (fabricated, overruled, or contextually inaccurate). The reviewing attorney sees exactly which citations need manual attention, reducing the review burden from "check everything" to "check the flagged items." The report becomes part of the matter file for audit trail purposes.
Every engagement starts with understanding your firm's specific risk profile, practice areas, and existing technology stack. We build for your workflow, not a generic one.
Phase 1
Weeks 1-3
Phase 2
Weeks 4-10
Phase 3
Weeks 11-16
Answer these questions to understand your firm's current risk exposure and verification maturity. The results give you a framework for prioritizing AI governance investments, whether you work with us or not.
A peer-reviewed Stanford study published in the Journal of Empirical Legal Studies in 2025 tested both platforms systematically. Westlaw Precision hallucinated 33% of the time, with only 42% of responses fully accurate. Lexis+ AI (now Lexis+ with Protege) hallucinated 17% of the time, with just 20% of responses fully accurate. These numbers apply to complex multi-hop queries, the kind associates handle daily in litigation and regulatory work. Simpler lookups perform better.
The critical nuance: LexisNexis quietly walked back its "100% hallucination-free" marketing language after the study, clarifying that the promise applied only to linked legal citations, not to the reasoning around them. Contextual hallucination, citing a real case for a proposition it does not support, is not captured by citation-link accuracy metrics. A verification pipeline needs to check both: does the case exist, and does it say what the AI claims it says.
Over 300 federal and state judges have adopted standing orders or local rules governing AI use in filings, and they vary significantly. Some require only disclosure that AI was used and which tools. Others require certification that every citation has been independently verified. The Western District of North Carolina effectively bars generative AI for drafting entirely, permitting only standard research platforms. Florida enacted a new AI disclosure mandate in February 2026. A federal court has ruled that AI-generated documents are not protected by attorney-client privilege.
The compliance challenge is not reading one order. It is tracking 300+ orders across every jurisdiction where your firm files, keeping them updated as judges revise requirements, and generating the correct disclosure language for each filing. We build automated standing order compliance systems: a database of current requirements mapped by court, automatic flagging when a new filing enters a jurisdiction with AI rules, and templated disclosure language that matches each order's specific requirements. The system updates as new orders are issued.
Harvey is excellent at what it does. At $11B valuation and 50% AmLaw 100 adoption, it is the leading legal AI platform for research, drafting, and workflow automation. With 25,000+ custom agents operating on the platform, it is becoming infrastructure. But Harvey is a generative platform, not a verification system. It produces legal analysis. It does not independently verify that analysis against a second source.
A citation verification pipeline is a separate concern. Think of it as quality assurance for AI output, the same way a firm has document review processes that exist independently of the drafting tools. We build verification layers that take Harvey's output (or Lexis Protege, or Westlaw, or any source) and run automated checks: citation existence against KeyCite/Shepard's, negative treatment flagging, binding authority validation for the specific jurisdiction, and confidence scoring.
This matters particularly with Harvey's agentic workflows, where long-horizon agents handle multi-step processes like fund formation. An autonomous agent producing a 40-page analysis needs systematic verification, not ad hoc spot-checking.
ABA Formal Opinion 512, issued July 2024, is the first comprehensive ethics guidance on generative AI in legal practice. It addresses six obligations: competence, confidentiality, communication, candor toward the tribunal, supervisory responsibilities, and fees.
The practical requirements are specific. Competence means lawyers must understand AI capacity and limitations, and update that understanding periodically, not just attend one CLE. Confidentiality means assessing data exposure before entering client information into any AI tool, which most firms have not done systematically for Harvey, Lexis, or internal tools. Supervision means managerial lawyers must establish firm-wide AI policies and ensure training, not just for lawyers but for all staff who touch AI tools. On fees, lawyers cannot charge clients for time spent learning tools they will regularly use.
Compliance is not a policy document. It requires an enforceable system: tool approval workflows that log which tools are authorized for which practice areas, usage monitoring that flags when unapproved tools are used on client matters (68% of legal professionals have used unapproved AI tools at least once), training tracking with completion verification, and documentation that survives a malpractice inquiry.
Standard vector RAG works by semantic similarity. It finds text that looks like your query. A legal knowledge graph works by structural relationships. It knows that Case A interprets Statute B, that Case C overruled Case A, and that Case D from the Second Circuit is binding while Case E from the Ninth Circuit is persuasive only in the Second Circuit.
The difference matters for three specific failure modes. First, negative treatment: vector RAG cannot distinguish between citing a case and overruling it. A thoroughly discussed overruled case scores high on semantic similarity. A knowledge graph has an explicit OVERRULES edge that blocks retrieval of that case as binding authority. Second, multi-hop reasoning: a question like "find the most recent Second Circuit case applying the Twombly plausibility standard" requires traversing statute to interpretation to circuit to date. Vector RAG retrieves fragments and hopes the LLM connects them. A graph traverses the path deterministically. Third, jurisdictional hierarchy: vector search treats a state trial court opinion the same as a Supreme Court ruling if the text is similar. A knowledge graph encodes court hierarchy and returns binding authority first.
Benchmarks show GraphRAG outperforms vector RAG by 14% in retrieval relevance for legal queries. We build practice-area-specific knowledge graphs on Neo4j, starting with regulatory compliance and tax where citation networks are densest.
Malpractice insurers are actively incorporating AI usage into underwriting decisions in 2026. The risk exposure is specific and documented. If firm lawyers allow AI to make critical legal judgments without attorney oversight, insurers may classify this as unauthorized practice of law, which is typically excluded from coverage. The logic: no attorney oversight means no professional services were rendered by an attorney, which means the malpractice policy does not apply.
This creates a coverage gap where the firm is most exposed. Shadow AI compounds the problem. When 68% of legal professionals have used unapproved tools, the firm has undocumented AI usage on client matters with no audit trail. If a hallucinated citation leads to sanctions or adverse outcomes, the insurer asks: what was your AI governance policy, and can you prove it was followed?
An AI governance system provides the documentation trail: which tools were approved, who was trained, what verification steps were taken on each matter. This is not about avoiding AI. It is about creating the evidentiary record that keeps your coverage intact when something goes wrong.
Our detailed analysis of citation-enforced architectures for legal AI, including GraphRAG technical design, knowledge graph schemas, and implementation blueprints.
The $5,000 Hallucination and the End of the Wrapper Era: Citation-Enforced GraphRAG for Enterprise Legal AITechnical deep-dive into graph-constrained decoding, legal knowledge graph schema design, and the architecture of citation verification systems.
The Sixth Circuit levied $30,000 in sanctions in March 2026. Some cases have exceeded $100,000 in combined sanctions and attorney fees.
A citation verification pipeline for your highest-risk practice area takes weeks to build and costs a fraction of one sanctions event. The governance system that protects your malpractice coverage takes even less. The question is not whether you can afford to build this. It is whether you can afford not to.