
Every AI Got the Same Tax Question Wrong — And Nobody Noticed Until We Checked the Statute
It was a Tuesday night, and I was staring at three browser tabs — ChatGPT, Claude, Gemini — all telling me the same thing. All three were wrong.
The question was simple enough: under the new Omnibus Budget Reconciliation Act, can a taxpayer deduct interest on a personal car loan? Every model said yes. Every model explained the provision correctly — the dates, the vehicle requirements, even the income limits. And every model placed the deduction in the wrong section of the tax code, which meant the downstream financial advice was garbage.
Not slightly off. Not "well, it depends on interpretation." Just wrong. The kind of wrong that, if a CPA followed it, would trigger an IRS audit. The kind of wrong that could cost a retiree thousands in unexpected Medicare premium surcharges. And the terrifying part? If you didn't already know the answer, you'd never catch it. The AI sounded perfect.
That moment changed how I think about AI in compliance. Not because I was surprised that a language model hallucinated — we all know they do that. What shook me was that all three hallucinated in exactly the same way, for exactly the same reason. They'd learned the wrong answer from the internet, and the internet was so uniformly wrong that there was no signal left for the models to self-correct.
We started calling this phenomenon Consensus Error. And once we understood it, we couldn't unsee it.
What Is Consensus Error, and Why Should You Care?
Most people who work with AI are familiar with hallucinations — moments when a model invents a citation, fabricates a statistic, or confidently states something that isn't true. Those are scary, but they're also somewhat random. You can often catch them because they feel off, or because a quick Google search reveals the fabrication.
Consensus Error is different. It's what happens when an AI gives you the wrong answer not because it lacks data, but because the data it trained on is overwhelmingly, confidently incorrect. The model isn't guessing in the dark. It's reflecting the majority opinion of the internet back at you — and the majority opinion happens to be legally false.
When the internet itself is wrong, every AI trained on it inherits the same delusion. Consensus Error isn't a glitch — it's a feature of how language models learn.
Think about the math for a moment. If 90% of the articles on the web about a particular tax provision describe it incorrectly — using simplified language, conflating two different sections of the code, or just copying each other's mistakes — then the model's internal weights converge on the wrong answer with high confidence. Even if the actual statute exists in the training data, it appears once or twice in dense legal language, while the incorrect interpretation appears thousands of times in accessible, SEO-optimized prose.
The model does what it was designed to do: predict the most likely next token. And the most likely token is the wrong one.
The Car Loan Deduction That Broke Three AIs

Let me walk you through the specific case, because the details matter.
The OBBBA created a new category called "Qualified Passenger Vehicle Loan Interest" — a temporary deduction for interest paid on loans for new, US-assembled passenger vehicles, available for tax years 2025 through 2028. So far, so good. Every AI got this part right.
Here's where it falls apart. The deduction was added to IRC Section 63, which governs Taxable Income. It was not added to IRC Section 62, which governs Adjusted Gross Income (AGI). In tax, this distinction is everything.
A Section 62 deduction — "above the line" — reduces your AGI. Your AGI determines eligibility for dozens of other benefits: student loan repayment plans, medical expense deduction floors, the taxation of Social Security benefits, Medicare premium surcharges. Lower your AGI, and you unlock a cascade of downstream savings.
A Section 63 deduction — "below the line" — reduces your Taxable Income. It still saves you money on your federal tax bill, but it does nothing to your AGI. None of those downstream benefits are affected.
Every major LLM told users this deduction would lower their AGI. Every one. Because that's what the financial content ecosystem said. Widely-shared explainers and SEO-optimized articles ran headlines like "Car Loan Interest is Now Deductible!" without specifying where in the tax calculation it applied. The distinction between Section 62 and Section 63 is apparently too boring for content creators optimizing for clicks.
But for the taxpayer who follows this advice? The consequences are real.
A retiree who incorrectly reports a lower AGI could face unexpected Medicare IRMAA surcharges. A borrower on an income-driven student loan repayment plan could lose eligibility for lower payments. A taxpayer in a state like Arizona, which couples its income tax to federal AGI, could face a state audit. I wrote about the full cascade of financial consequences in the interactive version of our research — the ripple effects are worse than most people assume.
Why Didn't RAG Fix This?
I can already hear the objection. "Just use RAG. Retrieve the actual statute. Problem solved."
We tried. It didn't solve it.
Retrieval-Augmented Generation — the technique where you feed relevant documents into the model's context window alongside the user's question — is the industry's current answer to hallucinations. And for many use cases, it helps. But for legal reasoning, it has three structural weaknesses that showed up immediately in our testing.
First, tax legislation doesn't read like a narrative. The OBBBA doesn't say "this is a below-the-line deduction." It says "Section 163(h) is amended by inserting after paragraph (3) the following new paragraph..." The model has to reconstruct the logical state of the tax code from a series of amendments. When the retrieved chunk says "deduction allowed" but doesn't explicitly state its position in the calculation flow, the model fills the gap with its training bias — the blog posts.
Second, vector search finds what's similar, not what's relevant by omission. A query about car loan deductions retrieves paragraphs about car loan deductions. It doesn't retrieve the paragraph in Section 62 that defines AGI — because that paragraph doesn't mention car loans. It just excludes them by not listing them. In law, the absence of something from a list is meaningful. In vector similarity search, absence is invisible.
Third — and this is the one that keeps me up at night — RAG solves retrieval, not reasoning. You can put the correct statute in front of the model, and it will still misinterpret it if its internal weights are skewed by millions of incorrect training examples. It becomes a biased reader, seeing what it expects to see rather than what the text actually says.
RAG puts the right document in front of the AI. But a biased reader with the right book still reaches the wrong conclusion.
After weeks of testing different retrieval strategies, chunk sizes, and prompt configurations, my team had an argument that clarified everything. One of our engineers wanted to keep refining the RAG pipeline — better embeddings, more precise chunking, explicit instructions to "only use the retrieved text." Our legal advisor looked at him and said, "You're trying to make a better pair of glasses for someone who can't do math." She was right. The problem wasn't what the model could see. It was what the model could reason about.
What Happens When AI Reads the Law Wrong?
The OBBBA case also includes phase-out rules that compound the problem. The deduction is capped at $10,000 per year and reduced by $200 for each $1,000 by which the taxpayer's income exceeds $100,000 (single filers) or $200,000 (joint filers). At $150,000 or $250,000 respectively, the deduction vanishes entirely.
When we tested this with a hypothetical taxpayer earning $125,000, the models struggled. Some ignored the phase-out entirely. Others applied a reduction curve borrowed from the Child Tax Credit — a completely different provision with different math. The models were pattern-matching to familiar structures rather than computing the actual rule.
This is the "Arithmetic Hallucination" problem layered on top of Consensus Error. The model doesn't just get the legal framework wrong — it gets the math wrong too. And when you compress these models for efficient deployment through quantization, the reasoning abilities degrade faster than the linguistic abilities. The model sounds more confident while becoming less accurate. That's a dangerous combination in any domain. In tax compliance, it's a liability.
There's another dimension most people miss entirely. The OBBBA also created Section 6050AA, which imposes new reporting requirements on lenders. Any business receiving $600 or more in interest on a qualifying vehicle loan must file an information return with the IRS. When we asked the models what banks and credit unions need to do about the new law, they focused almost exclusively on the borrower's deduction — because that's what the content ecosystem wrote about. The lender's compliance obligation barely registered. For a fintech company using AI to summarize regulatory changes, that omission could mean systematic non-compliance and penalties under IRC Sections 6721 and 6722.
The Architecture That Actually Works

After the car loan debacle, I spent a month convinced that the entire approach of using language models for compliance was fundamentally broken. Then we built something that changed my mind — not about LLMs, but about how they should be used.
The insight was simple, once we saw it: stop asking the AI to reason about law. Let it do what it's good at — understanding language — and hand the reasoning to something that can actually do logic.
This is the core idea behind Neuro-Symbolic AI: a hybrid architecture that combines the linguistic fluency of neural networks with the deterministic rigor of symbolic logic systems. We built what we call a Deterministic Tax Engine, and it works in three stages.
The first stage is the Intent Parser. This is where the LLM lives, doing what LLMs do best. A user uploads an invoice, types a question, or connects a bank feed. The neural layer extracts structured entities: vehicle type, purchase date, loan amount, assembly location, taxpayer income. It doesn't decide anything about tax treatment. It just translates messy human reality into clean data.
The second stage is the Truth Anchor. This is where the symbolic layer takes over. We encode the tax code — not as text to be "read," but as executable logic using specialized languages like Catala (developed by INRIA for the French government's tax administration) and PROLEG (a Prolog-based legal reasoning system). The statute becomes a program. The Truth Anchor queries a Knowledge Graph, applies the encoded rules, checks for missing facts, and computes the deterministic result. If the vehicle was assembled in Mexico, the deduction is denied. If the income exceeds the phase-out threshold, the reduction is calculated to the penny. No probability. No guessing.
The third stage is the Response Generator. Another neural layer, but constrained. It receives the logic solver's output — "Deduction: DENIED. Reason: Phase-out exceeded at income level $135,000" — and translates it into clear, human-readable language. The LLM makes the answer readable. It has zero freedom to make the answer different.
We don't ask the AI to guess the law. We teach it to calculate the law. The neural layer handles language. The symbolic layer handles truth. Neither tries to do the other's job.
This decoupling is what kills Consensus Error. The neural layer might "know" from its training that popular financial sites say car loan interest lowers AGI. But it's never asked to make that determination. It's only asked to extract the vehicle details. The decision lives in the symbolic layer, which has no access to Reddit — only to the encoded statute.
For the full technical architecture and formal analysis, including how Catala handles default-exception logic and how Answer Set Programming checks for consistency across an entire tax return, we published a detailed research paper.
Why Does This Matter Beyond One Tax Deduction?
I'll be honest about something. When I first started explaining Consensus Error to people, the most common response was: "Okay, but that's one edge case. Most of the time, the AI gets it right."
That response misunderstands the nature of the problem. The OBBBA car loan deduction isn't special. It's typical. Every time Congress passes a new provision, the content ecosystem generates thousands of simplified, sometimes incorrect interpretations before the IRS even issues guidance. Every time a state decouples from federal tax treatment, the internet lags behind. Every time a court ruling changes the interpretation of an existing statute, the old interpretation persists in the training data for years.
Consensus Error isn't an edge case. It's the default state for any area of law that's complex, recently changed, or poorly understood by generalist writers. Which is to say: most of tax law.
The broader implication is about what we expect from AI in high-stakes domains. The current paradigm — give the LLM more context, better prompts, bigger windows — is an incremental improvement on a fundamentally mismatched architecture. You're asking a pattern-matching engine to do formal reasoning. Sometimes it gets lucky. In tax compliance, "sometimes" isn't good enough.
From Black Box to Glass Box
There's one more thing that matters enormously to anyone who's ever sat across from an auditor: traceability.
When a standard LLM tells a taxpayer "yes, you can deduct that," and the IRS later disagrees, what's the audit trail? The model predicted that "yes" was the most probable next token given the context. That's not an answer you can present to a revenue agent.
In our Neuro-Symbolic Engine, every determination produces a graph path:
Deduction_Allowed = True because Loan_Date is within the 2025–2028 window (verified), Vehicle_Type is Passenger (verified), Assembly_Location is US (verified), Income is below the phase-out threshold (verified), Rule_Reference is IRC § 163(h)(4).
Every node in that chain is auditable. Every fact can be traced to a source document. Every rule can be traced to a specific section of the code. If something changes — a new IRS ruling, an amended provision — you update the Knowledge Graph and the logic solver, and every affected determination is automatically flagged for review.
This transforms AI from something you have to trust on faith into something you can verify on demand. People always ask me whether this approach is slower or more expensive than just using an LLM. The honest answer: the initial encoding of the statute takes real work. But once it's encoded, running a determination is nearly instant, and the cost of checking 100% of transactions approaches the cost of checking 1%. That's not a marginal improvement over traditional sampling-based audits. It's a different category of assurance.
The Question Nobody's Asking Yet
Here's what bothers me most. The financial services industry is racing to deploy AI assistants, AI tax preparers, AI compliance monitors. The marketing says "AI-powered." The fine print says "not tax advice." And in between, millions of people are getting answers that sound authoritative, cite real legislation, reference actual dates and dollar amounts — and place the deduction in the wrong section of the tax code.
We're not in a world where AI is obviously wrong and humans can tell. We're in a world where AI is subtly wrong in ways that require deep domain expertise to detect. That's more dangerous than obvious failure, because it erodes the instinct to double-check.
The question isn't whether AI should be used in tax compliance. It absolutely should — the complexity of the code has outpaced human capacity to apply it consistently. The question is whether we're willing to build AI systems that respect the nature of the domain they operate in. Tax law is deterministic. It has right answers and wrong answers, defined by statute, not by popular opinion. An architecture that treats legal truth as a probability distribution will always be vulnerable to Consensus Error, no matter how large the model or how clever the prompt.
In tax law, popularity is not a proxy for truth. An answer repeated ten thousand times on the internet is not more correct than the statute that contradicts it once.
We built Veriprajna because I believe the era of "trust, but verify" is already over for AI in compliance. The systems that earn enterprise trust will be the ones that verify first and speak second — the ones where the logic is auditable, the math is deterministic, and the AI's confidence is backed by something more durable than token probability.
The statute doesn't care what the internet thinks. Your AI shouldn't either.


