Tax Compliance AI
Thomson Reuters "Ready to Review" auto-prepares 1040s. CCH Axcess Expert AI drafts advisory insights across 10,000 firms. Blue J answers tax research questions with a disagree rate under 1 in 700.
The preparation problem is being solved. The verification problem is not. When an AI misclassifies a deduction as above-the-line instead of below-the-line, the 20% accuracy penalty applies to the human who signed the return, not the algorithm that drafted it. We build the verification layer that catches these errors before they reach the IRS.
$126B+
Annual US business tax compliance cost
Fortune, March 2026
8.8% → 22.6%
IRS large corporate audit rate increase
IRS enforcement priorities, 2026
50%
Accountants aware of AI-caused financial losses
Accountancy Age, March 2026
Tax AI failures are not isolated hallucinations. They are systematic biases baked into the training data that produce confidently wrong answers with perfect grammar and plausible-sounding citations.
The Omnibus Budget Reconciliation Act created a new deduction for qualified passenger vehicle loan interest (QPVLI) under IRC Section 163(h)(4)(A). The deduction was placed in Section 63(b)(7), which means it reduces taxable income, not adjusted gross income.
This is a below-the-line deduction. It does not lower AGI.
Yet as of April 2026, H&R Block's own website describes it as an "above-the-line incentive." Thousands of blog posts, SEO-optimized articles, and financial content farms repeat the same misclassification. When LLMs trained on this content answer questions about the OBBBA deduction, they reproduce the error with high confidence because the incorrect characterization appears orders of magnitude more frequently than the correct statutory text.
| Impact Area | If Misclassified as Above-the-Line | Actual Statutory Effect | Financial Consequence |
|---|---|---|---|
| AGI Calculation | Incorrectly lowers AGI | Does not affect AGI | Underpayment of federal tax |
| State Taxes (AGI-coupled states) | Incorrectly lowers state tax | No effect in most states | Multi-state audit exposure |
| Medicare IRMAA Premiums | False premium reduction | No effect on premiums | Unexpected costs for retirees |
| Medical Deduction Floor | Incorrectly lowers 7.5% floor | No effect on floor | Disallowed deductions + interest |
| Student Loan IDR | False qualification | No effect on repayment | Non-compliance with loan terms |
A single above-the-line/below-the-line misclassification cascades through at least five downstream calculations. This is one provision. The IRC has thousands.
LLMs do not reason about tax law. They predict the next token based on patterns in training data. When the blogosphere is 90% wrong about a specific provision (common for technical legislative changes), the model's weights converge on the incorrect answer regardless of the prompt.
RAG helps but does not solve this. Blue J retrieves the statute text, but the LLM must still interpret it. Amendment language ("Section 163(h) is amended by inserting...") requires reconstructing the code's current state from fragments. If the model's internal weights are biased by millions of incorrect blog posts, it acts as a biased reader, misinterpreting even correctly retrieved text.
Prompt engineering cannot fix this either. You cannot instruct a probability engine to become a logic solver. The architecture itself must change for provisions where deterministic correctness is required.
Every category below solves a real problem. None of them solve verification of AI-generated tax positions. This table is designed to be pulled up in internal meetings when evaluating tax technology investments.
| Category | Key Players | What They Actually Do | Honest Gaps |
|---|---|---|---|
| Platform Incumbents | Thomson Reuters ONESOURCE+, Wolters Kluwer CCH Axcess Expert AI, Intuit ProConnect | End-to-end compliance: data import, return preparation, filing, workflow automation. ONESOURCE claims 65% reduction in routine reporting. CCH Axcess embedded across 10,000 firms. | Verify their own outputs against their own rules. No cross-platform verification. Agentic AI is workflow automation, not position verification. Data quality issues upstream propagate through. |
| AI Tax Research | Blue J ($122M Series D), TaxGPT ($4.6M), Bizora | Natural language tax research on curated authority databases. Blue J: RAG on GPT-4.1, disagree rate <1/700. Bizora: all 50 states SALT, $30-120/mo. | Probabilistic answers. The 1-in-700 disagree rate measures user disagreement, not ground truth accuracy. Users who don't know the correct answer can't disagree with a wrong one. Not suitable as sole authority for high-penalty positions. |
| Deterministic Tax Engines | Vertex (300M+ rates), Avalara ($8.4B + $500M BlackRock), Sovos (Sovi AI) | Indirect tax calculation: rates, exemptions, filing across 12,000+ jurisdictions. 100% deterministic for covered scenarios. Full audit trails. | Cannot handle natural language. Cannot reason about ambiguous provisions (facts-and-circumstances tests). Adding rules requires manual encoding. Limited to indirect tax; income tax verification is a separate problem. |
| Big 4 / Large SIs | EY+IBM (watsonx), KPMG (Tax AI Accelerator), Deloitte, PwC | Proprietary AI tools for internal use. EY targeting 80% automation of foreign tax compliance. KPMG launched Tax AI Accelerator Feb 2026. PwC claims 20-50% developer productivity gains. | Proprietary tools built for their own engagements, not available to your tax department. Engagements run $500K-$5M+. They implement platforms, not build custom verification layers. Their AI tools verify their own work, not yours. |
| Neuro-Symbolic / Decision Platforms | Rainbird AI (BDO client) | Deterministic graph-based inference with AI guardrailing. BDO cut R&D tax review from 5 hours to seconds. Transparent reasoning chains. | General-purpose platform, not tax-specific. Each use case requires custom knowledge graph construction. BDO case was R&D credits (narrow domain), not general tax compliance. UK-focused. |
| Academic / Research | Catala (INRIA), PROLEG (NII Japan), Sarah Lawsky (Northwestern) | Domain-specific languages for formalizing tax law. Catala excels at default/exception logic. Used by French government for housing benefits. Lawsky demonstrated on IRC Sections 121, 132. | Not production-ready. Catala compiler described as "yet unstable." Full IRC is 4M+ words. Only a few US sections formalized. PROLEG designed for Japanese Civil Code. Years away from enterprise deployment. Veriprajna can't solve this either; we use OPA/Rego for production rule encoding instead. |
Missing from this table: a vendor-neutral verification layer that sits atop any of these platforms and catches position-level errors deterministically. That is the gap we fill.
Every engagement is custom. These are the capabilities we bring to tax technology work, not products you purchase off a shelf.
We encode high-error-rate IRC provisions in OPA/Rego, creating a deterministic verification layer that tests AI-generated tax positions against statutory logic. We reach for OPA over Catala because OPA is CNCF-graduated with a massive community, generates comprehensive audit trails, and integrates with modern API architectures. Catala is elegant but has no production US tax deployment and an unstable compiler.
A typical initial build covers 10-15 provisions: Section 199A (QBI deduction), Section 163(j) (business interest limitation), Section 1031 (like-kind exchanges), OBBBA QPVLI, Section 280A (home office), and Section 30D (EV credits). These are selected based on error frequency data and penalty exposure.
The engine takes a structured tax position as input and returns a pass/fail with the specific statutory citation chain. It integrates via REST API with ONESOURCE, CCH Axcess, Blue J, or internal tools.
We build Neo4j-based knowledge graphs encoding IRC cross-references, amendment chains, and default/exception hierarchies. The graph represents relationships that vector search misses: Section 163(h)(4)(B) places a numeric cap on the exception in Section 163(h)(4)(A), which is itself an exception to the general prohibition in Section 163(h)(1).
Each graph is custom-scoped to the client's tax position universe. A multinational with transfer pricing concerns gets a different graph than a domestic retailer with sales and use tax complexity. We do not attempt to encode the full IRC. That is a multi-year, multi-million-dollar academic exercise. We encode the provisions where your specific audit risk is concentrated.
The knowledge graph enables GraphRAG retrieval: queries traverse the statutory structure, not just keyword similarity. When an LLM asks about the OBBBA deduction, the graph retrieves not just Section 163(h)(4) but the Section 62/63 distinction and the phase-out formula in sequence.
After the Heppner ruling (SDNY, February 2026), using public AI tools for tax research creates a privilege waiver risk. Judge Rakoff held that communications with publicly available AI platforms are not protected by attorney-client privilege. Morgan Lewis advises all in-house tax professionals to rely on closed, internal AI systems.
We design and deploy enterprise AI architectures where no data leaves the client's perimeter. The LLM runs self-hosted or in the client's VPC. The knowledge graph is local. The verification engine processes everything on-premises. For firms needing counsel-directed AI use (strengthening privilege claims under Kovel arrangements), we structure the architecture accordingly.
This is not about building another chatbot. It is about ensuring that your existing AI tax research workflows are defensible if the privilege question arises in litigation or examination.
78% of enterprises run 4-7 ERP systems (Phoenix Strategy Group). Tax data lives in SAP, Oracle, NetSuite, and sometimes Excel spreadsheets maintained by one person who is retiring next year. 50% of tax department leaders cite lack of a sustainable data strategy as their biggest barrier (EY).
We build the connectors. Apache Airflow for orchestration, dbt for GAAP-to-tax-basis transformations, OPA validation rules at each checkpoint to catch data quality issues before they propagate into returns. The goal is structured, validated tax data flowing continuously from source systems into whatever compliance platform you use.
This is the least glamorous work we do and frequently the most valuable. A verification engine is only as good as the data it receives.
The GloBE calculation is deterministic. The OECD's January 2026 administrative guidance confirmed that Pillar Two has moved into the compliance phase. The formula is known. The difficulty is feeding it accurate entity-level financial data across every jurisdiction where you operate.
We build custom data pipelines connecting local statutory accounts to GloBE reporting requirements: effective tax rate computation per jurisdiction, qualified domestic minimum top-up tax modeling, and substance-based income exclusion calculations. The pipeline handles GAAP divergence, intercompany eliminations, and currency translation automatically. The deterministic calculation engine sits at the end of a clean data pipeline, not on top of manually reconciled spreadsheets.
Every engagement starts with a scoping phase. We do not sell pre-built solutions because every enterprise tax environment is different.
We map your current tax technology stack: which platforms you use, how data flows between ERPs and compliance tools, where manual intervention happens, and which provisions carry the highest penalty exposure. The output is a risk-ranked list of verification targets and a detailed build specification. If the scoping reveals that off-the-shelf tools already solve your problem, we say so. Not every tax department needs a custom verification layer.
We encode the priority provisions in OPA/Rego, construct the relevant knowledge graph segments in Neo4j, build API connectors to your existing platforms, and deploy the verification engine in your environment. Each encoded provision goes through a validation cycle with your senior tax staff. The rule encoding is transparent: your team can read the OPA policies and confirm they match their understanding of the statute.
The verification engine runs in parallel with your existing workflow on real tax positions. We measure catch rate (errors identified), false positive rate (correct positions flagged), and integration stability. Adjustments happen in real time. The pilot period is when the knowledge graph gets refined based on your actual tax position universe, not hypothetical scenarios.
Congress makes an average of 420 changes to the tax code per year (Taxpayer Advocate Service). IRS publishes a continuous stream of notices, revenue rulings, and proposed regulations. We update the OPA rules, extend the knowledge graph, and add coverage for new provisions as your risk profile evolves. The maintenance engagement includes a quarterly review of verification performance metrics and priority adjustments.
We do not prepare tax returns. We do not replace your compliance platform. We do not offer legal advice or serve as your tax advisor. We build the technology layer that makes your existing tools and advisors more reliable. If you need a firm to prepare your returns, Thomson Reuters and Wolters Kluwer have excellent platforms. If you need someone to verify that the AI-assisted positions in those returns are consistent with the statute, that is our work.
Answer six questions about your current tax technology environment. The assessment identifies where verification gaps exist and what foundational steps are needed before building a verification layer.
Question 1 of 6
You need a verification layer that operates independently of the AI tool producing the answer. The core problem with verifying AI tax research is that the same LLM biases that produce the wrong answer also produce convincing-sounding justifications. Asking the AI to "check its work" runs through the same probabilistic weights that generated the error.
Effective verification requires a separate system with deterministic logic. We build these as OPA/Rego policy engines encoding specific IRC provisions. The verification engine takes the AI's conclusion (for example, "this deduction reduces AGI") and tests it against the encoded statute. If the statute says otherwise, the engine returns a hard block with the specific section citation.
This works because the verification layer has no access to blog posts, training data, or popularity signals. It only knows what the statute says. For enterprise deployments, we typically start with 10-15 high-error-rate provisions (Section 199A QBI, Section 163(j) business interest limitation, Section 1031 like-kind exchanges, OBBBA QPVLI) where the penalty exposure is highest. The verification engine integrates via API with whatever tax platform you already use, whether that is ONESOURCE, CCH Axcess, Blue J, or an internal tool.
The CPA or tax advisor is liable. Every major tax software vendor disclaims liability for AI outputs. Thomson Reuters, Intuit, and Wolters Kluwer all include explicit disclaimers that AI-generated content is not tax advice and the professional remains responsible.
The AICPA's revised Statements on Standards for Tax Services (effective January 2024) require members to exercise due professional care when using electronic tools, and state boards of accountancy are drafting AI-specific guidance. The IRS does not care whether a wrong position was generated by a human, an AI, or a magic eight ball. Accuracy-related penalties under IRC Section 6662 apply a 20% penalty on underpayments attributable to negligence or substantial understatement, regardless of the tool used. Fraud penalties under Section 6663 reach 75%.
The February 2026 Heppner ruling adds another layer: if a tax professional uses a public AI tool and inputs privileged client information, that privilege may be waived entirely. This is why we build closed, enterprise-grade verification systems that keep sensitive data within the organization's perimeter. The verification audit trail we generate serves a defensive purpose as well. When an AI-assisted position is later questioned, a deterministic audit trail showing the statutory logic chain is stronger evidence of due diligence than "the AI said so."
It can. The Heppner ruling (February 10, 2026, SDNY, Judge Rakoff) established that communications with publicly available AI platforms are not protected by attorney-client privilege or work product doctrine. The defendant had input information learned from his attorneys into a public AI tool, and the court held this constituted disclosure to a third party, destroying the privilege.
For tax departments, the implications are significant. In-house tax counsel routinely research sensitive positions involving potential exposure, aggressive planning, or audit defense strategies. If that research is conducted through a public AI tool, the analysis, the questions asked, and the data provided may all become discoverable.
Morgan Lewis published detailed guidance in March 2026 recommending that all in-house tax professionals avoid inputting confidential or privileged information into public AI systems and instead rely on closed, internal AI systems accessible only to relevant persons within the organization. Enterprise AI architectures with proper Kovel-type arrangements (where the AI use is directed by counsel) offer stronger protection. We build these as self-hosted or private-cloud deployments where no data leaves the client's environment. The LLM runs within the perimeter, the knowledge graph is local, and the verification engine processes everything on-premises or in the client's VPC.
Blue J and ONESOURCE solve different problems. Blue J is a probabilistic tax research tool. It retrieves relevant authorities via RAG and generates answers grounded in curated sources. Its disagree rate of fewer than 1 in 700 is impressive, but that metric measures user disagreement, not statutory ground truth. A user who does not know the correct answer cannot disagree with a wrong one.
ONESOURCE is a compliance platform. Its deterministic engine handles tax calculation (rates, forms, filing), and ONESOURCE+ adds agentic AI for workflow automation. It is not designed to verify novel tax positions or catch misclassification errors in AI-generated research.
A deterministic verification engine does something neither tool does: it takes a specific tax position and tests it against encoded statutory logic. The engine does not generate answers. It validates them. Think of it as a compiler type-checker for tax positions. The position either satisfies the statutory conditions or it does not. When it does not, the engine returns the specific failure point (for example, "deduction classified as Section 62 but statute places it in Section 63(b)(7)"). This is complementary to both Blue J and ONESOURCE. Blue J generates the research. ONESOURCE prepares the return. The verification engine checks that the position taken is consistent with the statute before the return is filed.
It is a hybrid. The GloBE calculation itself is deterministic and well-suited to automation: compute the effective tax rate per jurisdiction, compare against the 15% minimum, calculate top-up tax. KPMG, EY, and Deloitte all offer Pillar Two calculation engines. The hard part is not the calculation. It is the data.
Pillar Two requires entity-level financial data across every jurisdiction where the multinational operates. That data lives in different ERPs, different chart-of-accounts structures, different local GAAP standards. Only 15% of Southeast Asian organizations report being fully prepared for Pillar Two compliance (EY, 2026). The bottleneck is connecting local statutory accounts to GloBE reporting requirements, not running the formula.
AI helps in two specific places: extracting and normalizing data from disparate sources, and translating between local GAAP treatments and the GloBE framework. We build custom data pipelines using Apache Airflow for orchestration and dbt for transformation, with OPA validation rules at each checkpoint to catch data quality issues before they propagate into the GloBE calculation. The calculation engine itself is deterministic. The data pipeline feeding it is where custom work is needed.
A focused verification engine covering 10-15 high-error-rate IRC provisions typically takes 8-12 weeks for the initial build and runs $150K-$300K depending on the complexity of the provisions and the number of tax platforms that need API integration. That includes the OPA policy encoding, knowledge graph construction for the relevant IRC cross-references, API connectors to your existing tax platform, and a pilot period with real tax positions.
For context, the average business tax return costs $9,090 in preparation alone (Fortune, 2026). A mid-market enterprise filing across 20 states spends $180K+ annually just on preparation labor. The verification engine adds a quality layer on top of that existing spend.
Ongoing maintenance runs $3K-$8K per month, covering annual tax code updates (Congress makes an average of 420 changes per year), new IRS guidance incorporation, and rule expansion. Larger engagements that include Pillar Two pipeline work, ERP data integration, or privilege-safe architecture design are scoped separately and typically run 4-6 months. We price these on a fixed-fee basis after a 2-week scoping engagement ($15K-$25K) that maps your current tax technology stack, identifies the highest-risk positions, and produces a detailed build specification.
The research behind this solution page, available as an interactive whitepaper.
The Stochastic Parrot vs. The Statutory Code: Consensus Error in AI Tax Compliance and the Neuro-Symbolic RemedyA detailed analysis of how LLMs systematically produce incorrect tax advice through training data bias, with a proposed neuro-symbolic architecture for deterministic tax verification.
With corporate audit rates rising to 22.6% and accuracy penalties at 20% of underpayment, a single misclassified provision costs more than a verification engine.
Start with a 2-week scoping engagement. We map your tax technology stack, identify your highest-risk provisions, and produce a build specification you can take to leadership.