Causal Inference That Answers Intervention Questions, Not Just Predictions
Causal inference pipelines that answer intervention and counterfactual questions from observational data when A/B testing is impossible or insufficient.
Solutions for Causal & Counterfactual Modeling
AI Procurement Fairness & Supplier Diversity Compliance
Audit your procurement AI for bias. Veriprajna builds vendor-agnostic fairness testing for SAP Ariba, Coupa, GEP, and Ivalua supplier scoring, ensuring FAR Part 19 compliance and provable algorithmic equity.
Ethical Subscription Retention AI
Amazon paid $2. 5 billion for a cancel flow that took 6 clicks. Uber is facing 21 state attorneys general over 23 screens to cancel.
Medicare Advantage AI Governance & Algorithmic Compliance
Audit, explain, and defend your Medicare Advantage AI. Explainability middleware, CMS-0057-F compliance architecture, and litigation readiness for health plan algorithms.
Frequently Asked Questions
How much does a causal inference engagement cost compared to running more A/B tests?
The cost comparison depends on what you are trying to learn and whether experimentation is even possible. A/B tests are the gold standard when you can randomize, but many enterprise decisions cannot be tested: you cannot randomly assign prices to customer segments for months, randomly open or close stores, or randomly deny loans to measure disparate impact. In those cases the comparison is not causal inference versus A/B testing but causal inference versus guessing. For decisions where you can run experiments, causal inference from observational data often supplements A/B tests by measuring long-term effects, estimating heterogeneous treatment effects across subgroups, and bridging the gap when experiment sample sizes are too small for reliable subgroup analysis. Engagement scope typically ranges from a focused causal analysis of a single treatment (graph elicitation, estimation, sensitivity analysis, documentation) to a production causal pipeline with automated monitoring. The ROI case is clearest when the intervention cost is high: if you are spending millions on promotions, pricing changes, or policy shifts, knowing which subgroups actually respond versus who would have acted anyway pays for the engagement quickly.
We tried NOTEARS for automated causal discovery and the graph changes every time we rescale variables. Is this broken?
This is a known limitation, not a bug. CausaLens published research demonstrating that NOTEARS lacks scale-invariance: changing the units of a variable (meters to centimeters, dollars to thousands) alters the discovered graph. The algorithm uses an L1 penalty on edge weights, and rescaling changes those weights, which changes which edges survive penalization. DECI and other neural causal discovery methods partially address this but introduce their own issues with mixed-type data and missing values. The deeper problem is that fully automated causal discovery from observational data alone remains an open research problem. These algorithms assume no unmeasured confounders, which no enterprise dataset can guarantee. We use automated discovery as a hypothesis generator: run it, surface candidate edges, then have domain experts validate or reject each one. The production graph is built collaboratively with the people who understand the data-generating process, with formal identifiability checks before any estimation begins.
Our DML pipeline gives wildly different CATE estimates depending on how we split the data. How do we fix this?
This is almost always a cross-fitting implementation problem. Double/debiased machine learning requires that the nuisance models (treatment propensity and outcome regression) be fit on different data folds than the ones used for the final causal estimate. If cross-fitting is not implemented correctly, or if the number of folds is too small, the nuisance models overfit to the estimation sample, and the resulting CATE estimates are driven by that overfitting rather than by actual treatment effect heterogeneity. We diagnose this by checking first-stage prediction quality across folds, testing estimate stability under different fold counts and random seeds, and verifying that the Neyman orthogonality condition holds in practice. If the first-stage models are weak predictors of either treatment or outcome, the orthogonalization that makes DML work breaks down. In that case, the fix is not more folds but better nuisance models, or switching to a method that does not depend on strong first-stage prediction, such as instrumental variables or a direct experimental design.
Can causal AI help us prove our lending model does not discriminate, for Colorado AI Act compliance?
Causal methods are the right approach here because the regulatory question is inherently causal: does the protected attribute cause the adverse outcome, or is the statistical disparity explained by legitimate risk factors? The Colorado AI Act, effective June 30, 2026, requires deployers of high-risk AI to exercise reasonable care against algorithmic discrimination, including documented disparate impact testing across protected classes. Statistical disparity analysis (comparing approval rates across groups) tells you whether a gap exists. Causal analysis tells you why it exists: whether the gap is driven by the protected attribute itself, by proxies correlated with it, or by legitimate underwriting factors that happen to be distributed differently across groups. We build this as a causal mediation analysis: constructing the DAG of how applicant attributes flow through the model to the decision, identifying direct and indirect causal paths from protected attributes to outcomes, and quantifying how much of the observed disparity flows through each path. The output is a document that maps each causal claim to its identifying assumptions, satisfying the Act's documentation requirements.
What is the difference between causal forests and meta-learners, and which works better at enterprise scale?
Both estimate conditional average treatment effects (CATE), which tell you who benefits most from a treatment. Causal forests (Athey and Imbens) split data to maximize treatment effect heterogeneity directly. Meta-learners (S-Learner, T-Learner, X-Learner) repurpose standard ML models for causal estimation by training outcome models and computing treatment effects as differences in predictions. At enterprise scale, meta-learners currently win on practicality. A recent large-scale benchmark on 13.98 million customer records showed S-Learner with LightGBM achieving the highest uplift, with the top 20% of customers capturing 77.7% of all incremental conversions. Causal forests require subsampling at that scale due to computational constraints. Meta-learners also integrate more naturally into existing ML pipelines since they use standard supervised learners as base models. We evaluate both on each client's data because the best method depends on the treatment effect structure: causal forests can discover effect heterogeneity along unexpected dimensions, while meta-learners leverage the predictive power of gradient boosting for settings where the treatment effect pattern is smoother.
How do we handle positivity violations when certain customer segments never received the treatment?
Positivity violations mean that for some covariate combinations, every observed unit either received or did not receive the treatment. Standard inverse propensity weighting amplifies noise catastrophically for near-zero propensity scores, producing treatment effect estimates dominated by a handful of extreme weights. We diagnose positivity before estimation by examining the propensity score distribution and flagging regions with minimal overlap between treated and control groups. The fix depends on the type of violation. For practical violations (certain segments are rarely treated but treatment is possible), overlap weights or trimming restrict estimation to the region where both treatment arms have adequate representation, and we clearly document which subpopulations have reliable estimates versus which are extrapolated. For structural violations (certain segments can never receive the treatment), the causal question itself is ill-defined for those groups, and the honest answer is to exclude them from the estimand rather than pretend you can estimate something that the data cannot support.
Can we use synthetic control methods to measure the impact of a new store opening without a control group?
This is one of the strongest use cases for synthetic control. You have one treated unit (the market where the new store opened) and no randomized control. The synthetic control method constructs a weighted combination of untreated markets that matches the treated market's pre-intervention trajectory on key outcomes. The treatment effect is the divergence between the treated market's actual post-opening performance and the synthetic control's predicted performance. The method works well when you have enough donor markets with similar characteristics and a sufficiently long pre-intervention period to validate the match. It breaks down when no combination of control markets can reproduce the treated market's pre-trend, or when other interventions (competitor openings, macro shocks) hit the treated market simultaneously. We implement synthetic control with placebo tests (applying the method to untreated markets to verify that it does not find spurious effects) and pre-trend fit diagnostics that quantify match quality before anyone looks at post-intervention results.
Is CausaLens worth the platform cost, or should we build on PyWhy open source?
CausaLens provides a no-code Decision Intelligence Platform with causal discovery, estimation, and optimization capabilities, backed by $50M+ in funding and enterprise deployment experience. The PyWhy ecosystem (DoWhy, EconML) is open source, backed by Microsoft Research, with roughly 24,000 GitHub stars across the ecosystem and broad community adoption. The platform versus open-source decision depends on your team's causal inference expertise and your tolerance for methodological opacity. If you have the statistical expertise to validate assumptions, diagnose failures, and interpret sensitivity analyses, PyWhy gives you full control at zero license cost. If you want a managed interface and are comfortable trusting the platform's methodological choices, CausaLens reduces engineering effort. The risk with any platform is that it abstracts away the methodology, and causal inference is precisely the domain where methodological choices (graph structure, estimator selection, sensitivity analysis) are load-bearing. We work with both: building on PyWhy when the client has engineering capacity, integrating with CausaLens when the client has already invested in the platform. The methodology layer is the same regardless of tooling.
What is the risk of deploying treatment effect estimates from observational data without sensitivity analysis?
The risk is that you make a confident business decision based on an estimate that could be entirely explained by an unmeasured confounder. Every observational causal estimate assumes you have measured all relevant confounders. That assumption is untestable. Sensitivity analysis quantifies how wrong you could be: E-values tell you the minimum strength of confounding (relative risk) that an unmeasured variable would need to have with both treatment and outcome to explain away your estimate. Rosenbaum bounds give you the maximum amount of hidden bias consistent with your confidence interval still excluding zero. Without these, a CATE estimate of +15% incremental conversion could be real, or it could be an artifact of a single unmeasured variable. We include sensitivity analysis in every causal deliverable, not as an appendix but as the primary credibility metric. For regulatory submissions, this is not optional: documenting the robustness of causal claims to unmeasured confounding is part of demonstrating reasonable care under frameworks like the Colorado AI Act.
Build Your AI with Confidence.
Partner with a team that has deep experience in building the next generation of enterprise AI. Let us help you design, build, and deploy an AI strategy you can trust.
Veriprajna Deep Tech Consultancy specializes in building safety-critical AI systems for healthcare, finance, and regulatory domains. Our architectures are validated against established protocols with comprehensive compliance documentation.