A striking visual representing the divide between AI that dazzles in demos and AI that survives production — the central tension of the article.

Artificial IntelligenceTechnologyBusiness

We Spent $35 Billion on AI and Got Almost Nothing Back

Ashutosh Singhal April 8, 202614 min read

The call came on a Tuesday evening. A mid-market healthcare company — one we'd been advising — had just killed their flagship AI project. Nine months of work. Six-figure spend. The CTO sounded exhausted, not angry. "It worked perfectly in the demo," he told me. "Every single time. Then we plugged it into real patient data and it started hallucinating insurance codes."

I didn't know what to say, because I'd heard some version of this story a dozen times that quarter alone. AI that dazzles in a conference room and disintegrates in production. Pilots that generate excitement in month one and budget reviews in month six. The gap between what generative AI promises and what it delivers inside an actual enterprise is the defining tension of this moment in technology.

And now we have the numbers to prove it. The MIT NANDA initiative released a study in mid-2025 that landed like a grenade: of the estimated $30 to $40 billion enterprises have poured into generative AI, roughly 95% of pilots have failed to deliver measurable impact on the profit and loss statement. McKinsey's own 2025 survey echoes it — 88% of organizations say they're using AI somewhere, but only 39% can point to any EBIT impact at all.

I run Veriprajna, where we build deep AI systems for enterprises. I'm not a neutral observer here. But I've been close enough to the wreckage — and to the rare successes — to have a clear picture of what's actually going wrong. And it's not what most people think.

The Demo Looked Great. Then Reality Showed Up.

A funnel infographic showing the dramatic drop-off from AI exploration (80%) to pilot (20%) to production with measurable results (5%), based on MIT data.

That healthcare CTO's experience wasn't unusual. It was practically the median outcome.

MIT's data maps a brutal funnel: 80% of organizations explore generative AI tools. Only 20% get to a pilot. And just 5% ever reach production with measurable business results. The researchers call it a "learning gap," which is a polite way of saying that most companies don't understand what they've bought.

I remember sitting in our office after reading the full MIT report, arguing with my co-founder about whether the 95% number was too dramatic. It wasn't. If anything, it understated the problem, because many of the "successful" 5% had redefined success downward — they'd measured adoption rates or user satisfaction instead of actual revenue impact.

The pattern I keep seeing is this: a team builds a proof of concept using a major LLM. It handles the ten sample queries beautifully. Leadership gets excited. Budget gets approved. Then the system meets the real world — messy data, edge cases, ambiguous inputs, regulatory requirements where "close enough" is a lawsuit — and it falls apart.

The gap between demo-ready AI and production-ready AI is not a gap. It's a canyon, and most companies don't realize they're standing on the wrong side until they've already jumped.

60% of users in the MIT study reported that models couldn't learn from feedback over time. 55% said they spent excessive effort providing context for every single prompt. 40% said the models simply "broke" when they hit non-standard inputs. These aren't exotic failure modes. These are Tuesday.

Why Are Companies Building on Quicksand?

Most of what the enterprise market calls "AI products" right now are wrappers — thin user interfaces sitting on top of an API call to GPT-4 or Claude or Gemini. You type something in, it goes to the model, the model responds, and the wrapper formats it nicely.

I have a visceral memory of a pitch meeting where a potential client showed me their "AI-powered compliance engine." I asked what happened when the underlying model's behavior changed after a provider update. The room went quiet. They hadn't considered it. Their entire product was a prompt template and a nice dashboard. The "intelligence" they were selling was entirely rented.

This is the wrapper fallacy, and it's everywhere. The approach typically relies on what people in the industry call a "mega-prompt" — you cram rules, data, context, and instructions into a single massive interaction and hope the model sorts it out. I wrote about this architectural problem in more depth in the interactive version of our research, but the short version is that mega-prompts create three fatal problems:

You can't audit them. There's no way to verify that the model followed instructions in the correct order. For compliance-heavy industries, this is a non-starter.

They're economically fragile. Long context windows and retries burn through tokens. And here's a number that shocked me when I first saw it: the difference between an efficient and inefficient tokenizer can mean a 450% cost variance for the same workload. An enterprise processing 100,000 daily customer inquiries could see annual costs jump from $36,500 to over $164,000 just by picking the wrong model for a multilingual use case.

They're brittle. Change three words in a prompt and you get a completely different output. Try building an SLA on that.

The economic trap is even worse than the technical one. When OpenAI or Anthropic drops their API prices — and they will keep dropping them — wrapper companies see their margins evaporate. They don't own the data. They don't own the workflow. They're reselling someone else's intelligence with a markup, and the moment the landlord cuts rent for everyone, the subletter has no business.

What Does "Deep AI" Actually Mean?

A side-by-side architectural comparison showing the fragile "wrapper" approach (single mega-prompt to one LLM) versus the robust "deep AI" multi-agent approach with specialized agents and deterministic workflow.

I'll tell you the moment the concept clicked for me.

We were working on a document processing system for a logistics client. The initial approach was straightforward: send the shipping document to an LLM, ask it to extract the relevant fields, return the results. It worked on standard forms. Then we hit a container manifest from a Southeast Asian port with mixed-language annotations, handwritten corrections, and a format that didn't match anything in the training data. The model confidently returned garbage.

My lead engineer, frustrated after a week of prompt engineering that kept producing new failure modes, finally said: "We're asking one brain to do seven jobs. What if we gave each job to a specialist?"

That's deep AI in a sentence. Instead of treating the LLM as an oracle that handles everything, you treat it as one component in a larger system. You decompose the problem. One agent handles query understanding. Another retrieves data from a structured database. A third validates the output against known rules. A fourth formats the response. Each agent has a defined responsibility, and the workflow between them is deterministic — meaning you control the sequence, the logic, and the checkpoints.

Deep AI treats the language model as a gifted intern, not a CEO. You give it specific tasks within a governed structure, not the keys to the building.

The agentic design patterns that make this work aren't theoretical. They're being deployed now:

A reflection pattern where the agent critiques its own output before sending it to the user. A tool use pattern where the agent calls external calculators, APIs, or databases instead of trying to compute answers from memory. A planning pattern that breaks complex goals into sequential steps. And an orchestration pattern where a supervisor agent manages the entire workflow, routing tasks to the right specialist.

When we rebuilt that logistics system using multi-agent orchestration, the extraction accuracy on non-standard documents went from roughly 60% to over 95%. More importantly, when it did fail, we could see exactly where and why — because the system wasn't a black box anymore. It was a pipeline with observable, auditable steps.

Why Does Token Cost Kill Enterprise AI ROI?

This is the part that doesn't get enough attention.

Everyone talks about model accuracy. Almost nobody talks about the unit economics of running these systems at scale. But I've watched token costs quietly murder the business case for AI projects that were otherwise working perfectly.

The math is straightforward but brutal. Different models tokenize text differently — especially non-English text and complex scripts. A query that costs 800 tokens on one model might cost 4,500 on another. Multiply that by hundreds of thousands of daily interactions, and you're looking at a cost difference that wipes out whatever efficiency gains the AI was supposed to deliver.

I had a moment of genuine alarm when we ran the tokenization analysis for a client operating in Tamil and English. The cost differential between their current model and a more efficient alternative was 4.5x. They were bleeding money on every single interaction and attributing it to "infrastructure costs" in their budget. Nobody had thought to look at the tokenizer.

Deep AI systems address this by being surgical about when they use expensive LLM tokens. High-volume, low-complexity tasks get handled by smaller models or deterministic logic. The expensive reasoning capability gets reserved for the steps where it actually matters. It's the difference between hiring a senior consultant to answer every phone call versus having them focus on the decisions that require judgment.

The 10-20-70 Rule That Nobody Follows

A visual breakdown of the 10-20-70 resource allocation rule showing that 70% of effort should go to people and process change, not technology — the counterintuitive insight most companies get wrong.

When I talk to executives about why their AI projects stalled, they almost always point to technology. The model wasn't good enough. The data wasn't clean. The integration was too complex.

They're not wrong about any of that. But they're missing the real ratio. The companies that are actually seeing EBIT impact — and McKinsey says only 6% are seeing more than 5% of total EBIT from AI — follow a resource allocation that would surprise most technologists:

10% of the effort goes to choosing and tuning the algorithms. 20% goes to building the data and technology infrastructure. 70% goes to managing people, processes, and cultural transformation.

Seventy percent. Not on technology. On getting humans to change how they work.

I resisted this idea for longer than I should have. I'm an engineer by instinct. I wanted to believe that if we built a better system, adoption would follow. It took a painful project — one where we delivered a technically excellent solution that sat unused for three months because nobody had redesigned the workflow around it — to internalize that the technology is the easy part.

Mid-market firms that follow the 10-20-70 principle improve their EBITDA by 160 to 280 basis points within 24 months. The ones that spend 70% on technology and 10% on change management get expensive shelfware.

The wins aren't glamorous. Revenue cycle management. Cash application automation. Cloud cost optimization. Nobody writes breathless LinkedIn posts about reducing discharged-but-not-final-billed claims backlogs. But Inova Health System cut that backlog by 50% and saved $1.3 million annually. OSF HealthCare's AI virtual assistants saved $1.2 million while increasing revenue by another $1.2 million. UPS saves $400 million a year through AI-based routing.

These aren't pilot results. These are production systems running at scale, built with the kind of deep integration that wrappers can't touch.

What Happens When AI Agents Start Acting on Their Own?

The shift from AI that answers questions to AI that takes actions changes the security calculus entirely.

I've been thinking about this a lot, partly because of a near-miss we had during testing. We were building an agentic system that needed to access a client's ERP to pull inventory data. During a test run, the agent — following a chain of reasoning that was technically logical but contextually wrong — attempted to modify a purchase order instead of just reading it. We had safeguards in place. It didn't go through. But I sat at my desk afterward thinking about what would have happened if we'd been less careful.

This is why standards like the Model Context Protocol (MCP) and the NANDA framework matter so much. MCP — developed by Anthropic — acts as a standardized integration layer between AI agents and enterprise data sources. People call it the "USB-C of AI," which is apt: it means you don't need custom, fragile integrations for every connection. NANDA provides the governance layer — cryptographically verifiable capability attestation (meaning you can prove what an agent is and isn't allowed to do), zero-trust access controls extended to autonomous agents, and centralized audit trails.

For the full technical breakdown of these architectural patterns and how they fit together, see our research paper.

The point isn't that agentic AI is dangerous and we should slow down. The point is that the wrapper approach — where you have minimal visibility into what the model is doing and why — becomes genuinely reckless when the model can take real-world actions. Deep AI systems with observable, governed workflows aren't just better engineering. They're the only responsible way to deploy autonomous agents in an enterprise.

"Just Use GPT" and Other Expensive Advice

People ask me all the time whether they should just wait for the models to get better. "GPT-5 will solve this," I heard an investor say at a dinner. "Why build all this infrastructure when the next model version will handle it natively?"

I understand the appeal of this argument. It's clean. It requires no hard work. And it's wrong.

Better models don't fix the wrapper problem. They make it worse. A more powerful model in a mega-prompt architecture is like putting a Formula 1 engine in a car with no steering wheel. You go faster in the wrong direction. The issues that kill enterprise AI — lack of auditability, brittle prompts, no feedback loops, missing business context, uncontrolled costs — are architectural problems, not capability problems.

The shadow AI economy proves this. Over 90% of employees are already secretly using personal ChatGPT or Claude accounts for work because their company's official AI tools are too rigid. The models are capable enough. The systems around them are not.

Better models don't save bad architecture. They just hallucinate faster and more confidently.

The other question I get is about timeline. "How long does this actually take?" The honest answer is 12 to 18 months to go from scattered experiments to AI that moves the P&L. The first three months are discovery — identifying where AI can create value without creating regulatory exposure. Months three through six are data readiness, which is where 58% of CXOs say they get stuck. Months six through twelve are building and iterating multi-agent prototypes — and I mean 30+ iteration cycles against real-world data, not three polished demos. The final phase is production deployment with full operational support: drift detection, bias monitoring, cost governance.

It's not fast. It's not easy. But the companies that do it are the ones showing up in McKinsey's 6% with real EBIT impact.

The Divide Is a Choice

The "GenAI Divide" that MIT identified isn't a technology gap. It's a decision gap.

On one side: companies that treated generative AI as a product to buy, a wrapper to deploy, a demo to show the board. They're the 95%. They spent real money and got press releases.

On the other side: companies that treated AI as an architectural challenge — one that requires decomposing problems, governing workflows, redesigning processes, and doing the unglamorous work of connecting models to the messy reality of enterprise data. They're the 5%. They spent similar money and got EBIT impact.

I think about that healthcare CTO sometimes. The one who called me on a Tuesday, exhausted, having just killed his AI project. He called again four months later. His team had rebuilt the system using a multi-agent approach — separate agents for data extraction, code validation, and compliance checking, with deterministic handoffs between them. It wasn't as elegant as the original demo. It was slower to build. It required more upfront thinking about workflow design and failure modes.

It worked. Not perfectly — nothing does — but reliably enough to deploy, audit, and improve. Reliably enough to show up on a P&L statement.

The era of treating AI like a magic trick is over. What comes next is harder, slower, and less photogenic. It's also the only thing that actually works.