A striking visual contrasting AI-generated plausibility against physical/legal truth, anchored in the article's two domains: battery chemistry and audio provenance.

Artificial IntelligenceTechnologyMachine Learning

Why I Stopped Trusting AI and Started Building Oracles Instead

Ashutosh Singhal February 26, 202614 min read

The email came in at 11:47 PM on a Tuesday. A battery manufacturer we'd been talking to had just pulled a shipment of cells off the line. Not because they failed a test — because their AI-assisted material screening tool had passed a candidate electrolyte that, when a human chemist finally ran the numbers, turned out to be thermodynamically unstable above 150°C. The material would have decomposed inside a battery pack. The decomposition would have released heat. The heat would have triggered what the industry euphemistically calls "thermal runaway" — and what the rest of us call a fire.

Nobody got hurt. But I sat at my desk staring at that email and thinking about the word "plausible." The AI hadn't been wrong in any obvious way. The molecular structure it recommended looked reasonable. The formation energy it predicted was in the right ballpark. It was plausible. It just wasn't true.

That distinction — between plausible and true — is the fault line running through the entire AI industry right now. And it's the reason I built Veriprajna.

The Wrapper Economy Has a Truth Problem

Here's what most people don't realize about the current wave of AI products: the vast majority of them are thin interface layers — "wrappers" — sitting on top of general-purpose Large Language Models. The LLM predicts the next most likely token. The wrapper makes it look like an app. The user assumes they're getting answers. They're getting probabilities.

For writing marketing copy or summarizing meeting notes, this is fine. Probabilities are good enough. But the companies I work with don't have the luxury of "good enough." They make batteries that go into electric vehicles. They produce audio content that gets broadcast globally. For them, an answer that's 99% plausible but 1% physically impossible isn't a rounding error. It's a thermal event or a copyright lawsuit.

When your AI is responsible for something that can catch fire or get you sued, "statistically likely" is not the same as "correct."

I've started calling this the bifurcation of AI. On one side, the Wrapper Economy — fast, accessible, built on stochastic prediction. On the other, what we do at Veriprajna: Deep AI, where every output gets validated against immutable rules before a human ever sees it. Physics. Logic. Provenance. The things that don't care about your training data distribution.

What Happens When AI Predicts Chemistry It Doesn't Understand?

Let me make this concrete with the battery problem, because it haunts me.

Lithium-ion batteries fail through a deterministic sequence of chemical breakdowns. It starts around 80–100°C when the protective layer on the anode — called the Solid Electrolyte Interphase — decomposes. By 110–135°C, the separator melts and the electrolyte starts breaking down into flammable gases. Above 200°C, the cathode collapses, releases oxygen, and you get combustion.

The electrolyte is the critical variable. Traditional liquid electrolytes — typically lithium hexafluorophosphate dissolved in carbonate solvents — are chemically unstable at elevated temperatures. They're literally the fuel source in the combustion event. To prevent thermal runaway, especially in high-voltage or high-temperature applications, we need electrolytes with decomposition energies that keep them stable well beyond that 200°C threshold.

The problem is finding them. The chemical space of possible inorganic crystals contains an estimated 10^100 combinations. For decades, materials scientists have explored this space the way Edison tested filaments: hypothesize a structure, synthesize it in a lab, test it, wait months for results. And human intuition biases us toward modifications of known families — garnets, perovskites — rather than venturing into genuinely novel compositional territory.

So the industry turned to AI. Makes sense. But here's where it went wrong for many teams: they pointed an LLM at the problem. An LLM that had "read" millions of chemistry papers could predict molecular structures — but it predicts tokens, not electron densities. It has no concept of valency rules, no understanding of quantum mechanical forces. It can hallucinate a crystal structure that looks right on paper but violates the laws of physics in ways that only show up when you try to build it.

This is what happened with that late-night email. The AI proposed a candidate. The candidate was plausible. It was not real.

The Oracle Architecture: How We Actually Solve This

A labeled pipeline diagram showing the full materials discovery architecture — from GNoME candidate generation through tiered DFT validation to the active learning feedback loop — so readers can see the complete system flow at a glance.

After that incident, my team and I had a long, uncomfortable conversation about what we were really building. Were we building AI that generates answers? Or AI that discovers truth?

We chose truth. And truth requires an Oracle.

Our architecture for materials discovery pairs Google DeepMind's GNoME — Graph Networks for Materials Exploration — with rigorous Density Functional Theory validation. The key insight is this: we don't use AI to answer the question. We use AI to propose candidates from a vast search space, and then we validate every single one against the laws of physics before it goes anywhere.

GNoME treats crystal structures as graphs — atoms are nodes, chemical bonds are edges. Unlike an LLM processing linear text, GNoME understands 3D geometry and topology. It's built to be what physicists call E(3)-equivariant, meaning its predictions don't change if you rotate the crystal in space. That's not a feature you bolt on. It's a mathematical constraint baked into the architecture. The model cannot violate rotational symmetry.

But even GNoME is probabilistic. It predicts formation energies — the energy required to assemble a crystal from its elements — but those predictions carry uncertainty. A crystal might look stable to the neural network and still be thermodynamically uncompetitive against other possible phases.

So we built the Oracle layer.

Why Does DFT Validation Matter for Battery Safety?

Density Functional Theory is a quantum mechanical method that approximates the solution to the Schrödinger equation. It calculates electron density and total energy with high precision. It's computationally expensive — a single calculation can take hundreds of CPU hours — but it doesn't hallucinate. It solves equations. The answer is either right or it's a numerical error you can quantify and bound.

We run a tiered validation strategy. Machine learning force fields handle initial geometric relaxation — filtering out candidates that are obviously broken. Then PBE-level calculations do high-throughput screening. The survivors get validated with r²SCAN, a meta-GGA functional that accurately predicts lattice constants and formation energies for strongly bound systems. Transition metals get an additional Hubbard U correction to handle self-interaction errors in d-orbitals.

I realize I just threw a lot of physics jargon at you. The point is simpler than the details: we have multiple layers of increasingly expensive and accurate physical simulation, and every candidate must survive all of them before we'd ever recommend it for a battery.

The metric that matters most is what we call "Distance to Hull." Imagine plotting every possible material in a given compositional space on a chart — composition on one axis, energy on the other. The stable materials form a lower boundary, a "convex hull." Anything above that hull will spontaneously decompose into the materials on it. A material with zero distance to hull is the thermodynamic ground state. A material with distance greater than 100 meV/atom is almost certainly going to fall apart — and in a battery, falling apart means releasing heat.

The convex hull doesn't care about your neural network's confidence score. A material is either thermodynamically stable or it isn't.

The Flywheel That Gets Smarter Overnight

What makes this more than a one-shot pipeline is the active learning loop. GNoME generates thousands of candidate structures. We select the ones the model thinks are most promising and the ones where it's most uncertain — exploitation and exploration simultaneously. Those go to the DFT cluster. The true energies come back and get fed into GNoME's training set. The model retrains. Its internal physics gets corrected.

I remember the first time we watched the hit rate climb — the percentage of AI-proposed materials that actually turned out to be stable after DFT validation. Traditional random search sits below 1%. Standard machine learning gets you to maybe 50%. After several active learning cycles, our GNoME-driven pipeline was exceeding 80%.

My co-founder looked at the dashboard and said, "It's not guessing anymore. It's learning what stability means." That was the moment I knew we had something. Not because the number was impressive in isolation, but because the system was converging on physical reality through iteration, not memorization.

I wrote about this architecture in more depth in the interactive version of our research, if you want to see the full workflow.

The Other Kind of Explosion: Copyright in Generative Audio

Now let me tell you about a completely different domain where the same architectural philosophy — propose, then validate — saved us from a different kind of disaster.

A media company approached us about generating audio content at scale. They had a massive library of licensed music and voice recordings. They wanted to use AI to create new content from this library — localized voiceovers, remixed soundtracks, that kind of thing. They'd been experimenting with off-the-shelf generative audio tools.

I asked one question: "Can you prove, for any given output, exactly which licensed sources contributed to it?"

Silence.

This is the black box problem in generative media. Diffusion models — the architecture behind most AI audio and image generators — are trained on massive datasets scraped from the internet. When they generate output, they traverse a high-dimensional latent space to synthesize something new. The output is a mathematical amalgamation of the training data. You cannot trace which training examples influenced which parts of the result.

For a consumer playing around with AI music tools, this is a curiosity. For a global media company, it's an existential legal risk. If a generated audio track contains a four-bar loop that's identical to a copyrighted song, the company is liable for infringement — even if nobody intended it. The courts are actively litigating whether training on copyrighted data constitutes fair use (Andersen v. Stability AI, New York Times v. OpenAI). An enterprise whose content pipeline depends on these tools could wake up one morning to find their entire asset library legally contaminated.

A media company that can't prove the provenance of its AI-generated content is building on sand — legal sand that shifts every time a court issues a ruling.

How Do You Build AI Audio That Can Prove Its Own Innocence?

A two-phase pipeline diagram showing the audio deconstruction (source separation into stems) and reconstruction (retrieval-based voice conversion with provenance signing) architecture, making the retrieval-vs-generation distinction visually clear.

We rejected the "generate from noise" paradigm entirely. Instead, we built what I think of as Retrieval-Augmented Generation for audio — the same conceptual move that RAG brought to text, but applied to sound.

The pipeline has two phases: deconstruction and reconstruction.

For deconstruction, we use Hybrid Transformer Demucs — a source separation model that takes mixed audio and isolates it into individual stems: vocals, drums, bass, other instruments. The architecture is a U-Net with skip connections (preserving high-frequency detail that would otherwise get lost in compression) and a Transformer encoder at the bottleneck that uses self-attention to analyze the entire audio sequence. It processes audio simultaneously in the time domain and the frequency domain, fusing information from both.

We ran Demucs across the client's entire licensed archive. Thousands of hours of mixed audio, separated into clean, isolated stems, each tagged and indexed by audio features — timbre, pitch, rhythm. We'd turned their back-catalog from a collection of finished songs into a massive library of building blocks.

For reconstruction — specifically for voice content — we use Retrieval-Based Voice Conversion. This is fundamentally different from text-to-speech or diffusion-based voice generation. RVC is speech-to-speech: it takes an input recording (say, a creative director reading a script on their phone) and transforms the timbre to match a licensed target voice, while preserving the original performance's intonation and rhythm.

The critical mechanism is in the name: retrieval. We use HuBERT to extract speaker-agnostic content features from the input. Then, for every frame, we query a FAISS index of feature vectors derived from the licensed voice actor's recordings. We retrieve the closest matching acoustic details — the breathiness, the resonance, the specific vocal quality — from actual authorized recordings. The output sounds like the target voice because we pulled specific data points from their licensed index, not because a neural network dreamed up an approximation.

I cannot overstate how much this matters legally. In a deepfake model, the target voice lives as opaque neural network weights. In our system, every acoustic detail traces back to a specific, timestamped, licensed recording. The chain of title is unbroken.

The Paperwork That Travels With the Sound

Generating provenance-clean audio is necessary but not sufficient. The asset needs to carry its own proof. We implement the C2PA standard — Coalition for Content Provenance and Authenticity — which embeds tamper-evident provenance data directly into media files using public-key cryptography.

Every audio file we generate ships with a signed manifest: the hash of the input guide track, the ID of the licensed voice model, the complete sequence of processing actions, and the tool version. Any downstream user — a streaming platform, a broadcaster — can validate the signature and confirm the asset was built entirely from authorized sources.

We also adapted the Structural Similarity Index for audio quality control. By comparing spectrograms of the input guide and the output, we catch cases where the AI has distorted the performance — skipped a word, changed the rhythm, hallucinated a pause. Anything below a 0.95 SSIM threshold gets flagged for human review automatically.

For the full technical breakdown of both the materials and audio architectures, see our research paper.

What About Just Using Better Prompts?

People push back on this approach. They tell me we're overengineering the problem. "Just use a better model." "Just fine-tune on your domain data." "Just add a disclaimer."

I had an investor tell me, point blank, "Just use GPT with a good system prompt and save yourself the infrastructure cost." I asked him if he'd put his family in an electric vehicle whose battery electrolyte was selected by a system prompt. He changed the subject.

The deeper objection is about cost and complexity. Yes, running DFT calculations on an HPC cluster is more expensive than calling an API. Yes, building a FAISS-indexed stem database with C2PA signing is harder than pointing a diffusion model at a text prompt. But the question isn't whether deterministic validation is more expensive than probabilistic generation. The question is whether it's more expensive than a battery recall. Or a copyright lawsuit that invalidates your entire content library.

Others ask whether this approach scales. It does — that's what the active learning flywheel is for. The system gets more efficient with every cycle. The hit rate climbs. The cost per validated candidate drops. The stem database grows. You're not just solving today's problem; you're building an engine that compounds.

The End of AI Tourism

I think we're at an inflection point. The era of experimenting with AI — chatbots in the lobby, copilots in the sidebar, wrappers on everything — is ending. Not because those tools aren't useful, but because the enterprises that matter most are now trying to put AI into the core of their operations. Into the R&D lab. Into the production studio. Into the systems where failure has consequences measured in thermal events and litigation, not in awkward chatbot responses.

In those environments, the tolerance for hallucination is zero. Not low. Zero.

The architecture we've built at Veriprajna — for batteries, for audio, for every domain where truth is non-negotiable — rests on a single principle: the generative power of the neural network must be strictly subordinate to the verifying power of the Oracle. The AI proposes. Physics decides. The AI assembles. Provenance proves. The creative capacity of these models is extraordinary. But creativity without accountability is just sophisticated guessing.

For the battery manufacturer, a hallucination is a fire. For the media company, a hallucination is a lawsuit. The only viable architecture constrains generation with verification — every time, without exception.

I don't think the future of AI belongs to the models that generate the most convincing outputs. I think it belongs to the systems that can prove their outputs are true. Constraints don't limit intelligence. They create reality.