A striking editorial image showing audio waveforms being deconstructed into labeled, transparent layers — representing the article's theme of auditable, traceable AI audio vs. opaque black-box generation.
Artificial IntelligenceMusic IndustryIntellectual Property

I Stopped Trusting AI Music Generators the Night One Spit Out a Mariah Carey Vocal Run

Ashutosh SinghalAshutosh SinghalFebruary 22, 202614 min read

It was almost midnight, and I was sitting in our office with two engineers and a pot of terrible coffee, running stress tests on a popular generative audio platform. We'd been hired to evaluate whether an ad agency could safely use AI-generated music in a national campaign. So we were methodically prompting the tool — genre by genre, style by style — documenting what came out.

Then one of my engineers, Priya, played back a track she'd generated with a simple prompt: "upbeat pop ballad, female vocalist, powerful range." She didn't mention any artist. She didn't ask for a sound-alike. But what came out of the speakers made all three of us go quiet.

It was unmistakable. The melisma — that cascading, note-bending vocal run — belonged to one person. The tool hadn't "created" a vocal style. It had reconstructed Mariah Carey's signature technique from whatever it had swallowed during training. And it had done it from a prompt that never mentioned her name.

I turned to Priya and said, "If we ship this to a client and someone at Sony hears it, we're not getting sued. The client is getting sued."

That was the night I stopped thinking of generative AI music as a creative tool and started seeing it for what it actually is: a compression algorithm for copyrighted material, dressed up in a text box. And it was the night I committed to building something fundamentally different at Veriprajna.

The Lawsuit That Changed Everything

If you haven't been following the Recording Industry Association of America's cases against Suno and Udio, you should be. This isn't a nuisance suit. It's the music industry's line in the sand.

The RIAA alleges that these platforms engaged in industrial-scale "stream-ripping" — circumventing YouTube's rolling cipher encryption to download millions of copyrighted recordings and feed them directly into their training pipelines. Not incidental ingestion. Not a few songs slipping through. Millions of tracks, deliberately scraped, their expressive features quantified into vectors so the model could reconstruct them on demand.

The legal theory is elegant and devastating: fruit of the poisonous tree. If the training data is illegally obtained, every output is tainted. It doesn't matter if the user had innocent intentions. It doesn't matter if the output isn't a note-for-note copy. The model learned to generate "a song like Mariah Carey" by memorizing the statistical fingerprint of Mariah Carey's actual recordings. That's not inspiration. That's data decompression with a text prompt as the key.

When a model can't tell you where its creative decisions came from, it can't be trusted in a commercial supply chain. Period.

I wrote about the full legal and technical breakdown in the interactive version of our research, but the short version is this: enterprise users of these tools are renting a lawsuit. The platforms' Terms of Service are designed to shift liability back to the user the moment a prompt gets specific. And "specific" is a lower bar than you think.

Why Does "Fair Use" Fail for AI Music?

Side-by-side comparison diagram contrasting how Black Box AI music generators work (scraped data → opaque model → unverifiable output → liability) versus the White Box/Source-Separated approach (licensed input → auditable pipeline → provenance-stamped output → ownership).

This is the question I get most often from executives who want to use these tools. "Isn't training transformative? Isn't it like a musician listening to the radio?"

No. And courts are increasingly saying so.

Fair use in the US hinges on four factors, but the one that kills AI music generators is the fourth: effect on the potential market. When an AI tool charges users $24 a month to generate tracks that directly compete with — and substitute for — the licensed recordings it was trained on, the market harm is not theoretical. It's the business model.

A human musician who listens to Mariah Carey and writes an original song has processed that influence through years of lived experience, physical vocal training, and creative interpretation. A diffusion model that ingests her spectrogram and learns to reverse-engineer it from noise has done something categorically different. It has compressed her work into weights and learned to decompress it on command.

The Udio settlement with Universal Music Group made this painfully concrete. As part of the deal, users of the original platform reportedly can't even download their own creations anymore. Everything is locked in a walled garden. If you built an ad campaign's soundtrack on Udio, that soundtrack may now be commercially useless for any off-platform application.

I watched an agency creative director's face go white when I explained this at a meeting. She had six months of campaign audio sitting on a platform that had just settled a copyright lawsuit. None of it could be exported.

The Night We Argued About the Wrong Problem

For a while, my team and I were obsessed with the wrong question. We kept asking: "How do we make generative AI music safer?" We tried prompt guardrails. We tried output fingerprinting. We tried building classifiers that could detect when a generated track was too close to a known recording.

All of it was patching a broken foundation.

The argument that changed our direction happened over a whiteboard covered in architecture diagrams. One of our senior engineers — I'll call him Raj — kept pushing back on every safeguard I proposed. "You're trying to make a probabilistic system behave deterministically," he said. "It can't. The whole point of diffusion is to reconstruct training data. You're asking it to not do the thing it was designed to do."

He was right. And he was frustrated, because he'd been saying it for weeks and I hadn't been listening.

The question wasn't how to make Black Box generation safer. The question was: why are we generating from scratch at all?

Every enterprise client we talked to already had audio assets. They had demo recordings. They had licensed stock tracks. They had legacy catalog material. They didn't need a model to hallucinate a song from nothing. They needed a model to transform what they already owned — change a voice, modernize a mix, isolate a stem — without breaking the chain of copyright ownership.

That realization was the birth of what we now call the Source-Separated Licensing Engine.

What Is a Source-Separated Licensing Engine?

A labeled pipeline diagram showing the complete Source-Separated Licensing Engine workflow: licensed track input → Deep Source Separation into stems → Voice Conversion on vocal stem → recombination → C2PA provenance manifest attached to final output.

Instead of asking an AI to generate audio from a text prompt — which requires the model to traverse a latent space built from stolen copyrights — we ask the AI to do two very specific, very auditable things:

First, take apart. Using Deep Source Separation, we deconstruct a licensed track into its constituent stems: vocals, drums, bass, and everything else. The AI isn't creating anything. It's isolating what's already there, like a surgeon separating tissue layers.

Then, transform. Using Retrieval-Based Voice Conversion (RVC), we change the vocal identity on the isolated stem. The melody stays. The lyrics stay. The performance stays. But the voice — the timbre, the texture, the grain — comes from a licensed voice model that we trained on recordings from a voice actor who signed a commercial release.

The composition comes from the client's licensed input. The voice comes from our licensed model. Every ingredient has a clear chain of title. There is no latent space of scraped copyrights. There is no probabilistic hallucination. There is no mystery about where any element came from.

We traded the magic of hallucination for the certainty of engineering. And enterprise clients don't want magic — they want assets they can actually own.

How Does Deep Source Separation Actually Work?

Annotated diagram showing how a mixed audio spectrogram is processed through neural network masking to produce isolated stems — illustrating the frequency-overlap problem and the masking solution.

When you listen to a finished song, you're hearing a polyphonic mixture — vocals, drums, bass, guitars, synths, all layered on top of each other. A bass guitar and a kick drum both live in the 50–200Hz range. A vocal and a piano share the 500Hz–2kHz range. Traditional audio filters can't pull them apart without destroying the sound.

Deep Source Separation uses neural networks to solve this. The mixed audio gets converted into a spectrogram — essentially a visual map of frequencies over time — and the network learns to generate a "mask" for each source. Think of it like a stencil: the mask tells the system which frequencies at which moments belong to the drums, which belong to the vocals, which belong to everything else. Apply the mask, and you get a clean isolated stem.

We run an ensemble of the best architectures — Hybrid Transformer Demucs for capturing long-range patterns like a repeating drum beat across an entire song, and MDX-Net for spectral clarity across frequency bands. Running multiple models and averaging the results minimizes "bleeding," those annoying artifacts where you can hear ghost drums in the vocal track.

The legal point is what matters: we're performing this separation on tracks the client already owns or has licensed. The AI is a tool for isolation, not invention. The resulting stems are legally derived from the licensed parent track.

Why Does Voice Conversion Matter More Than Voice Generation?

This is where most people's intuition leads them astray. They assume the impressive part of AI audio is generating a voice from nothing. It's not. The impressive part — and the legally defensible part — is converting one voice into another while preserving everything else about the performance.

RVC works by disentangling what is being sung from who is singing it. A model called HuBERT strips the source vocal down to pure linguistic and melodic content — phonemes, prosody, rhythm — while discarding the speaker's identity. It anonymizes the performance.

Then comes the retrieval step, which is the key innovation. Instead of having a neural network guess what the target voice should sound like (which produces that telltale synthetic smoothness), the system searches a pre-built index of the target voice's actual characteristics — breaths, rasps, vowel shapes — and injects real feature snippets into the converted audio. The result sounds authentic because it is authentic. It's built from real samples of the licensed voice, not a statistical approximation.

Finally, a HiFi-GAN vocoder synthesizes the waveform, trained adversarially against real recordings of the target speaker until the output is indistinguishable from a genuine performance.

The whole thing requires only 30–60 minutes of clean audio from a single speaker to train a voice model. Compare that to Suno or Udio, which need millions of scraped tracks to learn "music." Our approach is surgical where theirs is industrial.

The Delete Button That Black Box Models Don't Have

Here's something that keeps enterprise legal teams up at night: if a voice actor revokes consent, or a licensing deal expires, can you remove their contribution from your AI system?

With large transformer models — the kind that power Suno and Udio — the answer is effectively no. The training data is baked into billions of parameters. Removing a specific artist's influence requires expensive retraining and risks "catastrophic forgetting," where the model loses capabilities far beyond what you intended to remove.

In our architecture, every voice is a separate file. About 50 megabytes. If a voice actor says "I'm done," we delete the file. The separation engine keeps working. Every other voice model keeps working. Compliance with "Right to be Forgotten" requests is instantaneous and surgical.

In a Black Box model, unlearning is a research problem. In our architecture, it's a delete key.

I can't overstate how much this matters as regulations tighten. The EU AI Act will demand transparency about training data. The ability to demonstrate granular control over every component of your AI pipeline isn't a nice-to-have — it's going to be table stakes.

What Happens When Someone Questions Your AI Audio?

Every file that leaves our pipeline carries a C2PA manifest — a cryptographic signature from the Coalition for Content Provenance and Authenticity. Think of it as a digital nutrition label that travels with the file and can't be forged.

The manifest records: the hash of the input audio (proving derivation from a licensed source), the hash of the separation model (proving which tool was used), the hash of the voice model (proving which licensed voice was applied), and Veriprajna's cryptographic signature certifying the integrity of the entire chain.

If YouTube flags a track, if Spotify questions its copyright status, if a competitor alleges it's a deepfake — the client opens the manifest and the provenance is right there, mathematically verifiable. No ambiguity. No "trust us." Cryptographic proof.

For the full technical architecture of the pipeline and C2PA integration, I've published a detailed research paper that goes deeper than I can here.

"But Isn't This Just Limiting What AI Can Do?"

People ask me this constantly. Usually with a tone that implies I'm being a killjoy.

My answer: I'm not limiting AI. I'm limiting liability. There's a difference.

A Black Box generator that can produce any song from a text prompt is genuinely impressive technology. I don't deny that. But impressive technology that can't tell you where its outputs came from, that can't be audited, that can't guarantee you own what it produces — that technology is a consumer toy, not an enterprise tool.

The US Copyright Office has been increasingly clear: purely AI-generated works likely aren't copyrightable. Typing "make a jazz song" isn't authorship. It's an idea, not an expression. Which means if your competitor rips your AI-generated jingle and uses it in their own ad, you may have no legal recourse.

Our approach preserves copyrightability because there's a human-created guide track at the foundation and human-directed transformation at every step. The AI is a tool in the hands of a creator, not a creator itself. That distinction is the difference between owning your output and hoping nobody steals it.

The Real Cost Equation

I'll be direct about the economics because nobody else in this space seems willing to be.

Training on scraped data is free. The legal liability is uncapped — statutory damages of up to $150,000 per infringed work. If your model ingested ten thousand songs, do the math.

Licensing training data and voice recordings introduces an upfront cost. But it caps your liability at zero. Every component in the chain has a signed agreement behind it. Every output has a provenance manifest attached to it.

The ad agency that hired us for that initial evaluation? They ran the numbers. The cost of our pipeline was a rounding error compared to a single copyright infringement claim. And unlike the Black Box platforms, we could actually guarantee that the rounding error was the total cost — not a down payment on a lawsuit.

The End of "Prompt and Pray"

The RIAA lawsuits against Suno and Udio aren't the end of AI audio. They're the end of the phase where nobody asked where the training data came from. The settlement terms — walled gardens, download restrictions, new licensed platforms — tell you exactly where this is heading. The wild west is closing.

What comes next is what we've been building: sovereign audio pipelines where every artifact has a verifiable origin, where models can be audited and updated and deleted at the component level, where the output is deterministic rather than probabilistic, and where the enterprise client actually owns what they paid for.

I think about that night with Priya and the Mariah Carey vocal run more often than I'd like to admit. Not because it was technically surprising — we knew the models were memorizing training data. But because it made the risk visceral. That wasn't an abstract legal theory playing through our speakers. It was someone's life's work, compressed into weights and reconstructed without permission, ready to be shipped to a client who would have had no idea what they were distributing.

You cannot build a business on a system that can't explain itself. If you don't know what data the model was trained on, you don't own the output. You're not creating. You're gambling.

In an era of synthetic uncertainty, provenance is the product.

We build systems where every note has a name, every voice has a contract, and every file carries proof. That's not a limitation on AI. That's what AI looks like when it's ready for the real world.

Related Research