A striking editorial image conveying the concept of hidden identity signals embedded within audio waveforms, specific to the music/AI fraud domain.
Artificial IntelligenceMusic IndustryTechnology

75 Million Fake Songs Got Deleted From Spotify. The Real Problem Is the Ones That Didn't.

Ashutosh SinghalAshutosh SinghalFebruary 23, 202612 min read

A few months ago, I sat in a meeting with a music distributor who told me something that rewired how I think about the entire audio industry. He pulled up a dashboard showing their daily ingestion pipeline. "See this?" he said, scrolling through a feed of new uploads. "We get about four thousand tracks a day through our platform alone. I'd estimate a third of them were made by someone who spent less time creating the track than you spent brushing your teeth this morning."

I laughed. He didn't.

He wasn't exaggerating. Spotify alone ingests roughly 100,000 new tracks every single day. If you tried to listen to just 30 seconds of each one, it would take you 35 days of continuous playback to get through a single day's uploads. And a growing share of that flood isn't music in any meaningful sense — it's algorithmically generated noise designed to siphon money from the people who actually make art.

This is the audio watermarking problem I've spent the last stretch of my career obsessing over at Veriprajna. Not because watermarking is a sexy technology — it isn't — but because every other solution the industry is relying on has a fatal flaw that nobody wants to talk about honestly.

The $3 Billion Heist Hiding in Your Playlist

Here's the part that should make you angry, whether you're a musician, a listener, or just someone who pays $10.99 a month for a streaming subscription.

The way most major platforms pay artists is called the pro-rata model. All subscription and ad revenue goes into one giant pool. That pool gets divided by the total number of streams on the platform. Your per-stream rate is a fraction of the whole.

This means every fake stream doesn't just steal from the platform — it steals from every real artist. When a bot farm generates a billion plays on AI-generated white noise, it inflates the denominator. The per-stream payout drops for everyone. Your favorite independent artist, the one who spent six months writing an album in their bedroom, gets paid less because a fraud ring in another country uploaded ten thousand rain-sound loops and pointed a botnet at them.

Industry estimates put the annual damage at $2 billion to $3 billion. Deezer reported that 70% of plays on AI-generated tracks on their platform were flagged as fraudulent. Spotify had to purge over 75 million tracks in 2024 and 2025 alone — a number that rivals the size of the entire historical catalog of recorded music.

Every fraudulent stream isn't just theft from a platform. It's a tax on every legitimate artist, paid invisibly through a shrinking royalty pool.

I remember the night those Spotify purge numbers came out. I was at my desk, and my first reaction was relief — finally, the platforms are taking this seriously. My second reaction, about ten minutes later, was dread. Because 75 million is the number they caught. What about the ones that slipped through?

Why Does Audio Fingerprinting Fail Against AI Music?

A side-by-side comparison diagram showing how audio fingerprinting (identification) fails against novel AI content while audio watermarking (authentication) succeeds by embedding provenance at creation.

This is the question that led me to start building what we're building. And the answer is deceptively simple once you see it.

The music industry's primary defense system is audio fingerprinting — the technology behind Shazam, YouTube's Content ID, and most rights management platforms. Fingerprinting works by extracting a perceptual signature from a piece of audio and matching it against a massive database of known recordings.

Here's the problem: generative AI doesn't copy. It synthesizes.

When a diffusion model generates a new track, it creates a waveform that has never existed before. There is no entry in any fingerprinting database to match against. To Content ID, a brand-new AI spam track looks exactly like a brand-new human masterpiece. Both are simply "unknown content."

I call this the Originality Paradox, and it's the reason I couldn't sleep for about a week after we ran our first tests. We took a set of AI-generated tracks — some clearly derivative of existing artists, some completely novel — and ran them through standard fingerprinting pipelines. The derivative ones occasionally triggered partial matches. The novel ones? Complete silence from the detection system. Not a single flag.

My co-founder looked at the results and said, "So the better the AI gets at being original, the worse our detection gets?" Yes. Exactly. That's the trap.

Fingerprinting is identification technology. It tells you what something is. Watermarking is authentication technology. It tells you where something came from. The music industry has been using the wrong tool.

I wrote about this distinction — and the full technical architecture behind why fingerprinting breaks down — in our interactive whitepaper. But the short version is this: fingerprinting is reactive. It needs the content to already exist and be registered. We needed something proactive — something that embeds provenance at the moment of creation.

The Fraud Got Smarter While We Weren't Looking

A flowchart diagram showing the modern "low and slow" AI music fraud kill chain, from AI track generation through botnet distribution to royalty pool extraction.

The other thing that kept me up was learning how the fraud operations actually work now. The old playbook was crude: upload a track, blast it with millions of streams from a single IP address, cash out. Platforms caught that years ago.

The new playbook is terrifyingly elegant. They call it "low and slow."

Instead of one track getting a million fake streams, a fraud ring uses AI to generate ten thousand tracks. Then a botnet plays each track just a hundred times. The aggregate payout is the same, but no single track triggers a viral-spike alert. The fraud hides in the long tail, buried under the sheer volume of legitimate data.

And the infrastructure behind these operations has gone enterprise-grade. We're talking residential proxies routing traffic through compromised IoT devices so each stream appears to come from a different home. Headless browsers running scripts that mimic human behavior — mouse movements, pausing, skipping tracks, searching — to fool engagement analytics. AI-generated playlists with SEO-optimized titles like "Chill Lo-Fi for Coding" that mix a few legitimate hits from major artists with dozens of spam tracks, camouflaging the fraud and sometimes even tricking the platform's recommendation algorithm into serving the fake tracks to real listeners.

I sat with our team one afternoon mapping out this kill chain on a whiteboard, and someone said, "This isn't music piracy. This is financial fraud that happens to use audio files as the vehicle." That reframing changed everything for us.

What Happens When You Play a Song Through a Speaker and Re-Record It?

An annotated diagram showing how the autocorrelation-based watermark survives the analog gap — the repeating watermark blocks get distorted identically by room acoustics, preserving their internal relationship.

This is the technical challenge that separates serious watermarking from everything else, and it's the one I'm most proud of our team for tackling.

It's called the Analog Gap — sometimes the Analog Hole. Imagine a deepfake song plays on someone's laptop speakers. The sound travels through the air. Someone records it on their phone. That recording gets uploaded to a platform.

During that journey, the audio signal gets destroyed in ways that are almost comically hostile to data preservation. Sound bounces off walls, floors, and furniture — the microphone receives the direct signal plus thousands of slightly delayed reflections. Cheap speakers cut everything below 300Hz and above 15kHz. The recording device doesn't know where the watermark "starts," so the entire signal is desynchronized.

Most watermarking systems that survive MP3 compression — the digital gap — die instantly in the analog gap. And yet, the analog gap is exactly the scenario that matters most for detecting deepfakes shared on social media, played on radio, or captured during live calls.

We spent weeks failing at this before we found the approach that worked. The breakthrough was realizing we shouldn't be comparing the received signal to an external reference at all. Instead, we embed a repeating pattern within the signal itself and use autocorrelation — the signal compares itself to itself.

Here's why that's clever: when audio travels through a reverberant room, the entire signal gets distorted in the same way. Block A and Block B of our repeating watermark both get smeared by the same room acoustics. The relationship between them survives even when the absolute signal is mangled. The detector looks for a periodic spike in the autocorrelation at a known interval, and that spike confirms the watermark's presence without ever needing to know what the original audio sounded like.

There was a moment in the lab — and I use "lab" loosely, it was really just a conference room with a laptop and a Bluetooth speaker we bought from a convenience store — where we played a watermarked track through that terrible speaker, recorded it on a phone across the room, and ran the detector. When it came back positive, my engineer looked at me and said, very quietly, "That shouldn't have worked." But it did. And that's when I knew we had something.

Can't Attackers Just Remove the Watermark?

This is the first objection everyone raises, and it's the right one.

Sophisticated attackers will absolutely try to use AI to find and strip watermarks. We'd be naive to think otherwise. This is why our training pipeline doesn't just defend against a fixed list of known attacks like "add noise" or "compress to MP3." We use an adversarial training framework — essentially, we train an attacker network alongside our watermarking system. The attacker tries to destroy the watermark while keeping the audio listenable. The encoder adapts to survive the attack. They play this minimax game through thousands of iterations until the watermark survives attacks that didn't even exist when training started.

The result: our system achieves attribution accuracy above 98% even under aggressive editing — time-stretching, pitch-shifting, cropping. Even if a fraudster cuts a 30-second clip down to 10 seconds, the detector accumulates enough statistical evidence from the fragment to decode the provenance signature.

For the full technical breakdown of the spread-spectrum embedding, SVD decomposition, and adversarial resistance protocols, see our research paper. But the key insight isn't about any single technique — it's that the watermark lives in the structure of the audio, not on its surface. You can sandblast the surface. The structure endures.

The Nutrition Label for Sound

A watermark by itself is a link. It says "this audio has been marked." But marked by whom? For what purpose? To build a real trust ecosystem, you need to connect that acoustic signal to a verifiable identity.

This is where we integrate with C2PA — the Coalition for Content Provenance and Authenticity — an open standard that functions like a nutrition label for digital content. It cryptographically records who created an asset, how it was created (human or AI), and what edits were made.

The vulnerability of metadata-only solutions is obvious: convert a signed WAV to a generic MP3, and the metadata header vanishes. Play it on the radio, and it's gone. But our watermark survives those transformations. So we use the watermark as a soft binding — it carries a unique identifier that points to a cloud-hosted C2PA manifest. Strip the metadata, convert the format, play it through the air and re-record it. The watermark persists. The detector extracts the identifier, queries the ledger, and retrieves the full provenance record.

Provenance should travel with the content, not sit in a header that gets stripped the moment someone clicks "Export as MP3."

And for anyone worried about privacy — a dissident journalist or an anonymous artist shouldn't need to attach their legal name to a file just to prove it's real. C2PA supports pseudonymous claims and selective disclosure. An artist can sign a track as "Verified Creator #892," linked to a credential issued by a trusted third party, without revealing their home address.

Why Not Just Hire More Moderators?

Because it's economically impossible. Research shows human moderators are more accurate at detecting nuance and context, but they cost nearly 40 times more than automated systems. And human hearing is becoming biologically insufficient — distinguishing a high-quality AI voice clone from a real recording is approaching the limits of what our ears can do, while remaining mathematically tractable for machines.

The industry needs the nuance of human judgment at the scale and cost of software. That's what deterministic watermark detection provides. A watermark is either present or it isn't. There's no confidence score to interpret, no probability curve that requires a human reviewer to break a tie. This allows fully automated action — demonetization, flagging, takedown — with legal-grade certainty.

The Fork in the Road

People sometimes ask me whether I think AI will destroy the music industry. I don't. I think the music industry will be fine — if it stops pretending that the tools built for the last era work in this one.

Fingerprinting was built for a world where content was created by humans and the challenge was identifying copies. We now live in a world where content is created by machines and the challenge is proving origin. These are fundamentally different problems, and they require fundamentally different infrastructure.

Spotify's 1,000-stream minimum threshold for royalty payouts is a policy band-aid. User-centric payment models are a structural improvement. But neither addresses the root cause: platforms cannot currently tell the difference between a new AI track and a new human track. Until that changes, every other fix is downstream.

The generative capability is a commodity now. Anyone with a GPU or an API key can flood the pipeline. The scarcity — and therefore the value — has shifted to provenance. Not what was created, but who created it, how, and whether it's real.

The future of AI music isn't about the model that generates the best melody. It's about the infrastructure that guarantees that melody is real, remunerated, and recognized.

With the EU AI Act and pending US deepfake regulation, watermarking is moving from optional to required. The question isn't whether the industry will adopt provenance standards. It's whether it will adopt them before or after the royalty pools have been bled dry.

I know which side of that bet I'm building on. If you can't watermark it, don't generate it. That's not a slogan. It's the only operational reality that makes a trusted audio internet possible.

Related Research