A wall of decades of newspaper front pages feeding a chat panel that answers with cited links.
Artificial IntelligenceMediaTechnology

Your News Archive Is the Asset. Stop Letting Google Rent It for Free.

Ashutosh SinghalAshutosh SinghalMay 9, 202614 min read

A regional daily I sat with last winter had four million monthly readers and a thirty-two-year archive, and the February board pack was the worst document anyone in the room had read in a decade. Organic search referrals down 41% year over year. Programmatic ad rates down another 18%. The affiliate revenue that quietly kept the lights on in 2023 had collapsed to a third of its peak. The CFO put the slide up, looked at me, and asked the only question that mattered: what exactly does Google owe us, and how do we make it pay?

I had to tell her the truth: Google owes them nothing contractually. The unwritten deal that built the web — you let us crawl you, we send you traffic — got rewritten the moment AI Overviews started appearing on 48% of all Google searches as of March 2026 (theStacc / Search Engine Land). When an AI summary sits above the organic link, the reader rarely scrolls; the Daily Mail's desktop click-through on those queries fell 89%, from 25.23% to 2.79%. The content is still being read. The publisher is just no longer being paid for it. The way back, I'm convinced, is conversational AI built on the publisher's own archive.

That conversation is why my team at Veriprajna now builds conversational AI for publishers — retrieval-augmented generation engines that run over a publisher's own news archive, with the citation enforcement and licensing strategy that make them safe to put under a masthead. Retrieval-augmented generation, or RAG, just means the AI answers only from your documents instead of from whatever it absorbed off the open internet. That distinction turns out to be the whole game. But I want to walk through how we got there, because we got the first version badly wrong, and the way we got it wrong is exactly the way most publishers are about to get it wrong.

The referral economy is over. The licensing economy isn't built yet.

Start with how bad the bleeding is, because the scale is the argument. Google search traffic to publishers fell roughly 33% globally in the year to November 2025, and the news executives Reuters Institute surveyed expect a further 43% decline over the next three years — a fifth of them brace for losses above 75% (Reuters Institute Journalism Trends 2026). The individual cases run worse, and one of them became a lawsuit: Penske Media, whose affiliate revenue fell more than a third from its peak, sued Google in September 2025 — the first major antitrust case alleging the company coerces publishers into AI training.

The search engine that built your distribution is now your largest competitor, and it didn't ask permission to switch sides.

So the instinct in every boardroom is identical: if Google won't send us readers, we'll build our own AI and keep them on our site. Correct instinct. The execution is where people get hurt.

We built the obvious thing first. It hallucinated about a councilman.

I'll be honest about our own scar tissue, because I earned it. The first engine we stood up was the version every SaaS vendor will quote you for $60K to $120K and deploy in a few weeks: take the archive, chop every article into chunks, turn the chunks into vector embeddings, and let a chatbot retrieve the nearest matches. It demos beautifully. You ask it about a recent story, it pulls the right article, it sounds great.

Then, in a pilot session, an editor asked it the kind of question any reader would ask: what's the history between this city councilman and the developer behind the downtown project? That's a multi-hop question — it means connecting one person across a dozen articles spanning years, where he shows up variously as the full name with title, then just a surname, and once, in an old photo caption, only as "the Ward representative," back before he held the seat at all. The vector search retrieved a few chunks that were textually similar, stitched them into a fluent paragraph, and asserted a relationship the archive did not actually support. It invented the connective tissue.

I watched the editor's face when she caught it. That look — a journalist realizing the machine just fabricated a fact under her newspaper's name — is the thing I now build everything to prevent. We had shipped a confident liar.

That failure is the entire reason I no longer believe a chat widget is a product. It has no idea that those scattered names are one person. It has no model of time — that the man was a private citizen for years before he ever won the seat. And it has no mechanism to check whether the sentence it just generated is actually grounded in a source. On the queries that don't matter, it's fine. On the queries readers actually care about, the longitudinal ones, the ones about people and money over time, it breaks exactly where it most needs to hold.

Why didn't RAG just fix this?

Because "RAG" describes a category, not a competence, and the hard 60% of the work is the part nobody sells.

The first thing I underestimated — badly — was ingestion. I walked in assuming a thirty-year archive was a clean corpus waiting to be embedded. It was not. It was two decades of mangled HTML from three CMS migrations, dead internal links pointing nowhere, and a pre-2005 layer that existed only as scanned microfilm. When we ran ordinary OCR over those scans, headlines, body columns, and photo captions came back fused into a single block of nonsense, because the OCR couldn't see the page layout. We had to bring in layout-aware optical character recognition — the kind that detects columns and separates a headline from a caption from the running text — using tools like Amazon Textract and Azure Document Intelligence. That's the unglamorous work that determines whether the whole thing functions, and it's the line item every chatbot vendor leaves off the quote.

The second thing was the councilman problem, which has a name: entity resolution. For the AI to reason about a person, every mention of that person — every byline variant, every honorific, every oblique reference — has to collapse to a single node. We solved it the way the serious shops do, with a knowledge graph: Microsoft's open-source GraphRAG layered on Neo4j, with the graph data science library doing the disambiguation so "Mr. Musk," "Elon Musk," and "the Tesla CEO" become one entity, not three. One implementation in this space reached 12 million nodes. Out of the box, GraphRAG knows nothing about your local political beat or your fifty recurring sources. Teaching it your archive's specific people is custom work, and it's the work that turns "textually similar chunks" into "the actual history of a person."

A vector search finds passages that sound alike. A knowledge graph knows that two passages are about the same human being. Readers ask about humans.

The third was time. News is inherently temporal — the same office held by different people, the same bill amended across sessions, the same company before and after a merger. We give graph edges valid-start and valid-end timestamps and decompose a question like "who chaired the committee when the bill passed" into temporal sub-queries, drawing on the temporal-RAG research that's emerged over the last year (the T-GRAG and VersionRAG lines of work). Without it, the engine flattens history into a single confused present, which is precisely how you assert that a private citizen held office years before he was ever elected.

The line that should be in every publisher's RFP: who's liable for the fabricated quote?

Here is the moment that made me read a Slack leak aloud to my own team as a cautionary tale.

In November 2024 the Washington Post launched Ask The Post AI. By late 2025 they'd extended it into an AI-generated podcast, and in December the standards editor's internal messages leaked: the thing was inventing quotes, misattributing sources, and inserting commentary as if it were the paper's own editorial position. "It is truly astonishing that this was allowed to go forward at all," one editor wrote (Semafor, Dec 11, 2025). The technical defect was small and specific — a missing citation-verification step. The reputational damage was global, and it happened to one of the most resourced newsrooms on earth.

I keep that leaked message close because it's the cheapest lesson available in this entire field. The Post had the engineers. What they didn't have, in that workflow, was an enforcement layer that refuses to emit a sentence unless every claim in it traces to a retrieved source, plus a confidence threshold that routes shaky answers to a human review queue before they go live, plus a clear answer to the board-level question: when the machine fabricates a quote, who is liable? That governance is not shrink-wrapped software. It's the difference between a tool you can put under your masthead and a global embarrassment.

Contrast it with Ask FT, which the Financial Times built on Anthropic's Claude, grounded strictly in FT journalism with mandatory citations linking back to source articles. Notably, it produced retention lift, not the subscription cannibalization everyone feared. The difference between the two outcomes isn't model quality. It's whether citation enforcement was treated as the product or as a feature to add later.

Why not just buy it, or build it like the FT did?

People ask me this constantly, so let me lay out the landscape honestly, because I've watched publishers lose their runway to each of the wrong answers.

The SaaS chatbot vendors drop a widget on your site for $60K-$120K. I've told you what that buys: vector embeddings, no entity resolution, no temporal reasoning, no citation verification, and your archive living in someone else's cloud. It hallucinates on exactly the queries that matter.

The Big Five in-house builds — the FT, the Times, Bloomberg, the Post, the Guardian — are real and impressive, and they were built by ML teams of six to twenty engineers over twelve to twenty-four months, running into seven figures. A regional daily with a one-to-three-person engineering team cannot replicate that headcount. Full stop. Telling a mid-tier publisher to "build it like the FT" is telling them to grow an organ they don't have.

The big consultancies — Accenture, Deloitte, IBM — will absolutely build it, with engagements that run $1.5M to $5M-plus and a discovery phase that outlasts your runway. They reach for the same Microsoft GraphRAG and Neo4j stack we do, then charge partner-tier rates on top, and they haven't built five publisher archives back to back to know where the bodies are buried.

That gap — between the widget that doesn't work and the seven-figure team you can't staff — is the entire reason mid-tier publishers need a build partner who has already done the unsexy ingestion, entity-resolution, and citation-enforcement work. Not a SaaS subscription, and not a discovery deck. It's the specific thing we built our publisher practice to be, and it's laid out in full on our solution page for conversational AI over news archives.

The part most consultancies get wrong: you need two revenue plays, not one.

Three publisher revenue plays — retention engine, crawl capture, paid intelligence tier — feeding one revenue model.

Here's the strategic mistake I see even smart publishers make. They treat "build our own AI" and "license our content to the AI companies" as an either/or. It's both, and they capture different money.

Your own conversational engine is the retention play — it keeps readers on your site asking your archive questions instead of bouncing to Perplexity. But it does nothing about the leakage happening right now, where AI crawlers ingest your content for free. For that you need the capture play, and the infrastructure for it only just got built.

Cloudflare launched Pay Per Crawl in January 2026, default-blocking AI crawlers across roughly 20% of all web traffic and letting publishers set Allow, Charge, or Block per crawler at a per-request price — the first infrastructure-level bot tax. Tollbit does something similar bot-by-bot; the Boston Globe, Vox, and Future have piloted it. And in March 2026 the News/Media Alliance signed a deal with ProRata letting 2,200 small and mid-sized publishers opt into a collective licensing pool with a 50/50 revenue share on attribution-tracked AI answers, with a parallel Bria deal for enterprise RAG use. The NMA handles the paperwork a regional daily could never staff alone.

Crawler tolls capture crawl revenue. They do nothing for query revenue. The honest answer is to run both a toll and a retention engine — anyone who tells you to pick one is selling you the half they happen to build.

A caveat I always give: none of the capture tools stop AI Overviews from summarizing you, because those retrieve at query time, not crawl time. The leakage and the summarization are separate problems. But a publisher who runs Tollbit or Cloudflare on the crawl side, opts into ProRata on the licensing side, and runs their own engine on the retention side has finally rebuilt a business model out of three partial answers. That's the whole strategy, and stitching those four systems — ProRata, Bria, Tollbit, Cloudflare — into one coherent revenue plan alongside the engine is most of what we actually do.

And there's a third play I push hardest with trade and B2B specialty publishers, because it's the one with the most upside: the engine isn't only a retention moat, it's a product you can charge for. A citation-grounded query interface over a deep vertical archive — thirty years of court filings, drug approvals, or deal data on your specific beat, the kind of corpus a compliance officer or analyst will expense a seat to query — is a paid intelligence tier, and eventually an API, that a competitor literally cannot generate because they don't own your corpus. That's the move the FT made at the high end with its thousand-dollars-plus professional tier, and Bloomberg with a system that turns plain English into its own query language across earnings calls and research. The trade press is sitting on the same opening over a narrower, deeper archive, and most of it hasn't noticed yet.

What this looks like when it works

Six-stage publisher RAG pipeline: OCR, entity resolution, temporal reasoning, hybrid retrieval, reranking, citation enforcement.

The engine that survived our councilman disaster looks nothing like the first one. It ingests the archive with layout-aware OCR that respects the page. It resolves entities into a knowledge graph so a person is one node, not a dozen scattered mentions. It reasons over time so history isn't flattened into the present. It runs hybrid retrieval — sparse keyword matching (BM25) alongside dense semantic search, because a query for "Bill 402" needs an exact match no embedding will reliably surface — and it reranks the candidates before they reach the model, a Cohere or BGE reranker buying 15-30% on retrieval quality for 50-200 milliseconds of latency, a trade I'll take every time on a query that matters. And it refuses, structurally, to emit a claim it can't cite, routing the uncertain ones to an editor instead of a reader.

It also bolts onto whatever CMS the publisher already runs, and that integration is never generic. Arc XP exposes a content API but no embedding hooks, so you build alongside it. WordPress VIP lets you register custom endpoints. Brightspot is more flexible. Atypon, used by scholarly publishers, has its own search you have to sit beside. A builder who's seen all five knows where each one fights you.

The newsrooms shifting fastest already feel the direction — cutting commodity general news, which is down 38% in their output, and pouring effort into original investigations, up 91%, and contextual analysis, up 82% (Reuters Institute). Their archive of that distinctive work is the one asset an AI company can't generate and a competitor can't replicate. The mistake is leaving it sitting on a server, indexed for free, while the summarizers monetize it.

That regional daily's CFO asked what Google owed her. The better question, the one I wish more publishers asked first, is what their own archive is worth when they control the engine reading it. Thirty-two years of reporting is a proprietary corpus that no model on earth was trained on — and the only people who can turn it into a conversation a reader will pay to keep having are the people who own it. You can find how we build that, end to end, here. The archive was always the asset. What changed is that you finally have to stop giving it away.

Related Research