A courtroom gavel striking down on a speech bubble containing AI-generated text, representing the legal liability of AI chatbot outputs.

Artificial IntelligenceTechnologyBusiness

Your AI Chatbot Just Became a Legally Binding Employee. Most Companies Haven't Noticed.

Ashutosh Singhal January 28, 202614 min read

A few months after the Moffatt v. Air Canada ruling dropped, I was on a call with a prospective client — a mid-sized fintech company, maybe 200 employees, growing fast. They'd built a customer-facing chatbot using a popular GPT wrapper. Clean UI. Friendly tone. Customers loved it.

I asked one question: "What happens when your bot quotes the wrong interest rate?"

Dead silence. Then their CTO said, "It won't. We've got good prompts."

I pulled up the ruling on my screen and read them the line where the tribunal said Air Canada "could not separate itself from the AI chatbot." That the company was liable for every word the bot generated, same as if a human employee had said it. That the airline's defense — arguing the chatbot was basically a "separate legal entity" responsible for its own mistakes — was rejected with something close to judicial contempt.

The CTO's face changed. Because here's what that ruling actually means: if your AI chatbot promises a customer a 2% rate in a 5% environment, or invents a refund policy that doesn't exist, or hallucinates a warranty term — congratulations, your company just signed a contract. Not metaphorically. Legally.

And the scariest part? Almost nobody building enterprise AI has internalized this.

The Ruling That Rewrote the Risk Profile of Every AI Chatbot

Let me tell you what actually happened in the Moffatt case, because the details matter more than the headlines suggested.

Jake Moffatt's grandmother died. He went to Air Canada's website, found the chatbot, and asked about bereavement fares. The chatbot — confidently, fluently, in the helpful tone these systems are optimized for — told him to buy a full-price ticket now and apply for a bereavement discount within 90 days for a partial refund.

That policy didn't exist. The airline's actual rules, buried in the tariff documents and static pages, said the opposite: no retroactive refunds once you've flown. The chatbot had hallucinated a policy that sounded right because, statistically, the phrase patterns around "bereavement" and "refund" and "90 days" co-occur frequently in airline policy documents across the industry.

When Moffatt asked for his refund and Air Canada said no, he took them to the tribunal. Air Canada's lawyers made an argument I still find breathtaking: they claimed the chatbot should be treated as a separate legal entity, responsible for its own statements. That the correct information was available elsewhere on the website, so the company had done its duty.

The tribunal didn't just reject this. Tribunal Member Christopher Rivers essentially said: there is no meaningful distinction between a human agent, a static webpage, and an interactive bot. They're all the company talking to the customer.

If your AI says it, your company has signed it. The tribunal established that hallucinations are not software bugs — they're negligent misrepresentation.

Three precedents came out of that ruling that should keep every CTO awake at night. Unified liability: it doesn't matter if the information comes from HTML text or a neural network — it's all the company's representation. Duty of care: deploying an unverified probabilistic model for policy dissemination is negligence. And the one that guts most current architectures: the "black box" defense is dead. The internal complexity of your AI system offers zero legal protection.

The damages were $800. The precedent is worth billions in future liability exposure.

Why "Good Prompts" Won't Save You

An infographic consolidating the key hallucination cost and risk statistics cited in the article, making the scale of the problem immediately graspable.

I need to be blunt about something that a lot of AI consultancies don't want to hear: Retrieval-Augmented Generation is not a compliance solution.

When I first started digging into the Moffatt case details, I expected to find that the chatbot had no access to the correct policy. That would've been a simple retrieval failure — fixable, understandable. Instead, I found something worse. The chatbot actually provided a link to the correct bereavement policy page. It had the right document. It just summarized it wrong.

This is the failure mode that breaks the "just add RAG" narrative. The chatbot retrieved the right context and still hallucinated the answer.

Here's why. Large Language Models are probabilistic engines. They predict the next likely token based on statistical patterns in training data. When an LLM says "refunds are available within 90 days," it's not querying a rules database. It's completing a sentence pattern that's statistically probable based on millions of documents it ingested during training — documents that included countless different refund policies from countless different companies.

Giving the model the correct document helps. But if the retrieved text is complex, if the legal language is dense, if there's a subtle negation buried in a subordinate clause — the model can ignore the retrieved context in favor of its pre-trained biases. This isn't a rare edge case. It's a known failure mode called parametric memory dominance, and it happens more often with precisely the kind of complex policy language that matters most for compliance.

I've seen this firsthand. We were testing a prototype for a client in healthcare, and the system had the correct drug interaction data in its context window — literally right there in the prompt. The model still generated a response that softened a "severe interaction" warning into a "mild caution." Because in the training data, most text about those two drugs together appeared in contexts that minimized the risk. The retrieval was perfect. The generation was dangerous.

RAG provides knowledge, but it does not guarantee adherence. You cannot solve a strict logic problem with a probability engine alone.

The numbers back this up. Global losses attributed to AI hallucinations hit $67.4 billion in 2024. Even the best frontier models — GPT-4o, Gemini 2.0 — retain baseline hallucination rates between 0.7% and 3% depending on task complexity. That sounds small until you do the math: a bank's AI assistant handling a million queries a month at a 0.7% hallucination rate produces 7,000 potential regulatory violations. Every month.

And enterprises are already paying a hidden tax for this unreliability. Forrester estimates that hallucination mitigation costs roughly $14,200 per employee per year in lost productivity — humans double-checking AI work that can't be trusted to stand on its own. The market for hallucination detection tools grew 318% between 2023 and 2025. That's not a sign of a problem being solved. That's a sign of an industry frantically patching a fundamentally flawed approach.

What Does a Chatbot That Can't Lie Look Like?

A flowchart showing how the Deterministic Action Layer architecture routes user queries — safe topics go to LLM generation, compliance-critical topics bypass the LLM and trigger deterministic logic instead.

There was a moment — I remember it clearly because it happened during a late-night architecture session with my team — when the core idea clicked. We were arguing about how to make an LLM "more accurate" for a compliance use case. Better prompts. Better retrieval. Fine-tuning on domain data. And one of my engineers said something that stopped the conversation: "Why are we asking the model to be accurate? It's not designed for accuracy. It's designed for fluency."

She was right. And that reframe changed everything about how we build.

The answer isn't making the probabilistic model less probabilistic. The answer is not letting it make decisions at all when the stakes are high.

We call this a Deterministic Action Layer — a middleware component that sits between the user and the LLM, acting as a traffic controller. When a customer asks about the weather or wants help drafting an email, the LLM does what it's great at: generating fluent, helpful, creative text. But the moment the conversation touches refunds, pricing, legal terms, warranties, privacy policy — anything where a wrong answer creates liability — the system switches modes entirely.

Instead of letting the LLM generate an answer from its weights, the Deterministic Action Layer triggers hard-coded logic. A database query. A decision tree. A pre-written, legally vetted response template. The LLM's role shrinks from "author" to "translator" — it might rephrase the result into a polite sentence, but it cannot add, remove, or reinterpret the information.

Think of it this way. If the Moffatt chatbot had this architecture, here's what would have happened: the semantic router detects the intent — bereavement_refund. Instead of letting the model riff on what it thinks bereavement refund policies usually say, it executes a deterministic function: if ticket_status == 'flown' return NO_REFUND. The response comes back: "Our policy strictly prohibits refunds after travel. Reference: Tariff Rule 45." Boring. Legally airtight. Exactly what was needed.

I wrote about this architecture in depth in the interactive version of our research, but the core insight is simple: separate the conversation from the compliance. Let the neural network handle the messy, beautiful variability of human language. Let deterministic code handle the parts where being wrong costs money.

The Silence Protocol

There's a specific design pattern we use that I think captures the philosophy better than any architecture diagram. We call it the Silence Protocol.

When a user asks about a topic we've classified as "Compliance Critical," the generative AI's creative capabilities are effectively muted. The system switches from "Author" mode to "Reader" mode. It retrieves the exact text from the database and serves it verbatim, or fills a strict template with variables from a trusted source.

And here's the part that makes some product managers uncomfortable: if the user asks a question that falls into a policy gap — where no deterministic rule exists — the system doesn't improvise. It says: "I cannot answer that question directly. Let me connect you with a human specialist."

I had a potential client push back on this hard. "Users want instant answers," he said. "A chatbot that says 'I don't know' feels broken."

I asked him which feels more broken: a chatbot that says "let me get you a human," or a chatbot that invents a refund policy, the company has to honor it, and the legal team spends six months in damage control?

In legal terms, creativity regarding contract terms is synonymous with fabrication. The most valuable feature of an enterprise AI isn't what it can say — it's what it's prevented from saying.

We disable creativity for compliance topics because in a post-Moffatt world, an AI that "helpfully" improvises a policy is an AI that's rewriting your contracts in real time without authorization.

How Does the System Know What's Dangerous?

This is the question I get most often, and it's the right one. The architecture only works if the routing layer — the traffic controller — can reliably distinguish between "tell me about your company's history" (safe for LLM generation) and "can I get a refund on this?" (must be handled deterministically).

We use semantic routing, which is fundamentally different from the brittle keyword matching of older chatbot systems. A keyword system looking for "refund" would miss "I want my money back" or "can you reimburse me." Semantic routing converts the user's query into a high-dimensional vector embedding and compares it against predefined canonical examples for restricted topics.

The key detail: this routing layer sits outside the LLM's context window. This matters enormously for security. Prompt injection attacks — where users craft inputs designed to trick the model into ignoring its instructions — are a real and growing threat. But if the routing decision happens before the query ever reaches the model, those attacks become irrelevant to the compliance logic. You can't jailbreak a system that never gives the model the keys in the first place.

Once a sensitive intent is detected, we use function calling — a capability in modern LLMs where the model outputs structured data (a JSON object calling a specific function) rather than free-form text. The LLM extracts parameters from the conversation — ticket ID, purchase date, travel date — and passes them to a deterministic code block. Python. SQL. Whatever executes the actual business logic. The model never calculates the refund. It never decides eligibility. It translates natural language into an API call, and translates the API response back into natural language. The deciding is done by code, not by probability.

For the full technical breakdown of the routing architecture, function calling patterns, and our verification pipeline, see our technical deep-dive.

The Regulatory Walls Are Closing In

If the Moffatt precedent wasn't enough motivation, the regulatory landscape is about to make deterministic guardrails non-optional.

The EU AI Act classifies many customer-facing AI systems — especially in transport, banking, and essential services — as High-Risk. Article 14 mandates human oversight: systems must be designed so humans can interpret outputs, intervene, and hit the stop button. A black-box LLM wrapper doesn't satisfy this. A Deterministic Action Layer — where the compliance officer writes the rules that the system executes — does.

GDPR Article 22 grants individuals the right not to be subject to decisions based solely on automated processing when those decisions have legal or significant effects. Denying a refund is a significant effect. Denying a loan application is a significant effect. When a customer asks "why was I denied?", a neural network can't explain its reasoning because it doesn't have reasoning — it has statistical weights. A deterministic logic tree can point to the exact node: "Credit score below threshold" or "Ticket status: flown."

And ISO 42001 — the first global standard for AI governance — requires organizations to map where probabilistic versus deterministic logic is used, measure hallucination rates, and maintain complete audit trails. We designed our architecture specifically to be audit-ready for this standard. Every interaction, every routing decision, every policy execution is logged with a traceable logic path.

This isn't theoretical compliance. I've sat in rooms with enterprise legal teams who are actively rethinking their AI deployments because of these frameworks. The companies that build the guardrails now will deploy AI faster and more broadly than those scrambling to retrofit compliance later.

"But Isn't This Expensive?"

People always ask me this, and I understand the instinct. Building semantic routing, deterministic logic layers, knowledge graphs, verification pipelines — it's undeniably more complex than wrapping an API call in a nice UI.

But let me reframe the question. What's the cost of not building it?

Air Canada's damages were $800. But the legal fees dwarfed that. The reputational damage — "airline argues its own chatbot is a separate legal entity" became a global punchline — is incalculable. And that was a single interaction about a single bereavement fare.

Now imagine a financial services chatbot that hallucinates a loan approval. A healthcare bot that softens a drug interaction warning. An insurance bot that invents coverage terms. We're not talking about $800 anymore. We're talking about class-action territory.

The $14,200 per employee per year that enterprises currently spend on hallucination mitigation — humans manually verifying AI outputs because nobody trusts them — that's the real cost of "cheap" AI. The wrapper is cheap to build and expensive to operate. The deterministic architecture is expensive to build and cheap to trust.

This Is About What Comes Next

I want to end on something that goes beyond the current chatbot conversation, because I think the Moffatt ruling is a preview of a much larger shift.

We're moving from an era of AI chatbots to an era of AI agents — systems that don't just answer questions but take actions. Book flights. Transfer money. Approve claims. Sign agreements. The legal fiction that "the user should verify the information" was already weak when applied to chatbots. It's completely untenable when applied to agents that execute transactions autonomously.

Every company deploying AI that touches money, contracts, or regulated decisions is making a choice right now, whether they realize it or not. They're either building systems where the AI's creativity is bounded by deterministic logic — where the machine can be fluent and helpful within strictly enforced guardrails — or they're deploying eloquent, unsupervised agents with the legal authority to rewrite corporate policy one hallucination at a time.

I know which side of that line I want to be on. I know which side the law is going to demand.

Your chatbot is a legally binding employee. It needs the same training, the same oversight, and the same strict boundaries as a human employee handling corporate funds. You wouldn't let a new hire invent refund policies based on vibes. Don't let your AI do it either.

The black box defense is dead. The wrapper era is ending. And the companies that figure out deterministic action layers first won't just avoid liability — they'll be the ones that actually scale AI into the parts of their business where it matters most, because they'll be the ones whose systems can be trusted.

The question isn't whether your AI is smart enough. It's whether it knows when to shut up.