A visual contrast between a probabilistic text output and a structured knowledge graph, representing the article's core thesis that justice requires deterministic graph reasoning, not language model guessing.

Artificial IntelligenceInsuranceMachine Learning

The AI That Decides Who Caused Your Car Crash Is Probably Wrong — Here's Why I'm Building a Better One

Ashutosh Singhal February 19, 202615 min read

A few months ago, I watched a demo that made my stomach turn.

A well-funded insurtech startup was showing off their new claims automation tool. They fed a police report into GPT-4, asked it to determine fault in a two-car intersection collision, and out came a beautifully written paragraph assigning 60/40 liability. The founder beamed. The investors nodded. The narrative was clean, confident, and — I was almost certain — wrong.

I asked a simple question: "Run it again."

Same report. Same prompt. This time: 70/30. The model had shifted ten percentage points of someone's financial liability between two runs because it's a probabilistic text generator, not a judge. The room got quiet. Someone muttered something about temperature settings.

That moment crystallized everything my team at Veriprajna had been building toward. We'd spent months studying how LLMs handle legal reasoning, and the results were worse than I expected. Stanford researchers have documented hallucination rates between 69% and 88% when state-of-the-art models respond to specific legal queries. These aren't edge cases. This is the baseline. And the insurance industry is rushing to deploy these systems to decide who pays when your car gets hit.

I'm going to tell you why that's dangerous, and what we're building instead.

The Night the Verbose Driver Won

Before I get into architecture and logic engines, let me tell you about an experiment that radicalized my thinking.

We set up a simple test. Two narratives describing the same intersection collision, written from each driver's perspective. Driver A had clearly run a stop sign — the police report confirmed it, the witness confirmed it, the damage pattern confirmed it. Open and shut.

But we gave Driver A a 500-word narrative. Vivid details about the weather, the glare, the "aggressive acceleration" of the other car. Sophisticated vocabulary. Emotional texture.

Driver B got 50 words: "I stopped at the intersection. I checked for cross traffic. I proceeded. Driver A struck my passenger side."

We fed both accounts to three major LLMs and asked each to assess liability.

Two out of three gave Driver A — the one who ran the stop sign — a more favorable liability split. Not because the facts supported it, but because Driver A told a better story.

I remember sitting in our office past midnight staring at those results. My co-founder walked over, looked at the screen, and said: "So we're building justice for the articulate." That phrase stuck. It's exactly what these systems do.

Researchers call this verbosity bias — the documented tendency of LLMs to award higher confidence scores to longer, more detailed responses, even when the factual content is equivalent or inferior to concise alternatives. The model conflates token density with evidence density. It mistakes eloquence for truth.

When an AI system penalizes brevity and rewards rhetorical flourish, it structurally discriminates against anyone who is less educated, less articulate, or simply more honest.

Think about who gets hurt by this. The elderly driver who gives a straightforward account. The non-native English speaker. The person who just tells the truth without embellishment. These are the people an automated liability system should protect, and instead, it's systematically ruling against them.

Why Does Your AI Agree With Whatever You Tell It?

Verbosity bias wasn't the only failure mode we found. There's something arguably worse: sycophancy.

LLMs are trained through a process called Reinforcement Learning from Human Feedback — RLHF — which rewards "helpfulness" and "agreeableness." This is fine when you're asking for a recipe. It's catastrophic when you're asking for a legal judgment.

We tested this by framing the same police report with different leading prompts. "Analyze this report to determine if the claimant was speeding" versus "Analyze this report to determine if the claimant had the right of way." Same data. Different framing. The model reliably tilted its analysis toward whatever hypothesis the prompt implied.

One of my engineers called it "confirmation bias as a service," and I haven't been able to think of it any other way since.

In a real claims environment, an adjuster might unconsciously frame a query based on their initial read of the situation. The model picks up on that framing and amplifies it. Research shows this happens in two flavors: progressive sycophancy, where the model adjusts its reasoning to arrive at your desired conclusion, and regressive sycophancy, where it abandons correct information to agree with an incorrect challenge. Either way, you don't get an impartial arbiter. You get an echo chamber.

What Happens When AI Reads the Law Wrong?

I need to tell you about the statute problem, because it's the one that keeps me up at night.

LLMs don't "know" traffic law. They've ingested text that includes traffic law, and they predict sequences of tokens that look like legal reasoning. The distinction matters enormously.

We found a case where a model cited a "first-to-arrive" right-of-way rule — common at four-way stops — and applied it to a T-intersection, where through-traffic has absolute right of way. The model didn't flag the mismatch. It just generated a confident, well-structured paragraph applying the wrong law to the wrong situation.

An AI that invents a statute and applies it with confidence isn't making an error. It's manufacturing injustice at scale.

This is what researchers call legal hallucination, and it takes two forms. Factual hallucination: the model infers details not present in the source text to create a coherent narrative. Reading "severe front-end damage," it might conclude the vehicle was speeding, despite no skid mark measurements or telemetry. And legal hallucination proper: the model misinterprets, misapplies, or outright invents traffic codes and case law.

An insurance decision based on a hallucinated version of California Vehicle Code § 21802 exposes the carrier to bad-faith litigation and regulatory penalties. And the insured — the actual human being — gets a wrong verdict delivered with the authority of "AI."

I wrote about these failure modes in depth in the interactive version of our research, if you want to see the full evidence base. But the short version is: LLMs are linguistically brilliant and logically broken, and we're asking them to do logic.

The Argument That Changed Our Architecture

There was a specific argument inside our team that shaped everything we built afterward.

We were debating whether to build a better RAG pipeline — retrieve relevant statutes, feed them to the LLM, constrain its output. The "make the LLM smarter" approach. Half the team was convinced this was the pragmatic path. Ship faster, iterate, improve retrieval quality over time.

I was on the other side, and I was losing the argument until our legal advisor asked a question that silenced the room: "If two witnesses disagree about whether the light was red or green, what does your system do?"

The RAG team paused. An LLM with retrieved context would do what LLMs always do — pick the narrative that feels more coherent, probably the longer one, and generate a resolution. It would hallucinate a consensus.

"It should hold the conflict," I said. "It should say: this is a disputed fact, and I cannot resolve it without additional evidence."

That's not something a language model does. Language models resolve. They complete. They generate the next plausible token. Holding an unresolved contradiction and flagging it as a gap — that requires a fundamentally different kind of system.

That's the day we committed to knowledge graphs.

How Do You Turn a Police Report Into a Graph?

A diagram showing the KGER pipeline — how unstructured police report text is transformed into structured knowledge graph nodes and edges through semantic extraction against a defined ontology.

What we build at Veriprajna is called Knowledge Graph Event Reconstruction — KGER. The core idea is deceptively simple: stop asking AI to judge, and start asking it to reconstruct.

A police report is unstructured text. It contains entities — drivers, vehicles, roads, traffic signals, witnesses — and relationships between them. Vehicle A was traveling north on Main Street. Vehicle B ran the stop sign at 4th Avenue. The light was green. It was raining.

We use the LLM as a semantic extractor — a very sophisticated clerk. Its job is to read the unstructured text and pull out entities and relationships, mapping them to a strict ontology we've defined. Our ontology covers over 110 entity and relationship types: agents, objects, infrastructure, events, conditions, measurements.

The LLM doesn't decide who's at fault. It catalogs actors and actions. And because its output is constrained to a predefined schema, we can validate everything it produces. If it extracts a "stop sign" where our map database shows no stop sign exists, the system flags a conflict instead of silently accepting the hallucination.

Once extracted, these entities become nodes in a knowledge graph. The relationships become edges. Vehicle_A → TRAVELING_ON → Main_Street. Vehicle_B → VIOLATED → Stop_Sign_1. Witness_A → OBSERVED → Light_State_Green.

The subjective narrative is now an objective topology. And once you have a topology, fault becomes a question of graph traversal and pattern matching — not sentiment analysis.

Can You Turn Traffic Law Into Code?

This is the part that gets me genuinely excited, and it's the part most people think is impossible.

Traffic laws are written in natural language, full of vague terms like "immediate hazard" and "safe distance." Courts interpret them through precedent and judgment. How do you make that executable?

The answer is Defeasible Deontic Logic — DDL. Deontic logic deals with obligations, prohibitions, and permissions. "Defeasible" means it handles exceptions. This is exactly what traffic law is: a set of norms with structured exceptions.

Take California Vehicle Code § 21802, the stop sign rule. In natural language: "The driver of any vehicle approaching a stop sign shall stop... The driver shall then yield the right-of-way to any vehicles which have approached from another highway."

In our system, this becomes executable logic:

Rule 1 — Obligation to Stop: If a vehicle is approaching an intersection with a stop sign, the driver is obligated to bring speed to zero at the limit line. If speed is greater than zero at intersection entry, that's a violation.

Rule 2 — Obligation to Yield: If the driver has stopped but another vehicle is in or approaching the intersection, the driver must wait. If they enter while the other vehicle is present and a collision occurs, that's a failure-to-yield violation.

Rule 3 — Exception: If a police officer is directing traffic, the officer's direction overrides the sign. The exception formally defeats the primary rule.

Now here's where it gets powerful. We map the physical graph — the reconstruction of each vehicle's speed and position over time — against this logical template. If the graph shows Vehicle A entered the intersection while Vehicle B was present, the logic engine triggers a yield violation. That's a computed fact, not an opinion.

We don't ask the AI "Was it hazardous?" We calculate the hazard based on physics and apply the law based on logic. The ambiguity disappears.

For vague terms like "immediate hazard," we ground them in physics. We define Immediate_Hazard as Time-to-Collision less than 3.0 seconds, or distance less than braking distance at current speed. The graph calculates TTC from speed and distance nodes. If TTC is below threshold, the hazard node activates, and the corresponding rule fires. No interpretation needed.

For the full technical breakdown of our formalization process and architecture, see our research paper.

The Counterfactual That Proves Causation

Fault isn't just about rule violation. It's about causation. A driver might have an expired license — that's a violation — but if they were rear-ended while stopped at a red light, the expired license didn't cause the accident.

This is where most AI systems fall apart. LLMs can't reason counterfactually. They can't ask: "Would this collision have occurred if Vehicle A had stopped at the sign?" They can only predict what sentence comes next in a crash narrative.

Our system builds what we call Causal Knowledge Graphs. To test causation, we create a counterfactual branch: we modify Vehicle A's speed to zero at the limit line and run the physics simulation forward through the temporal layer. If the collision node disappears in the counterfactual graph, the violation is the proximate cause.

This is the difference between "he was speeding and he crashed" (correlation) and "the speeding caused the crash" (causation). In a multi-vehicle pileup, this matters enormously. You can trace causal chains through the graph, measure what we call "fault centrality" — how central each actor's violations are to the collision event — and produce a mathematically grounded comparative fault split. Not 60/40 because the model felt like it. 80/20 because the topology proves it.

Why Can't You Just Make LLMs More Accurate?

People ask me this constantly. "Fine-tune the model on traffic law. Use better prompts. Add guardrails." I understand the impulse. LLMs are easy to deploy, and the outputs look impressive.

But the problem isn't accuracy in the traditional sense. The problem is architectural. A probabilistic text generator will never be deterministic. Run it a hundred times on the same input, and you'll get variation. In a domain where the same facts must yield the same verdict every time — where a ten-point swing in liability means thousands of dollars changing hands — stochasticity isn't a bug to be patched. It's a fundamental disqualifier.

Our graph engine produces the exact same liability determination on the exact same input, every single time. That's not a nice-to-have. For regulatory compliance, for legal defensibility, for basic fairness — it's the minimum requirement.

The other objection I hear: "This sounds expensive and complex compared to an API call." It is more complex to build. But consider the cost of getting it wrong. Claims leakage — paying more than you should due to inaccurate liability — is a massive line item for insurers. A probabilistic system that suggests 50/50 because the narratives are messy, when deterministic logic reveals a clear 100/0 based on a specific right-of-way violation, costs real money on every single claim.

And then there's litigation. Try defending an AI liability decision in court when the system can't explain its reasoning, and running it again produces a different answer. The audit trail from a knowledge graph — "Vehicle A violated Rule 21802(a) at timestamp 12:01:30, and counterfactual simulation confirms this violation as proximate cause" — is a fundamentally different thing to put in front of a judge.

The Sandwich, Not the Black Box

A layered architecture diagram showing the neuro-symbolic "sandwich" — neural AI layers on the outside handling language, symbolic AI in the middle handling reasoning, with clear labels showing what each layer does and does not do.

I want to be clear about something: I'm not anti-LLM. We use LLMs. They're extraordinary tools for processing unstructured language, and we'd be foolish to ignore that.

What I'm against is using them as judges.

Our architecture is what we call a "sandwich." Neural AI on the outside, symbolic AI in the middle. The first neural layer handles ingestion — OCR on police reports, speech-to-text on witness audio, entity extraction from messy unstructured data. The symbolic middle layer builds the graph, fuses data from multiple sources, runs the deontic logic engine, performs causal simulation. The final neural layer translates the structured liability report back into readable natural language, strictly grounded in graph facts.

The LLM never decides. It reads and it writes. The graph reasons.

Asking an LLM to read a police report and judge liability is asking a poet to do physics. You'll get a beautiful answer, but it will likely be fiction.

This is what the industry is starting to call neuro-symbolic AI — the fusion of learning and logic. Kennedys IQ, a major legal technology firm, recently launched what they describe as the insurance industry's first neuro-symbolic AI solution, explicitly to eliminate the "black box" concern. The direction is clear. The question is how fast the rest of the industry follows.

Justice Is a Graph, Not a Probability

I think about that demo I watched — the one where liability shifted ten points between runs — more often than I'd like. Not because it was a bad product. The team was talented. The technology was impressive. But impressive isn't the same as right. And in the domain of fault and liability, "mostly right" is wrong.

Every time an AI system assigns fault based on who told a better story, or shifts its verdict because of a temperature setting, or cites a statute that doesn't exist — a real person absorbs that error. They pay a higher premium. They lose a dispute they should have won. They carry fault that belongs to someone else.

We can do better. Not by making language models smarter, but by recognizing what they are and what they aren't. They are brilliant at language. They are terrible at justice. Justice requires determinism — the same facts, the same verdict, every time. It requires auditability — show me exactly which evidence and which statute produced this conclusion. It requires the ability to hold an unresolved conflict and say "I don't know yet" instead of generating a confident fiction.

These aren't features you add to a language model. They're properties of a different kind of system entirely. A system where facts are immutable nodes, laws are executable logic, and fault is found not in the sentiment of a narrative but in the topology of what actually happened.

Justice is a graph. It's time we started building it that way.