A visual metaphor showing the contrast between a chaotic probabilistic system and a structured deterministic graph controlling an AI agent, specific to travel booking.

Artificial IntelligenceSoftware EngineeringMachine Learning

GPT-4 Failed 99.4% of the Time — So We Stopped Letting It Make Decisions

Ashutosh Singhal February 16, 202613 min read

It was almost midnight, and I was watching our agent book a flight to the wrong city for the third time in a row.

Not a different wrong city each time — the same wrong city. Delhi instead of Dehradun. The user had typed "Dehradun" clearly. The LLM had parsed it correctly in its chain-of-thought reasoning. And then, when it generated the API call, it swapped in the airport code for Delhi. Confidently. Silently. Three times.

My co-founder was on the call. He said, "It knows the right answer. Look at the reasoning trace. It literally says Dehradun. And then it does something else."

That was the night I stopped believing that better prompts would save us.

We'd been building an AI agent for travel booking — the kind that talks to Global Distribution Systems like Amadeus and Sabre, those ancient mainframe-era backends that power every airline reservation on the planet. And we'd been doing what everyone else was doing in 2023: wrapping GPT-4 in a thin orchestration layer, giving it tools, and praying.

The prayer wasn't working.

The Number That Changed Everything

A few weeks after that Dehradun incident, I came across the TravelPlanner benchmark — a rigorous academic evaluation that tests LLMs on multi-day itinerary planning with real constraints: budgets, transportation, dining, accommodation. The kind of thing a competent travel agent does in twenty minutes.

GPT-4's overall success rate: 0.6%.

Not 60%. Not 6%. Zero point six percent.

I read it three times. Then I pulled up the methodology to make sure they hadn't made a mistake. They hadn't. When you ask the most advanced language model in the world to plan a trip that respects a budget, connects flights to hotels to restaurants, and doesn't violate basic temporal logic — it fails 99.4% of the time.

When GPT-4 was asked to plan trips with real-world constraints, it succeeded 0.6% of the time. A neuro-symbolic agent solving the same problem scored 97%.

The system that scored 97% didn't use a smarter model. It used a fundamentally different architecture — one where the LLM translated the user's request into structured data, and then a deterministic solver did the actual planning. The LLM was the translator. The code was the brain.

That benchmark didn't just validate our frustration. It gave us a blueprint.

Why Does Your AI Agent Keep Failing?

An infographic showing the exponential reliability decay of chained LLM steps — the "Chain of Probability" problem — with concrete success percentages at 1, 5, and 10 steps.

Here's the thing nobody in the "AI agent" gold rush wants to talk about: LLMs don't reason. They predict.

When GPT-4 "decides" to call a search API, it's not executing logic. It's predicting the most statistically likely next token based on patterns in its training data. In a conversation, that prediction is usually good enough. In a ten-step API workflow where each step depends on the exact output of the previous one? It's a disaster.

I started calling this the Chain of Probability problem. Assume your LLM gets each step right 90% of the time — a generous estimate for complex tool use. Here's the math:

1 step: 90% success
5 steps: ~59% success
10 steps: ~34% success

A flight booking workflow — search, filter, select, price, collect passenger details, create PNR, validate, pay, ticket — routinely exceeds ten steps. At 34% theoretical success, you're not building software. You're building a slot machine.

And 34% is the ceiling. Real-world performance is worse because of two phenomena we kept hitting in production.

The Hallucination Cascade

The first is what I call the Hallucination Cascade. In a chained architecture, the output of Step 2 becomes the input to Step 3. If the LLM makes a subtle error early — misreading a flight arrival time as 2:00 PM instead of 2:00 AM — that error doesn't get caught. It propagates. The agent books a hotel check-in for the wrong day based on the hallucinated time. The GDS API doesn't know the agent's intent, only its input, so it processes the request successfully. The agent sees a 200 OK response and reinforces its own mistake.

You end up with a "successful" execution trace that produces a catastrophic real-world outcome. The agent thinks it nailed it. The customer shows up at the airport and finds out otherwise.

The second phenomenon is Context Drift. As the agent works through a multi-step plan, the context window fills with intermediate data — search results, API responses, user messages. The model's attention mechanism spreads thinner and thinner across all those tokens. By Step 10, it has effectively "forgotten" the budget constraint it correctly identified in Step 2. The attention scores, governed by the softmax function, dilute across too many irrelevant tokens.

I watched this happen live during a demo for a potential partner. The agent found a hotel within budget in Step 3. By Step 8, when selecting a restaurant, it had completely lost track of the remaining budget. It recommended a place that would have blown the user's spending limit by 40%. The partner turned to me and said, "So it just... forgets?"

Yeah. It just forgets.

What Happens When AI Meets a Mainframe?

To really understand why we needed a different approach, you have to understand what Global Distribution Systems are like to work with.

Amadeus, Sabre, Travelport — these are the backbone of global air travel. They were designed in the mainframe era, and they behave like it. A flight booking isn't a single API call. It's a finite state machine with a precise sequence of operations that cannot be reordered, skipped, or approximated.

You authenticate and get a session token. That token must be passed in every subsequent header — if the LLM "forgets" it or hallucinates a new one, the entire transaction context is lost. Then you search for flights, and the GDS returns massive nested JSON payloads — often 50KB+ — containing fare basis codes, baggage models, segment references. The LLM needs to extract a specific offerId from that payload to proceed. But LLMs are lossy compressors. They summarize. They truncate. They "helpfully" normalize data formats that the GDS requires to be exact, down to the byte.

One night, we spent four hours debugging a booking failure. The LLM had "corrected" a fare basis code — changed a lowercase letter to uppercase, because that looked more "right" to a model trained on English text. The GDS rejected it with a cryptic error: ERR 1209 - SEQUENCE ERROR. No explanation. No suggestion. Just a wall.

LLMs are lossy compressors. When they transfer data between API calls, they "autocorrect" and "normalize" in ways that break the cryptographic integrity enterprise systems require.

And when the GDS returns an error like UC (Unable to Confirm), the LLM has no idea what to do. It's trained to be helpful, so it interprets the error as a glitch and retries the exact same request. Again. And again. We watched agents burn through thousands of tokens and hit API rate limits, stuck in what we started calling the "Loop of Death" — repeatedly banging against a wall they couldn't understand.

The Night We Flipped the Architecture

The turning point came during an argument.

We were three months into the project. My engineering lead wanted to keep improving prompts — longer system messages, more examples, chain-of-thought instructions. "We're so close," he kept saying. "If we just structure the prompt better for the PNR creation step..."

I pulled up our logs. In the previous week, we'd had 47 failed booking attempts in our test environment. Eleven were the Loop of Death. Nine were hallucinated airport codes. Six were the LLM trying to commit a PNR before adding the mandatory "Received From" field — a sequence error that no amount of prompting seemed to fix, because the model had no inherent concept of temporal ordering beyond what it had absorbed from training data.

"We're not close," I said. "We're at the ceiling. The architecture is the problem."

That week, we rewrote everything. We stopped asking the LLM to orchestrate. We stopped letting it decide what step came next. We stopped feeding it raw GDS responses and hoping it would extract the right fields.

Instead, we built a graph.

For the full technical breakdown of what we built and why, I wrote a detailed research paper that goes deep on the architecture.

How Does Neuro-Symbolic AI Actually Work?

A labeled architecture diagram showing the two-layer neuro-symbolic split — LLM as translator/interface layer vs. deterministic graph as execution/manager layer — with specific examples of what each layer handles.

The core idea is deceptively simple: control flow is not a language task.

Deciding what to do next in a rigid business process should not be a matter of token prediction. It should be a matter of conditional logic. The decision to "ask for payment" should only fire if "flight is selected" AND "price is confirmed." That's a boolean condition, not a probabilistic suggestion.

We split our system into two layers:

The LLM became the interface layer — the translator. It parses the user's natural language ("I want a morning flight to Dehradun, not too expensive") into structured data: {origin: "DEL", destination: "DED", date: "2024-03-15", time_preference: "morning", budget: "economy"}. That's what LLMs are genuinely great at: understanding messy human intent.

The graph became the execution layer — the manager. It receives that structured data and executes the business logic using deterministic code. Hard-coded nodes. Typed state schemas. Conditional edges that inspect variables, not vibes.

We used LangGraph to build this, because it gives you the primitives you need: a shared state schema (backed by a database, not a chat history), nodes that are just Python functions, and conditional edges that route based on actual variable values.

The LLM should be the worker — extracting data, summarizing text, formatting JSON — while the manager should be hard-coded software. This inversion of control is the defining characteristic of robust agentic systems.

In our architecture, the LLM literally cannot skip steps. It's physically impossible for the system to attempt a booking before the selected_offer_id variable is populated in the state. Not because we told the LLM "don't do that" in a prompt, but because the graph edge won't fire. It's like trying to drive through a wall — the code simply doesn't allow it.

What Does the Actual System Look Like?

A detailed node-by-node flowchart of the actual booking pipeline described in the article, showing each node type (Collector, Retriever, Summarizer, Selector, Gatekeeper, Transactor), whether it uses LLM or code, and the suspension/resume capability at the Gatekeeper.

Let me walk you through what happens when someone says "Book me a flight from Mumbai to London next Tuesday."

First, a Collector node — powered by an LLM — parses that sentence into structured fields. It uses guided generation (JSON Mode) to output a specific schema. A Python validator checks if the airport codes are real. "London" is ambiguous — Heathrow or Gatwick? — so the graph routes to a disambiguation node. The LLM doesn't guess. It asks.

Once we have validated search criteria, a Retriever node calls the Amadeus API. This is pure code. No LLM involved. The response comes back, gets cached in the state, and only then does a Summarizer node — an LLM — convert the top five results into a human-readable message. But it's strictly constrained: it can only display data present in the cached JSON. It cannot invent perks or change prices.

The user picks an option. A Selector node resolves "the second one" to the specific offer_id hash. A Gatekeeper node checks business rules — is this within corporate policy? Is the carrier blacklisted? If there's a violation, the graph suspends. It persists its state to the database, sends an approval request to a manager, and waits. Hours later, when the manager clicks "Approve," the graph reloads the exact state and resumes at the booking node.

Finally, a Transactor node executes the PNR creation sequence — segments, passenger details, pricing, commit — in the exact order the GDS requires. If the GDS returns a price change warning (common in travel), the node halts and asks the user to confirm. It does not auto-book at the higher rate.

Every node transition is logged. Every decision is traceable. An auditor can read the execution log and understand exactly why the system booked a specific flight — not by interpreting a mess of tokens, but by reading a structured record: Node:Gatekeeper | Input: Price=1200 | Rule: Policy_Limit=1000 | Output: REJECT_NEED_APPROVAL.

I wrote about the full architecture, including the interactive diagrams, in the whitepaper's interactive version.

Isn't This Just... Regular Software Engineering?

People ask me this constantly. "So you're saying we should write code instead of using AI? Revolutionary."

No. I'm saying the AI industry has been so intoxicated by the magic of language models that it forgot the last sixty years of computer science. State machines, typed schemas, conditional branching, transactional integrity — these aren't outdated concepts. They're the reason your bank doesn't accidentally wire money to the wrong account.

The neuro-symbolic approach isn't anti-AI. It's pro-architecture. We use LLMs aggressively — for intent parsing, for disambiguation, for summarization, for handling the genuinely hard problem of understanding what a human means when they type something ambiguous. But we don't let the LLM touch the steering wheel when the car is on the highway.

You can build a chatbot that talks about doing work, or you can architect an agent that does the work. The difference is the graph.

There's also a cost argument that surprised me. Pure LLM agents are expensive — not because inference is costly per call, but because of the failure loops. When an agent gets stuck retrying a GDS error by hallucinating new parameters, it burns thousands of tokens before timing out. A single stuck session can cost $5-$10 in API credits. Our hard-coded error handlers catch those failures at zero token cost. And because we only send the LLM the 5 relevant fields from a 50KB GDS response instead of the whole thing, we cut context window usage by roughly 90%.

But Won't Models Get Good Enough Eventually?

Maybe. I genuinely don't know if GPT-6 or GPT-7 will be reliable enough to orchestrate ten-step API workflows without guardrails. But I know two things.

First, even if models improve dramatically, the Chain of Probability problem is mathematical, not technological. If your model is 99% reliable per step — an extraordinary achievement — a ten-step workflow still fails 10% of the time. For enterprise transactions, that's still unacceptable. The graph eliminates this entirely because the routing isn't probabilistic.

Second, waiting for models to get better is a luxury most enterprises don't have. They need agents that work now, that are auditable now, that comply with the EU AI Act's transparency requirements now. The neuro-symbolic approach doesn't bet on the future. It builds on proven engineering principles while using the best AI capabilities available today.

The Architecture Is the Product

I've been in enough rooms with investors and enterprise buyers to know that the AI industry is starting to wake up. The question is shifting from "Who has the smartest model?" to "Who has the most robust system?" The demos that dazzle in a conference talk — the ones where an agent flawlessly books a flight in a controlled environment — are cheap. What's expensive, and what matters, is building something that works on the ten-thousandth request as reliably as it did on the first.

We're entering an era where the differentiation won't be the model. It'll be the graph. The state schema. The error handlers. The conditional edges. The boring, rigorous, deterministic software engineering that wraps around the probabilistic magic and keeps it from burning down the house.

The magic was never in the prompt. It was always in the architecture.