A visual metaphor showing the core thesis — a fluent language AI paired with a precise logic engine, representing the "Voice and Brain" concept central to the article.

Artificial IntelligenceMachine LearningTechnology

The AI Tutor That Taught a Kid 2+2=5 — And What It Reveals About Every AI Product You're Using

Ashutosh Singhal February 13, 202616 min read

A few months ago, a parent sent me a screenshot that stopped me cold.

Her daughter — a seventh grader — had been using one of the most popular AI tutoring platforms to study for a math test. The kid was working through a multiplication problem: 3,750 times 7. She typed in 21,690. The correct answer is 26,250. She wasn't even close.

The AI tutor responded: "Great job multiplying! You solved the problem and showed great thinking!"

I stared at that screenshot for a long time. Not because the error surprised me — I'd been studying LLM failure modes for years. What hit me was the enthusiasm. The AI didn't just get it wrong. It celebrated the wrong answer. It reinforced a misconception with the warmth and confidence of a beloved teacher. And somewhere, a twelve-year-old girl walked into her exam believing she understood multiplication because a machine told her she did.

That screenshot crystallized something I'd been circling for a while: the most dangerous AI systems aren't the ones that refuse to answer. They're the ones that answer confidently and incorrectly. And right now, that description fits nearly every AI product built on top of large language models.

I'm Ashutosh, and I run Veriprajna. We build neuro-symbolic AI systems — architectures that fuse the linguistic fluency of neural networks with the logical rigor of symbolic solvers. I'm writing this because I think the industry is making a catastrophic bet on the wrong architecture, and the people who'll pay the price are students, patients, borrowers, and anyone else who trusts an AI to get the facts right.

Why Does Your AI Sound So Smart but Get Math So Wrong?

Here's something most people don't realize about large language models like GPT-4 or Claude: they don't know anything. Not in the way a database knows that your birthday is March 15th, or a calculator knows that 17 times 24 is 408.

An LLM is a prediction engine. When you ask it a question, it doesn't retrieve a fact or perform a calculation. It predicts the most statistically likely sequence of words that should follow your prompt, based on patterns it absorbed from billions of pages of internet text. It's performing what researchers call "next-token prediction" — choosing the next word (or fragment of a word) based on probability distributions learned during training.

This is why LLMs can write poetry that makes you cry and then tell you that 2+2=5 if you nudge the context window the right way. The poetry works because language is patterns. The math fails because arithmetic is not a pattern — it's a formal system with exact rules that don't bend to statistical likelihood.

An LLM doesn't distinguish between a fact that appeared a million times in its training data and one that appeared once. It treats rare facts as statistical noise — which means the more obscure the information you need, the more likely the AI is to make something up.

I think of it this way: imagine you had a colleague who'd read every book ever written but had never learned to use a calculator. You'd trust them to summarize a novel or draft a persuasive email. You would never trust them to do your taxes. Yet that's exactly what we're doing when we deploy raw LLMs into education, finance, and healthcare.

The Night I Realized Prompt Engineering Was a Dead End

There was a period — I'm almost embarrassed to admit this now — when I thought we could fix this with better prompts.

My team and I spent weeks crafting elaborate chain-of-thought instructions. "Think step by step." "Show your work." "Double-check your arithmetic before responding." We tested dozens of variations across math problems, compliance scenarios, logical reasoning tasks. Some of the prompt chains were hundreds of tokens long, essentially begging the model to be careful.

It helped. A little. Chain-of-thought prompting improved accuracy on complex reasoning tasks from abysmal to merely unreliable. But here's what kept happening: the model would lay out a beautiful chain of logic — step one correct, step two correct, step three correct — and then make a simple arithmetic error in step four that cascaded through the rest of the reasoning chain, producing a final answer that was confidently, elegantly wrong.

One night, I was reviewing test results at my desk. We'd run a battery of 500 compound interest calculations through a chain-of-thought prompted GPT-4 setup. The accuracy rate was around 87%. My co-founder looked at the results and said, "87% is pretty good."

I pulled up a spreadsheet. "Would you use a spreadsheet that fabricated numbers 13% of the time?"

Silence.

That was the moment the architecture shifted in my head. The problem wasn't the prompt. The problem was that we were asking a prediction engine to be a logic engine. We were whispering to dice and hoping they'd land on the right number. No amount of prompt engineering would change the fundamental stochastic nature of the system.

We needed a brain.

What Is Neuro-Symbolic AI, and Why Should You Care?

A diagram mapping Kahneman's System 1 and System 2 to the two AI paradigms (neural networks and symbolic AI), showing how neuro-symbolic AI fuses both — making the article's central conceptual framework immediately visual.

The history of artificial intelligence is a story of two tribes that spent decades refusing to talk to each other.

The Symbolists — dominant from the 1950s through the 1980s — believed intelligence was about manipulating explicit rules and logic. If you could encode enough knowledge as formal statements (Socrates is a man; all men are mortal; therefore Socrates is mortal), you could build a thinking machine. Their systems were precise, transparent, and provably correct. They were also brittle — they shattered the moment they encountered messy, real-world language or situations their rules didn't cover.

The Connectionists — the neural network crowd — took the opposite approach. Don't write rules; let the machine learn patterns from data. Their systems could handle ambiguity, noise, and natural language beautifully. But they were black boxes. You couldn't explain why they produced a particular answer, and they had no concept of truth — only statistical likelihood.

Daniel Kahneman, the Nobel laureate, described human cognition as two systems: System 1 is fast, intuitive, pattern-based — you recognize a friend's face in a crowd. System 2 is slow, deliberate, logical — you multiply 17 times 24 on paper. Current LLMs are extraordinary System 1 engines being asked to do System 2 work. That's the mismatch.

Neuro-symbolic AI is the fusion. You keep the neural network as the "Voice" — it handles language, understands intent, generates fluid responses. But you add a symbolic "Brain" — deterministic solvers, logic engines, formal verification systems — that handles everything requiring precision. The Voice talks to the user. The Brain does the math. And a bridge connects them.

In a neuro-symbolic system, 2+2 will always equal 4 — not because the model predicts it should, but because it's defined as an axiom in the symbolic layer. The neural network literally cannot override it.

This isn't theoretical. This is what we build at Veriprajna, and I've laid out the full architectural blueprint in the interactive version of our research paper.

How Do You Make a Language Model Do Math It Can't Do?

A step-by-step diagram showing how the PAL (Program-Aided Language Model) pipeline works — from user question, to code generation by the LLM, to deterministic execution, to verified natural language response — contrasted with the standard LLM approach that guesses the answer.

The key mechanism is something called Program-Aided Language Models, or PAL. And the elegance of it still delights me.

Instead of asking the LLM to solve a problem, you ask it to write a program that solves the problem.

Here's what that looks like in practice. A user asks: "If I have a loan of $50,000 at 5% interest compounded annually, how much do I owe after 3 years?"

In a standard LLM setup, the model attempts to calculate $50,000 × (1.05)³ in its head — using token prediction. Sometimes it gets it right. Sometimes it doesn't. You have no way of knowing which answer you can trust.

In our system, the LLM doesn't calculate anything. It generates a few lines of Python code: principal = 50000, rate = 0.05, years = 3, print(principal * (1 + rate) ** years). That code is executed by a deterministic runtime — a real computer doing real math. The CPU's arithmetic logic unit returns 57,881.25. The LLM then wraps that verified number in a natural language response: "After 3 years, you would owe $57,881.25."

The neural network did what it's good at: understanding the question and generating code. The symbolic engine did what it's good at: computing the answer with perfect accuracy. Neither could do the other's job. Together, they're formidable.

We tested this against standard chain-of-thought prompting on complex arithmetic tasks. Standard LLMs scored below 40% accuracy on multi-step calculations. Chain-of-thought improved that to moderate but error-prone results. Our PAL-based neuro-symbolic approach achieved near-perfect accuracy — limited only by whether the generated code logic was correct, which is a much easier problem to verify and debug than probabilistic token prediction.

The Argument That Almost Split My Team

I need to tell you about a fight we had internally, because it shaped how we think about this architecture.

When we first started integrating symbolic solvers, one of my engineers — a brilliant guy, deeply steeped in the deep learning world — pushed back hard. His argument: "The models are getting better every six months. GPT-5 will fix the math problems. GPT-6 will fix the reasoning problems. You're building scaffolding for a building that's going to grow its own skeleton."

He wasn't wrong about the trend. Models are improving. But I kept coming back to a structural argument that I couldn't shake.

The improvement in LLMs is asymptotic for deterministic tasks. Making a prediction engine 10x bigger doesn't make it deterministic — it makes it a bigger prediction engine. A model that gets compound interest right 95% of the time instead of 87% of the time is still a model you can't trust for financial calculations. The gap between 95% and 100% isn't a gap you close with scale. It's a gap that requires a different kind of system.

We argued about this for two days. Whiteboards covered in diagrams. Competing benchmarks. At one point someone said, "Just use GPT and add a disclaimer." I think I visibly flinched.

What settled it was a simple test. We took 100 compliance scenarios from a banking client — loan eligibility checks with hard regulatory thresholds. We ran them through a state-of-the-art LLM with careful prompting. It approved three loans that violated debt-to-income ratio requirements because the applicants had written compelling personal statements. The model was persuaded by the narrative. It was doing what it was designed to do — pattern-match on language — and in doing so, it broke the law.

A chatbot that lies 5% of the time is not 95% useful. For critical tasks, it is 100% unusable.

My engineer came around. Not because the symbolic approach was sexier — it isn't — but because the failure mode of the alternative was unacceptable.

Why Are "AI Wrapper" Companies in Trouble?

Let me step back and talk about the business landscape, because the technical architecture has massive economic implications.

Right now, the AI startup ecosystem is dominated by what I call "wrapper" companies — businesses whose core product is a user interface and some prompt logic sitting on top of a third-party foundation model. They're reselling access to capabilities they don't own.

The problem is structural. Every time OpenAI or Anthropic releases a new model version, they absorb the features that wrappers provide. The startup selling "AI for PDF summarization" gets wiped out when the foundation model adds native file upload. The company offering "AI for code generation" watches its value proposition evaporate as the base models improve at coding. Your competitive moat is being drained by your own supplier.

Enterprise clients are catching on. I've sat in meetings where CTOs have said, point blank: "Why would I pay you to wrap an API I can call myself?" And they're right to ask. Routing sensitive financial records or proprietary code through a startup's servers, which then route them to a public model provider, creates an unacceptable attack surface. The "Sovereign AI" movement — enterprises demanding to own their models and run them within their own infrastructure — is accelerating.

This is why we rejected the wrapper model from day one. We don't sell access to tokens. We sell System 2 architectures — proprietary symbolic reasoning engines, domain-specific knowledge graphs, deterministic compliance layers. When the underlying language model gets commoditized (and it will), our value doesn't diminish. It increases, because the logic layer becomes the only differentiator that matters.

What Happens When You Give an AI Tutor a Real Brain?

Let me bring this back to education, because that's where the stakes feel most personal to me.

The promise of AI tutoring is extraordinary: personalized, one-on-one instruction for every student, at scale. Bloom's famous "2 Sigma Problem" showed that students who receive individual tutoring perform two standard deviations better than students in conventional classrooms. If AI could deliver even a fraction of that benefit, it would transform education.

But the current generation of AI tutors is failing in ways that are worse than no tutor at all. Beyond the multiplication disaster I described earlier, there are documented cases where students arrive at the correct answer, but the AI — hallucinating an incorrect solution path — tries to convince them they're wrong. The model gaslights the student into abandoning correct reasoning. In an educational context, where trust is everything, this is devastating.

Our approach is fundamentally different. We built what we call a Pedagogical Accuracy Engine — and it works on three levels.

First, the symbolic layer maintains a model of each student's knowledge state using Bayesian Knowledge Tracing. It's not guessing whether the student understands algebra; it's tracking a probability vector updated with every interaction. When the student struggles with geometry, the system knows — mathematically, not intuitively — and adjusts its scaffolding accordingly.

Second, when the AI generates practice problems, it doesn't just make up numbers. The PAL engine ensures that every generated problem produces clean, solvable answers. No more "calculate 7,349 divided by 13.7" when the student is learning basic division. The symbolic layer guarantees pedagogically appropriate difficulty.

Third — and this is the one I'm proudest of — we anchor the AI to the specific curriculum. Using property graph indexing, we parse the actual textbook into a knowledge graph where concepts are nodes and relationships are edges. If the textbook defines "prime number" in a specific way, the AI uses that definition, not whatever Wikipedia-derived approximation lives in the LLM's training data. For the full technical breakdown of how these layers interact, see our research paper.

The Compliance Problem No One Wants to Talk About

A diagram showing how the symbolic veto layer works in the loan compliance use case — LLM output passes through a rule-checking gate that either approves or blocks the response before it reaches the user.

Education is one domain. Finance is another — and in some ways, the failure modes are even more alarming.

A regional bank came to us after their previous AI vendor's system had approved loans that violated regulatory lending criteria. The issue was subtle and, once you understand the architecture, completely predictable: the LLM was processing applicants' personal statements alongside their financial data. When an applicant wrote a compelling story about overcoming hardship, the model's pattern-matching — trained on millions of examples of persuasive narratives leading to positive outcomes — weighted the narrative over the hard debt-to-income thresholds.

The model wasn't malfunctioning. It was doing exactly what it was designed to do: predict the most likely next token in a sequence that looked like a loan approval conversation. The problem was that loan approval isn't a conversation. It's a rule-based decision with legal boundaries.

We implemented a PyReason layer — a neuro-symbolic framework that supports logical reasoning over knowledge graphs. The rules are explicit: IF applicant age is under 21 AND state is New York, THEN loan type cannot be Commercial. Before the LLM generates any response to a loan applicant, the context passes through the symbolic engine. If the proposed output violates a hard rule, the symbolic engine vetoes it. Period.

The result: 100% adherence to regulatory lending criteria, combined with personalized, empathetic communication to applicants. The Voice remains warm. The Brain remains inflexible. That's the point.

We don't build AI that's probably compliant. We build AI that is physically incapable of approving a non-compliant transaction, regardless of how persuasive the input.

"Won't Bigger Models Just Fix This?"

People ask me this constantly, and I understand why. The trajectory of LLM capability is genuinely impressive. Every new release handles more edge cases, scores higher on benchmarks, makes fewer obvious errors.

But here's what I keep coming back to: the improvement curve for deterministic tasks has a ceiling that's built into the architecture. A prediction engine, no matter how large, generates outputs probabilistically. Making it bigger makes the probability distribution tighter — but it never becomes a guarantee. And for the domains that matter most — a child's education, a patient's diagnosis, a borrower's legal rights — "probably correct" isn't a product category.

There's also a practical argument. Even if GPT-7 achieves 99.9% accuracy on arithmetic (which would be remarkable), that still means one error per thousand calculations. A bank processing ten thousand loan applications a day would generate ten incorrect calculations daily. Each one is a potential regulatory violation. Each one is a lawsuit waiting to happen. The symbolic layer doesn't reduce the error rate to 99.9%. It reduces it to zero for any operation routed through the solver.

The other objection I hear: "Isn't this just adding complexity?" Yes. It is. A neuro-symbolic system is harder to build than a wrapper. It requires understanding both paradigms — the statistical and the logical — and engineering the bridge between them. But the complexity lives in the architecture so it doesn't have to live in the failure mode. I'd rather build a complex system that works than a simple system that fails unpredictably.

The Bridge Between Two Kinds of Intelligence

I want to leave you with an image that's been stuck in my head since we started this work.

Think about how you actually think. When a friend asks you to recommend a restaurant, you use intuition — pattern matching on past experiences, vibes, associations. System 1. Fast and fluid. But when your accountant asks you to verify a tax calculation, you pull out a calculator. System 2. Slow and certain. You don't try to intuit whether the numbers add up. You check.

Every AI system deployed in the world today is operating on System 1 alone. It's as if we built a civilization of brilliant conversationalists who can't use calculators, and then put them in charge of the banks, the hospitals, and the schools.

The fix isn't to throw away the conversationalists. They're extraordinary at what they do. The fix is to hand them a calculator — and make sure they use it.

That's what neuro-symbolic AI is. Not a replacement for large language models. A completion of them. The Voice and the Brain, working together, with a bridge that knows when to talk and when to compute.

We're building that bridge. And I believe it's the only architecture that deserves to be trusted with the things that matter.