Editorial cover depicting the tension between voice AI systems and real human speech diversity — a drive-thru speaker failing to understand a customer, with the gap between 86% and 100% as the visual metaphor.

Artificial IntelligenceVoice AIAccessibility

I Watched a Drive-Thru AI Cut Off a Person Who Stutters. Then I Built Something Different.

Ashutosh Singhal April 12, 202613 min read

There's a video that's been making the rounds on Reddit. A woman at a Wendy's drive-thru is trying to order a Baconator. She stutters — a block on the "b" — and the AI cuts her off mid-word, cheerfully suggesting a Frosty. She tries again. The system interprets her repetition as a new order. Three attempts later, she's shouting "AGENT" at a speaker box that doesn't care.

I've watched that video probably thirty times. Not because it's funny — it's not — but because every failure in that interaction maps precisely to an architectural decision somebody made in a conference room, probably while looking at a slide that said "86% success rate."

That remaining 14%? Those are real people. And I'd argue the architecture was never built for them in the first place.

This is the story of why my team at Veriprajna spent the better part of two years rejecting the fastest path to market in voice AI — and what we built instead.

What Does "Enterprise-Grade Voice AI" Actually Mean?

Most companies in our space do something remarkably simple: they connect a microphone to an API. OpenAI, Google, Anthropic — pick your favorite large language model, pipe audio in, get text back, generate a response. Ship it.

I call this the API wrapper approach, and it works beautifully in a demo. Quiet room, clear speaker, simple request. The demo always works.

The demo always works. The drive-thru at 11:47 PM with a diesel truck idling behind you and a toddler screaming in the backseat — that's where architecture actually matters.

The Wendy's FreshAI system — built on Google Cloud's Vertex AI — is probably the highest-profile example of this approach at scale. And the reported customer experiences tell you everything you need to know about its limits: customers needing three or more attempts for simple orders, the system cutting people off mid-sentence to suggest items they didn't ask for, and an experience described as "unusable" for anyone with a speech disfluency.

Yet Wendy's is expanding to 500-600 locations. The reason is simple math — the system increases average check size through upselling, and the labor efficiency numbers look good on a quarterly earnings call. If you're optimizing for the average, the architecture is a success. If you're the person it doesn't work for, the architecture is broken.

I explored this tension in depth in the interactive version of our research. But the core argument is one I want to make personally, because it shaped how we build everything.

The Night We Realized the Microphone Was the Wrong Place to Start

It was about 9 PM on a Thursday in late spring. Me, my co-founder, and two engineers standing in the parking lot of a shuttered Taco Bell we'd gotten permission to use for testing. We had our prototype mounted on a post — a speaker, a mic, some duct tape holding it all together. We'd been running it in the lab for weeks at about 95% accuracy. We felt ready.

The first car that pulled up was a woman in a Honda Civic with her window halfway down. She said "I'd like a number three combo" clearly enough. The system heard "island numb recon bowl." I looked at my co-founder. He looked at the ground.

The accuracy wasn't just bad — it was unusable. We stood in that parking lot for another two hours, running test after test, and the numbers only got worse as the evening traffic picked up. I remember the exact moment I stopped feeling frustrated and started feeling something closer to dread: this wasn't a tuning problem. Our entire approach was wrong.

The problem wasn't the language model. The model was fine. The problem was everything that happened before the audio reached the model. Wind noise. Engine rumble. The mechanical hum of an HVAC unit twenty feet away. A car horn three blocks over. Our system couldn't tell the difference between a human voice and a diesel engine because, at the signal level, nobody had taught it to.

That was the moment I understood something that I think most people in this space still haven't internalized: voice AI is not an NLP problem. It's a signal processing problem first, a linguistics problem second, and an NLP problem third. If your first layer is broken, nothing downstream can save you.

Why Does Drive-Thru AI Keep Cutting People Off?

Diagram showing the advanced Voice Activity Detection system — comparing simple energy-threshold VAD versus neural VAD with speculative transcription and dynamic pause windows.

The culprit is something called Voice Activity Detection — VAD. It's the system that decides when you've started talking and when you've stopped. In most wrapper solutions, it's a simple energy threshold: sound goes above a line, recording starts; sound drops below a line, recording stops.

Think about that in a drive-thru. You pause for half a second to glance at the menu board. The energy drops. The VAD decides you're done. It sends a sentence fragment to the model, the model hallucinates a response to a question you never finished asking, and now you're arguing with a speaker box.

We rebuilt our VAD from scratch. Instead of energy thresholds, we use neural models — Silero, Cobra — that output a probability score for human speech across diverse frequencies. Instead of a binary on/off, our system gives a confidence level. And instead of a static 500-millisecond pause tolerance, we use a dynamic window of 600 to 1,000 milliseconds that adjusts based on conversational context.

The trick that made the biggest difference, though, was what we call speculative transcription. The system starts processing audio at 250 milliseconds but doesn't commit to an endpoint until 600 milliseconds of confirmed silence. That overlap reduces perceived latency by 350 to 600 milliseconds while simultaneously killing premature cut-offs.

My co-founder argued for weeks that the dynamic pause window was over-engineered. We were in the office late one night — cold coffee, whiteboards covered in latency diagrams — and he pushed his chair back and said, "We're spending three engineering-weeks on a feature that saves half a second. Nobody pauses for a full second at a drive-thru. This is a vanity problem." I said something like, "And if you're wrong, we've built a system that cuts off every customer who needs to think." We didn't talk for the rest of the night. He left around midnight. I stayed and kept running simulations.

Then we tested it with real customers. Turns out, people pause constantly — looking at the menu, turning to ask a passenger what they want, thinking about whether they really need fries. A full second of natural pause is not silence. It's thinking. My co-founder sent me a one-line message after he saw the test results: "You were right. Sorry about the chair."

When you optimize for speed over patience, you build a system that only works for people who already know what they want.

80 Million People

Stuttering affects over 80 million people globally. That number landed differently for me after the parking lot.

It manifests as repetitions ("b-b-b-baconator"), prolongations ("mmmmilk"), and blocks — silent pauses in the middle of a word where the person physically cannot produce sound.

Now think about what a standard VAD does with a block. The person stops making sound mid-word. The system interprets silence as turn completion. It responds to half a word. The person tries again. The system treats the repetition as a new order. Within ten seconds, you've got a confused AI, a frustrated human, and a line of cars building behind them.

This isn't an edge case. This is a design choice. When you train an ASR (Automatic Speech Recognition) model almost exclusively on "standard" U.S. English — well-articulated, minimal pauses — you are making a decision about who your system is for. Research shows that Conformer-based (a neural architecture that combines convolution with self-attention for audio processing) ASR models, the backbone of most modern systems, see their performance degrade so severely on disordered speech that some return negative semantic similarity scores. Not just inaccurate — semantically inverted.

When your AI model returns negative semantic scores on disordered speech, you haven't built a system that struggles with edge cases. You've built a system that was never designed to hear a significant portion of humanity.

An investor told me once, point-blank: "Just use the API and fine-tune later. You're burning runway on a problem that affects a small percentage of customers." I pulled up the numbers on my phone — 80 million people with stuttering alone, before you count accents, ESL (English as a Second Language) speakers, elderly customers, anyone ordering in a noisy car. I watched his face change. "That's not a small percentage," he said. "No," I said. "It's not."

We fine-tune self-supervised models on re-annotated disfluent speech datasets. We use synthetic disfluency insertion — taking fluent transcripts, adding blocks and repetitions, synthesizing them into training audio. It's painstaking work. It's not the kind of thing that shows up on a feature comparison chart. But it's the difference between a system that works for everyone and a system that works for the average.

What Happens When Voice AI Runs on the Edge Instead of the Cloud?

Architecture comparison diagram showing API wrapper approach versus edge-deployed voice AI — highlighting latency, reliability, and data sovereignty differences.

Every word spoken into a Wendy's drive-thru microphone travels across the public internet to a Google data center and back. That round trip costs 100 to 500 milliseconds before the model even begins processing. In voice interaction, the gold standard is sub-300 millisecond response time — anything above that, and the conversation stops feeling natural. By 700 to 900 milliseconds, it feels like a bad phone call. By two seconds, people start talking over the system.

We moved everything to the edge. Local processing on specialized hardware at the restaurant site. Our latency dropped to 5 to 10 milliseconds.

But the real insight wasn't just speed — it was model size. A general-purpose LLM needs to know everything about everything. A domain-specific Small Language Model needs to know that "Dave's Single" is a burger, not a music album. That focus means faster inference, more predictable responses, and a fraction of the computational cost. We've seen 3x speed improvements and 30 to 40% lower operational costs compared to cloud-based approaches.

The edge architecture also solved a problem we hadn't fully anticipated: reliability. When the internet goes down — and it will — a cloud-dependent system becomes a very expensive paperweight. Our system keeps running. For the full technical breakdown of our edge architecture versus cloud approaches, you can dig into the research paper.

The Regulatory Wall Nobody's Talking About

CAN-ASC-6.2:2025 landed on my desk in early 2025, and I remember reading it with something between relief and vindication — here was a standard that finally said what we'd been building toward: people with disabilities must be involved in the design, testing, and governance of AI systems. Not as an afterthought. From the start. The European Accessibility Act begins enforcement in June 2025 with steep fines, and the ADA is being reinterpreted to cover digital barriers for people with speech disabilities. Retrofitting a non-compliant system across 600 locations costs roughly five times what it costs to build it right from the start.

"What if we're just building a really expensive way to take a burger order?"

That thought hit me at about 2 AM, maybe six months into development. I was alone in the office, staring at a spectrogram of a stuttered word that our system still couldn't parse. We'd been at this for months. We'd burned through most of our initial funding. And the API wrapper companies were shipping product while we were still debugging signal processing pipelines.

I almost called it. Almost decided to just wrap the API, ship something, and iterate later like everyone else.

But "iterate later" is a lie in voice AI. Once you've built your architecture around cloud-dependent, VAD-threshold, fluent-speech-only assumptions, every customer interaction reinforces those assumptions in your training data. You don't iterate toward accessibility. You iterate away from it.

Build for the edge case first, and the average case takes care of itself. Build for the average, and the edge case never gets fixed.

The Turn-Taking Problem That Made Me Rethink Everything

There's a subtlety to human conversation that we take completely for granted. When you say "I'd like a Baconator and..." — that trailing "and" signals that you're not done. A human cashier would wait. Most drive-thru AI doesn't.

We built what I think of as conversational grammar into our endpointing logic. The system parses linguistic cues in real time: conjunctions that signal continuation, pitch changes that signal completion, the phrase "that's all" that means exactly what it says. When a customer says "that's all," our system responds in under 200 milliseconds because the intent is unambiguous. When they trail off with "and..." it waits, even through a full second of silence.

This is also where our human-in-the-loop philosophy lives. We don't believe AI should handle the entire transaction unsupervised. Simple, transactional requests — the AI handles those. Complex situations, frustrated customers, high-friction moments — those escalate to a human before the interaction breaks down, not after.

The goal was never to replace the human at the drive-thru. It was to make sure no customer ever has to shout "AGENT" at a machine that isn't listening.

I keep coming back to that 86% success rate that Wendy's reported. In most software contexts, 86% would be a failure. Imagine a banking app that processes 86% of transactions correctly. Imagine a navigation system that gets you to the right destination 86% of the time. The drive-thru has somehow normalized a failure rate that would be unacceptable in any other consumer interaction.

This Is an Architecture Problem, Not an AI Problem

The pattern I see across the industry is companies treating voice AI as a software layer — something you bolt on top of existing infrastructure with the right API key. And I understand why. It's fast, it's cheap, and the demos are incredible.

But the drive-thru is not a demo. It's diesel engines and wind and toddlers and accents and stuttering and people who pause to think. It's the full, irreducible complexity of human communication happening in the worst possible acoustic environment. You cannot wrapper your way through that.

The companies that will win this market — and I say this with the bias of someone who's bet his career on it — are the ones willing to go deep. Deep into signal processing. Deep into acoustic modeling. Deep into the linguistics of how people actually talk, not how ASR training data says they should. Deep into edge infrastructure that doesn't depend on a data center a thousand miles away.

There are no shortcuts in voice AI. There is only the rigorous, unglamorous, deeply technical work of building systems that hear every customer. Not 86% of them. Every single one.

That's what enterprise-grade means. And until the industry accepts that definition, we'll keep watching videos of drive-thru speakers that can't understand the word "Baconator."