Voice AI Systems Built Beyond the Platform Ceiling

Custom voice AI pipelines for telephony and product use cases, with latency engineering, domain-specific ASR, and regulatory compliance built in.

The Platform Ceiling Is Real

Voice AI platforms are everywhere in 2026. Retell, Vapi, Bland, and a dozen others let you spin up a phone agent in an afternoon. They work well for appointment booking, simple FAQ routing, and outbound confirmation calls. Then the requirements get harder. The insurance intake call needs to track fifteen data fields across a branching conversation. The medical triage bot needs to recognize drug names that sound like common English words. The payments flow needs PCI-DSS isolation with no card data touching the LLM. And the platform that got you to the demo breaks under the weight of the real use case.

Production voice agent implementations grew 340% year-over-year in 2025. But S&P Global's survey of over 1,000 enterprises found that 42% abandoned the majority of their AI initiatives before production, up from 17% the year before. McDonald's spent two years and 100+ restaurants with IBM before ending their drive-through voice AI in June 2024 after the system could not handle accents, background noise, or order corrections. Wendy's FreshAI with Google Cloud delivered a 22-second speed improvement by investing in natural language handling for pauses, corrections, and complex customizations. The difference was engineering depth applied to real-world audio.

We build voice systems for the use cases where platforms hit their ceiling. Custom pipelines from best-of-breed components, architected for the latency budget, vocabulary, and regulatory constraints of each engagement.

Latency Is the Conversation Killer

Users abandon voice interactions when the system takes more than 800 milliseconds to respond. The standard voice AI pipeline runs speech-to-text, then an LLM for reasoning, then text-to-speech. Each step adds 100-300ms plus network overhead, pushing total round-trip latency to 1-2 seconds. That gap between when the caller stops talking and when the system starts responding is where trust dies.

We engineer latency out of the pipeline at every layer. On the ASR side, Deepgram nova-3 delivers sub-300ms streaming transcription with 5-7% median word error rate in production, compared to Whisper's batch-oriented architecture that processes audio in 30-second chunks. On the TTS side, Cartesia Sonic produces first audio in approximately 40ms, and ElevenLabs Flash v2.5 in approximately 75ms. Between them, the LLM inference step is where most latency hides: we use speculative response generation, where the system begins formulating likely responses while the caller is still speaking, and streams the first audio tokens before the full response is generated.

Telephony infrastructure adds its own latency. Twilio's Media Streams WebSocket connections introduce jitter that does not appear in lab testing. Their newer ConversationRelay product reduces this overhead. Genesys AudioHook has similar characteristics. SIP trunk selection, codec transcoding, and jitter buffer tuning each contribute 20-50ms that compound silently. We profile the full audio path from microphone to speaker, not just the AI inference step, because that is where real-world latency lives.

Domain Vocabulary Breaks General-Purpose ASR

General-purpose speech recognition optimizes for conversational English. It handles "I'd like to schedule an appointment" well. It does not handle "atorvastatin 40 milligrams QD" or "force majeure clause in section 12.3(b)" or "part number XKCD-4419-REV-C" without help. Custom vocabulary features like phrase boosting exist in most ASR providers, but they are fragile when boosted terms are phonetically similar to common words.

For regulated industries where recognition errors have downstream consequences, we fine-tune ASR models on domain-specific corpora. Medical voice recognition requires training on clinical notes and physician dictations. Legal dictation requires exposure to archaic phrasing and citation formats. Financial services calls reference ticker symbols and proprietary product names that no general model has seen. Google Research demonstrated in 2023 that augmenting training data with synthetic accented speech improved recognition accuracy for underrepresented accents without degrading performance on well-represented ones. We apply the same principle to domain vocabulary: augment the training distribution with the exact terms the system will encounter, validated against real audio from the deployment environment.

Conversation State Is an Engineering Problem, Not a Prompt Problem

Simple voice agents work by feeding the full conversation history into an LLM context window and relying on the model to track what has been discussed. This works for short, linear interactions. It fails on complex multi-turn conversations where the caller provides information out of order, corrects previous statements, or the conversation branches based on collected data.

We build structured dialog management that separates conversation state from the language model. Required fields, validation rules, branching logic, and escalation triggers are defined in a state machine that the LLM cannot override. The language model handles natural language understanding and generation within the boundaries the state machine sets. This means the system can track that the caller already provided their date of birth in turn three even if they mention a different date in turn seven, that an insurance claim requires all fifteen data fields before routing to adjudication, and that the system cannot commit to a price, a deadline, or a coverage determination without explicit authorization.

For healthcare voice AI, this architecture is not optional. HIPAA requires that the system never disclose protected health information to unauthorized parties, which means the dialog manager must enforce access controls at the conversation level, not rely on prompt instructions that the LLM might ignore under adversarial pressure or context window drift.

The Nuance Migration Nobody Is Talking About

Nuance's on-premise speech stack reaches end of sustaining support around June 2026. Roughly 30% of Nuance's customer base is still running on-premise. Microsoft is pushing migration to Dynamics 365 Contact Center and Azure AI services. HCLTech launched a "Nuance Migration Factory" for at-scale transitions. But the real challenge is not swapping ASR engines. It is preserving the dialog logic embedded in VXML grammars that have been refined over years of production use.

Enterprise IVR trees encode business rules, exception handling, and edge cases documented nowhere except the VXML itself. A rip-and-replace migration that starts fresh with an LLM-based system will spend months rediscovering edge cases the old system already handled. We approach Nuance migrations as dialog logic extraction followed by architecture modernization: parse existing VXML flows, extract the state machine and business rules into a portable representation, then implement on modern infrastructure. Realistic timeline is 18-24 months for complex contact centers, not the 90-day plans vendor marketing suggests.

Regulatory Compliance Is Not an Add-On

The FCC confirmed in 2024 that TCPA restrictions on artificial or prerecorded voices apply to AI-generated speech. Any voice AI making outbound calls requires prior express consent for informational calls and prior express written consent for marketing calls. Violations carry $500-$1,500 per call in statutory damages with strict liability, meaning no intent required. The one-to-one consent requirement, postponed to April 2026, requires individual consent per seller, closing the lead-generation loophole that many voice AI vendors have been operating through.

Beyond TCPA, voice AI deployments face PCI-DSS requirements when handling payment data (card numbers must never reach the LLM or appear in logs), HIPAA requirements for healthcare interactions (BAAs, minimum necessary access, audit trails), and state-level requirements from Colorado's AI Act (effective June 2026) and Illinois' BIPA. We architect compliance into the voice pipeline from the beginning: consent capture and recording mechanisms, PCI-scoped audio routing that isolates card data from the AI processing path, call recording consent handling that adapts to the caller's jurisdiction, and audit trails that document exactly what the system said and why.

Build, Buy, or Compose

The build-vs-buy framing is outdated for voice AI in 2026. The real decision is what to compose. Open-source frameworks like Pipecat (from Daily) and LiveKit Agents provide vendor-neutral pipeline orchestration. ASR, LLM, and TTS components can be swapped independently. Telephony infrastructure connects through standard SIP trunks or WebRTC. The question is where your use case needs custom engineering versus where an off-the-shelf component is genuinely sufficient.

We help teams answer this honestly. If the use case is appointment scheduling with standard English speakers and no regulatory constraints, a platform at $0.07-0.09 per minute is the right answer. We will tell you that. If the use case involves domain-specific vocabulary, multi-turn state management, regulated data, or call volumes where per-minute pricing becomes the dominant cost driver, custom composition eliminates vendor lock-in. At 50,000 minutes per month, the spread between platform pricing and a self-hosted pipeline can exceed $5,000 monthly. A custom build at $50K-300K upfront eliminates compounding per-minute charges entirely.

Every engagement starts with actual requirements: latency budget, vocabulary complexity, regulatory exposure, call volume projections, and existing telephony infrastructure. We design the architecture, select components, build customizations, and deliver a system the team owns without per-minute vendor dependency.

Related Industries

FAQ

Frequently Asked Questions

How much does enterprise voice AI cost per minute, and when does custom build make sense?

Platform voice AI runs $0.07-0.20 per minute depending on provider and volume tier. Retell charges $0.07+/min, Vapi advertises $0.05/min but bills across up to five separate invoices, and Bland runs $0.09/min with monthly minimums. For comparison, a fully loaded human agent costs $0.42-1.08 per minute. At low volumes, platforms are the right choice. At 50,000+ minutes per month, per-minute charges compound significantly and a custom-built pipeline on open frameworks like Pipecat or LiveKit eliminates ongoing vendor markups. Custom builds run $50K-300K+ upfront but flatten to infrastructure costs only. We model the TCO crossover for each engagement so the decision is grounded in actual volume projections, not assumptions.

How do you get voice AI response latency below 800 milliseconds?

The standard STT-to-LLM-to-TTS pipeline adds 1-2 seconds of round-trip latency. We reduce this at every layer: streaming ASR (Deepgram nova-3 delivers sub-300ms transcription versus Whisper's batch processing), low-latency TTS (Cartesia Sonic at approximately 40ms time-to-first-audio, ElevenLabs Flash at approximately 75ms), speculative response generation that begins formulating answers while the caller is still speaking, and telephony path optimization including SIP trunk selection, codec tuning, and jitter buffer configuration. The telephony infrastructure alone can add 100-200ms of hidden latency that never shows up in lab demos.

What TCPA and regulatory requirements apply to AI voice agents?

The FCC confirmed that TCPA restrictions on artificial or prerecorded voices apply to AI-generated speech. Outbound informational calls require prior express consent. Marketing calls require prior express written consent. Violations carry $500-1,500 per call in statutory damages with strict liability. The one-to-one consent requirement taking effect April 2026 requires individual consent per seller. Beyond TCPA, voice AI handling payment data needs PCI-DSS isolation (card numbers must never reach the LLM), healthcare voice AI needs HIPAA compliance with BAAs and audit trails, and Colorado's AI Act (effective June 2026) adds requirements for high-risk AI systems. We architect compliance into the pipeline from day one, not as a retrofit.

Why does general-purpose ASR fail on medical, legal, and financial terminology?

General ASR models are trained on conversational speech. They optimize for common English vocabulary and struggle with Latin medical terms, archaic legal phrasing, financial ticker symbols, and industrial part numbers. Phrase boosting features help but are fragile when boosted terms sound similar to common words. For regulated industries where recognition errors have downstream consequences, we fine-tune ASR models on domain-specific corpora collected from the actual deployment environment. Google Research demonstrated that augmenting training data with synthetic domain-specific speech improved accuracy without degrading general performance. The target word error rate depends on the domain: below 10% for general customer service, below 5% for healthcare, below 3% for safety-critical transcription.

Should we use OpenAI's Realtime Voice API or build a custom voice pipeline?

OpenAI's gpt-realtime is a single speech-to-speech model with 250-500ms end-to-end latency and no transcription step. It is fast and simple to integrate. The trade-offs: rate limited to approximately 100 simultaneous sessions at Tier 5, no audit trail of what was said (no intermediate transcription), the model sometimes misidentifies languages for speakers with heavy accents, and you are locked to OpenAI from day one. A custom pipeline (STT + LLM + TTS) gives you component-level control, vendor independence, full transcription for compliance, and the ability to swap any component as better options emerge. For regulated use cases or high concurrency, the custom pipeline is the practical choice.

How do you handle Nuance IVR end-of-life migration?

Nuance on-premise sustaining support ends around June 2026, affecting roughly 30% of the customer base. The real challenge is not replacing the ASR engine. It is preserving the dialog logic encoded in VXML grammars refined over years of production use. Enterprise IVR trees contain business rules, exception handling, and edge cases documented nowhere except the VXML itself. We approach migrations as dialog logic extraction first: parse existing VXML flows, extract the state machine and business rules into a portable format, then implement on modern infrastructure while preserving behavioral guarantees. Realistic timeline is 18-24 months for complex contact centers. Vendors promising 90-day migrations are likely discarding logic your organization depends on.

How do you prevent voice AI from making unauthorized commitments or disclosing sensitive information?

We separate conversation state management from the language model. Required fields, validation rules, branching logic, and escalation triggers live in a structured state machine that the LLM cannot override. The language model handles natural language understanding and response generation within boundaries the state machine enforces. This means the system cannot commit to a price, deadline, or coverage determination without explicit authorization, and cannot disclose protected information to unauthorized parties. For regulated industries, this is not optional. HIPAA, PCI-DSS, and financial services regulations require that AI systems enforce access controls at the conversation level, not rely on prompt instructions that models may ignore under adversarial pressure or context drift.

How do you test and monitor voice AI systems in production?

Production voice AI requires monitoring across four layers: infrastructure (latency percentiles, concurrent session capacity, telephony uptime), agent execution (word error rate, intent accuracy, dialog completion rate), user reaction (abandonment rate, repeat-call rate, escalation frequency), and business outcome (cost per resolved interaction, containment rate, customer satisfaction). We target WER below 10% for customer service, below 5% for healthcare, and below 3% for safety-critical use cases. We track time-to-first-audio at P50, P90, and P99 percentiles, not averages. Weekly WER monitoring catches model drift, and automated alerts fire on sustained drops in intent accuracy, rising repetition rates, or increased fallback usage. After any prompt or model change, we run regression test suites against recorded call scenarios before deploying to production traffic.

Build Your AI with Confidence.

Partner with a team that has deep experience in building the next generation of enterprise AI. Let us help you design, build, and deploy an AI strategy you can trust.

Veriprajna Deep Tech Consultancy specializes in building safety-critical AI systems for healthcare, finance, and regulatory domains. Our architectures are validated against established protocols with comprehensive compliance documentation.