What is voice AI vs conversational AI vs agentic AI?

Voice AI is a software agent that conducts spoken conversations on a phone call. Conversational AI is the broader umbrella covering voice and chat agents. Agentic AI specifically refers to agents that don't just talk but invoke production APIs to take real action — booking the slot, raising the ticket, processing the refund — inside the conversation. The distinction from non-agentic is whether the agent closes the loop or hands off.

What is MCP in the context of voice AI?

MCP (Model Context Protocol) is a standardised protocol released by Anthropic in late 2024 for AI agents to invoke tools — typed function calls — exposed by a server. The server controls what tools exist, what auth scope each requires, and how invocations are logged. MCP is the production-grade pattern for connecting voice AI to enterprise APIs in 2026, replacing ad-hoc bolted-on webhook integrations.

What latency is acceptable for production voice AI?

Production-grade voice AI hits a round-trip latency (customer finishing speaking to agent starting to respond) of 600–900ms. Below 400ms feels eerily natural; above 1.2s feels broken to most customers. Latency budgets are tight enough that integration architecture (webhook triggers, MCP-controlled tool access, caching layers in front of slow internal APIs) materially shapes the customer experience.

What is WER and what is good production WER for Indian languages?

WER (Word Error Rate) is the standard ASR (Automatic Speech Recognition) quality metric — lower is better. Production-grade Indian-language WER on tuned models is 4–8%; pre-tuning models often run 12–25% on Indian-accented speech. WER varies materially by language, accent, telephony audio quality, and background noise — vendor benchmarks should specify all four conditions.

What is a conversation graph in voice AI deployments?

The structured map of conversation states, transitions, and tool invocations that defines what the agent does. The voice AI equivalent of a script + objection-handling card + escalation rules combined. Conversation graphs are versioned, A/B-testable, and form the bulk of the conversation design discipline that distinguishes mature voice AI deployments from quick demos.

Voice AI Glossary 2026 — 60 Terms for Indian Buyers and Builders | Caller Digital

Q: What is the difference between per-minute, per-call, and per-outcome pricing?

Per-minute prices voice AI by minutes of conversation (industry-standard for many India vendors). Per-call prices by completed call regardless of duration. Per-outcome prices by completed business outcome — per recharged subscriber, per recovered cart, per qualified lead, per booked appointment. Per-outcome aligns better with the customer's revenue model; per-minute exposes the customer to dial-volume risk at the vendor's pricing discretion.

Q: What is code-switching and why does it matter for Indian voice AI?

Code-switching is when a speaker mixes two or more languages in the same sentence — Hindi-English (Hinglish), Tamil-English, Hinglish-Marathi. Indian customers do this constantly. Voice AI agents that don't handle code-switching natively force the customer to restart in one language, breaking the conversation. Production-grade Indian voice AI handles code-switching across all 10+ Indian languages with no restart.

This is the reference glossary we send to procurement leads, RFP authors, and IT architects who want to read voice AI vendor decks without getting tangled in jargon. Sixty terms organised into ten clusters — fundamentals, multilingual, integration, compliance, pricing, evaluation, deployment, sales, operations, and emerging concepts. Plain language, India context, no marketing dressing.

Fundamentals

Voice AI. A software agent that conducts spoken conversations on a phone call — placing or receiving calls, understanding speech in real time, and responding conversationally. Distinct from a chatbot (text-only) and from IVR (rigid menu-driven, no understanding).

Conversational AI. A broader umbrella that includes voice AI, chat agents, and any AI that holds a multi-turn dialogue. Voice AI is a subset.

Agentic AI. A voice (or chat) agent that doesn't just have a conversation but invokes production APIs to take real action — booking the slot, raising the ticket, processing the refund — inside the conversation. The distinction from non-agentic is whether the agent closes the loop or hands off.

ASR (Automatic Speech Recognition). The component that converts the customer's spoken audio into text the model can process. Quality is measured in WER (Word Error Rate) and improves materially when tuned for Indian-accented speech and Indian languages.

TTS (Text-to-Speech). The component that converts the model's text response into spoken audio the customer hears. Quality is measured in MOS (Mean Opinion Score). India-tuned TTS produces materially more natural Hindi-Hinglish-regional output than generic global TTS.

LLM (Large Language Model). The model that drives the conversation — Anthropic's Claude, OpenAI's GPT, Google's Gemini, Meta's Llama, or India-specific models like Sarvam. The LLM choice affects conversation quality, latency, and per-call cost.

Round-trip latency. End-to-end time from the customer finishing speaking to the agent starting to respond. Production-grade voice AI hits 600–900ms; below 400ms feels eerily natural; above 1.2s feels broken.

Multilingual and India-specific

Code-switching. When a speaker mixes two or more languages in the same sentence — Hindi-English, Tamil-English, Hinglish-Marathi. Indian customers do this constantly. Voice AI agents that don't handle code-switching natively force the customer to restart in one language, breaking the conversation.

Hinglish. Hindi-English mixed code-switching specifically — the most common conversational register in urban Indian customer interactions.

Indian-accented English. A distinct variety with its own phonetic patterns. Voice AI tuned only on US/UK English shows materially worse ASR accuracy on Indian-accented input.

Tier-2/Tier-3 voice coverage. The ability to handle regional dialects and accent variation common in non-metro India. Bhojpuri-inflected Hindi, Deccani Telugu, Rohilkhand Hindi, etc.

Language detection. The agent's ability to detect the customer's preferred language from the first response and switch accordingly, without requiring a menu choice.

Integration and architecture

MCP (Model Context Protocol). A standardised protocol (released by Anthropic in late 2024) for AI agents to invoke tools — typed function calls — exposed by a server. The production-grade pattern for connecting voice AI to enterprise APIs in 2026.

Tool calling. The mechanism by which an agent invokes an external API mid-conversation. MCP is one standard for tool calling; vendor-specific patterns exist as well.

Webhook trigger. A push-based pattern for triggering an outbound voice AI call — your commerce stack pings the voice AI platform when an event occurs (cart abandoned, COD order placed, demo form submitted), and the call fires in seconds.

API integration vs middleware integration. API integration means direct platform-to-platform connection; middleware integration involves an intermediate translation layer (Zapier, custom integration servers). API is faster and lower-cost; middleware is faster to ship the first time.

Idempotency key. A stable identifier on a tool call that prevents the same action from executing twice if the agent retries. Critical for write operations (refunds, bookings, ticket creation) — without it, a retry creates duplicate refunds.

Tenant isolation. Multi-tenant voice AI platforms that ensure one customer's data and tool access can't bleed into another customer's. Required for multi-brand or B2B SaaS deployments.

Audit log. A queryable record of every conversation, every tool call, every outcome — with timestamp, conversation ID, auth context, and content. Required for compliance defence and operational debugging.

Concurrency. The number of conversations the platform can run simultaneously. India peak workloads (festival weeks, recharge surges, BFCM) can require 4,000–10,000 concurrent.

Compliance and regulation

DPDP (Digital Personal Data Protection Act 2023). India's horizontal data-protection law. Always applies to voice AI — every deployment processes personal data.

TRAI DLT (Distributed Ledger Technology). TRAI's framework for governing commercial outbound communications. Mandates registration of senders, headers, templates; classification of calls as transactional vs service vs promotional; DND scrubbing.

RBI FPC (Fair Practices Code). Reserve Bank of India's code governing collection-call conduct for banks, NBFCs, and digital lenders. Calling hours, identity disclosure, no-harassment language, recording retention.

IRDAI overlay. Insurance Regulatory and Development Authority's sectoral compliance for insurer/intermediary voice calls — disclosure, no mis-selling, recorded consent for policy changes.

RERA disclosure. Real Estate Regulatory Authority (state-level) disclosure requirements for real-estate sales calls — registration number, accuracy of marketing claims.

DND (Do Not Disturb). The National DND register; numbers on it must be scrubbed before non-transactional outbound. Mandatory pre-dial check.

Promotional vs transactional. TRAI's classification distinction. Transactional (calls related to existing transactions) bypass DND; promotional (calls intended to influence purchase) require DLT registration and consent.

Data residency. Where data is stored and processed. India-region residency is the safe operational default for sensitive verticals (BFSI, healthcare, insurance, telecom).

Retention period. How long data (recordings, transcripts, PII) is kept. Sectoral minimums apply (RBI: 90 days; some 12+ months; insurance often 3+ years for grievance defence).

Consent capture. The mechanism for obtaining explicit customer consent — at outbound dial start, in the opening seconds of the call, with a verifiable record.

Pricing models

Per-minute pricing. Voice AI priced by minutes of conversation. Industry-standard for many India vendors; ranges meaningfully by volume and complexity.

Per-call pricing. Priced by completed call regardless of duration. Useful when calls have natural length variance.

Per-outcome pricing. Priced by completed outcome — per recharged subscriber, per recovered cart, per qualified lead, per booked appointment. Aligns better with the customer's revenue model.

Volume tier pricing. Per-unit pricing that drops at higher volume thresholds. Standard for high-volume deployments.

Outcome plus retainer. Hybrid model — a base platform retainer plus per-outcome usage. Common for enterprise-tier contracts.

Evaluation metrics

Connect rate. Percentage of dialed calls that actually connect to a live customer. Varies by region, time of day, telephony partner.

Conversation completion rate. Percentage of connected calls that reach a defined "complete" state (vs hung up mid-conversation).

Resolution rate / First-call resolution (FCR). Percentage of calls that achieve their intended outcome in a single conversation, no callback or escalation.

Escalation rate. Percentage of calls that route to a human agent. Higher escalation isn't necessarily bad — it's a quality signal when calibrated to actually-need-human cases.

WER (Word Error Rate). ASR quality metric. Lower is better. Production-grade Indian-language WER is 4–8%; pre-tuning models often run 12–25%.

MOS (Mean Opinion Score). TTS naturalness metric on a 1–5 scale. Production India-tuned TTS hits 4.0–4.4; below 3.5 sounds robotic.

CSAT/NPS on AI calls. Customer satisfaction or Net Promoter Score specifically on AI-handled calls. Good deployments hit parity or near-parity with human agents on transactional flows.

Deployment and operations

Pilot. A time-boxed (typically 30-day) deployment on a single workflow, single language, with explicit success metrics and a go/no-go decision at the end.

Conversation graph. The structured map of conversation states, transitions, and tool invocations that defines what the agent does. The voice AI equivalent of a script + objection-handling card + escalation rules.

Prompt template. The model-side instructions that shape how the agent speaks, what tone it uses, what constraints it observes. Versioned, auditable, tunable.

Conversation design. The discipline of building production-grade conversation graphs and prompt templates. The voice AI analogue of UX design.

A/B test. Running two prompt or conversation variants in parallel against matched cohorts to measure outcome lift. Standard practice in mature deployments.

Smart retry. Region-aware, voicemail-aware retry logic for unconnected calls. Different retry timing for a Patna number vs a Bangalore number.

Escalation path. The defined route from voice AI to a human agent — with full transcript, tool-call state, and customer context handed off, so the human picks up where the AI left off.

Sales and inside-sales

SDR (Sales Development Representative). The role traditionally responsible for inbound MQL callback and cold outbound. The role voice AI most directly augments or replaces.

MQL (Marketing Qualified Lead). A lead that has shown buying intent — typically by submitting a demo form, downloading content, or attending a webinar.

SQL (Sales Qualified Lead). A lead that has been qualified as a real sales opportunity — typically through structured discovery against a BANT-style rubric.

BANT. A qualification framework — Budget, Authority, Need, Timing. Voice AI agents can run a structured 12-point BANT discovery in 4–6 minutes per prospect.

Speed-to-lead. Time from MQL submission to first sales contact. Conversion drops 7x between 5-min and 30-min response. Voice AI hits sub-15-min reliably.

Demo show-up rate. Percentage of booked demos where the prospect actually shows up. Day-before reminder calls (handled by voice AI) lift this materially.

Customer experience and retention

Cart abandonment. A customer who adds items to a cart but doesn't complete the purchase. Voice AI cart-recovery calls fire within 20–60 minutes of abandonment.

RTO (Return-to-Origin). A delivered order returned to the warehouse — refused, address error, fake order. Voice AI COD verification cuts RTO by 30–50% by filtering before dispatch.

NDR (Non-Delivery Report). A delivery attempt that failed — wrong address, customer unavailable, COD refusal. Voice AI NDR-recovery calls capture the failure reason and either re-attempt or close.

No-show rate. Percentage of confirmed appointments where the customer doesn't arrive — relevant for healthcare, hospitality, F&B. Voice AI reminder calls cut this by 30–50%.

Churn. Customer attrition. Voice AI churn-prevention is most effective at the inflection points — porting eligibility for telecom, renewal window for insurance, post-purchase first-30-days for D2C.

Emerging concepts

RAG (Retrieval-Augmented Generation). A pattern where the agent retrieves relevant content from a knowledge base before generating a response. Useful for enterprise deployments where the agent has to answer from policy documents, product catalogs, or FAQs.

Multi-turn coherence. The agent's ability to maintain context across many conversation turns without losing track of who it's talking to or what's been agreed.

Empathy modeling. Agent behaviour that detects customer emotion (frustration, distress, satisfaction) and adjusts tone accordingly. Increasingly table-stakes for sensitive verticals.

Multi-modal handoff. A conversation that starts on voice and seamlessly transitions to WhatsApp or SMS for visual content (room photos, document links, payment QR codes).

On-device voice AI. Running the voice agent on the customer's device or in private cloud for privacy-sensitive deployments. Emerging in healthcare and BFSI.

Synthetic voice cloning. TTS that mimics a specific speaker's voice. Increasing brand differentiation but raises consent questions.

If a term you encountered isn't in this list, write to us — the glossary is updated quarterly and we will fold in genuine reader requests in the next revision.

Voice AI Glossary 2026: 60 Terms Indian Buyers, Builders and Operators Need to Know