What is good production WER for Indian languages on telephony audio?

Production-grade Indian-language WER on India-tuned models runs 4-8% on telephony audio. Global models on the same audio routinely run 12-25% WER. The 3-5x error-rate gap is the difference between a deployment that converts and one that frustrates the customer with constant misunderstandings.

How do you actually evaluate vendor multilingual capability?

Provide your own anonymised audio (50+ recordings across the languages and code-switching combinations your customers actually exhibit). Test WER objectively against ground-truth transcripts. Demand production-deployed evidence per language with 6+ months of customer references. Test code-switching density (3+ language switches per sentence). Test telephony degradation. Test prosody and accent variance within each language.

What's the 90-day multilingual rollout for voice AI?

Days 1-14: Hindi + English + Hinglish on one workflow with production-grade evaluation. Days 15-30: Tier-2 expansion (Tamil, Telugu, Bengali, Marathi, Gujarati, Kannada). Days 31-60: Tier-3 (Malayalam, Punjabi, Odia, Assamese) and multi-workflow expansion. Days 61-90: vertical-specific languages (Urdu, Bhojpuri, regional dialects) and continuous improvement loops. By day 90 the deployment is genuinely pan-India, not nominally multilingual.

Multilingual Voice AI India 2026 — Hindi, Tamil, Telugu, Bengali, Marathi | Caller Digital

Q: Why does voice AI for India need 10+ languages, not just Hindi and English?

On a typical pan-India consumer base, English-only covers 6% of customers, Hindi-only adds another 22%, Hindi-English code-switching adds 31%. The remaining 41% requires Tamil, Telugu, Marathi, Bengali, Kannada, Gujarati, Malayalam, Punjabi, Odia or Assamese. To get 95%+ effective coverage, voice AI must run all 10+ Indian languages with native code-switching at production-grade WER on actual telephony audio.

Q: What is code-switching and why do global voice AI models fail on it?

Code-switching is mixing two or more languages within a single sentence — Hindi structure with English nouns, Tamil-English mixing, Hinglish-Marathi blends. Most multilingual ASR is built as monolingual models with a language-detection front-end, which fails on mid-utterance code-switching because the language signal flips multiple times per second. India-tuned models build code-switching as a primary objective, not a downstream patch.

Q: Why is TTS quality often more important than ASR quality?

ASR (customer-to-agent) gets most buyer attention, but TTS (agent-to-customer) is half the experience and often where deployment quality is decided. Synthetic Tamil that sounds robotic, synthetic Telugu with unnatural prosody, synthetic Bengali with English-accented vowels — all are conversion-killers. Native-speaker TTS evaluation in each language, voice consistency across language switches, and code-switching-native TTS are non-negotiable.

A pan-India consumer-finance brand we work with runs outbound on a 14-million-customer base. Their internal language-distribution audit, run on actual call recordings rather than registered-language fields, found that English-only conversations covered 6% of their customer base. Hindi-only covered another 22%. Hindi-English code-switching covered 31%. The remaining 41% required at least one of Tamil, Telugu, Marathi, Bengali, Kannada, Gujarati, Malayalam, Punjabi, Odia or Assamese — and within that 41%, code-switching with Hindi or English was the norm in roughly two-thirds of conversations.

In other words: a Hindi-only voice AI deployment serves 22% of the customer base. A Hindi-English deployment serves 53%. To get to 95%+ coverage requires 10+ Indian languages, all of them with native code-switching, all of them at production-grade WER on actual telephony audio. There is no shortcut to that coverage profile, and it is the single most important architectural decision in deploying voice AI in India.

This is the buyer's guide to that decision.

Why this matters more in India than anywhere else

Three structural facts make Indian multilingual voice AI categorically harder than multilingual voice AI in Europe, North America, or Southeast Asia.

Fact 1: The customer's spoken language often differs from their registered-language field. Internal migration drives a meaningful gap between the language captured at onboarding and the language the customer actually prefers in 2026. A customer registered in Mumbai with "English" as the preference may now be living in Bengaluru and prefer Kannada or Hindi-Kannada code-switching. A registered "Hindi" customer in Delhi may have moved to Chennai and now codes between Tamil and Hindi. Voice AI deployments that route by registered-language field misroute 20–30% of calls.

Fact 2: Code-switching is the norm, not the exception. A typical Indian conversation mixes 2–3 languages within a single sentence. "Sir, aapka loan amount sanctioned ho gaya hai, but documentation pending hai, can you please share the salary slip by tomorrow?" — this single sentence has Hindi structure, English nouns, English verbs, and Hinglish connectors, with a politeness register that's distinctively Indian. Global voice AI models trained on monolingual or bilingual code-switching corpora fail on this density of mixing.

Fact 3: Telephony audio in India is uniquely degraded. 4G/5G cellular voice with VoLTE handoffs, signal drops, multi-speaker household acoustics, two-wheeler engine noise, market-stall ambient noise, and connection-quality variance across tier-1, tier-2, tier-3 networks. Voice AI WER on Indian telephony audio is materially higher than on the same languages tested in studio conditions. Vendors that quote benchmarks from broadcast-quality audio are quoting numbers that have no operational relevance.

The consequence: voice AI built for India has to be tuned, evaluated, and deployed differently from voice AI ported from elsewhere.

The 10+ language coverage map

The languages voice AI for India needs to handle, ordered by approximate addressable speaker population in 2026.

Tier 1 — non-negotiable for any pan-India deployment.

Hindi — 600M+ speakers. The default for North and Central India. Includes wide dialectal variation (Bhojpuri, Awadhi, Haryanvi-influenced, Mumbai-Hindi).
Hinglish — code-switched Hindi-English. Effectively a separate target for voice AI given the structural mixing density.
English — Indian English specifically. Pronunciation, lexical choices, and prosody differ materially from American or British English.

Tier 2 — required for 90%+ coverage of pan-India customer bases.

Tamil — 80M+ speakers. Tamil Nadu, Puducherry, parts of Karnataka and Kerala, large diaspora.
Telugu — 95M+ speakers. Andhra Pradesh, Telangana.
Bengali — 100M+ speakers in India. West Bengal, Tripura, large diaspora across India.
Marathi — 90M+ speakers. Maharashtra.
Gujarati — 60M+ speakers. Gujarat, Mumbai, large business community across India.
Kannada — 45M+ speakers. Karnataka.

Tier 3 — required for true pan-India coverage.

Malayalam — 35M+ speakers. Kerala, Lakshadweep, large GCC diaspora returning.
Punjabi — 35M+ speakers. Punjab, large diaspora and migrant communities.
Odia — 40M+ speakers. Odisha.
Assamese — 15M+ speakers. Assam.

Tier 4 — vertical-specific.

Urdu — for specific BFSI, healthcare, and edtech customer cohorts in North India.
Bhojpuri — for migrant-worker-heavy verticals (construction, last-mile delivery).
Kashmiri, Sindhi, Konkani, Maithili, Manipuri — niche but operationally relevant in vertical-specific deployments.

A vendor that claims "multilingual voice AI" without Tier 1 + Tier 2 in production with deployed evidence is a vendor not yet ready for pan-India deployment.

Code-switching: what production-grade means

Most voice AI vendors claim multilingual capability. Far fewer handle code-switching well. The distinction matters because Indian customers don't pick a language and stay in it.

Forced restart is the failure mode. A non-code-switching agent encounters Hinglish, fails to parse, and asks the customer to "please choose one language." This breaks the conversation and signals the agent is not native to the customer's communication context.

Mid-utterance code-switching is the bar. Production-grade Indian voice AI handles "Sir, aapka EMI due hai on the 15th, can you confirm the payment account?" — with Hindi opening, English temporal anchor, English body, and English noun — without any restart, with full intent capture, and with the agent's response in the same code-switched register the customer used.

Listener-led register matching. The agent should match the customer's register. If the customer opens in Hindi-English, the agent responds in Hindi-English. If the customer switches mid-conversation to pure Tamil, the agent switches with them. This is not a UX nicety — it's the differentiator between a deployment that produces conversion and one that produces customer complaints about robotic behaviour.

Numbers, dates, currencies in multiple formats. "Pandrah tareekh" vs "fifteenth" vs "the 15th". "Pacchas hazaar" vs "fifty thousand" vs "₹50,000". The agent has to recognise all forms in input and produce the form most natural for the customer in output.

Why global models fall short on Indian languages

Three technical reasons.

Training data scarcity for Indian-language telephony audio. The dominant ASR models were trained on web-crawled audio (YouTube, podcasts, broadcast). Indian-language presence in those corpora is materially lower than English, Spanish, or Mandarin. Telephony audio at Indian carrier quality is a further-degraded distribution that almost no global training corpus represents.

Code-switching as a first-class problem. Most multilingual ASR is built as a stack of monolingual models with a language-detection front-end. This architecture fails on mid-utterance code-switching because the language-detection signal flips multiple times per second. India-tuned models build code-switching as a primary objective rather than a downstream patch.

Prosody and accent variance. A Tamil-speaker speaking English carries Tamil prosody. A Bengali-speaker speaking Hindi carries Bengali prosody. A Bhojpuri-speaker in Mumbai speaks a Mumbai-Bhojpuri-Hindi blend. Global models trained on standard accent profiles undershoot on the actual accent distribution Indian voice AI encounters.

The operational consequence: WER (Word Error Rate) on global models often runs 12–25% on Indian-accented and code-switched audio. India-tuned models on the same audio routinely run at 4–8% WER. The 3–5x error-rate gap is the gap between a deployment that converts and one that frustrates the customer.

Vendor evaluation: what to actually test

Vendor demos are designed to look good. Vendor benchmarks are designed to be quoted favourably. The honest evaluation runs differently.

1. Provide your own audio. Pull 50 anonymised recordings from your own contact centre, across the 10+ languages and code-switching combinations your customer base actually exhibits. Let the vendor run their ASR and TTS on those recordings, with an objective evaluation against ground-truth transcripts you produce.

2. Ask for production-deployed evidence per language. Not slides. Not a single demo. Customer references where the language has been in production for 6+ months with measurable outcomes. A vendor that has Hindi and English in production but Tamil/Telugu/Marathi only "available" is a vendor whose other languages are not yet at production grade.

3. Test code-switching density. Provide audio with high code-switching density (3+ language switches per sentence). Most vendors fail at this density. The ones that don't are the ones to shortlist.

4. Test telephony degradation. Run the audio through a 4G voice codec, drop 200ms of audio randomly to simulate a network glitch, add ambient noise. The vendor's WER on degraded audio is closer to your real-world WER than their studio-quality benchmark.

5. Test prosody and accent variance within each language. A Bhojpuri-Hindi speaker, a Mumbai-Hindi speaker, a Punjabi-Hindi speaker, and a Hyderabad-Hindi speaker are all "Hindi" but exhibit very different acoustic profiles. The vendor's coverage on this within-language variance is the difference between a "Hindi-supports" claim and a deployable Hindi capability.

TTS: the under-evaluated half of the stack

Buyer attention on multilingual voice AI usually concentrates on ASR (the customer-to-agent direction). The TTS (agent-to-customer direction) is half the experience and often the half where deployment quality is decided.

Voice naturalness in regional languages. Synthetic Tamil that sounds robotic, synthetic Telugu with unnatural prosody, synthetic Bengali with English-accented vowels — all are conversion-killers. Native-speaker evaluation of TTS quality, in each language, is non-negotiable.

Voice consistency across languages. If the agent switches from Hindi to Tamil mid-conversation, the voice should remain recognisably the same agent. Inconsistent voice characteristics across language switches is jarring and signals the system is stitched-together rather than native-multilingual.

Latency parity across languages. TTS in regional languages should generate at the same latency as English. A 300ms penalty for Tamil generation versus English generation is enough to make the conversation feel asymmetric.

Code-switched TTS. The agent saying "Sir, your loan amount approved ho gaya hai" must sound natural — not English-voice + Hindi-voice stitched. The TTS must be code-switching-native.

The 90-day multilingual rollout

The deployment shape that has worked across pan-India multilingual rollouts.

Days 1–14: Hindi + English + Hinglish on one workflow. Pick the highest-volume single workflow. Deploy on Hindi-English-Hinglish coverage with code-switching. Production-grade evaluation against actual customer audio.

Days 15–30: Tier-2 language expansion. Add Tamil, Telugu, Bengali, Marathi, Gujarati, Kannada based on the customer-base distribution. Per-language production evaluation.

Days 31–60: Tier-3 expansion and multi-workflow. Layer in Malayalam, Punjabi, Odia, Assamese. Expand to additional workflows across the deployment.

Days 61–90: Vertical-specific languages and continuous improvement. Deploy any vertical-specific languages (Urdu, Bhojpuri, regional dialects). Production telemetry feedback into ongoing acoustic and conversational model improvements.

By day 90, the deployment is genuinely pan-India in language coverage rather than nominally multilingual.

Where this is heading

Three directions over the next 18–24 months.

Dialect-aware deployment. Not just Hindi but Mumbai-Hindi, Patna-Hindi, Lucknow-Hindi as separable acoustic targets with adapted prosody. Not just Tamil but Chennai-Tamil and Madurai-Tamil. The next frontier of deployment quality in India.

Voice AI for low-resource Indian languages. Bodo, Manipuri, Konkani, Sindhi, Maithili — languages that haven't yet hit production-grade coverage. The deployment opportunity for vertical-specific use cases (government services, regional financial inclusion, healthcare in tribal districts) is meaningful.

Continuous personalisation at the customer level. Voice AI that learns each individual customer's preferred language register and code-switching pattern over multiple conversations, and matches it from the second call onwards. The CRM-meets-language-model frontier.

For Indian voice AI in 2026, multilingual capability is no longer a feature checkbox. It is the architectural decision that decides whether the deployment serves 22% of the customer base or 95%. Talk to us if your business is ready to deploy voice AI that is genuinely multilingual at India scale.

Multilingual Voice AI in India 2026: Hindi, Tamil, Telugu, Bengali, Marathi and the Code-Switching Reality

Why this matters more in India than anywhere else

The 10+ language coverage map

Code-switching: what production-grade means

Why global models fall short on Indian languages

Vendor evaluation: what to actually test

TTS: the under-evaluated half of the stack

The 90-day multilingual rollout

Where this is heading

Frequently Asked Questions

Why does voice AI for India need 10+ languages, not just Hindi and English?

What is code-switching and why do global voice AI models fail on it?

Why is Indian telephony audio harder than other markets?

What is good production WER for Indian languages on telephony audio?

How do you actually evaluate vendor multilingual capability?

Why is TTS quality often more important than ASR quality?

What's the 90-day multilingual rollout for voice AI?

Kanan Richhariya