Is a Hindi WER under 10% achievable on Indian telephony in production?

Yes, on Indian-trained models with telephony specialisation. Production Hindi WER of 6-10% on Indian carrier telephony with code-switching is the 2026 benchmark for the leading vendors. Global vendor models running US-trained Hindi land at 22-32% on the same audio. The gap is durable and reflects different model architectures and training corpora.

Why are Tamil and Telugu WER numbers higher than Hindi for the same vendor?

The training-data investment for Tamil and Telugu has lagged Hindi by 18-24 months across all Indian voice AI vendors. By late 2026, the gap is forecast to halve as platform investment compounds. Buyers deploying in Tamil-Nadu or Andhra/Telangana should set their accuracy expectation 6-10 percentage points below their Hindi expectation and revisit annually.

How should we run a vendor bake-off on Indian language quality?

Collect 50 real conversational audio samples from your own call recordings, 10 per language, with 30% in background noise and 50% containing code-switches. Get a native-speaker gold-standard transcription. Submit the batch to every vendor under evaluation. Compute WER, entity error rate (EER), and code-switch recovery rate (CSR). Weight telephony + code-switch numbers higher than studio numbers.

What is Entity Error Rate and why does it matter more than WER?

EER measures specifically the error rate on entities that change the meaning of the conversation — customer names, account numbers, IFSC codes, amounts, dates. A 8% WER that misreads 4% of account numbers is operationally worse than a 12% WER that misreads 0.5% of them. For BFSI voice AI, EER under 2% on financial entities is the production threshold. Most vendors do not measure or report it; you have to ask.

Does Indian-language ASR work for older customers and tier-2/3 city speakers?

It works, but the accuracy degrades 4-12 percentage points compared to younger tier-1 speakers — primarily because of dialectal differences (Hindi spoken in Patna vs Mumbai vs Bengaluru by a long-resident Kannada speaker is acoustically distinct), and because older speakers have slower speech and longer pauses that some models misinterpret as end-of-turn. Best-in-class models adapt within the first 5-10 seconds of audio; lesser models stay at the initial accuracy.

Will end-to-end speech-to-speech models replace ASR + LLM + TTS pipelines?

Yes — by 2027 the dominant architecture will be end-to-end speech-to-speech models that bypass the intermediate text representation. Early evidence in 2026 shows 2-4 percentage point WER improvement on conversational audio. The vendor selection question is whether the vendor's model architecture roadmap is on the second curve or stuck on the first. Ask for the roadmap; first-curve vendors will struggle to keep up by 2027.

Voice AI WER Benchmarks Indian Languages 2026 — Hindi Tamil Telugu Bengali Marathi

Q: What is code-switching and why do global voice AI vendors fail at it?

Code-switching is when the speaker changes language mid-sentence — 'haan boss, payment ho gaya, can you call back later?' has Hindi-English-Hindi-English transitions. Global vendors handle this in two passes (detect language per phrase, transcribe, stitch), which loses 25-40% of words at phrase boundaries. Indian-trained models handle it in a single pass with a code-switch-aware language model and recover 88-94% of the words correctly.

A CTO at a top-three Indian fintech ran a vendor bake-off six months ago that ended a procurement decision in fifteen minutes. He played the same 30-second customer recording — a Hindi-Marathi code-switched payment-confirmation call from a Pune customer — to four voice AI vendors' demo platforms. Three of them produced English transcripts of the Hindi-Marathi audio. One produced an accurate Hindi-Marathi transcript with the code-switch preserved. The fifteen-minute meeting decided the next two years of his platform's voice strategy.

That demo captures the entire WER (word error rate) reality for Indian voice AI in 2026. The vendor pitch decks all claim "multilingual support for Hindi and 22 Indian languages." The production reality is that most of them are running US/EU-trained ASR (automatic speech recognition) models with a language-detection layer bolted on, and the models collapse on the three things that make Indian conversations Indian: code-switching, regional accents, and ambient noise.

This post breaks down what the WER numbers actually look like across the five most-deployed Indian languages, why the gap exists between vendor marketing and production performance, and how a buyer should evaluate.

All WER numbers below are typical industry ranges observed across vendor bake-offs we have run or been close to in 2025–26. Specific vendor numbers vary. Treat these as benchmarks, not absolute claims.

What WER actually means for Indian voice AI

WER = (insertions + deletions + substitutions) / total words. A WER of 10% means roughly 1 in 10 words is wrong in the transcript. For voice AI used in a conversational loop, WER above 18–20% breaks the conversation: the LLM downstream cannot maintain context, the customer gets confused, the call escalates to a human.

The practical conversational threshold for production-grade Indian voice AI:

WER < 8% — conversation feels native. Customer rarely notices the bot is a bot until told.
WER 8–15% — production-acceptable for transactional flows. Customer may repeat one sentence in three.
WER 15–22% — usable for very short flows (under 60 seconds, 2–3 conversational turns). Breaks on anything longer.
WER > 22% — not production-deployable. Customer abandonment rate above 35%.

Five-language WER benchmarks: vendor categories observed in 2025–26

Three vendor categories, three different performance tiers:

Category A — global ASR (US-trained, India language-pack added)

Typical examples: large US cloud vendors offering "Hindi" as one of 100+ supported languages.

Language	Studio audio	Telephony, no noise	Telephony + accent + code-switch
Hindi (Mumbai/Delhi)	12–18%	22–32%	38–52%
Tamil (Chennai)	15–22%	28–38%	45–58%
Telugu (Hyderabad)	16–23%	30–40%	47–60%
Bengali (Kolkata)	14–20%	26–36%	42–55%
Marathi (Pune/Mumbai)	15–22%	28–38%	44–57%

The takeaway: these models are unfit for Indian telephony-grade voice AI deployment. The studio numbers look acceptable; the telephony + code-switch numbers — which are the only ones that matter in production — are above the conversational-breakage threshold for every language.

Category B — Indian-trained ASR with code-switch handling

Typical examples: Indian voice AI vendors that have trained or fine-tuned models on Indian conversational corpora.

Language	Studio audio	Telephony, no noise	Telephony + accent + code-switch
Hindi	5–9%	8–13%	10–16%
Tamil	7–11%	11–17%	14–22%
Telugu	8–13%	12–18%	15–23%
Bengali	7–12%	11–17%	14–22%
Marathi	8–13%	12–18%	15–23%

The Hindi-Marathi-Bengali numbers are production-ready in this category. Tamil and Telugu are at the production-acceptable edge — usable for transactional flows, not yet for long lead-qualification conversations.

Category C — Indian-trained ASR with telephony-and-noise specialisation

The frontier: vendors who have trained on Indian-carrier telephony audio with synthetic and real call-centre background noise.

Language	Studio audio	Telephony, no noise	Telephony + accent + code-switch
Hindi	4–7%	6–10%	7–12%
Tamil	6–9%	9–14%	11–17%
Telugu	6–10%	10–15%	12–18%
Bengali	6–9%	9–14%	11–17%
Marathi	6–10%	10–15%	12–18%

This is what production Indian voice AI looks like in 2026 at its best. Hindi WER under 12% even in worst-case conditions. The other four languages are still 4–8 percentage points behind Hindi — the training-data gap remains.

Why "multilingual" vendors actually fail: the three things their pitch decks don't cover

1. Code-switching is not language detection plus translation

The pattern: "haan boss, payment ho gaya, but kal tak mai office nahi aa paaunga, can you call me back evening time, around 6 PM ke baad?"

Three languages in a single sentence (Hindi, English, Hindi-English hybrid). The customer is one person, the speech is continuous, the language toggles at sub-word boundaries.

Global ASRs handle this in two passes — detect language per phrase, transcribe, stitch. The two-pass approach drops 25–40% of the words because phrase-boundary detection fails on rapid switches. Indian-trained ASRs handle it in a single pass with a code-switch-aware language model that does not enforce one language per phrase.

This is the single biggest performance gap, and it is invisible in vendor pitch demos because vendors test on monolingual reference audio.

2. Indian accent variation is not "Hindi" — it is 15+ regional Hindi sub-dialects

Hindi spoken in Pune is not Hindi spoken in Patna is not Hindi spoken in Mumbai is not Hindi spoken in Lucknow is not Hindi spoken in Bengaluru by a Hindi-speaking customer who has lived there 20 years.

Each sub-dialect has phonetic shifts (vowel lengths, retroflex consonants, sandhi rules) that change the acoustic signature. Vendors that train on a single Hindi reference corpus (typically Delhi/NCR speech) see 15–25 percentage point WER degradation when the customer is from outside the training distribution.

Production-grade Indian voice AI training corpora should cover Hindi from 12+ Indian cities at minimum, with phonetician-supervised dialectal balancing.

3. Telephony codec, jitter, and packet loss are real signal degradation

A studio recording at 16 kHz has the acoustic clarity of a podcast. A live Indian telephony call uses 8 kHz µ-law or G.729 codecs, has jitter spikes of 50–200 ms on Jio/Airtel/VI long-distance routes, and 1–3% packet loss on premium SIP trunks (worse on rural 3G fallback).

Models trained on studio data lose 8–14 percentage points of WER on telephony audio. Models trained on a mix of telephony and studio audio handle the codec degradation natively.

This is why the studio WER numbers in vendor decks are misleading and the live-demo numbers (when you make the vendor demo against your own telephony recording) are the only ones that count.

The two metrics besides WER that matter

WER is necessary, not sufficient. Two additional metrics that buyers should require in evaluation:

Entity Error Rate (EER) on Indian named entities

WER averages all words; entity errors weight specifically the words that change the conversation's meaning. A 8% WER that includes a 4% error rate on customer names, account numbers, and amounts is worse than a 12% WER with 0.5% error on those same entities.

For BFSI voice AI, EER on PAN numbers, account numbers, IFSC codes, INR amounts, and date phrases ("teesree October") should be under 2%. Most vendors don't measure this; ask for it.

Code-Switch Recovery Rate (CSR)

When the speech switches language mid-sentence, the bot's next-turn response should be in the language the speaker most recently used or the dominant language of the conversation — not a hardcoded English fallback. CSR = % of code-switch conversations where the bot's response language is appropriate.

Indian production threshold: CSR > 90%. Below 80%, the conversation feels foreign to the customer and bot escalation jumps.

How to evaluate a vendor's Indian-language ASR in 60 minutes

A structured bake-off that any procurement team can run:

Collect 50 real conversational audio samples from your own call recordings, 10 per language (Hindi/Tamil/Telugu/Bengali/Marathi). Include 30% with significant background noise (street, restaurant, traffic).
Have 5 of the 10 samples in each language include at least one code-switch between Hindi (or the regional language) and English mid-sentence.
Get a human transcription baseline from a native speaker for each sample. This is your gold standard.
Submit the same audio batch to every vendor under evaluation. Require the vendor to share the transcribed text within 24 hours.
Compute WER per sample per vendor against the gold standard. Compute EER for named entities. Manually score CSR on the code-switch samples.
Build the vendor scoring matrix. Weight the telephony + code-switch numbers higher than studio numbers because those are the production reality.

Cost of running this bake-off: roughly INR 30,000–50,000 in human transcription + 8–12 hours of an internal analyst's time. Cost of skipping it and signing the wrong vendor: 8–14 months of stalled deployment and INR 50–200 lakh in sunk cost depending on scale.

Where Indian-language ASR is heading 2026–27

Three tracks worth watching:

Tamil and Telugu are closing the gap. Indian-trained Tamil and Telugu models in 2024 sat 8–12 percentage points behind Hindi. By late 2026, that gap is forecast to halve as the training-data investment compounds.
Live-context adaptation. Best-in-class vendors are now training models that adapt per-call to the speaker's specific accent within the first 5–10 seconds of audio. The WER on the second half of the call is materially better than the first half — meaningful for longer flows like lead qualification or KYC.
End-to-end multimodal models. The boundary between ASR, language model, and TTS is dissolving. Single end-to-end models trained on audio-text pairs directly are starting to outperform pipeline systems. This will be the dominant architecture by 2027.

The vendor whose model architecture is on the second curve — not the first — will have a structural quality advantage that buyers can lock in by signing now.

What this means for your procurement

If your voice AI deployment is in India, in Indian languages, against Indian telephony, then global ASR vendors are a category to avoid for the production loop. Indian-trained ASR with telephony specialisation is the floor. Indian-trained ASR with code-switch + noise specialisation is the buying target.

The WER number on the pitch deck is meaningless without the audio sample it was tested on. Insist on running the bake-off against your own audio. Vendors who refuse have something to hide.

Talk to us if you are running a voice AI vendor bake-off and want a documented WER, EER and CSR baseline against your own conversational corpus — caller.digital's bake-off package includes a 50-sample multi-language audio evaluation and the comparison scoring matrix.

What WER actually means for Indian voice AI

The practical conversational threshold for production-grade Indian voice AI:

WER < 8% — conversation feels native. Customer rarely notices the bot is a bot until told.
WER 8–15% — production-acceptable for transactional flows. Customer may repeat one sentence in three.
WER 15–22% — usable for very short flows (under 60 seconds, 2–3 conversational turns). Breaks on anything longer.
WER > 22% — not production-deployable. Customer abandonment rate above 35%.

Five-language WER benchmarks: vendor categories observed in 2025–26

Three vendor categories, three different performance tiers:

Category A — global ASR (US-trained, India language-pack added)

Typical examples: large US cloud vendors offering "Hindi" as one of 100+ supported languages.

Language	Studio audio	Telephony, no noise	Telephony + accent + code-switch
Hindi (Mumbai/Delhi)	12–18%	22–32%	38–52%
Tamil (Chennai)	15–22%	28–38%	45–58%
Telugu (Hyderabad)	16–23%	30–40%	47–60%
Bengali (Kolkata)	14–20%	26–36%	42–55%
Marathi (Pune/Mumbai)	15–22%	28–38%	44–57%

Category B — Indian-trained ASR with code-switch handling

Typical examples: Indian voice AI vendors that have trained or fine-tuned models on Indian conversational corpora.

Language	Studio audio	Telephony, no noise	Telephony + accent + code-switch
Hindi	5–9%	8–13%	10–16%
Tamil	7–11%	11–17%	14–22%
Telugu	8–13%	12–18%	15–23%
Bengali	7–12%	11–17%	14–22%
Marathi	8–13%	12–18%	15–23%

Category C — Indian-trained ASR with telephony-and-noise specialisation

The frontier: vendors who have trained on Indian-carrier telephony audio with synthetic and real call-centre background noise.

Language	Studio audio	Telephony, no noise	Telephony + accent + code-switch
Hindi	4–7%	6–10%	7–12%
Tamil	6–9%	9–14%	11–17%
Telugu	6–10%	10–15%	12–18%
Bengali	6–9%	9–14%	11–17%
Marathi	6–10%	10–15%	12–18%

Why "multilingual" vendors actually fail: the three things their pitch decks don't cover

1. Code-switching is not language detection plus translation

The pattern: "haan boss, payment ho gaya, but kal tak mai office nahi aa paaunga, can you call me back evening time, around 6 PM ke baad?"

Three languages in a single sentence (Hindi, English, Hindi-English hybrid). The customer is one person, the speech is continuous, the language toggles at sub-word boundaries.

This is the single biggest performance gap, and it is invisible in vendor pitch demos because vendors test on monolingual reference audio.

2. Indian accent variation is not "Hindi" — it is 15+ regional Hindi sub-dialects

Production-grade Indian voice AI training corpora should cover Hindi from 12+ Indian cities at minimum, with phonetician-supervised dialectal balancing.

3. Telephony codec, jitter, and packet loss are real signal degradation

Models trained on studio data lose 8–14 percentage points of WER on telephony audio. Models trained on a mix of telephony and studio audio handle the codec degradation natively.

This is why the studio WER numbers in vendor decks are misleading and the live-demo numbers (when you make the vendor demo against your own telephony recording) are the only ones that count.

The two metrics besides WER that matter

WER is necessary, not sufficient. Two additional metrics that buyers should require in evaluation:

Entity Error Rate (EER) on Indian named entities

For BFSI voice AI, EER on PAN numbers, account numbers, IFSC codes, INR amounts, and date phrases ("teesree October") should be under 2%. Most vendors don't measure this; ask for it.

Code-Switch Recovery Rate (CSR)

Indian production threshold: CSR > 90%. Below 80%, the conversation feels foreign to the customer and bot escalation jumps.

How to evaluate a vendor's Indian-language ASR in 60 minutes

A structured bake-off that any procurement team can run:

Collect 50 real conversational audio samples from your own call recordings, 10 per language (Hindi/Tamil/Telugu/Bengali/Marathi). Include 30% with significant background noise (street, restaurant, traffic).
Have 5 of the 10 samples in each language include at least one code-switch between Hindi (or the regional language) and English mid-sentence.
Get a human transcription baseline from a native speaker for each sample. This is your gold standard.
Submit the same audio batch to every vendor under evaluation. Require the vendor to share the transcribed text within 24 hours.
Compute WER per sample per vendor against the gold standard. Compute EER for named entities. Manually score CSR on the code-switch samples.
Build the vendor scoring matrix. Weight the telephony + code-switch numbers higher than studio numbers because those are the production reality.

Where Indian-language ASR is heading 2026–27

Three tracks worth watching:

Tamil and Telugu are closing the gap. Indian-trained Tamil and Telugu models in 2024 sat 8–12 percentage points behind Hindi. By late 2026, that gap is forecast to halve as the training-data investment compounds.
Live-context adaptation. Best-in-class vendors are now training models that adapt per-call to the speaker's specific accent within the first 5–10 seconds of audio. The WER on the second half of the call is materially better than the first half — meaningful for longer flows like lead qualification or KYC.
End-to-end multimodal models. The boundary between ASR, language model, and TTS is dissolving. Single end-to-end models trained on audio-text pairs directly are starting to outperform pipeline systems. This will be the dominant architecture by 2027.

The vendor whose model architecture is on the second curve — not the first — will have a structural quality advantage that buyers can lock in by signing now.

What this means for your procurement

The WER number on the pitch deck is meaningless without the audio sample it was tested on. Insist on running the bake-off against your own audio. Vendors who refuse have something to hide.

Voice AI WER Benchmarks for Indian Languages 2026: Hindi, Tamil, Telugu, Bengali, Marathi and Why "Multilingual" Vendors Fail in Practice

What WER actually means for Indian voice AI

Five-language WER benchmarks: vendor categories observed in 2025–26

Category A — global ASR (US-trained, India language-pack added)

Category B — Indian-trained ASR with code-switch handling

Category C — Indian-trained ASR with telephony-and-noise specialisation

Why "multilingual" vendors actually fail: the three things their pitch decks don't cover

1. Code-switching is not language detection plus translation

2. Indian accent variation is not "Hindi" — it is 15+ regional Hindi sub-dialects

3. Telephony codec, jitter, and packet loss are real signal degradation

The two metrics besides WER that matter

Entity Error Rate (EER) on Indian named entities

Code-Switch Recovery Rate (CSR)

How to evaluate a vendor's Indian-language ASR in 60 minutes

Where Indian-language ASR is heading 2026–27

What this means for your procurement

Frequently Asked Questions

Is a Hindi WER under 10% achievable on Indian telephony in production?

Why are Tamil and Telugu WER numbers higher than Hindi for the same vendor?

What is code-switching and why do global voice AI vendors fail at it?

How should we run a vendor bake-off on Indian language quality?

What is Entity Error Rate and why does it matter more than WER?

Does Indian-language ASR work for older customers and tier-2/3 city speakers?

Will end-to-end speech-to-speech models replace ASR + LLM + TTS pipelines?

Caller Digital

Voice AI WER Benchmarks for Indian Languages 2026: Hindi, Tamil, Telugu, Bengali, Marathi and Why "Multilingual" Vendors Fail in Practice

What WER actually means for Indian voice AI

Five-language WER benchmarks: vendor categories observed in 2025–26

Category A — global ASR (US-trained, India language-pack added)

Category B — Indian-trained ASR with code-switch handling

Category C — Indian-trained ASR with telephony-and-noise specialisation

Why "multilingual" vendors actually fail: the three things their pitch decks don't cover

1. Code-switching is not language detection plus translation

2. Indian accent variation is not "Hindi" — it is 15+ regional Hindi sub-dialects

3. Telephony codec, jitter, and packet loss are real signal degradation

The two metrics besides WER that matter

Entity Error Rate (EER) on Indian named entities

Code-Switch Recovery Rate (CSR)

How to evaluate a vendor's Indian-language ASR in 60 minutes

Where Indian-language ASR is heading 2026–27

What this means for your procurement

Frequently Asked Questions

Is a Hindi WER under 10% achievable on Indian telephony in production?

Why are Tamil and Telugu WER numbers higher than Hindi for the same vendor?

What is code-switching and why do global voice AI vendors fail at it?

How should we run a vendor bake-off on Indian language quality?

What is Entity Error Rate and why does it matter more than WER?

Does Indian-language ASR work for older customers and tier-2/3 city speakers?

Will end-to-end speech-to-speech models replace ASR + LLM + TTS pipelines?

Caller Digital

Other Blogs

Voice AI Vendor RFP Scoring Rubric for Indian Enterprises 2026: 9 Categories, 47 Criteria, How to Evaluate Without Falling for Demos

Voice AI for Indian Edtech 2026: Lead Nurture, Demo Booking, Drop-out Save and Renewal Flows

TRAI DLT Compliance for AI Outbound Calling in India 2026: Headers, Templates, Consent and Penalty Avoidance

Voice AI for Indian Quick-Commerce 2026: Order Confirmation, Refund Resolution, Rider Dispatch and Partner Support (Blinkit, Zepto, Instamart Playbook)

Voice AI for Indian SaaS: Onboarding, Trial-to-Paid, Renewal & Churn-Save Calls (2026 Lifecycle Playbook)

Voice AI Pilot Failures: 7 Reasons Indian Voice AI Pilots Get Killed at Steering Committee (And How to Survive)

Voice AI for Mutual Fund Distributors & IFAs in India 2026: SIP Top-Ups, NFO Promotions, Redemption Deflection and the IFA Economics Reset

Voice AI + IndiaStack: Aadhaar v-CIP, UPI Mandate, Account Aggregator & ONDC Integration Playbook (India 2026)

Voice AI for Manufacturing & Industrial Operations in India 2026: Dealer Networks, After-Sales, MRO and B2B Order Workflows