Voice AI WER Benchmarks for Indian Languages 2026: Hindi, Tamil, Telugu, Bengali, Marathi and Why "Multilingual" Vendors Fail in Practice

A CTO at a top-three Indian fintech ran a vendor bake-off six months ago that ended a procurement decision in fifteen minutes. He played the same 30-second customer recording — a Hindi-Marathi code-switched payment-confirmation call from a Pune customer — to four voice AI vendors' demo platforms. Three of them produced English transcripts of the Hindi-Marathi audio. One produced an accurate Hindi-Marathi transcript with the code-switch preserved. The fifteen-minute meeting decided the next two years of his platform's voice strategy.
That demo captures the entire WER (word error rate) reality for Indian voice AI in 2026. The vendor pitch decks all claim "multilingual support for Hindi and 22 Indian languages." The production reality is that most of them are running US/EU-trained ASR (automatic speech recognition) models with a language-detection layer bolted on, and the models collapse on the three things that make Indian conversations Indian: code-switching, regional accents, and ambient noise.
This post breaks down what the WER numbers actually look like across the five most-deployed Indian languages, why the gap exists between vendor marketing and production performance, and how a buyer should evaluate.
All WER numbers below are typical industry ranges observed across vendor bake-offs we have run or been close to in 2025–26. Specific vendor numbers vary. Treat these as benchmarks, not absolute claims.
What WER actually means for Indian voice AI
WER = (insertions + deletions + substitutions) / total words. A WER of 10% means roughly 1 in 10 words is wrong in the transcript. For voice AI used in a conversational loop, WER above 18–20% breaks the conversation: the LLM downstream cannot maintain context, the customer gets confused, the call escalates to a human.
The practical conversational threshold for production-grade Indian voice AI:
- WER < 8% — conversation feels native. Customer rarely notices the bot is a bot until told.
- WER 8–15% — production-acceptable for transactional flows. Customer may repeat one sentence in three.
- WER 15–22% — usable for very short flows (under 60 seconds, 2–3 conversational turns). Breaks on anything longer.
- WER > 22% — not production-deployable. Customer abandonment rate above 35%.
Five-language WER benchmarks: vendor categories observed in 2025–26
Three vendor categories, three different performance tiers:
Category A — global ASR (US-trained, India language-pack added)
Typical examples: large US cloud vendors offering "Hindi" as one of 100+ supported languages.
| Language | Studio audio | Telephony, no noise | Telephony + accent + code-switch |
|---|---|---|---|
| Hindi (Mumbai/Delhi) | 12–18% | 22–32% | 38–52% |
| Tamil (Chennai) | 15–22% | 28–38% | 45–58% |
| Telugu (Hyderabad) | 16–23% | 30–40% | 47–60% |
| Bengali (Kolkata) | 14–20% | 26–36% | 42–55% |
| Marathi (Pune/Mumbai) | 15–22% | 28–38% | 44–57% |
The takeaway: these models are unfit for Indian telephony-grade voice AI deployment. The studio numbers look acceptable; the telephony + code-switch numbers — which are the only ones that matter in production — are above the conversational-breakage threshold for every language.
Category B — Indian-trained ASR with code-switch handling
Typical examples: Indian voice AI vendors that have trained or fine-tuned models on Indian conversational corpora.
| Language | Studio audio | Telephony, no noise | Telephony + accent + code-switch |
|---|---|---|---|
| Hindi | 5–9% | 8–13% | 10–16% |
| Tamil | 7–11% | 11–17% | 14–22% |
| Telugu | 8–13% | 12–18% | 15–23% |
| Bengali | 7–12% | 11–17% | 14–22% |
| Marathi | 8–13% | 12–18% | 15–23% |
The Hindi-Marathi-Bengali numbers are production-ready in this category. Tamil and Telugu are at the production-acceptable edge — usable for transactional flows, not yet for long lead-qualification conversations.
Category C — Indian-trained ASR with telephony-and-noise specialisation
The frontier: vendors who have trained on Indian-carrier telephony audio with synthetic and real call-centre background noise.
| Language | Studio audio | Telephony, no noise | Telephony + accent + code-switch |
|---|---|---|---|
| Hindi | 4–7% | 6–10% | 7–12% |
| Tamil | 6–9% | 9–14% | 11–17% |
| Telugu | 6–10% | 10–15% | 12–18% |
| Bengali | 6–9% | 9–14% | 11–17% |
| Marathi | 6–10% | 10–15% | 12–18% |
This is what production Indian voice AI looks like in 2026 at its best. Hindi WER under 12% even in worst-case conditions. The other four languages are still 4–8 percentage points behind Hindi — the training-data gap remains.
Why "multilingual" vendors actually fail: the three things their pitch decks don't cover
1. Code-switching is not language detection plus translation
The pattern: "haan boss, payment ho gaya, but kal tak mai office nahi aa paaunga, can you call me back evening time, around 6 PM ke baad?"
Three languages in a single sentence (Hindi, English, Hindi-English hybrid). The customer is one person, the speech is continuous, the language toggles at sub-word boundaries.
Global ASRs handle this in two passes — detect language per phrase, transcribe, stitch. The two-pass approach drops 25–40% of the words because phrase-boundary detection fails on rapid switches. Indian-trained ASRs handle it in a single pass with a code-switch-aware language model that does not enforce one language per phrase.
This is the single biggest performance gap, and it is invisible in vendor pitch demos because vendors test on monolingual reference audio.
2. Indian accent variation is not "Hindi" — it is 15+ regional Hindi sub-dialects
Hindi spoken in Pune is not Hindi spoken in Patna is not Hindi spoken in Mumbai is not Hindi spoken in Lucknow is not Hindi spoken in Bengaluru by a Hindi-speaking customer who has lived there 20 years.
Each sub-dialect has phonetic shifts (vowel lengths, retroflex consonants, sandhi rules) that change the acoustic signature. Vendors that train on a single Hindi reference corpus (typically Delhi/NCR speech) see 15–25 percentage point WER degradation when the customer is from outside the training distribution.
Production-grade Indian voice AI training corpora should cover Hindi from 12+ Indian cities at minimum, with phonetician-supervised dialectal balancing.
3. Telephony codec, jitter, and packet loss are real signal degradation
A studio recording at 16 kHz has the acoustic clarity of a podcast. A live Indian telephony call uses 8 kHz µ-law or G.729 codecs, has jitter spikes of 50–200 ms on Jio/Airtel/VI long-distance routes, and 1–3% packet loss on premium SIP trunks (worse on rural 3G fallback).
Models trained on studio data lose 8–14 percentage points of WER on telephony audio. Models trained on a mix of telephony and studio audio handle the codec degradation natively.
This is why the studio WER numbers in vendor decks are misleading and the live-demo numbers (when you make the vendor demo against your own telephony recording) are the only ones that count.
The two metrics besides WER that matter
WER is necessary, not sufficient. Two additional metrics that buyers should require in evaluation:
Entity Error Rate (EER) on Indian named entities
WER averages all words; entity errors weight specifically the words that change the conversation's meaning. A 8% WER that includes a 4% error rate on customer names, account numbers, and amounts is worse than a 12% WER with 0.5% error on those same entities.
For BFSI voice AI, EER on PAN numbers, account numbers, IFSC codes, INR amounts, and date phrases ("teesree October") should be under 2%. Most vendors don't measure this; ask for it.
Code-Switch Recovery Rate (CSR)
When the speech switches language mid-sentence, the bot's next-turn response should be in the language the speaker most recently used or the dominant language of the conversation — not a hardcoded English fallback. CSR = % of code-switch conversations where the bot's response language is appropriate.
Indian production threshold: CSR > 90%. Below 80%, the conversation feels foreign to the customer and bot escalation jumps.
How to evaluate a vendor's Indian-language ASR in 60 minutes
A structured bake-off that any procurement team can run:
-
Collect 50 real conversational audio samples from your own call recordings, 10 per language (Hindi/Tamil/Telugu/Bengali/Marathi). Include 30% with significant background noise (street, restaurant, traffic).
-
Have 5 of the 10 samples in each language include at least one code-switch between Hindi (or the regional language) and English mid-sentence.
-
Get a human transcription baseline from a native speaker for each sample. This is your gold standard.
-
Submit the same audio batch to every vendor under evaluation. Require the vendor to share the transcribed text within 24 hours.
-
Compute WER per sample per vendor against the gold standard. Compute EER for named entities. Manually score CSR on the code-switch samples.
-
Build the vendor scoring matrix. Weight the telephony + code-switch numbers higher than studio numbers because those are the production reality.
Cost of running this bake-off: roughly INR 30,000–50,000 in human transcription + 8–12 hours of an internal analyst's time. Cost of skipping it and signing the wrong vendor: 8–14 months of stalled deployment and INR 50–200 lakh in sunk cost depending on scale.
Where Indian-language ASR is heading 2026–27
Three tracks worth watching:
-
Tamil and Telugu are closing the gap. Indian-trained Tamil and Telugu models in 2024 sat 8–12 percentage points behind Hindi. By late 2026, that gap is forecast to halve as the training-data investment compounds.
-
Live-context adaptation. Best-in-class vendors are now training models that adapt per-call to the speaker's specific accent within the first 5–10 seconds of audio. The WER on the second half of the call is materially better than the first half — meaningful for longer flows like lead qualification or KYC.
-
End-to-end multimodal models. The boundary between ASR, language model, and TTS is dissolving. Single end-to-end models trained on audio-text pairs directly are starting to outperform pipeline systems. This will be the dominant architecture by 2027.
The vendor whose model architecture is on the second curve — not the first — will have a structural quality advantage that buyers can lock in by signing now.
What this means for your procurement
If your voice AI deployment is in India, in Indian languages, against Indian telephony, then global ASR vendors are a category to avoid for the production loop. Indian-trained ASR with telephony specialisation is the floor. Indian-trained ASR with code-switch + noise specialisation is the buying target.
The WER number on the pitch deck is meaningless without the audio sample it was tested on. Insist on running the bake-off against your own audio. Vendors who refuse have something to hide.
Talk to us if you are running a voice AI vendor bake-off and want a documented WER, EER and CSR baseline against your own conversational corpus — caller.digital's bake-off package includes a 50-sample multi-language audio evaluation and the comparison scoring matrix.
Frequently Asked Questions
Tags :









