Can AI Understand a Nigerian Accent? The Honest Answer

When a Nigerian business owner asks whether AI can understand their customers, they are usually asking because they have already tried something — a generic IVR, a US-built chatbot, a voice assistant — and watched it fail. Callers got frustrated. Calls got dropped. The business went back to manual answering.

This is not a small problem. It is the central problem of deploying voice AI in Nigeria. And it deserves a straight answer.

Why generic AI speech recognition struggles with Nigerian English

The dominant speech recognition models — the ones powering most consumer and business AI voice tools globally — were trained predominantly on American English audio. Some include British English. A growing minority include Indian English. Very few include Nigerian English in any meaningful proportion.

This creates a measurable accuracy gap. The word error rate (WER) — the percentage of words the model transcribes incorrectly — for American English on major commercial STT systems sits below 5%. For Nigerian English, independent evaluations have consistently found WERs in the 18–30% range on standard models, depending on the speaker and network quality.

At 20% WER, one in five words is wrong. For a short sentence like "I want to check my account balance", that level of error is survivable. For a pharmacist asking a customer to confirm which medication they need, it is not.

The specific features of Nigerian English that trip generic models

Nigerian English is not "accented American English." It is a distinct variety with its own phonological patterns, intonation contours, and lexical choices that generic models were not trained to expect.

Several specific features cause frequent errors:

Vowel differences. The Nigerian English pronunciation of words like "water," "order," and "quarter" differs substantially from American English. A generic model hears something unexpected and makes an error or substitution.

Rhythm and stress. Nigerian English has a more syllable-timed rhythm compared to the stress-timed rhythm of American English. Generic models trained on stress-timed speech misread the timing cues and lose track of word boundaries.

Naija expressions. Phrases like "I want to manage," "he is on top of it," "it is not my fault o," or "abeg" have precise meaning in Nigerian English context. A generic model either mishears or completely omits these.

Phone quality. Much Nigerian mobile call audio comes in over MTN or Airtel connections at lower bitrates, with occasional compression artefacts or brief drops. Generic models trained on studio-quality American audio perform noticeably worse on compressed African mobile audio.

What about Hausa, Igbo, and Yoruba?

For the three major Nigerian languages, the situation with generic models is more stark. Most commercial STT systems do not support these languages at all. Those that do — and a small number of newer models have added basic Yoruba and Hausa support — were trained on limited data sets, often sourced from audiobooks or Wikipedia readings rather than telephone conversations.

Here are concrete examples of what a standard STT system transcribes versus what was actually said:

Hausa — What the caller said: "Ina son sanin farashin magani" (I want to know the price of the medicine)
Generic STT output: "Ina son Sandy for farashin maggon" — unusable

Yoruba — What the caller said: "Ẹ jọ̀ọ́, mo fẹ́ mọ àkókò tí ẹ ṣí" (Please, I want to know your opening time)
Generic STT output: "A joe, mo fe mo akoko ti e si" — diacritics stripped, tonal information lost, meaning partially degraded

Igbo — What the caller said: "Achọrọ m ịnọ ebe ị nọ" (I want to know where you are located)
Generic STT output: "Achoro mi ino ebe i no" — diacritics missing, tonal distinctions erased, context-dependent interpretation required

When diacritics are stripped from Yoruba or Igbo, tonal meaning is lost. The same sequence of consonants and vowels can mean completely different things depending on tone. A model that silently drops this information is not transcribing the language — it is transcribing a degraded approximation of it.

Code-switching compounds the problem

In real Nigerian calls, speakers rarely stay in one language for the entire conversation. A caller to a Lagos pharmacy might say: "Good morning, I need to know — ẹ ọ tún wà — is the cough syrup back in stock?" The phrase "ẹ ọ tún wà" means "do you still have it" in Yoruba. The caller inserted it mid-sentence, naturally, without thinking about it.

A generic model would either drop the Yoruba phrase entirely, attempt to transcribe it as garbled English, or simply fail on that utterance. Maraba handles the full sentence because the language detection runs in parallel with transcription, identifying the switch and applying the correct phonological model for each segment.

What Maraba does differently

Maraba's STT models were trained on Nigerian voice data — specifically, telephone-quality audio from real Nigerian callers in English, Hausa, Igbo, and Yoruba. The training data includes the network conditions, accent variation, and code-switching patterns that characterise actual Nigerian phone calls, not studio recordings.

Three specific technical decisions matter here:

Diacritics are preserved throughout the pipeline. The STT output retains ẹ, ọ, ṣ for Yoruba; ị, ụ for Igbo; ƙ, ɗ for Hausa. These characters are never lowercased or stripped. The language model downstream receives the full tonal information.

Language detection runs at the utterance level, not the call level. The system identifies the language of each individual speech segment rather than detecting the language once at the start of the call and locking in. This is what enables mid-sentence code-switch handling.

The pipeline is optimised for compressed mobile audio. The VAD (voice activity detection) and audio pre-processing layers are tuned for the bitrate and compression characteristics of MTN, Airtel, Glo, and 9mobile calls — not broadband VoIP.

What accuracy looks like in practice

Across 2.1 million calls handled on the Maraba platform, the overall call resolution rate — calls where the caller's query was understood and answered without requiring human escalation — sits at 82% for English and Yoruba-English code-switching, 79% for Hausa-English, and 77% for Igbo-English. These are live telephone calls, not lab conditions.

For comparison: a clinic with a human receptionist who is also managing walk-in patients, filing records, and handling the doctor's requests typically resolves incoming phone queries correctly on the first attempt roughly 70% of the time — and misses a proportion of calls entirely during busy periods.

The honest caveat

No STT system is perfect. Maraba will occasionally mishear a word — particularly with very heavy network distortion, unusual proper nouns, or highly dialectal speech. When Maraba is uncertain about what was said, it asks the caller to repeat rather than acting on a potentially wrong transcript. That is by design.

The relevant comparison is not "Maraba vs. perfect transcription." It is "Maraba vs. an unanswered call." On that metric, the case is straightforward.

Hear how Maraba handles Nigerian accents on a real call

Start with 50 free calls and test Maraba with your own callers. No card required. Hausa, Igbo, and Yoruba available from the Starter plan.

Request beta →