If you have ever sat in a Lagos boardroom, a Kano market, or an Enugu pharmacy, you already know the truth about Nigerian voice: no real conversation stays in one language for very long. A caller will open in Yoruba, switch to English to discuss specifics, and close in Pidgin. A trader will quote a price in Hausa numerals and pivot to English to confirm the address. An Igbo pharmacist will greet a patient in Igbo, take the prescription details in English, and explain dosing in Igbo again.
This is not casual code-mixing. It is the dominant register of Nigerian business communication. It is also the single hardest problem in Nigerian speech recognition — and it is the problem that every commercial speech-to-text API on the market today gets badly wrong, in three different ways for three different language pairs.
This post documents a public benchmark we ran in April 2026 across three 500-utterance test sets — Hausa-English, Yoruba-English, and Igbo-English code-switched audio — comparing four production speech-recognition systems. It then walks through the architectural decisions behind Orinode STT, the model we built to close the gap for all three. Everything below is reproducible. The eval notebook and reference transcripts will be released on GitHub alongside this post.
What "code-switching" actually means in audio
In the linguistics literature, code-switching is the alternation between two or more languages within a single utterance, conversation, or discourse. There are three types you encounter in Nigerian audio, regardless of which Nigerian language is involved:
- Inter-sentential: the speaker finishes one sentence in Hausa/Yoruba/Igbo and starts the next in English. This is the easiest case — most systems can re-detect language between sentences.
- Intra-sentential: the switch happens inside one sentence, often at a clause boundary. "Zan zo gobe, but can you confirm the price?" is the canonical Hausa example. "Ṣe ó wà lori sale?" embedded in an English sentence is the canonical Yoruba example.
- Lexical insertion: a single foreign word inserted into an otherwise monolingual utterance. "Ina son delivery a yau" (Hausa); "Mo fẹ́ refund" (Yoruba); "Anyị chọrọ invoice" (Igbo).
All three appear in real Nigerian business calls. Lexical insertion is the most common; intra-sentential is the most damaging to traditional STT systems. Our benchmark intentionally over-samples intra-sentential and lexical-insertion cases because those are where commercial systems fail hardest — and the failure modes differ by Nigerian language in ways that are themselves informative.
The one-language-per-utterance assumption
Almost every commercially deployed speech-to-text system in 2026 — Whisper, Google Cloud STT, AWS Transcribe, Azure Speech, Wit.ai — was designed around an assumption that is so deeply baked into the architecture that most users never notice it: one audio file gets one language label.
In Whisper specifically, the decoder is conditioned on a single language token (<|ha|>, <|yo|>, <|ig|>, <|en|>) emitted at the start of decoding and assumed for the rest of the sequence. The cross-attention mechanism, the byte-pair encoding, and the language-modelling prior are all sampled from a distribution conditioned on that one language token. If the model commits to <|ha|> for the audio, then when an English word appears halfway through, the model has two bad options:
- Force English phonology into the target language's orthography. The English word "delivery" gets transcribed as "diliferi" (under
<|ha|>), "ditefírì" (under<|yo|>), or "dilifari" (under<|ig|>) — phonetic approximations no Hausa, Yoruba, or Igbo speaker would write, and no downstream NLP system can recognise. - Skip the foreign segment entirely. The model emits silence tokens or hallucinates target-language words that did not appear in the audio.
Neither option produces a usable transcript. Both produce confident-looking outputs that pass a syntax checker and fail the downstream task. Below are three real examples, one per language pair, from our test set.
Example 1 — Hausa-English (Kano pharmacy)
// "Hello, I want Paracetamol 500mg, do you have it?"
Whisper-large-v3 (language=ha): Sannu, ina son paratamol bashar mai jiki, kuna da shi?
// "Paracetamol" became "paratamol bashar mai jiki" — a Hausa-phonotactic hallucination.
Google Cloud STT (language=ha-NG): Sannu, ina son [INAUDIBLE], kuna da shi?
// Silently dropped the English brand name + dosage.
Orinode STT (per-token language): Sannu, ina son Paracetamol 500mg, kuna da shi?
// Exact match. Diacritics, brand name, dosage, all correct.
Example 2 — Yoruba-English (Lagos real estate)
// "Good morning, I'm calling about the apartment in Lekki — is it still on sale?"
Whisper-large-v3 (language=yo): Eku aaro, I'm calling about the apartment in Lekki — se o wa lori sale?
// Tonal diacritics stripped. "Ẹ" collapsed to "E" (different vowel). "ṣe" became "se" — a different word entirely.
Google Cloud STT (language=yo-NG): eku aro, [INAUDIBLE] in Lekki — se o wa lori se
// Lost the English clause middle. Tones gone. "sale" misread as "ṣe" again. Punctuation dropped.
Orinode STT (per-token language): Ẹ ku àárọ̀, I'm calling about the apartment in Lekki — ṣe ó wà lori sale?
// All tone marks preserved. English span intact. Question mark survives.
Example 3 — Igbo-English (Onitsha logistics)
// "Hello, we sent the package to Onitsha — where is it right now?"
Whisper-large-v3 (language=ig): Ndewo, we sent the package to Onicha — kedu ebe o no ugbu a
// Dot-below diacritics lost: ọ → o, ụ → u. "Onitsha" misspelled as "Onicha". Word boundaries kept but the result is not searchable.
Meta MMS-1B-all (language=ibo): Ndewo we sent the package to oni isha kedu ebe o no ugbo a
// No punctuation. Place name fragmented. Diacritics stripped. Unusable for any downstream task.
Orinode STT (per-token language): Ndeewo, we sent the package to Onitsha — kedụ ebe ọ nọ ugbu a?
// Exact match. ọ, ụ, ụ all preserved. "Onitsha" spelled correctly (an English-language place name embedded in Igbo).
Three languages, three slightly different failure modes — but the same root cause. The Whisper Hausa failure is hallucination (confident gibberish). The Whisper Yoruba failure is diacritic collapse (tonally distinct words merged into one). The Whisper Igbo failure is dot-below stripping plus place-name corruption. In every case, the transcript is unusable for the downstream task and the model is unaware that anything went wrong.
The benchmark
Between January and April 2026, we collected three 500-utterance test sets from anonymised real Nigerian business calls with explicit caller consent for research use: one Hausa-English, one Yoruba-English, one Igbo-English. Each utterance was transcribed independently by three native bilingual speakers per language (Kano + Lagos for Hausa-EN; Lagos + Ibadan for Yoruba-EN; Onitsha + Enugu for Igbo-EN); we used the consensus reference for scoring. Utterances range from 2 to 14 seconds, average length 6.4 seconds.
Domain mix across all three sets is roughly balanced: pharmacy (24%), logistics (22%), real estate (20%), clinics (16%), restaurants (12%), other (6%). 1,500 utterances total. Each set was scored independently.
We ran each utterance through four production systems with their default settings:
- Whisper-large-v3 via the official OpenAI checkpoint, language auto-detect off, language hint set per test set (
ha,yo,ig) - Google Cloud Speech-to-Text v2, model
latest_long, language codesha-NG/yo-NG/ig-NGwhere supported - Meta MMS-1B-all, the multilingual 1B-parameter open-source model
- Orinode STT v2.3, our production model, with
detect_code_switch=True
Word error rate (WER) is computed against the consensus reference per ITU-T P.940, with case and punctuation normalised but diacritics preserved.
"—" for Google Cloud STT on Igbo: as of April 2026, Google does not ship a production Igbo language model. This is itself the headline finding for Igbo voice AI in 2026 — the largest cloud provider does not support the language at all.
The per-token decomposition
Aggregate WER hides where models actually fail. Decomposing the error by token language — that is, whether the error occurred on a token labelled as the Nigerian language or as English in the reference — is the more revealing view.
| System | Hausa-EN overall | Hausa-EN HA tokens | Hausa-EN EN tokens | Per-token lang acc. |
|---|---|---|---|---|
| Whisper-large-v3 | 41.7% | 58.9% | 22.3% | — |
| Google Cloud STT | 52.4% | 46.1% | 61.0% | — |
| Meta MMS-1B-all | 34.2% | 39.5% | 27.8% | — |
| Orinode STT v2.3 | 12.6% | 13.4% | 11.8% | 97.2% |
| System | Yoruba-EN overall | Yoruba-EN YO tokens | Yoruba-EN EN tokens | Per-token lang acc. |
|---|---|---|---|---|
| Whisper-large-v3 | 36.8% | 49.2% | 21.4% | — |
| Google Cloud STT | 44.1% | 38.7% | 50.3% | — |
| Meta MMS-1B-all | 30.5% | 36.1% | 23.8% | — |
| Orinode STT v2.3 | 14.1% | 15.6% | 12.4% | 96.8% |
| System | Igbo-EN overall | Igbo-EN IG tokens | Igbo-EN EN tokens | Per-token lang acc. |
|---|---|---|---|---|
| Whisper-large-v3 | 48.2% | 62.4% | 30.1% | — |
| Google Cloud STT | — | — | — | — |
| Meta MMS-1B-all | 38.9% | 44.7% | 31.2% | — |
| Orinode STT v2.3 | 16.7% | 17.9% | 15.0% | 95.9% |
Per-token language accuracy is the share of reference tokens for which the system assigned the correct language label. Whisper, Google, and MMS commit to a single language for the whole utterance, so per-token language accuracy is not meaningful for them and we report it as "—". For Orinode STT, it is the headline number that explains why downstream task accuracy is high across all three language pairs.
Where the errors actually come from
We classified every Whisper error from all three test sets into six categories. The shares below are pooled across Hausa-EN, Yoruba-EN, and Igbo-EN — the relative distribution is broadly similar, but the dominant category differs slightly by language (phonotactic hallucination dominates Hausa; diacritic loss dominates Yoruba; place-name corruption is higher in Igbo).
The composition matters because it tells you what kind of fix is required. If the dominant error were "noise" or "dialect mismatch", the cure would be more training data. But the dominant errors are all architectural: hallucination, diacritic collapse, silent omission. These do not go away with scale. They require a different decoder.
How we fixed it: per-token language tagging
The core architectural decision behind Orinode STT is that the language label is part of the output sequence, not the input conditioning. The decoder does not commit to a single language at the start. Instead, at every decoding step, it emits a (token, language, confidence) triple, drawing from a unified vocabulary that covers Hausa, Yoruba, Igbo, English, and Nigerian Pidgin simultaneously.
Concretely, the decoder's output head is wider than a monolingual model's: each emission position has a probability distribution over a vocabulary of ~52,000 tokens (combining the BPE vocabularies of all five languages plus shared punctuation) and a separate distribution over the language label. The two distributions are jointly trained with a hierarchical softmax such that token and language hypotheses are mutually informative.
When the model sees the audio for "Ẹ ku àárọ̀, I'm calling about the apartment in Lekki", the output looks like this:
Every downstream task — intent extraction, named-entity recognition, summary generation, TTS response synthesis, analytics — reads the per-token labels and routes accordingly. The Yoruba span gets Yoruba NLP. The English span gets English NLP. The TTS response can switch voice characteristics at the right token boundary so the agent's reply matches the caller's register.
What this required, technically
Per-token language tagging is conceptually simple but it forces three architectural decisions that most off-the-shelf models do not make.
1. Unified tokenizer. A Hausa BPE tokenizer trained alone will produce sub-word units that fragment English (and Yoruba and Igbo) words awkwardly. Orinode STT uses a joint BPE tokenizer trained on a roughly 55/15/12/10/8 split of Nigerian English / Hausa / Yoruba / Igbo / Pidgin text. Diacritics are preserved through Unicode NFC normalisation rather than stripping — this is the single change that prevents the Yoruba tone-mark and Igbo dot-below failures dominating the Whisper/MMS baselines. The resulting vocabulary is denser for each language than a sentencepiece tokenizer trained per-language would be, but it tokenizes code-switched text without artificial boundary effects.
2. Joint language-token loss. During training, the loss function is a weighted sum of next-token cross-entropy and next-language cross-entropy. The weighting is annealed: early in training, language prediction is weighted higher to force the model to learn language-discriminative features; later, token prediction takes over. The annealing schedule was the single training hyperparameter that mattered most for code-switch performance, and it generalised cleanly across all three language pairs.
3. Code-switch-balanced training data. We synthesised 8,400 hours of code-switched audio by splicing monolingual recordings — 3,200 hours Hausa-EN, 2,800 hours Yoruba-EN, 2,400 hours Igbo-EN — and validated 1,200 hours of natural code-switched audio from anonymised call recordings (proportionally split). The synthesised data has prosodic discontinuities that natural code-switching does not have, so the model is trained on a mix that biases toward the natural set. Pure monolingual training (the Whisper default) leaves the model unable to handle code-switch boundaries no matter how much fine-tuning is done.
What you can do with this today
The Orinode STT API exposes per-token language labels through the detect_code_switch flag. The same endpoint handles all three language pairs — you do not pick which Nigerian language; the model detects it. A minimal Python call looks like this:
# pip install requests import requests resp = requests.post( "https://maraba.ai/api/v1/stt/", headers={"X-API-Key": "sk_live_..."}, json={ "audio_url": "https://cdn.example.com/call.wav", "language_hint": "auto", "detect_code_switch": True, "preserve_diacritics": True, }, ) for tok in resp.json()["tokens"]: print(tok["text"], tok["lang"], tok["conf"]) # Ẹ yo 0.98 # ku yo 0.99 # àárọ̀ yo 0.97 # I'm en 0.99 # calling en 0.99 # ...
For real-time use — telephony, live transcription, voice agents — the same model is exposed via a WebSocket endpoint at wss://maraba.ai/api/v1/stt/stream/. Partial transcripts with per-token labels are emitted every ~200 ms. Final latency end-to-end is sub-second on Lagos-region 8 kHz GSM telephony audio.
Reproducibility
Everything in this post is designed to be reproducible. Specifically:
- The three 500-utterance eval sets (Hausa-EN, Yoruba-EN, Igbo-EN), each with consensus reference transcripts and per-token language labels, will be released on Hugging Face Datasets under CC-BY-4.0 by the end of June 2026, subject to caller-consent verification on a final batch.
- The eval notebook — covering data loading, system invocation for all four models, WER computation per ITU-T P.940, and the error-categorisation classifier — is at
github.com/orinode/benchmarks. - The current benchmark figures, machine-readable, live at
maraba.ai/benchmarks.jsonand are updated whenever Orinode STT is retrained.
If you spot a methodological issue or want to run the eval with different decoding settings, the notebook is the place to start. We will accept pull requests.
What this means for Nigerian voice AI
Code-switching is not a niche edge case. It is the dominant mode of Nigerian business voice communication across all three major Nigerian languages, and every commercial speech-recognition system on the market today fails on it at the architectural level. "More training data" does not solve a one-language-per-utterance decoder. Fine-tuning does not solve it. Better acoustic models do not solve it. The fix is structural: the language label has to be part of the output, not the input.
For developers building voice agents, telephony products, transcription tools, or analytics pipelines on Nigerian audio, this matters in three concrete ways. First, any system architected around a single language label per call will under-report intent, drop named entities, and produce confidently wrong summaries — and the failure mode looks slightly different in Hausa, Yoruba, and Igbo, which makes debugging harder. Second, any vendor that does not publish per-token language accuracy is hiding a real failure mode. Third, the only durable fix is to use a model that was trained from the start for code-switched speech — not one that has been fine-tuned on it after the fact.
That is the bet behind Orinode STT: that Nigerian voice is fundamentally different from English voice, and that getting it right requires architectural choices, not just training-data choices. The benchmark above is the evidence we have so far. We also note one finding worth dwelling on: in April 2026, Google Cloud Speech-to-Text does not support Igbo at all, despite Igbo being a first language for roughly 24 million people. That is the kind of gap that does not close without African-led infrastructure.
If you have your own Nigerian audio and want to compare systems, the request format is identical to OpenAI's STT API except for the detect_code_switch flag. The same endpoint handles Hausa-EN, Yoruba-EN, and Igbo-EN with no per-language configuration. Beta API keys are available — request one here. Free during the private beta for the first ten partners.
Frequently asked
Is code-switching really common enough to matter commercially? Yes. In our 2026 sample of 12,000 anonymised Nigerian business calls, 73% contained at least one intra-sentential code-switch (Hausa-EN, Yoruba-EN, Igbo-EN, or Pidgin-EN). 41% contained lexical-insertion code-switching in every utterance. Monolingual Nigerian-language calls were 18% of the sample. Monolingual English calls were 9%.
Why don't bigger models like GPT-4o or Gemini 1.5 solve this? They have the same architectural assumption. They are larger versions of the same one-language-per-utterance decoder. The capability frontier moves with parameter count, but the structural error mode is the same. Bigger models hallucinate more fluently on code-switched audio, which is sometimes worse than failing loudly.
What about Whisper fine-tuning on Nigerian data? We tried this. Fine-tuning Whisper-large-v3 on 6,000 hours of conversational Hausa brought monolingual Hausa WER down to 14.2%. Code-switch WER stayed at 38.4%. Equivalent fine-tunes on Yoruba and Igbo monolingual data show the same pattern — the monolingual WER drops, the code-switch WER barely moves. The decoder's language-conditioning bottleneck does not go away with more training; it goes away with a different decoder.
How does Orinode STT handle Yoruba tonal diacritics? Yoruba's combining tone marks (è, é, ē, ẹ̀, ọ́, ṣ) are preserved through Unicode NFC normalisation in the tokenizer and through the decoder. Tonally distinct words like ọkọ ("husband") and ọkọ̀ ("vehicle") are kept as different tokens with different acoustic embeddings. Most off-the-shelf models lowercase or strip these characters, which collapses these words into one.
What about Igbo dot-below characters? ị, ụ, ọ are preserved identically. The most common Igbo failure mode in Whisper is collapsing ọ → o and ụ → u, which loses phonemic distinctions. Orinode STT keeps them apart with the same NFC-normalised tokenizer.
Will you open-source the model? A 250M-parameter variant, Orinode-STT-Small, will be released on Hugging Face under Apache 2.0 by the end of Q3 2026. The production model used by Maraba's telephony stays closed for now. Open-sourcing the small variant is meant to make Nigerian-language voice AI reproducible and accessible to African NLP researchers; the larger model carries production-grade telephony optimisations that we plan to monetise.
Does this work for Pidgin-English code-switching too? Yes — Pidgin is the fifth language in the unified tokenizer. We did not include Pidgin-EN as a separate test set in this benchmark because Pidgin already shares so much English vocabulary that the code-switch boundary is fuzzy and existing WER methodology gets noisy. A dedicated Pidgin benchmark is on the publishing roadmap for Q3 2026.
Orinode STT, Orinode TTS, and Orinode LangID are the speech stack behind Maraba — and they are available to developers as standalone APIs. One endpoint handles Hausa-EN, Yoruba-EN, and Igbo-EN code-switching. Beta keys are free for the first ten partners.
Request beta API key →