Fine-Tuning Whisper for Hausa, Yoruba, and Igbo

The problem: Whisper base WER on Nigerian languages

When we evaluated OpenAI Whisper (small) on Nigerian language test sets before any fine-tuning, the results were not encouraging:

Hausa: ~40% word error rate (WER) on in-domain conversational speech
Yoruba: ~48% WER — tonal marking nearly absent in output
Igbo: ~52% WER — dotted vowels frequently replaced with ASCII equivalents

A 40% WER means roughly 40 words in every 100 are wrong. For a transcription system, this is unusable in production. For a live AI voice agent making real-time decisions based on what a caller said, it is a customer service disaster.

The causes of high WER on these languages are structural:

Limited representation in pre-training data. Whisper was pre-trained on 680,000 hours of multilingual audio scraped from the internet. The proportion of this that is Nigerian Hausa, Yoruba, or Igbo is tiny — these languages are underrepresented online relative to their real-world speaker populations. The model has simply not seen enough of them.

Diacritic-stripped text in training labels. Even where Nigerian language audio exists in public datasets, the ground-truth transcripts are sometimes written without proper diacritics — plain ASCII approximations. A model trained on such labels learns to output diacritic-free text, which is technically incorrect Hausa, Yoruba, or Igbo.

No Nigerian English accent data. Code-switching between a Nigerian language and English is endemic in Nigerian speech. The base Whisper model tends to hallucinate or fail when it encounters this mid-sentence language switch.

Data collection: what makes good training data for Nigerian languages

The first and most important investment was in data quality. We used three sources:

Mozilla Common Voice

The Common Voice Hausa dataset has approximately 8 hours of validated audio at time of training. We filtered this aggressively: we took only clips with two or more validation votes, excluded clips shorter than 1.5 seconds, and excluded clips from single speakers who had contributed more than 30% of the corpus (to avoid speaker bias). After filtering, we used approximately 5 hours of Common Voice Hausa.

Common Voice Igbo was under 3 hours of validated audio. Common Voice Yoruba was comparable. We used all of it, with the same validation filtering.

Masakhane-aligned text corpora

Masakhane does not provide audio, but its text corpora are valuable for vocabulary coverage. We used Masakhane text to generate synthetic training audio using an early TTS model — low-fidelity but sufficient to expose the Whisper encoder to vocabulary it had not seen in the Common Voice recordings. Synthetic audio should never be the primary training source, but it helps with rare words and proper nouns.

Maraba proprietary recordings

This was the most important data source. We made 6.5 hours of Hausa audio from real call recordings (with customer consent and anonymisation), recordings from staff members who are native Hausa speakers, and targeted studio sessions with speakers from Kano, Kaduna, and Sokoto. The studio sessions specifically targeted:

Business vocabulary: product orders, prices, appointments, complaints, location enquiries
Code-switching: Hausa-English sentences of the type callers actually produce
Ejective and implosive consonants: deliberate inclusion of words with ƙ, ɗ, ɓ to ensure these phonemes were represented
Fast speech: natural conversational pace, not over-enunciated

Total training data: Hausa ~6.5 hours proprietary + 5 hours Common Voice = ~11.5 hours. Yoruba ~4 hours. Igbo ~3.5 hours. This is still very small by the standards of mainstream ASR — but it was enough to achieve meaningful WER improvements.

Training setup: Whisper fine-tuning with HuggingFace

We used the HuggingFace transformers library to fine-tune openai/whisper-small. The small model was the right choice for our latency requirements — we need sub-1.5 second response times on a live phone call, and the medium model adds 200–400ms of inference time on our hardware that we could not afford.

Python — Whisper fine-tuning configuration

from dataclasses import dataclass
from transformers import (
    WhisperForConditionalGeneration,
    WhisperProcessor,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
)
from datasets import load_dataset, Audio
import evaluate

MODEL_ID = "openai/whisper-small"
LANGUAGE = "ha"   # or "yo", "ig"
TASK = "transcribe"

processor = WhisperProcessor.from_pretrained(MODEL_ID, language=LANGUAGE, task=TASK)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)
model.generation_config.language = LANGUAGE
model.generation_config.task = TASK
model.generation_config.forced_decoder_ids = None

# Load and prepare dataset
hausa_dataset = load_dataset("mozilla-foundation/common_voice_13_0", "ha", split="train+validation")
hausa_dataset = hausa_dataset.cast_column("audio", Audio(sampling_rate=16_000))

def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_features"] = processor.feature_extractor(
        audio["array"], sampling_rate=audio["sampling_rate"]
    ).input_features[0]
    # Labels: do NOT lowercase — this destroys Hausa diacritics
    batch["labels"] = processor.tokenizer(batch["sentence"]).input_ids
    return batch

hausa_dataset = hausa_dataset.map(prepare_dataset, remove_columns=hausa_dataset.column_names)

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-hausa",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=500,
    eval_steps=500,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
)

wer_metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.batch_decode(label_ids, skip_special_tokens=True)
    # Do NOT normalise — preserve Hausa diacritics in WER computation
    wer = 100 * wer_metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=hausa_dataset,
    eval_dataset=hausa_dataset,  # use proper train/eval split in production
    data_collator=DataCollatorSpeechSeq2SeqWithPadding(processor=processor),
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

trainer.train()

The most critical line in that code is the comment in prepare_dataset: do NOT lowercase. The HuggingFace Whisper fine-tuning examples in the official documentation apply text normalisation including lowercasing before computing WER. For English, this is reasonable. For Hausa, Yoruba, and Igbo, it destroys the diacritics that define the correct orthography. We spent two training runs before we identified this as the source of our model outputting ƙ as k and ọ as o.

The diacritics problem in ground-truth labels

This issue deserves its own section because it affected us significantly and will affect anyone training Nigerian language ASR.

When we first evaluated our WER numbers after fine-tuning, we saw an oddly high WER even though the audio output sounded correct to native speaker evaluators. The model was outputting alƙawari (correct Hausa, "appointment") and the ground-truth label contained alkawari (ASCII approximation, technically wrong). The WER metric was penalising correct output because the labels were wrong.

The fix required auditing every ground-truth transcript in our dataset for diacritic correctness. We built a simple validator that flagged transcripts containing only ASCII characters for words that commonly contain ƙ, ɗ, ɓ, ị, ụ, ọ, ẹ, ṣ. We then hired native speaker reviewers — three for Hausa, two for Yoruba, two for Igbo — to correct the labels. This was the most expensive part of the data preparation process, but it was essential.

The lesson: for Nigerian language ASR, never trust ground-truth transcripts that come from a pipeline that applied any form of text normalisation or lowercasing. Always validate diacritic coverage before using a corpus for training.

Results: WER benchmarks after fine-tuning

After fine-tuning on the combined corpus (proprietary + Common Voice + synthetic), measured against held-out test sets of native speaker conversational speech:

Hausa WER: 40.3% (base Whisper small) → 18.1% (Maraba fine-tuned). 55% relative WER reduction.
Yoruba WER: 48.2% → 23.7%. 51% relative WER reduction.
Igbo WER: 52.1% → 22.4%. 57% relative WER reduction.

These are in-domain results — the test set is conversational Nigerian speech in business contexts (orders, appointments, enquiries). On out-of-domain speech (e.g., formal broadcast Hausa or academic Yoruba), WER is higher because the style differs significantly from our training data.

For comparison, Google Cloud Speech-to-Text does not support Hausa, Yoruba, or Igbo as of 2026. AWS Transcribe supports none of the three. Azure Speech does not support them. There is no directly comparable commercial baseline.

Code-switching: the hardest problem

Training a model to transcribe monolingual Hausa is one challenge. Training it to handle mid-sentence switches between Hausa and English — which is how northern Nigerian callers actually speak — is substantially harder.

Our approach was to include code-switched training examples: recordings where speakers deliberately switched languages mid-sentence. We generated approximately 2 hours of code-switched Hausa-English audio through studio sessions where speakers were given prompts designed to elicit natural code-switching. We also included the naturally code-switched segments from our call recordings, identified by language segmentation using a lightweight LID (Language Identification) model before manual verification.

The result is a model that handles sentences like "I want to book, amma har yanzu ban san farashin ba" (I want to book, but I still don't know the price) without breaking. The English segment is transcribed correctly as English; the Hausa segment preserves its correct Hausa characters.

We still see failures on very rapid switches (switching within the same phrase rather than at phrase boundaries), which is an area for the next training iteration.

Production deployment: latency considerations

On our GPU inference cluster, the Whisper small model with 30 seconds of audio achieves approximately 0.8–1.1 seconds of inference time. For a live phone call, this is acceptable — callers expect a brief pause after they finish speaking. For real-time streaming applications, we run the model in a sliding-window configuration, processing 5-second chunks with 1-second overlap and returning partial transcripts as the call progresses.

VAD (Voice Activity Detection) is a critical preprocessing step. Without VAD, a 30-second audio chunk with 20 seconds of silence wastes 20 seconds of compute. We use WebRTC VAD to identify speech segments before feeding to Whisper, which reduces inference load by 40–60% on typical call audio.

See the voice to text in Nigeria post for a discussion of latency considerations on Nigerian network conditions.

What we would do differently

Start with medium, not small. The small model meets our latency targets but has a ceiling on WER improvement. If we had invested in faster inference hardware earlier, we would have trained on medium from the start.
Invest in data quality review first. We lost two training runs to label quality issues. Auditing 100% of ground-truth transcripts for diacritic correctness before training would have saved significant time.
More dialect coverage. Our Hausa training data is predominantly Kano and Kaduna accent. Sokoto and Bauchi speakers show noticeably higher WER. Targeted data collection from underrepresented dialect regions is the highest-value next step.
Larger batch sizes. We were constrained to batch size 16 by GPU memory. Gradient accumulation helped but is not as effective as true large-batch training.

Using the model via API

If you want to build on top of Maraba's Nigerian language STT without training your own model, the Hausa STT API, Igbo STT API, and Yoruba STT are available via the developer API. For researchers who want to work with the underlying model or contribute to improving it, reach out through the contact page — we are open to collaboration with academic groups and the Masakhane community.

The full Nigerian language AI resources landscape — datasets, models, papers — is documented in our Nigerian language AI resources guide.

Use our Nigerian language STT without the training overhead

The fine-tuned models run in production. Get an API key and call them directly — ₦5 per minute, no GPU required.

Get your API key →