Why Hausa speech recognition is a hard problem
Hausa is spoken by approximately 70 million people as a first language, with another 20–30 million using it as a lingua franca across West Africa. It is the dominant language of northern Nigeria — Kano, Kaduna, Sokoto, Katsina, Bauchi — and the commercial language of huge swaths of the Nigerian economy. Yet if you search for "Hausa speech to text API" today, you will find almost nothing useful. A few academic papers. Some references to Mozilla Common Voice's small Hausa dataset. Nothing a developer can call and ship against.
The technical reasons for this gap are real:
- Limited training data. The largest publicly available Hausa speech corpus at time of writing is approximately 8 hours in Mozilla Common Voice — a fraction of what English STT systems train on (thousands of hours). The Masakhane project has contributed text corpora but speech data remains scarce.
- Distinctive phonology. Hausa has phonemes that do not exist in English: ejective consonants (ƙ, represented as /k'/ phonetically), implosive consonants (ɓ and ɗ), and long/short vowel distinctions that are phonemically contrastive. A model trained on English will simply mishear these sounds.
- Diacritic complexity. Correct Hausa text uses characters like ƙ (hooked k), ɗ (hooked d), ɓ (hooked b), and the standard extended Latin vowels. Many transcription pipelines silently strip these to plain ASCII, producing text that is technically wrong — changing meaning in ways that matter for downstream NLP.
- Code-switching is the norm. In real Nigerian Hausa speech, speakers regularly switch between Hausa and English within a single sentence. A transcription system that can only handle monolingual Hausa will fail on most real-world recordings.
Orinode STT (used by Maraba) model addresses all of these. We fine-tuned OpenAI Whisper (small) on 6.5 hours of Nigerian Hausa audio — sourced from Kano, Kaduna, and Sokoto speakers — combined with the Mozilla Common Voice Hausa set and proprietary call recordings. The result is a word error rate of approximately 18% on in-domain Nigerian Hausa, compared to roughly 40% WER from the base Whisper model on the same test set.
Prerequisites
Before you start, you need:
- An Maraba developer account — sign up at maraba.ai
- An API key from Developer → API Keys in your dashboard
- Python 3.9+ with the
requestslibrary installed (pip install requests) - An audio file containing Hausa speech — WAV, MP3, OGG, or FLAC formats are supported; 16kHz mono is optimal
The STT API is billed at ₦5 per minute of audio. A 30-second Hausa clip costs ₦2.50. There is no minimum charge on the STT endpoint.
Your first Hausa transcription: Python
The endpoint is POST /api/v1/transcribe/. You send a multipart form with the audio file and the language code ha. The response returns the transcript text, detected language, confidence score, and duration.
import requests
API_KEY = "your-api-key-here"
AUDIO_FILE = "hausa_sample.wav"
with open(AUDIO_FILE, "rb") as f:
response = requests.post(
"https://maraba.ai/api/v1/transcribe/",
headers={"X-API-Key": API_KEY},
data={"language": "ha"},
files={"audio": (AUDIO_FILE, f, "audio/wav")},
)
response.raise_for_status()
result = response.json()
print(result["transcript"])
print(f"Confidence: {result['confidence']:.2f}")
print(f"Duration: {result['duration_seconds']:.1f}s")
For a recording of the sentence "Ina son in yi alƙawari da likita a ranar Talata." (I would like to make an appointment with the doctor on Tuesday), the API returns:
{
"transcript": "Ina son in yi alƙawari da likita a ranar Talata.",
"language_detected": "ha",
"confidence": 0.91,
"duration_seconds": 3.4,
"words": [
{"word": "Ina", "start": 0.0, "end": 0.3, "confidence": 0.97},
{"word": "son", "start": 0.3, "end": 0.55, "confidence": 0.95},
{"word": "in", "start": 0.55, "end": 0.7, "confidence": 0.93},
{"word": "yi", "start": 0.7, "end": 0.85, "confidence": 0.98},
{"word": "alƙawari", "start": 0.85, "end": 1.4, "confidence": 0.88},
{"word": "da", "start": 1.4, "end": 1.55, "confidence": 0.99},
{"word": "likita", "start": 1.55, "end": 1.95, "confidence": 0.92},
{"word": "a", "start": 1.95, "end": 2.1, "confidence": 0.97},
{"word": "ranar", "start": 2.1, "end": 2.5, "confidence": 0.94},
{"word": "Talata.", "start": 2.5, "end": 2.9, "confidence": 0.89}
],
"cost_ngn": 0.28
}
Notice that the transcript preserves alƙawari with the hooked ƙ — not the plain ASCII "k". This is critical. In Hausa, ƙ and k are different phonemes. Transcribing alƙawari as "alƙawari" is correct; transcribing it as "alkawari" is phonemically wrong and will cause downstream errors in any NLP pipeline that works with Hausa text.
JavaScript / Node.js example
const fs = require("fs");
const FormData = require("form-data");
const fetch = require("node-fetch");
const API_KEY = "your-api-key-here";
const AUDIO_FILE = "hausa_sample.wav";
async function transcribeHausa(filePath) {
const form = new FormData();
form.append("language", "ha");
form.append("audio", fs.createReadStream(filePath), {
filename: filePath,
contentType: "audio/wav",
});
const response = await fetch("https://maraba.ai/api/v1/transcribe/", {
method: "POST",
headers: {
"X-API-Key": API_KEY,
...form.getHeaders(),
},
body: form,
});
if (!response.ok) {
const error = await response.json();
throw new Error(`API error ${response.status}: ${error.error}`);
}
return response.json();
}
transcribeHausa(AUDIO_FILE)
.then((result) => {
console.log("Transcript:", result.transcript);
console.log("Confidence:", result.confidence);
})
.catch(console.error);
The diacritic rule: never call .lower() on Hausa text
This is important enough to state explicitly. When you receive a Hausa transcript from the API, do not apply Python's .lower() or JavaScript's .toLowerCase() to it. These methods behave incorrectly or inconsistently with Hausa-specific characters on some platforms:
# WRONG — destroys Hausa diacritics
transcript = result["transcript"]
lowered = transcript.lower() # "ƙ" may become "k", breaking Hausa text
# CORRECT — preserve the transcript exactly as returned
transcript = result["transcript"]
# Use it as-is. Do not normalise case for Hausa.
The specific characters to preserve in Hausa text:
- ƙ / Ƙ — hooked k, a distinct phoneme (ejective velar stop). Example: ƙarfi (strength), alƙali (judge).
- ɗ / Ɗ — hooked d, an implosive alveolar stop. Example: ɗaki (room), ɗan (son of).
- ɓ / Ɓ — hooked b, an implosive bilabial stop. Example: ɓangare (side, aspect).
- ƴ / Ƴ — hooked y, a palatal ejective (less common). Example: ƴar (daughter of).
These are all in the Unicode Latin Extended-B block and should be handled correctly by any system that declares UTF-8 encoding. Ensure your database columns, API endpoints, and storage layers are configured for UTF-8 or UTF-8mb4 — they need to be, and most modern stacks already are.
Handling code-switching audio
Much real-world Nigerian Hausa speech switches between Hausa and English mid-sentence. A caller might say: "Ina son order, but I want to confirm the price first." The Hausa opening shifts to English in the middle. To handle this, use the language=ha-en bilingual mode:
with open("codeswitched_audio.wav", "rb") as f:
response = requests.post(
"https://maraba.ai/api/v1/transcribe/",
headers={"X-API-Key": API_KEY},
data={"language": "ha-en"}, # bilingual mode
files={"audio": ("audio.wav", f, "audio/wav")},
)
result = response.json()
print(result["transcript"])
# Output: "Ina son order, but I want to confirm the price first."
# The transcript preserves the language switch naturally
In bilingual mode the model detects the dominant language of each segment and switches accordingly. The word-level timestamps still reflect the mixed-language reality.
Streaming transcription for real-time use cases
The standard endpoint processes a complete audio file. For real-time applications — such as transcribing a live phone call — use the WebSocket streaming endpoint at wss://maraba.ai/api/v1/transcribe/stream/. This returns partial transcripts as the audio arrives, with a latency of approximately 600–900ms on typical Nigerian network conditions.
Streaming is out of scope for this tutorial but is documented in the full STT API reference.
Common failure modes and fixes
Here are the transcription problems you are most likely to encounter with Hausa audio and how to fix them:
Hausa vowel length is phonemically contrastive — gida (house) vs giida (houses) differ only in vowel duration. In low-quality audio, the model may mis-transcribe these. Fix: record at 16kHz mono minimum, avoid heavy audio compression before submission.
This happens when audio is heavily compressed or the speaker's ejective consonant is under-articulated. The model makes its best guess. Fix: use higher bitrate audio. If you are generating synthetic test audio, use the Orinode TTS API with voice=ha-NG to create Hausa audio with correct phoneme rendering.
If you submit a bilingual recording with language=ha instead of language=ha-en, the model will try to force English words into Hausa phonology. Fix: use language=ha-en for any audio that may contain English.
The API returns a 400 with "code": "unsupported_format" if the audio codec is not recognised. Accepted MIME types: audio/wav, audio/mpeg (MP3), audio/ogg, audio/flac. AAC files will be rejected. Convert with ffmpeg: ffmpeg -i input.aac -ar 16000 -ac 1 output.wav.
Rate limits and error codes
Developer accounts have a default rate limit of 60 requests per minute. For batch transcription of large audio archives, use the X-Rate-Limit-Remaining response header to pace your requests, or contact support to request a higher limit.
Key error codes returned in {"error": "...", "code": "...", "detail": {}} format:
auth_failed— invalid or missing API keyquota_exceeded— you have exceeded your monthly STT quota; top up credits or upgrade planaudio_too_long— audio file exceeds the 120-minute limit per requestunsupported_format— unrecognised audio codeclanguage_not_supported— an invalid language code was supplied; valid options includeha,yo,ig,en,ha-en,yo-en,ig-en
What you can build with Hausa STT
The practical applications for Hausa speech recognition in the Nigerian market are substantial. Here are the use cases developers are building today:
- Call center transcription. Log and analyse inbound customer calls from northern Nigerian customers. Identify common complaint themes, recurring product questions, and escalation triggers — in Hausa.
- Voice-to-CRM pipelines. Sales reps in Kano making field calls can dictate notes in Hausa. The transcript feeds directly into your CRM without requiring a human transcriptionist.
- Accessibility tools. Voice input for northern Nigerian users who are more comfortable speaking Hausa than typing in English.
- Broadcast monitoring. Northern Nigerian radio stations broadcast heavily in Hausa. STT enables keyword monitoring, compliance logging, and content indexing.
- Court and legal transcription. Many northern Nigerian court proceedings include Hausa testimony. Orinode STT provides a starting-point transcript that a human reviewer can correct.
If you are building a Hausa-language voice application end-to-end, pair the STT API with the Hausa TTS output for a full speech-in, speech-out pipeline. For detecting which Nigerian language a speaker is using before you route to the correct STT model, see the Nigerian Language Detection API guide.
Pricing
The STT API charges ₦5 per minute of audio, billed in 10-second increments. There is no setup fee and no monthly minimum. You pay only for what you transcribe. A 1,000-minute batch transcription of Hausa call recordings costs ₦5,000.
API access is available on all Maraba plans including the free tier. Free plan accounts receive 50 API minutes per month. Starter (₦15,000/month) and Pro (₦45,000/month) plans include higher included API usage; additional usage is charged at the PAYG rate.
Sign up free, get your API key, and make your first Hausa transcription in under five minutes. No credit card required for the free tier.
Get your API key →