A question that comes up frequently when Nigerian companies evaluate AI products: "Can we just use ChatGPT for this?" The answer is nuanced. ChatGPT and GPT-4 are powerful language models with genuinely impressive English capabilities. But a conversational AI that answers phone calls in Nigeria requires capabilities at several layers that ChatGPT does not address — and that most Western AI products have never had to solve.
This is not a criticism of those products. They were built for markets where those problems do not exist. But understanding why Nigerian voice AI is its own distinct engineering challenge explains why solutions built in San Francisco rarely work without substantial adaptation when deployed in Lagos.
Layer 1: The acoustic challenge
Phone calls in Nigeria do not arrive under ideal acoustic conditions. A caller from Onitsha Market is surrounded by generator noise, market calls, and traffic. A caller from a Kano shop is in a room with fans, street noise through an open door, and other people speaking. A caller from an Abuja office may be on a 3G connection with packet loss that creates brief audio dropout every few seconds.
The ASR (automatic speech recognition) models that major platforms optimise for are tested on American and British callers in quiet environments, on broadband connections. Word error rates published in research papers are typically measured under controlled conditions. Under Nigerian phone call conditions — GSM audio codec, background noise, variable signal quality — performance of those models degrades substantially.
Building for Nigerian callers requires training ASR models on recordings that reflect actual Nigerian call conditions: real background noise profiles, GSM codec artefacts, and the acoustic signatures of calls made in markets, clinics, restaurants, and homes across different regions of Nigeria. The model needs to be robust to these conditions, not just to clean studio audio.
Layer 2: The language problem
GPT-4 handles English very well. It handles Yoruba, Hausa, and Igbo in a rudimentary way — it can translate simple sentences and generate grammatically plausible text in these languages, but its training data for them is thin relative to English, and its speech capabilities (GPT-4 Voice) were not trained on sufficient Nigerian-language audio to handle casual conversational speech.
But the deeper problem is not just that GPT-4 handles Yoruba imperfectly. It is that the task requires handling code-switching within a single utterance — switching between languages mid-sentence, mid-phrase, sometimes mid-word — in a way that Western LLMs were not designed for. GPT-4 processes a single input and produces a single output. When the input contains Yoruba, English, and possibly a Pidgin phrase in sequence, the model needs to understand the boundaries between them and process each in its relevant linguistic context.
This single sentence contains: Yoruba greeting (E kaaro), Igbo discourse marker (biko = please), English medical query, Yoruba discourse marker (sha = anyway/just), Nigerian Pidgin English construction (make I come), and standard English. A model that processes this as a unified English input will produce incorrect output.
Layer 3: The knowledge problem
GPT-4 has broad general knowledge — it knows what a pharmacy is, what Clopidogrel is used for, and what an appointment means. What it does not know is what a specific pharmacy in Lagos has in stock on a Tuesday afternoon, what Dr Eze's available appointment slots are, or what the delivery status of package ND-7743 is.
An AI that answers business calls must connect general language capability to specific business knowledge. For Maraba, this means the knowledge base system — a structured store of business-specific information that Maraba consults when answering caller queries. The LLM provides language understanding and response generation; the knowledge base provides the facts about the specific business.
Building this retrieval-augmented architecture correctly for Nigerian businesses requires understanding how Nigerian businesses structure information — the categories of information callers typically ask for, the terminology they use, the way pricing, availability, and schedules are typically described in different industry verticals. A RAG system tuned for American e-commerce does not automatically work well for a Kano pharmacy or a Port Harcourt logistics company.
Layer 4: The infrastructure constraint
A voice AI system that processes a call must complete its speech-to-text, intent extraction, knowledge base lookup, response generation, and text-to-speech steps within the time budget that feels natural in a phone conversation. In practice, a caller will tolerate about 1.5–2 seconds of silence before they assume the call has dropped or the system has failed.
In the US and Europe, this latency budget is achievable because inference can run on servers physically close to callers — AWS us-east-1 is milliseconds away from most American callers. In Nigeria, latency to major cloud regions — AWS Cape Town (af-south-1), Lagos PoPs, European regions — is higher. A pipeline that runs in 600ms for an American caller may run in 900–1200ms for a Nigerian caller, depending on routing and the specific path the audio travels.
Maraba runs its inference on AWS af-south-1 (Cape Town) and uses Africa's Talking for telephony routing, which keeps audio as local to the African continent as possible. Model optimisation — quantisation, batching, edge caching of frequent responses — brings the total pipeline latency to under 1.8 seconds in most conditions. This is achievable, but it requires engineering attention to every step in the pipeline specifically for the West African network topology.
Layer 5: The cultural and contextual layer
Language models trained primarily on Western data carry Western cultural assumptions. GPT-4 knows what a clinic appointment means in an American context. It does not automatically know that a Nigerian caller saying "I will come when God willing" is not an indefinite postponement — it is a common expression of intent that in context means "I am planning to come." It does not know that a Yoruba caller opening with an extended greeting before stating their business is not wasting time — they are following a culturally appropriate communication sequence.
A Nigerian-built AI has the advantage of being designed by people who live in and understand the cultural context. Response generation that sounds natural to a Nigerian caller — appropriate register, appropriate greetings and closings, appropriate handling of indirect communication — requires cultural knowledge that cannot be extracted from training data alone. It is built through iteration with real Nigerian callers across real Nigerian industries.
What this means for buyers of AI products
When evaluating an AI product for a Nigerian business, the question to ask is not "does this product support Hausa?" but "was this product built with Nigerian acoustic data, Nigerian code-switching patterns, Nigerian cultural context, and Nigerian infrastructure constraints at the design level?"
A product that supports Hausa via a translation layer will fail in noisy conditions when the translation step degrades. A product built on Western infrastructure without latency optimisation for West Africa will feel slow and unreliable. A product that uses a standard English LLM with minimal Nigerian fine-tuning will produce responses that feel generic or culturally off to Nigerian callers.
The Nigerian market is large enough, and the infrastructure and language challenges distinctive enough, that it requires solutions built specifically for it — not adapted from products designed for London and New York.
Maraba was designed for Nigerian acoustic conditions, Nigerian languages, and Nigerian businesses. Start free with 50 calls — limited beta spots.
Request beta →