← Back to blog
Explainer

How AI phone answering actually works — for Nigerian business owners

A caller rings your clinic number and asks in Yoruba about today's doctor hours. Within 1.8 seconds, they get a clear, spoken answer. Here is exactly what happens in between — no jargon, no hype.

Most Nigerian business owners who ask about AI call answering have the same question underneath the question: is this actually real, and how does it work? The demos look clean. But what is actually happening when someone's MTN line connects to your Maraba number and starts speaking Yoruba?

This is the plain-language explanation. No computer science degree required.

The scenario

A patient calls Sunrise Clinic in Abuja at 7:45am. The front desk does not open until 8am. The caller says: "Ẹ káàárọ̀, I want to know if Dr. Adeyemi is around today." She opens in Yoruba, switches to English mid-sentence. This is entirely normal in Lagos and Abuja. Most phone systems cannot handle it. Maraba handles it without pausing.

Here is what happens in the roughly 1.8 seconds between her finishing that sentence and hearing a clear reply.

Step 1: The call arrives (0ms)

The caller dials your dedicated Maraba number. That number is provisioned through Africa's Talking — the leading telephony infrastructure provider for the African market. Africa's Talking routes the call through Nigerian network interconnects, handling MTN, Airtel, Glo, and 9mobile traffic natively.

Why does this matter? Because Nigerian network conditions are different from American or European ones. Call quality fluctuates. Latency varies. Africa's Talking is built specifically for these conditions, and Maraba runs on top of it by design — not as an afterthought. If Africa's Talking has a fault on a particular route, the call automatically falls over to Twilio as a secondary layer. From the caller's perspective, nothing changes.

The call is live. Maraba answers within the first ring.

Step 2: Maraba greets the caller

The moment the call connects, Maraba plays your configured greeting. This is a text-to-speech (TTS) audio file generated from the greeting text you set in your dashboard — something like: "Thank you for calling Sunrise Clinic. I'm Maraba, how can I help you today?"

The greeting is pre-generated so there is no delay. The caller hears it instantly.

Step 3: The caller speaks — Speech-to-Text (STT) begins (~0–800ms)

This is the core of the system. The caller speaks her sentence: "Ẹ káàárọ̀, I want to know if Dr. Adeyemi is around today."

As she speaks, the audio stream is being captured and passed through a Voice Activity Detection (VAD) layer. VAD's job is simple but important: distinguish between speech and silence so the system knows when the caller has finished speaking and when she is just pausing mid-thought. On Nigerian calls, where network hiccups can create brief audio gaps, good VAD makes a significant difference to accuracy.

Once VAD confirms the caller has finished her utterance, the audio is passed to the Speech-to-Text engine. Maraba uses a custom STT model — one trained on Nigerian voice data, including Yoruba, Hausa, Igbo, and Nigerian English — not a generic model adapted from American English speech. The STT engine converts the audio into a text transcript: "Ẹ káàárọ̀, I want to know if Dr. Adeyemi is around today."

Critically, the diacritics are preserved. The ẹ, ọ, ṣ characters that give Yoruba its tonal meaning are not stripped out. This matters both for understanding meaning and for generating a natural spoken reply.

Step 4: Language detection runs simultaneously

While STT is running, a language detection model analyses the audio. In this case, it identifies that the caller opened in Yoruba and switched to English. This bilingual detection is what makes Maraba's code-switching support real — it is not a feature bolted on top. The system understands that this caller is most comfortable being responded to in a mix of Yoruba and English, and Maraba's reply reflects that.

This is what most AI systems sold to Nigerian businesses cannot do. They handle English only. Some handle a second language as a separate mode you select from a menu. Maraba handles mid-sentence switching because it was built with that as a baseline, not an addition.

Step 5: The LLM understands the intent (~800–1,200ms)

The transcript is now text. It goes to the language model (LLM) — the layer responsible for actually understanding what the caller is asking and deciding what to say back.

The LLM does two things simultaneously. First, it classifies the intent. In this case: doctor availability query. Second, it searches the clinic's knowledge base for the relevant information. The knowledge base is what you configure in your Maraba dashboard — your opening hours, your staff schedule, your services, your location instructions, your escalation rules. This is your data, scoped only to your account. The LLM cannot access any other business's information.

In the knowledge base, it finds: Dr. Adeyemi is available on Mondays, Wednesdays, and Fridays. Today is Wednesday. Clinic opens at 8am.

The LLM composes a response. It knows the caller's preferred language mix. It keeps the reply concise — two or three sentences. It does not fabricate information not in the knowledge base. If the answer is not there, it says so and offers to take a message.

Step 6: Text-to-Speech generates the reply (~1,200–1,600ms)

The text response is passed to the Text-to-Speech engine, which converts it to natural-sounding audio. The voice is configured to match your business — warm and professional, not robotic. Maraba's TTS is trained on Nigerian speech patterns, so the rhythm and intonation sound natural to a Nigerian ear.

Maraba speaks: "Yes, Dr. Adeyemi is available today. The clinic opens at 8am — just about 15 minutes from now. Would you like me to take your name for when you arrive?"

The caller has her answer. She did not wait on hold. She did not hear "press 1 for English." She did not hang up.

Step 7: The call ends — the summary begins

After the call ends, a post-call processing task runs. The full transcript is analysed and structured into a call summary: the caller's intent, the sentiment of the conversation (neutral, positive, frustrated, urgent), a priority tag, and any action items. In this case: Intent — Doctor availability query. Outcome — Resolved. Action — No follow-up required.

Within 60 seconds of the call ending, you receive this summary on WhatsApp and email. Not at the end of the day. Not in a dashboard you have to log into. On WhatsApp, where you actually read things, within a minute.

What the system does not do

Maraba does not guess. If the caller asks something not covered by your knowledge base, Maraba says it does not have that information and offers to take a message or transfer the call. It does not improvise answers that could be wrong. You set the rules; Maraba follows them.

Maraba does not diagnose, give medical advice, give legal advice, or make commitments on behalf of your business beyond what your knowledge base authorises. It is your AI receptionist — it knows what you have told it to know.

What this costs

The Starter plan on Maraba is ₦20,000 per month and covers 200 calls. For a clinic running on that plan, the per-call cost is ₦100. A missed appointment booking at the average Lagos private clinic costs considerably more than ₦100 in lost revenue. For higher volumes, the Pro plan at ₦65,000 per month covers 1,000 calls. Calls beyond the plan limit are charged at ₦0.50 per second — ₦25 minimum per call — billed after the call ends, deducted from your pre-loaded credit.

There is a free plan — 50 calls per month, English only — if you want to try the system before committing.

See it working on your own business number

Start free — limited beta spots. Your first 50 calls in English are on us. Upgrade to add Hausa, Igbo, and Yoruba.

Request beta →