Voice to Text in Nigeria: Business Use Cases and APIs

Every business call is untapped data

A medium-sized Lagos business might take 100 inbound calls per day. By the end of the week, that is 500 conversations with customers — each containing real data about what people want to buy, what problems they are having, what questions they cannot find answers to on the website, which products they keep asking about that you do not stock.

Without voice to text, this data evaporates. A staff member might note a few details in a logbook. The rest is lost. Patterns that would be obvious from a dataset of 500 calls — a recurring product question, a delivery complaint pattern, a time-of-day spike in enquiries about a specific service — are invisible because no one can review 500 calls manually.

Voice to text (STT — speech to text) is the technology that converts spoken audio into written text. Once the call is text, you can search it, filter it, feed it into analysis tools, push it to a CRM, or use it to generate automated summaries. The call becomes data instead of a memory.

Use case 1: call transcription for records and compliance

The most straightforward use of voice to text is maintaining a written record of every customer call. In sectors where this is a regulatory requirement — banking, insurance, fintech, healthcare — transcription is not optional. The Central Bank of Nigeria's customer protection guidelines, and similar frameworks, increasingly expect financial institutions to be able to produce records of customer interactions.

Even where it is not mandated, having a full transcript of every customer call has practical value:

A customer disputes what they were told about a product's terms — you can pull the transcript
A staff member is accused of giving incorrect information — you can review the call
A new team member wants to understand how experienced staff handle a specific type of query — you have a library of real examples

For Nigerian businesses, the critical requirement is that transcription must handle Nigerian English, Nigerian-accented speech, and the major local languages. A transcription system trained only on American or British English will produce poor-quality output on calls with Hausa or Yoruba-accented English — or miss entire segments where a caller switches into a Nigerian language.

Use case 2: customer intent analysis

When you have transcripts from hundreds of customer calls, you can start asking questions of the data: what are customers calling about most? What products come up most often? What complaints appear repeatedly? Which questions are callers asking that are not answered on the website?

This kind of analysis — which requires manual review without transcription, and becomes automated with it — drives product decisions, staff training priorities, and marketing content. A pharmacy that discovers 30% of its calls are asking about the availability of a specific medication that is often out of stock has a clear restocking priority. A restaurant that sees "delivery to Ajah" mentioned in 40 calls per week knows there is demand in that area that they are not serving.

You can start simple: export transcripts to a spreadsheet, sort by keyword, and look for patterns manually. Or pipe transcripts to a simple text classifier — the Maraba API returns structured transcript data that is easy to feed into a downstream analysis pipeline.

Use case 3: CRM integration

A sales call or customer service call typically ends with a staff member manually typing notes into a CRM. This takes time, requires discipline, and often produces incomplete records — the staff member remembers the key point but forgets the caller's address, or the specific product variant they asked about.

With voice to text, the call is transcribed automatically. A lightweight extraction step — either a simple rule-based parser or a small language model — pulls out structured data: caller name, phone number (from the call metadata), product mentioned, delivery address mentioned, follow-up action required. This structured record is posted to the CRM directly, with the full transcript attached for reference.

The result is a CRM that is complete and accurate, without requiring discipline from staff who are managing the pressure of a busy call. The Maraba voice agent produces structured post-call summaries automatically — these are a form of CRM-ready data for every call Maraba handles.

Use case 4: compliance logging for regulated industries

Fintech, microfinance, insurance, and banking businesses in Nigeria face growing regulatory requirements around customer interaction records. The NDPR (Nigeria Data Protection Regulation) creates obligations around what data you hold and for how long, but it also presupposes that you can identify what data was shared with a customer during a call — which requires a record of what was said.

Transcription enables compliance logging that is searchable and auditable. Instead of a call recording that requires someone to listen in real time to review, you have a searchable text record. An auditor asking "how many customers were told about the 3% early repayment fee in Q1?" gets an answer from a database query rather than a manual review of hundreds of recordings.

How STT APIs work: technical overview

For developers building voice-to-text pipelines, here is a brief technical overview of how STT works and what to consider for Nigerian deployments.

A speech-to-text API takes audio as input — a WAV, MP3, or OGG file, or a streaming audio feed — and returns text. Under the hood, this involves three stages:

Voice Activity Detection (VAD). The audio is scanned to identify the segments containing speech vs. silence or background noise. Only the speech segments are sent to the recogniser.
Acoustic modelling. The speech segments are converted into a sequence of phoneme probabilities — essentially, which sounds were spoken.
Language modelling + decoding. The phoneme sequence is decoded into words using a language model that represents the probability of word sequences in the target language. This is where language-specific knowledge lives — and why a model trained on English produces poor results on Hausa.

Latency considerations on Nigerian networks

For batch transcription (uploading completed call recordings), network latency does not matter much — you upload the file and wait for the result. For real-time or near-real-time transcription (transcribing a live call), latency is critical.

On Nigerian 4G networks, round-trip latency to a Lagos-hosted API is typically 80–150ms. For a streaming STT application where you send 5-second audio chunks and receive partial transcripts, this is workable. For applications requiring sub-200ms response to every spoken word, you need edge deployment — which is beyond the scope of most SME builds.

Practical recommendation for Nigerian developers: use the batch endpoint for post-call transcription (upload the recorded audio after the call ends). Use the streaming endpoint only for live applications where you genuinely need real-time text, and test thoroughly on the actual network conditions of your deployment.

The Orinode STT API: overview

The Orinode STT API handles English, Hausa, Yoruba, Igbo, and bilingual code-switched audio. Here is the simplest possible Python call:

Python — basic transcription

import requests

API_KEY = "your-api-key-here"

with open("customer_call.wav", "rb") as f:
    response = requests.post(
        "https://maraba.ai/api/v1/transcribe/",
        headers={"X-API-Key": API_KEY},
        # language="en" for English, "ha" for Hausa,
        # "yo" for Yoruba, "ig" for Igbo,
        # "ha-en" for bilingual Hausa/English, etc.
        data={"language": "en"},
        files={"audio": ("call.wav", f, "audio/wav")},
    )

response.raise_for_status()
result = response.json()

print(result["transcript"])       # Full text of the call
print(result["language_detected"]) # Detected language code
print(result["duration_seconds"])  # Call duration in seconds
print(result["cost_ngn"])          # Cost charged (₦5/min)

The words array in the response provides word-level timestamps, which are useful if you need to synchronise the transcript with call recording playback or highlight specific moments in a call:

Python — word-level timestamps

result = response.json()

for word_data in result["words"]:
    print(f"{word_data['start']:.2f}s - {word_data['end']:.2f}s: {word_data['word']}")
    # Example output:
    # 0.00s - 0.30s: Hello
    # 0.30s - 0.60s: I
    # 0.60s - 0.85s: want
    # 0.85s - 1.10s: to
    # 1.10s - 1.55s: order
    # ...

For a full technical reference on the STT API including language codes, streaming, and error codes, see the dedicated tutorials: Hausa STT, Igbo STT.

Getting started

If you are a developer building a transcription pipeline for a Nigerian business, start with the Maraba free tier — 50 API minutes per month included at no cost. This is enough to process approximately 50 one-minute call recordings and verify that the transcript quality meets your requirements before committing to a paid plan.

If you are a business owner looking for automatic post-call transcription without writing any code, the Maraba voice agent (Maraba) produces a structured WhatsApp summary after every call automatically — no API integration required. See the automated call answering guide for the no-code route.

Turn your business calls into data

Sign up free and start transcribing Nigerian business calls — in English, Hausa, Yoruba, or Igbo. ₦5 per minute, 50 free minutes included.

Get your API key →