Yoruba Text to Speech API: Generate Natural Yoruba Audio

Why Yoruba TTS is technically hard

Yoruba is spoken by approximately 45 million people, primarily in south-western Nigeria — Lagos, Oyo, Ogun, Ondo, Ekiti, and Osun states — and across diaspora communities in the UK, US, and Brazil. It is one of Nigeria's three major languages and the primary language of Lagos, Nigeria's commercial capital and largest city.

Generating natural Yoruba speech is significantly more complex than generating English speech for two reasons:

Tonal orthography. Yoruba is a tonal language with three contrastive tones: high (é), low (è), and mid (e, unmarked). The tone of a syllable changes the word's meaning entirely. Consider: ọkọ (husband), ọ̀kọ̀ (spear), ọkọ̀ (farm). These are three different words written with the same base letters — tone marks are the only distinguishing feature in writing. A TTS system that ignores tones will produce speech where these three words sound identical, which is incomprehensible to a Yoruba listener.

In addition to tone marks, Yoruba uses the dotted vowels ẹ (open e) and ọ (open o), and the dotted consonant ṣ (postalveolar fricative). These are phonemically distinct from their undotted counterparts. The sentence Ẹ káàárọ̀ (Good morning, formal plural) cannot be approximated as "E kaaro" — the two sound completely different and the latter is not correct Yoruba.

Code-switching with English. Lagos Yoruba in particular is heavily mixed with English. A voice response system serving Lagos customers must handle sentences like "Mo fẹ́ place an order for delivery si Ikeja." (I want to place an order for delivery to Ikeja.) — Yoruba grammar with English vocabulary mixed in naturally. A TTS system that switches to English mode when it encounters an English word, and back to Yoruba mode otherwise, produces a jarring experience. The Maraba system handles this natively.

Prerequisites

Maraba developer account — sign up at maraba.ai
API key from Developer → API Keys
Python 3.9+ with requests installed
Your Yoruba text with correct diacritics — see the diacritics guide below before writing your text input

TTS is billed at ₦3 per 1,000 characters. Synthesising a 200-character Yoruba sentence costs ₦0.60. A full hour of Yoruba audio (approximately 90,000 characters of spoken text) costs around ₦270.

Your first Yoruba synthesis: Python

The endpoint is POST /api/v1/synthesize/. You POST a JSON body with the text and voice identifier. The response is audio data in your requested format.

Python

import requests

API_KEY = "your-api-key-here"

text = "Ẹ káàárọ̀, báwo ni mo ṣe le ràn yín lọ́wọ́?"
# Translation: "Good morning, how can I help you?"

response = requests.post(
    "https://maraba.ai/api/v1/synthesize/",
    headers={
        "X-API-Key": API_KEY,
        "Content-Type": "application/json",
    },
    json={
        "text": text,
        "voice": "yo-NG",        # Yoruba (Nigeria) voice
        "format": "wav",         # wav | mp3 | mulaw
        "sample_rate": 16000,    # 8000 for telephony, 16000 for general
    },
)

response.raise_for_status()

# The response body is raw audio bytes
with open("yoruba_greeting.wav", "wb") as f:
    f.write(response.content)

print("Audio saved to yoruba_greeting.wav")
print(f"Content-Type: {response.headers['Content-Type']}")
print(f"X-Duration-Seconds: {response.headers.get('X-Duration-Seconds')}")
print(f"X-Cost-NGN: {response.headers.get('X-Cost-NGN')}")

The example text "Ẹ káàárọ̀, báwo ni mo ṣe le ràn yín lọ́wọ́?" is a formal Yoruba greeting appropriate for a customer service context. The Ẹ (open e with no tone mark, inheriting mid tone) is the formal plural second person — used when addressing a caller respectfully. The multiple tone marks (àá, ọ̀) are essential: without them, the greeting would sound incorrect to a Yoruba speaker.

Available voices and audio formats

The voice parameter accepts the following Yoruba options:

yo-NG — Yoruba (Nigeria), female voice, standard Lagos accent
yo-NG-M — Yoruba (Nigeria), male voice

Audio format options via the format parameter:

wav — uncompressed PCM, suitable for processing and high-quality playback
mp3 — compressed, suitable for web delivery and storage
mulaw — G.711 mu-law encoding, required for telephony integration (8kHz, 8-bit); use this when feeding audio into a phone call or IVR system

For telephony use cases — for example, generating a Yoruba IVR prompt — use "format": "mulaw", "sample_rate": 8000. This matches the audio format expected by most SIP/VoIP infrastructure.

JavaScript example

JavaScript (Node.js)

const fs = require("fs");
const fetch = require("node-fetch");

const API_KEY = "your-api-key-here";

async function synthesiseYoruba(text, outputPath) {
  const response = await fetch("https://maraba.ai/api/v1/synthesize/", {
    method: "POST",
    headers: {
      "X-API-Key": API_KEY,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      text: text,
      voice: "yo-NG",
      format: "mp3",
      sample_rate: 16000,
    }),
  });

  if (!response.ok) {
    const error = await response.json();
    throw new Error(`TTS error ${response.status}: ${error.error}`);
  }

  const buffer = await response.buffer();
  fs.writeFileSync(outputPath, buffer);
  console.log(`Saved ${buffer.length} bytes to ${outputPath}`);
  console.log("Duration:", response.headers.get("x-duration-seconds"), "s");
}

synthesiseYoruba(
  "Ẹ káàbọ̀ sí Maraba. Mo jẹ́ Maraba, olùrànlọ́wọ́ rẹ.",
  "welcome_yo.mp3"
).catch(console.error);

The diacritics guide: how to write correct Yoruba text for TTS

The quality of your audio output depends entirely on the correctness of the input text. A Yoruba TTS system can only render the tones it is told — if you submit text with missing tone marks, the output will sound wrong to a native speaker.

Here is a practical reference for the most common Yoruba characters needed in a customer service context:

Essential Yoruba characters for TTS input

ẹ / Ẹ — open e (as in "bed", but more open). Example: ẹ̀wù (cloth), Ẹ káàbọ̀ (Welcome)
ọ / Ọ — open o (as in "bought"). Example: ọjọ́ (day), ọdún (year)
ṣ / Ṣ — sh sound. Example: ṣe (do/make), ṣíṣe (doing)
á é í ó ú — high tone marks on vowels
à è ì ò ù — low tone marks on vowels
a e i o u — mid tone (unmarked). Mid is the default when no mark appears.

Practical tip: write your Yoruba text in a Unicode-aware text editor (VS Code, Google Docs, or the macOS Notes app all handle Yoruba characters correctly). Copy it directly into your API call. Do not pass Yoruba text through any process that strips diacritics — for example, some URL encoders or SMS gateways will corrupt these characters.

One quick validation check: if your Yoruba text contains no diacritics at all (no accented characters, no dotted vowels), it is almost certainly incomplete. Real Yoruba text almost always contains tone marks and dotted letters.

Code-switching: Yoruba with English words

To synthesise text that mixes Yoruba and English — which is very common in Lagos business contexts — you do not need to do anything special. The yo-NG voice handles English loanwords and switches naturally:

Python — code-switching example

mixed_text = (
    "Mo fẹ́ràn pé o pe wa. "
    "Your order number is 4-7-2. "
    "A óò deliver rẹ̀ ní àárọ̀ ọjọ́ kejì."
)
# Translation: "We're glad you called. Your order number is 472.
# We will deliver it tomorrow morning."

response = requests.post(
    "https://maraba.ai/api/v1/synthesize/",
    headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
    json={"text": mixed_text, "voice": "yo-NG", "format": "wav"},
)

The voice will render the English portions with a light Nigerian-accented English — matching how a native Yoruba speaker would naturally pronounce those words — rather than switching to an American or British accent mid-sentence.

Telephony integration: generating IVR prompts

A common use case is generating Yoruba IVR prompts for a call routing system. To do this, generate the audio once, save it as a mulaw file, and serve it as a static audio file to your telephony provider. You do not need to generate it on every call:

Python — IVR prompt generation

prompts = {
    "main_menu": "Ẹ káàbọ̀ sí Eko Pharmacy. Fún àpẹẹrẹ àwọn ìbéèrè nípa oògùn, tẹ ọ̀kan. Fún àwọn àkókò ìṣí, tẹ èjì.",
    "hours": "A ń ṣí láti Ọjọ Ajé dé Ọjọ Àbámẹ́ta, láàárọ̀ mẹ́jọ dé òru mẹ́jọ. A ti pa ní Ọjọ Àìkú.",
    "hold": "Jọwọ dúró díẹ̀. Ìjọba wa yóò dáhùn rẹ̀ lẹ́sẹ̀kẹ̀sẹ̀.",
}

for name, text in prompts.items():
    response = requests.post(
        "https://maraba.ai/api/v1/synthesize/",
        headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
        json={
            "text": text,
            "voice": "yo-NG",
            "format": "mulaw",
            "sample_rate": 8000,  # telephony standard
        },
    )
    response.raise_for_status()
    with open(f"prompts/{name}.ul", "wb") as f:
        f.write(response.content)
    print(f"Generated: prompts/{name}.ul")

Error handling

The TTS API returns errors in the standard Maraba format: {"error": "...", "code": "...", "detail": {}}.

text_too_long — input exceeds 5,000 characters per request; split long texts into chunks
unsupported_voice — the voice identifier is not recognised; check the voice list above
encoding_error — the text could not be processed, usually because of non-UTF-8 characters; ensure your text is UTF-8 encoded
quota_exceeded — TTS character quota exhausted; top up or upgrade

What to build next

With Yoruba TTS working, your natural next steps are:

Pair with Hausa STT and Igbo STT for a full multilingual voice pipeline
Use Orinode STT to transcribe inbound Yoruba speech, then TTS to generate a Yoruba response — a complete voice bot loop
See the Nigerian language AI resources for Yoruba text corpora to use in your knowledge base

Generate your first Yoruba audio in minutes

Sign up free, get your API key, and call the TTS endpoint. ₦3 per 1,000 characters — your first test sentence costs less than one kobo.

Get your API key →