← Back to blog
Developer Tutorial

Yoruba text to speech API: generate natural Yoruba audio

Synthesising natural Yoruba speech is one of the hardest problems in African language technology. Yoruba is tonal — the same written word has three possible tones, each changing meaning — and standard TTS systems ignore this entirely. This tutorial shows you how to use the Orinode TTS API to generate correct, natural-sounding Yoruba audio from text, including proper tone rendering and code-switching support.

Why Yoruba TTS is technically hard

Yoruba is spoken by approximately 45 million people, primarily in south-western Nigeria — Lagos, Oyo, Ogun, Ondo, Ekiti, and Osun states — and across diaspora communities in the UK, US, and Brazil. It is one of Nigeria's three major languages and the primary language of Lagos, Nigeria's commercial capital and largest city.

Generating natural Yoruba speech is significantly more complex than generating English speech for two reasons:

Tonal orthography. Yoruba is a tonal language with three contrastive tones: high (é), low (è), and mid (e, unmarked). The tone of a syllable changes the word's meaning entirely. Consider: ọkọ (husband), ọ̀kọ̀ (spear), ọkọ̀ (farm). These are three different words written with the same base letters — tone marks are the only distinguishing feature in writing. A TTS system that ignores tones will produce speech where these three words sound identical, which is incomprehensible to a Yoruba listener.

In addition to tone marks, Yoruba uses the dotted vowels ẹ (open e) and ọ (open o), and the dotted consonant ṣ (postalveolar fricative). These are phonemically distinct from their undotted counterparts. The sentence Ẹ káàárọ̀ (Good morning, formal plural) cannot be approximated as "E kaaro" — the two sound completely different and the latter is not correct Yoruba.

Code-switching with English. Lagos Yoruba in particular is heavily mixed with English. A voice response system serving Lagos customers must handle sentences like "Mo fẹ́ place an order for delivery si Ikeja." (I want to place an order for delivery to Ikeja.) — Yoruba grammar with English vocabulary mixed in naturally. A TTS system that switches to English mode when it encounters an English word, and back to Yoruba mode otherwise, produces a jarring experience. The Maraba system handles this natively.

Prerequisites

TTS is billed at ₦3 per 1,000 characters. Synthesising a 200-character Yoruba sentence costs ₦0.60. A full hour of Yoruba audio (approximately 90,000 characters of spoken text) costs around ₦270.

Your first Yoruba synthesis: Python

The endpoint is POST /api/v1/synthesize/. You POST a JSON body with the text and voice identifier. The response is audio data in your requested format.

Python
import requests

API_KEY = "your-api-key-here"

text = "Ẹ káàárọ̀, báwo ni mo ṣe le ràn yín lọ́wọ́?"
# Translation: "Good morning, how can I help you?"

response = requests.post(
    "https://maraba.ai/api/v1/synthesize/",
    headers={
        "X-API-Key": API_KEY,
        "Content-Type": "application/json",
    },
    json={
        "text": text,
        "voice": "yo-NG",        # Yoruba (Nigeria) voice
        "format": "wav",         # wav | mp3 | mulaw
        "sample_rate": 16000,    # 8000 for telephony, 16000 for general
    },
)

response.raise_for_status()

# The response body is raw audio bytes
with open("yoruba_greeting.wav", "wb") as f:
    f.write(response.content)

print("Audio saved to yoruba_greeting.wav")
print(f"Content-Type: {response.headers['Content-Type']}")
print(f"X-Duration-Seconds: {response.headers.get('X-Duration-Seconds')}")
print(f"X-Cost-NGN: {response.headers.get('X-Cost-NGN')}")

The example text "Ẹ káàárọ̀, báwo ni mo ṣe le ràn yín lọ́wọ́?" is a formal Yoruba greeting appropriate for a customer service context. The (open e with no tone mark, inheriting mid tone) is the formal plural second person — used when addressing a caller respectfully. The multiple tone marks (àá, ọ̀) are essential: without them, the greeting would sound incorrect to a Yoruba speaker.

Available voices and audio formats

The voice parameter accepts the following Yoruba options:

Audio format options via the format parameter:

For telephony use cases — for example, generating a Yoruba IVR prompt — use "format": "mulaw", "sample_rate": 8000. This matches the audio format expected by most SIP/VoIP infrastructure.

JavaScript example

JavaScript (Node.js)
const fs = require("fs");
const fetch = require("node-fetch");

const API_KEY = "your-api-key-here";

async function synthesiseYoruba(text, outputPath) {
  const response = await fetch("https://maraba.ai/api/v1/synthesize/", {
    method: "POST",
    headers: {
      "X-API-Key": API_KEY,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      text: text,
      voice: "yo-NG",
      format: "mp3",
      sample_rate: 16000,
    }),
  });

  if (!response.ok) {
    const error = await response.json();
    throw new Error(`TTS error ${response.status}: ${error.error}`);
  }

  const buffer = await response.buffer();
  fs.writeFileSync(outputPath, buffer);
  console.log(`Saved ${buffer.length} bytes to ${outputPath}`);
  console.log("Duration:", response.headers.get("x-duration-seconds"), "s");
}

synthesiseYoruba(
  "Ẹ káàbọ̀ sí Maraba. Mo jẹ́ Maraba, olùrànlọ́wọ́ rẹ.",
  "welcome_yo.mp3"
).catch(console.error);

The diacritics guide: how to write correct Yoruba text for TTS

The quality of your audio output depends entirely on the correctness of the input text. A Yoruba TTS system can only render the tones it is told — if you submit text with missing tone marks, the output will sound wrong to a native speaker.

Here is a practical reference for the most common Yoruba characters needed in a customer service context:

Essential Yoruba characters for TTS input
  • ẹ / Ẹ — open e (as in "bed", but more open). Example: ẹ̀wù (cloth), Ẹ káàbọ̀ (Welcome)
  • ọ / Ọ — open o (as in "bought"). Example: ọjọ́ (day), ọdún (year)
  • ṣ / Ṣ — sh sound. Example: ṣe (do/make), ṣíṣe (doing)
  • á é í ó ú — high tone marks on vowels
  • à è ì ò ù — low tone marks on vowels
  • a e i o u — mid tone (unmarked). Mid is the default when no mark appears.

Practical tip: write your Yoruba text in a Unicode-aware text editor (VS Code, Google Docs, or the macOS Notes app all handle Yoruba characters correctly). Copy it directly into your API call. Do not pass Yoruba text through any process that strips diacritics — for example, some URL encoders or SMS gateways will corrupt these characters.

One quick validation check: if your Yoruba text contains no diacritics at all (no accented characters, no dotted vowels), it is almost certainly incomplete. Real Yoruba text almost always contains tone marks and dotted letters.

Code-switching: Yoruba with English words

To synthesise text that mixes Yoruba and English — which is very common in Lagos business contexts — you do not need to do anything special. The yo-NG voice handles English loanwords and switches naturally:

Python — code-switching example
mixed_text = (
    "Mo fẹ́ràn pé o pe wa. "
    "Your order number is 4-7-2. "
    "A óò deliver rẹ̀ ní àárọ̀ ọjọ́ kejì."
)
# Translation: "We're glad you called. Your order number is 472.
# We will deliver it tomorrow morning."

response = requests.post(
    "https://maraba.ai/api/v1/synthesize/",
    headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
    json={"text": mixed_text, "voice": "yo-NG", "format": "wav"},
)

The voice will render the English portions with a light Nigerian-accented English — matching how a native Yoruba speaker would naturally pronounce those words — rather than switching to an American or British accent mid-sentence.

Telephony integration: generating IVR prompts

A common use case is generating Yoruba IVR prompts for a call routing system. To do this, generate the audio once, save it as a mulaw file, and serve it as a static audio file to your telephony provider. You do not need to generate it on every call:

Python — IVR prompt generation
prompts = {
    "main_menu": "Ẹ káàbọ̀ sí Eko Pharmacy. Fún àpẹẹrẹ àwọn ìbéèrè nípa oògùn, tẹ ọ̀kan. Fún àwọn àkókò ìṣí, tẹ èjì.",
    "hours": "A ń ṣí láti Ọjọ Ajé dé Ọjọ Àbámẹ́ta, láàárọ̀ mẹ́jọ dé òru mẹ́jọ. A ti pa ní Ọjọ Àìkú.",
    "hold": "Jọwọ dúró díẹ̀. Ìjọba wa yóò dáhùn rẹ̀ lẹ́sẹ̀kẹ̀sẹ̀.",
}

for name, text in prompts.items():
    response = requests.post(
        "https://maraba.ai/api/v1/synthesize/",
        headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
        json={
            "text": text,
            "voice": "yo-NG",
            "format": "mulaw",
            "sample_rate": 8000,  # telephony standard
        },
    )
    response.raise_for_status()
    with open(f"prompts/{name}.ul", "wb") as f:
        f.write(response.content)
    print(f"Generated: prompts/{name}.ul")

Error handling

The TTS API returns errors in the standard Maraba format: {"error": "...", "code": "...", "detail": {}}.

What to build next

With Yoruba TTS working, your natural next steps are:

Generate your first Yoruba audio in minutes

Sign up free, get your API key, and call the TTS endpoint. ₦3 per 1,000 characters — your first test sentence costs less than one kobo.

Get your API key →