This is the hands-on chapter. By the end you will have a working TTS feature that reads English and Czech text aloud, runs for free on your own hardware, and exposes a single clean HTTP endpoint your app calls with { text, lang }. We implement the report's primary recommendation — Piper as the default local engine (fast, free, genuinely usable Czech and English voices, CPU-only) — wrapped behind an adapter interface so you can later swap in a higher-fidelity engine (Coqui XTTS-v2 self-hosted) or a cloud API (Google / Azure / ElevenLabs) for the same call, with no frontend changes.
We cover: install and run Piper (both as a Python library and as a Docker HTTP server), download the EN + CZ voices, a tested Python (FastAPI) server endpoint with on-disk caching, a Node/Express equivalent, a frontend snippet to play the audio, Czech text-normalization tips (numbers, dates, abbreviations, diacritics), the adapter pattern to drop in a cloud backend, a troubleshooting section, and a browser Web Speech API fallback for when the server is down.
Why Piper as the default? It is MIT-licensed, runs on CPU in real time (faster than real time on a modern laptop), ships pre-trained
cs_CZanden_US/en_GBvoices, and has a tiny dependency footprint. Czech quality is "clearly intelligible and natural-enough for reading paragraphs," not studio-grade — when you need studio-grade Czech, the adapter lets you escalate to XTTS-v2 or a cloud neural voice. See Chapters 5–7 and 12 for the full quality/cost comparison that leads to this choice.
# Python 3.10–3.12 recommended (Piper wheels target these)
python3 --version
# A folder for your TTS service
mkdir -p tts-service/voices tts-service/cache
cd tts-service
python3 -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
You need ~200–500 MB of disk for a couple of voices, and roughly while synthesizing. (XTTS-v2 later wants a GPU with ~4–6 GB VRAM for snappy results, or it runs slowly on CPU.)
There are two supported ways to run Piper. Pick one (you can use both — the library for embedding, the Docker server for a clean network boundary).
Option A — Python library (piper-tts), simplest for embedding directly:
pip install piper-tts
# This pulls onnxruntime + the piper CLI. Verify:
piper --help
Option B — Docker HTTP server (recommended for production; isolates native deps). The Wyoming/Piper ecosystem and community images expose an HTTP /api/tts or an OpenAI-compatible /v1/audio/speech endpoint. The most portable approach is to run Piper's own bundled HTTP server inside a container you control (Step 3).
Naming note: the original Piper repo (
rhasspy/piper) moved under the OHF-Voice /OHF-voice/piper1-gplumbrella in 2025; the pip package is stillpiper-ttsand voices still live atrhasspy/piper-voiceson Hugging Face. If a command name differs in your version, runpiper --help— the flags below (-m,-f,--output_raw) are stable.
Voices are two files each: a .onnx model and a .onnx.json config. They live in the Hugging Face repo rhasspy/piper-voices, organized as <lang>/<locale>/<voice>/<quality>/.
cd voices
# English (US) — a solid medium-quality voice
curl -L -o en_US-lessac-medium.onnx \
https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
curl -L -o en_US-lessac-medium.onnx.json \
https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json
# Czech — the maintained cs_CZ medium voice
curl -L -o cs_CZ-jirka-medium.onnx \
https://huggingface.co/rhasspy/piper-voices/resolve/main/cs/cs_CZ/jirka/medium/cs_CZ-jirka-medium.onnx
curl -L -o cs_CZ-jirka-medium.onnx.json \
https://huggingface.co/rhasspy/piper-voices/resolve/main/cs/cs_CZ/jirka/medium/cs_CZ-jirka-medium.onnx.json
cd ..
Verify the exact path before downloading. Browse https://huggingface.co/rhasspy/piper-voices/tree/main/cs/cs_CZ to confirm the current Czech voice name and available quality levels. The Czech catalog is smaller than English;
cs_CZ-jirka-mediumhas been the practical default, but check for newer additions. If a 404 occurs, the folder/voice name changed — the directory listing is authoritative.
Quality levels trade size/CPU for fidelity:
| Quality | Sample rate | Model size (approx) | When to use |
|---|---|---|---|
x_low | 16 kHz | ~5–10 MB | embedded / very constrained |
low | 16 kHz | ~15–25 MB | fast, low CPU |
medium | 22.05 kHz | ~50–75 MB | default sweet spot |
high | 22.05 kHz | ~100+ MB | best Piper quality, more CPU |
Prefer medium for reading text aloud; jump to high for the English voice if your CPU has headroom. Czech high voices are not always available — medium is the realistic Czech target with Piper.
Quick smoke test:
echo "Hello, this is a test of natural sounding speech." \
| piper -m voices/en_US-lessac-medium.onnx -f test_en.wav
echo "Ahoj, toto je test českého hlasu. Příliš žluťoučký kůň úpěl ďábelské ódy." \
| piper -m voices/cs_CZ-jirka-medium.onnx -f test_cs.wav
Play test_en.wav / test_cs.wav. The Czech pangram (with all diacritics) is the fastest way to confirm ě š č ř ž ý á í é ú ů ď ť ň are pronounced, not dropped.
This is the core deliverable: one endpoint, POST /tts, that takes { text, lang }, returns audio bytes, and caches results so repeated reads are instant and CPU-free.
requirements.txt:
fastapi
uvicorn[standard]
piper-tts
app.py:
import hashlib
import io
import os
import wave
from pathlib import Path
from fastapi import FastAPI, HTTPException
from fastapi.responses import Response
from pydantic import BaseModel
from piper import PiperVoice # piper-tts >= 1.x exposes PiperVoice
# ---- Configuration -------------------------------------------------
VOICE_DIR = Path("voices")
CACHE_DIR = Path("cache")
CACHE_DIR.mkdir(exist_ok=True)
VOICES = {
"en": VOICE_DIR / "en_US-lessac-medium.onnx",
"cs": VOICE_DIR / "cs_CZ-jirka-medium.onnx",
}
# Lazy-load voices once, keep in memory (loading is the slow part).
_loaded: dict[str, PiperVoice] = {}
def get_voice(lang: str) -> PiperVoice:
lang = lang.lower().split("-")[0] # "cs-CZ" -> "cs"
if lang not in VOICES:
raise HTTPException(400, f"Unsupported language: {lang}")
if lang not in _loaded:
model_path = VOICES[lang]
if not model_path.exists():
raise HTTPException(500, f"Voice model missing: {model_path}")
_loaded[lang] = PiperVoice.load(str(model_path))
return _loaded[lang]
# ---- Synthesis -----------------------------------------------------
def synthesize_wav(text: str, lang: str) -> bytes:
voice = get_voice(lang)
buf = io.BytesIO()
with wave.open(buf, "wb") as wav_file:
voice.synthesize(text, wav_file) # writes a valid WAV header + PCM
return buf.getvalue()
def cache_key(text: str, lang: str, fmt: str) -> str:
raw = f"{lang}|{fmt}|{text}".encode("utf-8")
return hashlib.sha256(raw).hexdigest()
# ---- API -----------------------------------------------------------
app = FastAPI()
class TTSRequest(BaseModel):
text: str
lang: str = "en" # "en" or "cs"
format: str = "wav" # "wav" (default) or "mp3" if ffmpeg present
@app.post("/tts")
def tts(req: TTSRequest):
text = (req.text or "").strip()
if not text:
raise HTTPException(400, "Empty text")
if len(text) > 5000:
raise HTTPException(413, "Text too long; split into chunks (see docs)")
key = cache_key(text, req.lang, req.format)
cached = CACHE_DIR / f"{key}.{req.format}"
if cached.exists():
data = cached.read_bytes()
else:
wav_bytes = synthesize_wav(text, req.lang)
if req.format == "mp3":
data = wav_to_mp3(wav_bytes)
else:
data = wav_bytes
cached.write_bytes(data)
media = "audio/mpeg" if req.format == "mp3" else "audio/wav"
return Response(content=data, media_type=media,
headers={"Cache-Control": "public, max-age=31536000"})
@app.get("/healthz")
def healthz():
return {"ok": True, "voices": list(VOICES.keys())}
# ---- Optional MP3 transcode (needs ffmpeg on PATH) -----------------
def wav_to_mp3(wav_bytes: bytes) -> bytes:
import subprocess
p = subprocess.run(
["ffmpeg", "-hide_banner", "-loglevel", "error",
"-i", "pipe:0", "-f", "mp3", "-b:a", "96k", "pipe:1"],
input=wav_bytes, stdout=subprocess.PIPE, check=True,
)
return p.stdout
Run it:
uvicorn app:app --host 0.0.0.0 --port 8080
Test:
curl -s -X POST http://localhost:8080/tts \
-H "Content-Type: application/json" \
-d '{"text":"Dobrý den, jak se máte?","lang":"cs"}' \
--output out_cs.wav
curl -s -X POST http://localhost:8080/tts \
-H "Content-Type: application/json" \
-d '{"text":"Good morning, the report is ready.","lang":"en","format":"mp3"}' \
--output out_en.mp3
Why this design:
ffmpeg. WAV is ~10× larger over the wire — fine on LAN, use MP3/Opus for the public internet.Dockerize it (Step 3b):
Dockerfile:
FROM python:3.12-slim
RUN apt-get update && apt-get install -y --no-install-recommends ffmpeg \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
COPY voices ./voices
RUN mkdir -p cache
EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
docker build -t my-tts .
docker run -p 8080:8080 -v "$PWD/cache:/app/cache" my-tts
Mounting cache/ as a volume means the cache survives container restarts.
If you'd rather not run Python, shell out to the piper binary from Node. Install Piper (Step 1) so piper is on PATH, then:
// server.js — npm i express
import express from "express";
import { spawn } from "node:child_process";
import { createHash } from "node:crypto";
import { existsSync, mkdirSync, readFileSync, writeFileSync } from "node:fs";
import path from "node:path";
const app = express();
app.use(express.json({ limit: "1mb" }));
const VOICES = {
en: "voices/en_US-lessac-medium.onnx",
cs: "voices/cs_CZ-jirka-medium.onnx",
};
const CACHE = "cache";
mkdirSync(CACHE, { recursive: true });
function key(text, lang) {
return createHash("sha256").update(`${lang}|${text}`).digest("hex");
}
function synth(text, model) {
return new Promise((resolve, reject) => {
// --output_raw streams 16-bit mono PCM to stdout; we wrap it as WAV below.
const p = spawn("piper", ["-m", model, "--output_raw"]);
const chunks = [];
p.stdout.on("data", (d) => chunks.push(d));
p.on("close", (code) =>
code === 0 ? resolve(Buffer.concat(chunks)) : reject(new Error("piper failed")));
p.on("error", reject);
p.stdin.write(text);
p.stdin.end();
});
}
// Minimal WAV header for 16-bit mono PCM. Sample rate must match the voice
// (medium voices = 22050; low/x_low = 16000 — read it from the .onnx.json).
function wrapWav(pcm, sampleRate = 22050) {
const header = Buffer.alloc(44);
header.write("RIFF", 0);
header.writeUInt32LE(36 + pcm.length, 4);
header.write("WAVE", 8);
header.write("fmt ", 12);
header.writeUInt32LE(16, 16);
header.writeUInt16LE(1, 20); // PCM
header.writeUInt16LE(1, 22); // mono
header.writeUInt32LE(sampleRate, 24);
header.writeUInt32LE(sampleRate * 2, 28);
header.writeUInt16LE(2, 32);
header.writeUInt16LE(16, 34);
header.write("data", 36);
header.writeUInt32LE(pcm.length, 40);
return Buffer.concat([header, pcm]);
}
app.post("/tts", async (req, res) => {
try {
const text = (req.body.text || "").trim();
const lang = (req.body.lang || "en").toLowerCase().split("-")[0];
const model = VOICES[lang];
if (!text) return res.status(400).send("Empty text");
if (!model) return res.status(400).send("Unsupported language");
const file = path.join(CACHE, key(text, lang) + ".wav");
let data;
if (existsSync(file)) {
data = readFileSync(file);
} else {
const pcm = await synth(text, model);
data = wrapWav(pcm, 22050);
writeFileSync(file, data);
}
res.set("Content-Type", "audio/wav");
res.set("Cache-Control", "public, max-age=31536000");
res.send(data);
} catch (e) {
res.status(500).send(String(e));
}
});
app.listen(8080, () => console.log("TTS on :8080"));
Read the sample rate from the voice's
.onnx.json(audio.sample_rate) rather than hardcoding 22050 —low/x_lowvoices are 16000 Hz and the WAV header must match or playback will sound chipmunk-fast/slow.
Plain browser code. Send text + language, get audio, play it.
<button id="readBtn">🔊 Read aloud</button>
<script>
async function speak(text, lang = "en") {
try {
const res = await fetch("/tts", { // proxy or same-origin to :8080
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text, lang, format: "mp3" }),
});
if (!res.ok) throw new Error("TTS server error " + res.status);
const blob = await res.blob();
const url = URL.createObjectURL(blob);
const audio = new Audio(url);
audio.onended = () => URL.revokeObjectURL(url); // free memory
await audio.play();
} catch (err) {
console.warn("Server TTS failed, falling back to Web Speech:", err);
browserSpeak(text, lang); // see Step 8
}
}
document.getElementById("readBtn").addEventListener("click", () => {
speak("Toto je ukázka čtení textu nahlas.", "cs");
});
</script>
For long documents, pre-warm the cache: fire the /tts request when the user opens the article, so by the time they click "Read aloud" the response is cached and instant.
Neural TTS reads words, not symbols. Numbers, currency, dates, and abbreviations are where output breaks — and Czech is harder than English because numerals decline by gender and case and abbreviations differ. Normalize before sending text to Piper.
Diacritics first. The single most common Czech bug is mangled diacritics from encoding mistakes. Ensure your whole path is UTF-8: the JSON body, your DB column, and the file you read from. Never strip diacritics "to be safe" — kun vs kůň, byt vs být change meaning and pronunciation. The pangram in Step 2 is your regression test.
Numbers and units. Piper's Czech voice does not robustly expand digits to declined Czech words. Two practical strategies:
num2words package supports lang="cz":from num2words import num2words
num2words(2026, lang="cz") # -> "dva tisíce dvacet šest"
num2words(3, lang="cz", to="ordinal") # ordinal forms
import re
CS_REPLACE = {
r"%": " procent",
r"€": " eur",
r"Kč": " korun",
r"\bč\.\s*": "číslo ",
r"\bnapř\.": "například",
r"\batd\.": "a tak dále",
r"\btj\.": "to jest",
}
def normalize_cs(text: str) -> str:
# expand standalone integers to words
text = re.sub(r"\d+", lambda m: num2words_safe(m.group()), text)
for pat, rep in CS_REPLACE.items():
text = re.sub(pat, rep, text)
return text
def num2words_safe(s: str) -> str:
from num2words import num2words
try:
return num2words(int(s), lang="cz")
except Exception:
return s
Caveat / honesty note:
num2wordsCzech (cz) covers cardinals and ordinals but does not decline by grammatical case to fit the surrounding sentence (e.g. "ve dvou tisících…"). For a reading feature this is acceptable — listeners understand it — but it is not perfect Czech grammar. If grammatical correctness matters, a cloud neural voice (Google/Azure) does much better number/date expansion internally; that's another reason to keep the adapter (Step 7).
Abbreviations & acronyms. Decide per acronym whether to spell it out (EU → "É Ú") or expand it. Spell-out: insert spaces or periods between letters. Expand: keep a dictionary.
Sentence chunking for long text. Don't send a 5,000-word article in one call — latency scales with length and you lose caching granularity. Split on sentence boundaries and synthesize per paragraph; cache each chunk; concatenate or play sequentially:
import re
def split_sentences(text: str) -> list[str]:
# naive but effective; tune for abbreviations
parts = re.split(r"(?<=[.!?])\s+", text.strip())
return [p for p in parts if p]
Per-paragraph synthesis means a re-read of an edited article only re-synthesizes the changed paragraph — the rest stays cached.
The whole point: your frontend always calls the same /tts. Behind it, a TTSBackend interface lets you change engines via one environment variable. Start on Piper (free, local); escalate to XTTS-v2 (better, self-hosted) or a cloud neural voice (best Czech, paid) — without touching frontend or callers.
# backends.py
from abc import ABC, abstractmethod
import io, wave
from piper import PiperVoice
class TTSBackend(ABC):
@abstractmethod
def synth(self, text: str, lang: str) -> bytes: # returns WAV/MP3 bytes
...
# ---- 1) Local Piper (default, free) --------------------------------
class PiperBackend(TTSBackend):
def __init__(self, voices: dict[str, str]):
self.voices = {k: PiperVoice.load(v) for k, v in voices.items()}
def synth(self, text, lang):
v = self.voices[lang.split("-")[0]]
buf = io.BytesIO()
with wave.open(buf, "wb") as w:
v.synthesize(text, w)
return buf.getvalue()
# ---- 2) Self-hosted Coqui XTTS-v2 (better quality, GPU) ------------
# Run the coqui-tts (maintained fork: idiap/coqui-ai-TTS) server in Docker,
# then call its HTTP API. XTTS-v2 lists Czech ("cs") among supported langs.
import requests
class XTTSBackend(TTSBackend):
def __init__(self, base_url, speaker_wav=None):
self.base = base_url.rstrip("/")
self.speaker = speaker_wav
def synth(self, text, lang):
r = requests.post(f"{self.base}/api/tts", json={
"text": text, "language": lang.split("-")[0],
"speaker_wav": self.speaker, # 6–10s reference clip for cloning
}, timeout=120)
r.raise_for_status()
return r.content
# ---- 3) Cloud neural voice (best Czech, paid) ----------------------
# Example: an OpenAI-compatible /v1/audio/speech endpoint (works for OpenAI,
# and many gateways mimic this shape). For Google/Azure use their SDKs but
# keep this same .synth() signature.
class OpenAICompatBackend(TTSBackend):
def __init__(self, base_url, api_key, voice="alloy", model="tts-1"):
self.base, self.key = base_url.rstrip("/"), api_key
self.voice, self.model = voice, model
def synth(self, text, lang):
r = requests.post(
f"{self.base}/v1/audio/speech",
headers={"Authorization": f"Bearer {self.key}"},
json={"model": self.model, "voice": self.voice,
"input": text, "response_format": "mp3"},
timeout=60,
)
r.raise_for_status()
return r.content # mp3 bytes
Selecting the backend at startup:
import os
def build_backend() -> TTSBackend:
kind = os.getenv("TTS_BACKEND", "piper")
if kind == "piper":
return PiperBackend({
"en": "voices/en_US-lessac-medium.onnx",
"cs": "voices/cs_CZ-jirka-medium.onnx",
})
if kind == "xtts":
return XTTSBackend(os.environ["XTTS_URL"])
if kind == "openai":
return OpenAICompatBackend(
os.environ["TTS_BASE_URL"], os.environ["TTS_API_KEY"])
raise ValueError(f"Unknown TTS_BACKEND={kind}")
Your /tts route now calls BACKEND.synth(text, lang) and keeps the same caching wrapper. To go from free-local to premium-Czech you set TTS_BACKEND=openai (or wire Google/Azure) and restart — that's it.
Which to escalate to, for Czech (see Ch. 6–8 for detail):
| Backend | Czech quality | Cost | Hardware | Notes |
|---|---|---|---|---|
| Piper (default) | good / natural-enough | free | CPU, ~1 GB RAM | real-time; the baseline |
| XTTS-v2 (self-host) | very good; supports voice cloning | free (your GPU) | GPU ~4–6 GB VRAM | Czech listed in supported langs; verify on your text |
| Google Cloud TTS | excellent cs-CZ neural | per-char, has free monthly quota | none | strong number/date handling |
| Azure Neural TTS | excellent cs-CZ (e.g. Vlasta/Antonín) | per-char, free tier | none | high naturalness |
| ElevenLabs | excellent multilingual incl. Czech | small free tier, then paid | none | best naturalness, priciest |
Always verify Czech on your own representative text before committing — provider quality varies by sentence type (proper nouns, numbers, foreign words). Confirm current pricing/free-tier limits at the vendor's pricing page; they change.
If your server is down (or you want a zero-infra default), the browser's built-in SpeechSynthesis can read text locally. Quality is OS-voice-dependent and Czech availability is not guaranteed (depends on the user's installed OS voices), but it's a free safety net.
<script>
function browserSpeak(text, lang = "en") {
if (!("speechSynthesis" in window)) {
alert("No speech support available.");
return;
}
const u = new SpeechSynthesisUtterance(text);
const want = lang.startsWith("cs") ? "cs" : "en";
// Voices may load async; wait if empty.
const pick = () => {
const voices = speechSynthesis.getVoices();
const match = voices.find(v => v.lang.toLowerCase().startsWith(want));
if (match) u.voice = match;
u.lang = match ? match.lang : (want === "cs" ? "cs-CZ" : "en-US");
speechSynthesis.cancel(); // stop anything in progress
speechSynthesis.speak(u);
};
if (speechSynthesis.getVoices().length === 0) {
speechSynthesis.onvoiceschanged = pick;
} else {
pick();
}
}
</script>
Wire it as the catch branch in Step 5's speak() (already shown). Behavior:
cs-CZ, macOS/iOS usually ship a Czech voice, Android/Chrome varies. If no cs voice is found, the code falls back to a default voice reading Czech text with the wrong phonology — detect that case and show a "Czech voice unavailable offline" notice rather than read it badly.This tiered design gives you: server Piper (best free, consistent EN+CZ) → browser Web Speech (offline / server-down) → graceful message if even that lacks Czech.
| Symptom | Likely cause | Fix |
|---|---|---|
| Czech diacritics dropped / "????" | non-UTF-8 somewhere in the pipeline | Force UTF-8 on the JSON body, DB column, and file read; verify with the Step-2 pangram |
| Voice plays too fast/slow ("chipmunk") | WAV header sample rate ≠ voice rate | Read audio.sample_rate from the .onnx.json; use 22050 for medium, 16000 for low/x_low |
| 404 downloading voice | voice/quality path changed | Browse the HF repo tree to get the current exact path |
| First request very slow, rest fast | model loads on first call | Load voices at startup and keep the process warm; pre-warm cache for known text |
| Long text times out | one giant synthesis call | Split into sentences/paragraphs (Step 6), synthesize + cache per chunk |
| MP3 endpoint 500s | ffmpeg not installed | apt-get install ffmpeg (it's in the Dockerfile) or use WAV |
piper: command not found (Node path) | binary not on PATH | Use the pip-installed piper, or full path; or switch to the Python service |
| Czech numbers read as digits/wrong | no normalization | Expand with num2words(lang="cz") + your replacement map (Step 6) |
| High CPU under load | re-synthesizing repeats | Confirm cache is hitting (log hits/misses); cache key must include lang+format |
| Robotic / flat Czech prosody | Piper ceiling for Czech | Acceptable for reading; escalate via adapter to XTTS-v2 or Google/Azure for premium Czech |
Audio format guidance. Use WAV on LAN / localhost (simplest, lossless, no transcode). Use MP3 (~96 kbps) or Opus over the public internet to cut bandwidth ~10×. Browsers play all three via <audio>/Audio(). If you add Opus, transcode with ffmpeg -f ogg -c:a libopus.
Latency budget. With Piper medium on a modern laptop CPU, a 1–2 sentence request synthesizes in well under a second; a paragraph in 1–3 s. Caching makes repeats instant. If you need sub-200 ms streaming first-audio, that's a streaming-TTS concern (Ch. 9) — Piper can stream raw PCM, but for a "read my text" feature, synthesize-then-play with a warm cache is simpler and good enough.
en_US and cs_CZ voices. Czech is natural-enough for reading; English is solid..onnx + .onnx.json pairs from rhasspy/piper-voices; verify the exact Czech path in the HF tree before downloading, and prefer medium quality.POST /tts {text, lang}, with SHA-256 caching keyed on lang+format+text and voices kept loaded in RAM — caching is what keeps it free and fast.num2words(lang="cz"), map common abbreviations, and chunk long text by sentence/paragraph.TTSBackend.synth(text, lang)) so you can swap Piper → self-hosted XTTS-v2 → Google/Azure/ElevenLabs for premium Czech with one env var and zero frontend changes.cs/cs_CZ/... and en/en_US/... .onnx/.onnx.json download paths and quality levels.piper CLI and PiperVoice Python API used in this chapter.cs), API/server, voice cloning via speaker_wav.cs-CZ availability caveats.cz) support for text normalization.cs-CZ neural voices, per-character pricing and free monthly quota for the cloud adapter path.cs-CZ neural voices (e.g. Vlasta, Antonín) and free-tier limits for the cloud adapter./tts server endpoint, request models, and binary responses./v1/audio/speech). 2025. https://platform.openai.com/docs/api-reference/audio/createSpeech — request shape mirrored by the OpenAI-compatible adapter backend.