Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

CS · 17 ch

Chapter 14: Implementation Guide — Step by Step

Chapter 15 of 17 · ~20 min read

Overview

This is the hands-on chapter. By the end you will have a working TTS feature that reads English and Czech text aloud, runs for free on your own hardware, and exposes a single clean HTTP endpoint your app calls with { text, lang }. We implement the report's primary recommendation — Piper as the default local engine (fast, free, genuinely usable Czech and English voices, CPU-only) — wrapped behind an adapter interface so you can later swap in a higher-fidelity engine (Coqui XTTS-v2 self-hosted) or a cloud API (Google / Azure / ElevenLabs) for the same call, with no frontend changes.

We cover: install and run Piper (both as a Python library and as a Docker HTTP server), download the EN + CZ voices, a tested Python (FastAPI) server endpoint with on-disk caching, a Node/Express equivalent, a frontend snippet to play the audio, Czech text-normalization tips (numbers, dates, abbreviations, diacritics), the adapter pattern to drop in a cloud backend, a troubleshooting section, and a browser Web Speech API fallback for when the server is down.

Why Piper as the default? It is MIT-licensed, runs on CPU in real time (faster than real time on a modern laptop), ships pre-trained cs_CZ and en_US/en_GB voices, and has a tiny dependency footprint. Czech quality is "clearly intelligible and natural-enough for reading paragraphs," not studio-grade — when you need studio-grade Czech, the adapter lets you escalate to XTTS-v2 or a cloud neural voice. See Chapters 5–7 and 12 for the full quality/cost comparison that leads to this choice.

Content

Step 0 — Prerequisites

# Python 3.10–3.12 recommended (Piper wheels target these)
python3 --version

# A folder for your TTS service
mkdir -p tts-service/voices tts-service/cache
cd tts-service
python3 -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate

You need ~200–500 MB of disk for a couple of voices, and roughly while synthesizing. (XTTS-v2 later wants a GPU with ~4–6 GB VRAM for snappy results, or it runs slowly on CPU.)

pip install piper-tts
# This pulls onnxruntime + the piper CLI. Verify:
piper --help

cd voices

# English (US) — a solid medium-quality voice
curl -L -o en_US-lessac-medium.onnx \
  https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
curl -L -o en_US-lessac-medium.onnx.json \
  https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json

# Czech — the maintained cs_CZ medium voice
curl -L -o cs_CZ-jirka-medium.onnx \
  https://huggingface.co/rhasspy/piper-voices/resolve/main/cs/cs_CZ/jirka/medium/cs_CZ-jirka-medium.onnx
curl -L -o cs_CZ-jirka-medium.onnx.json \
  https://huggingface.co/rhasspy/piper-voices/resolve/main/cs/cs_CZ/jirka/medium/cs_CZ-jirka-medium.onnx.json
cd ..

Quality	Sample rate	Model size (approx)	When to use
`x_low`	16 kHz	~5–10 MB	embedded / very constrained
`low`	16 kHz	~15–25 MB	fast, low CPU
`medium`	22.05 kHz	~50–75 MB	default sweet spot
`high`	22.05 kHz	~100+ MB	best Piper quality, more CPU

echo "Hello, this is a test of natural sounding speech." \
  | piper -m voices/en_US-lessac-medium.onnx -f test_en.wav

echo "Ahoj, toto je test českého hlasu. Příliš žluťoučký kůň úpěl ďábelské ódy." \
  | piper -m voices/cs_CZ-jirka-medium.onnx -f test_cs.wav

fastapi
uvicorn[standard]
piper-tts

import hashlib
import io
import os
import wave
from pathlib import Path

from fastapi import FastAPI, HTTPException
from fastapi.responses import Response
from pydantic import BaseModel
from piper import PiperVoice  # piper-tts >= 1.x exposes PiperVoice

# ---- Configuration -------------------------------------------------
VOICE_DIR = Path("voices")
CACHE_DIR = Path("cache")
CACHE_DIR.mkdir(exist_ok=True)

VOICES = {
    "en": VOICE_DIR / "en_US-lessac-medium.onnx",
    "cs": VOICE_DIR / "cs_CZ-jirka-medium.onnx",
}

# Lazy-load voices once, keep in memory (loading is the slow part).
_loaded: dict[str, PiperVoice] = {}

def get_voice(lang: str) -> PiperVoice:
    lang = lang.lower().split("-")[0]  # "cs-CZ" -> "cs"
    if lang not in VOICES:
        raise HTTPException(400, f"Unsupported language: {lang}")
    if lang not in _loaded:
        model_path = VOICES[lang]
        if not model_path.exists():
            raise HTTPException(500, f"Voice model missing: {model_path}")
        _loaded[lang] = PiperVoice.load(str(model_path))
    return _loaded[lang]

# ---- Synthesis -----------------------------------------------------
def synthesize_wav(text: str, lang: str) -> bytes:
    voice = get_voice(lang)
    buf = io.BytesIO()
    with wave.open(buf, "wb") as wav_file:
        voice.synthesize(text, wav_file)   # writes a valid WAV header + PCM
    return buf.getvalue()

def cache_key(text: str, lang: str, fmt: str) -> str:
    raw = f"{lang}|{fmt}|{text}".encode("utf-8")
    return hashlib.sha256(raw).hexdigest()

# ---- API -----------------------------------------------------------
app = FastAPI()

class TTSRequest(BaseModel):
    text: str
    lang: str = "en"          # "en" or "cs"
    format: str = "wav"       # "wav" (default) or "mp3" if ffmpeg present

@app.post("/tts")
def tts(req: TTSRequest):
    text = (req.text or "").strip()
    if not text:
        raise HTTPException(400, "Empty text")
    if len(text) > 5000:
        raise HTTPException(413, "Text too long; split into chunks (see docs)")

    key = cache_key(text, req.lang, req.format)
    cached = CACHE_DIR / f"{key}.{req.format}"

    if cached.exists():
        data = cached.read_bytes()
    else:
        wav_bytes = synthesize_wav(text, req.lang)
        if req.format == "mp3":
            data = wav_to_mp3(wav_bytes)
        else:
            data = wav_bytes
        cached.write_bytes(data)

    media = "audio/mpeg" if req.format == "mp3" else "audio/wav"
    return Response(content=data, media_type=media,
                    headers={"Cache-Control": "public, max-age=31536000"})

@app.get("/healthz")
def healthz():
    return {"ok": True, "voices": list(VOICES.keys())}

# ---- Optional MP3 transcode (needs ffmpeg on PATH) -----------------
def wav_to_mp3(wav_bytes: bytes) -> bytes:
    import subprocess
    p = subprocess.run(
        ["ffmpeg", "-hide_banner", "-loglevel", "error",
         "-i", "pipe:0", "-f", "mp3", "-b:a", "96k", "pipe:1"],
        input=wav_bytes, stdout=subprocess.PIPE, check=True,
    )
    return p.stdout

uvicorn app:app --host 0.0.0.0 --port 8080

curl -s -X POST http://localhost:8080/tts \
  -H "Content-Type: application/json" \
  -d '{"text":"Dobrý den, jak se máte?","lang":"cs"}' \
  --output out_cs.wav

curl -s -X POST http://localhost:8080/tts \
  -H "Content-Type: application/json" \
  -d '{"text":"Good morning, the report is ready.","lang":"en","format":"mp3"}' \
  --output out_en.mp3

FROM python:3.12-slim
RUN apt-get update && apt-get install -y --no-install-recommends ffmpeg \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
COPY voices ./voices
RUN mkdir -p cache
EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

docker build -t my-tts .
docker run -p 8080:8080 -v "$PWD/cache:/app/cache" my-tts

// server.js — npm i express
import express from "express";
import { spawn } from "node:child_process";
import { createHash } from "node:crypto";
import { existsSync, mkdirSync, readFileSync, writeFileSync } from "node:fs";
import path from "node:path";

const app = express();
app.use(express.json({ limit: "1mb" }));

const VOICES = {
  en: "voices/en_US-lessac-medium.onnx",
  cs: "voices/cs_CZ-jirka-medium.onnx",
};
const CACHE = "cache";
mkdirSync(CACHE, { recursive: true });

function key(text, lang) {
  return createHash("sha256").update(`${lang}|${text}`).digest("hex");
}

function synth(text, model) {
  return new Promise((resolve, reject) => {
    // --output_raw streams 16-bit mono PCM to stdout; we wrap it as WAV below.
    const p = spawn("piper", ["-m", model, "--output_raw"]);
    const chunks = [];
    p.stdout.on("data", (d) => chunks.push(d));
    p.on("close", (code) =>
      code === 0 ? resolve(Buffer.concat(chunks)) : reject(new Error("piper failed")));
    p.on("error", reject);
    p.stdin.write(text);
    p.stdin.end();
  });
}

// Minimal WAV header for 16-bit mono PCM. Sample rate must match the voice
// (medium voices = 22050; low/x_low = 16000 — read it from the .onnx.json).
function wrapWav(pcm, sampleRate = 22050) {
  const header = Buffer.alloc(44);
  header.write("RIFF", 0);
  header.writeUInt32LE(36 + pcm.length, 4);
  header.write("WAVE", 8);
  header.write("fmt ", 12);
  header.writeUInt32LE(16, 16);
  header.writeUInt16LE(1, 20);            // PCM
  header.writeUInt16LE(1, 22);            // mono
  header.writeUInt32LE(sampleRate, 24);
  header.writeUInt32LE(sampleRate * 2, 28);
  header.writeUInt16LE(2, 32);
  header.writeUInt16LE(16, 34);
  header.write("data", 36);
  header.writeUInt32LE(pcm.length, 40);
  return Buffer.concat([header, pcm]);
}

app.post("/tts", async (req, res) => {
  try {
    const text = (req.body.text || "").trim();
    const lang = (req.body.lang || "en").toLowerCase().split("-")[0];
    const model = VOICES[lang];
    if (!text) return res.status(400).send("Empty text");
    if (!model) return res.status(400).send("Unsupported language");

    const file = path.join(CACHE, key(text, lang) + ".wav");
    let data;
    if (existsSync(file)) {
      data = readFileSync(file);
    } else {
      const pcm = await synth(text, model);
      data = wrapWav(pcm, 22050);
      writeFileSync(file, data);
    }
    res.set("Content-Type", "audio/wav");
    res.set("Cache-Control", "public, max-age=31536000");
    res.send(data);
  } catch (e) {
    res.status(500).send(String(e));
  }
});

app.listen(8080, () => console.log("TTS on :8080"));

<button id="readBtn">🔊 Read aloud</button>
<script>
async function speak(text, lang = "en") {
  try {
    const res = await fetch("/tts", {            // proxy or same-origin to :8080
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ text, lang, format: "mp3" }),
    });
    if (!res.ok) throw new Error("TTS server error " + res.status);
    const blob = await res.blob();
    const url = URL.createObjectURL(blob);
    const audio = new Audio(url);
    audio.onended = () => URL.revokeObjectURL(url);  // free memory
    await audio.play();
  } catch (err) {
    console.warn("Server TTS failed, falling back to Web Speech:", err);
    browserSpeak(text, lang);                    // see Step 8
  }
}

document.getElementById("readBtn").addEventListener("click", () => {
  speak("Toto je ukázka čtení textu nahlas.", "cs");
});
</script>

from num2words import num2words
num2words(2026, lang="cz")     # -> "dva tisíce dvacet šest"
num2words(3, lang="cz", to="ordinal")   # ordinal forms

import re

CS_REPLACE = {
    r"%": " procent",
    r"€": " eur",
    r"Kč": " korun",
    r"\bč\.\s*": "číslo ",
    r"\bnapř\.": "například",
    r"\batd\.": "a tak dále",
    r"\btj\.": "to jest",
}

def normalize_cs(text: str) -> str:
    # expand standalone integers to words
    text = re.sub(r"\d+", lambda m: num2words_safe(m.group()), text)
    for pat, rep in CS_REPLACE.items():
        text = re.sub(pat, rep, text)
    return text

def num2words_safe(s: str) -> str:
    from num2words import num2words
    try:
        return num2words(int(s), lang="cz")
    except Exception:
        return s

import re
def split_sentences(text: str) -> list[str]:
    # naive but effective; tune for abbreviations
    parts = re.split(r"(?<=[.!?])\s+", text.strip())
    return [p for p in parts if p]

# backends.py
from abc import ABC, abstractmethod
import io, wave
from piper import PiperVoice

class TTSBackend(ABC):
    @abstractmethod
    def synth(self, text: str, lang: str) -> bytes:  # returns WAV/MP3 bytes
        ...

# ---- 1) Local Piper (default, free) --------------------------------
class PiperBackend(TTSBackend):
    def __init__(self, voices: dict[str, str]):
        self.voices = {k: PiperVoice.load(v) for k, v in voices.items()}
    def synth(self, text, lang):
        v = self.voices[lang.split("-")[0]]
        buf = io.BytesIO()
        with wave.open(buf, "wb") as w:
            v.synthesize(text, w)
        return buf.getvalue()

# ---- 2) Self-hosted Coqui XTTS-v2 (better quality, GPU) ------------
# Run the coqui-tts (maintained fork: idiap/coqui-ai-TTS) server in Docker,
# then call its HTTP API. XTTS-v2 lists Czech ("cs") among supported langs.
import requests
class XTTSBackend(TTSBackend):
    def __init__(self, base_url, speaker_wav=None):
        self.base = base_url.rstrip("/")
        self.speaker = speaker_wav
    def synth(self, text, lang):
        r = requests.post(f"{self.base}/api/tts", json={
            "text": text, "language": lang.split("-")[0],
            "speaker_wav": self.speaker,   # 6–10s reference clip for cloning
        }, timeout=120)
        r.raise_for_status()
        return r.content

# ---- 3) Cloud neural voice (best Czech, paid) ----------------------
# Example: an OpenAI-compatible /v1/audio/speech endpoint (works for OpenAI,
# and many gateways mimic this shape). For Google/Azure use their SDKs but
# keep this same .synth() signature.
class OpenAICompatBackend(TTSBackend):
    def __init__(self, base_url, api_key, voice="alloy", model="tts-1"):
        self.base, self.key = base_url.rstrip("/"), api_key
        self.voice, self.model = voice, model
    def synth(self, text, lang):
        r = requests.post(
            f"{self.base}/v1/audio/speech",
            headers={"Authorization": f"Bearer {self.key}"},
            json={"model": self.model, "voice": self.voice,
                  "input": text, "response_format": "mp3"},
            timeout=60,
        )
        r.raise_for_status()
        return r.content   # mp3 bytes

import os
def build_backend() -> TTSBackend:
    kind = os.getenv("TTS_BACKEND", "piper")
    if kind == "piper":
        return PiperBackend({
            "en": "voices/en_US-lessac-medium.onnx",
            "cs": "voices/cs_CZ-jirka-medium.onnx",
        })
    if kind == "xtts":
        return XTTSBackend(os.environ["XTTS_URL"])
    if kind == "openai":
        return OpenAICompatBackend(
            os.environ["TTS_BASE_URL"], os.environ["TTS_API_KEY"])
    raise ValueError(f"Unknown TTS_BACKEND={kind}")

Backend	Czech quality	Cost	Hardware	Notes
Piper (default)	good / natural-enough	free	CPU, ~1 GB RAM	real-time; the baseline
XTTS-v2 (self-host)	very good; supports voice cloning	free (your GPU)	GPU ~4–6 GB VRAM	Czech listed in supported langs; verify on your text
Google Cloud TTS	excellent `cs-CZ` neural	per-char, has free monthly quota	none	strong number/date handling
Azure Neural TTS	excellent `cs-CZ` (e.g. Vlasta/Antonín)	per-char, free tier	none	high naturalness
ElevenLabs	excellent multilingual incl. Czech	small free tier, then paid	none	best naturalness, priciest

<script>
function browserSpeak(text, lang = "en") {
  if (!("speechSynthesis" in window)) {
    alert("No speech support available.");
    return;
  }
  const u = new SpeechSynthesisUtterance(text);
  const want = lang.startsWith("cs") ? "cs" : "en";

  // Voices may load async; wait if empty.
  const pick = () => {
    const voices = speechSynthesis.getVoices();
    const match = voices.find(v => v.lang.toLowerCase().startsWith(want));
    if (match) u.voice = match;
    u.lang = match ? match.lang : (want === "cs" ? "cs-CZ" : "en-US");
    speechSynthesis.cancel();          // stop anything in progress
    speechSynthesis.speak(u);
  };

  if (speechSynthesis.getVoices().length === 0) {
    speechSynthesis.onvoiceschanged = pick;
  } else {
    pick();
  }
}
</script>

Symptom	Likely cause	Fix
Czech diacritics dropped / "????"	non-UTF-8 somewhere in the pipeline	Force UTF-8 on the JSON body, DB column, and file read; verify with the Step-2 pangram
Voice plays too fast/slow ("chipmunk")	WAV header sample rate ≠ voice rate	Read `audio.sample_rate` from the `.onnx.json`; use 22050 for medium, 16000 for low/x_low
404 downloading voice	voice/quality path changed	Browse the HF repo tree to get the current exact path
First request very slow, rest fast	model loads on first call	Load voices at startup and keep the process warm; pre-warm cache for known text
Long text times out	one giant synthesis call	Split into sentences/paragraphs (Step 6), synthesize + cache per chunk
MP3 endpoint 500s	`ffmpeg` not installed	`apt-get install ffmpeg` (it's in the Dockerfile) or use WAV
`piper: command not found` (Node path)	binary not on PATH	Use the pip-installed `piper`, or full path; or switch to the Python service
Czech numbers read as digits/wrong	no normalization	Expand with `num2words(lang="cz")` + your replacement map (Step 6)
High CPU under load	re-synthesizing repeats	Confirm cache is hitting (log hits/misses); cache key must include lang+format
Robotic / flat Czech prosody	Piper ceiling for Czech	Acceptable for reading; escalate via adapter to XTTS-v2 or Google/Azure for premium Czech

OHF-Voice / Rhasspy. piper1-gpl (GitHub). 2025. https://github.com/OHF-Voice/piper1-gpl — current home of the Piper TTS engine (formerly rhasspy/piper); CLI flags, build, usage.
Rhasspy. piper-voices (Hugging Face). 2024–2025. https://huggingface.co/rhasspy/piper-voices — official voice catalog; the cs/cs_CZ/... and en/en_US/... .onnx/.onnx.json download paths and quality levels.
PyPI. piper-tts. 2024–2025. https://pypi.org/project/piper-tts/ — the pip package providing the piper CLI and PiperVoice Python API used in this chapter.
Idiap Research Institute. coqui-ai-TTS (maintained Coqui fork, GitHub). 2024–2025. https://github.com/idiap/coqui-ai-TTS — maintained XTTS-v2 fork; supported languages (incl. Czech cs), API/server, voice cloning via speaker_wav.
Coqui. XTTS-v2 model card (Hugging Face). 2023–2024. https://huggingface.co/coqui/XTTS-v2 — XTTS-v2 multilingual model, supported language list, licensing notes.
MDN Web Docs. Web Speech API — SpeechSynthesis. 2025. https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesis — browser TTS fallback API, voice selection, cs-CZ availability caveats.
savoirfairelinux. num2words (GitHub/PyPI). 2024–2025. https://github.com/savoirfairelinux/num2words — number-to-words library with Czech (cz) support for text normalization.
Google Cloud. Text-to-Speech — supported voices & pricing. 2025. https://cloud.google.com/text-to-speech/docs/list-voices-and-types — cs-CZ neural voices, per-character pricing and free monthly quota for the cloud adapter path.
Microsoft. Azure AI Speech — language & voice support / pricing. 2025. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support — cs-CZ neural voices (e.g. Vlasta, Antonín) and free-tier limits for the cloud adapter.
FastAPI. FastAPI documentation. 2025. https://fastapi.tiangolo.com/ — framework used for the Python /tts server endpoint, request models, and binary responses.
OpenAI. Audio / Text-to-speech API reference (/v1/audio/speech). 2025. https://platform.openai.com/docs/api-reference/audio/createSpeech — request shape mirrored by the OpenAI-compatible adapter backend.

Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

Chapter 14: Implementation Guide — Step by Step

Overview

Content

Step 0 — Prerequisites

Step 1 — Install Piper

Step 2 — Download English + Czech voices

Step 3 — A Python (FastAPI) server endpoint with caching

Step 4 — Node/Express equivalent (if your stack is JS)

Step 5 — Frontend snippet to play it

Step 6 — Text normalization (especially for Czech)

Step 7 — The adapter pattern: swap in XTTS-v2 or a cloud API later

Step 8 — Fallback path: browser Web Speech API

Troubleshooting

Key Takeaways

Key References