Chapter 13: Reference Architectures & Recommendations

Chapter 14 of 17 · ~18 min read

Overview

This is the payoff chapter. Earlier chapters surveyed individual engines (Piper, Kokoro, XTTS, ElevenLabs, Google, Azure) and the trade-offs between local, self-hosted, and cloud TTS. Here we stop comparing parts and start assembling complete, end-to-end systems you can actually drop into your software project.

We present four reference architectures, each with a text diagram, a component list, an explicit English + Czech strategy (Czech is graded separately every time, because it is the hard filter), a real cost estimate, honest pros/cons, and a "choose this when" rule. Then we give a single primary recommendation for the reader's stated sweet spot — the best natural-sounding option that stays free or near-free, both languages — and a decision tree to map your constraints to one of the four.

The guiding principle throughout: decouple your application from the engine. Whichever architecture you pick, make your app talk to an abstraction (ideally an OpenAI-compatible /v1/audio/speech endpoint or a thin internal tts.speak(text, lang, voice) function). That single decision lets you swap Piper for Kokoro, or local for cloud, or English-engine-for-Czech-engine, without touching application code. Every architecture below is built so the caller never knows or cares which engine produced the bytes.

Content

The one design rule that makes all of this work

Before the architectures, internalize this contract. Your application should never import piper or import elevenlabs directly. It should call something like:

# app/tts.py  — the ONLY place that knows which engine is used
import httpx, hashlib, os

CACHE_DIR = "/var/cache/tts"

def speak(text: str, lang: str = "en", voice: str | None = None) -> bytes:
    """Return MP3/WAV bytes for `text`. Caller is engine-agnostic."""
    key = hashlib.sha256(f"{lang}|{voice}|{text}".encode()).hexdigest()
    path = f"{CACHE_DIR}/{key}.mp3"
    if os.path.exists(path):                      # cache hit — free, instant
        return open(path, "rb").read()
    audio = _backend_synthesize(text, lang, voice)  # swappable backend
    open(path, "wb").write(audio)
    return audio

The _backend_synthesize function is the only thing that changes between Architecture A, B, C, and D. The in front of it is the single most important cost-control mechanism in this entire chapter: if your texts repeat at all (UI strings, articles read more than once, re-reads on scroll-back), the cache turns even a paid cloud engine into something that's free most of the time. Build the cache , before you choose an engine.

	A. Free-Local	B. Self-Hosted API	C. Premium-Cloud	D. Hybrid
EN engine	Kokoro / XTTS	Kokoro (Docker)	ElevenLabs/Google/Azure	Kokoro (local)
CZ engine	Piper / XTTS	Piper/XTTS (Docker)	ElevenLabs/Google/Azure	Cloud `cs-CZ`
EN quality	Good	Good	Excellent	Good
CZ quality	Okay (utilitarian)	Okay (utilitarian)	Best	Best
Marginal cost	$0	~$0 (VPS/serverless)	per-char (cache-reduced)	$0 EN / low CZ
Ops burden	Medium (run models)	Medium (run service)	None	Low-Medium
Offline/private	Yes	Yes (your infra)	No	EN yes / CZ no
Integration	Direct lib calls	OpenAI-compatible API	Vendor SDK/API	Both
Best for	Hard $0, offline	Pragmatic default	Quality-first	Free+quality EN/CZ

Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

Chapter 13: Reference Architectures & Recommendations

Overview

Content

The one design rule that makes all of this work

Architecture A — FREE-LOCAL (everything on your own machine/server, $0)

A1 — Best-per-language: Kokoro (EN) + Piper (CZ)

A2 — One engine for both: XTTS-v2 (Coqui)

Architecture B — BALANCED SELF-HOSTED (OpenAI-compatible TTS server in Docker)

Architecture C — PREMIUM-CLOUD (top naturalness, cost controlled by caching)

Architecture D — HYBRID (local for the cheap-good language, cloud for the weak one)

Side-by-side comparison

Primary recommendation for the reader's sweet spot

Decision tree

A minimal hybrid backend (Architecture D), end to end

Key Takeaways

Key References