This is the payoff chapter. Earlier chapters surveyed individual engines (Piper, Kokoro, XTTS, ElevenLabs, Google, Azure) and the trade-offs between local, self-hosted, and cloud TTS. Here we stop comparing parts and start assembling complete, end-to-end systems you can actually drop into your software project.
We present four reference architectures, each with a text diagram, a component list, an explicit English + Czech strategy (Czech is graded separately every time, because it is the hard filter), a real cost estimate, honest pros/cons, and a "choose this when" rule. Then we give a single primary recommendation for the reader's stated sweet spot — the best natural-sounding option that stays free or near-free, both languages — and a decision tree to map your constraints to one of the four.
The guiding principle throughout: decouple your application from the engine. Whichever architecture you pick, make your app talk to an abstraction (ideally an OpenAI-compatible /v1/audio/speech endpoint or a thin internal tts.speak(text, lang, voice) function). That single decision lets you swap Piper for Kokoro, or local for cloud, or English-engine-for-Czech-engine, without touching application code. Every architecture below is built so the caller never knows or cares which engine produced the bytes.
Before the architectures, internalize this contract. Your application should never import piper or import elevenlabs directly. It should call something like:
# app/tts.py — the ONLY place that knows which engine is used
import httpx, hashlib, os
CACHE_DIR = "/var/cache/tts"
def speak(text: str, lang: str = "en", voice: str | None = None) -> bytes:
"""Return MP3/WAV bytes for `text`. Caller is engine-agnostic."""
key = hashlib.sha256(f"{lang}|{voice}|{text}".encode()).hexdigest()
path = f"{CACHE_DIR}/{key}.mp3"
if os.path.exists(path): # cache hit — free, instant
return open(path, "rb").read()
audio = _backend_synthesize(text, lang, voice) # swappable backend
open(path, "wb").write(audio)
return audio
The _backend_synthesize function is the only thing that changes between Architecture A, B, C, and D. The in front of it is the single most important cost-control mechanism in this entire chapter: if your texts repeat at all (UI strings, articles read more than once, re-reads on scroll-back), the cache turns even a paid cloud engine into something that's free most of the time. Build the cache , before you choose an engine.
A second rule: pick the engine per language, not per app. Czech and English have very different "best free option" answers. The cleanest architectures below route by lang.
The pitch: zero marginal cost, zero data leaving your box, no API keys, works offline. The trade-off is naturalness ceiling and the operational burden of running models yourself. This is the architecture the reader's "free + EN + CZ" brief points at most directly.
There are two sub-variants. Pick based on whether you want one engine for both languages (simpler ops, voice cloning) or the best free engine per language (better quality, more moving parts).
┌──────────────────────────────────────────┐
your app ──► │ speak(text, lang) + content-hash cache │
└───────────────┬───────────────┬───────────┘
lang=en │ │ lang=cs
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Kokoro-82M │ │ Piper cs_CZ │
│ (ONNX/PyTorch)│ │ (ONNX, fast) │
└────────────────┘ └────────────────┘
▼ ▼
natural EN intelligible CZ
(Apache-2.0) (MIT, CPU-only OK)
cs_CZ voices. Piper is a fast, local neural TTS (now maintained as OHF-Voice/piper1-gpl) with prebuilt Czech voices in the rhasspy/piper-voices dataset under cs/. It runs on a Raspberry Pi 4 — CPU-only is fine. Czech sound is clear and intelligible but not "premium-natural": somewhat flat prosody, occasional stress quirks. It's the best truly free, truly local Czech option, but be honest with yourself that it won't match Azure/Google Czech neural voices.EN quality: good (free-tier best-in-class). CZ quality (graded explicitly): acceptable / utilitarian — fine for reading text aloud, not for marketing-grade narration.
Cost: $0 software. Hardware: runs on CPU; both models comfortably fit in a few GB RAM. A GPU helps Kokoro latency but isn't required.
Pros: $0 marginal, offline, private, no rate limits, no vendor lock-in, permissive licenses. Cons: you run/patch two engines; Czech naturalness is the weak link; CPU latency is higher than cloud; no SSML-rich control.
your app ──► speak(text, lang, speaker_wav) ──► XTTS-v2 (one model)
├─ lang="en" natural EN
└─ lang="cs" Czech (in 17-lang set)
+ voice cloning from a short clip
cs) alongside English, plus voice cloning from a ~6-second reference clip. One model, both languages, consistent voice.EN quality: very good. CZ quality (graded explicitly): good-but-uneven; better prosody than Piper, but verify pronunciation. Cost: $0 software (mind the CPML license for commercial use); GPU recommended.
When to choose Architecture A: privacy/offline is mandatory, budget is hard $0, you're comfortable running models, and Czech "good enough to read text" clears your bar. Choose A1 for permissive licensing and CPU-friendliness; choose A2 if you want one cloned voice across both languages and have a GPU (and your use is non-commercial or you've cleared the license).
The pitch: same free engines as Architecture A, but wrapped in a standard HTTP API your app already knows how to call — the OpenAI /v1/audio/speech shape. Your app stays a thin client; the engine lives in a container you can run on a cheap VPS, a homelab box, or a serverless GPU when you need speed. This is the architecture most projects should actually ship, because it gives you A's $0 economics with C's clean integration and the ability to scale or swap later.
┌──────────┐ POST /v1/audio/speech ┌────────────────────────────┐
│ your app │ ───────────────────────────► │ Docker container (GPU/CPU) │
│ (any │ {model, input, voice} │ Kokoro-FastAPI OR │
│ lang) │ ◄─────────────────────────── │ openedai-speech (Piper/ │
└──────────┘ audio/mpeg │ XTTS/Kokoro backends) │
│ └────────────────────────────┘
└── content-hash cache in front (CDN / Redis / disk)
remsky/Kokoro-FastAPI) — a Dockerized FastAPI wrapper around Kokoro-82M with CPU ONNX and NVIDIA GPU PyTorch support, OpenAI-compatible endpoints, streaming, and auto-stitching of long text. Best when English is your priority and you want the cleanest Kokoro deployment./v1/audio/speech-compatible server that can front multiple backends (Piper for fast/light, XTTS for cloning/quality). Best when you want one endpoint that can serve Piper-Czech and a richer English voice behind the same API.voice/model field. Map voice="en_kokoro" → Kokoro; voice="cs_piper" → Piper Czech (via openedai-speech), or run two containers and route in your _backend_synthesize. The OpenAI shape means switching to a paid OpenAI-compatible provider later (Architecture C) is a one-line base-URL change.EN quality: good (Kokoro). CZ quality (graded explicitly): same as A — Piper-Czech utilitarian, or XTTS-Czech good-but-uneven (license caveat). The server doesn't change the engine's voice quality; it changes your operations. Cost: VPS ~a few $/month for CPU; serverless GPU ~cents per active minute, near-$0 with caching at low volume.
Pros: standard API, app stays portable, trivial to swap to cloud later, scales horizontally, one cache layer serves everything. Cons: you operate a service (uptime, updates, monitoring); GPU-on-demand has cold-start latency; Czech quality unchanged from local.
When to choose: you want $0/near-$0 economics and a clean, future-proof integration; you may add more apps/clients; you anticipate possibly moving to paid quality for one language without re-plumbing. For most readers building a real feature, this is the pragmatic default.
The pitch: when "must sound real" beats "must be free," cloud neural TTS wins decisively — and Czech is where this matters most, because the cloud gap over free engines is largest in Czech. The cost objection is answered not by avoiding the cloud but by caching aggressively so you pay once per unique string, forever.
┌──────────┐ speak(text, lang) ┌──────────────────────────────┐
│ your app │ ───► CACHE ──(miss)──►│ Cloud TTS provider │
└──────────┘ │ │ ElevenLabs / Google / Azure │
│ (hit=free) └──────────────────────────────┘
▼ │
CDN + disk/Redis ◄────────────┘ store MP3 by content hash
cs-CZ) voices including its newer Chirp 3: HD voice line, which explicitly lists cs-CZ among its supported languages. Pricing (verified May 2026): Chirp 3: HD = 1M characters/month free, then $30/1M characters; standard voices are far cheaper with a 4M-char free tier. That free tier alone covers a lot of reading. Often the best price/quality for Czech if ElevenLabs is overkill.cs-CZ, e.g. Vlasta/Antonín). Solid quality, rich SSML, free tier plus pay-as-you-go per character. Good enterprise/Microsoft-stack fit.lang values. You can also mix: ElevenLabs for Czech (where the gap is biggest), Google/Azure or even local for English (where free is already good) — that's Architecture D.EN quality: excellent. CZ quality (graded explicitly): best available — ElevenLabs > Google HD ≈ Azure neural, all clearly above any free local Czech. Cost: pay-per-character; with caching, dominated by your first-read volume of unique text, not total plays.
Pros: top naturalness in both languages, especially Czech; zero ops; SSML control; instant low-latency. Cons: ongoing cost (mitigated, not eliminated, by caching); data leaves your box (privacy/compliance review needed); vendor lock-in unless you keep the OpenAI-compatible abstraction; free tiers won't cover real production volume.
When to choose: naturalness — especially Czech naturalness — is non-negotiable, your unique-text volume is bounded (so caching keeps the bill small), and offline/privacy isn't a hard requirement.
The pitch: the sweet-spot architecture for this reader specifically. The asymmetry is the whole insight: free English is already good (Kokoro), but free Czech is only okay (Piper/XTTS). So spend money only where free falls short — route Czech to a cloud voice, keep English local and free.
┌──────────────────────────────────────────────┐
your app ──► │ speak(text, lang) + content-hash cache │
└───────────────┬──────────────────┬────────────┘
lang=en │ │ lang=cs
▼ ▼
┌────────────────┐ ┌─────────────────────┐
│ Kokoro (local) │ │ Cloud CZ voice │
│ FREE, offline │ │ (Google HD / Eleven) │
└────────────────┘ │ paid, but CACHED │
└─────────────────────┘
cs-CZ voice (Google HD for best price/quality, ElevenLabs for max naturalness), behind the same content-hash cache so repeated Czech reads are free after the first._backend_synthesize plus the cache. You can even start fully local (Architecture A), measure that Czech quality isn't enough, and flip only the Czech branch to cloud — no app changes.EN quality: good, free. CZ quality (graded explicitly): best-available (cloud), cost-controlled by caching. Cost: $0 for English; near-$0 for Czech at low/repetitive volume, scaling with unique Czech characters.
Pros: best quality-per-dollar for a bilingual EN/CZ app; pay only where free is weak; degrades gracefully (cloud down → fall back to local Czech). Cons: two code paths; Czech still leaves your box (privacy); you manage one API key.
When to choose: you want natural Czech and tight budget and you're fine with English-good-enough-free. This is the recommended architecture for the reader's stated goal.
| A. Free-Local | B. Self-Hosted API | C. Premium-Cloud | D. Hybrid | |
|---|---|---|---|---|
| EN engine | Kokoro / XTTS | Kokoro (Docker) | ElevenLabs/Google/Azure | Kokoro (local) |
| CZ engine | Piper / XTTS | Piper/XTTS (Docker) | ElevenLabs/Google/Azure | Cloud cs-CZ |
| EN quality | Good | Good | Excellent | Good |
| CZ quality | Okay (utilitarian) | Okay (utilitarian) | Best | Best |
| Marginal cost | $0 | ~$0 (VPS/serverless) | per-char (cache-reduced) | $0 EN / low CZ |
| Ops burden | Medium (run models) | Medium (run service) | None | Low-Medium |
| Offline/private | Yes | Yes (your infra) | No | EN yes / CZ no |
| Integration | Direct lib calls | OpenAI-compatible API | Vendor SDK/API | Both |
| Best for | Hard $0, offline | Pragmatic default | Quality-first | Free+quality EN/CZ |
The brief is: the best natural-sounding option that stays free or near-free, EN + CZ both natural, technical reader who can run Docker. That points to a clear staged plan, not a single product:
Build the engine-agnostic speak() function + content-hash cache first (the snippet above). This is the highest-leverage code you'll write; it makes everything reversible and cheap.
Start with Architecture B (self-hosted, OpenAI-compatible) running Kokoro for English and Piper for Czech in Docker. Cost: ~$0. Run Kokoro-FastAPI (or openedai-speech if you want one endpoint for both). You now have a real, portable TTS feature at zero marginal cost with a standard API.
Listen to the Czech output on your actual text. If Piper-Czech (or XTTS-Czech) is good enough for "read my text aloud," stop here — you're free and natural enough. Be honest in the listening test: read 5–10 of your real Czech passages, including names and numbers.
If Czech isn't natural enough, flip to Architecture D: keep Kokoro-English free and local; switch only the Czech branch to a cloud cs-CZ voice — start with Google Cloud TTS HD (best price/quality, generous free tier; ElevenLabs only if you need maximum expressiveness). Because the cache sits in front, repeated Czech is still effectively free; you pay only for first-time unique Czech characters, which for most apps is a tiny, bounded bill.
In one line: Kokoro (EN, free, local) + Piper (CZ, free, local) behind an OpenAI-compatible Docker server and a content-hash cache — and upgrade only the Czech branch to Google HD cloud if your ears say Piper isn't natural enough. That sequence guarantees you land at the cheapest point that still clears the "Czech must sound natural" filter, and never overpays.
A note on commercial use: if your project is commercial, prefer Kokoro (Apache-2.0) and Piper (MIT) for the local branch and avoid official XTTS-v2 weights (non-commercial CPML). Cloud providers are fine for commercial use under their terms.
START: Do you NEED offline / fully-private (data must not leave your box)?
│
├── YES ─► Architecture A (Free-Local).
│ ├─ Need ONE cloned voice across EN+CZ & have a GPU & non-commercial?
│ │ └─► A2: XTTS-v2 (mind CPML license)
│ └─ Otherwise ─► A1: Kokoro (EN) + Piper (CZ)
│
└── NO ─► Is hard $0 the top priority (quality second)?
│
├── YES ─► Architecture B (self-hosted OpenAI-compatible, Kokoro+Piper).
│ Then LISTEN to Czech on your real text:
│ ├─ Czech good enough? ── STOP. Done, free.
│ └─ Czech not enough? ──► go to Architecture D.
│
└── NO (quality first) ─► Is it BOTH languages that must be premium,
and is your unique-text volume bounded?
├── YES ─► Architecture C (Premium-Cloud + caching).
│ Czech-expressive priority → ElevenLabs.
│ Best price/quality → Google HD / Azure.
└── NO ─► Architecture D (Hybrid): ◄── RECOMMENDED DEFAULT
Kokoro EN (free, local) + cloud cs-CZ (cached).
# app/tts_backend.py — the swappable core; everything else stays the same
import os, httpx
from google.cloud import texttospeech # pip install google-cloud-texttospeech
# --- English: local Kokoro via OpenAI-compatible server (Architecture B) ---
def _en_local(text: str) -> bytes:
r = httpx.post(
"http://localhost:8880/v1/audio/speech", # Kokoro-FastAPI
json={"model": "kokoro", "voice": "af_heart",
"input": text, "response_format": "mp3"},
timeout=60,
)
r.raise_for_status()
return r.content
# --- Czech: cloud cs-CZ (only the bytes that need it; cache makes it cheap) ---
_g = texttospeech.TextToSpeechClient()
def _cs_cloud(text: str) -> bytes:
resp = _g.synthesize_speech(
input=texttospeech.SynthesisInput(text=text),
voice=texttospeech.VoiceSelectionParams(
language_code="cs-CZ", # Czech neural / HD voice
name="cs-CZ-Chirp3-HD-Aoede"), # verify exact voice name in console
audio_config=texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3),
)
return resp.audio_content
def _backend_synthesize(text: str, lang: str, voice: str | None) -> bytes:
return _cs_cloud(text) if lang == "cs" else _en_local(text)
Wire this _backend_synthesize into the cached speak() from the top of the chapter and you have the recommended system: free natural English, premium natural Czech, paid only on cache misses. Swapping the Czech branch to ElevenLabs (or back to local Piper) is one function — exactly the portability the design rule bought you.
speak(text, lang) with a disk/CDN cache makes every later choice reversible and turns paid engines near-free on repeated text.