This chapter frames the entire report. You want to add a text-to-speech (TTS) feature to your own software so it can read your texts aloud — and it has to sound genuinely natural in both English and Czech, ideally while staying free or near-free. That combination is harder than it sounds. Most of the headline-grabbing 2025–2026 TTS models are English-first; many sound stunning in English and robotic (or simply unsupported) in Czech. So Czech quality is treated throughout this report as a hard filter, graded explicitly for every option rather than assumed.
Here we define the problem precisely, introduce the four decision axes you should weigh (naturalness, cost, hosting/privacy, language coverage), sketch a quick map of the 2026 landscape, preview the recommendation categories the rest of the report builds toward, explain how to read the report, and give you a working glossary of TTS jargon (MOS, prosody, zero-shot voice cloning, streaming, SSML, phonemes, vocoder). It closes with a one-screen orientation table.
Content
The problem, stated precisely
You are not buying a consumer "read-aloud" app. You are an engineer adding a TTS capability to software you control. That changes everything about how to choose. The real requirements are:
It reads your text — arbitrary text your app produces, not a fixed script. So you need an API, a library, or a self-hostable engine you can call programmatically.
It must sound real — listeners should not immediately think "that's a robot." For long-form reading especially, prosody (rhythm, stress, intonation) matters more than raw audio fidelity.
English AND Czech, both natural — this is the binding constraint. English is the easy case; almost everything does English well. Czech is the differentiator. Czech is a West Slavic language with rich inflection, palatalization, the notorious ř, vowel-length distinctions, and front-loaded word stress. Many engines either don't ship a Czech voice at all, or ship one that mangles these. Whenever a model's Czech support is weak, unverified, or community-reported rather than vendor-confirmed, this report says so out loud.
Free or near-free — you are happy to run code, Docker, or a self-hosted server, and you'll consider cloud APIs too, but the sweet spot is "best natural-sounding option that costs roughly nothing." That can mean fully local/offline (no per-character fee at all), or a cloud API whose free tier covers your real volume.
A useful mental anchor: ~1 million characters ≈ 160,000–200,000 words ≈ a few full-length novels of reading. Keep that conversion in mind every time you see a "free tier of X characters/month" — it tells you instantly whether a free tier is generous or a toy.
A worked volume estimate (do this first)
Before choosing anything, estimate your real monthly character volume — it decides almost everything. A rough rule of thumb: English averages ~5 characters/word plus a space ≈ 6 characters per word; Czech runs a touch longer because of diacritics and compounding, call it ~6.5. So:
Your usage
Words/month
≈ Characters/month
Fits which free tier?
Personal app, light reading
5,000
~30K
Everything (even ElevenLabs free)
Small product, moderate
50,000
~300K
Azure (500K), Google premium (1M)
Active product
150,000
~1M
Google premium free tier exactly; Azure overflows
Heavy / long-form library
1,000,000+
~6M+
No free tier — go self-hosted or pay
The lesson: at light-to-moderate volume, a cloud free tier is literally $0 and beats the hassle of self-hosting. At heavy volume, self-hosted = $0 marginal cost is the only thing that stays free. The crossover is roughly the 0.5–1M chars/month band — which is exactly why this report's recommendation hinges on your number. Measure it (count the characters your app would send in a typical month) before reading further.
The four decision axes
Every TTS choice is a trade-off across four axes. The whole report is organized so you can score candidates on each.
Axis 1 — Naturalness / quality
This is what "sounds real" actually means, decomposed:
Segmental quality — are individual sounds (phonemes) rendered correctly? For Czech this is where weak engines fail: wrong ř, flattened long vowels, anglicized consonants.
Prosody — stress, rhythm, intonation, sentence melody. Modern neural and LLM-based TTS win here; older concatenative/parametric voices sound "flat" precisely because their prosody is weak.
Expressiveness — emotion, emphasis, natural pauses. Matters for engaging long-form reading; less critical for short UI prompts.
The field's traditional yardstick is MOS (Mean Opinion Score) — see glossary. Treat published MOS numbers with caution: they're measured on different datasets, with different listeners, often in English only, and vendors love to quote them selectively. A 4.x MOS in English tells you nothing about Czech. This report quotes MOS only where it's real and labels it as approximate.
Axis 2 — Cost
Three cost models exist, and they behave very differently:
Per-character / per-second cloud pricing — e.g. ~$4–$30 per 1M characters depending on voice tier and vendor. Predictable, scales linearly with usage, but recurring forever. Free tiers (Google ~1M premium chars/month, Azure ~500K/month) can make this effectively free at modest volume.
Subscription credits — e.g. ElevenLabs' tiered monthly credit buckets. Best quality, but credits burn fast and the free tier is tiny (~10K chars/month, attribution-required, non-commercial).
Self-hosted / local = $0 marginal cost — you pay once in hardware and setup time, then synthesize unlimited characters for free. This is the structurally cheapest path if you have (or can rent) the compute, and it's why local engines dominate the "near-free" sweet spot.
Always compute your monthly character volume first. A free cloud tier you never exceed is the cheapest possible option — zero dollars, zero hardware, zero ops.
Axis 3 — Hosting / privacy
Four deployment shapes, each with different privacy and ops implications:
Shape
Where it runs
Privacy
Ops burden
Marginal cost
Fully local / offline
On the user's device or your own machine
Best — text never leaves the device
Per-device install, but no server
$0
Self-hosted server
Your Docker/VM/GPU box, behind your own API
Good — you control the data
You run & scale it
$0 + infra
Cloud API
Vendor's servers (Google/Azure/ElevenLabs…)
Text sent to a third party
Lowest — they operate it
Per-use (often free tier)
Browser (Web Speech API)
The visitor's browser, using OS voices
On-device, but uncontrollable quality
Almost none
$0
For privacy-sensitive text (personal notes, internal documents), local or self-hosted is the honest choice — cloud TTS means shipping your text to a vendor. For maximum reach with minimal effort, cloud or browser wins.
Axis 4 — Language coverage (incl. Czech, the hard filter)
The make-or-break axis for this project. Grade each candidate separately for English and Czech, because they almost never match. Questions to ask of every option:
Does it ship a dedicated Czech voice/model, or only "multilingual" coverage that may degrade on Czech?
Is Czech vendor-confirmed or only community-reported?
How many Czech voices, and male/female options?
Does Czech support SSML / pronunciation control so you can fix edge cases?
A quick verified read of the 2026 field on Czech specifically:
ElevenLabs — Czech via Multilingual v2 / v3 / Flash; widely considered the most natural Czech available, but premium-priced and a tiny free tier.
Google Cloud TTS — Czech (cs-CZ) across Standard, WaveNet, and the newer Chirp 3: HD voices; strong, with a usable free tier.
Azure Neural TTS — Czech cs-CZ-VlastaNeural (F) and cs-CZ-AntoninNeural (M); good neural quality, full SSML, ~500K free chars/month.
Coqui XTTS-v2 — Czech is one of its 17 explicitly supported languages, with voice cloning; best self-hosted Czech naturalness, but needs a GPU for real-time and carries a non-commercial license (CPML).
Piper — Czech cs_CZ-jirka (low/medium); fast, CPU-friendly, fully local, but Czech sounds noticeably less natural than the cloud/XTTS options.
Kokoro — superb quality-to-size in English, but Czech is not an officially supported language as of 2026 — so it's effectively English-only for this project.
Web Speech API — Czech depends on the user's OS voice (e.g. Microsoft Jakub on Windows, Zuzana on macOS/iOS); inconsistent and generally dated. Fallback only.
The headline: Czech eliminates several otherwise-excellent English engines. That single fact reshapes the recommendation.
A quick map of the 2026 landscape
The field has split into a few clear families:
LLM-based / "neural codec" cloud voices — Google Chirp 3 HD, ElevenLabs v3, Azure HD. The most natural, expressive output; cloud-only; pay per use (with free tiers). The quality frontier.
Open multilingual neural models you can self-host — Coqui XTTS-v2 (the long-standing favorite; 17 languages incl. Czech; voice cloning), Fish Speech, and similar. Near-cloud quality if you have a GPU; $0 marginal cost; check licenses.
Lightweight local engines — Piper (VITS-based, CPU-friendly, Raspberry-Pi-capable, broad language list incl. Czech) and Kokoro (82M params, excellent English, limited language coverage). The "runs anywhere, free forever" tier.
Browser-native — the Web Speech API, free and zero-infra but quality is whatever the OS provides. The universal fallback.
Note one important historical fact: Coqui (the company) shut down in early 2024, but its open-source models — especially XTTS-v2 — remain widely used and downloadable. So "Coqui TTS" in 2026 means the open-source toolkit and models, not an active vendor.
Preview: the recommendation categories
The rest of the report drives toward four named recommendation buckets. You'll likely pick one as primary and one as fallback.
Free-local (the default for "free + private") — runs on your own hardware, $0 per character, text never leaves your control. Best for privacy and unlimited volume. Czech-capable picks here: Piper (CPU, fast, decent Czech) and XTTS-v2 (GPU, best local Czech naturalness, watch the license).
Balanced self-hosted (the engineer's sweet spot) — a containerized TTS server (Docker/API) you call from your app. More natural than browser, still $0 marginal cost, full privacy, scalable. Often XTTS-v2 or a Piper/Fish-Speech server behind a thin HTTP API.
Premium-cloud (best naturalness, minimal ops) — Google Chirp 3 HD, Azure Neural, or ElevenLabs. Best-in-class Czech and English with near-zero engineering, and possibly free at your volume thanks to free tiers. The pragmatic choice if your monthly characters fit a free tier and your text isn't privacy-sensitive.
Browser fallback (zero-cost, zero-infra) — Web Speech API. Use when you need something everywhere with no servers and no bills, and you can accept inconsistent Czech quality.
The headline thesis of the whole report: for the free-or-near-free + natural-Czech sweet spot, the contest is between a self-hosted open model (Piper for CPU, XTTS-v2 for GPU) and a premium cloud free tier (Google/Azure) — and which wins depends mostly on your monthly volume, your hardware, and how sensitive your text is.
A 60-second decision flow
Run your situation through this before diving into the option chapters:
START
│
├─ Is the text privacy-sensitive (must not leave your control)?
│ YES ──► Local/self-hosted ONLY. GPU available? ──► XTTS-v2.
│ └─ No GPU / CPU only? ──► Piper (cs_CZ-jirka).
│ NO ──► continue
│
├─ Is your monthly volume comfortably under a cloud free tier
│ (≈ <1M premium chars/mo)?
│ YES ──► Premium cloud free tier.
│ └─ Want the most natural Czech? ──► Google Chirp 3 HD.
│ └─ Need strong SSML control? ──► Azure Neural.
│ NO ──► continue
│
├─ High volume but want top quality and can pay a little?
│ YES ──► ElevenLabs (best Czech) or pay-as-you-go Google/Azure.
│ NO ──► Self-host (XTTS-v2 on GPU, or Piper on CPU) for $0/char.
│
└─ Need a zero-infra universal fallback that works in any browser?
──► Web Speech API (accept inconsistent Czech).
What "calling" each shape looks like (orientation snippets)
You'll see full integration code in later chapters; these one-liners just orient you to the shape of each integration so the trade-offs feel concrete.
Browser (Web Speech API) — zero install, runs in the visitor's browser:
const u = new SpeechSynthesisUtterance("Ahoj, toto je test.");
u.lang = "cs-CZ";
u.voice = speechSynthesis.getVoices().find(v => v.lang === "cs-CZ");
speechSynthesis.speak(u); // quality = whatever the OS provides
Local CLI (Piper) — one binary + two model files, fully offline:
echo "Dobrý den, jak se máte?" | \
piper --model cs_CZ-jirka-medium.onnx --output_file out.wav
Self-hosted server (XTTS-v2 via a Python/Docker API) — call your own endpoint:
# your app -> your TTS container; text never leaves your infra
import requests
audio = requests.post("http://localhost:8020/tts",
json={"text": "Ahoj světe", "language": "cs"}).content
open("out.wav", "wb").write(audio)
Cloud API (Google/Azure/ElevenLabs) — vendor does the heavy lifting:
# pseudocode shape: authenticate -> send text + voice -> get audio bytes
audio = tts_client.synthesize(text="Ahoj světe",
language="cs-CZ", voice="cs-CZ-Chirp3-HD")
The pattern to notice: browser = zero code/zero control, local/self-hosted = your box, your data, your ops, cloud = their box, their bill, least effort. Your four-axis priorities pick the column.
How to read this report
If you want the answer fast — read this chapter's orientation table, then jump to the final recommendation chapter.
If you care most about naturalness — read the chapters on cloud LLM voices (Chirp 3 HD, ElevenLabs) and the XTTS-v2 deep-dive, and pay attention to the Czech-specific evaluation sections.
If you care most about cost/privacy — read the local-engine chapters (Piper, Kokoro, XTTS-v2 self-hosting) and the cost-modeling chapter.
If you're shipping to many users' browsers — read the Web Speech API chapter.
Every option chapter grades Czech explicitly and flags where support is vendor-confirmed vs. community-reported vs. absent. Trust those labels; do not assume "multilingual" means "good Czech."
Numbers are labeled. Where you see MOS, $/1M chars, VRAM, or latency, they're drawn from 2025–2026 sources and marked approximate when they are. Verify pricing against the live vendor page before you commit — cloud pricing and free tiers change.
Glossary of TTS terms
MOS (Mean Opinion Score) — the standard subjective quality metric: human listeners rate audio naturalness on a 1–5 scale; the mean is the MOS. ~4.0+ is "good," ~4.5+ approaches human-level in the test conditions. Crucial caveat: MOS is dataset- and listener-dependent and usually measured in English — a high English MOS says nothing about Czech.
Prosody — the rhythm, stress, and intonation of speech (the "melody"). The single biggest driver of whether TTS sounds human vs. robotic, especially over long passages.
Phoneme — the smallest unit of sound in a language. TTS often converts text → phonemes (via a grapheme-to-phoneme / G2P step) before generating audio. Czech G2P quality (especially ř, palatalized consonants, vowel length) is a key reason some engines sound off in Czech.
SSML (Speech Synthesis Markup Language) — an XML-based markup you wrap around text to control pronunciation, pauses, emphasis, rate, pitch, and to spell out numbers/dates/abbreviations. Supported well by Azure and Google; partially or not at all by lightweight local engines. Your escape hatch for fixing edge-case pronunciations.
Zero-shot voice cloning — generating a new speaker's voice from just a few seconds of reference audio, without retraining the model. XTTS-v2 and ElevenLabs offer this. Powerful, but raises consent/legal and licensing questions.
Streaming — emitting audio incrementally as it's generated, rather than waiting for the whole clip. Lowers perceived latency (you hear the first words in <200 ms in good setups), essential for interactive read-aloud.
Vocoder — the component that turns an intermediate acoustic representation (e.g. a mel-spectrogram) into an actual audio waveform. Modern neural vocoders (and end-to-end models that fold this step in, like VITS) are a big reason 2020s TTS sounds far better than 2010s TTS.
Real-time factor (RTF) — synthesis time ÷ audio duration. RTF < 1 means faster than real-time (good for live reading); RTF > 1 means you'd wait longer than the clip lasts (batch-only). XTTS-v2 hits RTF ≈ 0.1–0.2 on a strong GPU but is several×-slower-than-real-time on CPU.
Key Takeaways
The task is engineer-grade: read arbitrary text via API/library/self-hosted engine, sounding natural in both English and Czech, for free or near-free.
Judge every option on four axes: naturalness, cost, hosting/privacy, and language coverage — and grade Czech separately from English, because they rarely match.
Czech is the hard filter. It eliminates several top English-only engines (e.g. Kokoro has no official Czech). Always check whether Czech is vendor-confirmed, not just "multilingual."
Remember ~1M characters ≈ 160K–200K words; use it to judge whether a free tier is generous (Google ~1M premium/mo, Azure ~500K/mo) or a toy (ElevenLabs free ~10K/mo).
The 2026 landscape splits into LLM cloud voices (most natural), self-hostable open models (XTTS-v2/Fish, near-cloud quality, $0 marginal cost), lightweight local (Piper/Kokoro), and browser-native (Web Speech API, fallback).
The report's four recommendation buckets: free-local, balanced self-hosted, premium-cloud, browser fallback — the free-or-near-free Czech sweet spot is a fight between self-hosted open models and a premium cloud free tier, decided by your volume, hardware, and privacy needs.
Orientation table
Option
EN quality
CZ quality (graded)
Cost
Hosting
Best for
Piper
Good
Decent (cs_CZ-jirka), clearly below cloud
$0 (local)
Local / self-host, CPU-friendly
Free, private, runs anywhere incl. Pi
XTTS-v2
Excellent
Best self-hosted Czech; voice-clone capable
$0 marginal (needs GPU)
Self-host (GPU)
Best free+private Czech if you have a GPU; CPML = non-commercial
Kokoro
Excellent
No official Czech (English-first)
$0 (local)
Local
English-only read-aloud; not for Czech
Google Chirp 3 HD
Excellent
Strong, vendor-confirmed cs-CZ
Free ~1M premium chars/mo, then ~$30/1M
Cloud API
Premium Czech with a usable free tier
Azure Neural
Excellent
Good (Vlasta/Antonin), full SSML
Free ~500K chars/mo, then ~$15/1M
Cloud API
Cloud Czech needing SSML control
ElevenLabs
Best-in-class
Best Czech naturalness, premium
Free ~10K chars/mo (non-commercial), paid from ~$5/mo
Cloud API
Top quality when budget allows
Web Speech API
OK–good (OS-dependent)
Inconsistent, OS-dependent, dated
$0
Browser
Zero-infra universal fallback
Quality grades are this report's synthesis of 2025–2026 sources; cloud pricing and free tiers change — verify on the live vendor page before committing. Czech grades are deliberately stricter than English grades.
Key References
Coqui-AI. TTS — GitHub (coqui-ai/TTS) & XTTS-v2 model card. 2024–2025. https://github.com/coqui-ai/TTS — Confirms XTTS-v2's 17 supported languages including Czech, voice cloning from ~6 s of audio, and the CPML license.
Hugging Face. coqui/XTTS-v2 model card. 2024. https://huggingface.co/coqui/XTTS-v2 — Languages list (Czech included), license terms, and usage notes for the self-hosted model.
Hexgrad. Kokoro-82M model card. 2025. https://huggingface.co/hexgrad/Kokoro-82M — 82M-parameter model details and the (English-focused) supported-language list, confirming no official Czech support.
ElevenLabs. Models, supported languages & pricing. 2025. https://elevenlabs.io/docs/models and https://elevenlabs.io/pricing — Multilingual v2 / v3 / Flash language coverage (Czech included) and the credit-based plan tiers incl. the limited free plan.
W3C. Speech Synthesis Markup Language (SSML) Version 1.1. 2010 (still the reference spec). https://www.w3.org/TR/speech-synthesis11/ — Authoritative definition of SSML used to control pronunciation, pauses, and emphasis.
ITU-T. Recommendation P.800: Methods for subjective determination of transmission quality (MOS). 1996. https://www.itu.int/rec/T-REC-P.800 — The standards basis for the Mean Opinion Score metric referenced throughout the report.