If you only needed English, this whole report would be three pages long: pick ElevenLabs or Kokoro or Piper, ship it, done. The reason this report exists is the second half of your requirement — Czech must also sound natural. That is a genuinely hard filter, and it is the single most common place where otherwise-excellent TTS systems fall off a cliff.
This chapter explains why Czech is harder than English for text-to-speech, then surveys every engine relevant to your project and grades its Czech support honestly — separating the systems that genuinely speak good Czech from the ones that technically "support" cs but produce something a native speaker will reject in the first sentence. It ends with a concrete, do-it-yourself listening-test protocol so you can judge Czech naturalness for your texts rather than trusting any vendor's marketing.
The recurring theme: for Czech, "supports the language" and "sounds good in the language" are two completely different claims, and almost every vendor conflates them. You have to test, every time.
Czech is a West Slavic language with properties that punish naive TTS systems in ways English never does. The difficulty is not one big problem but a stack of them.
Czech is highly fusional and inflected. Nouns, adjectives, pronouns and numerals decline through seven grammatical cases × singular/plural × (for adjectives) gender, and verbs conjugate extensively. A single English lemma like "city" maps to one or two written forms; the Czech "město" appears as město, města, městu, městě, městem, měst, městům, městech, městy… Each form is a distinct surface string the model must have seen (or generalized to) and must stress and pronounce correctly.
The practical consequence for TTS: the effective vocabulary the model must handle correctly is many times larger than English for the same semantic coverage, and data sparsity bites hard. A form that appears rarely in the training corpus may be mispronounced or mis-stressed. English models reach naturalness with far less data per "word" because English barely inflects.
Czech uses háček and čárka diacritics — — and they are . They are not accents you can drop. Removing them changes pronunciation and meaning outright: (flat/apartment) vs (to be); is meaningless where (still/yet) is intended; vs . A model — or a text-normalization front-end — that silently strips or mangles diacritics produces wrong words, not just an accent.
This also makes input hygiene a Czech-specific risk in your pipeline: if your source texts have ever passed through a system that mangled UTF-8, lost diacritics, or used Windows-1250 vs UTF-8 inconsistently, the TTS will faithfully read the corruption.
Czech has the notorious ř (a raised alveolar trill-fricative) found in essentially no other language, plus tongue-twisting consonant clusters and even syllabic consonants — the canonical example strč prst skrz krk ("stick a finger through the throat") contains no vowels at all. A formant synthesizer or a multilingual model that learned its phoneme inventory mostly from vowel-rich languages will approximate ř as a plain r or ž, which natives notice instantly. Clusters like čtvrt, zmrzlina, scvrnkl stress the model's duration and coarticulation modeling.
Czech word stress is fixed on the first syllable of each phonological word (with clitics leaning on their host). That sounds easy — and it means a model that puts stress elsewhere sounds immediately, glaringly foreign. Sentence-level prosody (the rise/fall over phrases, question intonation, the handling of the clitic "second-position" rules) is where even good neural Czech voices reveal themselves: declaratives are usually fine, but yes/no questions, long subordinate clauses, and lists are where naturalness frays. Academic Czech TTS spent decades specifically on prosody modeling for exactly this reason (e.g. prosody-database and generative-prosody work at Czech universities).
This is the most underrated Czech TTS problem and it matters enormously for reading arbitrary texts. In English, "5 dogs" → "five dogs"; the number word is invariant. In Czech the number agrees in case and triggers case agreement on the noun: "5 psů", "2 psi", "o 5 hodinách", "ve 3 dny", and ordinals decline too ("3." → třetí/třetího/třetímu…). Currency, units, dates ("12. 5. 2026"), Roman numerals (Karel IV. → Karel Čtvrtý), and abbreviations (atd., apod., tj., č., s. r. o.) all require a Czech-aware text normalization / front-end. Most lightweight engines have weak or no Czech normalization, so they spell out or mis-decline numbers. For your use case (reading your own texts), test number- and date-heavy sentences specifically.
Czech has ~10–11 million speakers. The open multilingual speech corpora that exist for it — Mozilla Common Voice (cs), VoxPopuli (cs), CML-TTS (cs) — are far smaller and noisier than their English counterparts, and clean single-speaker studio TTS corpora (the kind that make a voice sound premium) are scarce and often proprietary. Commercial vendors prioritize large markets, so Czech voices ship later, get fewer voice options, and are less likely to receive the newest model architecture. This is structural and it is why you will repeatedly see "Czech: 1 male + 1 female voice, older tier" while English gets dozens of state-of-the-art voices.
Below, each engine is graded explicitly on Czech. English-good ≠ Czech-good, so I grade Czech separately and bluntly. Grades are qualitative (this report cannot run a formal MOS panel) and should be confirmed with the listening test at the end.
| Engine | Czech available? | Czech quality (honest) | Notes |
|---|---|---|---|
Piper (cs_CZ-jirka) | Yes — jirka low + medium | Good for the class — clearly the best free/local Czech option for most projects | Fast, CPU-only, tiny. One male voice. Robotic edges on questions/long sentences; weak built-in number normalization |
| Piper community fine-tunes | Yes (e.g. Thomcles/Piper-TTS-Czech, SHrubie/piper-cs) | Variable — sometimes better, sometimes worse than jirka | Worth A/B-testing for your texts; quality depends on the fine-tune dataset |
| XTTS-v2 (Coqui) | Yes — cs is one of 17 langs | Mixed/uneven — can sound natural in calm sentences, but prosody and ř drift, and quality depends heavily on the reference clip | Voice-cloning model; needs GPU for real-time; project unmaintained since Coqui shut down |
eSpeak-NG (cs) | Yes | Robotic by design — intelligible, not natural | Formant synth. Real value is as a G2P/phonemizer front-end for other models, not as final audio |
| Kokoro-82M | Not in core; some community/third-party Czech claimed | Unverified/weak | Core Kokoro targets EN/FR/JA/ZH/ES/HI/IT/PT/KO. Czech appears only in some third-party integrations — treat as unproven; do not plan around it for production Czech |
| Coqui other models / Festival / Epos / ARTIC | Yes (Czech roots) | Dated or research-grade | Historically important Czech systems; not the natural-2026-sounding output you want for production |
Piper cs_CZ-jirka — the realistic free/local champion. Piper is a fast, local, CPU-friendly neural TTS (VITS-based) from the Rhasspy project. Its official Czech voice is jirka (a single male voice) in low and medium quality tiers, distributed as ONNX in the rhasspy/piper-voices repo. For a fully-offline, zero-cost Czech reader it is the most sensible default: it is comprehensible, the Czech is broadly correct, stress is generally on the first syllable, and it runs anywhere. The honest caveats: it is one male voice, the medium tier still has audible synthetic artifacts on questions and long clauses, and Piper does not ship strong Czech number/date/abbreviation normalization — you will likely need to pre-expand numbers yourself (see Chapter on the build). I could not verify the exact source corpus for jirka from primary sources; treat its dataset provenance as unconfirmed and judge it by ear.
# Try Czech Piper locally in ~2 minutes
pip install piper-tts
# download the medium Czech voice (onnx + json) from rhasspy/piper-voices
# then:
echo "Ahoj, toto je test české syntézy řeči. Strč prst skrz krk." \
| piper --model cs_CZ-jirka-medium.onnx --output_file test_cs.wav
XTTS-v2 lists Czech among its 17 languages and supports voice cloning from a few seconds of reference audio. In practice its Czech is uneven: marketing claims of "native-speaker quality" across all 17 languages are not credible for Czech specifically. Expect decent calm narration but occasional foreign-accented vowels, an imperfect ř, and prosody that wanders on questions and lists. It also needs a GPU for anything near real-time, and Coqui as a company is defunct, so the project is community-maintained. It is the best open voice-cloning route for Czech, but it is not a set-and-forget premium voice.
eSpeak-NG supports Czech and is genuinely useful — just not as your output voice. Its formant synthesis is robotic. Its real role in 2026 is as a grapheme-to-phoneme front-end that many neural systems (including Piper) lean on, and as an accessibility/offline fallback where clarity beats naturalness.
| Provider | Czech voices | Czech quality (honest) | Pricing posture |
|---|---|---|---|
| Azure AI Speech | cs-CZ-VlastaNeural (F), cs-CZ-AntoninNeural (M) | Best-in-class neural Czech for most projects — natural, well-normalized | Generous free tier monthly, then per-character; see Chapter on cloud |
| Google Cloud TTS | cs-CZ Standard / WaveNet / Neural2 + Chirp 3 HD (confirmed) | WaveNet/Neural2 Czech is good; Chirp 3 HD Czech is available and is the newest tier | Free tier (esp. on WaveNet/Standard), then per-character |
| Amazon Polly | Czech support is limited/late (verify current status) | Historically weak/absent for premium Czech | Pay per character |
| ElevenLabs | Czech via Multilingual v2 / v3 | Sometimes very good, but variable — best naturalness ceiling, worst price for high volume | Credit-based; not "near-free" at volume |
Azure ships two neural Czech voices — Vlasta (female) and Antonin (male) — and these are, in my honest assessment, the most reliably natural Czech voices among the mainstream clouds, with the best built-in Czech text normalization (numbers, dates, abbreviations). If you want "premium Czech with the least effort" and a free tier can cover your volume, Azure is the safe pick.
Google Cloud offers Czech in Standard, WaveNet and Neural2 tiers — and, importantly, in the newest Chirp 3: HD (LLM-based) generation. Google's documentation confirms that Chirp 3 HD supports cs-CZ, so the mythology-named voices (Aoede, Kore, Leda, Zephyr [female]; Puck, Charon, Fenrir, Orus [male]) are usable in Czech via voice names like cs-CZ-Chirp3-HD-Aoede. Two Czech-specific caveats per Google's docs: pause control and custom pronunciations are NOT available for cs-CZ Chirp 3 HD yet, so you cannot hand-fix a mispronounced word the way you can in some other locales. Older WaveNet (e.g. cs-CZ-Wavenet-B) and Neural2 Czech remain available and are good for most reading tasks; Chirp 3 HD is the most natural Google tier for Czech but should still be auditioned with the torture-test set below — "newest" does not guarantee "best for your text," especially given the missing pronunciation override.
Amazon Polly's Czech support has historically lagged — for a long time Polly had no Czech at all, and its premium/neural Czech (if present) is not its strength. Verify current Polly Czech status directly before considering it; do not assume parity with English.
ElevenLabs has the highest naturalness ceiling of anything here and supports Czech via Multilingual v2 (29 langs) and the newer v3 (70+ langs). When it's good, it's the most human-sounding Czech you can get. But: (1) quality is variable across voices and you must audition specifically in Czech, including ř and questions; and (2) the credit/pricing model makes it the least "near-free" option at any real volume, which collides with your stated free/near-free priority. It's the "premium when nothing else is good enough" choice, not the default.
These exist and matter for context, but most are not the natural-sounding, easy-to-integrate 2026 output you want:
kky.zcu.cz) — a long-running corpus-based Czech TTS research system with both male and female Czech voices (and Slovak/German), the strongest Czech-native academic lineage, with decades of work on Czech prosody and unit selection. Not a drop-in free API.epos.ufe.cz) — historically significant open rule-based Czech synthesizer; dated output.speechtech.cz, Czech-native, active since 2000) ships 6 Czech voices (Alena, Iva, Jan, Jiří, Radka, Stanislav) plus Slovak; ReadSpeaker offers lifelike Czech voices in its portfolio; Phonexia and Newton Technologies are more STT-focused. These deliver genuinely native, high-quality Czech but are paid/commercial — not "near-free."The takeaway: the strongest Czech-native engineering knowledge lives in Czech academia and a few Czech companies, but for a free/near-free, easy-to-integrate, natural-2026 voice, your realistic shortlist is Piper (local/free) → Azure (cloud, free-tier) → ElevenLabs (premium fallback), with XTTS-v2 as the open voice-cloning wildcard.
Never trust a Czech grade — including this chapter's — without your own ears (ideally a native Czech speaker's ears). Here is a tight protocol.
Run every candidate engine through the same sentences and compare. These are chosen to expose the specific failure modes above:
# 1. ř + consonant clusters
Řeřicha přerůstá přes zahradní zeď a tře se o křoví.
# 2. The vowel-less tongue-twister
Strč prst skrz krk.
# 3. Numbers that must decline (cardinals)
V roce 2026 přišlo 1 234 lidí, ale jen 2 zůstali a 5 odešlo.
# 4. Dates, ordinals, time
Schůzka je 12. 5. 2026 ve 14:30, druhý den po 3. výročí.
# 5. Abbreviations
Např. firma s. r. o. dodala zboží (tj. 5 ks) atd., apod.
# 6. Roman numerals / names
Karel IV. vládl ve 14. století; viz s. 42, odst. 3.
# 7. Question intonation (yes/no + wh-)
Půjdeš zítra do kina? A kdo s tebou půjde?
# 8. Long subordinate clause (prosody/phrasing)
Když jsem přišel domů, zjistil jsem, že okno, které jsem ráno
zavřel, je dokořán a po podlaze leží rozbité sklo.
# 9. Diacritics-as-meaning minimal pairs
Byt byl prázdný, protože tu nikdo nechtěl být.
# 10. Foreign words / code-switching (common in real texts)
Stáhni si XTTS-v2 z Hugging Face a spusť to v Dockeru.
Score each engine 1–5 on each axis; a native speaker should do this blind if possible:
| Axis | Question to ask |
|---|---|
| Phoneme correctness | Is ř a real ř? Are š/ž/č/ě right? Any English vowels leaking in? |
| Stress | First-syllable stress on every word? Clitics attached correctly? |
| Number/date handling | Are numbers spoken and declined correctly, or spelled/mis-cased? |
| Question intonation | Do yes/no questions actually rise naturally? |
| Phrasing / pauses | Are commas and clause boundaries respected without robotic chop? |
| Overall naturalness | Would a Czech listener think "human" or "robot/foreigner"? |
| Glitches | Dropped words, repeats, mangled diacritics, artifacts? |
voice_a.wav, voice_b.wav so brand bias doesn't color judgment.num2words-style step) — doing so dramatically improves every engine and partly equalizes the cheap ones with the premium ones.cs_CZ-jirka (medium) — comprehensible, fast, offline, one male voice, weak built-in number normalization. The realistic default for your near-free goal.Vlasta, Antonin) — reliably natural, with good normalization and a usable free tier. Google is a strong rival: WaveNet/Neural2 Czech is good, and Chirp 3 HD now confirmed for cs-CZ is the newest/most natural Google tier (caveat: no pause control or custom pronunciation for cs-CZ yet). Audition both on your texts.cs_CZ-jirka (low, medium tiers).cs) among the 17 supported languages and the voice-cloning design.cs-CZ-VlastaNeural and cs-CZ-AntoninNeural neural Czech voices.cs-CZ) Standard/WaveNet/Neural2 tiers and full locale list.cs) support and formant-synthesis/G2P role.