This chapter is the practical, Czech-focused companion to the general "local models" survey. The reader's hard filter is that Czech must sound natural — and this is where most of the celebrated 2025–2026 local TTS models fall away. Kokoro, MeloTTS, Dia, Sesame CSM, Chatterbox: all impressive, almost all English-first, none support Czech. So the field of genuinely usable, free, local Czech TTS narrows to a short list, and the quality ranges from "robotic but works" to "good with effort."
We cover, in order of practical usefulness for Czech:
cs_CZ-jirka) — the lightweight, CPU-friendly default; one real Czech voice, two quality tiers, with known pronunciation quirks.coqui-tts fork) — the best natural local Czech option, gets there through voice cloning from a Czech sample; needs a GPU.facebook/mms-tts-ces) — a VITS single-speaker Czech model, trivially installable via transformers, mediocre but serviceable.cs) — the robotic formant baseline; almost never the right answer for "natural," but it underpins everything else as a phonemizer.For each: how to install, how to get and point at the Czech voice, copy-pastable code, an honest Czech-specific quality grade, and hardware needs. We end with a concrete recommended local Czech setup.
Honesty up front: there is no free, local, English-and-Czech, effortless, ElevenLabs-grade option in 2026. The realistic sweet spot for natural Czech locally is XTTS-v2 with a cloned Czech reference voice on a GPU, with Piper
cs_CZ-jirka-mediumas the lightweight CPU fallback. Everything else is a compromise on naturalness, on Czech specifically, or on the "local" constraint.
Czech has features that punish under-trained TTS models:
ř phoneme (a raised alveolar trill) — unique to Czech, frequently mangled.The upshot: a model's "supports Czech" claim is necessary but not sufficient. The amount of Czech training data and the quality of the text front-end matter more than the headline language count. This is why we grade Czech explicitly below rather than trusting the language list.
What it is. Piper is a fast, fully local, CPU-first neural TTS based on a VITS-style architecture, originally from the Rhasspy/Home Assistant world. It is the single easiest "real neural voice, runs anywhere, no GPU" option, and it has a genuine Czech voice.
Maintenance status (2026). The project moved from rhasspy/piper to OHF-Voice/piper1-gpl (the "piper1" rewrite, GPL-licensed, cleaner Python API). The pip package is still piper-tts. The huge voice library lives on Hugging Face at rhasspy/piper-voices and the older .onnx + .onnx.json voice files remain compatible with the new runtime.
cs_CZ-jirkaThere is essentially one Czech voice family in the official library:
| Voice | Quality tier | Sample rate | File size (approx) | Notes |
|---|---|---|---|---|
cs_CZ-jirka-low | low | 16 kHz | ~20 MB | Fastest, noticeably rougher |
cs_CZ-jirka-medium | medium | 22.05 kHz | ~60 MB | Use this one — best stock Czech |
Voice files live under rhasspy/piper-voices at the path cs/cs_CZ/jirka/medium/ (and /low/). Each voice is two files you must keep together:
cs_CZ-jirka-medium.onnx (the model)cs_CZ-jirka-medium.onnx.json (the config / phoneme map)python -m venv .venv && source .venv/bin/activate
pip install piper-tts
# Download the Czech voice straight from Hugging Face
python -m piper.download_voices cs_CZ-jirka-medium
# (older CLI: python -m piper.download cs_CZ-jirka-medium)
# piper_cs.py — read Czech text to a WAV file, CPU only
import wave
from piper import PiperVoice
voice = PiperVoice.load("cs_CZ-jirka-medium.onnx") # picks up the .json automatically
text = "Dobrý den. Toto je ukázka českého hlasu generovaného lokálně na vašem počítači."
with wave.open("out_cs.wav", "wb") as wav:
voice.synthesize_wav(text, wav)
print("Wrote out_cs.wav")
Or straight from the shell (works for English too — just swap the voice):
echo "Dnes je krásný den a chci si přečíst svůj text nahlas." \
| piper -m cs_CZ-jirka-medium.onnx -f out.wav
For English in the same project, grab a strong English voice like en_US-lessac-medium or en_US-amy-medium the same way — Piper handles both languages cleanly through the same API, which makes it attractive as a single dependency for EN+CS.
Honest assessment for Czech specifically:
sptsfn/piper-czech-tts, that exists specifically because the stock cs_CZ-jirka voice mispronounces words. It adds pronunciation fixes and can read PDFs directly. The existence of this project is itself the honest signal: the base Czech voice has real, recurring G2P errors (foreign words, numbers, some clusters).Hardware: trivial. Runs faster-than-real-time on a modern laptop CPU; no GPU, no Python-heavy ML stack. RAM footprint is small (well under 1 GB). This is the reason to keep Piper around even if you choose XTTS for quality: it's the dependable, cheap, offline fallback that ships anywhere (including ARM/Raspberry Pi).
When to pick Piper for Czech: you need offline, CPU-only, low-latency, or embedded; you can tolerate "clearly a robot but a decent robot"; or you want one tiny library that does both EN and CS.
What it is. XTTS-v2 is Coqui's multilingual, voice-cloning TTS. It officially lists 17 languages including Czech (cs), and it clones a target voice from a short reference clip (~6+ seconds). For local Czech, this is the highest naturalness you can realistically get for free — but you almost always need to give it a Czech reference voice to make it sound genuinely Czech rather than "an accented multilingual model reading Czech."
Status / how to install (2026). Coqui (the company) shut down in early 2024, but the model is open and the code lives on through the idiap/coqui-ai-TTS fork. Install the maintained package — note it's coqui-tts, not the old TTS:
pip install coqui-tts # the maintained idiap fork
# model auto-downloads on first use (~1.8 GB) to your HF cache
# xtts_cs.py — natural Czech via voice cloning
import torch
from TTS.api import TTS # package is coqui-tts, import path is still TTS
device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
tts.tts_to_file(
text="Dobrý den, vítejte. Tohle je text předčítaný přirozeným českým hlasem.",
file_path="out_cs_xtts.wav",
speaker_wav="czech_reference.wav", # 6–20 s of a CZECH speaker
language="cs", # <-- Czech
)
The single biggest lever for natural Czech is the reference clip. Do this:
language="cs", you get Czech words with a foreign mouth — unnatural. A Czech reference voice anchors the phonetics and prosody to Czech.ř, long vowels, clusters) so the model has them to imitate.xtts.get_conditioning_latents() API in the model docs) — big speed win for a reading app.ř and clusters are usually handled; it reads paragraphs convincingly.Hardware:
| Mode | VRAM / RAM | Speed |
|---|---|---|
| GPU inference (recommended) | ~4 GB min, 6–8 GB comfortable (RTX 3060+) | near real-time or faster; streaming supported |
| CPU inference | 8 GB+ RAM | works but slower than real-time — fine for batch "render the article to MP3," painful for interactive |
| Fine-tuning | 12 GB+ VRAM | hours, plus a dataset |
When to pick XTTS for Czech: you want the most natural local Czech, you have a GPU (or can tolerate batch CPU rendering), and you can supply a clean Czech reference clip. This is the recommended quality option for the reader's goal.
What it is. Part of Meta's Massively Multilingual Speech project (1,100+ languages). The Czech model is facebook/mms-tts-ces (ces = ISO 639-3 for Czech), a VITS single-speaker model you can run directly through Hugging Face transformers with no special TTS framework.
pip install transformers torch scipy
# mms_cs.py — Czech TTS via Hugging Face transformers
import torch, scipy.io.wavfile
from transformers import VitsModel, AutoTokenizer
model = VitsModel.from_pretrained("facebook/mms-tts-ces")
tok = AutoTokenizer.from_pretrained("facebook/mms-tts-ces")
inputs = tok("Toto je český hlas z modelu Meta MMS.", return_tensors="pt")
with torch.no_grad():
wave = model(**inputs).waveform
scipy.io.wavfile.write("out_mms_cs.wav",
rate=model.config.sampling_rate,
data=wave.squeeze().cpu().numpy())
transformers), no separate runtime, decent intelligibility, runs on CPU.cs_CZ-jirka-medium on naturalness, but it's a useful sanity-check baseline and a fully transformers-native option if you're already in that ecosystem.Hardware: light. CPU is fine; a GPU just speeds it up. Sub-1 GB model.
When to pick MMS: you're already on transformers, want a no-framework Czech baseline, or need a quick A/B against Piper.
What it is. eSpeak-NG is a tiny formant-synthesis engine supporting 100+ languages including Czech (cs). It is not neural. It sounds unmistakably robotic — a classic 1990s computer voice — but it is instant, microscopic, and runs literally anywhere.
# Debian/Ubuntu
sudo apt-get install espeak-ng
# macOS
brew install espeak-ng
espeak-ng -v cs "Toto je robotický český hlas z programu eSpeak NG." -w out_espeak_cs.wav
Hardware: negligible. Runs on a microcontroller-class device.
When to pick eSpeak for Czech: accessibility fallback, extreme resource constraints, or as a phonemizer dependency — not for naturalness.
This is the part most "best local TTS 2026" lists won't tell you, and it's the reader's hard filter:
| Model | English quality | Czech support | Verdict for this reader |
|---|---|---|---|
| Kokoro (82M) | Excellent, very efficient | No Czech (EN, plus some zh/ja/fr/es/hi/it/pt) | Skip for Czech |
| MeloTTS | Very good | No Czech (EN/ES/FR/ZH/JA/KO) | Skip for Czech |
| Fish Speech / OpenAudio S1 | Excellent, cloning | Czech not officially listed / unconfirmed — focuses on high-resource langs | Don't rely on it for Czech |
| Dia, Sesame CSM, Chatterbox, Parler | Strong (English) | English-only / no verified Czech | Skip for Czech |
Piper cs_CZ-jirka | (EN voices excellent) | Yes — real Czech voice | Use (lightweight) |
| XTTS-v2 | Very good | Yes — cs + cloning | Use (best natural) |
MMS-TTS ces | n/a | Yes — single-speaker VITS | Backup/baseline |
eSpeak-NG cs | n/a | Yes — but robotic | Phonemizer / last resort |
The pattern: the buzziest 2025–2026 local models are English-first and leave you stranded in Czech. The list that actually serves natural Czech is short: XTTS-v2, Piper, MMS, eSpeak.
For the reader's goal — natural EN+CS, free/local, technical user with Docker:
Tier 1 — Best natural (needs a GPU):
coqui-tts for Czech and English, with a clean native-Czech reference clip you record yourself (and an English reference for EN).POST /tts and gets a WAV — same interface as a cloud API, but free and private.Tier 2 — Lightweight / no-GPU fallback (ships anywhere):
cs_CZ-jirka-medium for Czech and en_US-lessac-medium (or en_US-amy-medium) for English.sptsfn/piper-czech-tts pronunciation fixes if you hit recurring mispronunciations.Practical hybrid (what I'd actually ship): Piper as the always-available default (instant, offline, zero cost), with an XTTS-v2 "high-quality render" path for content you want to sound genuinely natural (e.g., a "listen" button that renders once and caches the MP3). Cache aggressively — generate each piece of text to audio once and store it; this makes even CPU XTTS practical and keeps everything free.
Text-normalization tip (applies to all local CS models): none of these have a strong Czech front-end. Before synthesis, expand numbers to Czech words, spell out abbreviations and units, and consider a small dictionary for English brand names. This single preprocessing step does more for perceived Czech naturalness than swapping models.
coqui-tts (idiap) fork. Grade ~B/B+. Needs ~6–8 GB VRAM for comfort; CPU works for batch rendering. Reference-clip quality is the dominant quality lever; never clone a real person without consent.cs_CZ-jirka-medium (pip install piper-tts, voice from rhasspy/piper-voices). One male voice, two tiers, grade ~C+/B−, known pronunciation quirks (hence the sptsfn/piper-czech-tts fix-up project). Runs faster-than-real-time, no GPU, and does English equally easily.facebook/mms-tts-ces is the easiest install (pure transformers, VITS, single speaker), grade ~C — a fine baseline, a notch below Piper.cs is robotic (grade D) — not for "natural," but it's the phonemizer backbone under Piper and worth keeping installed.pip install piper-tts), Python API, and VOICES.md.cs/cs_CZ/jirka/{low,medium} (.onnx + .onnx.json).cs_CZ-jirka mispronunciations and reads PDFs; evidence of the base voice's Czech G2P limits.pip install coqui-tts; XTTS-v2 usage, languages, and training recipes.cs), ~6s cloning, cross-language synthesis.transformers VitsModel usage.transformers, including the stochastic (non-deterministic) duration predictor.cs) support; widely used as the G2P/phonemizer backbone for neural TTS incl. Piper.