Before you can pick the right text-to-speech (TTS) engine for reading your texts aloud in English and Czech, it helps to understand what is happening under the hood. This chapter explains the technology accessibly but precisely: how raw text becomes natural-sounding speech, the major architectural families (acoustic-model-plus-vocoder, end-to-end, autoregressive vs. non-autoregressive, diffusion/flow-matching, and the newer LLM/codec systems), what actually makes a voice "sound real," how voice cloning works, and the practical engineering knobs — streaming vs. batch, latency, real-time factor, CPU vs. GPU. It closes with how naturalness is measured (MOS and friends), so that when later chapters throw scores around, you know what they mean and how much to trust them.
Throughout, the technology is mapped back to the reader's two hard requirements — natural English AND natural Czech — and to the free/near-free constraint. Czech is a much smaller-data language than English, so some architectural choices matter more for Czech quality than for English. Where that is the case, it is called out.
Content
1. The TTS pipeline in one picture
Almost every neural TTS system, no matter how modern, does some version of three jobs:
Raw text ──► [ FRONTEND ] ──► [ ACOUSTIC MODEL ] ──► [ VOCODER ] ──► Audio waveform
"Dr. Nový text normalization phonemes/text → spectrogram → 16/22/24/44 kHz
má 3 psy." + grapheme→phoneme mel-spectrogram waveform .wav / PCM
In two-stage (cascaded) systems these are separate models you can swap. In end-to-end systems the middle and right boxes are fused into one network. In the newest LLM/codec systems the spectrogram is replaced by discrete audio tokens that a language model predicts. We will walk through each box, then each architecture family.
2. The frontend: text normalization and grapheme-to-phoneme
This is the most under-appreciated part of TTS, and it is disproportionately important for Czech.
Text normalization (TN) turns "non-standard words" into how they are actually spoken:
Input
English spoken form
Czech spoken form
3 psy
—
"tři psy" (and Czech grammar: 3 takes a different case)
Dr.
"Doctor"
"doktor"
2026
"twenty twenty-six"
"dva tisíce dvacet šest"
12:30
"twelve thirty"
"dvanáct třicet"
15 %
"fifteen percent"
"patnáct procent"
€5
"five euros"
"pět eur"
Czech makes TN harder because numbers and units inflect by grammatical case and gender ("dva psi" vs. "dvě kočky" vs. "pěti psům"). Many TTS engines have rich English TN and weak-or-absent Czech TN, so a model that technically "supports Czech" may still read 3 psy as "tři psy" correctly but mangle 3. (an ordinal, "třetí") or read 2026 digit-by-digit. When evaluating an engine for Czech, test it on numbers, dates, abbreviations, and currency — not just plain sentences.
Grapheme-to-phoneme (G2P) converts written letters (graphemes) into sounds (phonemes), usually in IPA. English needs heavy G2P because spelling is irregular ("though," "through," "tough"). Czech is largely phonemic — it is spelled close to how it sounds — which is a genuine advantage: a model can sometimes do well on Czech from raw characters. But Czech still has traps: voicing assimilation ("led" → /lɛt/), the ě/ř sounds, and foreign loanwords.
Two G2P backends dominate the open-source world:
espeak-ng — a rule-based phonemizer supporting 100+ languages including Czech and English. It is the default phonemizer behind StyleTTS2, Kokoro, and many VITS models. It is fast, tiny, and the reason a lot of "multilingual" models can attempt Czech at all.
phonemizer (Python) — a wrapper that calls espeak-ng/Festival and returns IPA; ubiquitous in TTS training code.
# See exactly what phonemes an engine will receive for Czech vs. English
espeak-ng -v cs -q --ipa "Doktor Nový má tři psy."
# → dˈoktor nˈoviː mˈaː tr̝ɪ psˈɪ
espeak-ng -v en-us -q --ipa "Doctor Novy has three dogs."
# → dˈɑːktɚ nˈoʊvi hɐz θɹˈiː dˈɑːɡz
A practical implication: some end-to-end and LLM-codec models skip explicit G2P and feed raw text/bytes to the network (F5-TTS, Orpheus). That can be great for low-resource languages if the model saw enough of them in training — and a liability for Czech if it didn't. Phoneme-input models (StyleTTS2/Kokoro via espeak-ng) get a more even shot at Czech because espeak-ng already knows Czech pronunciation rules.
3. The two classic boxes: acoustic model + vocoder
Acoustic model
Takes phonemes/text and produces an intermediate mel-spectrogram — a time-frequency picture of the sound (what frequencies are loud, when). It decides the content and prosody: pitch contour, rhythm, stress, duration of each sound. Famous acoustic models:
Tacotron / Tacotron 2 (Google) — sequence-to-sequence with attention; autoregressive; the model that made neural TTS sound human in 2017–2018.
FastSpeech / FastSpeech 2 (Microsoft) — non-autoregressive; predicts duration explicitly, so it is fast and stable (no "babbling/looping" failure of attention).
Glow-TTS — flow-based, uses Monotonic Alignment Search to learn text↔audio alignment without an external aligner.
Vocoder
Takes the mel-spectrogram and synthesizes the actual waveform (the sound you hear). This is where "real vs. robotic" is often won or lost. Famous vocoders:
WaveNet (DeepMind, 2016) — autoregressive, sample-by-sample; gorgeous quality but painfully slow (the original was far slower than real time).
WaveGlow — flow-based, parallel, much faster.
HiFi-GAN — GAN-based, the modern default: fast, small, near-WaveNet quality. Most open models you will run (StyleTTS2, many VITS) use a HiFi-GAN-style decoder.
The advantage of two stages is modularity (swap a better vocoder under any acoustic model). The disadvantage is a train/inference mismatch: the vocoder is trained on real spectrograms but at inference sees predicted ones, which can introduce artifacts.
4. End-to-end systems (fuse the boxes)
End-to-end models generate waveform directly from text in a single trained network, removing the mismatch problem and simplifying deployment:
VITS (2021) — the breakthrough end-to-end model. It combines a variational autoencoder (VAE), normalizing flows, adversarial (GAN) training, Monotonic Alignment Search, and a stochastic duration predictor. That last piece matters for naturalness: because duration is sampled rather than fixed, the same sentence gets slightly different, human-like rhythm each time. VITS and its many descendants are the backbone of a huge number of free/local models, including most Czech community voices (see Chapter on Piper, which uses VITS).
For the reader: VITS-family models are the workhorses of free, local, multilingual TTS. They are small (tens of MB), run on CPU, and have real Czech voices in the wild. Their ceiling on expressiveness is lower than the newest diffusion/LLM systems, but their floor is reliable.
5. Autoregressive vs. non-autoregressive
This is one of the two biggest axes you will see when comparing engines.
Autoregressive (AR) models generate audio (or audio tokens) one step at a time, each step conditioned on the previous one — like how an LLM writes text token by token.
Pros: often the most natural, expressive prosody; great at long-range intonation; excellent zero-shot voice cloning.
Cons: slower (can't parallelize the time axis); can occasionally "hallucinate," skip words, repeat, or run on; latency grows with length.
Examples:Tortoise-TTS (very high quality, very slow — minutes per clip), XTTS v2 (Coqui's GPT-style model), Orpheus (Llama-based), most LLM-codec systems.
Non-autoregressive (NAR) models generate all frames in parallel (often after predicting durations up front, or by iteratively refining noise).
Pros: fast, stable, predictable latency; less prone to word-skipping; easy to stream chunk-by-chunk.
Cons: historically slightly less expressive (though diffusion/flow-matching has largely closed this gap).
A rough mental model: AR = "maximum naturalness, watch the latency and reliability"; NAR = "fast and dependable, naturalness now very competitive." For a feature that reads your texts (often long), NAR's stability and speed are usually the better fit unless you specifically want the expressive ceiling of an AR cloner.
6. Diffusion and flow-matching TTS
These are generative techniques used inside (usually NAR) models to produce extremely natural audio.
Diffusion: start from random noise and iteratively "denoise" it into a spectrogram/waveform over many small steps, guided by the text. StyleTTS2 uses style diffusion — it diffuses a latent "style" vector (capturing speaking style/emotion) rather than the raw audio, which is why it is both natural and fast. Diffusion historically needs many sampling steps (slower), but TTS variants use clever tricks to cut that down.
Flow matching (the newer hotness): a generative method that learns a smooth, more-or-less straight "flow" from noise to data, so it needs far fewer sampling steps than classic diffusion for comparable quality. F5-TTS ("A Fairytaler that Fakes Fluent and Faithful speech with Flow matching," arXiv Oct 2024) uses flow matching with a Diffusion Transformer (DiT), no explicit duration model and no phoneme alignment, plus a "Sway Sampling" inference trick. Result: fast NAR generation with strong zero-shot cloning.
F5-TTS reports a strikingly low inference RTF of ~0.15 (about 6.7× faster than real time on GPU), far better than older diffusion TTS — flow-matching's main practical win. Caveat for the reader: F5-TTS was trained on a ~100K-hour, mostly English+Chinese public dataset. It can attempt other languages and has "seamless code-switching," but Czech is not an officially trained language and quality is unverified — exactly the kind of "looks multilingual, isn't really" trap to test before trusting. (A 2025 "Cross-Lingual F5-TTS" line of work targets language-agnostic cloning; community Czech fine-tunes may exist — verify against your own sentences.)
7. LLM-based / codec TTS (the 2024–2026 frontier)
The newest paradigm treats speech generation exactly like language modeling. The trick is a neural audio codec that compresses audio into a stream of discrete tokens — like a vocabulary for sound.
Neural audio codecs: SoundStream (Google) and EnCodec (Meta) pioneered Residual Vector Quantization (RVQ) — representing each audio frame with several stacked codebooks (coarse-to-fine). SNAC (Multi-Scale Neural Audio Codec) tokenizes at multiple time scales and is used by Orpheus.
The recipe: (1) encode speech to tokens with the codec; (2) train an LLM to predict those audio tokens from text (and a voice prompt); (3) at inference, the LLM autoregressively generates audio tokens, then the codec decoder turns them back into a waveform.
Why it's exciting: an LLM backbone brings context, emotion, and prosody understanding. Orpheus (Canopy Labs, released March 2025) is a fine-tune of Llama-3.2-3B that emits SNAC tokens (~150 tokens/sec of audio), supports non-verbal cues (laughs, sighs), zero-shot cloning from ~10 s of reference audio, and low-latency streaming (~100–200 ms time-to-first-audio as advertised — verify on your hardware). It ships Apache-2.0 on code and weights.
The catch: these are AR (slower, can hallucinate), they are large (a 3B-param model wants a GPU with a few GB VRAM), and — critically for this reader — English-centric. Orpheus's April-2025 multilingual preview added French, German, Spanish, Italian, Mandarin, Korean, and Hindi — but not Czech. Czech support in LLM-codec models is currently weak-to-absent; treat any Czech claim as unverified until you hear it yourself.
8. What makes a voice "sound real"
Naturalness is not one thing. Three distinct qualities, often confused:
Intelligibility — can you understand the words? Even old concatenative/robotic TTS scores high here. This is the floor, not the goal.
Naturalness — does it sound like a human, not a machine? Driven by smooth waveform synthesis (good vocoder), correct prosody, no glitches/buzz, natural breath and micro-timing.
Expressiveness — does it convey emotion, emphasis, intent? Excitement, a question's rising tone, a dramatic pause. This is the ceiling, and where AR/LLM/diffusion models pull ahead.
Prosody is the umbrella term that decides naturalness: pitch (intonation/melody), duration/rhythm (how long each sound and pause lasts), stress/emphasis, and loudness. A robotic voice usually has flat or wrong prosody, not wrong words. This is why VITS's stochastic duration and StyleTTS2's style vector matter — they inject the natural variability humans have.
For Czech specifically, the danger zones are: wrong stress (Czech stress is almost always on the first syllable — engines trained mostly on English can get this wrong), the ř phoneme (notoriously hard, a common giveaway of a bad Czech voice), vowel length (Czech distinguishes short a vs. long á — getting it wrong changes meaning), and palatalization (ď, ť, ň). When auditioning a Czech voice, listen specifically for ř, first-syllable stress, and long vowels.
9. Voice cloning: zero-shot vs. fine-tuned
A major reason to pick a modern model is voice cloning — making it speak in a chosen voice.
Zero-shot cloning: give the model a few seconds of reference audio at inference time; it imitates that voice immediately, no training. XTTS v2 advertises cloning from ~6 seconds; F5-TTS and Orpheus do this too. Pros: instant, no GPU training. Cons: similarity varies; cross-language cloning (clone an English voice, speak Czech) can carry an accent.
Fine-tuning: actually train/adapt the model on a dataset of the target speaker (minutes to hours of clean audio). Pros: highest fidelity and consistency, better handling of the target language's phonetics. Cons: needs a GPU, data prep, and time.
For a reader who just wants their app to read text in a good voice (not necessarily a specific person's voice), you often don't need cloning at all — a model's built-in voices are simpler and more reliable. Cloning matters if you want a specific or branded voice. Note the license angle: XTTS v2 ships under the Coqui Public Model License (CPML) which is non-commercial — fine for a personal/hobby project, a problem for a commercial product. Permissively licensed alternatives (Kokoro is Apache-2.0) avoid this. (Coqui the company shut down in early 2024; the code lives on as the community fork idiap/coqui-ai-TTS.)
10. Streaming vs. batch, latency, and real-time factor
These engineering metrics decide whether a model is usable in your app's UX.
Real-Time Factor (RTF) = processing time ÷ audio duration produced. RTF < 1 = faster than real time (RTF 0.3 means 1 s of audio took 0.3 s to make). Lower is better. This governs throughput and cost.
Time-to-first-audio (TTFB / latency) = how long until the first sound plays. For an app that reads text on a button press, this is what the user feels. A model can have great RTF but bad TTFB if it must generate the whole clip before playing.
Streaming = emit audio in chunks as they are generated, so playback starts almost immediately and continues while the rest is computed. Batch = generate the entire clip, then play. For reading long texts, streaming is a huge UX win — the user hears the first sentence in ~200 ms instead of waiting for the whole paragraph. AR/LLM models stream naturally (token by token); many NAR models can stream by chunking the input text.
Concept
What it controls
Good number
RTF
Throughput, server cost
< 1 (ideally < 0.3 on GPU)
TTFB / latency
Perceived responsiveness
< 300 ms for "instant" feel
Streaming
Whether playback starts early
Supported = much better UX for long text
11. CPU vs. GPU inference
This is the make-or-break for the free/local path.
Small models (Kokoro ~82M params, VITS/Piper voices ~10–30M) run comfortably on CPU, often near or faster than real time. This is the sweet spot for a free, self-hosted, no-GPU-bill setup. Kokoro uses a StyleTTS2-derived, decoder-only design with an ISTFTNet vocoder (no diffusion), is ~300 MB (164 MB FP16), needs <2 GB VRAM on GPU (RTF ~0.03 on an A100), and runs at real time or faster on plain CPU. It reached #1 on the TTS Arena leaderboard in January 2026 while beating models 10–100× its size, and is Apache-2.0. That makes Kokoro on CPU a realistic "natural voice, $0 hardware" option for English; its Czech support must be checked per-release (it expanded languages over time and Czech quality is not guaranteed).
Medium models (StyleTTS2, F5-TTS, XTTS v2) run much faster on a GPU but can run on CPU slowly (seconds-per-sentence), acceptable for offline/batch generation, painful for interactive use.
Large LLM-codec models (Orpheus ~3B) effectively need a GPU (and a few GB of VRAM) for usable latency.
Rule of thumb: parameter count ≈ hardware appetite. If you want free + local + responsive with no GPU, bias toward small NAR/VITS-family models (Kokoro, Piper) and accept a slightly lower expressiveness ceiling. If you have a GPU (or a cheap cloud GPU), the StyleTTS2/F5/XTTS tier opens up. LLM-codec models are the most natural but the most expensive to run.
12. How naturalness is measured (MOS and friends)
When later chapters cite a "MOS of 4.2," here is what that means and how skeptical to be.
MOS (Mean Opinion Score) — the gold-standard subjective metric. Human listeners rate samples 1 (bad) to 5 (excellent); you average. Human recordings typically score ~4.5–4.7, so a model near 4.0+ is genuinely good. MOS is the most trustworthy but expensive, slow, and only comparable within the same study (different listener pools, sentences, and instructions are not comparable — never compare a MOS from one paper directly to another).
CMOS (Comparative MOS) — listeners directly compare two systems (e.g., −3 to +3); better for "is A better than B?" within one test.
SMOS (Similarity MOS) — for voice cloning: how similar is the output to the target speaker? Separate from naturalness.
UTMOS / automatic MOS predictors — neural models (from the VoiceMOS Challenge) that predict MOS without humans. Cheap and fast, great for ranking many models, but biased toward English read speech and not a substitute for human ears — especially for Czech, where these predictors were barely trained.
Crowd leaderboards (e.g., TTS Arena) — humans blind-compare two clips and pick the better; Elo-style ranking. Useful for a community sanity check; Kokoro, for example, has ranked surprisingly high for its size. But again, the prompts are mostly English.
Objective metrics — WER/CER (run an ASR over the output; high error = unintelligible/garbled) measures intelligibility, not naturalness. Useful as a cheap failure detector, especially to catch a model that mangles Czech.
Bottom line for this reader:No published MOS number tells you how a model sounds in Czech reading your texts. Treat benchmark scores as a coarse pre-filter, then do your own blind listening test on Czech sentences with numbers, dates, ř, and long vowels. A 30-minute self-test beats any leaderboard for your specific use case.
Key Takeaways
TTS = frontend (text normalization + grapheme-to-phoneme) → acoustic model (text→mel-spectrogram) → vocoder (spectrogram→waveform); modern systems fuse or replace these stages.
The frontend is decisive for Czech: Czech numbers/dates/units inflect by case and gender, so "supports Czech" ≠ "reads Czech numbers correctly." Czech is phonemic (helps), but watch ř, first-syllable stress, and long vowels.
Autoregressive models (Tortoise, XTTS, Orpheus) maximize expressiveness/cloning but are slower and can hallucinate; non-autoregressive models (VITS, StyleTTS2/Kokoro, F5-TTS) are fast, stable, and now nearly as natural.
Diffusion (StyleTTS2's style diffusion) and flow-matching (F5-TTS) are the techniques behind today's most natural NAR voices; flow-matching needs fewer steps (faster).
LLM-codec TTS (Orpheus on Llama-3B via SNAC tokens) is the most expressive frontier but is large, GPU-hungry, and English-centric — Czech support is weak/unverified.
"Sounds real" = intelligibility (floor) → naturalness (prosody: pitch, rhythm, stress) → expressiveness (emotion, ceiling). Prosody, not word choice, is what separates human from robotic.
Voice cloning is zero-shot (instant, from seconds of audio) or fine-tuned (best fidelity, needs a GPU). For just "read text in a good voice," built-in voices are simpler. Mind licenses: XTTS v2 = non-commercial (CPML); Kokoro = Apache-2.0.
Engineering knobs: RTF < 1 = faster than real time; TTFB/latency = perceived speed; streaming is a big UX win for long text. Param count ≈ hardware appetite — small NAR/VITS models (Kokoro, Piper) run free on CPU; LLM-codec models need a GPU.
MOS (1–5, human) is the trustworthy-but-non-portable gold standard; UTMOS/TTS-Arena are cheap pre-filters biased toward English. No score tells you Czech quality on your texts — always do your own blind listening test.
Key References
Mehta, S. et al. (Coqui / Hugging Face). XTTS v2 model card. 2023–2024. https://huggingface.co/coqui/XTTS-v2 — Multilingual (incl. Czech) GPT-style autoregressive voice-cloning model; license (CPML, non-commercial) and 6-second cloning claim.
Chen, Y. et al. F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. arXiv:2410.06885. 2024. https://arxiv.org/abs/2410.06885 — Flow-matching + Diffusion Transformer, no duration model/alignment, Sway Sampling; trained on English+Chinese (~100K h).
Li, Y. et al. StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. arXiv:2306.07691. 2023. https://arxiv.org/abs/2306.07691 — Style diffusion + SLM adversarial training; the architecture Kokoro builds on.
hexgrad. Kokoro-82M model card. Hugging Face. 2025. https://huggingface.co/hexgrad/Kokoro-82M — 82M-param Apache-2.0 TTS built on StyleTTS2; lightweight, CPU-friendly, multilingual expansion.
Kim, J., Kong, J., Son, J. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (VITS). arXiv:2106.06103. 2021. https://arxiv.org/abs/2106.06103 — End-to-end VAE+flow+GAN with Monotonic Alignment Search and stochastic duration predictor; backbone of many free/local Czech voices.
Défossez, A. et al. (Meta AI). High Fidelity Neural Audio Compression (EnCodec). arXiv:2210.13438. 2022. https://arxiv.org/abs/2210.13438 — Neural audio codec with residual vector quantization; the token representation enabling LLM-based TTS.
Zeghidour, N. et al. (Google). SoundStream: An End-to-End Neural Audio Codec. arXiv:2107.03312. 2021. https://arxiv.org/abs/2107.03312 — Pioneering RVQ neural codec underpinning discrete-token speech generation.
Saeki, T. et al. UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. arXiv:2204.02152. 2022. https://arxiv.org/abs/2204.02152 — Automatic MOS prediction; how naturalness can be scored without human listeners (and its English bias).
Kong, J., Kim, J., Bae, J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. arXiv:2010.05646. 2020. https://arxiv.org/abs/2010.05646 — The fast, high-quality GAN vocoder used by most modern open TTS systems.
espeak-ng contributors. espeak-ng (GitHub). 2024–2025. https://github.com/espeak-ng/espeak-ng — Rule-based grapheme-to-phoneme for 100+ languages incl. Czech and English; the default phonemizer behind StyleTTS2/Kokoro/VITS.
Ren, Y. et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv:2006.04558. 2020. https://arxiv.org/abs/2006.04558 — Canonical non-autoregressive acoustic model with explicit duration prediction.
Kim, J. et al. Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. arXiv:2005.11129. 2020. https://arxiv.org/abs/2005.11129 — Introduces Monotonic Alignment Search, later reused by VITS for internal text↔audio alignment.
TTS Arena (Hugging Face). TTS Arena leaderboard. 2025–2026. https://huggingface.co/spaces/TTS-AGI/TTS-Arena — Blind crowd A/B Elo ranking of TTS systems; the leaderboard where Kokoro-82M ranked #1 in Jan 2026.