Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

CS · 17 ch

Chapter 02: How Modern Neural TTS Works

Chapter 2 of 17 · ~19 min read

Overview

Before you can pick the right text-to-speech (TTS) engine for reading your texts aloud in English and Czech, it helps to understand what is happening under the hood. This chapter explains the technology accessibly but precisely: how raw text becomes natural-sounding speech, the major architectural families (acoustic-model-plus-vocoder, end-to-end, autoregressive vs. non-autoregressive, diffusion/flow-matching, and the newer LLM/codec systems), what actually makes a voice "sound real," how voice cloning works, and the practical engineering knobs — streaming vs. batch, latency, real-time factor, CPU vs. GPU. It closes with how naturalness is measured (MOS and friends), so that when later chapters throw scores around, you know what they mean and how much to trust them.

Throughout, the technology is mapped back to the reader's two hard requirements — natural English AND natural Czech — and to the free/near-free constraint. Czech is a much smaller-data language than English, so some architectural choices matter more for Czech quality than for English. Where that is the case, it is called out.

Content

1. The TTS pipeline in one picture

Almost every neural TTS system, no matter how modern, does some version of three jobs:

  Raw text  ──►  [ FRONTEND ]  ──►  [ ACOUSTIC MODEL ]  ──►  [ VOCODER ]  ──►  Audio waveform
 "Dr. Nový    text normalization     phonemes/text →         spectrogram →     16/22/24/44 kHz
  má 3 psy."   + grapheme→phoneme     mel-spectrogram         waveform          .wav / PCM

In two-stage (cascaded) systems these are separate models you can swap. In end-to-end systems the middle and right boxes are fused into one network. In the newest LLM/codec systems the spectrogram is replaced by discrete audio tokens that a language model predicts. We will walk through each box, then each architecture family.

2. The frontend: text normalization and grapheme-to-phoneme

This is the most under-appreciated part of TTS, and it is disproportionately important for Czech.

Text normalization (TN) turns "non-standard words" into how they are actually spoken:

# See exactly what phonemes an engine will receive for Czech vs. English
espeak-ng -v cs -q --ipa "Doktor Nový má tři psy."
#  →  dˈoktor nˈoviː mˈaː tr̝ɪ psˈɪ
espeak-ng -v en-us -q --ipa "Doctor Novy has three dogs."
#  →  dˈɑːktɚ nˈoʊvi hɐz θɹˈiː dˈɑːɡz

Concept	What it controls	Good number
RTF	Throughput, server cost	< 1 (ideally < 0.3 on GPU)
TTFB / latency	Perceived responsiveness	< 300 ms for "instant" feel
Streaming	Whether playback starts early	Supported = much better UX for long text

Mehta, S. et al. (Coqui / Hugging Face). XTTS v2 model card. 2023–2024. https://huggingface.co/coqui/XTTS-v2 — Multilingual (incl. Czech) GPT-style autoregressive voice-cloning model; license (CPML, non-commercial) and 6-second cloning claim.
Chen, Y. et al. F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. arXiv:2410.06885. 2024. https://arxiv.org/abs/2410.06885 — Flow-matching + Diffusion Transformer, no duration model/alignment, Sway Sampling; trained on English+Chinese (~100K h).
Li, Y. et al. StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. arXiv:2306.07691. 2023. https://arxiv.org/abs/2306.07691 — Style diffusion + SLM adversarial training; the architecture Kokoro builds on.
hexgrad. Kokoro-82M model card. Hugging Face. 2025. https://huggingface.co/hexgrad/Kokoro-82M — 82M-param Apache-2.0 TTS built on StyleTTS2; lightweight, CPU-friendly, multilingual expansion.
Kim, J., Kong, J., Son, J. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (VITS). arXiv:2106.06103. 2021. https://arxiv.org/abs/2106.06103 — End-to-end VAE+flow+GAN with Monotonic Alignment Search and stochastic duration predictor; backbone of many free/local Czech voices.
Canopy Labs. Orpheus TTS (GitHub). 2025. https://github.com/canopyai/Orpheus-TTS — Llama-3B-based LLM/codec TTS using SNAC tokens; emotion/non-verbal cues, zero-shot cloning, low-latency streaming claims.
Défossez, A. et al. (Meta AI). High Fidelity Neural Audio Compression (EnCodec). arXiv:2210.13438. 2022. https://arxiv.org/abs/2210.13438 — Neural audio codec with residual vector quantization; the token representation enabling LLM-based TTS.
Zeghidour, N. et al. (Google). SoundStream: An End-to-End Neural Audio Codec. arXiv:2107.03312. 2021. https://arxiv.org/abs/2107.03312 — Pioneering RVQ neural codec underpinning discrete-token speech generation.
Saeki, T. et al. UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. arXiv:2204.02152. 2022. https://arxiv.org/abs/2204.02152 — Automatic MOS prediction; how naturalness can be scored without human listeners (and its English bias).
Kong, J., Kim, J., Bae, J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. arXiv:2010.05646. 2020. https://arxiv.org/abs/2010.05646 — The fast, high-quality GAN vocoder used by most modern open TTS systems.
espeak-ng contributors. espeak-ng (GitHub). 2024–2025. https://github.com/espeak-ng/espeak-ng — Rule-based grapheme-to-phoneme for 100+ languages incl. Czech and English; the default phonemizer behind StyleTTS2/Kokoro/VITS.
Ren, Y. et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv:2006.04558. 2020. https://arxiv.org/abs/2006.04558 — Canonical non-autoregressive acoustic model with explicit duration prediction.
Kim, J. et al. Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. arXiv:2005.11129. 2020. https://arxiv.org/abs/2005.11129 — Introduces Monotonic Alignment Search, later reused by VITS for internal text↔audio alignment.
TTS Arena (Hugging Face). TTS Arena leaderboard. 2025–2026. https://huggingface.co/spaces/TTS-AGI/TTS-Arena — Blind crowd A/B Elo ranking of TTS systems; the leaderboard where Kokoro-82M ranked #1 in Jan 2026.
Codersera. Install Orpheus 3B TTS on macOS: Complete Guide. 2026. https://codersera.com/blog/install-and-run-orpheus-3b-tts-on-macos-a-complete-guide/ — Practical Orpheus details: Llama-3.2-3B base, SNAC 24 kHz codec, ~150 tokens/sec, GPU/VRAM needs.

`3 psy`	—	"tři psy" (and Czech grammar: 3 takes a different case)
`Dr.`	"Doctor"	"doktor"
`2026`	"twenty twenty-six"	"dva tisíce dvacet šest"
`12:30`	"twelve thirty"	"dvanáct třicet"
`15 %`	"fifteen percent"	"patnáct procent"
`€5`	"five euros"	"pět eur"

Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

Chapter 02: How Modern Neural TTS Works

Overview

Content

1. The TTS pipeline in one picture

2. The frontend: text normalization and grapheme-to-phoneme

3. The two classic boxes: acoustic model + vocoder

Acoustic model

Vocoder

4. End-to-end systems (fuse the boxes)

5. Autoregressive vs. non-autoregressive

6. Diffusion and flow-matching TTS

7. LLM-based / codec TTS (the 2024–2026 frontier)

8. What makes a voice "sound real"

9. Voice cloning: zero-shot vs. fine-tuned

10. Streaming vs. batch, latency, and real-time factor

11. CPU vs. GPU inference

12. How naturalness is measured (MOS and friends)

Key Takeaways

Key References