Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

CS · 17 ch

Chapter 01: Introduction & How to Choose a TTS

Chapter 1 of 17 · ~17 min read

Overview

This chapter frames the entire report. You want to add a text-to-speech (TTS) feature to your own software so it can read your texts aloud — and it has to sound genuinely natural in both English and Czech, ideally while staying free or near-free. That combination is harder than it sounds. Most of the headline-grabbing 2025–2026 TTS models are English-first; many sound stunning in English and robotic (or simply unsupported) in Czech. So Czech quality is treated throughout this report as a hard filter, graded explicitly for every option rather than assumed.

Here we define the problem precisely, introduce the four decision axes you should weigh (naturalness, cost, hosting/privacy, language coverage), sketch a quick map of the 2026 landscape, preview the recommendation categories the rest of the report builds toward, explain how to read the report, and give you a working glossary of TTS jargon (MOS, prosody, zero-shot voice cloning, streaming, SSML, phonemes, vocoder). It closes with a one-screen orientation table.

Content

The problem, stated precisely

You are not buying a consumer "read-aloud" app. You are an engineer adding a TTS capability to software you control. That changes everything about how to choose. The real requirements are:

It reads your text — arbitrary text your app produces, not a fixed script. So you need an API, a library, or a self-hostable engine you can call programmatically.
It must sound real — listeners should not immediately think "that's a robot." For long-form reading especially, prosody (rhythm, stress, intonation) matters more than raw audio fidelity.
English AND Czech, both natural — this is the binding constraint. English is the easy case; almost everything does English well. Czech is the differentiator. Czech is a West Slavic language with rich inflection, palatalization, the notorious ř, vowel-length distinctions, and front-loaded word stress. Many engines either don't ship a Czech voice at all, or ship one that mangles these. Whenever a model's Czech support is weak, unverified, or community-reported rather than vendor-confirmed, this report says so out loud.

Your usage	Words/month	≈ Characters/month	Fits which free tier?
Personal app, light reading	5,000	~30K	Everything (even ElevenLabs free)
Small product, moderate	50,000	~300K	Azure (500K), Google premium (1M)
Active product	150,000	~1M	Google premium free tier exactly; Azure overflows
Heavy / long-form library	1,000,000+	~6M+	No free tier — go self-hosted or pay

Shape	Where it runs	Privacy	Ops burden	Marginal cost
Fully local / offline	On the user's device or your own machine	Best — text never leaves the device	Per-device install, but no server	$0
Self-hosted server	Your Docker/VM/GPU box, behind your own API	Good — you control the data	You run & scale it	$0 + infra
Cloud API	Vendor's servers (Google/Azure/ElevenLabs…)	Text sent to a third party	Lowest — they operate it	Per-use (often free tier)
Browser (Web Speech API)	The visitor's browser, using OS voices	On-device, but uncontrollable quality	Almost none	$0

START
 │
 ├─ Is the text privacy-sensitive (must not leave your control)?
 │     YES ──► Local/self-hosted ONLY.  GPU available? ──► XTTS-v2.
 │            └─ No GPU / CPU only?     ──► Piper (cs_CZ-jirka).
 │     NO ──► continue
 │
 ├─ Is your monthly volume comfortably under a cloud free tier
 │   (≈ <1M premium chars/mo)?
 │     YES ──► Premium cloud free tier.
 │            └─ Want the most natural Czech?      ──► Google Chirp 3 HD.
 │            └─ Need strong SSML control?         ──► Azure Neural.
 │     NO ──► continue
 │
 ├─ High volume but want top quality and can pay a little?
 │     YES ──► ElevenLabs (best Czech) or pay-as-you-go Google/Azure.
 │     NO  ──► Self-host (XTTS-v2 on GPU, or Piper on CPU) for $0/char.
 │
 └─ Need a zero-infra universal fallback that works in any browser?
        ──► Web Speech API (accept inconsistent Czech).

const u = new SpeechSynthesisUtterance("Ahoj, toto je test.");
u.lang = "cs-CZ";
u.voice = speechSynthesis.getVoices().find(v => v.lang === "cs-CZ");
speechSynthesis.speak(u);   // quality = whatever the OS provides

echo "Dobrý den, jak se máte?" | \
  piper --model cs_CZ-jirka-medium.onnx --output_file out.wav

# your app -> your TTS container; text never leaves your infra
import requests
audio = requests.post("http://localhost:8020/tts",
    json={"text": "Ahoj světe", "language": "cs"}).content
open("out.wav", "wb").write(audio)

# pseudocode shape: authenticate -> send text + voice -> get audio bytes
audio = tts_client.synthesize(text="Ahoj světe",
                              language="cs-CZ", voice="cs-CZ-Chirp3-HD")

MOS (Mean Opinion Score) — the standard subjective quality metric: human listeners rate audio naturalness on a 1–5 scale; the mean is the MOS. ~4.0+ is "good," ~4.5+ approaches human-level in the test conditions. Crucial caveat: MOS is dataset- and listener-dependent and usually measured in English — a high English MOS says nothing about Czech.
Prosody — the rhythm, stress, and intonation of speech (the "melody"). The single biggest driver of whether TTS sounds human vs. robotic, especially over long passages.
Phoneme — the smallest unit of sound in a language. TTS often converts text → phonemes (via a grapheme-to-phoneme / G2P step) before generating audio. Czech G2P quality (especially ř, palatalized consonants, vowel length) is a key reason some engines sound off in Czech.
SSML (Speech Synthesis Markup Language) — an XML-based markup you wrap around text to control pronunciation, pauses, emphasis, rate, pitch, and to spell out numbers/dates/abbreviations. Supported well by Azure and Google; partially or not at all by lightweight local engines. Your escape hatch for fixing edge-case pronunciations.
Zero-shot voice cloning — generating a new speaker's voice from just a few seconds of reference audio, without retraining the model. XTTS-v2 and ElevenLabs offer this. Powerful, but raises consent/legal and licensing questions.
Streaming — emitting audio incrementally as it's generated, rather than waiting for the whole clip. Lowers perceived latency (you hear the first words in <200 ms in good setups), essential for interactive read-aloud.
Vocoder — the component that turns an intermediate acoustic representation (e.g. a mel-spectrogram) into an actual audio waveform. Modern neural vocoders (and end-to-end models that fold this step in, like VITS) are a big reason 2020s TTS sounds far better than 2010s TTS.
Real-time factor (RTF) — synthesis time ÷ audio duration. RTF < 1 means faster than real-time (good for live reading); RTF > 1 means you'd wait longer than the clip lasts (batch-only). XTTS-v2 hits RTF ≈ 0.1–0.2 on a strong GPU but is several×-slower-than-real-time on CPU.

Option	EN quality	CZ quality (graded)	Cost	Hosting	Best for
Piper	Good	Decent (`cs_CZ-jirka`), clearly below cloud	$0 (local)	Local / self-host, CPU-friendly	Free, private, runs anywhere incl. Pi
XTTS-v2	Excellent	Best self-hosted Czech; voice-clone capable	$0 marginal (needs GPU)	Self-host (GPU)	Best free+private Czech if you have a GPU; CPML = non-commercial
Kokoro	Excellent	No official Czech (English-first)	$0 (local)	Local	English-only read-aloud; not for Czech
Google Chirp 3 HD	Excellent	Strong, vendor-confirmed cs-CZ	Free ~1M premium chars/mo, then ~$30/1M	Cloud API	Premium Czech with a usable free tier
Azure Neural	Excellent	Good (`Vlasta`/`Antonin`), full SSML	Free ~500K chars/mo, then ~$15/1M	Cloud API	Cloud Czech needing SSML control
ElevenLabs	Best-in-class	Best Czech naturalness, premium	Free ~10K chars/mo (non-commercial), paid from ~$5/mo	Cloud API	Top quality when budget allows
Web Speech API	OK–good (OS-dependent)	Inconsistent, OS-dependent, dated	$0	Browser	Zero-infra universal fallback

Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

Chapter 01: Introduction & How to Choose a TTS

Overview

Content

The problem, stated precisely

A worked volume estimate (do this first)

The four decision axes

Axis 1 — Naturalness / quality

Axis 2 — Cost

Axis 3 — Hosting / privacy

Axis 4 — Language coverage (incl. Czech, the hard filter)

A quick map of the 2026 landscape

Preview: the recommendation categories

A 60-second decision flow

What "calling" each shape looks like (orientation snippets)

How to read this report

Glossary of TTS terms

Key Takeaways

Orientation table

Key References