Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

CS · 17 ch

Chapter 09: Quality Benchmarks & Honest Naturalness Matrix

Chapter 10 of 17 · ~20 min read

Overview

By the time you reach this chapter you have met a lot of TTS engines, and almost every one of them claims to be "the most natural" or "human-like" or "indistinguishable from a real voice." They cannot all be right. This chapter is about how to cut through that noise: what TTS quality actually means, how it is measured (MOS, CMOS, MUSHRA, WER), which public leaderboards are worth reading (TTS Arena on Hugging Face, Artificial Analysis), and — crucially — why almost every number you will see is biased toward English and toward the vendor publishing it.

The reader of this report has a hard, non-negotiable filter: the chosen engine must sound natural in both English and Czech. That filter is exactly where public benchmarks fail you, because nearly all of them are English-centric and Czech is rarely scored at all. So this chapter does three things: (1) teaches you to read benchmarks critically, (2) builds an honest quality matrix for the top contenders with explicit EN vs CZ grading and confidence levels, and (3) gives you a concrete, copy-pastable recipe to run your own blind A/B test on your own Czech and English texts — because in the end that is the only benchmark that matters for your project.

Content

Why "naturalness" is hard to measure

TTS quality is not one thing. A voice can be perfectly intelligible (you understand every word) yet sound robotic; or it can be warm and expressive yet mispronounce names; or it can nail a single sentence in a demo and fall apart on a 2,000-word article with numbers, abbreviations, and foreign words. The quality dimensions that matter for reading your texts aloud are roughly:

Naturalness / human-likeness — does it sound like a person, with believable prosody and rhythm?
Intelligibility — are the words clearly recoverable, even hard ones?
Pronunciation correctness — names, numbers, dates, acronyms, units, foreign loanwords.
Prosody & expressiveness — sentence melody, emphasis, pauses, question intonation.
Stability / robustness — does it stay good across long inputs, or hallucinate / skip / repeat?
Speaker consistency — same voice from paragraph to paragraph.

WER = (Substitutions + Insertions + Deletions) / Total reference words

Metric	Type	Scale	Measures	Main weakness
MOS	Subjective	1–5	Absolute naturalness	Pool-dependent, not cross-study comparable
CMOS	Subjective	−3…+3	A-vs-B preference	Relative only, still needs humans
MUSHRA	Subjective	0–100	Many systems at once	Needs trained listeners, careful setup
WER	Objective	0–100%	Intelligibility	Inherits ASR bias; ignores prosody

Engine	Type / cost	EN quality	EN conf.	CZ supported?	CZ quality	CZ conf.	Notes
ElevenLabs (v2/v3)	Cloud, paid (free tier)	A	High	Yes (70+ langs incl. Czech)	B (best-in-class commercial CZ candidate)	Med	Top of English arenas; Czech is officially supported and generally the strongest commercial Czech voice, but verify on your text — quality varies by voice.
Google Cloud TTS (Chirp 3 HD / Neural2)	Cloud, paid (free tier)	A− / B+	High	Yes (cs-CZ voices)	B	Med	Mature Czech locale; Chirp 3 HD is notably more natural than older WaveNet/Neural2 Czech voices. Reliable, well-documented.
Microsoft Azure Neural TTS	Cloud, paid (free tier)	A− / B+	High	Yes (cs-CZ neural voices)	B	Med	Long-standing, solid Czech neural voices; strong pronunciation engine, good for long-form.
MiniMax Speech-02	Cloud, paid	A	Med–High	Claims 30+ langs; Czech inclusion not explicitly confirmed	C–B (unverified for CZ)	Low	Excellent in English arenas; multilingual claims are vendor-stated on a private benchmark. Czech specifically must be tested before trusting.
OpenAI TTS (gpt-4o-mini-tts etc.)	Cloud, paid	A−/B+	High	Multilingual, follows input language	C–B (CZ uneven)	Low	Pleasant English; Czech can work but accent/pronunciation is inconsistent and not officially a per-locale product. Test heavily.
Coqui XTTS v2	Open / self-host, free	B+	High	Yes — Czech (cs) is an official supported language (17 total)	B− / C+	Med	The strongest free, self-hostable, explicitly-Czech option. Zero-shot voice cloning. Czech is real but a notch below top commercial; original Coqui is defunct, community fork (idiap/coqui-ai-TTS) is maintained.
Kokoro-82M	Open / self-host, free	B+ (great for size)	High	No official Czech	n/a	High	Tiny (82M), fast, well-liked in English arenas. Not a Czech option — do not pick it if Czech is required.
Piper (rhasspy)	Open / self-host, free	B− / C	High	Yes (Czech voice models exist)	C	Med	Fast, runs on a Raspberry Pi. Czech voices exist and are intelligible but clearly synthetic — fine for utility, not for "real-sounding" reading.
Cartesia (Sonic)	Cloud, paid	A	Med–High	Limited multilingual; Czech not a headline locale	C (unverified)	Low	Excellent low-latency English; Czech coverage is weak/unclear — verify before considering.

import os, random, json, glob

# samples/<engine>/<id>.wav  -> blind A/B over the SAME text id
engines = [d for d in os.listdir("samples") if os.path.isdir(f"samples/{d}")]
ids = sorted({os.path.splitext(os.path.basename(p))[0]
              for p in glob.glob("samples/*/*.wav")})

votes, key = [], {}
for tid in ids:
    a, b = random.sample(engines, 2)              # two random engines, same text
    label_a, label_b = f"{tid}-A", f"{tid}-B"
    key[label_a], key[label_b] = a, b             # hidden mapping
    print(f"\nText {tid}")
    print(f"  Play A: samples/{a}/{tid}.wav")
    print(f"  Play B: samples/{b}/{tid}.wav")
    pick = input("  Which sounds more natural? [a/b/tie]: ").strip().lower()
    votes.append({"text": tid, "winner": pick, "A": a, "B": b})

# tally
from collections import Counter
score = Counter()
for v in votes:
    if v["winner"] == "a": score[v["A"]] += 1
    elif v["winner"] == "b": score[v["B"]] += 1
    else: score[v["A"]] += 0.5; score[v["B"]] += 0.5
print("\nBlind A/B results:", score.most_common())
json.dump({"votes": votes, "key": key}, open("results.json", "w"), indent=2)

# pip install faster-whisper jiwer
from faster_whisper import WhisperModel
from jiwer import wer
import glob, os

asr = WhisperModel("large-v3")   # multilingual; decent Czech
refs = dict(l.split("\t", 1) for l in open("refs.tsv"))   # id<TAB>text

for wav in glob.glob("samples/*/*.wav"):
    tid = os.path.splitext(os.path.basename(wav))[0]
    lang = "cs" if tid.startswith("cs") else "en"
    segs, _ = asr.transcribe(wav, language=lang)
    hyp = " ".join(s.text for s in segs).strip()
    print(f"{wav}\tWER={wer(refs[tid].lower(), hyp.lower()):.3f}")

TTS-AGI. TTS Arena V2: A Community-Driven Benchmark for TTS (Hugging Face blog). 2025. https://huggingface.co/blog/arena-tts — Methodology of the blind, randomized, Elo-style pairwise TTS leaderboard and its V2 model roster.
TTS-AGI. TTS Arena V2 (Hugging Face Space). 2025–2026. https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2 — Live community leaderboard; vote interface and current rankings.
TTS-AGI. About — TTS Arena V2. 2025–2026. https://tts-agi-tts-arena-v2.hf.space/about — States explicitly that the arena is English-only (US/UK accents, four use-case categories) and a fresh Elo restart from V1.
Artificial Analysis. Text to Speech AI Model & Provider Leaderboard. 2025–2026. https://artificialanalysis.ai/text-to-speech/leaderboard — Commercial-focused TTS quality (Elo) / price / speed leaderboard with head-to-head Speech Arena; ~76–83 models tracked in 2026.
Minixhofer, C. et al. TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems. arXiv:2506.19441. 2025. https://arxiv.org/abs/2506.19441 — Objective, distribution-based naturalness benchmark correlating strongly with human ratings; complements preference arenas.
AssemblyAI. How to Evaluate Speech Synthesis & TTS Quality. 2024–2025. https://www.assemblyai.com/blog/how-to-evaluate-speech-synthesis-and-tts-quality — Clear explanation of MOS, CMOS, MUSHRA and WER with scales and limitations.
Deepgram. What is Mean Opinion Score (MOS) and How is it Calculated? 2024. https://www.deepgram.com/ai-glossary/mean-opinion-score-mos — MOS scale, ITU-T P.800 background, and rater-pool caveats.
Papers With Code. Text-to-Speech Synthesis & Naturalness Evaluation of Speech. 2025. https://paperswithcode.com/task/text-to-speech-synthesis — Standard datasets (LibriTTS, VCTK, LJSpeech) and metrics used in academic TTS evaluation.
Coqui. XTTS-v2 model card (Hugging Face). 2024. https://huggingface.co/coqui/XTTS-v2 — Confirms the 17 supported languages including Czech (cs); zero-shot voice cloning.
ElevenLabs. Text-to-Speech capabilities / supported languages. 2025. https://elevenlabs.io/docs/capabilities/text-to-speech — Documents 70+ supported languages including Czech for the multilingual models.
Google Cloud. Chirp 3: HD voices. 2025–2026. https://docs.cloud.google.com/text-to-speech/docs/chirp3-hd — Confirms cs-CZ is among the Chirp 3 HD supported languages (8 voices, 4M/4F); higher naturalness than older Czech WaveNet/Neural2 voices.
TTS-AGI / Pendrokar. Support evaluations of other languages? (TTS Arena discussion). 2025. https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena/discussions/10 — Community thread documenting the English-only limitation and plans for multilingual (Common Voice) evaluation.
MiniMax. MiniMax Speech-02: A New Benchmark in TTS. 2025. https://www.minimax.io/news/minimax-speech-02 — Vendor-claimed multilingual WER/similarity results; example of self-graded benchmark to treat cautiously (Czech inclusion unconfirmed).
rhasspy. Piper — a fast, local neural text to speech system (GitHub). 2024–2025. https://github.com/rhasspy/piper — Open, lightweight engine with Czech voice models; intelligible but clearly synthetic.

Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

Chapter 09: Quality Benchmarks & Honest Naturalness Matrix

Overview

Content

Why "naturalness" is hard to measure

The core metrics: MOS, CMOS, MUSHRA, WER

Public leaderboard #1: TTS Arena (Hugging Face)

Public leaderboard #2: Artificial Analysis TTS

Public benchmark #3: TTSDS2 (objective, distribution-based)

Other evidence sources (and how much to trust them)

The honest naturalness matrix: EN vs CZ

Run your own blind A/B test (the only benchmark that matters)

Warning: do not trust vendor demos

Key Takeaways

Key References