By the time you reach this chapter you have met a lot of TTS engines, and almost every one of them claims to be "the most natural" or "human-like" or "indistinguishable from a real voice." They cannot all be right. This chapter is about how to cut through that noise: what TTS quality actually means, how it is measured (MOS, CMOS, MUSHRA, WER), which public leaderboards are worth reading (TTS Arena on Hugging Face, Artificial Analysis), and — crucially — why almost every number you will see is biased toward English and toward the vendor publishing it.
The reader of this report has a hard, non-negotiable filter: the chosen engine must sound natural in both English and Czech. That filter is exactly where public benchmarks fail you, because nearly all of them are English-centric and Czech is rarely scored at all. So this chapter does three things: (1) teaches you to read benchmarks critically, (2) builds an honest quality matrix for the top contenders with explicit EN vs CZ grading and confidence levels, and (3) gives you a concrete, copy-pastable recipe to run your own blind A/B test on your own Czech and English texts — because in the end that is the only benchmark that matters for your project.
TTS quality is not one thing. A voice can be perfectly intelligible (you understand every word) yet sound robotic; or it can be warm and expressive yet mispronounce names; or it can nail a single sentence in a demo and fall apart on a 2,000-word article with numbers, abbreviations, and foreign words. The quality dimensions that matter for reading your texts aloud are roughly:
No single score captures all of these. The metrics below each capture a slice, which is why you need several of them — and why a top leaderboard rank is necessary but never sufficient.
MOS — Mean Opinion Score. The classic subjective metric, standardized in ITU-T P.800. Human listeners rate audio on a 1–5 scale (1 = bad, 5 = excellent) and you average across raters and samples. A modern "very good" neural TTS lands roughly in the 4.0–4.5 range; natural human recordings typically score around 4.5–4.8 (not a perfect 5, because raters are harsh and recordings have their own flaws).
MOS limitations are severe and worth internalizing:
The practical consequence: never compare MOS numbers across different papers or vendors as if they were on the same ruler. A MOS is only meaningful relative to other systems scored in the same study by the same listeners.
CMOS — Comparative MOS. Listeners hear two samples (A vs B) and rate the difference, typically on a −3 to +3 scale (−3 = A much worse, +3 = A much better, 0 = tie). This is far more sensitive than absolute MOS for detecting subtle differences, because you remove the "where is 4 on my personal scale" problem. A CMOS near 0 means two systems are perceptually a tie; a CMOS of −0.3 to −0.5 is a small but real preference. CMOS is the right tool when you are deciding between two genuinely good options — which is exactly your situation.
MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor, ITU-R BS.1534). Listeners rate several systems at once on a 0–100 scale, with a hidden high-quality reference and a deliberately degraded "anchor" mixed in to calibrate the scale. More efficient than MOS when comparing many systems, and the hidden reference/anchor make scores somewhat more stable. You will see it in audio-codec and academic TTS papers more than in product marketing.
WER — Word Error Rate. This is the only objective, fully automatic metric of the four. You feed the synthesized audio into a strong ASR (speech-to-text) system, transcribe it, and compare to the text you fed in:
WER = (Substitutions + Insertions + Deletions) / Total reference words
Lower is better; a clean modern English TTS run through a good ASR often lands in the low single-digit percentages. WER measures intelligibility, not naturalness — a flat robotic voice can have a near-zero WER. Its limitations:
WER is most useful as a guardrail: it catches catastrophic failures (skipped sentences, hallucinated words, garbled numbers) that a quick listen might miss. Pair it with a human/CMOS judgment of naturalness — never use it alone.
| Metric | Type | Scale | Measures | Main weakness |
|---|---|---|---|---|
| MOS | Subjective | 1–5 | Absolute naturalness | Pool-dependent, not cross-study comparable |
| CMOS | Subjective | −3…+3 | A-vs-B preference | Relative only, still needs humans |
| MUSHRA | Subjective | 0–100 | Many systems at once | Needs trained listeners, careful setup |
| WER | Objective | 0–100% | Intelligibility | Inherits ASR bias; ignores prosody |
The most-cited community benchmark is TTS Arena by the TTS-AGI group on Hugging Face, now in its V2 generation. Its design borrows directly from the LMSYS Chatbot Arena idea: instead of asking people for an absolute MOS, it shows a listener two anonymized audio clips of the same text generated by two different (randomly chosen) models, the listener picks which sounds better, and those pairwise votes are aggregated into an Elo-style ranking. This is essentially crowd-sourced CMOS at scale.
Why this design is good:
V2 expanded the model roster and refined the methodology over the original arena, pulling in a wide range of contemporary systems — commercial engines (e.g., ElevenLabs, MiniMax, Cartesia, PlayHT, Inworld, Hume) alongside open models (e.g., Kokoro, Sesame CSM, and other open releases). The original Hugging Face blog post "TTS Arena: Benchmarking Text-to-Speech Models in the Wild" documents the methodology; the V2 leaderboard is a fresh Elo restart (V1 votes do not carry over).
As a moment-in-time illustration (and only that — rankings move): in 2026 snapshots, the very top of the board is held by frontier commercial systems — recent entries from Inworld, MiniMax (Speech-02 / 2.x HD lines), ElevenLabs (which routinely places several models in the top 10 across its Turbo/Flash/Multilingual tiers), Hume's Octave, and Google's Gemini-based TTS, with Elo spreads of only a few dozen points separating the leaders (i.e., perceptually close). The instructive open-source data point is Kokoro-82M, which sits well down the board (roughly mid-table) yet is remarkable for an 82M-parameter open model with one of the largest vote samples on the leaderboard — proof that you can get respectable English naturalness from a tiny free model, but also a reminder that the open models are not (yet) beating the commercial frontier on raw English naturalness.
Critical caveats — read these before you trust a rank:
Artificial Analysis runs a parallel "Text to Speech" leaderboard and arena (artificialanalysis.ai/text-to-speech). It is more commercial-provider focused than the Hugging Face arena and is widely cited in industry write-ups. Like TTS Arena it uses head-to-head preference voting to produce a quality ranking, and it pairs that with provider-level data on price and speed/latency, which is genuinely useful for an engineering decision. The same English-bias and snapshot caveats apply: it is primarily an English-quality and commercial-coverage view. Use it to compare commercial options on the quality/price/latency triangle, and cross-check its quality ordering against the Hugging Face arena.
A useful complement to the preference arenas is TTSDS2 (Text-to-Speech Distribution Score 2), a 2025 academic benchmark (arXiv:2506.19441). Instead of crowd votes, it computes how closely a system's synthetic speech matches real human speech distributions across several factors — prosody, speaker identity, intelligibility, and general naturalness — and the authors report it correlates strongly with human ratings (validated against ~11,800 collected listening-test ratings). Why it matters for you: it is objective and reproducible (no rater pool to drift), and the framework is designed to be re-run on new languages and datasets. It is still research-grade and not a product leaderboard, but it is the most credible objective naturalness metric to cite in 2026, and it is the kind of thing you could point at Czech data yourself.
Here is the part the public leaderboards cannot give you: a per-engine grade that separates English from Czech and attaches a confidence level to each grade. The grades below are qualitative bands synthesized from the arenas (for English), from documented language support, and from community/community-listening evidence (for Czech). They are not transcribed MOS numbers — see the warnings above about why that would be dishonest. Confidence reflects how much independent, Czech-specific evidence exists.
Grade legend: A = clearly natural / among the best · B = good, usually natural · C = usable but noticeably synthetic or inconsistent · D = weak / robotic · n/a = not supported. Confidence: High = strong independent evidence · Med = some evidence · Low = sparse, mostly inferred.
| Engine | Type / cost | EN quality | EN conf. | CZ supported? | CZ quality | CZ conf. | Notes |
|---|---|---|---|---|---|---|---|
| ElevenLabs (v2/v3) | Cloud, paid (free tier) | A | High | Yes (70+ langs incl. Czech) | B (best-in-class commercial CZ candidate) | Med | Top of English arenas; Czech is officially supported and generally the strongest commercial Czech voice, but verify on your text — quality varies by voice. |
| Google Cloud TTS (Chirp 3 HD / Neural2) | Cloud, paid (free tier) | A− / B+ | High | Yes (cs-CZ voices) | B | Med | Mature Czech locale; Chirp 3 HD is notably more natural than older WaveNet/Neural2 Czech voices. Reliable, well-documented. |
| Microsoft Azure Neural TTS | Cloud, paid (free tier) | A− / B+ | High | Yes (cs-CZ neural voices) | B | Med | Long-standing, solid Czech neural voices; strong pronunciation engine, good for long-form. |
| MiniMax Speech-02 | Cloud, paid | A | Med–High | Claims 30+ langs; Czech inclusion not explicitly confirmed | C–B (unverified for CZ) | Low | Excellent in English arenas; multilingual claims are vendor-stated on a private benchmark. Czech specifically must be tested before trusting. |
| OpenAI TTS (gpt-4o-mini-tts etc.) | Cloud, paid | A−/B+ | High | Multilingual, follows input language | C–B (CZ uneven) | Low | Pleasant English; Czech can work but accent/pronunciation is inconsistent and not officially a per-locale product. Test heavily. |
| Coqui XTTS v2 | Open / self-host, free | B+ | High | Yes — Czech (cs) is an official supported language (17 total) | B− / C+ | Med | The strongest free, self-hostable, explicitly-Czech option. Zero-shot voice cloning. Czech is real but a notch below top commercial; original Coqui is defunct, community fork (idiap/coqui-ai-TTS) is maintained. |
| Kokoro-82M | Open / self-host, free | B+ (great for size) | High | No official Czech | n/a | High | Tiny (82M), fast, well-liked in English arenas. Not a Czech option — do not pick it if Czech is required. |
| Piper (rhasspy) | Open / self-host, free | B− / C | High | Yes (Czech voice models exist) | C | Med | Fast, runs on a Raspberry Pi. Czech voices exist and are intelligible but clearly synthetic — fine for utility, not for "real-sounding" reading. |
| Cartesia (Sonic) | Cloud, paid | A | Med–High | Limited multilingual; Czech not a headline locale | C (unverified) | Low | Excellent low-latency English; Czech coverage is weak/unclear — verify before considering. |
How to read this matrix for your decision (natural EN and CZ, free/near-free):
Public leaderboards are English. Vendor demos are cherry-picked. Your texts are yours. So build a tiny, honest, blind test harness. It takes an afternoon and will out-decide every leaderboard for your specific use case.
Step 1 — Pick representative texts. Choose 8–15 short passages (2–5 sentences each) drawn from your real content, split roughly half English, half Czech. Deliberately include the hard cases that break TTS: numbers and dates, an acronym, a person's name, a foreign loanword, a question, and one long compound sentence. Save them as texts_en.txt and texts_cs.txt.
Step 2 — Generate the same texts from each candidate engine. Render every passage with every candidate into samples/<engine>/<id>.wav. For self-hosted XTTS, that's a local call; for cloud engines, their SDK. Keep loudness consistent (normalize all clips) so loudness doesn't bias votes.
Step 3 — Blind-randomize and listen. Strip engine names from filenames, shuffle, and rate each clip 1–5 (MOS-style) and do pairwise A/B picks (CMOS-style) per text. Keep a hidden key mapping shuffled IDs back to engines. Critically: rate Czech and English separately and keep separate scoreboards — an engine can be A in English and C in Czech.
A minimal harness to randomize, collect votes, and reveal the key:
import os, random, json, glob
# samples/<engine>/<id>.wav -> blind A/B over the SAME text id
engines = [d for d in os.listdir("samples") if os.path.isdir(f"samples/{d}")]
ids = sorted({os.path.splitext(os.path.basename(p))[0]
for p in glob.glob("samples/*/*.wav")})
votes, key = [], {}
for tid in ids:
a, b = random.sample(engines, 2) # two random engines, same text
label_a, label_b = f"{tid}-A", f"{tid}-B"
key[label_a], key[label_b] = a, b # hidden mapping
print(f"\nText {tid}")
print(f" Play A: samples/{a}/{tid}.wav")
print(f" Play B: samples/{b}/{tid}.wav")
pick = input(" Which sounds more natural? [a/b/tie]: ").strip().lower()
votes.append({"text": tid, "winner": pick, "A": a, "B": b})
# tally
from collections import Counter
score = Counter()
for v in votes:
if v["winner"] == "a": score[v["A"]] += 1
elif v["winner"] == "b": score[v["B"]] += 1
else: score[v["A"]] += 0.5; score[v["B"]] += 0.5
print("\nBlind A/B results:", score.most_common())
json.dump({"votes": votes, "key": key}, open("results.json", "w"), indent=2)
Step 4 — Add an objective WER guardrail. Transcribe every generated clip with a strong multilingual ASR (e.g., a Whisper-class model, which handles Czech reasonably) and compute WER against the source text. This catches the silent killers — a skipped clause, a mangled number, a hallucinated word — that a quick listen misses. Use a Czech-capable ASR for the Czech clips, otherwise the WER reflects the ASR's Czech weakness, not the TTS.
# pip install faster-whisper jiwer
from faster_whisper import WhisperModel
from jiwer import wer
import glob, os
asr = WhisperModel("large-v3") # multilingual; decent Czech
refs = dict(l.split("\t", 1) for l in open("refs.tsv")) # id<TAB>text
for wav in glob.glob("samples/*/*.wav"):
tid = os.path.splitext(os.path.basename(wav))[0]
lang = "cs" if tid.startswith("cs") else "en"
segs, _ = asr.transcribe(wav, language=lang)
hyp = " ".join(s.text for s in segs).strip()
print(f"{wav}\tWER={wer(refs[tid].lower(), hyp.lower()):.3f}")
Step 5 — Decide on the combined picture. Rank by your blind A/B/MOS for each language separately, then disqualify anything with alarming Czech WER. The engine that wins your Czech scoreboard while staying free/near-free is your answer — regardless of where it sits on any public English leaderboard.
Every vendor's landing page plays its best clip: a sentence the model was tuned on, in the language it's strongest in (almost always English), at the optimal length, with the best voice, possibly hand-edited. This is the single most misleading "evidence" in the whole field. Specific traps:
The antidote is the harness above: your texts, your languages, blind, with an objective WER guardrail. Two hours of that beats two weeks of reading leaderboards.