This is the "just tell me what to use" chapter. Everything covered earlier in the report — local/offline models, self-hosted projects, cloud APIs, and the browser — is consolidated here into one master scorecard, graded explicitly on the two hard filters that matter for this project: natural-sounding English AND natural-sounding Czech, plus cost, hosting model, voice cloning, streaming, licensing, and setup effort.
Then comes a decision tree that routes you to a concrete pick based on your real constraints (budget, whether Czech is critical, whether you can self-host, whether you need cloning, and how tight your latency budget is), followed by the headline recommendation restated for your exact situation: a developer adding TTS to their own software, who wants the most natural English and Czech that stays free or near-free.
A recurring theme from the research that you must keep in mind while reading this scorecard: Czech is the filter that eliminates most of the "best" English models. Many of the highest-MOS open models (Kokoro, Chatterbox, Sesame CSM, Orpheus in its base form) ship with no Czech at all, or with Czech that is unverified/poor. Grade every row on the Czech column first.
Quality grades are coarse on purpose — published MOS numbers are not comparable across vendors and almost none of them are measured on Czech. The grades below are an evidence-tiered synthesis of the per-option chapters:
"Cost" is the realistic cost for this project (reading your own texts), not a theoretical list price.
| Option | Type | EN quality | CZ quality | Realistic cost | Hosting | Voice clone |
|---|
| Streaming |
|---|
| License / commercial |
|---|
| Setup effort |
|---|
| Best for |
|---|
Microsoft edge-tts | Cloud (unofficial) | A− | A− | Free (no key) | Their server, free Python lib | No | Yes (chunked) | Gray area — Edge "Read aloud" backend, no commercial ToS | Very low | The free sweet spot for EN+CZ if you accept the gray-area dependency |
| Azure AI Speech (Neural) | Cloud API | A | A | Free 500k chars/mo, then ~$15–16/M (Neural) | Microsoft cloud | Custom Neural Voice (gated, paid) | Yes (real-time + SDK) | Clear commercial ToS | Low–med | Production EN+CZ with a license you can stand behind |
| Google Cloud TTS | Cloud API | A | A− | Free 1M (Std) / lower free tier for premium; Neural2 ~$16/M, Chirp 3 HD priced higher | Google cloud | Instant Custom Voice (gated) | Yes | Clear commercial ToS | Low–med | Generous always-free Czech via Standard/WaveNet; scale up later |
OpenAI gpt-4o-mini-tts | Cloud API | A | B+ (multilingual, CZ usable but not native-tuned) | ~$0.015/min audio (token-based, est.) | OpenAI cloud | No (preset + steerable style) | Yes | Clear commercial ToS | Very low | Easiest API; great EN, decent CZ; steerable tone |
| ElevenLabs | Cloud API | A | A− (Multilingual v2 / v3 supports Czech) | Free 10k chars/mo (non-commercial); paid tiers from ~$5/mo | ElevenLabs cloud | Yes, best-in-class | Yes (WebSocket) | Free tier non-commercial; paid = commercial | Low | Highest naturalness + cloning if budget allows; CZ is good |
| Kokoro-82M | Local model | A− | — (no Czech) | Free | Local CPU/GPU | No | Partial (short chunks) | Apache-2.0 (very permissive) | Med | English-only projects; fails the Czech filter |
| XTTS v2 (Coqui / forks) | Local / self-host | B+ | B (Czech is a supported language) | Free | Local GPU (or CPU slow) | Yes (3–6s sample) | Yes | CPML — non-commercial (original weights); check fork terms | Med–high | Local Czech voice cloning for personal/non-commercial use |
| Piper | Local model | B | B− (Czech voices exist, dated) | Free | Local CPU (tiny) | No | Yes (fast) | MIT (permissive) | Low–med | Fully offline, low-resource, embedded; CZ acceptable not great |
| Coqui VITS / other CZ checkpoints | Local model | B− | B− | Free | Local | No | Yes | Varies (often MPL/Apache) | Med–high | Tinkerers wanting fully-open Czech weights |
| Fish Speech / OpenAudio | Local / self-host | A− | B (multilingual incl. CZ claimed) | Free (self-host) | Local GPU | Yes | Yes | Check current license (has shifted) | High | Self-hosted multilingual cloning if you verify Czech yourself |
| Orpheus / Sesame CSM / Chatterbox | Local model | A | — (English-first; CZ absent/unverified) | Free | Local GPU | Some (Chatterbox yes) | Yes | Apache/MIT (varies) | High | Best-sounding English locally; fails Czech filter |
Web Speech API (speechSynthesis) | Browser built-in | C–B (OS-dependent) | C (OS-dependent, often poor) | Free | Client device | No | Yes (native) | N/A (browser API) | Trivial | Zero-cost accessibility fallback; CZ inconsistent |
Notes that matter:
edge-tts uses the same neural voices as Azure (e.g. Czech cs-CZ-VlastaNeural, cs-CZ-AntoninNeural) but through the free, unofficial endpoint behind Edge's "Read aloud." It is the single best free-and-natural Czech option — with the caveat that there is no commercial license or SLA; Microsoft can change or rate-limit the endpoint at any time. Treat it as "free prototype / personal-use" and have Azure as the drop-in paid fallback (same voice names).Filtered to EN good + CZ good + free or near-free, the field collapses to a short list:
| Rank | Option | Why it wins | The catch |
|---|---|---|---|
| 1 | edge-tts | Truly free, no key, Azure-grade EN+CZ neural voices, trivial setup | Unofficial/gray-area, no SLA, no commercial license |
| 2 | Azure AI Speech | Same voices as #1 but licensed; 500k chars/mo free | Needs an Azure account/key; paid beyond free tier |
| 3 | Google Cloud TTS | Large always-free tier including Czech; clean ToS | Premium Czech (Chirp/Neural2) costs money |
| 4 | Piper (local) | 100% offline, free forever, no quotas, CZ voices exist | Czech voices are dated/robotic vs cloud neural |
| 5 | XTTS v2 (local) | Free local Czech voice cloning | Non-commercial license; needs a GPU to be fast |
If you only remember one row: prototype on edge-tts, and keep Azure (same voice IDs) as the one-line swap when you need a real license.
For a project that reads your own texts (not mass TTS for thousands of users), volume is usually small:
At that volume:
edge-tts: $0.gpt-4o-mini-tts: ~70 min × The takeaway: for personal-scale reading, every cloud option is in the "a few dollars or free" range. Cost is rarely the deciding factor at this scale — Czech quality, licensing, and offline-vs-cloud are.
Follow the first branch that matches your hardest constraint.
START: I need TTS that reads my texts in natural English AND Czech.
│
├─ Q1: Is hard offline / data-never-leaves-my-box a requirement?
│ ├─ YES ─────────────────────────────────────────────────────────────────┐
│ │ ├─ Need voice CLONING (your own/a custom voice)? │
│ │ │ ├─ YES → XTTS v2 (Czech + cloning). NOTE: weights are │
│ │ │ │ NON-COMMERCIAL (CPML). OK for personal/internal only. │
│ │ │ │ Needs a GPU for snappy latency. │
│ │ │ └─ NO → Piper (MIT, runs on a CPU/Raspberry-Pi-class box). │
│ │ │ Czech voices exist but sound dated — acceptable, not │
│ │ │ cloud-grade. Best fully-free offline pick. │
│ │ └─ (If English-only after all → Kokoro-82M, Apache-2.0, best │
│ │ open English — but it has NO Czech.) │
│ │ ┘
│ └─ NO (cloud is fine) ↓
│
├─ Q2: Is this commercial / shipped to real users (needs a clean license)?
│ ├─ YES ↓
│ │ ├─ Q2a: Top-tier naturalness + want cloning, budget OK?
│ │ │ └─ ElevenLabs (best naturalness + cloning; Czech is good;
│ │ │ paid tier = commercial license).
│ │ ├─ Q2b: Want the most defensible, enterprise-clean EN+CZ?
│ │ │ └─ Azure AI Speech (cs-CZ-Vlasta/Antonin neural, clear ToS,
│ │ │ 500k chars/mo free). RECOMMENDED commercial default.
│ │ └─ Q2c: Want generous always-free + Google ecosystem?
│ │ └─ Google Cloud TTS (free Czech via Standard/WaveNet;
│ │ premium tiers cost more — verify Czech on the tier).
│ └─ NO (personal project / internal tool / prototype) ↓
│
├─ Q3: Want the absolute easiest free path that sounds great in EN+CZ?
│ └─ YES → edge-tts. Free, no key, Azure-grade Czech voices,
│ ~5 lines of Python. THE sweet-spot pick for this project.
│ (Gray-area/no-SLA — keep Azure as the drop-in fallback.)
│
└─ Q4: Pure browser/client, zero backend, accessibility-grade is enough?
└─ YES → Web Speech API (speechSynthesis). Free, trivial, but Czech
quality depends on the user's OS and is often weak. Use as a
fallback, not the primary voice.
edge-tts.edge-tts, with a license), or Google Cloud TTS if you want the bigger always-free tier.It is worth stating plainly because it's the crux of this whole report: the 2025–2026 open-source TTS renaissance (Kokoro, Orpheus, Sesame CSM, Chatterbox) is overwhelmingly English-first. Their headline naturalness is real, but the moment you apply the Czech filter, they drop out:
That is exactly why the practical Czech winners are the cloud neural voices (Azure/Edge/Google) and, for offline, the older multilingual workhorses (Piper, XTTS) rather than the flashy new English models. Don't be seduced by an English MOS leaderboard — grade Czech first, every time.
The fastest way to validate the headline pick in your own project:
edge-tts (free, EN + CZ, no key) — Python:
pip install edge-tts
# List Czech voices:
edge-tts --list-voices | grep cs-CZ
# Synthesize Czech:
edge-tts --voice cs-CZ-VlastaNeural \
--text "Dobrý den, toto je test české syntézy řeči." \
--write-media cz.mp3
# Synthesize English:
edge-tts --voice en-US-AriaNeural \
--text "Hello, this is an English test." \
--write-media en.mp3
import asyncio, edge_tts
async def speak(text, voice, out):
await edge_tts.Communicate(text, voice).save(out)
asyncio.run(speak("Ahoj světe!", "cs-CZ-AntoninNeural", "hello_cs.mp3"))
asyncio.run(speak("Hello world!", "en-GB-RyanNeural", "hello_en.mp3"))
Azure AI Speech (same voices, licensed) — when you outgrow the free endpoint, swap to the Azure Speech SDK with the identical voice names (cs-CZ-VlastaNeural, cs-CZ-AntoninNeural). The migration is a few lines, which is the whole point of starting on edge-tts.
Piper (fully offline) — bash:
# Download a Czech voice model (.onnx + .onnx.json) from the Piper voices repo,
# then:
echo "Toto je test offline syntézy." | \
piper --model cs_CZ-*.onnx --output_file cz_offline.wav
You are a technical developer adding TTS to your own software, you need natural English and natural Czech, and you want free or near-free. Given everything above:
edge-tts. It is free, has no API key, sounds Azure-grade in both English and Czech, and is ~5 lines to integrate. It directly hits your stated sweet spot.Architect the integration around a thin synthesize(text, lang) -> audio interface so you can prototype on edge-tts today and swap to Azure/Google/ElevenLabs later without touching the rest of your app.
edge-tts — Azure-grade Czech/English neural voices, no key, trivial setup — with the caveat that it's an unofficial, no-SLA, no-commercial-license endpoint.edge-tts (same voice IDs, 500k chars/mo free), making it the recommended commercial default and a near-zero-friction upgrade path.synthesize(text, lang)) so you can start free on edge-tts and migrate without rewrites.cs-CZ-VlastaNeural/cs-CZ-AntoninNeural; documents usage and voice listing.