This chapter compares the major commercial cloud text-to-speech (TTS) APIs you can call from your own software to read your texts aloud in English and Czech. The reader's hard filter is that Czech (cs-CZ) must sound genuinely natural — and this is exactly where most "best" TTS models fall down, because the headline-grabbing systems (Deepgram Aura, Cartesia Sonic in practice, PlayHT's flagship) are overwhelmingly English-first.
For each provider we grade five things explicitly:
cs-CZ exist, in which voice tier, and how good is it (this is graded honestly and separately)The bottom line up front: for the free-or-near-free + natural Czech sweet spot, the realistic cloud contenders are Google Cloud TTS (Chirp 3 HD), Microsoft Azure Neural, and ElevenLabs (Multilingual v2 / Turbo v2.5 / Flash v2.5), with Amazon Polly as a budget fallback and OpenAI as a wildcard (great quality, undocumented/inconsistent Czech, no cs-CZ voice picker). Deepgram, Cartesia, and PlayHT's flagship models are essentially disqualified for Czech as of mid-2026.
Pricing and language tables below were verified live against vendor pages on 2026-05-31. TTS pricing and free tiers change often — treat all numbers as "verified at writing time, re-check the linked pricing page before you commit."
Almost every cloud TTS API bills by characters of input text (not audio minutes), usually quoted per 1 million characters. Rough conversions to keep in your head:
ElevenLabs is the exception: it bills in "credits," where (for most models) 1 character ≈ 1 credit, but the low-latency Flash/Turbo models cost 0.5 credit/char. OpenAI historically quoted per 1M characters for tts-1/tts-1-hd and per minute of audio (token-based) for gpt-4o-mini-tts. We normalize everything to "$ per 1M characters" where possible.
English: Widely regarded as the most natural-sounding commercial TTS in 2025–2026, especially for expressive long-form reading and audiobooks. Its Multilingual v2 model is the quality reference; Turbo v2.5 trades a little quality for ~250–300 ms latency; Flash v2.5 targets ~75 ms model latency for real-time use.
Czech (cs-CZ): Supported and good. ElevenLabs lists Czech among its 70+ languages (Multilingual v2, Turbo v2.5, and Flash v2.5 all cover Czech), and has a dedicated Czech TTS/voiceover product page. Czech naturalness is among the best available from any cloud API — correct stress and reasonably natural prosody on ordinary prose, clearly better than the older Google WaveNet / Amazon Polly neural Czech voices. It's not flawless (occasional foreign-accent coloring on some cloned/multilingual voices, and numbers/abbreviations can need normalization), but for "natural Czech from a cloud API," ElevenLabs is at or near the top.
Voice cloning: Yes — Instant Voice Cloning (a few seconds of audio, available from paid tiers) and Professional Voice Cloning (30+ min, higher tiers). This is one of ElevenLabs' main differentiators.
Streaming/latency: Yes, WebSocket + HTTP streaming; Flash v2.5 ~75 ms model latency.
Free tier & pricing (verified 2026-05-31):
Verdict for the reader: Best Czech quality, but the worst price-per-character of the mainstream options. Good if you read modest volumes and want the best voice; bad if you'll synthesize hours of text and care about "free/near-free."
tts-1, tts-1-hd, gpt-4o-mini-tts) — great voices, murky CzechEnglish: Very natural, especially gpt-4o-mini-tts, which adds steerability — you can pass an instructions field ("speak in a calm, warm tone") to control delivery. Voices: alloy, echo, fable, onyx, nova, shimmer, plus newer ones (ash, coral, sage, ballad, etc.).
Czech (cs-CZ): Unofficial / not a first-class feature. OpenAI does not expose a cs-CZ voice or a language selector — the model auto-detects language from the input text. It can read Czech text and it often sounds passable, but OpenAI does not document Czech as supported, quality is inconsistent run-to-run, and there's an audible English-accent tendency on Czech because the voices are English-trained. Grade: works in a pinch, not reliable for "must sound natural in Czech." Verify with your own Czech samples before depending on it.
Voice cloning: No public custom-voice/cloning in the standard TTS API (the fixed voice set only).
Streaming/latency: Streaming supported (chunked/SSE; realtime via the Realtime API). Latency is good but not the lowest.
Free tier & pricing (verified 2026-05-31):
tts-1: ~$15 / 1M characters; tts-1-hd: ~$30 / 1M characters.gpt-4o-mini-tts: billed by audio tokens / per minute — roughly $0.015 per minute of audio (≈ a few dollars per million characters equivalent, often cheaper than tts-1-hd). Confirm on the pricing page; OpenAI changed audio billing units in 2025.Verdict: Excellent English and cheap-ish, but Czech is unsupported officially and accent-prone — fails the reader's hard Czech filter unless your own listening tests say otherwise.
English: Multiple tiers. Quality ladder (worst→best): Standard → WaveNet → Neural2 → Studio (premium narration, limited locales) → Chirp 3 HD (2025 flagship, very natural, expressive). There's also Polyglot and instant custom voice on some tiers.
Czech (cs-CZ): Strong and broad — explicitly confirmed. Czech is available across Standard, WaveNet, Neural2, and — critically — Chirp 3 HD. Google's docs explicitly list cs-CZ among the Chirp 3 HD locales (the recently-added batch includes bg-BG, cs-CZ, et-EE, el-GR, hr-HR, hu-HU, lt-LT, lv-LV, ro-RO, sk-SK, sl-SI, sr-RS), served by Chirp 3 HD's 8 named voices (4 male / 4 female) across ~30+ languages. Neural2 Czech is solid; Chirp 3 HD Czech is a clear step up in naturalness. Caveat: for cs-CZ, Chirp 3 HD currently does not support pause control or custom pronunciations (per Google's docs) — fine for plain reading, a limitation if you need SSML-style fine control. This is the most important finding for the reader: Google gives you genuinely modern, natural Czech and a generous free tier.
Voice cloning: Instant Custom Voice is available (consent-gated, enterprise-flavored), but not as frictionless as ElevenLabs. For most readers the prebuilt Chirp 3 HD voices are enough.
Streaming/latency: Streaming synthesis supported (including a low-latency streaming endpoint for Chirp 3 HD); good latency.
Free tier & pricing (verified 2026-05-31, per month, per voice tier):
Verdict: Top recommendation for the reader's sweet spot. Natural Czech (Chirp 3 HD), 1M free chars/month on the good tiers, and ~$16–30/1M after that. If you stay under ~1M Czech characters a month, it's effectively free with good-to-excellent quality.
English: Large neural voice catalog, very natural; newer HD / multilingual voices push quality higher and add cross-lingual delivery.
Czech (cs-CZ): Supported with native neural voices — notably cs-CZ-AntoninNeural (male) and cs-CZ-VlastaNeural (female). These are real, native Czech neural voices with good naturalness (comparable to Google Neural2; very usable for reading prose). Azure also has multilingual HD voices that can speak Czech, though native cs-CZ neural voices are the safer bet for accent-free Czech. Grade: solid, reliable native Czech — a strong #2 behind ElevenLabs/Google Chirp 3 HD on quality, but excellent on price.
Voice cloning: Yes — Custom Neural Voice (gated behind an application/approval process for responsible-AI reasons; more enterprise friction than ElevenLabs).
Streaming/latency: Yes, real-time streaming SDK; low latency; good for interactive use.
Free tier & pricing (verified 2026-05-31):
Verdict: Excellent near-free Czech option. 0.5M free chars/month plus genuine native Czech voices (Antonin/Vlasta). Slightly behind Google on free-tier volume (0.5M vs 1M) and arguably on top-end naturalness, but rock-solid and well-documented.
English: Neural (and 2024+ Generative and Long-Form) voices are good; Generative is the most natural Polly tier. English is competitive.
Czech (cs-CZ): Supported but weak/dated. Polly's Czech voice is Jitka (cs-CZ, female), and historically Czech has been standard-engine only — i.e., the older concatenative/parametric quality, not the natural neural voices. As of mid-2026 Polly's neural/generative coverage for Czech is limited or absent (verify the voice list for your region). Grade: usable for utility readouts, but it does NOT sound as natural as Google/Azure/ElevenLabs Czech. This is the trap to avoid if naturalness is the priority.
Voice cloning: No custom/cloning in Polly (fixed voices only).
Streaming/latency: Yes, supports streaming.
Free tier & pricing (verified 2026-05-31, AWS Free Tier = first 12 months):
Verdict: Cheapest mainstream option and great for English, but Czech is standard-tier and noticeably less natural — fails the hard Czech filter for a reading feature where quality matters.
English: Very natural flagship models (Play 3.0 mini, PlayDialog) tuned for conversational/agent use, low latency.
Czech (cs-CZ): Weak. PlayHT markets Czech support and has a "Czech AI voice generator" page, but its best/flagship models are English (and a handful of major languages) only; Czech, where offered, comes from older multilingual voices and is not on par with Google/Azure/ElevenLabs. Treat PlayHT Czech as utility-grade, verify by ear.
Voice cloning: Yes (instant + high-fidelity cloning) — a selling point.
Streaming/latency: Yes, low-latency streaming aimed at voice agents.
Free tier & pricing: Limited free trial; paid plans (Creator ~$31/mo and up) and API pricing roughly in the $10–30/1M-character-equivalent range depending on model/tier. (Pricing structure changes frequently; verify.)
Verdict: Good for English agents and cloning; not a serious natural-Czech option.
English: Aura-2 is enterprise TTS optimized for very low latency voice agents and is priced aggressively (~$30/1M chars, sometimes quoted lower). Quality is good for conversational English.
Czech (cs-CZ): Not supported. Aura/Aura-2 TTS covers only ~7 languages — English, Spanish, German, French, Dutch, Italian, Japanese. No Czech TTS. (Note: Deepgram did add Czech to its speech-to-text in late 2025 — don't confuse that with TTS; the read-aloud side has no Czech.) Disqualified for this reader.
Voice cloning: No public cloning.
Verdict: Great for English real-time agents; irrelevant for Czech.
English: Sonic is known for extremely low latency (~40–90 ms time-to-first-audio) and strong naturalness — a favorite for real-time voice agents. Sonic-3 (2025) claims state-of-the-art expressiveness.
Czech (cs-CZ): Not clearly supported / unverified. Cartesia advertises a growing multilingual list (English, French, German, Spanish, Portuguese, Chinese, Japanese, and more — on the order of ~15–40 languages depending on model/version). Czech is not clearly on the official supported list as of mid-2026 — if it appears, treat it as new/experimental and test heavily. Grade: assume no reliable Czech until you confirm in the docs.
Voice cloning: Yes (instant voice cloning) — a strength.
Streaming/latency: Best-in-class low latency; great for agents.
Free tier & pricing: Free tier (limited monthly credits) + paid plans; API pricing roughly $15–40 / 1M characters depending on tier/model. Verify.
Verdict: Excellent for English/realtime; Czech is not a confirmed strength — don't bet on it without verifying.
Czech grades: ✅ native & natural · 🟡 supported but dated/utility-grade · ⚠️ works unofficially / accent-prone · ❌ not supported.
| Provider / model | English naturalness | Czech (cs-CZ) | Voice cloning | Streaming / latency | Free tier | Price /1M chars (paid) |
|---|---|---|---|---|---|---|
| ElevenLabs (Multilingual v2 / Turbo / Flash) | Top-tier | ✅ Best cloud Czech; v2/Turbo/Flash all cover cs | ✅ Instant + Professional | ✅ Flash ~75 ms | ~10k credits/mo, non-commercial | ~$150–300 (credit-based, priciest) |
| Google Cloud TTS (Chirp 3 HD / Neural2) | Excellent (Chirp 3 HD) | ✅ Native, incl. Chirp 3 HD | 🟡 Instant Custom Voice (gated) | ✅ Streaming, low latency | 1M chars/mo (WaveNet/Neural2/Chirp); 4M (Standard) | Std ~$4 · Neural2/WaveNet ~$16 · Chirp 3 HD ~$30 · Studio ~$160 |
| Azure AI Speech (Neural / HD) | Excellent | ✅ Native: Antonin, Vlasta | ✅ Custom Neural Voice (gated) | ✅ Real-time SDK | 0.5M chars/mo | Neural ~$15 · HD ~$30 |
| OpenAI (gpt-4o-mini-tts / tts-1-hd) | Excellent, steerable | ⚠️ Unofficial, accent-prone, no cs picker | ❌ | ✅ Streaming/Realtime | None (signup credit only) | tts-1 ~$15 · tts-1-hd ~$30 · mini-tts ~$0.015/min |
| Amazon Polly (Neural / Generative) | Good–excellent (Generative) | 🟡 Jitka, standard-engine only | ❌ | ✅ | 12-mo: 1M neural, 5M std | Std ~$4 · Neural ~$16 · Generative ~$30 · Long-form ~$100 |
| PlayHT (Play 3.0 / PlayDialog) | Excellent (English) | 🟡 Legacy voices only, utility-grade | ✅ | ✅ Low latency | Limited trial | ~$10–30 (varies) |
| Deepgram Aura-2 | Good (agents) | ❌ English-only (+some Spanish) | ❌ | ✅ Very low latency | Trial credits | ~$30 |
| Cartesia Sonic-3 | Excellent, lowest latency | ⚠️ Not confirmed — verify | ✅ Instant | ✅ Best-in-class latency | Limited free credits | ~$15–40 |
All figures verified 2026-05-31 against vendor pricing/voice pages; some (Chirp 3 HD, Azure HD, Polly Generative, PlayHT, Cartesia) move frequently — re-check before committing.
In honest order of Czech naturalness:
For the reader's stated goal (natural Czech + free/near-free): start with Google Chirp 3 HD, with Azure Neural as the backup. Both keep you at $0 under ~0.5–1M Czech characters/month, which is a lot of reading. Reserve ElevenLabs for the texts where you want the very best voice and can accept the per-character cost.
Google Cloud TTS — Chirp 3 HD, Czech (Python):
# pip install google-cloud-texttospeech ; set GOOGLE_APPLICATION_CREDENTIALS
from google.cloud import texttospeech as tts
client = tts.TextToSpeechClient()
resp = client.synthesize_speech(
input=tts.SynthesisInput(text="Dobrý den, toto je test českého hlasu."),
voice=tts.VoiceSelectionParams(
language_code="cs-CZ",
name="cs-CZ-Chirp3-HD-Charon", # verify exact Chirp 3 HD cs-CZ voice name in docs
),
audio_config=tts.AudioConfig(audio_encoding=tts.AudioEncoding.MP3),
)
open("out.mp3", "wb").write(resp.audio_content)
Azure AI Speech — Czech neural voice (Python):
# pip install azure-cognitiveservices-speech
import azure.cognitiveservices.speech as speechsdk
cfg = speechsdk.SpeechConfig(subscription="YOUR_KEY", region="westeurope")
cfg.speech_synthesis_voice_name = "cs-CZ-VlastaNeural" # or cs-CZ-AntoninNeural
synth = speechsdk.SpeechSynthesizer(speech_config=cfg)
synth.speak_text_async("Dobrý den, toto je test českého hlasu.").get()
ElevenLabs — Multilingual v2, Czech (Python):
# pip install elevenlabs
from elevenlabs.client import ElevenLabs
from elevenlabs import save
client = ElevenLabs(api_key="YOUR_KEY")
audio = client.text_to_speech.convert(
voice_id="VOICE_ID", # pick a multilingual voice from your library
model_id="eleven_multilingual_v2", # or eleven_turbo_v2_5 / eleven_flash_v2_5 (0.5 credit/char)
text="Dobrý den, toto je test českého hlasu.",
output_format="mp3_44100_128",
)
save(audio, "out.mp3")