This is the money chapter. You want the most natural-sounding English-and-Czech TTS that stays free or near-free, and "near-free" only means something once you can compare apples to apples. So this chapter does three things:
A note on the conversion factor used throughout: spoken audio runs ~150 words/min, English averages ~5 characters/word including the trailing space, so ~750 characters ≈ 1 minute and ~45,000 characters ≈ 1 hour of audio (Voices.com / Speechify converters confirm ~750 chars/min at 150 WPM). Czech words are slightly longer on average but the rate of speech is comparable, so the same ~45k chars/hour rule is a fine planning approximation for both languages. Every "$/hour of audio" figure below is derived as ($/1M chars) × 0.045. (If your TTS reads faster — ~180 WPM ≈ 900 chars/min — costs per audio-hour drop proportionally; I use the conservative 750 figure.)
⚠️ Prices change. All figures are list prices verified against vendor pricing pages in May 2026, rounded for readability. Treat them as planning numbers, re-check the live pricing page before committing, and note that several vendors bill in "credits" or "bytes" rather than characters — I flag those.
These are the perpetual-or-trial free allowances. The distinction matters enormously: a perpetual monthly free tier can make a low-volume "read my texts" project genuinely $0 forever, while a 12-month trial just delays the bill.
| Provider | Free allowance | Type | Czech? | Catch / notes |
|---|---|---|---|---|
| Google Cloud TTS | 4M chars/mo (Standard & WaveNet); 1M chars/mo (Neural2 / Chirp 3 HD) | Perpetual, monthly | Yes (cs-CZ neural) | Per-product free tiers; needs billing account on file |
| Azure AI Speech | 500k chars/mo (Neural) on Free (F0) tier | Perpetual, monthly | Yes (cs-CZ: Antonín, Vlasta) | F0 has rate limits; Neural HD voices not free |
| Amazon Polly | 5M chars/mo Standard; 1M chars/mo Neural | 12-month trial only | Standard only — no Neural Czech (see note) | After 12 months, full price |
| ElevenLabs | 10,000 credits/mo (~10 min audio) | Perpetual, monthly | Yes (multilingual v2) | Non-commercial only; attribution required; no commercial use on Free |
| OpenAI | None (pay-as-you-go from char 1) | — | Partial (accent-tinted) | No free tier; cheap per-minute though |
| PlayAI (PlayHT) | ~12,500 chars trial | Trial | Limited | Small one-time trial |
| Deepgram (Aura) | $200 free credit on signup | One-time credit | No Czech (English-only) | Burns down with use |
Key free-tier reading for your use case: for low-volume personal/text-reading, Google (4M WaveNet + 1M Neural2 chars/month, both free) and Azure's 500k Neural chars/month are the two perpetual free tiers that also support natural Czech. Note a pleasant surprise from the live pricing data: Google's WaveNet tier is only $4/1M and carries a 4M-char/month free allowance (same as Standard), while the newer Neural2 voices are the $16/1M tier with a 1M-char free allowance. So if WaveNet Czech quality is acceptable to you, you get 4M free natural-ish Czech characters/month — very generous. ElevenLabs' free tier sounds the best but is non-commercial, which disqualifies it the moment you ship it inside your software product for others.
Czech caveat on Amazon Polly: Polly's free Neural tier is generous, but Polly does not offer a Czech Neural voice — Czech (cs-CZ) is only available on the older, robotic Standard engine. So Polly's "natural" tier is effectively English-and-friends-only for your purposes. This is the kind of thing that silently kills a plan; verify the voice, not just the language code.
Here is every major paid tier converted to both comparison units. Where a vendor bills in something other than characters (OpenAI bills audio output tokens ≈ per-minute; ElevenLabs bills credits ≈ characters), I converted using the standard factors and label the assumption.
| Service / tier | Native price | $ / 1M chars | $ / hour audio | Czech natural? |
|---|---|---|---|---|
| Google Standard | $4 / 1M chars | $4 | ~$0.18 | weak (robotic) |
| Google WaveNet | $4 / 1M chars | $4 | ~$0.18 | mid (improved) |
| Google Neural2 | $16 / 1M chars | $16 | ~$0.72 | Yes |
| Google Chirp 3 HD | ~$30 / 1M chars* | ~$30 | ~$1.35 | Yes (newest) |
| Google Studio | $160 / 1M chars | $160 | ~$7.20 | limited langs |
| Azure Neural | $16 / 1M chars | $16 | ~$0.72 | Yes |
| Azure Neural HD | $22 / 1M chars (from Mar 2026) | $22 | ~$0.99 | partial |
| Amazon Polly Standard | $4 / 1M chars | $4 | ~$0.18 | Czech = robotic |
| Amazon Polly Neural (NTTS) | $16 / 1M chars | $16 | ~$0.72 | no Czech |
| Amazon Polly Generative | $30 / 1M chars | $30 | ~$1.35 | no Czech |
| OpenAI gpt-4o-mini-tts | ~$0.015 / min audio | ~$12** | ~$0.90 | accent-tinted |
| OpenAI tts-1 / tts-1-hd | $15 / $30 / 1M chars | $15 / $30 | ~$0.68 / ~$1.35 | accent-tinted |
| ElevenLabs Starter | $5/mo → 30k credits | ~$167*** | ~$7.50 | Yes |
| ElevenLabs Creator | $22/mo → ~121k credits | ~$182*** | ~$8.20 | Yes |
| ElevenLabs Pro | $99/mo → ~600k credits | ~$165*** | ~$7.40 | Yes |
| PlayAI (PlayHT) API | ~$0.030 / 1k chars | ~$30 | ~$1.35 | limited |
| Deepgram Aura-2 | ~$0.030 / 1k chars | ~$30 | ~$1.35 | no Czech |
* Google bills Chirp 3 HD per byte of input; UTF-8 Czech with diacritics uses multi-byte characters, so your effective Czech rate can be noticeably higher than the per-character headline. Budget +10–20% for Czech text on byte-billed tiers.
** OpenAI's audio models bill by output audio tokens; $0.015/min × 60 min ≈ $0.90/hr of audio, and the per-character figure ($12/1M) is derived from that per-minute rate via the ~750 chars/min factor. Treat OpenAI's ~$12/1M as approximate.
*** ElevenLabs is a subscription, not pure usage. The effective $/1M is the subscription cost divided by included credits (1 credit ≈ 1 char on standard models). Overage and the non-commercial Free tier change the math; this column shows the blended rate at full plan utilization.
What this table says in plain English:
"Local is free" is half-true. The compute is free at the margin, but the fixed costs (hardware, your time) and variable costs (electricity, or serverless GPU $/hr) are real. Let's build the model honestly across the three deployment shapes.
This is the genuine $0-marginal-cost path. Piper (rhasspy/piper) runs faster than real time on a CPU — it even runs on a Raspberry Pi 4 — and Kokoro (82M params) is small enough for snappy CPU inference too.
Verdict: if your texts are English-heavy and "good enough, totally free, totally private" beats "best Czech," CPU-local Piper/Kokoro is unbeatable on cost — essentially $0/hour of audio.
To get natural open-source Czech you generally need a heavier model (e.g. Coqui XTTS-v2, which does Czech and voice cloning, needs ~4 GB VRAM and runs real-time on something like an RTX 3060). Options:
The middle path: no capex, no idle cost, pay per second of GPU time. Verified May-2026 rates:
| Platform / GPU | Rate | ~$ / GPU-hour |
|---|---|---|
| RunPod Serverless RTX 4090 | ~$0.00031/s | ~$1.10 |
| RunPod Serverless A100 80GB | ~$0.00076/s | ~$2.72 |
| RunPod pod RTX 4090 (rented) | — | ~$0.34–0.69 |
| Modal A10G | — | ~$1.10 |
| Modal A100 40GB | — | ~$2.50–3.00 |
Cost per hour of audio on serverless depends on the model's real-time factor (RTF). XTTS on a 4090 synthesizes well faster than real-time — call it ~3–6× faster conservatively. So one hour of audio needs ~10–20 min of GPU wall-time:
Verdict: serverless GPU lands you at ~$0.20–0.40/hr of audio — below the ~$0.72/hr top cloud-API rate (though above Google's $4/1M WaveNet ≈ $0.18/hr) — with no idle cost and no capex, while letting you run a natural open Czech model you control. The catch is engineering effort (container, model loading, cold starts) and that the open model's Czech must actually be good enough for you.
Now combine it. We compare four strategies at four monthly volumes. Assumptions: top cloud natural rate $16/1M chars (Neural2/Azure Neural); serverless GPU effective ~$0.30/hr of audio; conversion 45k chars = 1 hr of audio, so 1M chars ≈ 22 hr of audio and serverless ≈ $0.30 × 22 ≈ ~$6.60/1M chars.
| Monthly volume | = hours audio | Cloud Neural2 ($16/1M) | Serverless GPU (~$0.30/hr) | Owned GPU box (amortized ~$30/mo + elec) | CPU-local (Piper) |
|---|---|---|---|---|---|
| 100k chars/mo | ~2.2 hr | $1.60 (or $0 in free tier) | ~$0.66 + eng. effort | ~$30 (silly) | $0 |
| 1M chars/mo | ~22 hr | $16 (first 1M Neural2 free → $0; or use $4 WaveNet) | ~$6.60 | ~$31 | $0 |
| 10M chars/mo | ~222 hr | $160 (minus 1M free ≈ $144) | ~$66 | ~$33 | $0 |
| 100M chars/mo | ~2,222 hr | ~$1,584 | ~$660 | ~$40–60 (if box keeps up) | $0 (if CPU keeps up) |
Reading the break-evens:
Let's ground this. If you're reading your own documents/articles/notes:
So a heavy personal-reading habit lands around 300k–1.2M characters/month — comfortably inside the perpetual free tiers. This is the single most important fact in the chapter for you: at your likely volume, natural EN+CZ TTS is free via Google (4M WaveNet + 1M Neural2 chars/mo, both free) or Azure (500k Neural chars/mo), possibly stacked across providers. You don't need to self-host to save money — you'd self-host for privacy, offline use, or because you exceed the free tier.
If your software will serve many users (not just you), volumes balloon to 10M+ chars/month and the calculus flips toward serverless GPU with a self-hosted open model (cheapest per unit, ~$0.30/hr) or a paid cloud contract — at which point caching repeated text and choosing the right tier per language matters.
hash(text + voice + settings). For "read my texts," re-reads are common; caching can cut billed characters by a large fraction.
import hashlib, os
def cache_key(text, voice, fmt="mp3"):
h = hashlib.sha256(f"{voice}|{text}".encode("utf-8")).hexdigest()
return f"tts_cache/{h}.{fmt}"
# synth only on cache miss:
path = cache_key(text, "cs-CZ-Neural2")
if not os.path.exists(path):
audio = synth(text, voice="cs-CZ-Neural2") # bills here
open(path, "wb").write(audio)
** and https://.hash(text+voice) → audio), strip non-spoken characters, and route by language/quality to keep any option near-free.