Picking a good-sounding voice is only half the job. The other half is engineering: how do you actually wire TTS into a real web or app project so that it's fast, cheap, robust, and pleasant to listen to — in both English and Czech? This chapter is the practical glue. It covers the three high-level architectures (client-side Web Speech, server-side generation, and a cloud-API proxy), how to stream audio to the browser, how to cache generated audio so you never pay to synthesize the same sentence twice, how to chunk long texts at sentence boundaries and queue them, how to control pronunciation with SSML / lexicons / phoneme overrides, how to normalize messy input (numbers, dates, abbreviations — a special pain point in Czech), which audio formats to ship, and how to highlight words as they are read using word-boundary timestamps. You'll find copy-pastable code for a caching synth endpoint and a streaming frontend player.
Because the reader's hard filter is natural English AND Czech, free-or-near-free, every pattern here is graded for how it interacts with that goal — and word-timing support is called out explicitly, because it varies enormously between engines and is often the deciding factor.
There are really only three places synthesis can happen. Most production apps mix two of them.
| Architecture | Where audio is made | Cost | Czech quality control | Word timings | Best for |
|---|---|---|---|---|---|
A. Client-side Web Speech API | In the user's browser/OS | Free | Whatever the OS ships (uneven; Czech often present on Win/Android/macOS, weak on Linux) | Yes — boundary events | Zero-budget MVP, accessibility, "read this page" buttons |
| B. Server-side generation | Your server (local model: Piper/Kokoro/XTTS, or self-hosted API) | Compute only (near-free) | Full control — you choose/finetune the model and lexicons |
| Depends on engine (Piper/Coqui can emit phoneme times; Kokoro can be coaxed) |
| Privacy, offline, unlimited volume, consistent voice |
| C. Cloud-API proxy | Vendor cloud (Azure/Google/OpenAI/ElevenLabs), called from your server | $/char or $/min | Vendor-dependent — grade Czech per vendor (see ch. on cloud APIs) | Best-in-class on Azure/Google/ElevenLabs | Highest naturalness with least ops; you proxy + cache |
Key rule: never call a paid cloud TTS directly from the browser. Doing so leaks your API key and removes your ability to cache. Always put a thin proxy endpoint on your server (architecture C) — it hides the key, lets you cache, lets you swap vendors, and lets you add normalization/SSML server-side.
A pragmatic default for the reader's "natural + near-free" target: server-side generation with a local model (B) for the bulk of traffic, optionally falling back to Web Speech (A) if your server is down or the user wants instant playback, with the option to proxy a cloud API (C) for premium voices on demand. Cache aggressively in all cases.
Web Speech API (free, zero infra)The browser's SpeechSynthesis interface needs no key and no server. The catch is that voices come from the user's OS, so quality and Czech availability vary wildly: Windows (SAPI/Edge voices), macOS/iOS, and Android usually include a Czech voice (cs-CZ); desktop Chrome on Linux often has no Czech voice at all unless one is installed. Always feature-detect and degrade gracefully.
function speak(text, lang = 'cs-CZ') {
if (!('speechSynthesis' in window)) return false; // no support
const u = new SpeechSynthesisUtterance(text);
u.lang = lang;
u.rate = 1.0; // 0.1–10, 1 = normal
u.pitch = 1.0; // 0–2
// Prefer a real cs-CZ voice if one exists
const v = speechSynthesis.getVoices()
.find(v => v.lang.toLowerCase().startsWith('cs'));
if (v) u.voice = v;
// Word highlighting: the boundary event gives char offsets
u.onboundary = (e) => {
if (e.name === 'word') highlightAt(e.charIndex, e.charLength);
};
speechSynthesis.cancel(); // stop anything queued
speechSynthesis.speak(u);
return true;
}
Gotchas worth knowing: getVoices() is populated asynchronously — listen for speechSynthesis.onvoiceschanged before reading the list. The boundary event fires for word/sentence boundaries and gives you charIndex (+ charLength in modern browsers), which is exactly what you need to highlight as you read — but support and accuracy are uneven across browsers, and some mobile browsers omit it. Long utterances can be silently truncated or stop after ~15 seconds in some Chromium builds, so chunk into sentences and queue (see §6). Verdict: great for a free fallback and accessibility, not reliable enough to be your only path if natural Czech is a hard requirement.
This is the workhorse. The pattern is identical whether the engine behind it is a local model (Piper, Kokoro, XTTS/Coqui) or a proxied cloud API: normalize → check cache → synthesize on miss → store → serve. The cache key is a hash of everything that affects the output.
# FastAPI: cache-first TTS endpoint (engine-agnostic)
import hashlib, os
from fastapi import FastAPI, Response
from fastapi.responses import FileResponse
from pydantic import BaseModel
app = FastAPI()
CACHE_DIR = "/data/tts-cache"
os.makedirs(CACHE_DIR, exist_ok=True)
class Req(BaseModel):
text: str
lang: str = "cs" # "cs" | "en"
voice: str = "default"
rate: float = 1.0
fmt: str = "opus" # opus | mp3 | wav
def cache_key(r: Req) -> str:
# ANY parameter that changes the audio must be in the key
raw = f"{r.text}|{r.lang}|{r.voice}|{r.rate}|{r.fmt}|v3"
return hashlib.sha256(raw.encode("utf-8")).hexdigest()
def normalize(text: str, lang: str) -> str:
... # see §7 — expand numbers/dates/abbrevs, fix Czech
def synthesize(text: str, r: Req, out_path: str):
... # call Piper / Kokoro / Azure / OpenAI here, write file
@app.post("/tts")
def tts(r: Req):
r.text = normalize(r.text, r.lang)
key = cache_key(r)
path = os.path.join(CACHE_DIR, f"{key}.{r.fmt}")
if not os.path.exists(path):
synthesize(r.text, r, path) # cache miss → pay once
mime = {"opus": "audio/ogg", "mp3": "audio/mpeg",
"wav": "audio/wav"}[r.fmt]
return FileResponse(path, media_type=mime,
headers={"Cache-Control": "public, max-age=31536000, immutable"})
Notice three production details baked in: (1) a version suffix (v3) in the key so you can bust the entire cache when you change the model or normalizer; (2) the Cache-Control: immutable header so the browser/CDN never re-requests the same audio; (3) normalization happens before hashing so two strings that read identically share one file.
Caching is the single highest-leverage optimization. Real text is repetitive — UI strings, headings, common sentences, re-reads of the same article. A content-addressed store (hash → file) means you synthesize each unique normalized string exactly once, ever. Put the files behind a CDN (Cloudflare/CloudFront/Bunny) keyed by the hash in the URL (/audio/<sha256>.opus) and the second listener anywhere in the world pays nothing and waits nothing. For a paid cloud backend this can cut your bill by 90%+; for a local model it removes GPU/CPU latency on repeats. Store the small metadata (key, text, lang, voice, bytes, duration, created_at) in a table so you can do cache analytics and eviction (LRU on last-access if disk is tight).
For short strings, just return the finished file (as above) and set audio.src. For long texts you want playback to start before the whole thing is synthesized — otherwise the user stares at a spinner. Three techniques, in increasing order of complexity:
(a) Simplest: per-sentence files + a queue. Synthesize sentence-by-sentence, return each as a separate small audio file, and play them back-to-back in the browser with an Audio queue (see §6). Prefetch the next 1–2 while the current plays. This is robust, cache-friendly (each sentence is independently cached), and works everywhere. Recommended default.
(b) HTTP chunked / streaming response. Many engines can stream encoded audio as it's produced. OpenAI's /audio/speech, Azure, and ElevenLabs all support streamed responses; you pipe that straight to the client. In the browser, the modern approach is to read the response body as a stream:
// Stream an audio response and play it as it arrives
async function playStream(url, body) {
const res = await fetch(url, {method:'POST',
headers:{'Content-Type':'application/json'}, body:JSON.stringify(body)});
// For broad support, buffer then play; for true low-latency use MSE/Web Audio
const blob = await res.blob();
const audio = new Audio(URL.createObjectURL(blob));
audio.play();
}
For true progressive playback you historically used Media Source Extensions (MSE) — create a MediaSource, append ArrayBuffer chunks to a SourceBuffer as they arrive. MSE is still widely supported and works for streaming Opus/AAC-in-fMP4, but it's fiddly (you must feed properly fragmented containers, and codec strings must match). It is not deprecated as of 2026, but for most TTS use cases the simpler per-sentence queue (a) achieves the same perceived latency with far less code.
(c) WebSocket streaming. ElevenLabs and Azure offer WebSocket APIs that stream audio chunks and timing metadata together, which is ideal when you need word highlighting in real time. You proxy the socket through your server (to hide the key) and forward chunks to the client. Use this only when you genuinely need sub-second first-byte latency plus live word timings.
First-byte latency tip: whatever you choose, the perceived speed is dominated by time-to-first-audio. Synthesize and send the first sentence as fast as possible, then catch up on the rest in the background.
For speech, you almost always want Opus (in an Ogg or WebM container) — it's the best quality-per-byte codec for voice, royalty-free, and supported in all modern browsers via the <audio> element. MP3 is the universal-compatibility fallback (ancient devices, some embedded webviews). WAV is uncompressed: only use it as an intermediate or when an API requires it.
| Format | Typical speech bitrate | Rel. size (1 min speech) | Browser support | Use when |
|---|---|---|---|---|
| Opus (Ogg/WebM) | 24–32 kbps mono is transparent for voice | ~1× (≈200–280 KB/min) | All modern browsers | Default for web delivery |
| MP3 | 64–96 kbps for comparable quality | ~2–3× | Universal incl. legacy | Fallback / maximum compatibility |
| AAC (m4a) | 48–64 kbps | ~1.5–2× | Broad (esp. Apple) | iOS-heavy audiences, MSE |
| WAV/PCM | 256+ kbps (16-bit/16 kHz) | ~10×+ | All | Intermediate, never ship to users |
Opus at ~24–32 kbps mono is genuinely transparent for a single speaking voice — roughly 2–3× smaller than an equivalent-quality MP3 — so it's both cheaper to store/serve and faster to start. Store two encodings if you must support legacy clients: .opus for modern, .mp3 as fallback, both keyed by the same text hash. Mono and 22–24 kHz sample rate are fine for speech; don't waste bytes on stereo or 48 kHz.
Long inputs must be split — both because many engines cap input length (often a few thousand characters or a token budget) and because you want streaming/early playback. Split at sentence boundaries, never mid-word or mid-number. Naïve split('.') breaks on "Dr.", "tj.", "č.", "10.5", and Czech ordinals written as "5." (which means "fifth"!), so use a real sentence splitter or a guarded regex.
// Split into sentences, then re-pack into chunks under a char budget,
// keeping abbreviations and decimals intact.
const ABBR = /\b(?:tj|atd|apod|mj|tzn|tzv|př|stol|s\.r\.o|a\.s|Dr|Ing|Mgr|Bc|Sv|č|str|odst|písm|sb)\.$/i;
function splitSentences(text) {
const parts = text.match(/[^.!?…]+[.!?…]+(?:["')\]]+)?|\S[^.!?…]*$/g) || [text];
const out = [];
for (const p of parts) {
const prev = out[out.length - 1];
// merge if the previous "sentence" ended in a known abbreviation
if (prev && ABBR.test(prev.trim())) out[out.length - 1] = prev + p;
else out.push(p.trim());
}
return out.filter(Boolean);
}
function chunk(text, maxLen = 600) {
const sents = splitSentences(text);
const chunks = []; let buf = '';
for (const s of sents) {
if ((buf + ' ' + s).length > maxLen && buf) { chunks.push(buf); buf = s; }
else buf = buf ? buf + ' ' + s : s;
}
if (buf) chunks.push(buf);
return chunks;
}
A small playback queue with prefetch gives smooth, gapless reading and is the most reliable streaming method in practice:
class TTSQueue {
constructor(getAudioUrl) { this.getAudioUrl = getAudioUrl; this.q = []; this.i = 0; }
load(text) { this.q = chunk(text); this.i = 0; this.prefetch(0); this.prefetch(1); }
async prefetch(i){ if(this.q[i] && !this.q[i].url) this.q[i] = {text:this.q[i], url: await this.getAudioUrl(this.q[i])}; }
async play() {
while (this.i < this.q.length) {
await this.prefetch(this.i); this.prefetch(this.i + 1); // fetch current, warm next
const a = new Audio(this.q[this.i].url);
await new Promise(res => { a.onended = res; a.onerror = res; a.play(); });
this.i++;
}
}
}
Because each chunk hits your caching endpoint (§3), repeated paragraphs and shared sentences are free on the second pass. Add basic backpressure if you proxy a paid API (limit concurrent in-flight syntheses) and handle errors by skipping or retrying a chunk rather than killing the whole read.
TTS engines read text, and raw text is full of things that don't map cleanly to speech. "Bylo to 23. 5. 2024 v 14:30 za 1 250 Kč" must become something like "Bylo to dvacátého třetího května dva tisíce dvacet čtyři ve čtrnáct hodin třicet minut za jeden tisíc dvě stě padesát korun." This is text normalization (the "front end" of any TTS system), and it is dramatically harder for Czech than English for three reasons:
Practical guidance: good cloud engines (Azure, Google) and good local Czech voices already do a lot of this internally, so the safest move is to not over-normalize — feed them reasonably clean text and let the engine's Czech front-end handle inflection, because hand-rolling correct Czech grammar is error-prone. Where the engine clearly mishandles something, fix that specific case with SSML (<say-as>, see §8) or with targeted pre-processing. For purely local/offline pipelines (e.g., Piper, which expects already-normalized text via its phonemizer), you'll need more pre-processing: expand common abbreviations from a dictionary, and for numbers consider a Czech-aware library or call out to a normalizer. A defensive baseline normalizer in code:
import re
ABBREV_CS = {r"\btj\.": "to jest", r"\bnapř\.": "například",
r"\batd\.": "a tak dále", r"\bapod\.": "a podobně",
r"\bč\.\s*": "číslo ", r"\bKč\b": "korun", r"\b°C\b": "stupňů Celsia"}
def normalize_cs(t: str) -> str:
t = t.replace(" ", " ") # NBSP thousands sep → space
t = re.sub(r"(?<=\d) (?=\d{3}\b)", "", t) # "1 250" -> "1250" for the reader
for pat, rep in ABBREV_CS.items():
t = re.sub(pat, rep, t, flags=re.IGNORECASE)
# Let the engine inflect numbers/dates; only collapse whitespace here.
return re.sub(r"\s+", " ", t).strip()
The honest takeaway: Czech normalization is the place most home-grown TTS integrations sound wrong. Test with a fixed set of "hard" strings (prices, dates, times, ordinals, percentages, phone numbers, version numbers) on day one, and prefer engines whose Czech front-end already handles them over hand-rolling grammar.
SSML (Speech Synthesis Markup Language, a W3C standard) is XML you wrap around text to control how it's spoken: pauses, emphasis, rate/pitch, spelling-out, and explicit pronunciation. Support is uneven: Azure and Google have rich SSML; Amazon Polly supports most of it; OpenAI's TTS and ElevenLabs historically support little or none (they take plain text plus their own controls/voice settings); local engines like Piper/Kokoro mostly take plain text (some honor limited markup). Grade SSML support per engine, and don't write SSML your engine ignores.
The most useful SSML tools for an integrator:
<speak version="1.0" xml:lang="cs-CZ">
Cena je <say-as interpret-as="cardinal">1250</say-as> korun.
Sejdeme se <say-as interpret-as="date" format="dmy">23.5.2024</say-as>
v <say-as interpret-as="time" format="hms24">14:30</say-as>.
Krátká pauza<break time="400ms"/> a pokračujeme.
Tohle <emphasis level="strong">zdůrazni</emphasis>.
<prosody rate="90%" pitch="-1st">Pomalu a níž.</prosody>
Slovo <phoneme alphabet="ipa" ph="ˈɛlɛnˌbɛrk">Ellenberg</phoneme>.
</speak>
<say-as interpret-as="..."> — forces interpretation as cardinal/ordinal/date/time/telephone/characters. This is your scalpel for the normalization problems in §7 when the engine supports it.<break time="..."> — controls pauses (great for readability of lists and headings).<prosody rate/pitch/volume> — rate and pitch control; prefer this over the OS-level rate so it's consistent across engines.<emphasis> — relative stress.<phoneme alphabet="ipa" ph="..."> — per-word phoneme override: the reliable way to fix a mispronounced name, brand, or loanword. Azure and Google support IPA (and SAPI/X-SAMPA on Azure).Lexicons are reusable pronunciation dictionaries (PLS — Pronunciation Lexicon Specification, also a W3C standard) you register once and apply to all requests, so you don't repeat <phoneme> everywhere. Azure and Google support custom lexicons (Azure via <lexicon uri="..."/> referencing a hosted .pls file; Google via inline pronunciations / custom voice features). For local engines, the equivalent is editing the phonemizer dictionary. Build a small project lexicon for your domain's proper nouns (product names, place names, people) — it's the cheapest way to sound polished, and it's especially valuable in Czech for foreign names that the engine would otherwise read with Czech letter-to-sound rules.
Synchronized highlighting ("karaoke" reading) is a flagship UX feature — and whether you can build it depends entirely on whether your engine emits word boundaries / timestamps. Here's the honest landscape:
| Engine | Word-timing mechanism | Notes |
|---|---|---|
| Web Speech API (client) | utterance.onboundary (charIndex, charLength) | Free, real-time; accuracy & support uneven across browsers/mobile |
| Azure Speech | WordBoundary event (+ VisemeReceived, BookmarkReached) with audio offsets in ticks | Excellent, reliable; arguably the best for highlighting + lip-sync |
| Google Cloud TTS | SSML <mark> timepoints returned with the audio | You insert <mark> tags; get back time for each mark — word-level if you mark each word |
| Amazon Polly | Speech marks (separate JSON stream: word/sentence/viseme/ssml) | Mature, well-documented |
| ElevenLabs | Character-level alignment / timestamps (with-timestamps / WS) | Character timings → derive word timings by grouping |
| OpenAI TTS | None (no official word timestamps as of 2026) | If you need highlighting, force-align separately or pick another engine |
| Piper / Kokoro / XTTS (local) | Phoneme/duration data available with effort | Piper exposes phoneme durations; Kokoro needs custom alignment; not turn-key |
If your engine gives character offsets (ElevenLabs, Web Speech), you map them back onto your original text. If it gives word events with audio time offsets (Azure, Polly), you schedule highlights against the playing audio's currentTime. If it gives timepoints from <mark> (Google), you insert a mark before each word, then highlight when each mark's time is reached. A generic client-side scheduler:
// timings: [{ time: 0.42, charIndex: 0, length: 5 }, ...] aligned to audio
function highlightWhilePlaying(audio, timings, textEl) {
let i = 0;
audio.ontimeupdate = () => {
while (i < timings.length && timings[i].time <= audio.currentTime) {
const t = timings[i];
markRange(textEl, t.charIndex, t.charIndex + t.length); // wrap in <mark>
i++;
}
};
}
If you're stuck with an engine that has no timing output (e.g., OpenAI TTS or a bare local model) but you still want highlighting, the fallback is forced alignment: run a small ASR/aligner (e.g., a WhisperX-style alignment or aeneas) over the generated audio + the source text to produce word timestamps offline, cache them next to the audio file, and replay. It's extra compute but it's a one-time cost per cached clip — and it works with any voice.
For the reader's free/Czech goal: the cleanest free path to highlighting is the Web Speech API (onboundary) for the zero-cost tier. For a self-hosted natural Czech voice with highlighting, plan for the forced-alignment-on-the-cached-file approach, since local engines rarely emit clean word times out of the box. Among paid options, Azure is the gold standard for word + viseme timing in Czech.
For a technical reader who wants natural EN+CZ, near-free, with good UX:
POST /tts endpoint (FastAPI/Express) that normalizes → hashes → caches → synthesizes on miss → serves .opus (with .mp3 fallback), behind a CDN keyed on the hash. Engine = a local Czech-capable model for bulk volume; pluggable cloud proxy for premium.Web Speech API as an instant, offline fallback.onboundary on the free tier; precomputed forced-alignment JSON (cached beside each audio file) for self-hosted voices; Azure WordBoundary if you go premium.<phoneme> overrides where the engine supports SSML; otherwise tweak the local phonemizer dictionary.boundary events), server-side generation (full control, near-free), cloud-API proxy (best naturalness; never call paid TTS from the browser — proxy it to hide keys and enable caching).sha256(normalized text + lang + voice + rate + format + version), serve immutable from a CDN, and you synthesize each unique string exactly once — often a 90%+ cost cut on paid backends and zero repeat latency locally.split('.') will mangle prices, dates, and version numbers.<say-as>/lexicons rather than hand-rolling grammar.<phoneme> + a PLS lexicon to nail proper nouns/foreign names.onboundary) and Azure (WordBoundary) are turn-key; Google uses <mark> timepoints; ElevenLabs gives character timings; OpenAI and bare local models give nothing — fall back to forced alignment on the cached audio.boundary event (charIndex, charLength) for word highlighting in the Web Speech API.SpeechSynthesis, voice selection, async getVoices(), and browser support caveats.WordBoundary, VisemeReceived, and bookmark events with audio offsets.<say-as>, <break>, <prosody>, <phoneme>, and <lexicon> support.<mark> timepoints returned alongside synthesized audio for word-level sync./audio/speech endpoint, streaming, and supported output formats (opus, aac, flac, mp3, wav, pcm).SourceBuffer for progressive playback.