Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

CS · 17 ch

Chapter 11: Integration Patterns for Your App

Chapter 12 of 17 · ~20 min read

Overview

Picking a good-sounding voice is only half the job. The other half is engineering: how do you actually wire TTS into a real web or app project so that it's fast, cheap, robust, and pleasant to listen to — in both English and Czech? This chapter is the practical glue. It covers the three high-level architectures (client-side Web Speech, server-side generation, and a cloud-API proxy), how to stream audio to the browser, how to cache generated audio so you never pay to synthesize the same sentence twice, how to chunk long texts at sentence boundaries and queue them, how to control pronunciation with SSML / lexicons / phoneme overrides, how to normalize messy input (numbers, dates, abbreviations — a special pain point in Czech), which audio formats to ship, and how to highlight words as they are read using word-boundary timestamps. You'll find copy-pastable code for a caching synth endpoint and a streaming frontend player.

Because the reader's hard filter is natural English AND Czech, free-or-near-free, every pattern here is graded for how it interacts with that goal — and word-timing support is called out explicitly, because it varies enormously between engines and is often the deciding factor.

Content

1. The three architectures (and when to use each)

There are really only three places synthesis can happen. Most production apps mix two of them.

Architecture	Where audio is made	Cost	Czech quality control	Word timings	Best for
A. Client-side `Web Speech API`	In the user's browser/OS	Free	Whatever the OS ships (uneven; Czech often present on Win/Android/macOS, weak on Linux)	Yes — `boundary` events	Zero-budget MVP, accessibility, "read this page" buttons
B. Server-side generation	Your server (local model: Piper/Kokoro/XTTS, or self-hosted API)	Compute only (near-free)	Full control — you choose/finetune the model and lexicons

function speak(text, lang = 'cs-CZ') {
  if (!('speechSynthesis' in window)) return false;     // no support
  const u = new SpeechSynthesisUtterance(text);
  u.lang = lang;
  u.rate = 1.0;   // 0.1–10, 1 = normal
  u.pitch = 1.0;  // 0–2
  // Prefer a real cs-CZ voice if one exists
  const v = speechSynthesis.getVoices()
    .find(v => v.lang.toLowerCase().startsWith('cs'));
  if (v) u.voice = v;
  // Word highlighting: the boundary event gives char offsets
  u.onboundary = (e) => {
    if (e.name === 'word') highlightAt(e.charIndex, e.charLength);
  };
  speechSynthesis.cancel();   // stop anything queued
  speechSynthesis.speak(u);
  return true;
}

# FastAPI: cache-first TTS endpoint (engine-agnostic)
import hashlib, os
from fastapi import FastAPI, Response
from fastapi.responses import FileResponse
from pydantic import BaseModel

app = FastAPI()
CACHE_DIR = "/data/tts-cache"
os.makedirs(CACHE_DIR, exist_ok=True)

class Req(BaseModel):
    text: str
    lang: str = "cs"        # "cs" | "en"
    voice: str = "default"
    rate: float = 1.0
    fmt: str = "opus"       # opus | mp3 | wav

def cache_key(r: Req) -> str:
    # ANY parameter that changes the audio must be in the key
    raw = f"{r.text}|{r.lang}|{r.voice}|{r.rate}|{r.fmt}|v3"
    return hashlib.sha256(raw.encode("utf-8")).hexdigest()

def normalize(text: str, lang: str) -> str:
    ...  # see §7 — expand numbers/dates/abbrevs, fix Czech

def synthesize(text: str, r: Req, out_path: str):
    ...  # call Piper / Kokoro / Azure / OpenAI here, write file

@app.post("/tts")
def tts(r: Req):
    r.text = normalize(r.text, r.lang)
    key = cache_key(r)
    path = os.path.join(CACHE_DIR, f"{key}.{r.fmt}")
    if not os.path.exists(path):
        synthesize(r.text, r, path)          # cache miss → pay once
    mime = {"opus": "audio/ogg", "mp3": "audio/mpeg",
            "wav": "audio/wav"}[r.fmt]
    return FileResponse(path, media_type=mime,
        headers={"Cache-Control": "public, max-age=31536000, immutable"})

// Stream an audio response and play it as it arrives
async function playStream(url, body) {
  const res = await fetch(url, {method:'POST',
    headers:{'Content-Type':'application/json'}, body:JSON.stringify(body)});
  // For broad support, buffer then play; for true low-latency use MSE/Web Audio
  const blob = await res.blob();
  const audio = new Audio(URL.createObjectURL(blob));
  audio.play();
}

Format	Typical speech bitrate	Rel. size (1 min speech)	Browser support	Use when
Opus (Ogg/WebM)	24–32 kbps mono is transparent for voice	~1× (≈200–280 KB/min)	All modern browsers	Default for web delivery
MP3	64–96 kbps for comparable quality	~2–3×	Universal incl. legacy	Fallback / maximum compatibility
AAC (m4a)	48–64 kbps	~1.5–2×	Broad (esp. Apple)	iOS-heavy audiences, MSE
WAV/PCM	256+ kbps (16-bit/16 kHz)	~10×+	All	Intermediate, never ship to users

// Split into sentences, then re-pack into chunks under a char budget,
// keeping abbreviations and decimals intact.
const ABBR = /\b(?:tj|atd|apod|mj|tzn|tzv|př|stol|s\.r\.o|a\.s|Dr|Ing|Mgr|Bc|Sv|č|str|odst|písm|sb)\.$/i;
function splitSentences(text) {
  const parts = text.match(/[^.!?…]+[.!?…]+(?:["')\]]+)?|\S[^.!?…]*$/g) || [text];
  const out = [];
  for (const p of parts) {
    const prev = out[out.length - 1];
    // merge if the previous "sentence" ended in a known abbreviation
    if (prev && ABBR.test(prev.trim())) out[out.length - 1] = prev + p;
    else out.push(p.trim());
  }
  return out.filter(Boolean);
}
function chunk(text, maxLen = 600) {
  const sents = splitSentences(text);
  const chunks = []; let buf = '';
  for (const s of sents) {
    if ((buf + ' ' + s).length > maxLen && buf) { chunks.push(buf); buf = s; }
    else buf = buf ? buf + ' ' + s : s;
  }
  if (buf) chunks.push(buf);
  return chunks;
}

class TTSQueue {
  constructor(getAudioUrl) { this.getAudioUrl = getAudioUrl; this.q = []; this.i = 0; }
  load(text) { this.q = chunk(text); this.i = 0; this.prefetch(0); this.prefetch(1); }
  async prefetch(i){ if(this.q[i] && !this.q[i].url) this.q[i] = {text:this.q[i], url: await this.getAudioUrl(this.q[i])}; }
  async play() {
    while (this.i < this.q.length) {
      await this.prefetch(this.i); this.prefetch(this.i + 1); // fetch current, warm next
      const a = new Audio(this.q[this.i].url);
      await new Promise(res => { a.onended = res; a.onerror = res; a.play(); });
      this.i++;
    }
  }
}

import re
ABBREV_CS = {r"\btj\.": "to jest", r"\bnapř\.": "například",
             r"\batd\.": "a tak dále", r"\bapod\.": "a podobně",
             r"\bč\.\s*": "číslo ", r"\bKč\b": "korun", r"\b°C\b": "stupňů Celsia"}
def normalize_cs(t: str) -> str:
    t = t.replace(" ", " ")               # NBSP thousands sep → space
    t = re.sub(r"(?<=\d) (?=\d{3}\b)", "", t)   # "1 250" -> "1250" for the reader
    for pat, rep in ABBREV_CS.items():
        t = re.sub(pat, rep, t, flags=re.IGNORECASE)
    # Let the engine inflect numbers/dates; only collapse whitespace here.
    return re.sub(r"\s+", " ", t).strip()

<speak version="1.0" xml:lang="cs-CZ">
  Cena je <say-as interpret-as="cardinal">1250</say-as> korun.
  Sejdeme se <say-as interpret-as="date" format="dmy">23.5.2024</say-as>
  v <say-as interpret-as="time" format="hms24">14:30</say-as>.
  Krátká pauza<break time="400ms"/> a pokračujeme.
  Tohle <emphasis level="strong">zdůrazni</emphasis>.
  <prosody rate="90%" pitch="-1st">Pomalu a níž.</prosody>
  Slovo <phoneme alphabet="ipa" ph="ˈɛlɛnˌbɛrk">Ellenberg</phoneme>.
</speak>

Engine	Word-timing mechanism	Notes
Web Speech API (client)	`utterance.onboundary` (`charIndex`, `charLength`)	Free, real-time; accuracy & support uneven across browsers/mobile
Azure Speech	`WordBoundary` event (+ `VisemeReceived`, `BookmarkReached`) with audio offsets in ticks	Excellent, reliable; arguably the best for highlighting + lip-sync
Google Cloud TTS	SSML `<mark>` timepoints returned with the audio	You insert `<mark>` tags; get back time for each mark — word-level if you mark each word
Amazon Polly	Speech marks (separate JSON stream: word/sentence/viseme/ssml)	Mature, well-documented
ElevenLabs	Character-level alignment / timestamps (`with-timestamps` / WS)	Character timings → derive word timings by grouping
OpenAI TTS	None (no official word timestamps as of 2026)	If you need highlighting, force-align separately or pick another engine
Piper / Kokoro / XTTS (local)	Phoneme/duration data available with effort	Piper exposes phoneme durations; Kokoro needs custom alignment; not turn-key

// timings: [{ time: 0.42, charIndex: 0, length: 5 }, ...] aligned to audio
function highlightWhilePlaying(audio, timings, textEl) {
  let i = 0;
  audio.ontimeupdate = () => {
    while (i < timings.length && timings[i].time <= audio.currentTime) {
      const t = timings[i];
      markRange(textEl, t.charIndex, t.charIndex + t.length); // wrap in <mark>
      i++;
    }
  };
}

Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

Chapter 11: Integration Patterns for Your App

Overview

Content

1. The three architectures (and when to use each)

2. Client-side: the `Web Speech API` (free, zero infra)

3. Server-side generation: the caching synth endpoint

4. Streaming audio to the browser

5. Audio formats and sizes

6. Chunking long texts and queueing

7. Text normalization (numbers, dates, abbreviations) — the Czech trap

8. SSML, lexicons, and pronunciation overrides

9. Highlighting words as they're read (word timings)

10. Putting it together — a recommended baseline

Key Takeaways

Key References

Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

Chapter 11: Integration Patterns for Your App

Overview

Content

1. The three architectures (and when to use each)

2. Client-side: the Web Speech API (free, zero infra)

3. Server-side generation: the caching synth endpoint

4. Streaming audio to the browser

5. Audio formats and sizes

6. Chunking long texts and queueing

7. Text normalization (numbers, dates, abbreviations) — the Czech trap

8. SSML, lexicons, and pronunciation overrides

9. Highlighting words as they're read (word timings)

10. Putting it together — a recommended baseline

Key Takeaways

Key References

2. Client-side: the `Web Speech API` (free, zero infra)