Before you reach for a model, an API key, or a Docker container, your reader's device almost certainly already ships with a text-to-speech engine that costs nothing, requires no hosting, and produces zero bytes of network traffic in most configurations. The browser exposes it through the Web Speech API (SpeechSynthesis), and every desktop and mobile OS exposes its own native layer — macOS say / AVSpeechSynthesizer, Windows SAPI5 / OneCore, Android TextToSpeech, and iOS AVSpeech.
This chapter is the honest accounting of that "free fallback": where it shines, where it embarrasses you, and — critically for this report — how well it handles English and Czech. The headline tension is that the engine is free and local, but you do not control which voices exist on your reader's machine. That single fact governs everything below. We cover the JavaScript API in depth (with copy-pastable code that selects a Czech voice, handles the asynchronous voiceschanged quirk, and chunks long text), the per-OS quality reality, and when to drop this in versus when it will make your product look broken. We close with the server-side cousins of the same idea — macOS say and Piper — for when you want OS-grade TTS but on your machine, not the user's.
There are two distinct surfaces, and they're easy to conflate:
The Web Speech API (window.speechSynthesis) — a browser JavaScript interface. In the browser, it is almost always a thin shim over the host OS's TTS engine (or, in Chrome and Edge, sometimes over a Google/Microsoft cloud voice). You call speechSynthesis.speak(utterance) and audio comes out of the user's speakers. You never receive an audio buffer — there is no way to capture the output as a file from the standard API.
The OS-native TTS layer — say on macOS, AVSpeechSynthesizer (iOS/macOS), SAPI5 + OneCore on Windows, android.speech.tts.TextToSpeech on Android. These are what you'd call from a native app, or from a server process you control. is special because it can write to a file (), making it usable server-side.
saysay -o out.aiffThe browser path is the "zero-hosting" dream: ship a few lines of JS, the user's device does the work. The OS-native path matters mostly when you control the device (a kiosk, a desktop Electron app) or the server (macOS build agents, a Linux box running Piper).
SpeechSynthesis is supported in essentially every current browser: Chrome, Edge, Safari, Firefox, and their mobile equivalents. Per MDN and caniuse, baseline support has been there for years (Chrome since 33, Safari since 7, Firefox since 49, Edge since the first Chromium build). So the API existing is not your problem — which voices it returns is.
The voices come from the underlying platform:
| Browser | Voice source | Czech voice present? |
|---|---|---|
| Chrome (desktop) | Google network voices (localService=false) + installed OS voices | Often yes via Google cs-CZ network voice; quality decent but requires network |
| Edge (desktop) | OS voices + "Online (Natural)" neural voices over the network | Yes — Microsoft Antonín/Vlasta Online (Natural), genuinely good, but network + Edge-only |
| Firefox (desktop) | Only OS-installed voices | Only if the OS has a Czech voice installed |
| Safari (macOS/iOS) | Apple system voices | Yes — Zuzana (cs_CZ), more if user downloads "Enhanced/Premium" |
| Chrome/Firefox (Android) | Android system TTS (Google / Samsung engine) | Depends on installed engine + downloaded cs-CZ data |
The crucial column is the third one. You cannot ship a Czech voice with your web app. You can only ask "does a Czech voice exist on this device right now?" and gracefully degrade if not. This is the defining limitation of the entire approach.
localService flag — local vs. network, and why it mattersEach SpeechSynthesisVoice has a boolean localService. true means a synthesizer running on the device (offline-capable, private). false means the audio is generated remotely — the text you speak is sent to a server (Google's for Chrome's default voices, Microsoft's for Edge's "Online (Natural)" voices).
Two consequences for your project:
voice.localService === true. Network voices silently fail (or fall back) without connectivity.cs-CZ). The local ones are the older, more robotic system voices. So there's a direct trade-off: natural-sounding usually means "your text left the device."// Inspect what you've actually got
speechSynthesis.getVoices().forEach(v =>
console.log(`${v.name} | ${v.lang} | ${v.localService ? 'LOCAL' : 'NETWORK'} | default=${v.default}`)
);
voiceschanged async trap (you will hit this)getVoices() frequently returns an empty array on first call, especially in Chrome and Edge, because voices load asynchronously after page init. The fix is to listen for the voiceschanged event — but you must also call getVoices() once eagerly, because in Safari the event may never fire and the list is already populated. Handle both:
function loadVoices() {
return new Promise(resolve => {
let voices = speechSynthesis.getVoices();
if (voices.length) return resolve(voices); // Safari / warm cache
speechSynthesis.addEventListener('voiceschanged', () => {
resolve(speechSynthesis.getVoices()); // Chrome / Edge / Firefox
}, { once: true });
// Safety net: some engines never fire the event
setTimeout(() => resolve(speechSynthesis.getVoices()), 1000);
});
}
The strategy: prefer a high-quality network Czech voice if available, fall back to any local cs voice, and finally bail out to a UI message rather than reading Czech text in a default English voice (which sounds grotesque — the engine will read Czech graphemes with English phonetics).
async function pickCzechVoice() {
const voices = await loadVoices();
const cz = voices.filter(v => v.lang.toLowerCase().startsWith('cs'));
if (!cz.length) return null; // No Czech voice installed at all
// 1) Prefer an Edge "Online (Natural)" or Google network voice (best quality)
const natural = cz.find(v =>
/natural|online|google/i.test(v.name) && !v.localService);
if (natural) return natural;
// 2) Otherwise any local Czech voice (offline, lower quality)
const local = cz.find(v => v.localService);
if (local) return local;
// 3) Whatever Czech voice exists
return cz[0];
}
If pickCzechVoice() returns null, you should tell the user their device has no Czech voice and link them to OS instructions (see below) — never silently read Czech in en-US.
async function speakCzech(text) {
const voice = await pickCzechVoice();
if (!voice) {
alert('No Czech voice installed on this device. See settings to add one.');
return;
}
const u = new SpeechSynthesisUtterance(text);
u.voice = voice;
u.lang = voice.lang; // belt-and-suspenders; helps some engines
u.rate = 1.0; // 0.1–10; ~0.9–1.1 sounds natural
u.pitch = 1.0; // 0–2
u.volume = 1.0; // 0–1
u.onerror = e => console.warn('TTS error', e.error);
speechSynthesis.cancel(); // clear any queued/half-spoken utterance
speechSynthesis.speak(u);
}
A few practical notes:
rate above ~1.4 quickly turns robotic on local voices; the network "Natural" voices hold up better.speechSynthesis.cancel() before a new speak() if you want "barge-in" behaviour, or you'll queue utterances behind each other.onstart / onend / onboundary events for UI sync.Chrome has a long-standing bug where speechSynthesis stops after roughly 15 seconds / a couple hundred characters if you pass one giant utterance. Two mitigations, often combined:
(a) The pause/resume keep-alive hack — a setInterval that pings pause()+resume() every ~10s to stop the engine going idle:
function keepAlive() {
if (speechSynthesis.speaking) {
speechSynthesis.pause();
speechSynthesis.resume();
}
}
const ka = setInterval(keepAlive, 10000);
// clearInterval(ka) when your whole text is done
(b) Sentence chunking — the more robust approach: split the text into sentence-sized utterances and queue them. This also gives you per-sentence highlighting and error recovery.
function chunkText(text, maxLen = 180) {
// Split on sentence boundaries, keep the punctuation
const sentences = text.match(/[^.!?…]+[.!?…]?\s*/g) || [text];
const chunks = [];
let buf = '';
for (const s of sentences) {
if ((buf + s).length > maxLen && buf) { chunks.push(buf.trim()); buf = s; }
else buf += s;
}
if (buf.trim()) chunks.push(buf.trim());
return chunks;
}
async function speakLong(text) {
const voice = await pickCzechVoice(); // or your English picker
if (!voice) return;
speechSynthesis.cancel();
for (const chunk of chunkText(text)) {
const u = new SpeechSynthesisUtterance(chunk);
u.voice = voice;
u.lang = voice.lang;
u.rate = 1.0;
speechSynthesis.speak(u); // queued; engine plays them in order
}
}
Sentence chunking respects Czech punctuation fine (.!?…), but watch for abbreviations and decimal numbers (3.14, tzv.) that contain periods — for production, a slightly smarter splitter or a sentence-segmentation library avoids mid-number breaks.
This is where honesty earns its keep. Built-in TTS quality varies enormously, and "is there a Czech voice at all" is a separate axis from "does it sound good."
macOS / iOS (Safari + native). The best of the built-ins. Apple ships Zuzana (cs_CZ) as the system Czech voice, and users can download Enhanced and Premium quality variants in System Settings → Accessibility → Spoken Content → System Voice → Manage Voices. The default (compact) Zuzana is intelligible but plainly synthetic; the Premium download is markedly better and is the single best free, local Czech voice you'll find on a consumer device. English on macOS is excellent (Samantha, and the neural "Siri" voices for system use). Via say on the command line you can list voices with say -v '?' and pick Czech with say -v Zuzana "Dobrý den".
Windows (Edge / Chrome / Firefox + SAPI/OneCore). The messiest platform.
Microsoft Antonín Online (Natural) and Microsoft Vlasta Online (Natural) — genuine neural voices that sound very good, and they appear in getVoices() (with localService=false) when running in Edge with network. This is, in practice, the highest-quality free Czech browser TTS available without any backend on your part — but it only works in Edge, online, and the text is sent to Microsoft.Android (Chrome / Firefox + Android TTS). Depends entirely on the installed engine (Google Speech Services or Samsung TTS) and whether the user has downloaded the cs-CZ voice data. Google's Czech voice is reasonable. Coverage is uneven across devices and OEMs.
Linux (Firefox / Chromium). The weak spot. Out of the box, Linux often has no usable TTS through the browser, or only espeak/speech-dispatcher — which is the textbook "1990s robot" sound and barely tolerable for Czech. If your reader base is heavily Linux, the browser fallback essentially does not exist for them; this is a strong reason to consider a server-side option (next section).
Quality summary table:
| Platform | Czech voice exists? | Local quality | Best free option | Sends text to server? |
|---|---|---|---|---|
| macOS / iOS | Yes (Zuzana, +Premium download) | Good→Very good (Premium) | Zuzana Premium, local | No |
| Windows + Edge | Yes (Jakub local; Antonín/Vlasta online) | Local: poor; Online: very good | Edge "Online (Natural)" | Online voices: yes |
| Windows + Chrome/FF | Sometimes (registry quirk) | Poor | Limited | Chrome network voice: yes |
| Android | Usually (after download) | Decent (Google) | Google cs-CZ | Often yes |
| Linux | Rarely | Robotic (espeak) | Effectively none in-browser | No |
English, by contrast, is well-served and natural on macOS, iOS, Edge (Natural), and modern Android. The honest bottom line: English built-in TTS is "good enough" almost everywhere; Czech is only reliably good on Apple devices and in Edge online.
speak() often must be triggered by a user gesture; auto-reading on page load is frequently blocked.If you like "OS-grade, free, no per-character cost" but hate "I don't control the voice," move the OS engine to the server, where you do control it. Two routes worth knowing:
macOS say (if you have a Mac/Mac mini/CI runner). It writes audio to a file, so it's usable in a pipeline:
# List Czech voices, then synthesize to a file
say -v '?' | grep cs_CZ
say -v Zuzana -o vystup.aiff "Dobrý den, toto je test."
# Convert to mp3 for the web (needs ffmpeg)
ffmpeg -i vystup.aiff -codec:a libmp3lame -qscale:a 4 vystup.mp3
This gives you consistent, decent Czech (especially with the Premium Zuzana installed) and full backend control — at the price of needing macOS hardware. Note Apple's licensing restricts using system voices outside Apple platforms, so this is fine for your own Mac, not for redistributing the voice.
Piper (cross-platform, open source, the real recommendation for self-hosting). Piper is a fast, fully local neural TTS from the Rhasspy/Home Assistant ecosystem. It runs on Linux/Windows/macOS, even a Raspberry Pi, has no per-character cost, and — importantly — ships a Czech voice (cs_CZ, e.g. the jirka voice in low/medium quality) hosted on Hugging Face. Quality is clearly better than espeak and roughly in the "good local neural" tier, though below top cloud voices.
# Download a Czech Piper voice (model + config) from the rhasspy/piper-voices repo,
# then synthesize to a WAV file — fully offline, no API key, no per-char cost.
echo 'Dobrý den, toto je test české syntézy.' | \
piper --model cs_CZ-jirka-medium.onnx --output_file out.wav
Piper is covered in depth in the local/self-hosted chapter; the point here is that it's the natural upgrade from "browser fallback" when you want the same free/offline ethos but consistent, server-generated, cacheable audio. Architecturally: the browser Web Speech API is the client-side free option (no hosting, inconsistent), and Piper is the server-side free option (you host a small container, fully consistent, can pre-generate and cache files).
voiceschanged code above.say + ffmpeg.SpeechSynthesis) is universally supported in 2026 browsers and costs nothing, but you don't control which voices exist on the user's device — that is its defining limitation.cs-CZ) are network voices (localService=false): they send your text to a server and need connectivity. Filter on localService if you need privacy/offline.voiceschanged event, the ~15s/long-text cutoff (pause/resume hack and/or sentence chunking), and the case where no Czech voice exists (warn the user rather than read Czech in an English voice).say -o (consistent, needs a Mac) or, better, Piper (open-source neural, free, offline, ships a Czech voice).getVoices(), voiceschanged behaviour.SpeechSynthesis.say -v voice listing and -o file output for server-side synthesis.cs_CZ Piper voice models.