This chapter is the hands-on deep dive into the six strongest local, free, open-weight TTS models for English naturalness as of mid-2026: Kokoro-82M, F5-TTS, Orpheus-3B, Chatterbox, StyleTTS 2, and XTTS-v2. "Local" here means the model runs on your own hardware — your laptop, a workstation GPU, or a self-hosted Docker box — with no per-character cloud fees once downloaded.
The reader's brief is specific: a TTS feature that reads their text aloud, must sound real, and should stay free or near-free. This chapter answers the English half of that. Each model gets: install steps, model download, a copy-pastable "hello world" Python snippet, hardware reality (does it run on CPU / Apple Silicon, or does it need a GPU and how much VRAM), inference speed (real-time factor, RTF), voice options and cloning, an honest quality assessment, and the license — because the license decides whether you can ship it.
Important scope note on Czech. This chapter grades these models on English. Czech is a hard filter for the reader, and most of the models below are weak or unverified in Czech. I flag each model's Czech status briefly here, but the full Czech evaluation lives in the companion chapter (05b). If you only need English, this chapter is your shortlist. If you need both languages from a single local model, jump ahead — the answer is a short list (XTTS-v2 and the multilingual Chatterbox), and it comes with caveats.
What it is. An 82-million-parameter TTS model from "hexgrad," built on a StyleTTS-2-derived architecture. It is astonishingly small for its quality and has repeatedly ranked near the top of open-source models in the TTS-Arena blind-listening leaderboard despite having far fewer parameters and less training data than competitors.
License. Apache 2.0 — for both code and weights. This is the cleanest license in this entire chapter. You can ship Kokoro in a commercial product with no fees and no non-commercial trap. For the reader's "free / near-free, ship it" goal, this matters enormously.
Hardware. This is the headline: Kokoro runs comfortably on CPU and Apple Silicon, faster than real-time. No GPU required. The kokoro-onnx runtime generates speech well above real-time on a modern laptop CPU, and the mlx-audio package gives native Apple-Silicon acceleration on M-series Macs. On any consumer GPU it is effectively instant. RAM footprint is tiny (the model is well under 350 MB).
Speed. Sub-real-time on CPU; on GPU the RTF is a small fraction (think generating minutes of audio per second). For a read-aloud feature on a laptop, this is the only model in the list you can take for granted.
Voices & cloning. Kokoro ships a set of pre-built voices (American/British English voices like af_heart, af_bella, am_michael, plus British and several other languages' voices). It does not do arbitrary zero-shot voice cloning from a reference clip — you pick from the bundled voice packs (which can be blended). For "read my text in a pleasant standard voice," that's exactly enough; for "read it in my cloned voice," look at F5-TTS / Chatterbox / XTTS instead.
English quality. Excellent for the size — clean, natural, well-paced narration. It is not the most expressive model (less dramatic range than Orpheus or Chatterbox), but it is consistent, artifact-light, and rarely embarrasses itself on long text — ideal traits for reading documents.
Czech status. Kokoro is multilingual-ish via phoneme packs and exposes lang codes, but Czech is not a first-class, well-supported language and Czech quality is unverified/weak. Treat Kokoro as an English (and a handful of major languages) model. See 05b.
Install + hello world (pip route):
pip install kokoro soundfile
# Phonemizer backend (Linux/WSL):
sudo apt-get install -y espeak-ng
# macOS: brew install espeak-ng
from kokoro import KPipeline
import soundfile as sf
pipeline = KPipeline(lang_code='a') # 'a' = American English, 'b' = British
text = "Hello — this is Kokoro reading your text aloud. It runs on a plain laptop CPU."
# Kokoro yields audio in chunks; concatenate or write per chunk.
for i, (graphemes, phonemes, audio) in enumerate(pipeline(text, voice='af_heart')):
sf.write(f'kokoro_{i}.wav', audio, 24000)
CPU/ONNX route (no PyTorch, smallest footprint):
pip install kokoro-onnx soundfile
# Download kokoro-v*.onnx and voices-*.bin from the kokoro-onnx releases page first.
import soundfile as sf
from kokoro_onnx import Kokoro
kokoro = Kokoro("kokoro-v1.0.onnx", "voices-v1.0.bin")
samples, sr = kokoro.create("Hello from the ONNX runtime.", voice="af_heart", speed=1.0, lang="en-us")
sf.write("kokoro_onnx.wav", samples, sr)
Verdict: For the reader's English + free + runs-anywhere requirement, Kokoro is the default recommendation. Apache-2.0, CPU-friendly, genuinely natural. Its only real limits are no zero-shot cloning and weak Czech.
What it is. A non-autoregressive flow-matching TTS using a Diffusion Transformer (the paper's title: "A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"). It does zero-shot voice cloning from a few seconds of reference audio plus its transcript, and the English output is among the best-sounding open clones available.
License — read carefully. The code is MIT, but the released checkpoints were trained on the Emilia dataset, which is CC BY-NC 4.0 (non-commercial). In practice that means the pretrained weights carry a non-commercial cloud over commercial deployment. For a personal/free project this is fine; for a product you sell, this is a real legal blocker unless you retrain on permissively licensed data. Do not skip this.
Hardware. F5-TTS effectively wants a GPU. It runs on CPU but slowly (long generations become impractical). A GPU with ~8–12 GB VRAM is a comfortable target; it works on less for short clips. On Apple Silicon it can run via PyTorch MPS but expect modest speed.
Speed. On a modern GPU, RTF is roughly ~0.15–0.3 (several times faster than real-time) for typical sentences; diffusion sampling steps trade speed for quality. On CPU, RTF climbs well above 1 — not ideal for interactive read-aloud.
Voices & cloning. Pure zero-shot cloning — give it ref.wav + the reference transcript and it speaks your gen_text in that voice. Quality of the clone depends heavily on clean reference audio. No bundled "stock" voices in the Kokoro sense; you supply the voice.
English quality. Very high — natural prosody and convincing timbre transfer. It is one of the strongest open models for "clone this narrator and read my book in their voice." It can occasionally hallucinate or repeat on awkward inputs (a known flow-matching failure mode), so add normalization/chunking for long documents.
Czech status. The base English checkpoint is English-centric; community fine-tunes exist for some languages (e.g., a Spanish-F5), but Czech support is not official and is unverified. See 05b.
Install + hello world:
conda create -n f5-tts python=3.10 -y && conda activate f5-tts
pip install f5-tts
# CLI (simplest):
f5-tts_infer-cli \
--model F5TTS_v1_Base \
--ref_audio "ref.wav" \
--ref_text "This is the transcript of the reference clip." \
--gen_text "Now read my text in this cloned voice." \
--output_dir out/
# Python API:
from f5_tts.api import F5TTS
f5 = F5TTS() # downloads the base checkpoint on first run
wav, sr, _ = f5.infer(
ref_file="ref.wav",
ref_text="This is the transcript of the reference clip.",
gen_text="Now read my text in this cloned voice.",
file_wave="f5_out.wav",
)
Verdict: Best free local cloning for English — but the non-commercial weights make it a personal-project pick, not a ship-it-commercially pick, unless you retrain.
What it is. An LLM-based TTS from Canopy Labs built on a Llama-3B backbone. It targets the most human-sounding, emotionally expressive speech in the open-source field, with inline emotion tags and zero-shot cloning. There's a finetuned voice set (tara, leah, jess, leo, dan, mia, zac, zoe) plus a pretrained base.
License. Apache 2.0 — clean for commercial use (a big advantage over F5-TTS). Verify the specific checkpoint card, but the project is Apache-licensed.
Hardware. This is a 3-billion-parameter LLM, so it needs a GPU. Plan for roughly ~10–16 GB VRAM at fp16 (less with quantized GGUF builds via the community orpheus-tts-local + llama.cpp route). It is not a CPU/laptop model in its full form; the quantized local builds make it possible on smaller hardware but with quality/speed trade-offs.
Speed. Designed for low-latency streaming (~200 ms time-to-first-audio claims via vLLM). Throughput is good on a capable GPU; on weak hardware it is sluggish.
Voices & cloning. Named finetuned voices + zero-shot cloning, plus emotion tags embedded in text:
<laugh> <chuckle> <sigh> <cough> <sniffle> <groan> <yawn> <gasp>
English quality. Among the most lifelike and emotive open models — great for expressive narration, characters, and dialogue where you want laughs/sighs/breathing. For dry document read-aloud it can be more expressive than necessary, but it's a top pick when "sounds real and alive" is the priority and you have a GPU.
Czech status. English-first. Multilingual research releases have appeared, but Czech is unverified. See 05b.
Install + hello world (GPU / vLLM route):
pip install orpheus-speech # pulls vLLM; CUDA GPU expected
from orpheus_tts import OrpheusModel
import wave
model = OrpheusModel(model_name="canopylabs/orpheus-3b-0.1-ft")
tokens = model.generate_speech(
prompt="Hello <chuckle> this is Orpheus reading your text with real emotion.",
voice="tara",
)
with wave.open("orpheus.wav", "wb") as wf:
wf.setnchannels(1); wf.setsampwidth(2); wf.setframerate(24000)
for chunk in tokens:
wf.writeframes(chunk)
For low-VRAM machines, see the community orpheus-tts-local project, which runs quantized GGUF weights through an LM server (LM Studio / llama.cpp) and decodes audio tokens with SNAC.
Verdict: Most human/emotive English and a clean Apache-2.0 license — but it is a GPU model. If you have the VRAM and want expressiveness, this is the prestige pick.
What it is. Resemble AI's open-source SoTA TTS (~0.5B, Llama-backed), with zero-shot cloning, a unique exaggeration / emotion-intensity control, and built-in Perth watermarking. Resemble's own blind tests claimed listeners preferred Chatterbox over ElevenLabs roughly 64% of the time (vendor-run, so treat as marketing-flavored, but it is genuinely strong). A separate Chatterbox Multilingual (23 languages) variant was later released.
License. MIT — for both code and weights, which makes Chatterbox one of the few commercially clean, high-quality, clone-capable models here. This is a meaningful edge over F5-TTS.
Hardware. Prefers a GPU (CUDA). At ~0.5B it is far lighter than Orpheus — a ~6–8 GB VRAM card is comfortable; smaller works for short clips. CPU inference is possible but slow; an Apple-Silicon MPS path exists with moderate performance. A streaming variant (chatterbox-streaming) reduces latency.
Speed. On a decent GPU it is fast enough for interactive use (sub-real-time). On CPU it is not a good read-aloud experience for long text.
Voices & cloning. Zero-shot cloning from a reference clip, plus the exaggeration and cfg_weight dials to push emotional intensity up or calm delivery down — a genuinely useful knob for read-aloud tone.
English quality. Top tier among open models — natural, expressive, and the exaggeration control lets you tune between flat-and-clear (good for documents) and dramatic (good for stories). The Perth watermark is inaudible but present (relevant for provenance/compliance).
Czech status. The original Chatterbox (May 2025) is English-only. The newer Chatterbox Multilingual (v2, Sept 2025) supports 23 languages — Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, Chinese — and Czech is NOT in that set. So Chatterbox, despite being one of the best models here, does not cover the reader's Czech requirement. It covers Polish and Russian (Slavic neighbors) but not Czech. See 05b.
Install + hello world:
pip install chatterbox-tts
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device="cuda") # "mps" on Apple Silicon, "cpu" otherwise
wav = model.generate(
"Hello, this is Chatterbox reading your text.",
audio_prompt_path="ref.wav", # optional: clone a voice
exaggeration=0.5, # 0 = flat, higher = more emotive
cfg_weight=0.5,
)
ta.save("chatterbox.wav", wav, model.sr)
Verdict: Best MIT-licensed, clone-capable, expressive English model — the strongest "ship it commercially" pick when you have a GPU. The emotion dial is the differentiator.
What it is. The 2023 model ("Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training") that set the open-source quality bar and whose architecture Kokoro descends from. It reports CMOS results that surpass human recordings on single-speaker LJSpeech and match human on multi-speaker VCTK, with strong zero-shot adaptation on LibriTTS.
License. MIT (code and the released models). Commercially clean — though check individual fine-tuned checkpoints' data licenses if you use third-party voices.
Hardware. Lighter than the LLM-based models; runs on modest GPUs (~4–8 GB VRAM) and is workable on CPU for short clips (slower). It is not as plug-and-play as Kokoro: the repo expects you to wire up phonemizer, models, and config, so the setup is the friction point, not the hardware.
Speed. Fast on GPU (it's a comparatively small, non-LLM model). RTF well under 1 on GPU.
Voices & cloning. Pre-trained single/multi-speaker checkpoints plus zero-shot style/voice adaptation from reference audio. Style diffusion lets you sample varied deliveries.
English quality. Outstanding — still among the most natural-sounding open English TTS for clean inputs. In practice, Kokoro packages StyleTTS-2-class quality with vastly easier installation, which is why most builders now reach for Kokoro unless they specifically need StyleTTS 2's cloning/style controls or want to fine-tune.
Czech status. English/standard-dataset focused; no first-class Czech support, would require training your own. See 05b.
Install + hello world (source route):
git clone https://github.com/yl4579/StyleTTS2.git
cd StyleTTS2
pip install -r requirements.txt
sudo apt-get install -y espeak-ng # phonemizer backend
# Download a pretrained checkpoint (e.g., LibriTTS) into Models/ per the README.
# StyleTTS 2 has no single one-liner API; inference goes through the repo's
# notebook/utils (load config + model, phonemize text, run the diffusion sampler).
# See Demo/Inference_LibriTTS.ipynb in the repo for the canonical minimal example.
Verdict: Top quality, MIT, but high setup friction. For most readers, Kokoro is the easier on-ramp to the same lineage; reach for StyleTTS 2 directly only if you need its cloning/style features or plan to fine-tune.
What it is. Coqui's multilingual voice-cloning model. Coqui (the company) shut down in early 2024, but the model lives on and the codebase is maintained as the Idiap coqui-tts fork. XTTS-v2 clones a voice from ~6 seconds of reference audio and supports cross-language cloning.
Why it's in an "English" chapter. Because it is the most relevant bridge to the reader's Czech requirement: XTTS-v2 officially lists Czech among its 17 languages (English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, Hindi). Its English is good (not class-leading vs. 2025 models), but its native Czech support is rare and valuable — covered properly in 05b.
License — the catch. Coqui Public Model License (CPML), which is non-commercial. Like F5-TTS, this makes XTTS-v2 fine for personal/free use but a blocker for a paid product. Note the distinction: the code fork is Mozilla-style permissive, but the XTTS-v2 weights are CPML non-commercial.
Hardware. Runs on GPU comfortably (~4–6 GB VRAM) and on CPU acceptably for short-to-medium text (slower but usable — better CPU behavior than the LLM-based models). Apple-Silicon CPU works; MPS support is partial.
Speed. Sub-real-time on GPU; on CPU expect roughly real-time-ish for short clips, slower for long passages. Reasonable for a read-aloud feature on a modest machine.
Voices & cloning. Zero-shot cloning from ~6 s reference, in any of the supported languages, with cross-lingual transfer (clone an English voice, read Czech text). 24 kHz output.
English quality. Solid and natural, though the 2025 generation (Chatterbox/Orpheus/F5) generally edges it on raw English realism. Where XTTS-v2 wins is breadth + official Czech.
Install + hello world:
pip install coqui-tts # maintained Idiap fork (use this, not the archived 'TTS')
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda") # or .to("cpu")
tts.tts_to_file(
text="Hello — and this same voice can also read Czech.",
speaker_wav="ref.wav", # ~6+ seconds of clean reference audio
language="en", # use "cs" for Czech
file_path="xtts_en.wav",
)
Verdict: The local model to beat for Czech, and a competent English cloner — but non-commercial weights. For the reader's bilingual, free, personal case it's a front-runner; for a commercial bilingual product the license forces a different plan.
| Model | Params | English quality (read-aloud) | Cloning | Stock voices | CPU / Apple Silicon | Min GPU VRAM | RTF (GPU, approx) | License | Czech (this report) |
|---|---|---|---|---|---|---|---|---|---|
| Kokoro-82M | 82M | Very good, consistent | No (voice packs) | Yes (EN + some langs) | Yes — faster than RT on CPU/MLX | none needed | tiny | Apache 2.0 (commercial OK) | Weak / unverified |
| F5-TTS | ~0.3B+ | Excellent (great clones) | Yes (zero-shot) | No (you supply) | Slow on CPU; GPU preferred | ~8–12 GB | ~0.15–0.3 | MIT code / NC weights | Unofficial / unverified |
| Orpheus-3B | 3B | Most expressive/lifelike | Yes | Yes (8 named) | No (GPU; GGUF=possible) | ~10–16 GB | low (streaming) | Apache 2.0 | Unverified |
| Chatterbox | ~0.5B | Excellent + emotion dial | Yes (zero-shot) | One default | GPU preferred; CPU slow | ~6–8 GB | sub-RT | MIT (commercial OK) | No — 23-lang variant excludes Czech |
| StyleTTS 2 | ~0.1B | Excellent (clean inputs) | Yes (zero-shot) | Pretrained ckpts | CPU short clips; GPU better | ~4–8 GB | <1 | MIT | No first-class support |
| XTTS-v2 | ~0.46B | Good/natural | Yes (~6 s ref) | Built-in speakers | CPU usable; GPU better | ~4–6 GB | <1 (GPU) | CPML (non-commercial) | Official Czech |
RTF and VRAM figures are approximate and hardware-dependent; treat them as planning guidance, not guarantees.
For reading your texts — paragraphs, documents, articles — every model above needs a text pipeline around it, not just raw inference:
These four steps matter more for perceived quality than the choice between two close models.