Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

CS · 17 ch

Chapter 05a: Local Models — Best for English

Chapter 5 of 17 · ~20 min read

Overview

This chapter is the hands-on deep dive into the six strongest local, free, open-weight TTS models for English naturalness as of mid-2026: Kokoro-82M, F5-TTS, Orpheus-3B, Chatterbox, StyleTTS 2, and XTTS-v2. "Local" here means the model runs on your own hardware — your laptop, a workstation GPU, or a self-hosted Docker box — with no per-character cloud fees once downloaded.

The reader's brief is specific: a TTS feature that reads their text aloud, must sound real, and should stay free or near-free. This chapter answers the English half of that. Each model gets: install steps, model download, a copy-pastable "hello world" Python snippet, hardware reality (does it run on CPU / Apple Silicon, or does it need a GPU and how much VRAM), inference speed (real-time factor, RTF), voice options and cloning, an honest quality assessment, and the license — because the license decides whether you can ship it.

Important scope note on Czech. This chapter grades these models on English. Czech is a hard filter for the reader, and most of the models below are weak or unverified in Czech. I flag each model's Czech status briefly here, but the full Czech evaluation lives in the companion chapter (05b). If you only need English, this chapter is your shortlist. If you need both languages from a single local model, jump ahead — the answer is a short list (XTTS-v2 and the multilingual Chatterbox), and it comes with caveats.

Content

How to read the numbers

RTF (real-time factor) = seconds of compute per second of audio. RTF 0.3 means it generates a 10-second clip in 3 seconds (≈3× faster than real-time). RTF below 1.0 is "faster than real-time" and is what you want for a read-aloud feature. Hardware changes RTF dramatically, so treat all figures as approximate and hardware-dependent.
Quality is graded for an English read-aloud use case (long-form, neutral-to-expressive narration), not for one-shot demo lines. Where a real MOS (mean opinion score) or TTS-Arena Elo result exists, I cite it; otherwise I say "subjective."
Several excellent-sounding models ship under terms (the model weights, even if the is MIT). If you plan to monetize, that disqualifies them or forces you to retrain. I call this out per model.

pip install kokoro soundfile
# Phonemizer backend (Linux/WSL):
sudo apt-get install -y espeak-ng
# macOS: brew install espeak-ng

from kokoro import KPipeline
import soundfile as sf

pipeline = KPipeline(lang_code='a')          # 'a' = American English, 'b' = British
text = "Hello — this is Kokoro reading your text aloud. It runs on a plain laptop CPU."

# Kokoro yields audio in chunks; concatenate or write per chunk.
for i, (graphemes, phonemes, audio) in enumerate(pipeline(text, voice='af_heart')):
    sf.write(f'kokoro_{i}.wav', audio, 24000)

pip install kokoro-onnx soundfile
# Download kokoro-v*.onnx and voices-*.bin from the kokoro-onnx releases page first.

import soundfile as sf
from kokoro_onnx import Kokoro

kokoro = Kokoro("kokoro-v1.0.onnx", "voices-v1.0.bin")
samples, sr = kokoro.create("Hello from the ONNX runtime.", voice="af_heart", speed=1.0, lang="en-us")
sf.write("kokoro_onnx.wav", samples, sr)

conda create -n f5-tts python=3.10 -y && conda activate f5-tts
pip install f5-tts

# CLI (simplest):
f5-tts_infer-cli \
  --model F5TTS_v1_Base \
  --ref_audio "ref.wav" \
  --ref_text "This is the transcript of the reference clip." \
  --gen_text "Now read my text in this cloned voice." \
  --output_dir out/

# Python API:
from f5_tts.api import F5TTS
f5 = F5TTS()   # downloads the base checkpoint on first run
wav, sr, _ = f5.infer(
    ref_file="ref.wav",
    ref_text="This is the transcript of the reference clip.",
    gen_text="Now read my text in this cloned voice.",
    file_wave="f5_out.wav",
)

<laugh> <chuckle> <sigh> <cough> <sniffle> <groan> <yawn> <gasp>

pip install orpheus-speech   # pulls vLLM; CUDA GPU expected

from orpheus_tts import OrpheusModel
import wave

model = OrpheusModel(model_name="canopylabs/orpheus-3b-0.1-ft")
tokens = model.generate_speech(
    prompt="Hello <chuckle> this is Orpheus reading your text with real emotion.",
    voice="tara",
)
with wave.open("orpheus.wav", "wb") as wf:
    wf.setnchannels(1); wf.setsampwidth(2); wf.setframerate(24000)
    for chunk in tokens:
        wf.writeframes(chunk)

pip install chatterbox-tts

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")   # "mps" on Apple Silicon, "cpu" otherwise
wav = model.generate(
    "Hello, this is Chatterbox reading your text.",
    audio_prompt_path="ref.wav",   # optional: clone a voice
    exaggeration=0.5,              # 0 = flat, higher = more emotive
    cfg_weight=0.5,
)
ta.save("chatterbox.wav", wav, model.sr)

git clone https://github.com/yl4579/StyleTTS2.git
cd StyleTTS2
pip install -r requirements.txt
sudo apt-get install -y espeak-ng       # phonemizer backend
# Download a pretrained checkpoint (e.g., LibriTTS) into Models/ per the README.

# StyleTTS 2 has no single one-liner API; inference goes through the repo's
# notebook/utils (load config + model, phonemize text, run the diffusion sampler).
# See Demo/Inference_LibriTTS.ipynb in the repo for the canonical minimal example.

pip install coqui-tts          # maintained Idiap fork (use this, not the archived 'TTS')

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")  # or .to("cpu")
tts.tts_to_file(
    text="Hello — and this same voice can also read Czech.",
    speaker_wav="ref.wav",     # ~6+ seconds of clean reference audio
    language="en",             # use "cs" for Czech
    file_path="xtts_en.wav",
)

Model	Params	English quality (read-aloud)	Cloning	Stock voices	CPU / Apple Silicon	Min GPU VRAM	RTF (GPU, approx)	License	Czech (this report)
Kokoro-82M	82M	Very good, consistent	No (voice packs)	Yes (EN + some langs)	Yes — faster than RT on CPU/MLX	none needed	tiny	Apache 2.0 (commercial OK)	Weak / unverified
F5-TTS	~0.3B+	Excellent (great clones)	Yes (zero-shot)	No (you supply)	Slow on CPU; GPU preferred	~8–12 GB	~0.15–0.3	MIT code / NC weights	Unofficial / unverified
Orpheus-3B	3B	Most expressive/lifelike	Yes	Yes (8 named)	No (GPU; GGUF=possible)	~10–16 GB	low (streaming)	Apache 2.0	Unverified
Chatterbox	~0.5B	Excellent + emotion dial	Yes (zero-shot)	One default	GPU preferred; CPU slow	~6–8 GB	sub-RT	MIT (commercial OK)	No — 23-lang variant excludes Czech
StyleTTS 2	~0.1B	Excellent (clean inputs)	Yes (zero-shot)	Pretrained ckpts	CPU short clips; GPU better	~4–8 GB	<1	MIT	No first-class support
XTTS-v2	~0.46B	Good/natural	Yes (~6 s ref)	Built-in speakers	CPU usable; GPU better	~4–6 GB	<1 (GPU)	CPML (non-commercial)	Official Czech

Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

Chapter 05a: Local Models — Best for English

Overview

Content

How to read the numbers

1. Kokoro-82M — the efficiency champion (start here)

2. F5-TTS — best free zero-shot voice cloning (with a license asterisk)

3. Orpheus-3B — most expressive / human-like (needs a GPU)

4. Chatterbox — best "ElevenLabs-killer" claim, emotion dial, MIT (English; multilingual variant)

5. StyleTTS 2 — research-grade quality, fiddlier to run

6. XTTS-v2 — the one with real, official Czech (but non-commercial license)

Head-to-head: English read-aloud comparison

Decision shortcuts for the reader's English need

A note on long-document read-aloud (all models)

Key Takeaways

Key References