Chapter 06: Self-Hosting an Open Model as a Server

Chapter 7 of 17 · ~16 min read

Overview

The previous chapters covered the open TTS models themselves — Piper, Kokoro, XTTS-v2, and friends — as libraries you call from Python. That is fine for a batch script, but a real application wants a service: an HTTP endpoint that your backend (or even your frontend) can POST text to and get audio back. This chapter is the engineering bridge between "I can run piper on the command line" and "my app has a /speak button that works in production, reads English and Czech, and costs almost nothing."

We cover the standard patterns for wrapping an open model as a server, the ready-made projects that have already done this work for you (and how they handle Czech), the runtime decisions that actually matter (CPU vs GPU, concurrency, streaming, caching), and where to host the thing — your own VPS, a small GPU box, or serverless GPU like Modal/RunPod/Replicate — with rough 2026 monthly costs. There is a concrete docker-compose.yml, a Dockerfile, and a sample HTTP request you can paste into your project today.

The bias of this chapter matches the reader's brief: best natural-sounding English + Czech that stays free or near-free. For self-hosting that almost always means one of two answers — Piper (CPU, truly free, has a real Czech voice) or Kokoro (light GPU, more natural English, no native Czech) — wrapped behind an OpenAI-compatible endpoint so you can swap engines without rewriting your client.

Content

Why "OpenAI-compatible" is the format to target

OpenAI's text-to-speech API is a single, dead-simple endpoint, and it has become the de-facto standard that almost every self-host project now imitates. If you build your client against this shape, you can later swap the backend (local Piper → cloud OpenAI → a friend's GPU box) by changing only a base URL and an API key.

The request is a JSON POST to /v1/audio/speech:

{
  "model": "tts-1",
  "input": "Dobrý den, toto je test české syntézy řeči.",
  "voice": "alloy",
  "response_format": "mp3",
  "speed": 1.0
}

The response body is raw audio bytes (not JSON) in the requested container. OpenAI supports , , , , , and ; self-host projects typically support a subset (almost always and , increasingly and for streaming). Because the official SDKs speak this format, you get clients in Python, JS, Go, etc. for free:

Project	Engines it wraps	OpenAI-compatible	Czech support	Runtime	Docker images	Best for
openedai-speech (matatonic)	Piper, Coqui XTTS-v2	Yes (`/v1/audio/speech`)	Yes via Piper (`cs_CZ-jirka`); XTTS lists Czech but quality is uneven	CPU (Piper) + GPU (XTTS)	CPU + CUDA images	Drop-in OpenAI replacement, Open WebUI
Kokoro-FastAPI (remsky)	Kokoro-82M only	Yes + a streaming endpoint	No native Czech (en/es/fr/hi/it/ja/pt/zh)	CPU (ONNX) or NVIDIA GPU (PyTorch)	`kokoro-fastapi-cpu`, `-gpu`	Most natural English at low cost
speaches (speaches-ai)	Kokoro, Piper (TTS) + faster-whisper (STT)	Yes (TTS + STT)	Yes via Piper; not via Kokoro	CPU or GPU	Official image	One server for both STT and TTS
Wyoming Piper (rhasspy)	Piper only	No (Wyoming protocol)	Yes (`cs_CZ-jirka`)	CPU	`rhasspy/wyoming-piper`	Home Assistant / voice-assistant stacks
openai-edge-tts (travisvn)	Microsoft Edge TTS (cloud relay)	Yes	Yes (Edge has good `cs-CZ` voices)	CPU, tiny	Lightweight image	Free cloud-quality via Edge — but not truly self-contained/offline

Model	CPU viable?	GPU helps?	Rough RAM/VRAM	Real-time factor (RTF)*	Notes
Piper	Yes — designed for CPU	Marginal	~150–300 MB RAM	well under real-time even on a Raspberry Pi 4 (`medium` voice)	Cheapest possible self-host; the free answer for Czech
Kokoro-82M	Yes (ONNX, slower)	Yes, big win	~1–2 GB VRAM	~0.1–0.3 RTF on a consumer NVIDIA GPU; several× real-time on CPU but slower	82M params is tiny, so even modest GPUs fly
XTTS-v2	Painfully slow on CPU	Yes, needed	~4–6 GB VRAM	near or below real-time on GPU; not practical on CPU for interactive use	Use only if you need voice cloning

Option	Example	Rough cost	When it fits	Caveat
Your existing app server / small VPS (CPU)	Hetzner CX/CPX, or a $5–10/mo box	€4–15/mo, often €0 marginal if you already have it	Piper (Czech + English), low/medium volume	No GPU, so Kokoro is CPU-slow and XTTS impractical
Bigger CPU VPS	Hetzner CPX31/41, ~4–8 vCPU	~€15–30/mo	Piper at higher concurrency, or Kokoro-ONNX-CPU with relaxed latency	Still no GPU acceleration
Dedicated small GPU server	Hetzner GEX44 (RTX 4000-class)	~€180–230/mo (verify current)	Always-on Kokoro/XTTS, steady traffic	Pay 24/7 even when idle; only worth it at volume
Serverless GPU (per-second)	RunPod serverless, Modal, Replicate	RTX-4090-class ~$0.0003–0.0004/sec; A100-80GB ~$0.0008/sec (RunPod, approx)	Bursty traffic, scale-to-zero, pay only while synthesizing	Cold starts (model load) add latency; container must boot
Cloud TTS API instead of self-host	OpenAI / Azure / Google / ElevenLabs	per-character/minute (covered in earlier chapters)	When you don't want to operate anything	Not self-hosted; recurring per-use cost

Text-to-Speech for Reading Your Texts — Natural Voices in English & Czech (2026)

Chapter 06: Self-Hosting an Open Model as a Server

Overview

Content

Why "OpenAI-compatible" is the format to target

Pattern 1 — Roll your own FastAPI wrapper

Pattern 2 — Use a ready-made OpenAI-compatible server

Pattern 3 — Wyoming protocol (only if you're in the Home Assistant world)

CPU vs GPU: what each model actually needs

Concurrency, batching, and queuing

Streaming audio over HTTP

Caching — the cheapest performance win

Where to host, and what it costs (2026, approximate)

A concrete deployment: openedai-speech via docker-compose

Alternative: Kokoro-FastAPI for best-effort English

A minimal custom Dockerfile (if you build your own wrapper)

Production hardening checklist

Key Takeaways

Key References