The previous chapters covered the open TTS models themselves — Piper, Kokoro, XTTS-v2, and friends — as libraries you call from Python. That is fine for a batch script, but a real application wants a service: an HTTP endpoint that your backend (or even your frontend) can POST text to and get audio back. This chapter is the engineering bridge between "I can run piper on the command line" and "my app has a /speak button that works in production, reads English and Czech, and costs almost nothing."
We cover the standard patterns for wrapping an open model as a server, the ready-made projects that have already done this work for you (and how they handle Czech), the runtime decisions that actually matter (CPU vs GPU, concurrency, streaming, caching), and where to host the thing — your own VPS, a small GPU box, or serverless GPU like Modal/RunPod/Replicate — with rough 2026 monthly costs. There is a concrete docker-compose.yml, a Dockerfile, and a sample HTTP request you can paste into your project today.
The bias of this chapter matches the reader's brief: best natural-sounding English + Czech that stays free or near-free. For self-hosting that almost always means one of two answers — Piper (CPU, truly free, has a real Czech voice) or Kokoro (light GPU, more natural English, no native Czech) — wrapped behind an OpenAI-compatible endpoint so you can swap engines without rewriting your client.
OpenAI's text-to-speech API is a single, dead-simple endpoint, and it has become the de-facto standard that almost every self-host project now imitates. If you build your client against this shape, you can later swap the backend (local Piper → cloud OpenAI → a friend's GPU box) by changing only a base URL and an API key.
The request is a JSON POST to /v1/audio/speech:
{
"model": "tts-1",
"input": "Dobrý den, toto je test české syntézy řeči.",
"voice": "alloy",
"response_format": "mp3",
"speed": 1.0
}
The response body is raw audio bytes (not JSON) in the requested container. OpenAI supports , , , , , and ; self-host projects typically support a subset (almost always and , increasingly and for streaming). Because the official SDKs speak this format, you get clients in Python, JS, Go, etc. for free:
mp3opusaacflacwavpcmmp3wavopuspcmopenaifrom openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-not-needed")
with client.audio.speech.with_streaming_response.create(
model="tts-1", voice="cs_CZ-jirka-medium",
input="Dobrý den, vítejte.",
) as resp:
resp.stream_to_file("out.mp3")
The trade-off: OpenAI's schema has no first-class field for language or voice cloning reference audio. Self-host projects handle this by overloading the voice field (e.g. voice="cs_CZ-jirka-medium") or by adding non-standard extra endpoints. So target OpenAI-compatible for the 90% path, but expect to drop to a project's native API for cloning or fine language control.
If you only need one model and one language pair, a ~40-line FastAPI app is the most transparent option. Here is a minimal Piper wrapper that is genuinely production-shaped (streaming response, voice selection, caching hook):
# app.py
import io, hashlib, subprocess, wave
from fastapi import FastAPI
from fastapi.responses import Response
from pydantic import BaseModel
app = FastAPI()
VOICES = { # OpenAI-style alias -> piper model path
"alloy": "/voices/en_US-amy-medium.onnx",
"jirka": "/voices/cs_CZ-jirka-medium.onnx",
}
class SpeechReq(BaseModel):
model: str = "tts-1"
input: str
voice: str = "alloy"
response_format: str = "wav"
def synth(text: str, model_path: str) -> bytes:
# piper reads text on stdin, writes WAV to stdout
p = subprocess.run(
["piper", "--model", model_path, "--output_file", "-"],
input=text.encode("utf-8"), capture_output=True, check=True,
)
return p.stdout
@app.post("/v1/audio/speech")
def speech(req: SpeechReq):
model_path = VOICES.get(req.voice, VOICES["alloy"])
audio = synth(req.input, model_path) # add a cache lookup here
return Response(content=audio, media_type="audio/wav")
This is enough for a single-user app or an internal tool. What it lacks (and why ready-made projects exist): proper async concurrency, format conversion to mp3/opus, request queuing under load, multiple engines, and streaming chunked output. Don't build all of that yourself unless you have a reason to.
This is the recommended path for most readers. Several mature open-source projects already wrap the popular models behind the OpenAI schema, ship Docker images, and handle concurrency and format conversion. Here is how they compare for the English+Czech, free-or-cheap brief.
| Project | Engines it wraps | OpenAI-compatible | Czech support | Runtime | Docker images | Best for |
|---|---|---|---|---|---|---|
| openedai-speech (matatonic) | Piper, Coqui XTTS-v2 | Yes (/v1/audio/speech) | Yes via Piper (cs_CZ-jirka); XTTS lists Czech but quality is uneven | CPU (Piper) + GPU (XTTS) | CPU + CUDA images | Drop-in OpenAI replacement, Open WebUI |
| Kokoro-FastAPI (remsky) | Kokoro-82M only | Yes + a streaming endpoint | No native Czech (en/es/fr/hi/it/ja/pt/zh) | CPU (ONNX) or NVIDIA GPU (PyTorch) | kokoro-fastapi-cpu, -gpu | Most natural English at low cost |
| speaches (speaches-ai) | Kokoro, Piper (TTS) + faster-whisper (STT) | Yes (TTS + STT) | Yes via Piper; not via Kokoro | CPU or GPU | Official image | One server for both STT and TTS |
| Wyoming Piper (rhasspy) | Piper only | No (Wyoming protocol) | Yes (cs_CZ-jirka) | CPU | rhasspy/wyoming-piper | Home Assistant / voice-assistant stacks |
| openai-edge-tts (travisvn) | Microsoft Edge TTS (cloud relay) | Yes | Yes (Edge has good cs-CZ voices) | CPU, tiny | Lightweight image | Free cloud-quality via Edge — but not truly self-contained/offline |
A few honest notes on this table:
cs_CZ-jirka-medium, trained from a Czech speaker, available in the rhasspy/piper-voices Hugging Face repo). Kokoro has no Czech, full stop — its language list as of early 2026 is English (US/UK), Spanish, French, Hindi, Italian, Japanese, Brazilian Portuguese, and Mandarin. So a Kokoro-only server cannot be your Czech path. If you want Kokoro's nicer English and Czech, run two engines (Kokoro for en, Piper for cs) — openedai-speech and speaches both let you do this in one container.cs-CZ-AntoninNeural, cs-CZ-VlastaNeural). It is free and trivially light. But it is not offline or self-contained — it depends on an undocumented Microsoft endpoint that can change or rate-limit, and it's a gray-area use of a consumer service. Great for a hobby project; risky as a product dependency.wyoming-piper exposes Piper over the Wyoming protocol (a simple TCP protocol used by the Rhasspy/Home Assistant voice ecosystem) rather than HTTP. Unless you are integrating with Home Assistant or another Wyoming consumer, skip it — for an app, you want HTTP/OpenAI-compatible. Mentioned here only so you don't accidentally pick the Wyoming image when you wanted the HTTP one.
This is the single biggest cost lever. Match the model to the hardware you can afford.
| Model | CPU viable? | GPU helps? | Rough RAM/VRAM | Real-time factor (RTF)* | Notes |
|---|---|---|---|---|---|
| Piper | Yes — designed for CPU | Marginal | ~150–300 MB RAM | well under real-time even on a Raspberry Pi 4 (medium voice) | Cheapest possible self-host; the free answer for Czech |
| Kokoro-82M | Yes (ONNX, slower) | Yes, big win | ~1–2 GB VRAM | ~0.1–0.3 RTF on a consumer NVIDIA GPU; several× real-time on CPU but slower | 82M params is tiny, so even modest GPUs fly |
| XTTS-v2 | Painfully slow on CPU | Yes, needed | ~4–6 GB VRAM | near or below real-time on GPU; not practical on CPU for interactive use | Use only if you need voice cloning |
* RTF = seconds of compute per second of audio; lower is better, < 1.0 means faster than real-time. These are approximate community/maintainer figures, not lab benchmarks — verify on your own hardware and text.
Practical reading of this table for the brief:
A naive wrapper that runs the model synchronously will serialize every request — fine for one user, a disaster for ten. Three things to get right:
run_in_threadpool, or a dedicated worker process) so the async server stays responsive. The ready-made projects already do this.docker compose --scale). Each is cheap.Batching (combining multiple requests into one GPU forward pass) helps throughput for high-volume Kokoro/XTTS deployments but adds latency and complexity; for "reading my texts" workloads it's usually unnecessary. The more impactful trick is chunking long input — split the text into sentences/paragraphs, synthesize them in parallel (or stream them), and stitch. Kokoro-FastAPI does this auto-stitching for you.
For "read this paragraph aloud," users perceive latency as time-to-first-audio, not total time. Streaming lets playback start while the rest is still synthesizing.
StreamingResponse: the server flushes audio bytes as they're produced. For raw pcm or wav this is straightforward; the client plays bytes as they arrive. Kokoro-FastAPI exposes a dedicated streaming endpoint and supports pcm/opus for low-latency playback; openedai-speech streams as well.<audio> source consumes chunked audio; in a backend pipeline, just write the stream to a file or pipe it onward.FastAPI streaming sketch:
from fastapi.responses import StreamingResponse
@app.post("/v1/audio/speech")
def speech(req: SpeechReq):
def gen():
for sentence in split_sentences(req.input):
yield synth(sentence, VOICES[req.voice]) # bytes per sentence
return StreamingResponse(gen(), media_type="audio/wav")
If your app reads the same texts repeatedly (UI strings, article intros, notification phrases), cache the audio. A cache key is hash(text + voice + format + speed); the value is the audio blob.
/cache/<sha256>.mp3) for persistence across restarts; serve cache hits with zero inference.import hashlib, os
def cache_key(text, voice, fmt, speed):
raw = f"{text}|{voice}|{fmt}|{speed}".encode()
return hashlib.sha256(raw).hexdigest()
# on hit: return open(f"/cache/{key}.{fmt}", "rb").read()
For a "read my texts" product where content is often static or semi-static, caching can turn 90%+ of requests into instant file serves, which collapses your GPU/CPU bill.
| Option | Example | Rough cost | When it fits | Caveat |
|---|---|---|---|---|
| Your existing app server / small VPS (CPU) | Hetzner CX/CPX, or a $5–10/mo box | €4–15/mo, often €0 marginal if you already have it | Piper (Czech + English), low/medium volume | No GPU, so Kokoro is CPU-slow and XTTS impractical |
| Bigger CPU VPS | Hetzner CPX31/41, ~4–8 vCPU | ~€15–30/mo | Piper at higher concurrency, or Kokoro-ONNX-CPU with relaxed latency | Still no GPU acceleration |
| Dedicated small GPU server | Hetzner GEX44 (RTX 4000-class) | ~€180–230/mo (verify current) | Always-on Kokoro/XTTS, steady traffic | Pay 24/7 even when idle; only worth it at volume |
| Serverless GPU (per-second) | RunPod serverless, Modal, Replicate | RTX-4090-class ~$0.0003–0.0004/sec; A100-80GB ~$0.0008/sec (RunPod, approx) | Bursty traffic, scale-to-zero, pay only while synthesizing | Cold starts (model load) add latency; container must boot |
| Cloud TTS API instead of self-host | OpenAI / Azure / Google / ElevenLabs | per-character/minute (covered in earlier chapters) | When you don't want to operate anything | Not self-hosted; recurring per-use cost |
How to read the serverless math: Kokoro generates ~3–10× faster than real-time on a 4090-class GPU, so one minute of audio might be ~6–20 seconds of billed GPU time. At ~$0.0004/sec that's roughly $0.0024–$0.008 per minute of audio in raw compute — but you must add cold-start time (loading the model into VRAM, often 10–40s on a cold container), which on serverless you may or may not be billed for depending on platform. With scale-to-zero and bursty usage, serverless GPU often beats a 24/7 GPU box. With steady high volume, the dedicated box wins.
The honest recommendation for "free or near-free": start with Piper on a CPU VPS (or your existing server). It is genuinely free-to-operate at small scale, covers both Czech and English with real voices, and needs no GPU. Add Kokoro (CPU ONNX, or a serverless-GPU function called only for English) only if you decide Piper's English isn't natural enough for you. Reserve dedicated GPU and XTTS for when you specifically need voice cloning at volume.
This gives you one OpenAI-compatible endpoint that serves Czech via Piper and English via Piper or XTTS, with a persistent cache.
# docker-compose.yml
services:
tts:
image: ghcr.io/matatonic/openedai-speech-min:latest # CPU/Piper image; use the CUDA image for XTTS
container_name: openedai-speech
ports:
- "8000:8000"
volumes:
- ./voices:/app/voices # Piper .onnx voices (incl. cs_CZ-jirka-medium)
- ./config:/app/config # voice_to_speaker.yaml mapping OpenAI names -> models
- ./cache:/app/cache # persistent audio cache
environment:
- TTS_HOME=/app/voices
restart: unless-stopped
# For XTTS/Kokoro on GPU, add:
# deploy:
# resources:
# reservations:
# devices:
# - capabilities: [gpu]
Pull the Czech voice once (Piper voices live in the rhasspy/piper-voices HF repo):
mkdir -p voices && cd voices
# cs_CZ-jirka-medium: .onnx model + .onnx.json config
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/cs/cs_CZ/jirka/medium/cs_CZ-jirka-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/cs/cs_CZ/jirka/medium/cs_CZ-jirka-medium.onnx.json
Map an OpenAI voice name to it in config/voice_to_speaker.yaml, then bring it up:
docker compose up -d
Sample request (works with curl, your backend, or the OpenAI SDK):
curl -s http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-anything" \
-d '{
"model": "tts-1",
"input": "Dobrý den, toto je ukázka české syntézy řeči.",
"voice": "jirka",
"response_format": "mp3"
}' \
--output cz.mp3
If you want Kokoro's more natural English and don't mind running it separately, its images are CPU-or-GPU ready:
# CPU (ONNX)
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest
# NVIDIA GPU (PyTorch)
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest
curl -s http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"kokoro","input":"This is a Kokoro voice test.","voice":"af_bella","response_format":"mp3"}' \
--output en.mp3
Then route by language in your app: cs → openedai-speech/Piper on :8000, en → Kokoro on :8880. Two small containers, one natural-sounding bilingual TTS service, near-zero cost on CPU.
FROM python:3.11-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
ffmpeg && rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir piper-tts fastapi uvicorn
COPY app.py /app/app.py
WORKDIR /app
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]
/health so your orchestrator/load balancer can restart dead replicas./v1/audio/speech schema. It's the de-facto standard; building your client against it lets you swap engines (Piper ↔ Kokoro ↔ cloud) by changing a base URL.openedai-speech (Piper + XTTS), Kokoro-FastAPI (Kokoro), and speaches (Kokoro/Piper + Whisper STT) already handle concurrency, format conversion, and Docker.cs_CZ-jirka-medium); Kokoro has no Czech at all. For bilingual EN+CZ, the clean pattern is Kokoro (or Piper) for English + Piper for Czech, possibly in one container.cs_CZ-jirka-medium, the Czech Piper voice./v1/audio/speech request schema (model, input, voice, response_format, speed) that self-host projects imitate.