The problem nobody sells a fix for
My kid loves audiobooks. The commercial platforms barely carry Hungarian children’s books, and none of them carry the one narrator my kid actually prefers: me. I can’t read aloud every evening — but my homelab doesn’t have that excuse.
The platform half (ebook → M4B → Audiobookshelf on k3s) is a story for another post. This one is about the voice: how to go from a phone recording to an audiobook narrated in your own voice, step by step, on hardware with no GPU.
The short version: XTTS-v2 does zero-shot voice cloning from a ~20-second sample. No training, no fine-tuning, no dataset. One clean recording and a flag.
Why XTTS-v2, in 2026?
It’s not the best open TTS model anymore. Chatterbox beats ElevenLabs in blind tests; F5-TTS sounds cleaner. But model selection for a small language is constraint-first, not leaderboard-first: Chatterbox has no Hungarian, NVIDIA’s TTS NIMs have no Hungarian, Kokoro — no Hungarian. XTTS-v2 speaks Hungarian and clones voices and runs on CPU. That intersection has exactly one resident.
I run it via ebook2audiobook, which wraps XTTS with Calibre ingestion and M4B chaptering.
Step 1 — Record ~25 seconds of yourself
Phone voice-memo app, quiet room, ~20 cm from your mouth. Mine came out as 28 seconds of stereo 48 kHz AAC. Two rules that matter more than gear:
- Read the way you want the books narrated. The clone copies prosody — energy, pacing, warmth — not just timbre. A flat recital clones into a flat narrator. I read a children’s tale the way I’d read it at bedtime.
- Don’t peak the mic. My sample hit −0.1 dB max volume — right at the clipping ceiling. It worked, but quieter is safer. Check yours:
ffmpeg -i janos.m4a -af volumedetect -f null - 2>&1 | grep volume
# mean_volume: -21.4 dB ← fine
# max_volume: -0.1 dB ← living dangerously
Step 2 — Normalize to what XTTS wants
XTTS expects a mono WAV; 24 kHz matches its internal rate. Trim the silence off both ends while you’re at it:
ffmpeg -i janos.m4a \
-af "silenceremove=start_periods=1:start_threshold=-45dB:start_silence=0.2,\
areverse,silenceremove=start_periods=1:start_threshold=-45dB:start_silence=0.2,\
areverse" \
-ar 24000 -ac 1 janos.wav
(The double-areverse is the classic trick: silenceremove only trims the
front, so you flip the audio, trim the front again, flip it back.)
Drop the result where your TTS stack looks for voices. In ebook2audiobook
that’s the voices/ tree, organised by language:
voices/hun/adult/male/janos.wav
Step 3 — Synthesize
One flag does the cloning. Headless run on the k3s pod:
kubectl exec -n web-audiobooks deploy/ebook2audiobook -- sh -c \
'cd /app && python app.py --headless \
--ebook "/app/ebooks/tale.txt" \
--language hun \
--tts_engine xtts \
--device cpu \
--voice /app/voices/hun/adult/male/janos.wav \
--output_format m4b \
--output_dir /app/audiobooks'
On my 12-core CPU node this runs at roughly 3× real-time — a 2-minute tale takes ~8 minutes, a full children’s book is an overnight job. The first run computes speaker latents from your WAV; after that it’s ordinary synthesis with your voice as the reference.
Step 4 — A/B before you batch
Render one short book twice — stock narrator and cloned voice — and put both in front of the household jury. Cloning quality is personal in the most literal sense: MOS scores won’t tell you whether it sounds like you. My benchmark has strong opinions and goes to bed at eight.
Only after the clone passes do you re-render the library with --voice.

The manual steps that earn the word “manual”
Things the tutorials skip, learned the slow way:
- Long conversions die with the browser tab. Gradio-style web UIs tie
the job to the open page; close the laptop and you get “Conversion
cancelled” half a book in. Anything longer than ~15 minutes of audio runs
headless under
nohup. - CPU synthesis leaks memory over hours. My pod has a hard 6 Gi limit on
a 16 Gi node, and a 6-hour run will hit it. Keep the cap (it protects the
other 30 namespaces), and rely on the tool’s
--session <id>resume — it picks up at the exact sentence. One catch: headless resume still asks an interactiveResume? [y]es— pipeecho y |into it. - The per-chapter FLACs survive a crash. If the final M4B muxing step
OOMs, don’t re-synthesize: the chapters are sitting in the session’s tmp
directory, and
ffmpegwill assemble them into a chaptered M4B with a hand-written FFMETADATA file in about two minutes, at near-zero memory.
None of this is hard. It’s just undocumented — which is the gap between “there’s a model for that” and your kid pressing play.
Postscript: the jury came back
The clone failed. Recognizably my timbre, nowhere near natural — I wouldn’t play it to my kid, which is the only metric that exists for this project.
Worth being precise about what failed: the stock XTTS-v2 narrator passed the ear test and the library keeps growing with it. Zero-shot cloning is the part that fell short — a 2023 model conditioning on 26 seconds of a voice it has never seen, in a language that was never its strong suit. The pipeline above is still the right pipeline; the model isn’t there yet on CPU-class options.
The next experiment is already picked: F5-TTS Hungarian, a 2026 fine-tune on 280 hours of actual Hungarian speech, built precisely for short-sample cloning. It needs CUDA, which my node doesn’t have — but a rented spot GPU tests it for the price of an espresso. If it passes the bedtime jury, that’ll be its own post.
Negative results are results. The jury reconvenes when the GPU shows up.
