Voice Track Module f-tts 8 min

TTS pipelines,
tier by tier.

edge-tts, ElevenLabs, voice cloning - what each tier costs in latency and quality.

Prerequisites·21 · Tokens Modalities
TTS Pipelines hero illustration
§ 01

The lesson

edge-tts · ElevenLabs · Coqui XTTS · what each tier costs

THE TIERS

Three quality bands, three cost/latency profiles

Free / fast: edge-tts (Microsoft's Azure TTS wrapped as a free CLI), pyttsx3, espeak. Sub-second per utterance, robotic but intelligible. Good for batch narration where you have script control. Mid: ElevenLabs, Play.ht, Coqui XTTS local (Coqui the company shut down in Jan 2024; XTTS lives on as community forks). ~2-5s per utterance, natural-sounding, supports voice cloning. Frontier: ElevenLabs v2, OpenAI TTS-1-hd. Sub-second time-to-first-byte streaming, indistinguishable from human at <30 seconds.
ARCHITECTURE: NEURAL CODEC LM

VALL-E and the codec language model paradigm

VALL-E framed synthesis as language modeling over audio codec tokens. An audio codec (EnCodec, SoundStream) compresses speech into discrete tokens; a transformer learns to predict those tokens given text + a short speaker reference, streaming token-by-token and decoding back to audio in real time. CosyVoice-class systems are the current generation of that line, while a parallel non-autoregressive family (NaturalSpeech 2's latent diffusion, Voicebox's flow matching) reaches similar quality without token-by-token decoding. This is why voice cloning got cheap — 3 seconds of reference audio gives the model enough to condition on.
VOICE CLONING

Few-shot speaker conditioning

Modern TTS clones a voice from 30 seconds of reference audio. In the XTTS era this meant a speaker embedding computed once via an x-vector or ECAPA-TDNN encoder and reused per utterance; codec LMs instead condition directly on the raw audio prompt's tokens. Ethical guardrails matter: consent verification, watermarking, refusal for political figures. The studio approach: only clone consented internal voices, watermark all output, treat unauthorized voice cloning as misuse.
STREAMING TTFS

Time to first sound is what makes voice assistants feel responsive

For an attendant, total latency matters less than time to first syllable (TTFS). Sub-300ms TTFS is now the conservative floor, not the target — ElevenLabs Flash and Cartesia Sonic quote sub-100ms model latency, with audio chunks streaming as soon as enough tokens are generated. Combined with sub-200ms STT and sub-100ms LLM TTFS, you get a voice interface that feels conversational rather than turn-based.
Then
Tiered cloud TTS APIs (2023-24).
Now · June 2026
Open-weight renaissance (Kokoro, F5-TTS, Chatterbox), sub-100ms streaming, and speech-to-speech as the alternative to TTS-at-the-end.
TTS Pipelines spotlight illustration
§ 02

Further reading.

Done with TTS Pipelines?
Mark it complete — progress is saved in your browser and shows on the course map.