TTS Pipelines — AI Learning Course

§ 01

The lesson

edge-tts · ElevenLabs · Coqui XTTS · what each tier costs

THE TIERS

Three quality bands, three cost/latency profiles

Free / fast: edge-tts (Microsoft's Azure TTS wrapped as a free CLI), pyttsx3, espeak. Sub-second per utterance, robotic but intelligible. Good for batch narration where you have script control. Mid: ElevenLabs, Play.ht, Coqui XTTS local (Coqui the company shut down in Jan 2024; XTTS lives on as community forks). ~2-5s per utterance, natural-sounding, supports voice cloning. Frontier: ElevenLabs v2, OpenAI TTS-1-hd. Sub-second time-to-first-byte streaming, indistinguishable from human at <30 seconds.

ARCHITECTURE: NEURAL CODEC LM

VALL-E and the codec language model paradigm

VALL-E framed synthesis as language modeling over audio codec tokens. An audio codec (EnCodec, SoundStream) compresses speech into discrete tokens; a transformer learns to predict those tokens given text + a short speaker reference, streaming token-by-token and decoding back to audio in real time. CosyVoice-class systems are the current generation of that line, while a parallel non-autoregressive family (NaturalSpeech 2's latent diffusion, Voicebox's flow matching) reaches similar quality without token-by-token decoding. This is why voice cloning got cheap — 3 seconds of reference audio gives the model enough to condition on.

VOICE CLONING

Few-shot speaker conditioning

Modern TTS clones a voice from 30 seconds of reference audio. In the XTTS era this meant a speaker embedding computed once via an x-vector or ECAPA-TDNN encoder and reused per utterance; codec LMs instead condition directly on the raw audio prompt's tokens. Ethical guardrails matter: consent verification, watermarking, refusal for political figures. The studio approach: only clone consented internal voices, watermark all output, treat unauthorized voice cloning as misuse.

STREAMING TTFS

Time to first sound is what makes voice assistants feel responsive

For an attendant, total latency matters less than time to first syllable (TTFS). Sub-300ms TTFS is now the conservative floor, not the target — ElevenLabs Flash and Cartesia Sonic quote sub-100ms model latency, with audio chunks streaming as soon as enough tokens are generated. Combined with sub-200ms STT and sub-100ms LLM TTFS, you get a voice interface that feels conversational rather than turn-based.

Then

Tiered cloud TTS APIs (2023-24).

Now · June 2026

Open-weight renaissance (Kokoro, F5-TTS, Chatterbox), sub-100ms streaming, and speech-to-speech as the alternative to TTS-at-the-end.

§ 02

TTS pipelines,
tier by tier.

The lesson

Three quality bands, three cost/latency profiles

VALL-E and the codec language model paradigm

Few-shot speaker conditioning

Time to first sound is what makes voice assistants feel responsive

Further reading.