THE TIERSThree quality bands, three cost/latency profiles
Free / fast: edge-tts (Microsoft's Azure TTS wrapped as a free CLI), pyttsx3, espeak. Sub-second per utterance, robotic but intelligible. Good for batch narration where you have script control. Mid: ElevenLabs, Play.ht, Coqui XTTS local (Coqui the company shut down in Jan 2024; XTTS lives on as community forks). ~2-5s per utterance, natural-sounding, supports voice cloning. Frontier: ElevenLabs v2, OpenAI TTS-1-hd. Sub-second time-to-first-byte streaming, indistinguishable from human at <30 seconds.
ARCHITECTURE: NEURAL CODEC LMVALL-E and the codec language model paradigm
VALL-E framed synthesis as language modeling over audio codec tokens. An audio codec (EnCodec, SoundStream) compresses speech into discrete tokens; a transformer learns to predict those tokens given text + a short speaker reference, streaming token-by-token and decoding back to audio in real time. CosyVoice-class systems are the current generation of that line, while a parallel non-autoregressive family (NaturalSpeech 2's latent diffusion, Voicebox's flow matching) reaches similar quality without token-by-token decoding. This is why voice cloning got cheap — 3 seconds of reference audio gives the model enough to condition on.
VOICE CLONINGFew-shot speaker conditioning
Modern TTS clones a voice from 30 seconds of reference audio. In the XTTS era this meant a speaker embedding computed once via an x-vector or ECAPA-TDNN encoder and reused per utterance; codec LMs instead condition directly on the raw audio prompt's tokens. Ethical guardrails matter: consent verification, watermarking, refusal for political figures. The studio approach: only clone consented internal voices, watermark all output, treat unauthorized voice cloning as misuse.
STREAMING TTFSTime to first sound is what makes voice assistants feel responsive
For an attendant, total latency matters less than time to first syllable (TTFS). Sub-300ms TTFS is now the conservative floor, not the target — ElevenLabs Flash and Cartesia Sonic quote sub-100ms model latency, with audio chunks streaming as soon as enough tokens are generated. Combined with sub-200ms STT and sub-100ms LLM TTFS, you get a voice interface that feels conversational rather than turn-based.