Speech Synthesis — AI Learning Course

§ 01

Core ideas

CODEC LM

Modern TTS is language modeling over audio tokens

A neural codec (EnCodec, SoundStream) compresses audio into discrete tokens at ~75 tokens/second. A transformer then predicts those tokens from text + a speaker reference — exactly the next-token loop from module 11, pointed at sound. Decoding the predicted tokens reconstructs the waveform.

CLONING

3 seconds of reference is enough to condition a voice

A speaker encoder distills a short clip into an embedding; generation conditions on it. That's why cloning got cheap — the model isn't trained per-voice, it is prompted per-voice. Quality scales with reference length: 3s captures timbre, 30s captures cadence and habits.

THE TIERS

Latency and naturalness define the market

Free/fast: edge-tts (used for this course's 125 narrations) — instant, slightly robotic. Mid: ElevenLabs, XTTS — natural, ~1–5s. Frontier: streaming codec LMs (ElevenLabs Flash, Cartesia Sonic) at sub-100ms time-to-first-sound for live agents. Pick by use case: batch narration tolerates latency; a phone attendant cannot.

ETHICS

Consent, watermarking, refusal

Voice is identity. Responsible pipelines verify consent for cloned voices, watermark generated audio, and refuse public-figure cloning. The technical bar to misuse is now near zero — the guardrails have to live in process and policy.

§ 02

The lesson

The platforms turning text into natural speech — and speech back into text.

SPEECH STACK

TTS leaders and the STT inverse

ElevenLabs leads commercial TTS, but the field is crowded: OpenAI, Cartesia, Google, and MiniMax compete on the API side, and a strong open-weight field (Kokoro, F5-TTS, CosyVoice, Chatterbox) closes the gap from below. ElevenLabs' architecture is unpublished — treat any description of its internals as speculation. OpenAI Whisper is the inverse problem (speech→text), included here for contrast: multilingual STT across 99 languages, famous for robustness to accents and noise — and equally famous for hallucinating text on silence or music. Speech-to-speech models (GPT-realtime-class, Gemini Live) are the post-cascade paradigm: one model listens and speaks with no text pipeline in the middle.

§ 03

Speech synthesis,
cloning a voice.