Words become math.
Tokens, embeddings, and the loss function — the raw material every model is made of.
Tokenization, in detail.
BPE, byte-level BPE, sentencepiece. The cost trade-offs that show up at scale.
What an embedding actually is.
Token bytes, positional encoding, the semantic word map, and how scale changes what the vectors mean.
Pre-transformer
embeddings.
The methods worth knowing - the intuition still applies, and they show why context matters.
Softmax meets cross-entropy.
The loss function for language modeling, slowly. Where the temperature knob actually lives.
Why next-token works.
Why predicting the next token produces emergent reasoning, in-context learning, arithmetic.
Inside the transformer.
Attention from first principles, Q/K/V, and the block that stacks into everything.
Encoder, decoder,
encoder-decoder.
Why GPT looks one way, BERT another, T5 a third.
Attention from first principles.
What Q, K, V actually are. Why scaled dot-product, why softmax, why heads.
Q, K, V - the projections.
The matrices that turn embeddings into queries, keys, values.
Additive attention,
briefly.
The Bahdanau-style variant that scaled dot-product replaced.
One transformer block,
end to end.
Attention, residuals, normalization, FFN.
Encoder-decoder,
revisited.
A return to the encoder-decoder pattern.
How models learn.
Pretraining at scale, RLHF and the alignment toolbox, LoRA fine-tuning.
The pretraining
pipeline.
Terabytes of raw text become tokens become a loss curve.
Mixed precision and
gradient tricks.
The optimizations that turn theoretical training into practical training.
RLHF and the reasoning
turn.
The post-training pass that turns a parrot into an assistant.
TRL and the GRPO
algorithm.
What changes when reward signal comes from groups of completions.
DPO, KTO, ORPO - the
post-PPO landscape.
Why preference learning ate the alignment world.
Models that
think first.
Test-time compute, verifiable rewards, and how o1/R1-class reasoning models are trained.
Parameter-efficient
fine-tuning.
Why LoRA is cheap, what QLoRA adds.
Models in production.
The decoding loop, KV caches, local engines, RAG, and the chat interfaces on top.
The decoding loop, up close.
KV caches, batching, speculative decoding, paged attention.
The full pipeline as
one highway.
Where every previous module fits.
Offline inference
engines.
Ollama, llama.cpp, vLLM, SGLang.
Hugging Face and
Git LFS.
Cloning a 40 GB model. Model cards, safetensors.
Retrieval-augmented
generation.
Semantic search, hybrid retrieval, GraphRAG.
Consumer
chat interfaces.
Chat UI patterns. Streaming. Tool-call surfacing.
The agent
loop.
Tool use, MCP, and the loop that turned chatbots into coworkers.
The image track.
ViT, CLIP, diffusion, ControlNet, FLUX — pixels as tokens.
Vision Transformers
are tokens too.
ViT - 196 patches per photo, 2D positional embeddings, CNN-vs-transformer comparison.
Text and vision,
connected.
How text and vision streams talk - CLIP, BLIP, any-to-any native models.
VAEs and the
diffusion process.
VAE compression, Stable Diffusion denoising, CLIP text conditioning.
Latent space
and control.
Walking the latent space. ControlNet conditioning.
Diffusion math,
slowly.
Forward + reverse process, score matching, why noise schedules matter.
FLUX, ERNIE,
HiDream.
Modern image models - U-Net to DiT, diffusion to flow matching, CLIP to LLM text encoders.
IP-Adapter and
personalization.
Subject-driven generation, identity preservation across new scenes.
ComfyUI workflow
as code.
Treating .json workflows as version-controlled artifacts.
Style-LoRA
training.
Locking a visual identity across hundreds of generations.
The video track.
Temporal diffusion, LTX, persona persistence, audio sync.
AI Film Studios
and video pioneers.
Sora, Runway Gen-3, Luma, Kling, Higgsfield.
Video diffusion
intuition.
Temporal coherence, frame consistency, why static-image tricks don't transfer.
LTX architecture
and shot composition.
What's inside a modern video diffusion model - shot composition, motion control.
Persona persistence
across frames.
Keeping a character's face, costume, and lighting consistent across hundreds of frames.
Audio sync
in video.
Lip-sync, beat-cut, the patterns underneath the AI Music Idols pipeline.
The voice track.
Whisper STT, codec-LM TTS, voice cloning, live attendants.
Speech synthesis,
cloning a voice.
ElevenLabs, OpenAI Whisper.
STT pipelines
(Whisper).
Whisper architecture, alignment timestamps, language detection.
TTS pipelines,
tier by tier.
edge-tts, ElevenLabs, voice cloning - what each tier costs in latency and quality.
Voice cloning,
ethically.
Single-shot vs few-shot cloning, where artifacts come from, ethical guardrails.
Turn-taking and
attendants.
Sub-300ms first-syllable latency, the patterns underneath the AI Attendant project.