IP-Adapter — AI Learning Course

§ 01

The lesson

Identity preservation across new scenes · image-prompted generation

THE PROBLEM

Text prompts can't describe "this exact face"

Text-to-image works when you can describe what you want in words ("a Victorian-era portrait of a woman with red hair"). But when you want this specific person or this specific product generated in new scenes, text isn't enough. Earlier solutions like DreamBooth required fine-tuning the entire model per subject (hours of compute, GBs of weights). IP-Adapter lets you do it with one reference image and zero training.

IP-ADAPTER

Decoupled cross-attention for image prompts

Ye et al. (Tencent 2023) introduced IP-Adapter. The trick: add a parallel cross-attention path alongside the existing text cross-attention. When you provide a reference image, it gets embedded by CLIP, passed through a small projection network (which also trains), and fed through this second path. The two attention outputs are summed. Only the new cross-attention weights and the projection network are trained (22M parameters for SD1.5, ~50MB checkpoint). Works on top of any existing diffusion model. One caveat: vanilla IP-Adapter transfers style and overall appearance — faithful face identity required the FaceID variants and their successors.

MULTI-REFERENCE

More than one reference image — subject-driven generation

Modern systems take this further. Provide N reference images of the same subject from different angles; the model averages or attends across them, producing far more consistent identity preservation. HiDream-O1 has native multi-reference support (up to 10 reference images). The technique generalizes beyond faces: products, mascots, characters, art styles, brand assets.

WHEN TO REACH FOR IT

IP-Adapter vs DreamBooth vs LoRA — the choice tree

IP-Adapter: one or a few reference images, need it working in 5 minutes, no training. DreamBooth-LoRA (the standard combo): 10–50 images, best quality, an hour of training. DreamBooth's actual contribution was subject fidelity from just 3–5 images of the subject — but full-model DreamBooth fine-tunes are rarely run in 2026; the LoRA variant captures the subject at a fraction of the cost. Combine: a style LoRA plus an identity method covers most production needs.

2026

Identity preservation moved on

InstantID and PuLID — built on dedicated ID embeddings from face-recognition models — superseded vanilla IP-Adapter for faces. The dominant 2026 pattern skips adapters entirely: reference-based editing in unified models (FLUX Kontext, Qwen-Image-Edit, Nano-Banana-class editors, GPT-Image). Hand the model a reference image plus an instruction; identity is preserved natively, no adapter required. One non-optional caveat: generating a real person's likeness requires their consent, and platform policies prohibit non-consensual identity generation (deepfakes).

Then

IP-Adapter (2023): bolt a second cross-attention path onto Stable Diffusion for zero-shot subject transfer.

Now · June 2026

ID-embedding methods (InstantID, PuLID) for faces; reference-based editing in unified models (FLUX Kontext, Qwen-Image-Edit, GPT-Image) for everything else.

§ 02

IP-Adapter and
personalization.