THE PROBLEMText prompts can't describe "this exact face"
Text-to-image works when you can describe what you want in words ("a Victorian-era portrait of a woman with red hair"). But when you want this specific person or this specific product generated in new scenes, text isn't enough. Earlier solutions like DreamBooth required fine-tuning the entire model per subject (hours of compute, GBs of weights). IP-Adapter lets you do it with one reference image and zero training.
IP-ADAPTERDecoupled cross-attention for image prompts
Ye et al. (Tencent 2023) introduced IP-Adapter. The trick: add a parallel cross-attention path alongside the existing text cross-attention. When you provide a reference image, it gets embedded by CLIP, passed through a small projection network (which also trains), and fed through this second path. The two attention outputs are summed. Only the new cross-attention weights and the projection network are trained (22M parameters for SD1.5, ~50MB checkpoint). Works on top of any existing diffusion model. One caveat: vanilla IP-Adapter transfers style and overall appearance — faithful face identity required the FaceID variants and their successors.
MULTI-REFERENCEMore than one reference image — subject-driven generation
Modern systems take this further. Provide N reference images of the same subject from different angles; the model averages or attends across them, producing far more consistent identity preservation. HiDream-O1 has native multi-reference support (up to 10 reference images). The technique generalizes beyond faces: products, mascots, characters, art styles, brand assets.
WHEN TO REACH FOR ITIP-Adapter vs DreamBooth vs LoRA — the choice tree
IP-Adapter: one or a few reference images, need it working in 5 minutes, no training. DreamBooth-LoRA (the standard combo): 10–50 images, best quality, an hour of training. DreamBooth's actual contribution was subject fidelity from just 3–5 images of the subject — but full-model DreamBooth fine-tunes are rarely run in 2026; the LoRA variant captures the subject at a fraction of the cost. Combine: a style LoRA plus an identity method covers most production needs.
2026Identity preservation moved on
InstantID and PuLID — built on dedicated ID embeddings from face-recognition models — superseded vanilla IP-Adapter for faces. The dominant 2026 pattern skips adapters entirely: reference-based editing in unified models (FLUX Kontext, Qwen-Image-Edit, Nano-Banana-class editors, GPT-Image). Hand the model a reference image plus an instruction; identity is preserved natively, no adapter required. One non-optional caveat: generating a real person's likeness requires their consent, and platform policies prohibit non-consensual identity generation (deepfakes).