AI Dictionary
Advanced· ~2 min read#diffusion#image-generation#generative

Diffusion Model

Generation by gradual denoising

A model family that generates images (or audio, video) by starting from pure noise and step-by-step denoising. The backbone of modern image generation.

FROM NOISE TO IMAGE — STEP BY STEPt = Tt = 3t = 2t = 1t = 0less noise at every stepStable Diffusion, DALL·E, Midjourney — all use this approach
Definition

A diffusion model learns a two-way process: gradually corrupting an image with noise (forward), then reversing those steps to clean noise back into an image (reverse). At generation time we only use the reverse path.

Typical flow: random Gaussian noise → the model denoises in 30-50 steps → each step makes the image slightly more "meaningful" → final step yields a clean image.

Latent diffusion (used by Stable Diffusion) runs the process not in pixel space but in a smaller latent space — 10-50x faster and hardware-friendly. Stable Diffusion, DALL·E 3, Midjourney, Flux, Imagen — all use diffusion architectures.

Beyond images: video (Sora, Runway), audio (AudioLDM), 3D models, protein structures — diffusion is everywhere now.

Analogy

Like uncovering a sculpture from marble. There's a rough block (noise) at first. The sculptor chips away to reveal the shape step by step. The diffusion model has learned which pixels are noise and which belong to the final image, removing them gradually.

Real-world example

You ask Midjourney for "a cat on a rainy Istanbul street, night, photorealistic." What happens: 1. A CLIP-like text encoder turns the prompt into an embedding. 2. A 1024x1024 matrix of pure Gaussian noise is generated. 3. The diffusion model denoises this in 30 steps, each step conditional on the prompt embedding. 4. Step 1: still blurry but a form starts emerging. 5. Step 15: main composition visible (cat + street). 6. Step 30: details done (raindrops, light reflections). 7. Final image ready in ~10 seconds.

This logic didn't exist before 2020 — DDPM (2020) and Stable Diffusion (2022) changed the game.

When to use
  • Image generation — text-to-image, image-to-image, inpainting
  • Video generation (Sora, Runway, Pika) — also diffusion-based
  • Audio/music generation — AudioLDM, Stable Audio
  • Scientific modeling — protein folding, molecular design
When not to use
  • Text generation — diffusion is weak for text; Transformers (LLMs) remain standard
  • Low-resource environments — diffusion does 30+ forward passes = expensive
  • Real-time needs — one image takes ~5-30s, video much more
Common pitfalls

Wrong sampling step count

Too few steps (5-10) = poor quality. Too many (>100) = diminishing returns, slow. Sweet spot is 20-50. Samplers like DPM-Solver, Euler reduce step count.

Copyright and training data issues

Using copyrighted images in training is a legal gray zone. Stability AI, Midjourney face lawsuits. In production, document your data sources.

Notorious failure modes

Hands (finger count), faces, text, symmetry — diffusion's classic weak spots. Fix with negative prompts ('extra fingers'), ControlNet, inpainting.