Diffusion Models and Related Architecture


A brief overview of diffusion models and the architure that has lead to modern day implementations.

I’ll be using this a rough guide as I further explore each model in greater detail.

Foundations (Pre-Diffusion, Influences)

Before diffusion models, generative modeling was dominated by:

  • VAEs (Variational Autoencoders, 2013–2014) → probabilistic latent-variable models with encoder–decoder structure. Key influence on later latent diffusion.
  • GANs (2014) → direct adversarial training for images, not diffusion but often compared.
  • Autoregressive models (e.g., PixelCNN 2016, PixelRNN 2016) → inspired sequential generation ideas later seen in diffusion.

First Diffusion Models

  • Sohl-Dickstein et al. (2015)Deep Unsupervised Learning using Nonequilibrium Thermodynamics
    • First formulation of diffusion probabilistic models (DPMs).
    • Forward process = gradual noise addition, reverse = denoising.
  • DDPM (Denoising Diffusion Probabilistic Models, Ho et al. 2020)
    • First widely adopted modern diffusion model.
    • Uses a U-Net–like architecture.
    • Key idea: reverse process parameterized by neural net.
    • This is the “ImageNet/Gaussian noise baseline.”
  • Score-Based Generative Models (Song & Ermon 2019–2020)
    • Independently introduced score matching with stochastic differential equations.
    • Equivalent to DDPM, but phrased in continuous time.

Efficiency & Quality Improvements

  • DDIM (Denoising Diffusion Implicit Models, Song et al. 2020)
    • Deterministic sampling version of DDPM.
    • Much faster inference (fewer steps).
  • Improved DDPMs (Nichol & Dhariwal 2021)
    • Better noise schedules, architecture tweaks, guidance strategies.
  • Classifier Guidance / Classifier-Free Guidance (Dhariwal & Nichol 2021)
    • Conditioning trick to boost fidelity/controllability.
    • Used in many modern text-to-image systems.

Latent Space Diffusion

  • VAE-based Latent Representations
    • VAE (2013) ideas are re-used here: instead of diffusing in pixel space, compress to latent first.
  • LDM (Latent Diffusion Models, Rombach et al. 2022 — Stable Diffusion’s core)
    • Diffusion happens in latent space learned by an autoencoder (usually a VAE).
    • Enables much faster and memory-efficient generation.
    • This is the foundation of Stable Diffusion.

Text-to-Image Breakthroughs

  • GLIDE (OpenAI, 2021)
    • Diffusion + text conditioning.
    • Predecessor to DALL·E 2.
  • DALLE-2 (OpenAI, 2022)
    • Uses CLIP embeddings + diffusion decoder.
  • Imagen (Google, 2022)
    • Text-to-image with T5 embeddings.
    • Large-scale high-quality results.
  • Stable Diffusion (2022)
    • Open-source implementation of LDM + text conditioning (via CLIP).
    • Spawned the whole open ecosystem.

Specialized & Advanced Variants

  • Consistency Models (Song et al. 2023)
    • Train for one-step or few-step generation while maintaining diffusion quality.
  • Rectified Flow / Flow Matching (2022–2023)
    • Reformulate diffusion as ODE/flow problem.
    • Faster, cleaner math, used in some next-gen models.
  • Video Diffusion (2022–2023 onward)
    • e.g., Imagen Video, Pika Labs, Runway.
    • Extend LDMs into temporal dimensions.

Summary of Key Logical Lineage

  • DDPM (2020) ← from original Sohl-Dickstein DPMs (2015).
  • DDIM (2020) is a deterministic/speedup variant of DDPM.
  • Improved DDPMs (2021) refine DDPM.
  • Classifier-free guidance is an add-on, not a new model.
  • LDM (2022) = DDPM in VAE latent space.
  • Stable Diffusion (2022) = LDM + CLIP conditioning.
  • GLIDE / DALLE-2 / Imagen = LDM/score-based diffusions + text conditioning.
  • Consistency Models / Flow Matching = alternative formulations for efficiency.

  • DDPM → DDIM → Improved DDPM → LDM → Stable Diffusion is the direct core lineage.
  • Score-based models are parallel but mathematically equivalent.
  • VAEs reappear as the backbone for LDMs.
  • GANs/autoregressive were competitors, but diffusion superseded them.