Diffusion Models and Related Architecture

A brief overview of diffusion models and the architure that has lead to modern day implementations.

I’ll be using this a rough guide as I further explore each model in greater detail.

Before diffusion models, generative modeling was dominated by:

VAEs (Variational Autoencoders, 2013–2014) → probabilistic latent-variable models with encoder–decoder structure. Key influence on later latent diffusion.
GANs (2014) → direct adversarial training for images, not diffusion but often compared.
Autoregressive models (e.g., PixelCNN 2016, PixelRNN 2016) → inspired sequential generation ideas later seen in diffusion.

Sohl-Dickstein et al. (2015) — Deep Unsupervised Learning using Nonequilibrium Thermodynamics
- First formulation of diffusion probabilistic models (DPMs).
- Forward process = gradual noise addition, reverse = denoising.
DDPM (Denoising Diffusion Probabilistic Models, Ho et al. 2020)
- First widely adopted modern diffusion model.
- Uses a U-Net–like architecture.
- Key idea: reverse process parameterized by neural net.
- This is the “ImageNet/Gaussian noise baseline.”
Score-Based Generative Models (Song & Ermon 2019–2020)
- Independently introduced score matching with stochastic differential equations.
- Equivalent to DDPM, but phrased in continuous time.

DDIM (Denoising Diffusion Implicit Models, Song et al. 2020)
- Deterministic sampling version of DDPM.
- Much faster inference (fewer steps).
Improved DDPMs (Nichol & Dhariwal 2021)
- Better noise schedules, architecture tweaks, guidance strategies.
Classifier Guidance / Classifier-Free Guidance (Dhariwal & Nichol 2021)
- Conditioning trick to boost fidelity/controllability.
- Used in many modern text-to-image systems.

VAE-based Latent Representations
- VAE (2013) ideas are re-used here: instead of diffusing in pixel space, compress to latent first.
LDM (Latent Diffusion Models, Rombach et al. 2022 — Stable Diffusion’s core)
- Diffusion happens in latent space learned by an autoencoder (usually a VAE).
- Enables much faster and memory-efficient generation.
- This is the foundation of Stable Diffusion.

GLIDE (OpenAI, 2021)
- Diffusion + text conditioning.
- Predecessor to DALL·E 2.
DALLE-2 (OpenAI, 2022)
- Uses CLIP embeddings + diffusion decoder.
Imagen (Google, 2022)
- Text-to-image with T5 embeddings.
- Large-scale high-quality results.
Stable Diffusion (2022)
- Open-source implementation of LDM + text conditioning (via CLIP).
- Spawned the whole open ecosystem.

Consistency Models (Song et al. 2023)
- Train for one-step or few-step generation while maintaining diffusion quality.
Rectified Flow / Flow Matching (2022–2023)
- Reformulate diffusion as ODE/flow problem.
- Faster, cleaner math, used in some next-gen models.
Video Diffusion (2022–2023 onward)
- e.g., Imagen Video, Pika Labs, Runway.
- Extend LDMs into temporal dimensions.

DDPM (2020) ← from original Sohl-Dickstein DPMs (2015).
DDIM (2020) is a deterministic/speedup variant of DDPM.
Improved DDPMs (2021) refine DDPM.
Classifier-free guidance is an add-on, not a new model.
LDM (2022) = DDPM in VAE latent space.
Stable Diffusion (2022) = LDM + CLIP conditioning.
GLIDE / DALLE-2 / Imagen = LDM/score-based diffusions + text conditioning.
Consistency Models / Flow Matching = alternative formulations for efficiency.

DDPM → DDIM → Improved DDPM → LDM → Stable Diffusion is the direct core lineage.
Score-based models are parallel but mathematically equivalent.
VAEs reappear as the backbone for LDMs.
GANs/autoregressive were competitors, but diffusion superseded them.