Diffusion models look intimidating from the outside and almost embarrassingly simple from the inside. You take data, gradually destroy it with noise, and train a network to undo one small step of that destruction. Sample by running the undo loop from pure noise. That is the whole idea.
The forward process
The forward process is fixed and requires no learning: at each step you add a little Gaussian noise, following a schedule chosen ahead of time. After enough steps, the original signal is gone and you are left with something indistinguishable from random noise. Crucially, you can jump to any noise level in closed form, which is what makes training tractable.
The reverse process
The reverse process is where the model lives. Given a noisy sample and the step it came from, the network predicts the noise that was added. Subtract a scaled version of that prediction and you have a slightly cleaner sample. Repeat. The elegance is that a hard generation problem becomes a sequence of easy denoising problems.
What bites you in production
Sampling is slow because it is iterative, so most of the engineering effort goes into doing fewer, smarter steps. Distillation, better solvers, and latent-space diffusion all attack the same cost. The quality ceiling is rarely the model — it is how much compute you are willing to spend per image at serving time.