video thumbnail 39:48
But how do AI videos actually work? | Guest video by @WelchLabsVideo

2025-07-25

[public] 221K views, 14.3K likes, dislikes audio only

4K

Diffusion models, CLIP, and the math of turning text into images

Welch Labs Book: https://www.welchlabs.com/resources/imaginary-numbers-book

Sections

0:00 - Intro

3:37 - CLIP

6:25 - Shared Embedding Space

8:16 - Diffusion Models & DDPM

11:44 - Learning Vector Fields

22:00 - DDIM

25:25 Dall E 2

26:37 - Conditioning

30:02 - Guidance

33:39 - Negative Prompts

34:27 - Outro

35:32 - About guest videos + Grant’s Reaction

Special Thanks to:

Jonathan Ho - Jonathan is the Author of the DDPM paper and the Classifier Free Guidance Paper.

https://arxiv.org/pdf/2006.11239

https://arxiv.org/pdf/2207.12598

Preetum Nakkiran - Preetum has an excellent introductory diffusion tutorial:

https://arxiv.org/pdf/2406.08929

Chenyang Yuan - Many of the animations in this video were implemented using manim and Chenyang’s smalldiffusion library: https://github.com/yuanchenyang/smalldiffusion

Cheyang also has a terrific tutorial and MIT course on diffusion models

https://www.chenyang.co/diffusion.html

https://www.practical-diffusion.org/

Other References

All of Sander Dieleman’s diffusion blog posts are fantastic: https://sander.ai/

CLIP Paper: https://arxiv.org/pdf/2103.00020

DDIM Paper: https://arxiv.org/pdf/2010.02502

Score-Based Generative Modeling: https://arxiv.org/pdf/2011.13456

Wan2.1: https://github.com/Wan-Video/Wan2.1

Stable Diffusion: https://huggingface.co/stabilityai/stable-diffusion-2

Midjourney: https://www.midjourney.com/

Veo: https://deepmind.google/models/veo/

DallE 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf

Code for this video: https://github.com/stephencwelch/manim_videos/tree/master/_2025/sora

Written by: Stephen Welch, with very helpful feedback from Grant Sanderson

Produced by: Stephen Welch, Sam Baskin, and Pranav Gundu

Technical Notes

The noise videos in the opening have been passed through a VAE (actually, diffusion process happens in a compressed “latent” space), which acts very much like a video compressor - this is why the noise videos don’t look like pure salt and pepper.

6:15 CLIP: Although directly minimizing cosine similarity would push our vectors 180 degrees apart on a single batch, overall in practice, we need CLIP to maximize the uniformity of concepts over the hypersphere it's operating on. For this reason, we animated these vectors as orthogonal-ish. See: https://proceedings.mlr.press/v119/wang20k/wang20k.pdf

Per Chenyang Yuan: at 10:15, the blurry image that results when removing random noise in DDPM is probably due to a mismatch in noise levels when calling the denoiser. When the denoiser is called on x_{t-1} during DDPM sampling, it is expected to have a certain noise level (let's call it sigma_{t-1}). If you generate x_{t-1} from x_t without adding noise, then the noise present in x_{t-1} is always smaller than sigma_{t-1}. This causes the denoiser to remove too much noise, thus pointing towards the mean of the dataset.

The text conditioning input to stable diffusion is not the 512-dim text embedding vector, but the output of the layer before that, [with dimension 77x512](https://stackoverflow.com/a/79243065)

For the vectors at 31:40 - Some implementations use f(x, t, cat) + alpha(f(x, t, cat) - f(x, t)), and some that do f(x, t) + alpha(f(x, t, cat) - f(x, t)), where an alpha value of 1 corresponds to no guidance. I chose the second format here to keep things simpler.

At 30:30, the unconditional t=1 vector field looks a bit different from what it did at the 17:15 mark. This is the result of different models trained for different parts of the video, and likely a result of different random initializations.

Premium Beat Music ID: EEDYZ3FP44YX8OWT


Shared Embedding Space
/youtube/video/iv-5mZ_9CPY?t=385
Diffusion Models & DDPM
/youtube/video/iv-5mZ_9CPY?t=496
Learning Vector Fields
/youtube/video/iv-5mZ_9CPY?t=704
About guest videos + Grant’s Reaction
/youtube/video/iv-5mZ_9CPY?t=2132
3Blue1Brown is creating videos animating math | Patreon patreon.com
patreon.com/3blue1brown
Summer of Math Exposition #4 | Teachers, I'd love to hear from you 127,036 views
/youtube/video/3foYyPDp0Ho