The F=ma of Artificial Intelligence

2025-06-11

[public] 12.9K views, 6.79K likes, dislikes audio only

Take your personal data back with Incogni! Use code WELCHLABS and get 60% off an annual plan: http://incogni.com/welchlabs

New Patreon Rewards 29:48 - own a piece of Welch Labs history! https://www.patreon.com/welchlabs

Books & Posters

https://www.welchlabs.com/resources

Sections

0:00 - Intro

2:08 - No more spam calls w/ Incogni

3:45 - Toy Model

5:20 - y=mx+b

6:17 - Softmax

7:48 - Cross Entropy Loss

9:08 - Computing Gradients

12:31 - Backpropagation

18:23 - Gradient Descent

20:17 - Watching our Model Learn

23:53 - Scaling Up

25:45 - The Map of Language

28:13 - The time I quit YouTube

29:48 - New Patreon Rewards!

Special Thanks to Patrons https://www.patreon.com/welchlabs

Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich, Mitch Jacobs, Lauren Steely

References

Werbos, P. J. (1994). The roots of backpropagation : from ordered derivatives to neural networks and political forecasting. United Kingdom: Wiley. Newton quote is on p4, Werbos expands on the analogy on p4.

Olazaran, Mikel. "A sociological study of the official history of the perceptrons controversy." *Social Studies of Science* 26.3 (1996): 611-659. Minsky quote is on p 393.

Widrow, Bernard. "Generalization and information storage in networks of adaline neurons.” Self-organizing systems (1962): 435-461.

Historical Videos

http://youtube.com/watch?v=FwFduRA_L6Q

https://www.youtube.com/watch?v=ntIczNQKfjQ

Code:

https://github.com/stephencwelch/manim_videos

Technical Notes

Large Llama training animation shows 8/16 layers. Specifically layers 1, 2, 7, 8, 9, 10, 15, and 16. Every third attention pattern is shown, and special tokens are ignored. MLP neurons are downsampled using max pooling. Only the weights and gradients above a specific percentile based threshold are shown. Only query weights are shown going into each attention layer.

The coordinates of Paris are subtracted from all training examples in the 4 city example as a simple normalization - this helps with convergence.

In some scenes, math is happening at higher precision behind the scenes, and results are rounded, which may create apparent inconsistencies.

Written by: Stephen Welch

Produced by: Stephen Welch, Sam Baskin, and Pranav Gundu

Special thanks to: Emily Zhang

Premium Beat IDs

EEDYZ3FP44YX8OWT

MWROXNAY0SPXCMBS

Intro
/youtube/video/VkHfRKewkWw?t=0

No more spam calls w/ Incogni
/youtube/video/VkHfRKewkWw?t=128

Toy Model
/youtube/video/VkHfRKewkWw?t=225

y=mx+b
/youtube/video/VkHfRKewkWw?t=320