Visualizing Attention, a Transformer's Heart

Visualizing Attention, a Transformer's Heart | Chapter 6, Deep Learning

2024-04-07

[public] 248K views, 42.2K likes, dislikes audio only

3Blue1Brown

Demystifying attention, the key mechanism inside transformers and LLMs.

Instead of sponsored ad reads, these lessons are funded directly by viewers: https://3b1b.co/support

Special thanks to these supporters: https://www.3blue1brown.com/lessons/attention#thanks

An equally valuable form of support is to simply share the videos.

Demystifying self-attention, multiple heads, and cross-attention.

Instead of sponsored ad reads, these lessons are funded directly by viewers: https://3b1b.co/support

The first pass for the translated subtitles here is machine-generated, and therefore notably imperfect. To contribute edits or fixes, visit https://translate.3blue1brown.com/

And yes, at 22:00 (and elsewhere), "breaks" is a typo.

------------------

Here are a few other relevant resources

Build a GPT from scratch, by Andrej Karpathy

https://youtu.be/kCc8FmEb1nY

If you want a conceptual understanding of language models from the ground up, @vcubingx just started a short series of videos on the topic:

https://youtu.be/1il-s4mgNdI?si=XaVxj6bsdy3VkgEX

If you're interested in the herculean task of interpreting what these large networks might actually be doing, the Transformer Circuits posts by Anthropic are great. In particular, it was only after reading one of these that I started thinking of the combination of the value and output matrices as being a combined low-rank map from the embedding space to itself, which, at least in my mind, made things much clearer than other sources.

https://transformer-circuits.pub/2021/framework/index.html

Site with exercises related to ML programming and GPTs

https://www.gptandchill.ai/codingproblems

History of language models by Brit Cruise, @ArtOfTheProblem

https://youtu.be/OFS90-FX6pg

An early paper on how directions in embedding spaces have meaning:

https://arxiv.org/pdf/1301.3781.pdf

------------------

Timestamps:

0:00 - Recap on embeddings

1:39 - Motivating examples

4:29 - The attention pattern

11:08 - Masking

12:42 - Context size

13:10 - Values

15:44 - Counting parameters

18:21 - Cross-attention

19:19 - Multiple heads

22:16 - The output matrix

23:19 - Going deeper

24:54 - Ending

------------------

These animations are largely made using a custom Python library, manim. See the FAQ comments here:

https://3b1b.co/faq#manim

https://github.com/3b1b/manim

https://github.com/ManimCommunity/manim/

All code for specific videos is visible here:

https://github.com/3b1b/videos/

The music is by Vincent Rubinetti.

https://www.vincentrubinetti.com

https://vincerubinetti.bandcamp.com/album/the-music-of-3blue1brown

https://open.spotify.com/album/1dVyjwS8FBqXhRunaG5W5u

------------------

3blue1brown is a channel about animating math, in all senses of the word animate. If you're reading the bottom of a video description, I'm guessing you're more interested than the average viewer in lessons here. It would mean a lot to me if you chose to stay up to date on new ones, either by subscribing here on YouTube or otherwise following on whichever platform below you check most regularly.

Mailing list: https://3blue1brown.substack.com

Twitter: https://twitter.com/3blue1brown

Instagram: https://www.instagram.com/3blue1brown

Reddit: https://www.reddit.com/r/3blue1brown

Facebook: https://www.facebook.com/3blue1brown

Patreon: https://patreon.com/3blue1brown

Website: https://www.3blue1brown.com

Recap on embeddings
/youtube/video/eMlx5fFNoYc?t=0

Motivating examples
/youtube/video/eMlx5fFNoYc?t=99

The attention pattern
/youtube/video/eMlx5fFNoYc?t=269

Masking
/youtube/video/eMlx5fFNoYc?t=668

Context size
/youtube/video/eMlx5fFNoYc?t=762

Values
/youtube/video/eMlx5fFNoYc?t=790

Counting parameters
/youtube/video/eMlx5fFNoYc?t=944

Cross-attention
/youtube/video/eMlx5fFNoYc?t=1101

Multiple heads
/youtube/video/eMlx5fFNoYc?t=1159

The output matrix
/youtube/video/eMlx5fFNoYc?t=1336

Going deeper
/youtube/video/eMlx5fFNoYc?t=1399

Ending
/youtube/video/eMlx5fFNoYc?t=1494

But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning 1,540,350 views
/youtube/video/wjZofJX0v4M

3Blue1Brown is creating videos animating math | Patreon patreon.com
patreon.com/3blue1brown

3Blue1Brown My name is Grant Sanderson. Videos here cover a variety of topics in math, or adjacent fields like physics and CS, all with an emphasis on visualizing the core ideas. To goal is to use animation to help elucidate and motivate otherwise tricky topics, and for difficult problems to be made simple with changes in perspective. For more information, other projects, FAQs, and inquiries see the website: https://www.3blue1brown.com
/youtube/channel/UCYO_jab_esuFRV4b17AJtAw