1. What
Different input, different output, amazing performance
Source: 2017, google, attention is all you need
All it do: predict the next word
2. Basics
Input: a array, tensor
Layer: model, weight, the only way the data is calculated
Weight: get trained
3. Part
Embedding: vectorize the token
- Break into token and vectorize, so a sentence turns into a matrix
- The dimension of the word is important, since it defined how similar between two words
- The weight and model understand the relation between word
- GPT3: 50257 words, 12288 dimensions, 6.17B weight
- Unembedding: a matrix turn the output into a probabilities of each word
- GPT3: also 50257 words, 12288 dimensions, 6.17B
- Use softmax to normalize the output
- Temperature T para could be used to control large value weight
Attention: find relevant word that would change the meaning
- Update the meaning of the word based on context, 2048 word
- Query: related word to update current word
- Key: potentially answering to the query, so we can find what used to update the query word
- Value matrix: multiply with key, generate a modification vector which could be add to the original vector
- GPT3: 96 attention head, for one attention head: query, key, value down and value up. 600M
- 96 layer, 58B
- GPT3: 96 attention head, for one attention head: query, key, value down and value up. 600M
- Context size scale is non-trivial
MLP: multi-layer perceptron
4. Vision Transformer
Source: google research, An image is worth 16x16 words
Patch into 16x16 pile
Eniops reshaping, transform the input
CLS Token: serve as global feature extractor