What is GPT

1. What

Different input, different output, amazing performance

Source: 2017, google, attention is all you need

All it do: predict the next word

Input: a array, tensor

Layer: model, weight, the only way the data is calculated

Weight: get trained

Embedding: vectorize the token

Break into token and vectorize, so a sentence turns into a matrix
- The dimension of the word is important, since it defined how similar between two words
- The weight and model understand the relation between word
- GPT3: 50257 words, 12288 dimensions, 6.17B weight
Unembedding: a matrix turn the output into a probabilities of each word
- GPT3: also 50257 words, 12288 dimensions, 6.17B
- Use softmax to normalize the output
- Temperature T para could be used to control large value weight

Attention: find relevant word that would change the meaning

Update the meaning of the word based on context, 2048 word
Query: related word to update current word
Key: potentially answering to the query, so we can find what used to update the query word
Value matrix: multiply with key, generate a modification vector which could be add to the original vector
- GPT3: 96 attention head, for one attention head: query, key, value down and value up. 600M
  - 96 layer, 58B
Context size scale is non-trivial

MLP: multi-layer perceptron

Source: google research, An image is worth 16x16 words

Patch into 16x16 pile

Eniops reshaping, transform the input

CLS Token: serve as global feature extractor