0%

What is GPT

1. What

Different input, different output, amazing performance

Source: 2017, google, attention is all you need

All it do: predict the next word

2. Basics

Input: a array, tensor

Layer: model, weight, the only way the data is calculated

Weight: get trained

3. Part

Embedding: vectorize the token

  • Break into token and vectorize, so a sentence turns into a matrix
    • The dimension of the word is important, since it defined how similar between two words
    • The weight and model understand the relation between word
    • GPT3: 50257 words, 12288 dimensions, 6.17B weight
  • Unembedding: a matrix turn the output into a probabilities of each word
    • GPT3: also 50257 words, 12288 dimensions, 6.17B
    • Use softmax to normalize the output
    • Temperature T para could be used to control large value weight

Attention: find relevant word that would change the meaning

  • Update the meaning of the word based on context, 2048 word
  • Query: related word to update current word
  • Key: potentially answering to the query, so we can find what used to update the query word
  • Value matrix: multiply with key, generate a modification vector which could be add to the original vector
    • GPT3: 96 attention head, for one attention head: query, key, value down and value up. 600M
      • 96 layer, 58B
  • Context size scale is non-trivial

MLP: multi-layer perceptron

4. Vision Transformer

Source: google research, An image is worth 16x16 words

Patch into 16x16 pile

Eniops reshaping, transform the input

CLS Token: serve as global feature extractor