Mini GPT (Single Training-Step Snapshot, From Scratch)

Realistic GPT Equations (mapped to this sample)

In a real decoder-only Transformer (GPT-style), one attention block is:

1) Q = X W_Q, K = X W_K, V = X W_V
2) A = softmax((Q K^T) / sqrt(d_k) + mask)
3) H = A V
4) O = H W_O
5) Y = LayerNorm(X + O)                 # residual connection
6) F = MLP(Y)                           # feed-forward network
7) Y2 = LayerNorm(Y + F)                # second residual + norm
8) logits = Y2_t W_U + b                # for position t (last token in next-token prediction)
9) probs = softmax(logits)
10) loss = -log(probs[target_index])

Important training context:
- This is ONE gradient-descent step from random initialization (training from scratch).
- At step 0, all weights (E, W_Q, W_K, W_V, W_U, etc.) are random.
- The model sees one training sample: input "i love" and target next token "you".
- We do forward pass -> compute loss -> backprop gradients -> update weights.

This file computes the attention path exactly with concrete numbers.
To stay beginner-friendly and small, we simplify by:
- using one tiny attention head
- skipping LayerNorm/MLP numeric expansion
- directly projecting the final token state to logits with a small matrix

So this demo is a realistic training snapshot, but intentionally compact.
    

0. Variables and Meaning

token ids: i=0, love=1, you=2

Dimensions used in this tiny example:
T = 2            # sequence length ("i love")
d_model = 2      # model width (tiny to make math visible)
d_k = 2          # key/query width
d_v = 2          # value width
Vocab = 3        # {i, love, you}

Core variables:
E in R^(Vocab x d_model)
  Embedding table. Row j is the learned vector for token id j.

X in R^(T x d_model)
  Input token vectors after embedding lookup.
  Here T=2 rows: first row for "i", second row for "love".

P (or pos) in R^(T x d_model)
  Positional encoding so the model knows token order.

X_tilde = X + P
  Position-aware token representations.

W_Q in R^(d_model x d_k)
W_K in R^(d_model x d_k)
W_V in R^(d_model x d_v)
  Learned linear maps producing queries, keys, values.

Q = X_tilde W_Q in R^(T x d_k)
K = X_tilde W_K in R^(T x d_k)
V = X_tilde W_V in R^(T x d_v)

S = Q K^T in R^(T x T)
  Raw attention scores (how much each token attends to each token).

S_scaled = S / sqrt(d_k)
  Scaling keeps values numerically stable.

A = softmax(S_scaled, row-wise) in R^(T x T)
  Attention weights. Every row sums to 1.
  In decoder-only GPT, this is causal-masked so token i cannot attend to future tokens j>i.

H = A V in R^(T x d_v)
  Mixed contextual representation (weighted sum of value vectors).

W_U in R^(d_v x Vocab)
  Final projection to vocabulary logits in this tiny demo.
  (In full GPT, this corresponds to the language-model head.)

logits = h_last W_U in R^(Vocab)
  Scores for each next-token candidate.

probs = softmax(logits) in R^(Vocab)
  Next-token probabilities.

target_index
  Correct token id for supervision.

loss = -log(probs[target_index])
  Cross-entropy for one training example.

d_logits = probs - one_hot(target_index)
  Gradient at the logits for cross-entropy + softmax.
    

0.1 How to Read This (Beginner Mode)

Goal of this training sample:
Input tokens:  "i love"
Correct next:  "you"

At every numbered step below, ask 4 questions:
1) What goes IN?
2) What operation happens?
3) What comes OUT?
4) Why do we need this step?

Quick map:
- Steps 1-4: build input vectors and model weights
- Steps 5-9: attention computes context-aware token states
- Steps 10-12: convert last token state to next-token probability + loss
- Steps 13-15: compute gradient and update parameters

If a symbol looks scary, check section "0. Variables and Meaning" first.
    

0.2 Shape Checklist (sanity checks)

X_tilde: (2 x 2)
W_Q, W_K, W_V: (2 x 2)
Q, K, V: (2 x 2)
scores = QK^T: (2 x 2)
A = softmax(scores): (2 x 2)
H = A V: (2 x 2)
h_last: (1 x 2)
W_U: (2 x 3)
logits = h_last W_U: (1 x 3)

Rule: inner dimensions must match for matrix multiply.
Example: (1x2) * (2x3) -> (1x3)
    

1. Vocabulary

Professor Note: We start with a vocabulary because neural networks only understand numbers, not words. Tokenization gives each word (or subword) an ID so math can begin.

i=0, love=1, you=2

2. Embeddings (E)

Why: convert token IDs into dense numeric vectors the model can learn from.

Professor Note: If we fed raw IDs directly (0, 1, 2), the model would treat them like ordered numeric magnitudes. Embeddings fix this by giving each token a learned feature vector where geometric relationships can represent meaning.

E =
i    = [1, 0]
love = [0, 1]
you  = [1, 1]

Input sequence: "i love"

Professor Note: This matrix is the model's current "view" of the sentence before context mixing. Each row is one token position.

X =
[[1,0],
 [0,1]]
---

3. Positional Encoding

Why: tell the model token order; otherwise "i love" and "love i" look identical.

Professor Note: Attention by itself is permutation-friendly. Position information is injected so the model can distinguish first token vs second token and learn grammar/order.

pos0 = [0.1, 0]
pos1 = [0, 0.1]

X + pos =
[[1.1, 0],
 [0, 1.1]]
---

4. Define Weights

Why: these are trainable parameters. During training, gradients update these values.

Professor Note: In real training these start random. Learning means repeatedly adjusting these matrices so the model predicts correct next tokens more often.

W_Q

Professor Note: W_Q creates queries: "what this token wants to look for." Without a query space, each token cannot ask targeted context questions.

[[1,0],
 [0,1]]

W_K

Professor Note: W_K creates keys: "what information this token offers." Query-key dot products become attention compatibility scores.

[[1,0],
 [0,1]]

W_V

Professor Note: W_V creates values: the actual content that gets mixed and passed forward. Scores decide "where to look"; values decide "what to take."

[[1,2],
 [3,4]]

W_U (vocab projection / LM head)

Professor Note: W_U maps hidden features to one score per token in the vocabulary. This is where internal representation becomes a concrete next-token guess.

[[1,0,1],
 [0,1,1]]
---

5. Compute Q, K, V

In: X_tilde and weight matrices. Out: Q, K, V (three learned "views" of the same tokens).

Professor Note: This is the core idea of attention: same token state projected into three roles. It is like one sentence being represented as questions (Q), labels (K), and transferable facts (V).

Q = X_tilde × W_Q

Q =
[[1.1, 0],
 [0, 1.1]]

K = X_tilde × W_K

K =
[[1.1, 0],
 [0, 1.1]]

V = X_tilde × W_V

Row1: [1.1,0] × W_V = [1.1, 2.2]
Row2: [0,1.1] × W_V = [3.3, 4.4]

V =
[[1.1, 2.2],
 [3.3, 4.4]]
---

6. Attention Scores + Causal Mask

Why: compute "who should attend to whom", then block future tokens with the mask.

Professor Note: Dot product is a fast similarity test. Bigger score means better match between "what I seek" (Q) and "what you contain" (K). Causal mask enforces no future leakage during training, which is essential for autoregressive generation.

K^T

[[1.1, 0],
 [0, 1.1]]

scores = Q × K^T

[1.1,0]·[1.1,0] = 1.21
[1.1,0]·[0,1.1] = 0

[0,1.1]·[1.1,0] = 0
[0,1.1]·[0,1.1] = 1.21

scores =
[[1.21, 0],
 [0, 1.21]]

Apply decoder causal mask

mask =
[[0, -inf],
 [0,    0]]

scores_masked = scores + mask

= [[1.21, -inf],
   [0,    1.21]]
---

7. Scale

Professor Note: Dividing by sqrt(d_k) prevents large dot products from making softmax too peaky too early. This stabilizes gradients, especially in larger dimensions.

sqrt(2) = 1.414

scaled = scores_masked / sqrt(2)

= [[0.856, -inf],
   [0,     0.856]]
---

8. Softmax

Why: turn each score row into attention probabilities that sum to 1.

Professor Note: After softmax, each row becomes a distribution over source positions. That makes attention interpretable as weighted averaging instead of raw scores.

Row1

input row = [0.856, -inf]

Step 1 (stability trick): subtract max value (0.856)
[0.856 - 0.856, -inf - 0.856] = [0, -inf]

Step 2: exponentiate each element
exp(0) = 1
exp(-inf) = 0

Step 3: sum of exponentials
Z = 1 + 0 = 1

Step 4: normalize
p1 = 1 / 1 = 1.000
p2 = 0 / 1 = 0.000

= [1.000,0.000]

Row2

input row = [0, 0.856]

Step 1 (stability trick): subtract max value (0.856)
[0 - 0.856, 0.856 - 0.856] = [-0.856, 0]

Step 2: exponentiate
exp(-0.856) = 0.4247
exp(0) = 1

Step 3: sum of exponentials
Z = 0.4247 + 1 = 1.4247

Step 4: normalize
p1 = 0.4247 / 1.4247 = 0.2981
p2 = 1 / 1.4247 = 0.7019

= [0.298,0.702]

Check: 0.298 + 0.702 = 1.000
---

9. Attention Output

Why: each output token becomes a weighted blend of value vectors from allowed positions.

Professor Note: This is where context is actually injected. Before this step, each token is mostly local; after this step, each token carries selected information from earlier tokens.

Row1:
1.000*[1.1,2.2] + 0.000*[3.3,4.4]
= [1.1, 2.2]

Row2:
= [2.645, 3.745]   # unchanged from previous version
---

10. Vocab Projection (Logits)

Why: convert hidden state into one score per vocabulary token.

Professor Note: Logits are unnormalized preferences, not probabilities yet. The largest logit indicates the current best guess before normalization.

Take second token (predict next)

x = [2.645, 3.745]

logits = x × W_U

logit0 = 2.645*1 + 3.745*0 = 2.645
logit1 = 2.645*0 + 3.745*1 = 3.745
logit2 = 2.645*1 + 3.745*1 = 6.39

logits = [2.645, 3.745, 6.39]
---

11. Softmax

Why: convert logits into next-token probabilities.

Professor Note: This gives a probabilistic prediction. During generation, you may sample from this distribution or pick argmax; during training, we compare it against truth.

input logits = [2.645, 3.745, 6.39]

Step 1 (stability trick): subtract max logit (6.39)
[2.645-6.39, 3.745-6.39, 6.39-6.39]
= [-3.745, -2.645, 0]

Step 2: exponentiate
exp(-3.745) = 0.0236
exp(-2.645) = 0.0710
exp(0) = 1

Step 3: sum of exponentials
Z = 0.0236 + 0.0710 + 1 = 1.0946

Step 4: normalize
p0 = 0.0236 / 1.0946 = 0.0216
p1 = 0.0710 / 1.0946 = 0.0649
p2 = 1 / 1.0946 = 0.9135

probs = [0.022, 0.065, 0.913]  (rounded)

Check: 0.0216 + 0.0649 + 0.9135 = 1.0000
---

12. Loss

Why: measure how wrong the prediction is against the correct token.

Professor Note: Cross-entropy punishes confident wrong predictions more than mild uncertainty. Minimizing this loss across many examples is the objective of LLM pretraining.

target = "you" = index 2

Predicted probabilities from Step 11:
probs = [0.0216, 0.0649, 0.9135]

Build one-hot target vector (vocab size 3, target index 2):
y_true = [0, 0, 1]

Cross-entropy for one token:
L = -sum_i y_true[i] * log(probs[i])

Substitute values:
L = -(0*log(0.0216) + 0*log(0.0649) + 1*log(0.9135))
L = -log(0.9135)
L = 0.0905

Rounded:
loss = 0.091

Interpretation:
- If model predicts target with probability 1.0, loss = -log(1) = 0 (best).
- If model predicts target with very small probability, loss becomes large.
- This is why training pushes probability mass toward the correct token.

Mini table: probability vs loss

p(target)=0.90  -> loss=-log(0.90)=0.105
p(target)=0.50  -> loss=-log(0.50)=0.693
p(target)=0.10  -> loss=-log(0.10)=2.303
p(target)=0.01  -> loss=-log(0.01)=4.605

Higher correct-token probability => lower loss.
    

What "from scratch training" means here

This loss is from one sample at one step.
Training repeats this over many batches and many steps, updating parameters each time.
Over time, probabilities for correct targets increase and average loss decreases.
    
---

13. Gradient (logits)

Why: tells us direction/magnitude to change parameters to reduce loss.

Professor Note: probs - one_hot has intuitive meaning: increase score for the true class (negative entry) and decrease scores for others (positive entries).

We use the combined derivative of softmax + cross-entropy:

for each class k:
dL/dlogit_k = probs[k] - y_true[k]

From Step 11 and Step 12:
probs  = [0.0216, 0.0649, 0.9135]
y_true = [0,      0,      1]

So:
dL/dlogit0 = 0.0216 - 0 =  0.0216
dL/dlogit1 = 0.0649 - 0 =  0.0649
dL/dlogit2 = 0.9135 - 1 = -0.0865

d_logits = [0.0216, 0.0649, -0.0865]

Interpretation:
- positive gradient => gradient descent will push that logit DOWN
- negative gradient => gradient descent will push that logit UP
So class 2 (correct class) gets pushed up.
---

14. Backprop to W_U

In: last hidden state and d_logits. Out: gradient for W_U.

Professor Note: This uses chain rule. If a hidden feature strongly contributed to an error, its associated weights receive larger updates.

Forward relation:
logits = x W_U
where x = [2.645, 3.745]  (shape 1x2), W_U shape (2x3), logits shape (1x3)

By matrix calculus:
dW_U = x^T × d_logits

Compute row by row:
row 0 gradient = x0 * d_logits = 2.645 * [0.0216, 0.0649, -0.0865]
               = [0.0571, 0.1717, -0.2288]

row 1 gradient = x1 * d_logits = 3.745 * [0.0216, 0.0649, -0.0865]
               = [0.0809, 0.2431, -0.3239]

dW_U =
[[0.0571, 0.1717, -0.2288],
 [0.0809, 0.2431, -0.3239]]

Also backprop to hidden state x (needed to continue backprop into attention block):
dx = d_logits W_U^T
   = [0.0216, 0.0649, -0.0865] * [[1,0],[0,1],[1,1]]
   = [-0.0649, -0.0216]

This dx is the signal that flows backward into attention outputs and earlier matrices.

15. Parameter Update (SGD step)

Why: this is the actual learning step; weights change slightly after each batch.

Professor Note: One step is tiny; learning emerges from millions/billions of such small corrections. Optimizers like AdamW improve this basic SGD idea with adaptive scaling.

After gradients are computed, gradient descent updates each parameter:

W <- W - lr * dW

Take lr = 0.1 and the current W_U:
W_U_old =
[[1,0,1],
 [0,1,1]]

Using dW_U from Step 14:
dW_U =
[[0.0571, 0.1717, -0.2288],
 [0.0809, 0.2431, -0.3239]]

Compute lr*dW_U:
0.1*dW_U =
[[0.00571, 0.01717, -0.02288],
 [0.00809, 0.02431, -0.03239]]

Now subtract element-wise:
W_U_new = W_U_old - 0.1*dW_U

W_U_new =
[[1-0.00571, 0-0.01717, 1-(-0.02288)],
 [0-0.00809, 1-0.02431, 1-(-0.03239)]]

W_U_new =
[[ 0.99429, -0.01717, 1.02288],
 [-0.00809,  0.97569, 1.03239]]

Notice what happened:
- Column for correct class (index 2) increased (both rows got larger).
- Competing classes mostly decreased.
- This makes class "you" more likely next time for similar hidden state x.

In real training, this is done for all parameters
(E, W_Q, W_K, W_V, W_O/W_U, LayerNorm, MLP, etc.).
    

15.1 Backprop continues beyond W_U (no abstraction)

The training graph does NOT stop at W_U.
Using dx from Step 14, gradients keep flowing backward:

dx -> h_last -> H -> A and V
A  -> S_scaled -> S -> Q and K
Q  -> W_Q and X_tilde
K  -> W_K and X_tilde
V  -> W_V and X_tilde
X_tilde -> E (embedding rows used by input tokens)

So one token-level loss updates many matrices, not only the output head.
That is why transformer training is end-to-end.
    
---

16. Key Insight

Every step: - embedding → meaning - attention → context mixing - projection → prediction - loss → error - gradient → correction

17. Beginner Notes (What to remember)

1) Q, K, V are just three different "views" of the same input X_tilde.
   - Q asks: "What am I looking for?"
   - K says: "What information do I contain?"
   - V stores: "What content should be passed forward?"

2) Attention is weighted averaging:
   output_for_token_i = sum_j attention_weight(i,j) * V_j

3) Softmax always converts scores to probabilities.
   Bigger score -> bigger probability, but all probabilities still sum to 1.

4) Loss is small when the model gives high probability to the correct next token.

5) Gradient tells each weight how to change to reduce future loss.