In a real decoder-only Transformer (GPT-style), one attention block is:
1) Q = X W_Q, K = X W_K, V = X W_V
2) A = softmax((Q K^T) / sqrt(d_k) + mask)
3) H = A V
4) O = H W_O
5) Y = LayerNorm(X + O) # residual connection
6) F = MLP(Y) # feed-forward network
7) Y2 = LayerNorm(Y + F) # second residual + norm
8) logits = Y2_t W_U + b # for position t (last token in next-token prediction)
9) probs = softmax(logits)
10) loss = -log(probs[target_index])
Important training context:
- This is ONE gradient-descent step from random initialization (training from scratch).
- At step 0, all weights (E, W_Q, W_K, W_V, W_U, etc.) are random.
- The model sees one training sample: input "i love" and target next token "you".
- We do forward pass -> compute loss -> backprop gradients -> update weights.
This file computes the attention path exactly with concrete numbers.
To stay beginner-friendly and small, we simplify by:
- using one tiny attention head
- skipping LayerNorm/MLP numeric expansion
- directly projecting the final token state to logits with a small matrix
So this demo is a realistic training snapshot, but intentionally compact.
token ids: i=0, love=1, you=2
Dimensions used in this tiny example:
T = 2 # sequence length ("i love")
d_model = 2 # model width (tiny to make math visible)
d_k = 2 # key/query width
d_v = 2 # value width
Vocab = 3 # {i, love, you}
Core variables:
E in R^(Vocab x d_model)
Embedding table. Row j is the learned vector for token id j.
X in R^(T x d_model)
Input token vectors after embedding lookup.
Here T=2 rows: first row for "i", second row for "love".
P (or pos) in R^(T x d_model)
Positional encoding so the model knows token order.
X_tilde = X + P
Position-aware token representations.
W_Q in R^(d_model x d_k)
W_K in R^(d_model x d_k)
W_V in R^(d_model x d_v)
Learned linear maps producing queries, keys, values.
Q = X_tilde W_Q in R^(T x d_k)
K = X_tilde W_K in R^(T x d_k)
V = X_tilde W_V in R^(T x d_v)
S = Q K^T in R^(T x T)
Raw attention scores (how much each token attends to each token).
S_scaled = S / sqrt(d_k)
Scaling keeps values numerically stable.
A = softmax(S_scaled, row-wise) in R^(T x T)
Attention weights. Every row sums to 1.
In decoder-only GPT, this is causal-masked so token i cannot attend to future tokens j>i.
H = A V in R^(T x d_v)
Mixed contextual representation (weighted sum of value vectors).
W_U in R^(d_v x Vocab)
Final projection to vocabulary logits in this tiny demo.
(In full GPT, this corresponds to the language-model head.)
logits = h_last W_U in R^(Vocab)
Scores for each next-token candidate.
probs = softmax(logits) in R^(Vocab)
Next-token probabilities.
target_index
Correct token id for supervision.
loss = -log(probs[target_index])
Cross-entropy for one training example.
d_logits = probs - one_hot(target_index)
Gradient at the logits for cross-entropy + softmax.
Goal of this training sample:
Input tokens: "i love"
Correct next: "you"
At every numbered step below, ask 4 questions:
1) What goes IN?
2) What operation happens?
3) What comes OUT?
4) Why do we need this step?
Quick map:
- Steps 1-4: build input vectors and model weights
- Steps 5-9: attention computes context-aware token states
- Steps 10-12: convert last token state to next-token probability + loss
- Steps 13-15: compute gradient and update parameters
If a symbol looks scary, check section "0. Variables and Meaning" first.
X_tilde: (2 x 2)
W_Q, W_K, W_V: (2 x 2)
Q, K, V: (2 x 2)
scores = QK^T: (2 x 2)
A = softmax(scores): (2 x 2)
H = A V: (2 x 2)
h_last: (1 x 2)
W_U: (2 x 3)
logits = h_last W_U: (1 x 3)
Rule: inner dimensions must match for matrix multiply.
Example: (1x2) * (2x3) -> (1x3)
Professor Note: We start with a vocabulary because neural networks only understand numbers, not words. Tokenization gives each word (or subword) an ID so math can begin.
i=0, love=1, you=2
Why: convert token IDs into dense numeric vectors the model can learn from.
Professor Note: If we fed raw IDs directly (0, 1, 2), the model would treat them like ordered numeric magnitudes. Embeddings fix this by giving each token a learned feature vector where geometric relationships can represent meaning.
E = i = [1, 0] love = [0, 1] you = [1, 1]
Professor Note: This matrix is the model's current "view" of the sentence before context mixing. Each row is one token position.
X = [[1,0], [0,1]]---
Why: tell the model token order; otherwise "i love" and "love i" look identical.
Professor Note: Attention by itself is permutation-friendly. Position information is injected so the model can distinguish first token vs second token and learn grammar/order.
pos0 = [0.1, 0] pos1 = [0, 0.1] X + pos = [[1.1, 0], [0, 1.1]]---
Why: these are trainable parameters. During training, gradients update these values.
Professor Note: In real training these start random. Learning means repeatedly adjusting these matrices so the model predicts correct next tokens more often.
Professor Note: W_Q creates queries: "what this token wants
to look for." Without a query space, each token cannot ask targeted context questions.
[[1,0], [0,1]]
Professor Note: W_K creates keys: "what information this
token offers." Query-key dot products become attention compatibility scores.
[[1,0], [0,1]]
Professor Note: W_V creates values: the actual content that
gets mixed and passed forward. Scores decide "where to look"; values decide "what to take."
[[1,2], [3,4]]
Professor Note: W_U maps hidden features to one score per token in
the vocabulary. This is where internal representation becomes a concrete next-token guess.
[[1,0,1], [0,1,1]]---
In: X_tilde and weight matrices. Out: Q, K, V (three learned "views" of the same tokens).
Professor Note: This is the core idea of attention: same token state projected into three roles. It is like one sentence being represented as questions (Q), labels (K), and transferable facts (V).
Q = [[1.1, 0], [0, 1.1]]
K = [[1.1, 0], [0, 1.1]]
Row1: [1.1,0] × W_V = [1.1, 2.2] Row2: [0,1.1] × W_V = [3.3, 4.4] V = [[1.1, 2.2], [3.3, 4.4]]---
Why: compute "who should attend to whom", then block future tokens with the mask.
Professor Note: Dot product is a fast similarity test. Bigger score means better match between "what I seek" (Q) and "what you contain" (K). Causal mask enforces no future leakage during training, which is essential for autoregressive generation.
[[1.1, 0], [0, 1.1]]
[1.1,0]·[1.1,0] = 1.21 [1.1,0]·[0,1.1] = 0 [0,1.1]·[1.1,0] = 0 [0,1.1]·[0,1.1] = 1.21 scores = [[1.21, 0], [0, 1.21]]
mask = [[0, -inf], [0, 0]] scores_masked = scores + mask = [[1.21, -inf], [0, 1.21]]---
Professor Note: Dividing by sqrt(d_k) prevents large dot products
from making softmax too peaky too early. This stabilizes gradients, especially in larger
dimensions.
sqrt(2) = 1.414 scaled = scores_masked / sqrt(2) = [[0.856, -inf], [0, 0.856]]---
Why: turn each score row into attention probabilities that sum to 1.
Professor Note: After softmax, each row becomes a distribution over source positions. That makes attention interpretable as weighted averaging instead of raw scores.
input row = [0.856, -inf] Step 1 (stability trick): subtract max value (0.856) [0.856 - 0.856, -inf - 0.856] = [0, -inf] Step 2: exponentiate each element exp(0) = 1 exp(-inf) = 0 Step 3: sum of exponentials Z = 1 + 0 = 1 Step 4: normalize p1 = 1 / 1 = 1.000 p2 = 0 / 1 = 0.000 = [1.000,0.000]
input row = [0, 0.856] Step 1 (stability trick): subtract max value (0.856) [0 - 0.856, 0.856 - 0.856] = [-0.856, 0] Step 2: exponentiate exp(-0.856) = 0.4247 exp(0) = 1 Step 3: sum of exponentials Z = 0.4247 + 1 = 1.4247 Step 4: normalize p1 = 0.4247 / 1.4247 = 0.2981 p2 = 1 / 1.4247 = 0.7019 = [0.298,0.702] Check: 0.298 + 0.702 = 1.000---
Why: each output token becomes a weighted blend of value vectors from allowed positions.
Professor Note: This is where context is actually injected. Before this step, each token is mostly local; after this step, each token carries selected information from earlier tokens.
Row1: 1.000*[1.1,2.2] + 0.000*[3.3,4.4] = [1.1, 2.2] Row2: = [2.645, 3.745] # unchanged from previous version---
Why: convert hidden state into one score per vocabulary token.
Professor Note: Logits are unnormalized preferences, not probabilities yet. The largest logit indicates the current best guess before normalization.
x = [2.645, 3.745] logits = x × W_U logit0 = 2.645*1 + 3.745*0 = 2.645 logit1 = 2.645*0 + 3.745*1 = 3.745 logit2 = 2.645*1 + 3.745*1 = 6.39 logits = [2.645, 3.745, 6.39]---
Why: convert logits into next-token probabilities.
Professor Note: This gives a probabilistic prediction. During generation, you may sample from this distribution or pick argmax; during training, we compare it against truth.
input logits = [2.645, 3.745, 6.39] Step 1 (stability trick): subtract max logit (6.39) [2.645-6.39, 3.745-6.39, 6.39-6.39] = [-3.745, -2.645, 0] Step 2: exponentiate exp(-3.745) = 0.0236 exp(-2.645) = 0.0710 exp(0) = 1 Step 3: sum of exponentials Z = 0.0236 + 0.0710 + 1 = 1.0946 Step 4: normalize p0 = 0.0236 / 1.0946 = 0.0216 p1 = 0.0710 / 1.0946 = 0.0649 p2 = 1 / 1.0946 = 0.9135 probs = [0.022, 0.065, 0.913] (rounded) Check: 0.0216 + 0.0649 + 0.9135 = 1.0000---
Why: measure how wrong the prediction is against the correct token.
Professor Note: Cross-entropy punishes confident wrong predictions more than mild uncertainty. Minimizing this loss across many examples is the objective of LLM pretraining.
target = "you" = index 2 Predicted probabilities from Step 11: probs = [0.0216, 0.0649, 0.9135] Build one-hot target vector (vocab size 3, target index 2): y_true = [0, 0, 1] Cross-entropy for one token: L = -sum_i y_true[i] * log(probs[i]) Substitute values: L = -(0*log(0.0216) + 0*log(0.0649) + 1*log(0.9135)) L = -log(0.9135) L = 0.0905 Rounded: loss = 0.091 Interpretation: - If model predicts target with probability 1.0, loss = -log(1) = 0 (best). - If model predicts target with very small probability, loss becomes large. - This is why training pushes probability mass toward the correct token.
p(target)=0.90 -> loss=-log(0.90)=0.105
p(target)=0.50 -> loss=-log(0.50)=0.693
p(target)=0.10 -> loss=-log(0.10)=2.303
p(target)=0.01 -> loss=-log(0.01)=4.605
Higher correct-token probability => lower loss.
This loss is from one sample at one step.
Training repeats this over many batches and many steps, updating parameters each time.
Over time, probabilities for correct targets increase and average loss decreases.
---
Why: tells us direction/magnitude to change parameters to reduce loss.
Professor Note: probs - one_hot has intuitive meaning: increase score
for the true class (negative entry) and decrease scores for others (positive entries).
We use the combined derivative of softmax + cross-entropy: for each class k: dL/dlogit_k = probs[k] - y_true[k] From Step 11 and Step 12: probs = [0.0216, 0.0649, 0.9135] y_true = [0, 0, 1] So: dL/dlogit0 = 0.0216 - 0 = 0.0216 dL/dlogit1 = 0.0649 - 0 = 0.0649 dL/dlogit2 = 0.9135 - 1 = -0.0865 d_logits = [0.0216, 0.0649, -0.0865] Interpretation: - positive gradient => gradient descent will push that logit DOWN - negative gradient => gradient descent will push that logit UP So class 2 (correct class) gets pushed up.---
In: last hidden state and d_logits. Out: gradient for W_U.
Professor Note: This uses chain rule. If a hidden feature strongly contributed to an error, its associated weights receive larger updates.
Forward relation:
logits = x W_U
where x = [2.645, 3.745] (shape 1x2), W_U shape (2x3), logits shape (1x3)
By matrix calculus:
dW_U = x^T × d_logits
Compute row by row:
row 0 gradient = x0 * d_logits = 2.645 * [0.0216, 0.0649, -0.0865]
= [0.0571, 0.1717, -0.2288]
row 1 gradient = x1 * d_logits = 3.745 * [0.0216, 0.0649, -0.0865]
= [0.0809, 0.2431, -0.3239]
dW_U =
[[0.0571, 0.1717, -0.2288],
[0.0809, 0.2431, -0.3239]]
Also backprop to hidden state x (needed to continue backprop into attention block):
dx = d_logits W_U^T
= [0.0216, 0.0649, -0.0865] * [[1,0],[0,1],[1,1]]
= [-0.0649, -0.0216]
This dx is the signal that flows backward into attention outputs and earlier matrices.
Why: this is the actual learning step; weights change slightly after each batch.
Professor Note: One step is tiny; learning emerges from millions/billions of such small corrections. Optimizers like AdamW improve this basic SGD idea with adaptive scaling.
After gradients are computed, gradient descent updates each parameter:
W <- W - lr * dW
Take lr = 0.1 and the current W_U:
W_U_old =
[[1,0,1],
[0,1,1]]
Using dW_U from Step 14:
dW_U =
[[0.0571, 0.1717, -0.2288],
[0.0809, 0.2431, -0.3239]]
Compute lr*dW_U:
0.1*dW_U =
[[0.00571, 0.01717, -0.02288],
[0.00809, 0.02431, -0.03239]]
Now subtract element-wise:
W_U_new = W_U_old - 0.1*dW_U
W_U_new =
[[1-0.00571, 0-0.01717, 1-(-0.02288)],
[0-0.00809, 1-0.02431, 1-(-0.03239)]]
W_U_new =
[[ 0.99429, -0.01717, 1.02288],
[-0.00809, 0.97569, 1.03239]]
Notice what happened:
- Column for correct class (index 2) increased (both rows got larger).
- Competing classes mostly decreased.
- This makes class "you" more likely next time for similar hidden state x.
In real training, this is done for all parameters
(E, W_Q, W_K, W_V, W_O/W_U, LayerNorm, MLP, etc.).
The training graph does NOT stop at W_U.
Using dx from Step 14, gradients keep flowing backward:
dx -> h_last -> H -> A and V
A -> S_scaled -> S -> Q and K
Q -> W_Q and X_tilde
K -> W_K and X_tilde
V -> W_V and X_tilde
X_tilde -> E (embedding rows used by input tokens)
So one token-level loss updates many matrices, not only the output head.
That is why transformer training is end-to-end.
---
Every step: - embedding → meaning - attention → context mixing - projection → prediction - loss → error - gradient → correction
1) Q, K, V are just three different "views" of the same input X_tilde.
- Q asks: "What am I looking for?"
- K says: "What information do I contain?"
- V stores: "What content should be passed forward?"
2) Attention is weighted averaging:
output_for_token_i = sum_j attention_weight(i,j) * V_j
3) Softmax always converts scores to probabilities.
Bigger score -> bigger probability, but all probabilities still sum to 1.
4) Loss is small when the model gives high probability to the correct next token.
5) Gradient tells each weight how to change to reduce future loss.