Browse parent directory
unimportant/software_and_ai/ai_course_120min_202506.html
2025-06-25
Intro to ML, intro to LLM
2-hour talk, for Ooty retreat 2025-06-26, Samuel Shadrach
Pre-requisites
Matrix multiplication, differential calculus
Find C = A @ B.T
@ means matrix multiplication, .T means transpose
[1 -1 3]
A = [2 3 3]
[-2 0 2]
[7 -1 4]
B = [-2 0 0]
[0 -3 -1]
z = y^2 - (sin(theta))^2 + 2 y cos(theta) + 1
Find partial derivatives dz/dy and dz/d(theta)
If you can solve all the above questions, you have covered the pre-reqs.
Two-layer fully-connected network, trained on MNIST
Resources
problem statement
- let's say we have 60000 photographs, each is 28x28 pixels wide, grayscale (pixel value between 0 and 1)
- each of these is already classified into one of two digits: 0,1,2....9
- we want a program that can quickly classify as many of these as possible
- we cannot hard-code the final answers into the final program, because we want this program to also work well on new images we have never seen.
- there are 10000 photographs we have not seen, our final score will be tested on this.
solution: define the following function using constant weight matrices W1 and W2
forward pass
Y = ReLU( ReLU(X @ W_1) @ W_2 )
loss = - sum (Y dot Y')
definition of ReLU
ReLU(M)_ij = if M_ij > 0, then M_ij, else 0
Y contains prediction (stored as log probabilities), Y' contains actual answer
example (assume N=1 image for now)
Y = [ln(0.15) ln(0.75) ln(0.15) ln(0.05) ln(0) ln(0) ln(0) ln(0) ln(0) ln(0)] = [-1.89 -0.28 -1.89 -3.00 -inf -inf -inf -inf -inf -inf]
Y' = [0 1 0 0 0 0 0 0 0 0]
loss = -ln(0.75) = -0.28
dimensions
X: (N,D)
W_1: (D,E)
W_2: (E,C)
=> Y: (N,C)
example dimensions
X: (60000, 28*28) = (60000, 784)
W_1: (784, 800)
W_2: (800, 10)
=> Y: (60000, 10)
Objective: find W1 and W2, so that as many images get classified into correct classes as possible
training loop
How do we find W1 and W2 fast?
- W1 has 627k values, W2 has 8k values
- even if each cell must be 0 or 1, thats
2^(627k * 8k) = 2^(5 * 10^9)
possibilities
- any sort of brute force or iteration is too slow
gradient descent
- given any weight matrices W1 and W2, we will find new W1 and W2 that are slightly better
find gradients of loss with respect to weight matrices
dL / dW2 = ???
dL / dW2_ij = ???
visualise it
L = - | { ReLu { ReLu { [X_00 ... ] [W1_00 .... ] } [W2_00 .... ] } } dot Y' |
| { { { [... X_ND] [.... W1_DE] } [.... W2_EC] } } |
how to find dL/dW1_00? and repeat for all values in W1?
rewrite it
H = f1(X,W1)
Y = f2(H,W2)
L = f0(Y,Y')
remember we are finding partial derivatives. all values are assumed constant, except the value with respect to which we are finding derivative
dL / dW2_ij
= df0/dY * df2/dW2_ij
= ...
dL / dW1_ij
= df0/dY * df2/dH * dH/dW1_ij
= ...
final answer, copy-pasted from o3, might contain hallucinations
# forward
A1 = X @ W1
H = np.maximum(0, A1)
A2 = H @ W2
Y = np.maximum(0, A2)
loss = -(Y * Y_true).sum()
# backward
grad_Y = -Y_true
mask2 = (A2 > 0).astype(float)
delta2 = grad_Y * mask2 # dL/dA2
grad_W2 = H.T @ delta2 # dL/dW2
delta1 = (delta2 @ W2.T) * (A1 > 0) # dL/dA1
grad_W1 = X.T @ delta1 # dL/dW1
conclusion
- we have found matrices
dL/dW1
and dL/dW2
, for given constant values of X, Y', W1, W2
(W1 - e * dL/dW1, W2 - e * dL/dW2)
will have slightly less loss than (W1, W2)
- repeat this processs millions of times to get very good
(W1, W2)
- this will run fast on a GPU (that's why we defined the network this way)
- only intensive operation is matrix multiplication
- no copy-paste operation
- discard intermediate values on each iteration, and update W1 and W2
- p.s. in practice we will split our dataset into batches to do batch SGD, use an optimiser such as Adam and use cross-entropy loss. all this is not important for now.
optional homework
- derive the backprop formula above
- figure out notation so that this calculation becomes easy to do
- I have heard that einstein notation makes this result easier to derive, I have not practised it myself though
- using pytorch, actually train a two-layer fully connected network on MNIST 60k examples training dataset, obtain 1.6% error on test dataset
- in modern ML, usually ML engineers rely on pytorch autograd to compute gradient formulas, so they don't have to calculate it by hand.
- for this homework assignment, don't use pytorch autograd. hard-code the gradients specified above.
deep learning is 15 years of accumulated blackbox tricks
-
question: ok but why did we use two-layer fully connected network in the first place?
-
we almost never have mathematical proof for why doing anything is a good idea
- why is RMS norm a good idea? why is softmax a good idea? why is ReLu a good idea?
- why did we define Q, K, V this way in attention blocks? why did we use 48 layers not 24?
- why did we use residual layer? why did we use cosine for positional embeddings?
-
we do things if we have empirical evidence it worked before, and outperformed similar ideas
- but: most ideas have not been tried yet. so we don't know if it actually outperforms similar ideas, just the ones we have tried so far.
- there is a graveyard of old approaches
-
Older architecture: RNN, CNN, LSTM
- Today: Attention dominates everything
-
Older activation functions: Sigmoid, tanh, GELU
- Today: ReLu dominates everything
-
Older loss functions: Mean square loss, hinge loss, logistic loss
- Today: Cross-entropy loss dominates everything
-
Older optimisers: vanilla SGD, RMS prop
- Today: AdamW (Adam with weight decay) dominates everything
-
Older ideas that nobody remembers: dropout, L1 regularisation, etc
-
New ideas that might or might not become old one day: Mixture of experts, temperature tuning for long context, etc
-
we have intuitions for why we are doing what we are doing
- but: intuitions are often retroactively justified, after we have empirical evidence it worked. no one publishes intuitions for failed ideas.
- but: intuitions often come just by looking at hundreds of training runs with slightly different networks. today only big labs can afford to do these many runs at large scale, so only their researchers can build these intuitions.
Transformer
Resources
Ask o3 this question along with the code: make a list of all the steps in the following forward pass in plain english
- Input: Sequence of tokens
- Output: Logits (log probabilities) for next token
Forward pass
- Tokenise
- Add positional embeddings
- Fuse image embeddings (optional)
- N transformer blocks (let's say N=80)
- RMS Norm
- Multi-headed Attention
- Projection:
X -> Q_1, K_1, V_1
- Reshape:
{Q,K,V}_1 -> {Q,K,V}_2
- Rotary embedding (optional):
{Q,K}_2 -> {Q,K}_3
- Head-wise RMS Norm (optional):
{Q,K}_3 -> {Q,K}_4
- Temperature tuning (optional, for long context):
Q_4 -> Q_TEMP_TUNED
- Duplicate KV cache (for faster computation):
K_4, V_1 -> K_DUPLICATED, V_DUPLICATED
- Scaled dot product attention (torch.nn.functional.scaled_dot_product_attention):
Q_TEMP_TUNED, K_DUPLICATED, V_DUPLICATED -> WO
- Projection:
WO -> OUT
- Add residual
- RMS Norm
- Feed forward / Mixture of experts (
ffn.py
)
- Either expert gating network
- Or 2-layer fully connected network with SiLU activation function
- Add residual
- Projection
Training loop
- Given log probabilities of next token and the actual next token, compute cross-entropy loss
- Compute gradients of all the weight matrices with respect to cross entropy loss
- Use AdamW optimiser to do gradient descent
- Spend $100 billion per year mostly on one single training run (yes, really)
- Note: To be technically accurate, this repo is for Llama 4 Scout which only cost ~$10M capex to train (5M hours on H100). State-of-the-art models like Llama 4 Behemoth, GPT4.5, grok-3 likely cost $1-10B capex and used similar architecture but not this exact repo.
important steps in attention block
Projection
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
Scaled dot product attention
Y = softmax[ (Q @ K_T + mask) / sqrt(|Q|) ] @ V
Softmax normalisation
- replace each cell with e to the power same cell
- divide each cell by sum of the row
softmax(M)_ij = e^M_ij / sum i (e^M_ij)
example
softmax { [1 2 ] } = [ e/(e+e^2) e^2/(e+e^2) ] = [0.268 0.731]
{ [3 -1] } [ e^3/(e^3+1/e) (1/e)/(e^3+1/e) ] [0.982 0.018]
Mask
- this is the data we are going to pretend we don't have access to, and have the model try to predict it
Intuition: How does this work?
- softmax: largest values in a row output approx 1, smallest values in a row output approx 0
- mask: some values get hard-coded to 0
softmax(something) @ V
softmax(something)
is a matrix with most values close to 0 and 1, this tells us which values in V
to ignore versus not ignore
- We are hiding some parts of X from ourselves (using mask), paying more attention to other parts of X (using softmax Q K_T), and then finding optimal weight matrices to predict the parts of X that we hid
Intution: Why does this outperform other known techniques?
- Sequential versus parallel
- Naive way of formulating next-token prediction is as a sequential problem. Attention masks parallelise this.
- RNNs do next-token prediction but their forward pass predicts tokens sequentially.
- This means gradient descent needs to happen across many serial steps. GPUs are good for training parallel stuff not sequential stuff.
- Vanishing gradients problem when doing gradient descent across many sequential steps.
- Pay attention to what?
- CNNs maintain hard-coded sliding windows of which tokens to pay attention to.
- LSTMs and RNNs maintain a shared context that is reused for many tokens.
- Attention layer can look at the tokens and use the tokens themselves to compute which tokens to pay attention to. (Remember
K,Q,V
are all a function of X
)
misc stuff in transformers
tokenisation
- create a vocabulary of ~100k most commonly used words and phrase
- convert input into 1-hot encoding using this
- basically "hello" will become [0 0 0 0 ....maybe 25k entries .... 0 0 1 0 0 ... maybe 75k entries ... 0]
- why?
- deep learning works on math, not words
- english has less than 100k commonly used words, we can hard-code this into the model instead of training the model to figure this out on its own.
- why not train it? don't know for sure
positional embeddings
- calculate some cosine thing of the token, and attach it to the token
- ensures each token also now stores data of which position it is. imagine new input is:
my one name two is three alice four
- why?
- don't know for sure
- intuition: maybe humans speak differently at the start of a paragraph versus in between. so it's helpful to always remember where you are.
RMS norm
- divide each cell by root mean square of all cells in that row
- ensures values remain between 0 and 1
- why?
- don't know for sure
- intuition: gradient descent is more well-behaved for values between 0 and 1
Residual layer
- let's say we did some stuff:
Y = f(X)
- adding residual just means adding input back in:
Y_with_residual = Y + X = f(X) + X
- why?
- don't know for sure
- intuition: "exploding and vanishing gradients" - sometimes if you do gradient descent on weight matrixes across so many layers you get gradients that approach zero or infinity
Mixture of experts
- Not covering in this lecture
N layers put together
- In total there are N sequential layers of transformer blocks. So we are doing gradient descent across N layers to find optimal weight matrices in each layer.
- Intuition for N layers: pay attention to nearby tokens, then not so nearby, then not so nearby
typical hyperparams
-
number of layers
- depends on model size
- typically 32 to 128 layers
-
number of params
- depends on model size
- GPT2 XL (2019): 1.5B params
- GPT3 (2020): 175B params
- GPT4 (2023): rumoured 1.8T params
- GPT4.5 (2025): rumoured 12T params, of which 1T active params (mixture of experts)
-
bytes per param
- training
- typically float32 (4 bytes per weight)
- mixed precision training is recent, for example deepseek
- inference
- quantisation works well: fp16, int8, int4, 1.58-bit
-
model size
- model size in bytes = number of params * bytes per param
How to pick number of params when training a new model
-
depends on data and compute available, versus data and compute required
-
data and compute required is calculated using chinchilla scaling law
-
typically compute has always been the bottleneck, not data
-
epoch.ai forecasts running out of internet data in 2028
-
data required per param
- (not very good) rule of thumb: 20 tokens/param * number of params
- typically trained for one epoch - gradient descent is done on every token exactly once
-
compute required
- (not very good) rule of thumb: 6 FLOP/param/token * number of params * number of tokens
More resources
Comments