unimportant/software_and_ai/ai_course_120min_202506.html

2025-06-25

Intro to ML, intro to LLM

2-hour talk, for Ooty retreat 2025-06-26, Samuel Shadrach

Pre-requisites

Matrix multiplication, differential calculus

Find C = A @ B.T

@ means matrix multiplication, .T means transpose

    [1  -1 3]
A = [2  3  3]
    [-2 0  2]

    [7  -1 4]
B = [-2 0  0]
    [0  -3 -1]

z = y^2 - (sin(theta))^2 + 2 y cos(theta) + 1

Find partial derivatives dz/dy and dz/d(theta)

If you can solve all the above questions, you have covered the pre-reqs.

Two-layer fully-connected network, trained on MNIST

Resources

MNIST dataset

problem statement

let's say we have 60000 photographs, each is 28x28 pixels wide, grayscale (pixel value between 0 and 1)
each of these is already classified into one of two digits: 0,1,2....9
we want a program that can quickly classify as many of these as possible
we cannot hard-code the final answers into the final program, because we want this program to also work well on new images we have never seen.
there are 10000 photographs we have not seen, our final score will be tested on this.

solution: define the following function using constant weight matrices W1 and W2

forward pass

Y = ReLU( ReLU(X @ W_1) @ W_2 )
loss = - sum (Y dot Y')

definition of ReLU

ReLU(M)_ij = if M_ij > 0, then M_ij, else 0

Y contains prediction (stored as log probabilities), Y' contains actual answer

example (assume N=1 image for now)

Y = [ln(0.15) ln(0.75) ln(0.15) ln(0.05) ln(0) ln(0) ln(0) ln(0) ln(0) ln(0)] = [-1.89 -0.28 -1.89 -3.00 -inf -inf -inf -inf -inf -inf]
Y' = [0 1 0 0 0 0 0 0 0 0]
loss = -ln(0.75) = -0.28

dimensions

X: (N,D)
W_1: (D,E)
W_2: (E,C)
=> Y: (N,C)

example dimensions

X: (60000, 28*28) = (60000, 784)
W_1: (784, 800)
W_2: (800, 10)
=> Y: (60000, 10)

Objective: find W1 and W2, so that as many images get classified into correct classes as possible

training loop

How do we find W1 and W2 fast?

W1 has 627k values, W2 has 8k values
even if each cell must be 0 or 1, thats 2^(627k * 8k) = 2^(5 * 10^9) possibilities
any sort of brute force or iteration is too slow

gradient descent

given any weight matrices W1 and W2, we will find new W1 and W2 that are slightly better

find gradients of loss with respect to weight matrices

dL / dW2 = ???

dL / dW2_ij = ???

visualise it

L = - | { ReLu { ReLu { [X_00 ... ] [W1_00 .... ] } [W2_00 .... ] } } dot Y' |
      | {      {      { [...  X_ND] [....  W1_DE] } [....  W2_EC] } }        |

how to find dL/dW1_00? and repeat for all values in W1?

rewrite it

H = f1(X,W1)
Y = f2(H,W2)
L = f0(Y,Y')

remember we are finding partial derivatives. all values are assumed constant, except the value with respect to which we are finding derivative

dL / dW2_ij
= df0/dY * df2/dW2_ij
= ...

dL / dW1_ij
= df0/dY * df2/dH * dH/dW1_ij
= ...

final answer, copy-pasted from o3, might contain hallucinations

# forward
A1 = X @ W1
H  = np.maximum(0, A1)
A2 = H @ W2
Y  = np.maximum(0, A2)
loss = -(Y * Y_true).sum()

# backward
grad_Y  = -Y_true
mask2   = (A2 > 0).astype(float)
delta2  = grad_Y * mask2            # dL/dA2
grad_W2 = H.T @ delta2              # dL/dW2

delta1  = (delta2 @ W2.T) * (A1 > 0)  # dL/dA1
grad_W1 = X.T @ delta1              # dL/dW1

conclusion

we have found matrices dL/dW1 and dL/dW2, for given constant values of X, Y', W1, W2
(W1 - e * dL/dW1, W2 - e * dL/dW2) will have slightly less loss than (W1, W2)
repeat this processs millions of times to get very good (W1, W2)
this will run fast on a GPU (that's why we defined the network this way)
- only intensive operation is matrix multiplication
- no copy-paste operation
- discard intermediate values on each iteration, and update W1 and W2
p.s. in practice we will split our dataset into batches to do batch SGD, use an optimiser such as Adam and use cross-entropy loss. all this is not important for now.

optional homework

derive the backprop formula above
- figure out notation so that this calculation becomes easy to do
- I have heard that einstein notation makes this result easier to derive, I have not practised it myself though
using pytorch, actually train a two-layer fully connected network on MNIST 60k examples training dataset, obtain 1.6% error on test dataset
- in modern ML, usually ML engineers rely on pytorch autograd to compute gradient formulas, so they don't have to calculate it by hand.
- for this homework assignment, don't use pytorch autograd. hard-code the gradients specified above.

deep learning is 15 years of accumulated blackbox tricks

question: ok but why did we use two-layer fully connected network in the first place?
- answer: click here
we almost never have mathematical proof for why doing anything is a good idea
- why is RMS norm a good idea? why is softmax a good idea? why is ReLu a good idea?
- why did we define Q, K, V this way in attention blocks? why did we use 48 layers not 24?
- why did we use residual layer? why did we use cosine for positional embeddings?
we do things if we have empirical evidence it worked before, and outperformed similar ideas
- but: most ideas have not been tried yet. so we don't know if it actually outperforms similar ideas, just the ones we have tried so far.
- there is a graveyard of old approaches
  - Older architecture: RNN, CNN, LSTM
    - Today: Attention dominates everything
  - Older activation functions: Sigmoid, tanh, GELU
    - Today: ReLu dominates everything
  - Older loss functions: Mean square loss, hinge loss, logistic loss
    - Today: Cross-entropy loss dominates everything
  - Older optimisers: vanilla SGD, RMS prop
    - Today: AdamW (Adam with weight decay) dominates everything
  - Older ideas that nobody remembers: dropout, L1 regularisation, etc
  - New ideas that might or might not become old one day: Mixture of experts, temperature tuning for long context, etc
we have intuitions for why we are doing what we are doing
- but: intuitions are often retroactively justified, after we have empirical evidence it worked. no one publishes intuitions for failed ideas.
- but: intuitions often come just by looking at hundreds of training runs with slightly different networks. today only big labs can afford to do these many runs at large scale, so only their researchers can build these intuitions.

Transformer

Resources

Justin Johson, University of Michigan, Deep Learning for Computer Vision - Lecture 13: Attention
Facebook, Llama 4 (launched 2025-04) - Code and model card, Try on OpenRouter

Ask o3 this question along with the code: make a list of all the steps in the following forward pass in plain english

Input: Sequence of tokens
Output: Logits (log probabilities) for next token

Forward pass

Tokenise
Add positional embeddings
Fuse image embeddings (optional)
N transformer blocks (let's say N=80)
- RMS Norm
- Multi-headed Attention
  - Projection: X -> Q_1, K_1, V_1
  - Reshape: {Q,K,V}_1 -> {Q,K,V}_2
  - Rotary embedding (optional): {Q,K}_2 -> {Q,K}_3
  - Head-wise RMS Norm (optional): {Q,K}_3 -> {Q,K}_4
  - Temperature tuning (optional, for long context): Q_4 -> Q_TEMP_TUNED
  - Duplicate KV cache (for faster computation): K_4, V_1 -> K_DUPLICATED, V_DUPLICATED
  - Scaled dot product attention (torch.nn.functional.scaled_dot_product_attention): Q_TEMP_TUNED, K_DUPLICATED, V_DUPLICATED -> WO
  - Projection: WO -> OUT
- Add residual
- RMS Norm
- Feed forward / Mixture of experts (ffn.py)
  - Either expert gating network
  - Or 2-layer fully connected network with SiLU activation function
- Add residual
Projection

Training loop

Given log probabilities of next token and the actual next token, compute cross-entropy loss
Compute gradients of all the weight matrices with respect to cross entropy loss
Use AdamW optimiser to do gradient descent
Spend $100 billion per year mostly on one single training run (yes, really)
- Note: To be technically accurate, this repo is for Llama 4 Scout which only cost ~$10M capex to train (5M hours on H100). State-of-the-art models like Llama 4 Behemoth, GPT4.5, grok-3 likely cost $1-10B capex and used similar architecture but not this exact repo.

important steps in attention block

Projection

Q = X @ W_Q
K = X @ W_K
V = X @ W_V

Scaled dot product attention

Y = softmax[ (Q @ K_T + mask) / sqrt(|Q|) ] @ V

Softmax normalisation

replace each cell with e to the power same cell
divide each cell by sum of the row

softmax(M)_ij = e^M_ij / sum i (e^M_ij)

example

softmax { [1  2 ] } = [ e/(e+e^2)     e^2/(e+e^2)     ] = [0.268 0.731]
        { [3  -1] }   [ e^3/(e^3+1/e) (1/e)/(e^3+1/e) ]   [0.982 0.018]

Mask

this is the data we are going to pretend we don't have access to, and have the model try to predict it

Intuition: How does this work?

softmax: largest values in a row output approx 1, smallest values in a row output approx 0
mask: some values get hard-coded to 0
softmax(something) @ V
softmax(something) is a matrix with most values close to 0 and 1, this tells us which values in V to ignore versus not ignore
We are hiding some parts of X from ourselves (using mask), paying more attention to other parts of X (using softmax Q K_T), and then finding optimal weight matrices to predict the parts of X that we hid

Intution: Why does this outperform other known techniques?

Sequential versus parallel
- Naive way of formulating next-token prediction is as a sequential problem. Attention masks parallelise this.
- RNNs do next-token prediction but their forward pass predicts tokens sequentially.
- This means gradient descent needs to happen across many serial steps. GPUs are good for training parallel stuff not sequential stuff.
- Vanishing gradients problem when doing gradient descent across many sequential steps.
Pay attention to what?
- CNNs maintain hard-coded sliding windows of which tokens to pay attention to.
- LSTMs and RNNs maintain a shared context that is reused for many tokens.
- Attention layer can look at the tokens and use the tokens themselves to compute which tokens to pay attention to. (Remember K,Q,V are all a function of X)

misc stuff in transformers

tokenisation

create a vocabulary of ~100k most commonly used words and phrase
convert input into 1-hot encoding using this
basically "hello" will become [0 0 0 0 ....maybe 25k entries .... 0 0 1 0 0 ... maybe 75k entries ... 0]
why?
- deep learning works on math, not words
- english has less than 100k commonly used words, we can hard-code this into the model instead of training the model to figure this out on its own.
- why not train it? don't know for sure

positional embeddings

calculate some cosine thing of the token, and attach it to the token
ensures each token also now stores data of which position it is. imagine new input is: my one name two is three alice four
why?
- don't know for sure
- intuition: maybe humans speak differently at the start of a paragraph versus in between. so it's helpful to always remember where you are.

RMS norm

divide each cell by root mean square of all cells in that row
ensures values remain between 0 and 1
why?
- don't know for sure
- intuition: gradient descent is more well-behaved for values between 0 and 1

Residual layer

let's say we did some stuff: Y = f(X)
adding residual just means adding input back in: Y_with_residual = Y + X = f(X) + X
why?
- don't know for sure
- intuition: "exploding and vanishing gradients" - sometimes if you do gradient descent on weight matrixes across so many layers you get gradients that approach zero or infinity

Mixture of experts

Not covering in this lecture

N layers put together

In total there are N sequential layers of transformer blocks. So we are doing gradient descent across N layers to find optimal weight matrices in each layer.
Intuition for N layers: pay attention to nearby tokens, then not so nearby, then not so nearby

typical hyperparams

number of layers
- depends on model size
- typically 32 to 128 layers
number of params
- depends on model size
- GPT2 XL (2019): 1.5B params
- GPT3 (2020): 175B params
  - GPT3.5 based on GPT3
- GPT4 (2023): rumoured 1.8T params
  - o1, o3 based on GPT4
- GPT4.5 (2025): rumoured 12T params, of which 1T active params (mixture of experts)
bytes per param
- training
  - typically float32 (4 bytes per weight)
  - mixed precision training is recent, for example deepseek
- inference
  - quantisation works well: fp16, int8, int4, 1.58-bit
model size
- model size in bytes = number of params * bytes per param

How to pick number of params when training a new model

depends on data and compute available, versus data and compute required
data and compute required is calculated using chinchilla scaling law
typically compute has always been the bottleneck, not data
epoch.ai forecasts running out of internet data in 2028
data required per param
- (not very good) rule of thumb: 20 tokens/param * number of params
- typically trained for one epoch - gradient descent is done on every token exactly once
compute required
- (not very good) rule of thumb: 6 FLOP/param/token * number of params * number of tokens

More resources