Q1: Find C = A @ B.T
@ means matrix multiplication, .T means transpose
[1 -1 3]
A = [2 3 3]
[-2 0 2]
[7 -1 4]
B = [-2 0 0]
[0 -3 -1]
Q2: z = y^2 - (sin(theta))^2 + 2 y cos(theta) + 1
Find partial derivatives dz/dy and dz/d(theta)
Q3: Write a function in C to sort an array of N integers in time complexity O(N log N). Do not use built-in libraries for sorting. Accept N as input, then accept the integers as input.
Definition of time complexity: Suppose we plot time required to sort an array as a function of N. Let's call this T(N). Time complexity of O(N log N) means T(N) < c N log N for some constant c and all possible N.
If you can solve all the above questions, you have covered the pre-reqs.
Two-layer fully-connected network, trained on MNIST
we have found matrices dL/dW1 and dL/dW2, for given constant values of X, Y', W1, W2
(W1 - e * dL/dW1, W2 - e * dL/dW2) will have slightly less loss than (W1, W2)
repeat this processs millions of times to get very good (W1, W2)
this will run fast on a GPU (that's why we defined the network this way)
only intensive operation is matrix multiplication
no copy-paste operation
discard intermediate values on each iteration, and update W1 and W2
p.s. in practice we will split our dataset into batches to do batch SGD, use an optimiser such as Adam and use cross-entropy loss. all this is not important for now.
optional homework
derive the backprop formula above
figure out notation so that this calculation becomes easy to do
I have heard that einstein notation makes this result easier to derive, I have not practised it myself though
using pytorch, actually train a two-layer fully connected network on MNIST 60k examples training dataset, obtain 1.6% error on test dataset
in modern ML, usually ML engineers rely on pytorch autograd to compute gradient formulas, so they don't have to calculate it by hand.
for this homework assignment, don't use pytorch autograd. hard-code the gradients specified above.
deep learning is 15 years of accumulated blackbox tricks
question: ok but why did we use two-layer fully connected network in the first place?
we almost never have mathematical proof for why doing anything is a good idea
why is RMS norm a good idea? why is softmax a good idea? why is ReLu a good idea?
why did we define Q, K, V this way in attention blocks? why did we use 48 layers not 24?
why did we use residual layer? why did we use cosine for positional embeddings?
we do things if we have empirical evidence it worked before, and outperformed similar ideas
but: most ideas have not been tried yet. so we don't know if it actually outperforms similar ideas, just the ones we have tried so far.
there is a graveyard of old approaches
Older architecture: RNN, CNN, LSTM
Today: Attention dominates everything
Older activation functions: Sigmoid, tanh, GELU
Today: ReLu dominates everything
Older loss functions: Mean square loss, hinge loss, logistic loss
Today: Cross-entropy loss dominates everything
Older optimisers: vanilla SGD, RMS prop
Today: AdamW (Adam with weight decay) dominates everything
Older ideas that nobody remembers: dropout, L1 regularisation, etc
New ideas that might or might not become old one day: Mixture of experts, temperature tuning for long context, etc
we have intuitions for why we are doing what we are doing
but: intuitions are often retroactively justified, after we have empirical evidence it worked. no one publishes intuitions for failed ideas.
but: intuitions often come just by looking at hundreds of training runs with slightly different networks. today only big labs can afford to do these many runs at large scale, so only their researchers can build these intuitions.
Transformer
Resources
Justin Johson, University of Michigan, Deep Learning for Computer Vision - Lecture 13: Attention
Scaled dot product attention (torch.nn.functional.scaled_dot_product_attention):Q_TEMP_TUNED, K_DUPLICATED, V_DUPLICATED -> WO
Projection: WO -> OUT
Add residual
RMS Norm
Feed forward / Mixture of experts (ffn.py)
Either expert gating network
Or 2-layer fully connected network with SiLU activation function
Add residual
Projection
Training loop
Given log probabilities of next token and the actual next token, compute cross-entropy loss
Compute gradients of all the weight matrices with respect to cross entropy loss
Use AdamW optimiser to do gradient descent
Spend $100 billion per year mostly on one single training run (yes, really)
Note: To be technically accurate, this repo is for Llama 4 Scout which only cost ~$10M capex to train (5M hours on H100). State-of-the-art models like Llama 4 Behemoth, GPT4.5, grok-3 likely cost $1-10B capex and used similar architecture but not this exact repo.
this is the data we are going to pretend we don't have access to, and have the model try to predict it
Intuition: How does this work?
softmax: largest values in a row output approx 1, smallest values in a row output approx 0
mask: some values get hard-coded to 0
softmax(something) @ V
softmax(something) is a matrix with most values close to 0 and 1, this tells us which values in V to ignore versus not ignore
We are hiding some parts of X from ourselves (using mask), paying more attention to other parts of X (using softmax Q K_T), and then finding optimal weight matrices to predict the parts of X that we hid
Intution: Why does this outperform other known techniques?
Sequential versus parallel
Naive way of formulating next-token prediction is as a sequential problem. Attention masks parallelise this.
RNNs do next-token prediction but their forward pass predicts tokens sequentially.
This means gradient descent needs to happen across many serial steps. GPUs are good for training parallel stuff not sequential stuff.
Vanishing gradients problem when doing gradient descent across many sequential steps.
Pay attention to what?
CNNs maintain hard-coded sliding windows of which tokens to pay attention to.
LSTMs and RNNs maintain a shared context that is reused for many tokens.
Attention layer can look at the tokens and use the tokens themselves to compute which tokens to pay attention to. (Remember K,Q,V are all a function of X)
misc stuff in transformers
tokenisation
create a vocabulary of ~100k most commonly used words and phrase
english has less than 100k commonly used words, we can hard-code this into the model instead of training the model to figure this out on its own.
why not train it? don't know for sure
positional embeddings
calculate some cosine thing of the token, and attach it to the token
ensures each token also now stores data of which position it is. imagine new input is: my one name two is three alice four
why?
don't know for sure
intuition: maybe humans speak differently at the start of a paragraph versus in between. so it's helpful to always remember where you are.
RMS norm
divide each cell by root mean square of all cells in that row
ensures values remain between 0 and 1
why?
don't know for sure
intuition: gradient descent is more well-behaved for values between 0 and 1
Residual layer
let's say we did some stuff: Y = f(X)
adding residual just means adding input back in: Y_with_residual = Y + X = f(X) + X
why?
don't know for sure
intuition: "exploding and vanishing gradients" - sometimes if you do gradient descent on weight matrixes across so many layers you get gradients that approach zero or infinity
Mixture of experts
Not covering in this lecture
N layers put together
In total there are N sequential layers of transformer blocks. So we are doing gradient descent across N layers to find optimal weight matrices in each layer.
Intuition for N layers: pay attention to nearby tokens, then not so nearby, then not so nearby
typical hyperparams
number of layers
depends on model size
typically 32 to 128 layers
number of params
depends on model size
GPT2 XL (2019): 1.5B params
GPT3 (2020): 175B params
GPT3.5 based on GPT3
GPT4 (2023): rumoured 1.8T params
o1, o3 based on GPT4
GPT4.5 (2025): rumoured 12T params, of which 1T active params (mixture of experts)
bytes per param
training
typically float32 (4 bytes per weight)
mixed precision training is recent, for example deepseek
inference
quantisation works well: fp16, int8, int4, 1.58-bit
model size
model size in bytes = number of params * bytes per param
How to pick number of params when training a new model
depends on data and compute available, versus data and compute required
data and compute required is calculated using chinchilla scaling law
typically compute has always been the bottleneck, not data
epoch.ai forecasts running out of internet data in 2028
data required per param
(not very good) rule of thumb: 20 tokens/param * number of params
typically trained for one epoch - gradient descent is done on every token exactly once
compute required
(not very good) rule of thumb: 6 FLOP/param/token * number of params * number of tokens