Browse parent directory

unimportant/software_and_ai/ai_course_120min_202506.html


2025-06-25

Intro to ML, intro to LLM

2-hour talk, for Ooty retreat 2025-06-26, Samuel Shadrach

Pre-requisites

Matrix multiplication, differential calculus

Find C = A @ B.T

@ means matrix multiplication, .T means transpose

    [1  -1 3]
A = [2  3  3]
    [-2 0  2]

    [7  -1 4]
B = [-2 0  0]
    [0  -3 -1]

z = y^2 - (sin(theta))^2 + 2 y cos(theta) + 1

Find partial derivatives dz/dy and dz/d(theta)

If you can solve all the above questions, you have covered the pre-reqs.

Two-layer fully-connected network, trained on MNIST

Resources

problem statement

solution: define the following function using constant weight matrices W1 and W2

forward pass

Y = ReLU( ReLU(X @ W_1) @ W_2 )
loss = - sum (Y dot Y')

definition of ReLU

ReLU(M)_ij = if M_ij > 0, then M_ij, else 0

Y contains prediction (stored as log probabilities), Y' contains actual answer

example (assume N=1 image for now)

Y = [ln(0.15) ln(0.75) ln(0.15) ln(0.05) ln(0) ln(0) ln(0) ln(0) ln(0) ln(0)] = [-1.89 -0.28 -1.89 -3.00 -inf -inf -inf -inf -inf -inf]
Y' = [0 1 0 0 0 0 0 0 0 0]
loss = -ln(0.75) = -0.28

dimensions

X: (N,D)
W_1: (D,E)
W_2: (E,C)
=> Y: (N,C)

example dimensions

X: (60000, 28*28) = (60000, 784)
W_1: (784, 800)
W_2: (800, 10)
=> Y: (60000, 10)

Objective: find W1 and W2, so that as many images get classified into correct classes as possible

training loop

How do we find W1 and W2 fast?

gradient descent

find gradients of loss with respect to weight matrices

dL / dW2 = ???

dL / dW2_ij = ???

visualise it

L = - | { ReLu { ReLu { [X_00 ... ] [W1_00 .... ] } [W2_00 .... ] } } dot Y' |
      | {      {      { [...  X_ND] [....  W1_DE] } [....  W2_EC] } }        |

how to find dL/dW1_00? and repeat for all values in W1?

rewrite it

H = f1(X,W1)
Y = f2(H,W2)
L = f0(Y,Y')

remember we are finding partial derivatives. all values are assumed constant, except the value with respect to which we are finding derivative

dL / dW2_ij
= df0/dY * df2/dW2_ij
= ...

dL / dW1_ij
= df0/dY * df2/dH * dH/dW1_ij
= ...

final answer, copy-pasted from o3, might contain hallucinations

# forward
A1 = X @ W1
H  = np.maximum(0, A1)
A2 = H @ W2
Y  = np.maximum(0, A2)
loss = -(Y * Y_true).sum()

# backward
grad_Y  = -Y_true
mask2   = (A2 > 0).astype(float)
delta2  = grad_Y * mask2            # dL/dA2
grad_W2 = H.T @ delta2              # dL/dW2

delta1  = (delta2 @ W2.T) * (A1 > 0)  # dL/dA1
grad_W1 = X.T @ delta1              # dL/dW1

conclusion

optional homework

deep learning is 15 years of accumulated blackbox tricks

Transformer

Resources

Ask o3 this question along with the code: make a list of all the steps in the following forward pass in plain english

Forward pass

Training loop

important steps in attention block

Projection

Q = X @ W_Q
K = X @ W_K
V = X @ W_V

Scaled dot product attention

Y = softmax[ (Q @ K_T + mask) / sqrt(|Q|) ] @ V

Softmax normalisation

softmax(M)_ij = e^M_ij / sum i (e^M_ij)

example

softmax { [1  2 ] } = [ e/(e+e^2)     e^2/(e+e^2)     ] = [0.268 0.731]
        { [3  -1] }   [ e^3/(e^3+1/e) (1/e)/(e^3+1/e) ]   [0.982 0.018]

Mask

Intuition: How does this work?

Intution: Why does this outperform other known techniques?

misc stuff in transformers

tokenisation

positional embeddings

RMS norm

Residual layer

Mixture of experts

N layers put together

typical hyperparams

How to pick number of params when training a new model

More resources



Comments