2025-04-24
I end up recomputing these numbers many times so here's a handy reference. Feel free to plug in your own numbers.
FLOP : floating point operation(s). assume float32 unless specified otherwise.
FLOP/s : floating point operations per second
FLOPs, FLOPS : I will never use this terminology
Given a GPU:
FLOP/$ = (GPU FLOP/s) * (GPU lifespan in s) / (GPU sales price in $)
Given a GPU and an LLM for inference:
$/token = (FLOP / token) / (GPU FLOP/$) = e * (LLM params) / (GPU FLOP/$)
Given a GPU and an LLM for inference:
tokens/s = (GPU FLOP/s) / (FLOP / token) = (GPU FLOP/s) / (e * (LLM params))
where e : number of times each LLM param was accessed (and multiplied) per forward pass
e > 1
(assumes cost of energy consumed over 5 years is much smaller than sales price)
(assumes one inference token per forward pass)
Assuming Llama3 405B inference, picking a machine
Llama3 405B float32 memory = 405B * 4 = 1620 GB
H200 memory = 141 GB
1620 GB / 141 GB = 11.48
=> Atleast 12xH200 required
Assuming 2x8xH200 SXM
Total FLOP/$ = (2 * 8 * 67 TFLOP/s) * (5 years) / ( 2 * $300k ) = 2.817e17 FLOP/$
Assuming Llama3 405B inference and 2x8xH200 SXM
$/token = e * (405 billion) / (2.817e17 FLOP/$) = e * 1.44e-6 $/token = e * $1.44/1M tokens
tokens/s = (2 * 8 * 67 TFLOP/s) / (e * 405 billion) = (2646/e) tokens / s
Here's OpenAI pricing page for comparison.