Tokens per dollar

I end up recomputing these numbers many times so here's a handy reference. Feel free to plug in your own numbers.

FLOP : floating point operation(s). assume float32 unless specified otherwise.
FLOP/s : floating point operations per second
FLOPs, FLOPS : I will never use this terminology

Given a GPU:
FLOP/$ = (GPU FLOP/s) * (GPU lifespan in s) / (GPU sales price in $)

Given a GPU and an LLM for inference:
$/token = (FLOP / token) / (GPU FLOP/$) = e * (LLM params) / (GPU FLOP/$)

Given a GPU and an LLM for inference:
tokens/s = (GPU FLOP/s) / (FLOP / token) = (GPU FLOP/s) / (e * (LLM params))

where e : number of times each LLM param was accessed (and multiplied) per forward pass
e > 1

(assumes cost of energy consumed over 5 years is much smaller than sales price)
(assumes one inference token per forward pass)

Assuming Llama3 405B inference, picking a machine

Llama3 405B float32 memory = 405B * 4 = 1620 GB
H200 memory = 141 GB
1620 GB / 141 GB = 11.48
=> Atleast 12xH200 required

Assuming 2x8xH200 SXM

Total FLOP/$ = (2 * 8 * 67 TFLOP/s) * (5 years) / ( 2 * $300k ) = 2.817e17 FLOP/$

Assuming Llama3 405B inference and 2x8xH200 SXM

$/token =  e * (405 billion) / (2.817e17 FLOP/$) = e * 1.44e-6 $/token = e * $1.44/1M tokens

tokens/s = (2 * 8 * 67 TFLOP/s) / (e * 405 billion) = (2646/e) tokens / s

Here's OpenAI pricing page for comparison.