Home | Search


2026-02-27

Update

I am aware this is not a problem for today, but for the next few years.

2026-02-24

I want data on theoretical limits of distributed training runs

Disclaimer

I would love to read an article on distributed versus centralised training runs. Most of the publicly available data is on the sizes of centralised compute clusters, and sizes of models trained on centralised compute clusters. However, if we try to regulate those, some researchers will attempt to do the same training runs on distributed compute clusters.

I want to understand what are the limitations in terms of fundamental physics on how much slower a distributed training run must be compared to a centralised training run. There are also limitations like the fact that these codebases are less mature, but eventually I can see researchers overcoming those. Actual physical limitations like how network bandwidth is worse than memory bandwidth in terms of ping or synchronisation or similar, would be useful for me to read about.

Subscribe

Enter email or phone number to subscribe. You will receive atmost one update per month

Comment

Enter comment