I have spent an impressively large amount of time confused by the difference between access time (aka latency) and throughput. Hence I thought it was worth writing this post for myself and other developers.
The numbers
Disclaimer
Numbers may have errors. I haven't personally checked benchmarks for every single number listed here.
I'm only aiming to be accurate upto one order of magnitude, for a cloud server as of 2026
(typically just pick the slower of the two components, and use the numbers for that)
GPU RAM to GPU compute - 10 TB/s
CPU RAM to CPU compute - 10 GB/s
CPU RAM to GPU RAM - 10 GB/s
SSD to GPU RAM - 1 GB/s
SSD to CPU RAM - 1 GB/s
HDD to GPU RAM - 100 MB/s
HDD to CPU RAM - 100 MB/s
HDD to SSD - 100 MB/s
Network to GPU RAM - 100 MB/s
Network to CPU RAM - 100 MB/s
Network to SSD - 100 MB/s
Network to HDD - 100 MB/s
Minimum amount of data transferred, for total time to transfer data to be dominated by throughput not by access time
Theoretical
GPU RAM - 0.01 us / (0.0001 us/KB = 100 KB
CPU RAM - 0.1 us / (0.1 us/KB) = 1 KB
SSD - 100 us / (1 us/KB) = 100 KB
HDD - 10 ms / (10 us/KB) = 1 MB
Network - 100 ms / (10 us/KB) = 10 MB
Heuristics for thinking about access time versus throughput
For very small amounts of data, the total time is dominated by access time. For sequential access to large amounts of data, the total time is dominated by throughput. For non-sequential access to large amounts of data, the total tim might again get dominated by access time. If you want to know how long it takes to transfer 1 KB, you should not look at throughput, you should look at access time. If you want to know how long it takes to transfer 1 GB, you should look at throughput.
If you are transferring small amounts of data, and running into any of the access time bottlenecks, the best solution is to usually modify algos and data structures so that you transfer larger amounts of data in a sequential manner.
There are lots of complicated data structures, storage and caching schemes for this. Some of these are in-built into your hardware or OS or database system or whatever, and some need to be handwritten for your specific use case.
For many tasks nowadays, it is a good idea to sequentially load large chunks of the data into RAM directly (either from disk or network), and do all your operations in RAM. You can do this explicitly using a RAM filesystem or implicitly using a bash pipeline. You can now rent 128 GB RAM or even 1 TB RAM on Hetzner or AWS or similar, which is large enough to fit many datasets.
Network throughput is recently becoming larger than disk throughput. This changes the optimal solution (only) for jobs that are sequential or can be made atleast somewhat sequential.
You can in theory build a machine with a 1 gbps network and a slow 100 MB/s HDD, making the network throughput higher than disk throughput. For jobs involving sequential access to a large amount of data, this means the optimal solution is probably to download the data straight to RAM, process it, and upload it elsewhere, without touching the disk at all. Historically this was not done, and all data was downloaded to disk first.
For jobs involving non-sequential access to the data, it is possible that downloading it to the disk first is still useful, as you are now bottlenecked by access time. Disk access time is still much faster than network access time. You can try caching a fraction of the dataset from network to RAM directly (by intelligently picking which fraction). However if your dataset is too large and the task can't be made sequential at all, this won't help, and you will still need to use the disk. (This is why no manufacturer is making machines without disks. I often wondered why you can't purchase a machine without a disk.)
Sequential versus non-sequential for different components.
(I haven't worked much on low-level GPU code, so I won't give strong opinion on how to optimise that. As per theory, making non-sequential reads sequential should help if each read is below 100 KB. GPU data smaller than 100 KB is typically either individual floats or vectors, but not matrices. So you should prefer transferring matrices from GPU RAM to GPU compute for example, over transferring floats or vectors.)
For CPU RAM, often sequential read is not that much faster than non-sequential read. Unless the RAM size and dataset size are both very large, you don't usually have to think much about how the data is being stored inside the RAM, or what order it is being read/written in. (You do have to think about L1/L2 cache and branch prediction, which is a whole another topic and out of scope of this post.)
For HDD and SSD, usually sequential read is much faster than non-sequential read. There are therefore lots of efficient data structures and caching schemes to convert non-sequential tasks into sequential ones.
For network, non-sequential read (aka opening new network connections) is so much slower than sequential reads (aka reading from a single network connection) that you can often ignore the time taken by the sequential read, and just add up how many non-sequential reads took place. If you are crawling 10 MB data each from 10 different websites via 1 gbps connection, it will take exactly as long as if you are crawling 10 KB data each from 10 different network locations.
Side Note: Caches all the way down
I find it helpful to think of all memory/storage as caches for other memory/storage
L1 cache caches a subset of RAM
RAM caches a subset of disk
Local disk caches a subset of all data (on all disks) available via internet
All disks on the internet cache a subset of all data recorded by sensors
All recorded data is a subset of all information humanity generates via its behaviour
Caching is then basically the problem of figuring out which subsets are worth storing in which place and when.
Subscribe
Enter email or phone number to subscribe. You will receive atmost one update per month