Browse parent directory
unimportant/all_data_in_ram.html
2025-04-23
All data in RAM
I have found it increasingly obvious that there is no technical barrier to all the world's data ending up stored in RAM in the next 10-20 years. Disks could become obsolete for most use cases. I figured I should write about it publicly too, so it's obvious to others too.
(All data in this article is being measured in petabytes for ease, 1 PB = 1024 TB. Your consumer laptop probably has 0.25-1.0 TB storage in it, so a petabyte is a thousand times that.)
Hobby project that can be done today:
- Purchase a few LTO-9 45 TB tapes from second-hand market and store CommonCrawl on them. You are now carrying the entire internet (plaintext) in a backpack.
- Write your own crawler to also crawl the websites that CommonCrawl does not.
Latency
As of today, RAM latency < Network latency < Disk latency
See Latency numbers every developer should know
Latency is the primary reason why developers will prefer storing this data in RAM rather than on disk or tape.
You can also read my other post on AI cloud gaming for extended discussion on latency. The short version is that it will likely soon be possible to stream your entire computer experience from a datacenter at <10 ms latency. Since human body operates at latency higher than 10 ms, it will be indistinguishable to you from something running locally on your machine.
Data size
All data formats are ultimately some mix of text, image and video.
For now if we consider just text:
-
CommonCrawl is an open source dump of a good fraction of the entire public internet. It's .WET plaintext files total around 0.6 PB.
-
Assume hypothetically we could capture every word spoken by every person on earth. A person speaks ~10k words per day.
- Shannon's estimates are English communicates ~1 bit per char and 5 chars per word, so a person produces ~6 KB/day.
- Assuming we are storing the data for entire Earth's population for past 100 years. Total data = 8B persons * ~6 KB/day/person * 365 days/year * 100 years = ~1700 PB.
- My blind guess is we can get it down by atleast 1-2 more orders of magnitude with more advanced compression techniques. Most times when people say something they're not the first person in history to be saying it. Assuming this we may only need to store 10-100 PB. Let's assume we need to store 100 PB.
- LLMs are probably a good example of this, Llama 3 140B weights fit in less than 1 TB (0.0001 PB), yet Llama 3 140B can say things similar enough to what most humans say most of the time.
- There's a lot of metadata and intermediate data generated by processes and input devices however I would be surprised if per person, they're generating an order of magnitude more data than the ~6 KB/day the person anyway generates.
-
All of Earth's video and image data is unlikely to fit in similar size range. However if we use AI to generate text descriptions for all the data, this again fits in comparable size.
- Assume we downsampled the video to 1 frame per second and generated 100 bytes of description per frame. Each description only needs to be a diff compared to the previous description. And we are likely storing data in a compressed format, not english. 100 bytes is 800 binary flags that have differed from the previous frame 1 second before.
- At 100 bytes/second we generate ~8500 KB/day/person. This is 3 OOMs more than the previous data, so if we could hypothetically store this for the entire population for 100 years we get 2,000,000 PB. This is an upper bound on all the data we would ever really need to store for most tasks.
-
For completeness, lets also consider worst case of storing all video data without much compression.
- Assume 1 hour of 4K 96 fps video takes 30 GB to store. Human perception can't differentiate video at much higher color quality or frame rate than this.
- Assume we want to store 100 years of content for 8 billion people.
- This is 210 billion PB
-
It is possible that in the future we invent new types of tasks that require even more storage. For instance if we scanned humans in 3D or scanned lots of biomolecular data from them. Or if we used AI that generated a huge amount of intermediate data (chains of thought ??). As of today it's not easy for me to predict this, if someone has spent more time trying to predict this I'd love to hear from you.
RAM Cost
As of 2023, RAM costs $1M/PB as per Our World in Data. See our world in data historical data on RAM costs.
(If you want off-the-shelf cost for consumers, it's noticeable higher. Hetzner EX44 cloud machines cost $44/mo for 64 GB RAM, or $8.5M/PB/year. This also includes cost of network, disk, support staff, etc. not just Hetzner's profit margin.)
Seems reasonable that RAM could cost below $100k/PB by 2030. Assuming RAM lasts for 5 years before it gives up, this is $20k/PB/year.
At hypothetical $20k/PB/year it costs:
- $12k/year to store every word on the public internet today (0.6 PB) - affordable for a rich donor or a group of friends
- $2M/year to store every word ever spoken (100 PB) - affordable for a medium-sized tech startup with Series A funding.
- $40B/year to store descriptions of video of every person ever (~2,000,000 PB) - affordable for a Big Tech corporation, assuming they can capture additional revenue of $5/person/year from 8B people to justify it.
- $4000T/year to store all video of every person ever (~200,000,000,000 PB) - not possible
It is possible we don't end up acquiring this much data by 2030, but if so the reasons will be culture and incentives, not technical reasons.
Also please remember these are only 2030 numbers. If Moore's law does not stop and research into reducing RAM cost does not stop, these numbers could go down by a few more OOMs by 2040 or 2050.
Disk cost, disk weight
I figured I'd include this section just for completeness.
As of 2025, aws s3 deep archive offers tape storage for $12k/PB/year. Assume this cost reduces to $1000/PB/year by 2030.
At hypothetical $1000/PB/year it costs:
- $600/year to store every word on public internet as of 2025 - trivially affordable
- $100k/year to store every word ever spoken (100 PB) - affordable for a group of family/friends to pool, or for a rich donor
- $2B/year to store descriptions of every video ever (~2,000,000 PB) - affordable for a Big Tech corporation even if they generate no additional revenue by doing so.
- $200T/year to store all video of every person ever (~200,000,000,000 PB) - not possible
In terms of weight, as of 2025, LTO-9 stores 45 TB at weight of 200 grams, or ~4.4 kg/PB. Assume this too reduces 10x, so it will weigh ~0.44 kg/PB by 2030.
- ~0.25 kg weight for every word on public internet as of 2025 (0.6 PB) - fits in your pocket
- 44 kg weight for every word every spoken on Earth (100 PB) - fits in a school bag
- 880 metric tonnes for (descriptions of) every video ever (~2,000,000 PB) - a typical frieght train carries 5000 metric tonnes, a typical aircraft carries 100 metric tonnes.
Data transport
Cold storage
- If the data is in cold storage
- If you can afford to buy N disks or tapes, you can probably also afford to hide N disks or tapes. All such data will be very easy to transport both legally and illegally. Any govt or large company with govt support can smuggle an aircraft's worth of freight for example.
- It will be very easy to make multiple copies of this data and very easy to destroy individual copies.
Datacenters
- If the data is online in a datacenter
- Every word every spoken (100 PB) can be stored in a secret datacenter whose location remains unknown
- Descriptions of every video ever (~2,000,000 PB) will probably get stored in datacenter whose location is known. Maybe you can split this into multiple small datacenters and hide it, I'm unsure how technically feasible this is.
- Datacenters today
- As per this xkcd, Google datacenter and NSA datacenter are the largest and are roughly on order of magnitude of 10,000 PB. GPS locations of these datacenters are public information, along with estimates of their power and water consumption.