Browse parent directory
my_research/open_source_search_summary.html
2025-06-11
Open Source Search (Summary)
Disclaimer
Summary
- This document describes how to build an open source search engine for the entire internet, that runs on a residential server
- As of 2025 it'll cost between $100k-$1M to build and host this server. This cost will reduce with every passing year, as GPU, RAM and disk prices reduce.
- Most expensive step is GPU capex to generate embeddings for the entire internet.
- Most steps can be done using low-complexity software such as bash scripts (
curl --multi
, htmlq -tw
, curl -X "$LLM_URL"
, etc)
Main
Why?
- I realised my posts on this topic are sprawling all over the place, without one post to summarise it all. Hence this post.
- If someone donates me $1M I might consider building this. I've written code for more than half the steps, and no step here seems impossibly hard.
Use cases of open source search
- Censorship-resistant internet
- aka internet with no delete button aka Liu Cixin's dark forest
- Any data that reaches any server may end up searchable by everyone, and backed up by everyone forever.
- You can read my other posts for more on the implications of this.
- Privacy-preserving search
- In theory, it will become possible to run searches on an airgapped tails machine. Search indices can be stored on read-only media and memory can wiped on reboot.
- As of 2025, a handful of intelligence agencies have exclusive access to everyone's thoughts, as everyone is dependent on centrally hosted search engines.
- Search for jobs, houses, friends, partners, etc without relying on a tech company.
- Most tech companies exist just to provide search functionality, and the incentives and culture so that both sides upload their data online.
- Not having to rely on them would mean better incentives and culture can be set. Lower cost-of-exit.
- Niche discovery
- Higher quality search (due to LLMs) makes it easier to connect people interested in a niche. Might be able to spawn subcommunities more easily based on shared thoughts or actions.
- Attention on internet currently is very heavy-tailed, few youtubers and social media companies have all of Earth's attention. This phenomena might weaken.
- Governance
- Can build political applications such as liquid democracy, distributed social media, etc if no politically or economically motivated group can censor data or alter search rankings or upvote counts.
Hardware costs in 2025
Important: All these prices are dropping exponentially. Try forecasting prices in 2030 or 2035. We will eventually end up with entire text internet stored in your pocket.
Prices taken from hetzner server auction, vast.ai, aws s3 deep archive
Rented
- Compute/Memory
- CPU compute = ($4/thread/mo) / (3.5 GHz/thread) = $1.1/B cycles = $0.07/GFLOP
- CPU RAM = $0.40/GB/mo
- CPU throughput = infinity, either disk or network throughput is the botteleneck
- GPU compute = ($1.40/h) / (50 TFLOP/s) = $0.008/PFLOP
- GPU RAM = $10/GB/mo
- GPU throughput = ($2/h) / (50 GB/s) = $28/TB
- Storage
- SSD = $10/TB/mo
- HDD = $2/TB/mo
- Tape = $1/TB/mo
- SSD throughput = ($20/mo) / (0.5 GB/s) = $16/PB
- HDD throughput = ($0.50/mo) / (100 MB/s) = $2/TB
- Network
- Network throughput = ($48/mo) / (10 gbps) = $0.015/TB
Self-hosted
- typically 1-100x cheaper than cloud
- price gap between cloud and rented is largest for storage, compared to cloud or memory
- Extreme examples
- SSD throughput = ($300/5y) / (10 GB/s) = $0.20/PB
- this is 80x cheaper than rented
- Tape (second-hand LTO-9) = $2/TB/30y = $0.0055/TB/mo
- this is 180x cheaper than rented
Storage
- In theory it is possible to do all the steps below in batches, so that a single node with 8 TB RAID0 is sufficient to crawl, extract, generate indices and store indices for 2 PB of internet plaintext.
- In practice you will likely use a network-based filesystem like ceph. All steps below are fully parallelisable, so the separate nodes don't need throughput between them.
- Raw plaintext can be stored on second-hand LTO-9 tapes
Crawling
Figures taken from commoncrawl and internet archive
Figures taken from my own (bad) benchmark
- Crawl rate > (30 HTTP headers / s) / (4 cores) = (~100M headers / mo) / (4 cores)
- Compute required < (1T headers) / (25M headers/mo/core) = 40k headers/core/mo
- Compute cost < $160k
- Requesting header involves DNS lookup, TCP handshake, TLS handshake.
- TLS handshake requires compute to multiply prime numbers, this is the likely real bottleneck. (Not tested.)
- Linux allows increasing number of sockets beyond 4096, network buffer has enough space to manage this. Using 10s timeout and 30 headers/s => 300 connections open at once
- Commoncrawl CDX files provide initial 1T urls from which to seed the crawl.
Software
- parallel curl requests is sufficient, nothing fancy. (Not tested on full dataset)
Plaintext extraction
Figures taken from commoncrawl and internet archive
Data size
- Total plaintext on internet = 2 PB
- Google and NSA datacenters are currently 10,000 PB, mostly for storing video
- Theoretical max plaintext = (10k words /day/person) * 5 bits/word * 8B people * 100y = 1,700 PB
- Theoretical max video (downsampling/embedding gen to 100 bytes/s using AI) = 100 B/s * 8B people * 100y = 2,000,000 PB
- video downsampling cost not considered
Plaintext extraction cost
- compute required to extract plaintext from a webpage, is typically less than compute required for the TLS initialisation for that webpage
- all processing can occur in RAM to avoid hitting disk I/O bottleneck.
Software
do something | htmlq -tw | do something
Embedding generation
Algorithm used
- BM25 or similar
- LLM embedding search
- searches concepts not keywords
- significantly outperforms BM25, although ideal system uses both
Figures taken from openai text-embedding-3-small
Embedding generation cost
- openai price = $0.01/1M input tokens
- assume 200 tokens per chunk, no overlap between chunks
- open source naive price = 175B params * $0.008/PFLOP * 1 FLOP/param/output vector / (200 input tokens/ output vector) = $0.007/1M input tokens
- Embedding generation cost = $0.01/1M input tokens * 2 PB * (1 input token/ 5 bytes) = $4.2M
Performance
- as of 2025, text-embedding-3-small outperforms many models that are overfit to MTEB
- also its cheaper than hosting yourself
Software
- Bash pipelines is sufficient
Embedding search
Algorithm used
- Search in RAM or disk
- RAM search is within few seconds - might be optimisable to below 100 ms, but will require custom work on RAM heap management
- Disk search is bottlenecked by disk throughput - >8 GB/s SSDs are available locally but not on cloud.
- FAISS index factory by Pinecone - different algos use different bytes/weight, and give different search times and recall.
- Graph-based algos like HNSW and pinecone proprietary algos outperform geometric algos like LSH, k-means clustering, product quantisation, etc
- Future algos
- Seems likely that a better graph algo than HNSW will be discovered in next few years.
- My basic intuition is you want to put 1T vectors into 1M clusters with 1M vectors each, then have a fast way to check which cluster is likely to contain matches to query vector. Then brute force search those clusters.
Software
- One-click software - dragonflyDB for RAM, qdrant for disk
- Can load databases using bash pipelines
- As of 2025, many other implementations exist, require more work to use.
Latency
If your hosting the app locally for your own use, latency does not matter, only search time matters. If you're hosting it for other people to use, latency can be relevant, and it may make sense to host multiple edge servers.
- Human body latency
- Nerve conduction latency = 1m / (100 m/s) = 10 ms
- Optical latency = (90 fps)^(-1) = 10 ms
- Computer latency
- CPU, RAM, disk, GPU, I/O device latencies are much below 10 ms
- Network latency is 100 ms for round-trip EU to US, due to speed of light.
- 1 gbps fibre optic now popular, sufficient to transmit HD video at below 10 ms latency with no encoding or compression
- 3D or VR data still can't be sent uncompressed
Comments