Browse parent directory
my_research/open_source_search_summary.html
2025-07-21
Open Source Search (Summary)
Disclaimer
- Quick note
- I support a complete ban on frontier AI R&D. This app requiring AI doesn't change that. If the ban is restrictive enough to also prevent the app decsribed below from being built, I might be okay with that.
Summary
- This document describes how to build an open source search engine for the entire internet, that runs on a residential server
- As of 2025 it'll cost between $100k-$1M to build and host this server. This cost will reduce with every passing year, as GPU, RAM and disk prices reduce.
- Most expensive step is GPU capex to generate embeddings for the entire internet.
- Most steps can be done using low-complexity software such as bash scripts (
curl --multi
, htmlq -tw
, curl -X "$LLM_URL"
, etc)
Main
Why?
- I realised my posts on this topic are sprawling all over the place, without one post to summarise it all. Hence this post.
- If someone donates me $1M I might consider building this. I've written code for more than half the steps, and no step here seems impossibly hard.
Use cases of open source search
- Censorship-resistant backups
- aka internet with no delete button aka Liu Cixin's dark forest
- Any data that reaches any server may end up backed up by people across multiple countries forever.
- You can read my other posts for more on the implications of censorship-resistant backups and discovery.
- Censorship-resistant discovery
- Any data that reaches any server may end up searchable by everyone forever.
- Currently each country's govt bans channels and websites that they find threatening. It is harder to block a torrent of a qdrant snapshot, than to block a static list of IP addresses and domains. Will reduce cost-of-entry/exit for a new youtuber.
- Since youtubers can potentially run for govt, subscribing to a youtuber is a (weak) vote for their govt.
- Privacy-preserving search
- In theory, it will become possible to run searches on an airgapped tails machine. Search indices can be stored on read-only media and memory can wiped on reboot.
- As of 2025, a handful of intelligence agencies have exclusive access to everyone's thoughts, as everyone is dependent on centrally hosted search engines. This could change.
- Search for jobs, houses, friends, partners, etc without relying on a tech company.
- Most tech companies exist just to provide search functionality, and the incentives and culture so that both sides upload their data online.
- Not having to rely on them would mean better incentives and culture can be set. Lower cost-of-exit.
- Niche discovery
- Higher quality search (due to LLMs) makes it easier to connect people interested in a niche. Might be able to spawn subcommunities more easily based on shared thoughts or actions.
- Attention on internet currently is very heavy-tailed, few youtubers and social media companies have all of Earth's attention. This phenomena might weaken.
- Governance
- Can build political applications such as liquid democracy, distributed social media, etc if no politically or economically motivated group can censor data or alter search rankings or upvote counts.
Similar projects
(incomplete list, as of 2025-07 I'm not affiliated with any project listed here)
- Internet Archive, CommonCrawl, gotham-grabber by Freedom of the Press Foundation
- torrent, IPFS, filecoin, veilid
- long list of (mostly failed) decentralised social media projects
Final output (after steps 0 to 5)
- Distribute a torrent of the plaintext and the embedding search database snapshots (qdrant or dragonflyDB or similar) as mentioned below.
- Distribute code for all steps of the pipeline mentioned below. Torrent the code if github or similar website removes it from their site.
Step 0: Estimate hardware costs in 2025
Important: All these prices are dropping exponentially. Try forecasting prices in 2030 or 2035. We will eventually end up with entire text internet stored in your pocket.
Prices taken from hetzner server auction, vast.ai, aws s3 deep archive
Rented
- Compute/Memory
- CPU compute = ($4/thread/mo) / (3.5 GHz/thread) = $1.1/B cycles = $0.07/GFLOP
- CPU RAM = $0.40/GB/mo
- CPU throughput = infinity, either disk or network throughput is the botteleneck
- GPU compute = ($1.40/h) / (50 TFLOP/s) = $0.008/PFLOP
- GPU RAM = $10/GB/mo
- GPU throughput = ($2/h) / (50 GB/s) = $28/TB
- Storage
- SSD = $10/TB/mo
- HDD = $2/TB/mo
- Tape = $1/TB/mo
- SSD throughput = ($20/mo) / (0.5 GB/s) = $16/PB
- HDD throughput = ($0.50/mo) / (100 MB/s) = $2/TB
- Network
- Network throughput = ($48/mo) / (10 gbps) = $0.015/TB
Self-hosted
- typically 1-100x cheaper than cloud
- price gap between cloud and rented is largest for storage, compared to cloud or memory
- Extreme examples
- SSD throughput = ($300/5y) / (10 GB/s) = $0.20/PB
- this is 80x cheaper than rented
- Tape (second-hand LTO-9) = $2/TB/30y = $0.0055/TB/mo
- this is 180x cheaper than rented
Estimate storage required
- In theory it is possible to do all the steps below in batches, so that a single node with 8 TB RAID0 is sufficient to crawl, extract, generate indices and store indices for 2 PB of internet plaintext.
- In practice you will likely use a network-based filesystem like ceph. All steps below are fully parallelisable, so the separate nodes don't need throughput between them.
- Raw plaintext can be stored on second-hand LTO-9 tapes
Step 1: Crawling
Figures taken from commoncrawl and internet archive
Figures taken from my own (bad) benchmark
- Crawl rate > (30 HTTP headers / s) / (4 cores) = (~100M headers / mo) / (4 cores)
- Compute required < (1T headers) / (25M headers/mo/core) = 40k core mo
- Compute cost < 40k core mo * $4/core/mo = $160k
- Requesting header involves DNS lookup, TCP handshake, TLS handshake.
- TLS handshake requires compute to multiply prime numbers, this is the likely real bottleneck. (Not tested.)
- Linux allows increasing number of sockets beyond 4096, network buffer has enough space to manage this. Using 10s timeout and 30 headers/s => 300 connections open at once
- Commoncrawl CDX files provide initial 1T urls from which to seed the crawl.
Software
- parallel curl requests is sufficient, nothing fancy. (Not tested on full dataset)
Prioritised public datasets
If you can't do the entire internet, here's some datasets you might want to prioritise:
- personal blogs - searchmysite.net 3k blogs, substack, livejournal
- video transcripts - youtube, rumble, rutube, bilibili, youku - atleast top 10k channels each by subscriber count. use yt-dlp or similar to scrape.
- forums - hackernews, reddit, lesswrong, stackexchange
- books, papers - arxiv, libgen, wikipedia
- code - github
- leaked datasets - wikileaks, distributed denial of secrets
- social media - discord public (discord unveiled) and private, insta public and private, twitter public and private etc - takes effort to get scrapes
It is important to do most of them not just few, otherwise your app won't be competitive with existing apps.
Private datasets
Also provide software to create private datasets
- Collect data of every keystroke on private machine. Retrieved using keylogger
- Collect data of all previous AI inputs and outputs, both local LLM calls and API calls. Retrieved by sniffing network traffic locally.
- Collect data from every webpage visited on private machine. Retrieved through browser cache directory.
- Collect data from all user-generated files on private machine. Retrieved directly.
Each user will likely have to individually create their own private dataset, as some of this data may not be present in any public dataset.
Step 2: Plaintext extraction
Figures taken from commoncrawl and internet archive
Data size
- Total plaintext on internet = 2 PB
- Google and NSA datacenters are currently 10,000 PB, mostly for storing video
- Theoretical max plaintext = (10k words /day/person) * 5 bits/word * 8B people * 100y = 1,700 PB
- Theoretical max video (downsampling/embedding gen to 100 bytes/s using AI) = 100 B/s * 8B people * 100y = 2,000,000 PB
- video downsampling cost not considered
Plaintext extraction cost
- compute required to extract plaintext from a webpage, is typically less than compute required for the TLS initialisation for that webpage
- all processing can occur in RAM to avoid hitting disk I/O bottleneck.
Software
do something | htmlq -tw | do something
Step 3: Embedding generation
Algorithm used
- BM25 or similar
- LLM embedding search
- searches concepts not keywords
- significantly outperforms BM25, although ideal system uses both
Figures taken from openai text-embedding-3-small
Embedding generation cost
- openai price = $0.01/1M input tokens
- assume 200 tokens per chunk, no overlap between chunks
- open source naive price = 175B params * $0.008/PFLOP * 1 FLOP/param/output vector / (200 input tokens/ output vector) = $0.007/1M input tokens
- Embedding generation cost = $0.01/1M input tokens * 2 PB * (1 input token/ 5 bytes) = $4.2M
Performance
- as of 2025, text-embedding-3-small outperforms many models that are overfit to MTEB
- also its cheaper than hosting yourself
Software
- Bash pipelines is sufficient
Step 4: Embedding search
Algorithm used, Search time
- Search in RAM or disk
- RAM search is within few seconds - might be optimisable to below 100 ms, but will require custom work on RAM heap management
- Disk search is bottlenecked by disk throughput - >8 GB/s SSDs are available locally but not on cloud.
- FAISS index factory by Pinecone - different algos use different bytes/weight, and give different search times and recall.
- Graph-based algos like HNSW and pinecone proprietary algos outperform geometric algos like LSH, k-means clustering, product quantisation, etc
- Future algos
- Seems likely that a better graph algo than HNSW will be discovered in next few years.
- My basic intuition is you want to put 1T vectors into 1M clusters with 1M vectors each, then have a fast way to check which cluster is likely to contain matches to query vector. Then brute force search those clusters.
Software
- One-click software - dragonflyDB for RAM, qdrant for disk
- Can load databases using bash pipelines
- As of 2025, many other implementations exist, require more work to use.
Search filters
- Tags based on data source
- User can specify they only want to search reddit or they only want to search their own keylogs etc
- Search filter based on tag based on data source is important
- Tags based on content
- Can use the embeddings to generate automated tags: politics, sports, tech, etc
- Search filter based on tag based on content is important
- Timestamps
- Search filter based on time interval is important
Latency
If your hosting the app locally for your own use, latency does not matter, only search time matters. If you're hosting it for other people to use, latency can be relevant, and it may make sense to host multiple edge servers.
- Human body latency
- Nerve conduction latency = 1m / (100 m/s) = 10 ms
- Optical latency = (90 fps)^(-1) = 10 ms
- Computer latency
- CPU, RAM, disk, GPU, I/O device latencies are much below 10 ms
- Network latency is 100 ms for round-trip EU to US, due to speed of light.
- 1 gbps fibre optic now popular, sufficient to transmit HD video at below 10 ms latency with no encoding or compression
- 3D or VR data still can't be sent uncompressed
Step 5: Tool use
Make sure AI can also make searches to the embedding search databases. For example you can put a wrapper in front of qdrant to convert it into an MCP server.
Reasoning models can reason about the source and context of any given piece of information, and estimate how likely it is to match ground-truth.
Subscribe / Comment
Enter email to subscribe, or enter comment to post comment