my_research/open_source_search_summary.html

2025-07-21

Open Source Search (Summary)

Disclaimer

Quick note
I support a complete ban on frontier AI R&D. This app requiring AI doesn't change that. If the ban is restrictive enough to also prevent the app decsribed below from being built, I might be okay with that.

Summary

This document describes how to build an open source search engine for the entire internet, that runs on a residential server
As of 2025 it'll cost between $100k-$1M to build and host this server. This cost will reduce with every passing year, as GPU, RAM and disk prices reduce.
Most expensive step is GPU capex to generate embeddings for the entire internet.
Most steps can be done using low-complexity software such as bash scripts (curl --multi, htmlq -tw, curl -X "$LLM_URL", etc)

Main

Why?

I realised my posts on this topic are sprawling all over the place, without one post to summarise it all. Hence this post.
If someone donates me $1M I might consider building this. I've written code for more than half the steps, and no step here seems impossibly hard.

Use cases of open source search

Censorship-resistant backups
- aka internet with no delete button aka Liu Cixin's dark forest
- Any data that reaches any server may end up backed up by people across multiple countries forever.
- You can read my other posts for more on the implications of censorship-resistant backups and discovery.
Censorship-resistant discovery
- Any data that reaches any server may end up searchable by everyone forever.
- Currently each country's govt bans channels and websites that they find threatening. It is harder to block a torrent of a qdrant snapshot, than to block a static list of IP addresses and domains. Will reduce cost-of-entry/exit for a new youtuber.
- Since youtubers can potentially run for govt, subscribing to a youtuber is a (weak) vote for their govt.
Privacy-preserving search
- In theory, it will become possible to run searches on an airgapped tails machine. Search indices can be stored on read-only media and memory can wiped on reboot.
- As of 2025, a handful of intelligence agencies have exclusive access to everyone's thoughts, as everyone is dependent on centrally hosted search engines. This could change.
Search for jobs, houses, friends, partners, etc without relying on a tech company.
- Most tech companies exist just to provide search functionality, and the incentives and culture so that both sides upload their data online.
- Not having to rely on them would mean better incentives and culture can be set. Lower cost-of-exit.
Niche discovery
- Higher quality search (due to LLMs) makes it easier to connect people interested in a niche. Might be able to spawn subcommunities more easily based on shared thoughts or actions.
- Attention on internet currently is very heavy-tailed, few youtubers and social media companies have all of Earth's attention. This phenomena might weaken.
Governance
- Can build political applications such as liquid democracy, distributed social media, etc if no politically or economically motivated group can censor data or alter search rankings or upvote counts.

Similar projects

(incomplete list, as of 2025-07 I'm not affiliated with any project listed here)

Internet Archive, CommonCrawl, gotham-grabber by Freedom of the Press Foundation
torrent, IPFS, filecoin, veilid
long list of (mostly failed) decentralised social media projects

Final output (after steps 0 to 5)

Distribute a torrent of the plaintext and the embedding search database snapshots (qdrant or dragonflyDB or similar) as mentioned below.
Distribute code for all steps of the pipeline mentioned below. Torrent the code if github or similar website removes it from their site.

Step 0: Estimate hardware costs in 2025

Important: All these prices are dropping exponentially. Try forecasting prices in 2030 or 2035. We will eventually end up with entire text internet stored in your pocket.

Prices taken from hetzner server auction, vast.ai, aws s3 deep archive

Rented

Compute/Memory
- CPU compute = ($4/thread/mo) / (3.5 GHz/thread) = $1.1/B cycles = $0.07/GFLOP
- CPU RAM = $0.40/GB/mo
- CPU throughput = infinity, either disk or network throughput is the botteleneck
- GPU compute = ($1.40/h) / (50 TFLOP/s) = $0.008/PFLOP
- GPU RAM = $10/GB/mo
- GPU throughput = ($2/h) / (50 GB/s) = $28/TB
Storage
- SSD = $10/TB/mo
- HDD = $2/TB/mo
- Tape = $1/TB/mo
- SSD throughput = ($20/mo) / (0.5 GB/s) = $16/PB
- HDD throughput = ($0.50/mo) / (100 MB/s) = $2/TB
Network
- Network throughput = ($48/mo) / (10 gbps) = $0.015/TB

Self-hosted

typically 1-100x cheaper than cloud
price gap between cloud and rented is largest for storage, compared to cloud or memory
Extreme examples
- SSD throughput = ($300/5y) / (10 GB/s) = $0.20/PB
  - this is 80x cheaper than rented
- Tape (second-hand LTO-9) = $2/TB/30y = $0.0055/TB/mo
  - this is 180x cheaper than rented

Estimate storage required

In theory it is possible to do all the steps below in batches, so that a single node with 8 TB RAID0 is sufficient to crawl, extract, generate indices and store indices for 2 PB of internet plaintext.
In practice you will likely use a network-based filesystem like ceph. All steps below are fully parallelisable, so the separate nodes don't need throughput between them.
Raw plaintext can be stored on second-hand LTO-9 tapes

Step 1: Crawling

Figures taken from commoncrawl and internet archive

Total urls = 1T urls

Figures taken from my own (bad) benchmark

Crawl rate > (30 HTTP headers / s) / (4 cores) = (~100M headers / mo) / (4 cores)
Compute required < (1T headers) / (25M headers/mo/core) = 40k core mo
Compute cost < 40k core mo * $4/core/mo = $160k
Requesting header involves DNS lookup, TCP handshake, TLS handshake.
TLS handshake requires compute to multiply prime numbers, this is the likely real bottleneck. (Not tested.)
Linux allows increasing number of sockets beyond 4096, network buffer has enough space to manage this. Using 10s timeout and 30 headers/s => 300 connections open at once
Commoncrawl CDX files provide initial 1T urls from which to seed the crawl.

Software

parallel curl requests is sufficient, nothing fancy. (Not tested on full dataset)

Prioritised public datasets

If you can't do the entire internet, here's some datasets you might want to prioritise:

personal blogs - searchmysite.net 3k blogs, substack, livejournal
video transcripts - youtube, rumble, rutube, bilibili, youku - atleast top 10k channels each by subscriber count. use yt-dlp or similar to scrape.
forums - hackernews, reddit, lesswrong, stackexchange
books, papers - arxiv, libgen, wikipedia
code - github
leaked datasets - wikileaks, distributed denial of secrets
social media - discord public (discord unveiled) and private, insta public and private, twitter public and private etc - takes effort to get scrapes

It is important to do most of them not just few, otherwise your app won't be competitive with existing apps.

Private datasets

Also provide software to create private datasets

Collect data of every keystroke on private machine. Retrieved using keylogger
Collect data of all previous AI inputs and outputs, both local LLM calls and API calls. Retrieved by sniffing network traffic locally.
Collect data from every webpage visited on private machine. Retrieved through browser cache directory.
Collect data from all user-generated files on private machine. Retrieved directly.

Each user will likely have to individually create their own private dataset, as some of this data may not be present in any public dataset.

Step 2: Plaintext extraction

Figures taken from commoncrawl and internet archive

Data size

Total plaintext on internet = 2 PB
Google and NSA datacenters are currently 10,000 PB, mostly for storing video
Theoretical max plaintext = (10k words /day/person) * 5 bits/word * 8B people * 100y = 1,700 PB
Theoretical max video (downsampling/embedding gen to 100 bytes/s using AI) = 100 B/s * 8B people * 100y = 2,000,000 PB
- video downsampling cost not considered

Plaintext extraction cost

compute required to extract plaintext from a webpage, is typically less than compute required for the TLS initialisation for that webpage
all processing can occur in RAM to avoid hitting disk I/O bottleneck.

Software

do something | htmlq -tw | do something

Step 3: Embedding generation

Algorithm used

BM25 or similar
LLM embedding search
- searches concepts not keywords
- significantly outperforms BM25, although ideal system uses both

Figures taken from openai text-embedding-3-small

Embedding generation cost

openai price = $0.01/1M input tokens
assume 200 tokens per chunk, no overlap between chunks
open source naive price = 175B params * $0.008/PFLOP * 1 FLOP/param/output vector / (200 input tokens/ output vector) = $0.007/1M input tokens
Embedding generation cost = $0.01/1M input tokens * 2 PB * (1 input token/ 5 bytes) = $4.2M

Performance

as of 2025, text-embedding-3-small outperforms many models that are overfit to MTEB
also its cheaper than hosting yourself

Software

Bash pipelines is sufficient

Step 4: Embedding search

Algorithm used, Search time

Search in RAM or disk
- RAM search is within few seconds - might be optimisable to below 100 ms, but will require custom work on RAM heap management
- Disk search is bottlenecked by disk throughput - >8 GB/s SSDs are available locally but not on cloud.
FAISS index factory by Pinecone - different algos use different bytes/weight, and give different search times and recall.
- Graph-based algos like HNSW and pinecone proprietary algos outperform geometric algos like LSH, k-means clustering, product quantisation, etc
Future algos
- Seems likely that a better graph algo than HNSW will be discovered in next few years.
- My basic intuition is you want to put 1T vectors into 1M clusters with 1M vectors each, then have a fast way to check which cluster is likely to contain matches to query vector. Then brute force search those clusters.

Software

One-click software - dragonflyDB for RAM, qdrant for disk
Can load databases using bash pipelines
As of 2025, many other implementations exist, require more work to use.

Search filters

Tags based on data source
- User can specify they only want to search reddit or they only want to search their own keylogs etc
- Search filter based on tag based on data source is important
Tags based on content
- Can use the embeddings to generate automated tags: politics, sports, tech, etc
- Search filter based on tag based on content is important
Timestamps
- Search filter based on time interval is important

Latency

If your hosting the app locally for your own use, latency does not matter, only search time matters. If you're hosting it for other people to use, latency can be relevant, and it may make sense to host multiple edge servers.

Human body latency
- Nerve conduction latency = 1m / (100 m/s) = 10 ms
- Optical latency = (90 fps)^(-1) = 10 ms
Computer latency
- CPU, RAM, disk, GPU, I/O device latencies are much below 10 ms
- Network latency is 100 ms for round-trip EU to US, due to speed of light.
- 1 gbps fibre optic now popular, sufficient to transmit HD video at below 10 ms latency with no encoding or compression
- 3D or VR data still can't be sent uncompressed

Step 5: Tool use

Make sure AI can also make searches to the embedding search databases. For example you can put a wrapper in front of qdrant to convert it into an MCP server.

Reasoning models can reason about the source and context of any given piece of information, and estimate how likely it is to match ground-truth.

Subscribe / Comment

Enter email to subscribe, or enter comment to post comment