I support a complete ban on frontier AI R&D. This app requiring AI doesn't change that. If the ban is restrictive enough to also prevent the app decsribed below from being built, I might be okay with that.
Summary
This document describes how to build an open source search engine for the entire internet, that runs on a residential server
As of 2025 it'll cost between $100k-$1M to build and host this server. This cost will reduce with every passing year, as GPU, RAM and disk prices reduce.
Most expensive step is GPU capex to generate embeddings for the entire internet.
Most steps can be done using low-complexity software such as bash scripts (curl --multi, htmlq -tw, curl -X "$LLM_URL", etc)
Main
Why?
I realised my posts on this topic are sprawling all over the place, without one post to summarise it all. Hence this post.
If someone donates me $1M I might consider building this. I've written code for more than half the steps, and no step here seems impossibly hard.
Use cases of open source search
Censorship-resistant backups
aka internet with no delete button aka Liu Cixin's dark forest
Any data that reaches any server may end up backed up by people across multiple countries forever.
You can read my other posts for more on the implications of censorship-resistant backups and discovery.
Censorship-resistant discovery
Any data that reaches any server may end up searchable by everyone forever.
Currently each country's govt bans channels and websites that they find threatening. It is harder to block a torrent of a qdrant snapshot, than to block a static list of IP addresses and domains. Will reduce cost-of-entry/exit for a new youtuber.
Since youtubers can potentially run for govt, subscribing to a youtuber is a (weak) vote for their govt.
Privacy-preserving search
In theory, it will become possible to run searches on an airgapped tails machine. Search indices can be stored on read-only media and memory can wiped on reboot.
As of 2025, a handful of intelligence agencies have exclusive access to everyone's thoughts, as everyone is dependent on centrally hosted search engines. This could change.
Search for jobs, houses, friends, partners, etc without relying on a tech company.
Most tech companies exist just to provide search functionality, and the incentives and culture so that both sides upload their data online.
Not having to rely on them would mean better incentives and culture can be set. Lower cost-of-exit.
Niche discovery
Higher quality search (due to LLMs) makes it easier to connect people interested in a niche. Might be able to spawn subcommunities more easily based on shared thoughts or actions.
Attention on internet currently is very heavy-tailed, few youtubers and social media companies have all of Earth's attention. This phenomena might weaken.
Governance
Can build political applications such as liquid democracy, distributed social media, etc if no politically or economically motivated group can censor data or alter search rankings or upvote counts.
Similar projects
(incomplete list, as of 2025-08 I'm not affiliated with any project listed here)
Distribute a torrent of the plaintext and the embedding search database snapshots (qdrant or dragonflyDB or similar) as mentioned below.
Distribute code for all steps of the pipeline mentioned below. Torrent the code if github or similar website removes it from their site.
Step 0: Estimate hardware costs in 2025
Important: All these prices are dropping exponentially. Try forecasting prices in 2030 or 2035. We will eventually end up with entire text internet stored in your pocket.
Prices taken from hetzner server auction, vast.ai, aws s3 deep archive
In theory it is possible to do all the steps below in batches, so that a single node with 8 TB RAID0 is sufficient to crawl, extract, generate indices and store indices for 2 PB of internet plaintext.
In practice you will likely use a network-based filesystem like ceph. All steps below are fully parallelisable, so the separate nodes don't need throughput between them.
Raw plaintext can be stored on second-hand LTO-9 tapes
Step 1: Crawling
Figures taken from commoncrawl and internet archive
Requesting header involves DNS lookup, TCP handshake, TLS handshake.
TLS handshake requires compute to multiply prime numbers, this is the likely real bottleneck. (Not tested.)
Linux allows increasing number of sockets beyond 4096, network buffer has enough space to manage this. Using 10s timeout and 30 headers/s => 300 connections open at once
Commoncrawl CDX files provide initial 1T urls from which to seed the crawl.
Software
parallel curl requests is sufficient, nothing fancy. (Not tested on full dataset)
Prioritised public datasets
If you can't do the entire internet, here's some datasets you might want to prioritise:
personal blogs - searchmysite.net 3k blogs, 512kb.club blogs, substack, livejournal
video transcripts - youtube, rumble, rutube, bilibili, youku - atleast top 10k channels each by subscriber count. use yt-dlp or similar to scrape.
Graph-based algos like HNSW and pinecone proprietary algos outperform geometric algos like LSH, k-means clustering, product quantisation, etc
Future algos
Seems likely that a better graph algo than HNSW will be discovered in next few years.
My basic intuition is you want to put 1T vectors into 1M clusters with 1M vectors each, then have a fast way to check which cluster is likely to contain matches to query vector. Then brute force search those clusters.
Software
One-click software - dragonflyDB for RAM, qdrant for disk
Can load databases using bash pipelines
As of 2025, many other implementations exist, require more work to use.
Search filters
Tags based on data source
User can specify they only want to search reddit or they only want to search their own keylogs etc
Search filter based on tag based on data source is important
Tags based on content
Can use the embeddings to generate automated tags: politics, sports, tech, etc
Search filter based on tag based on content is important
Timestamps
Search filter based on time interval is important
Latency
If your hosting the app locally for your own use, latency does not matter, only search time matters. If you're hosting it for other people to use, latency can be relevant, and it may make sense to host multiple edge servers.
Human body latency
Nerve conduction latency = 1m / (100 m/s) = 10 ms
Optical latency = (90 fps)^(-1) = 10 ms
Computer latency
CPU, RAM, disk, GPU, I/O device latencies are much below 10 ms
Network latency is 100 ms for round-trip EU to US, due to speed of light.
1 gbps fibre optic now popular, sufficient to transmit HD video at below 10 ms latency with no encoding or compression
3D or VR data still can't be sent uncompressed
Step 5: Tool use
Make sure AI can also make searches to the embedding search databases. For example you can put a wrapper in front of qdrant to convert it into an MCP server.
Reasoning models can reason about the source and context of any given piece of information, and estimate how likely it is to match ground-truth.
Subscribe
Enter email or phone number to subscribe. You will receive atmost one update per month