Browse parent directory
unimportant/simplify_embedding_search.html
2025-04-10
Simplify embedding search
tldr complexity bad, graph algo good, geometric algo bad, buckets good
Why?
- I'm looking for a simple embedding search algorithm because political projects should rely on simple software not complex software.
- Torrent codebase is simpler than etheruem codebase for example, hence torrent is less likely to get politically captured IMO.
- Examples of such political projects: search feature for distributed social media, spam filter for distributed social media, spam filter for SecureDrop servers that accept leaked documents, search feature for published leaked documents, etc.
- As of 2025, embedding search is necessary and almost sufficient method of using LLMs on a big dataset. And LLMs are state-of-the-art AI for basically every task.
- This could change in future, for instance it will one day be cheap enough to just feed an internet's worth of tokens as inference tokens and find out the most relevant content.
- We also don't yet have good interpretability tools to make sense of layers besides the embedding layer.
- Since input is too expensive and other layers are unhelpful, we will likely rely on embeddings
Embedding search: Given a dataset of vectors and a query vector, find top-10 vectors that maximise dot product with query vector.
This is equivalent to finding top-10 vectors that minimise euclidean distance with query vector, assuming all vectors are unit vectors.
Common algos
- Graph-based - HNSW, pinecone proprietary (?) algos
- Geometry-based - locality sensitive hashing, k-means clustering, product quantization (subvectors)
- Brute-force
Dataset sizes
- Rule of thumb: 1000 chars (1 KB) -> 1536 dimensional float32 vector (6 KB)
- 1 TB plaintext => 6 TB embeddings, 1 PB plaintext => 6 PB embeddings
Brute-force algo
- Commoncrawl is ~1 PB plaintext. Trillion vectors (10^12). ~10 PB embeddings
- Brute force can be done either fully in RAM or by loading disk to RAM at query time. This is affordable for 10^6 vectors but not for 10^12 vectors.
- If disk throughputs become fast enough, one could do brute-force on a large array of disks in parallel. Join 10,000x 8 GB/s 1 TB disks in parallel, and you can search a batch of queries in ~15 minutes. This cluster costs atleast $1M in hardware alone.
- Brute-force algo on disk will be possible within a decade or two. But is expensive today.
State of the art algos are captured by ANN benchmarks and big ANN benchmarks. The main conclusion from them for me is that graph-based methods beat geometry-based methods.
- My guess is this is because of curse of dimensionality. You don't lose a lot of search accuracy if you quantise vectors to 4-bit (or maybe even 2-bit or 1-bit). So you are basically searching for bitstrings that have low hamming distance from each other. Using intuitions from 2D and 3D geometry no longer makes sense to me.
Pinecone's articles on FAISS are the best resource on the internet on this that I've found.
Buckets
- As long as you can put the trillion vectors into buckets of million vectors each and quickly identify which buckets have the query's matches, you can then just go and brute-force search those buckets.
- Additional software complexity to save a few milliseconds of search time within the bucket is not necessarily worth it, depending on the application.
- Indexes must not occupy too much memory. For instance in LSH, each vector is stored along with a (say) 24-bit signature for 24 hyperplanes. 3 bytes / vector. In HNSW, each vector is stored along with index positions of atleast 10 neighbours. >100 bytes / vector.
- If you do the bucket approach, you need to store the index positions of all vectors in a shuffled order. For 10^12 vectors, that's 5 bytes / vector. Or you can avoid numbering the vectors at all and just the store the raw vectors in the shuffled order. This way you require zero extra space.
I'm yet to find a state-of-the-art algo for this, I just shared some intuitions of mine here.