Browse parent directory

unimportant/simplify_embedding_search.html


2025-04-10

Simplify embedding search

tldr complexity bad, graph algo good, geometric algo bad, buckets good

Why?

Embedding search: Given a dataset of vectors and a query vector, find top-10 vectors that maximise dot product with query vector.

This is equivalent to finding top-10 vectors that minimise euclidean distance with query vector, assuming all vectors are unit vectors.

Common algos

Dataset sizes

Brute-force algo

State of the art algos are captured by ANN benchmarks and big ANN benchmarks. The main conclusion from them for me is that graph-based methods beat geometry-based methods.

Pinecone's articles on FAISS are the best resource on the internet on this that I've found.

Buckets

I'm yet to find a state-of-the-art algo for this, I just shared some intuitions of mine here.