2025-04-10

Simple Embedding Search

We want a small, clear way to do embedding search. Complex code can be risky for big public work. For example, torrent code is simpler than Ethereum code, so it may face less capture by big powers.

Why Embedding Search?

We use LLMs for most tasks now.
If we have a huge set of text, we turn each part into a vector (an "embedding").
At query time, we compare the query's vector to all stored vectors and pick the top ones.

How to Do It

The goal: find the top-10 vectors with the best dot product (or least distance).
If vectors are unit vectors, dot product and Euclid distance are linked.

Common Methods

Graph-based: Like HNSW. Pinecone uses some closed-source tricks.
Geometry-based: Locality-Sensitive Hash (LSH), k-means, or product quant.
Brute-force: Compare the query to all vectors.

Data Sizes

One chunk of text (1 KB) => 1536D float (6 KB).
For 1 TB of text, we get ~6 TB of embeddings. For 1 PB of text, ~6 PB embeddings.

Brute-Force Feasibility

For big data (like CommonCrawl), we can have 10^12 vectors (~10 PB).
Brute force in RAM is fine for a few million vectors, but not for a trillion.
A big disk array could work, but is still very costly today. Maybe in 10–20 years it gets cheaper.

Graph vs. Geometry

Tests say graph-based wins over geometry-based.
In high-dim space, much quant can happen (like 4-bit per float). Actual shape in space matters less.

Buckets

We can group vectors into buckets. Then, we find which buckets match our query and do a brute-force search in those.
Each vector in LSH needs ~3 bytes for a 24-bit signature.
Each vector in HNSW can need over 100 bytes for neighbor links.
If we just store raw vectors in some grouped order, we might need no extra space.

In sum, a simple bucket-based approach plus brute force in each bucket might be good enough. We just store vectors in an easy way and skip fancy code. This keeps code small and stable.