Home | Search

2026-01-29

Hard numbers for the search pipeline

Disclaimer

Quick Note

I used to think search/discovery/recommendation algos only did embedding search. I have now realised there's an entire pipeline, and not a single step in the pipeline can be skipped.

Start with the entire text internet
Use hard-coded initial list of people/blogs/urls/keywords to filter
Use embedding search to filter further
Use inference with both smart prompts (like paul graham prompts, surprising difference prompts) and the user context, to filter further
Show the post to N other users, to filter further

Hard numbers

hard-coded list - hetzner charges $200/mo for 1 gbps 32 thread machine which can max out egress instead of CPU AFAIK => ($200 / (30 * 86400)) / (10^9 / 8/5) = $0.000003/1M tokens (pay per query). Also pay disk cost of hosting entire full-text DB (unless someone else is hosting it)
embedding search - openai text-embedding-3-small costs $0.01/1M tokens (pay once). Also pay disk cost of hosting entire vector DB
inference - openai gpt-5.2 high costs $1.75/1M tokens (pay per query)
use actual test users - $10 CPM => $10/(1000 views * 100 tokens/view) = $100/1M tokens/view (pay per query)

The text internet contains atleast 500B tokens.

Tips

Even if you work on some much smaller dataset (like just lesswrong or just substack or whatever), you will definitely need to implement actual test users.
You will probably also want to implement atleast one of either inference or embedding search.
It might be easier to build on top of bluesky custom feeds or similar, as compared to building your own 'like' functionality from scratch for the entire internet (including across across multiple social media platforms that don't want to cooperate with each other)

Enter email or phone number to subscribe. You will receive atmost one update per month

Comment

Enter comment

Home | Search

Hard numbers for the search pipeline

Subscribe

Comment