Browse parent directory

my_research/open_source_search_summary.html


2025-07-21

Open Source Search (Summary)

Disclaimer

Summary

Main

Why?

Use cases of open source search

Similar projects

(incomplete list, as of 2025-07 I'm not affiliated with any project listed here)

Final output (after steps 0 to 5)

Step 0: Estimate hardware costs in 2025

Important: All these prices are dropping exponentially. Try forecasting prices in 2030 or 2035. We will eventually end up with entire text internet stored in your pocket.

Prices taken from hetzner server auction, vast.ai, aws s3 deep archive

Rented

Self-hosted

Estimate storage required

Step 1: Crawling

Figures taken from commoncrawl and internet archive

Figures taken from my own (bad) benchmark

Software

Prioritised public datasets

If you can't do the entire internet, here's some datasets you might want to prioritise:

It is important to do most of them not just few, otherwise your app won't be competitive with existing apps.

Private datasets

Also provide software to create private datasets

Each user will likely have to individually create their own private dataset, as some of this data may not be present in any public dataset.

Step 2: Plaintext extraction

Figures taken from commoncrawl and internet archive

Data size

Plaintext extraction cost

Software

Step 3: Embedding generation

Algorithm used

Figures taken from openai text-embedding-3-small

Embedding generation cost

Performance

Software

Step 4: Embedding search

Algorithm used, Search time

Software

Search filters

Latency

If your hosting the app locally for your own use, latency does not matter, only search time matters. If you're hosting it for other people to use, latency can be relevant, and it may make sense to host multiple edge servers.

Step 5: Tool use

Make sure AI can also make searches to the embedding search databases. For example you can put a wrapper in front of qdrant to convert it into an MCP server.

Reasoning models can reason about the source and context of any given piece of information, and estimate how likely it is to match ground-truth.


Subscribe / Comment

Enter email to subscribe, or enter comment to post comment