search engine in rust

2 min read ~240 wpm

I'm building a search engine from scratch in Rust, chapter by chapter through Manning's Introduction to Information Retrieval. The goal is to actually understand what production search systems do — not to use them, but to build the pieces myself and hit the walls they're designed around.

Each article is written the day I finish the chapter. Real numbers from a real corpus (11,314 documents from 20 Newsgroups), honest about tradeoffs, including the things that didn't work. Code lives on GitHub.

progress through the book
Currently on Chapter 7 — efficient scoring and top-K retrieval. Six chapters done. Fifteen to go.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
hover a node to see the chapter
done
in progress
upcoming
articles
fixing the merge: from all-blocks-in-ram to streaming k-way merge

My merge step loaded every block into RAM at once. Worked for 3 blocks on 12K docs. Would crash at 100 blocks on a million docs. Wrote a BlockReader. Now the merge holds one term per block at a time. Constant memory, whatever the scale.

my search engine now knows which documents matter most

Ranking 14 documents for "United States of America." First try put a 1,835-line firearms archive at #1 and a 13-line post literally about the USA at last. Opened the files and figured out why. Cosine normalization fixed it.

i compressed my search index by 76% and made everything worse (for now)

Replaced bincode with a hand-rolled variable-byte encoder. 34.8MB → 8.3MB. Also 50% slower. The tradeoff is wrong at 12K docs and right at 1M — and I can tell you exactly why.

my search engine now reads from disk like a real search engine

Moved from all-in-RAM to a dictionary-in-RAM / postings-on-disk architecture. 138,743 terms indexed. Queries do a nanosecond dict lookup, then a single disk seek to exactly the bytes needed. The whole thing now scales to tens of millions of documents.

my search engine now corrects your spelling and finds exact phrases

Added positional phrase verification (37 matches → 14 real matches) and a three-layer spell corrector built from scratch: trigram index → Jaccard filter → Levenshtein. "Uniited Staates of Ameeriica" → corrected and searched in 19ms.

building a search engine from scratch in rust

Indexed 11,314 documents from 20 Newsgroups. Inverted positional index, two-pointer intersection, sorted-by-length query optimization. One structural change (sort once at build, not per query) cut latency 7x.