search engine in rust

2 min read ~240 wpm

I'm building a search engine from scratch in Rust, chapter by chapter through Manning's Introduction to Information Retrieval. The goal is to actually understand what production search systems do — not to use them, but to build the pieces myself and hit the walls they're designed around.

Each article is written the day I finish the chapter. Real numbers from a real corpus (11,314 documents from 20 Newsgroups), honest about tradeoffs, including the things that didn't work. Code lives on GitHub.

progress through the book

Currently on Chapter 7 — efficient scoring and top-K retrieval. Six chapters done. Fifteen to go.

hover a node to see the chapter

done

in progress

upcoming

articles

fixing the merge: from all-blocks-in-ram to streaming k-way merge

interlude · apr 21

My merge step loaded every block into RAM at once. Worked for 3 blocks on 12K docs. Would crash at 100 blocks on a million docs. Wrote a BlockReader. Now the merge holds one term per block at a time. Constant memory, whatever the scale.

systems k-way merge

my search engine now knows which documents matter most

day 5 · apr 21

Ranking 14 documents for "United States of America." First try put a 1,835-line firearms archive at #1 and a 13-line post literally about the USA at last. Opened the files and figured out why. Cosine normalization fixed it.

tf-idf cosine ranking

i compressed my search index by 76% and made everything worse (for now)

day 4 · apr 16

Replaced bincode with a hand-rolled variable-byte encoder. 34.8MB → 8.3MB. Also 50% slower. The tradeoff is wrong at 12K docs and right at 1M — and I can tell you exactly why.

compression vbyte

my search engine now reads from disk like a real search engine

day 3 · apr 14

Moved from all-in-RAM to a dictionary-in-RAM / postings-on-disk architecture. 138,743 terms indexed. Queries do a nanosecond dict lookup, then a single disk seek to exactly the bytes needed. The whole thing now scales to tens of millions of documents.

disk blocks

my search engine now corrects your spelling and finds exact phrases

day 2 · apr 12

Added positional phrase verification (37 matches → 14 real matches) and a three-layer spell corrector built from scratch: trigram index → Jaccard filter → Levenshtein. "Uniited Staates of Ameeriica" → corrected and searched in 19ms.

phrase search spell correction

building a search engine from scratch in rust

day 1 · apr 12

Indexed 11,314 documents from 20 Newsgroups. Inverted positional index, two-pointer intersection, sorted-by-length query optimization. One structural change (sort once at build, not per query) cut latency 7x.

inverted index rust