I'm building a search engine from scratch in Rust, chapter by chapter through Manning's Introduction to Information Retrieval. The goal is to actually understand what production search systems do — not to use them, but to build the pieces myself and hit the walls they're designed around.
Each article is written the day I finish the chapter. Real numbers from a real corpus (11,314 documents from 20 Newsgroups), honest about tradeoffs, including the things that didn't work. Code lives on GitHub.
My merge step loaded every block into RAM at once. Worked for 3 blocks on 12K docs. Would crash at 100 blocks on a million docs. Wrote a BlockReader. Now the merge holds one term per block at a time. Constant memory, whatever the scale.
Ranking 14 documents for "United States of America." First try put a 1,835-line firearms archive at #1 and a 13-line post literally about the USA at last. Opened the files and figured out why. Cosine normalization fixed it.
Replaced bincode with a hand-rolled variable-byte encoder. 34.8MB → 8.3MB. Also 50% slower. The tradeoff is wrong at 12K docs and right at 1M — and I can tell you exactly why.
Moved from all-in-RAM to a dictionary-in-RAM / postings-on-disk architecture. 138,743 terms indexed. Queries do a nanosecond dict lookup, then a single disk seek to exactly the bytes needed. The whole thing now scales to tens of millions of documents.
Added positional phrase verification (37 matches → 14 real matches) and a three-layer spell corrector built from scratch: trigram index → Jaccard filter → Levenshtein. "Uniited Staates of Ameeriica" → corrected and searched in 19ms.
Indexed 11,314 documents from 20 Newsgroups. Inverted positional index, two-pointer intersection, sorted-by-length query optimization. One structural change (sort once at build, not per query) cut latency 7x.