search engine in rust

2 min read ~240 wpm

I started this project as a chapter-by-chapter walk through Manning's Introduction to Information Retrieval, building each piece in Rust on the 20 Newsgroups corpus (~18,000 emails). Inverted index, BSBI, VByte compression, trigram spell correction, TF-IDF with cosine norm, tiered indexes, proximity scoring. Nine articles, all the math working on a toy corpus.

As of may 19, BM25F is live. The engine now scores docs with field-aware weighting (title, headers, code, body) and per-field length normalization, and the tokenizer splits CamelCase + underscore compounds so RunInstances hits the same docs as "run instances." The next stretch is page-level signals (PageRank from the cross-link graph), then an evaluation harness with hand-judged queries. Code lives on GitHub.

where this is now
Running BM25F over AWS documentation (14,266 markdown files, 18 services). Field-aware scoring (title, headers, code, body) with per-field length normalization. The tokenizer keeps AWS-shaped tokens intact (s3:GetObject, arn:aws:iam, t2.micro) and splits CamelCase + underscore compounds so API_RunInstances emits api, runinstances, run, instances at the same position. Frequency-weighted spell correction. Next: PageRank from the cross-link graph, then an evaluation harness with hand-judged queries.
14,266 docs indexed 18 AWS services 266,624 unique terms < 200ms typical query
engine capabilities
what the engine can do today, what's being built, what's next. hover a node.
Index Spell Compress Rank Proximity Tokenize BM25 Zones PageRank Eval Serve
hover any node to see what it does and where it stands
built
just shipped
in progress
upcoming
articles
CamelCase tokenization and the BM25F ceiling

Title weight 5.0 wasn't enough to push API_RunInstances.md to rank 1 — a single title token couldn't out-vote twenty body hits. The fix was upstream: split RunInstances into run, instances, and the original at tokenize time. Eight previously-impossible queries (DescribeInstances, UpdateStack, PutObject) now return useful results. Two regressions on the original golden set. Where BM25F runs out of headroom.

Fields: extracting structure from AWS markdown for BM25F

A Python analysis script over the whole corpus picked the field set — title, headers, code, body — and cut bold_labels and links into body. A ** token contamination bug from API-reference docs; the fix runs only on non-code fields because * means something in code. Rust's regex crate doesn't support lookbehind (linear-time guarantee), so the Python regex became a capture-group rewrite. BM25F math: IDF stays global, length norm goes per-field, weighting goes inside the pseudo-tf. 13-query golden set with four A's and two D's.

Killing tiers to make room for BM25F

Killing a feature I'd built and was proud of to make room for a different one. Tiers were a workaround for flat tf-idf — BM25F made them redundant. The new on-disk posting format adds a field layer: {term → doc_id → field → positions}. Position gaps reset between fields, doc-id gaps don't. Empty fields cost zero bytes. Refactor walked outward from encode_decode.rs through seven files; the Rust compiler acted as the audit. main.rs shrank from a ~50-line tier loop to one retrieval pass.

the AWS corpus, the tokenizer, and the spell corrector

14,266 markdown files. A character-frequency analysis on the whole corpus picked the keep list (. - _ : / ' @ *). Asymmetric edge trimming so cause: trims but s3:Get* survives. A spell corrector that now resolves instnace to instance instead of instace by sorting on (edit_distance ASC, doc_freq DESC) and dropping rare candidates before Jaccard. Side effect: spell-check latency dropped 2-3x.

why I'm changing course: from the Manning book to AWS docs

The Manning chapters did their job — the math works on a toy corpus from 1997. The artifact at the end is a search engine nobody uses. So I'm pivoting to a real corpus: AWS documentation. Evidence from re:Post, GitHub, devRant on why AWS docs search is genuinely bad, the 18 services that serve 80%+ of workloads, what I'm keeping from Manning, and what I'm rewriting or adding (BM25, zones, PageRank, eval harness, adapter trait).

i replaced strict phrase matching with proximity scoring — and a 13-line doc beat a 1,835-line doc by 4.25x

Day 6 ended with a known flaw: the strict phrase filter was killing real queries. Replaced it with a hand-coded continuous boost — boost = 1 + k/ω, where ω is the smallest window containing all query terms. israeli palestinian went from 0 results to 10. The doc 8802 / doc 9059 ratio nearly doubled (2.3x → 4.25x). Three textbook chapters compounding on the same data.

i built tiered indexes — and 0.49% of postings landed in the top tier

Manning Chapter 7.2.1 — split posting lists by tf, query the top tier first, fall back. Followed the textbook thresholds. Tier 1 ended up holding 0.49% of postings; most queries returned zero. Ran a histogram on 1.7M (term, doc, tf) triples, picked thresholds from the actual data, and along the way uncovered the next thing to fix: strict phrase filtering was killing valid AND-matches.

fixing the merge: from all-blocks-in-ram to streaming k-way merge

My merge step loaded every block into RAM at once. Worked for 3 blocks on 12K docs. Would crash at 100 blocks on a million docs. Wrote a BlockReader. Now the merge holds one term per block at a time. Constant memory, whatever the scale.

my search engine now knows which documents matter most

Ranking 14 documents for "United States of America." First try put a 1,835-line firearms archive at #1 and a 13-line post literally about the USA at last. Opened the files and figured out why. Cosine normalization fixed it.

i compressed my search index by 76% and made everything worse (for now)

Replaced bincode with a hand-rolled variable-byte encoder. 34.8MB → 8.3MB. Also 50% slower. The tradeoff is wrong at 12K docs and right at 1M — and I can tell you exactly why.

my search engine now reads from disk like a real search engine

Moved from all-in-RAM to a dictionary-in-RAM / postings-on-disk architecture. 138,743 terms indexed. Queries do a nanosecond dict lookup, then a single disk seek to exactly the bytes needed. The whole thing now scales to tens of millions of documents.

my search engine now corrects your spelling and finds exact phrases

Added positional phrase verification (37 matches → 14 real matches) and a three-layer spell corrector built from scratch: trigram index → Jaccard filter → Levenshtein. "Uniited Staates of Ameeriica" → corrected and searched in 19ms.

building a search engine from scratch in rust

Indexed 11,314 documents from 20 Newsgroups. Inverted positional index, two-pointer intersection, sorted-by-length query optimization. One structural change (sort once at build, not per query) cut latency 7x.