I started this project as a chapter-by-chapter walk through Manning's Introduction to Information Retrieval, building each piece in Rust on the 20 Newsgroups corpus (~18,000 emails). Inverted index, BSBI, VByte compression, trigram spell correction, TF-IDF with cosine norm, tiered indexes, proximity scoring. Nine articles, all the math working on a toy corpus.
As of may 19, BM25F is live. The engine now scores docs with field-aware weighting (title, headers, code, body) and per-field length normalization, and the tokenizer splits CamelCase + underscore compounds so RunInstances hits the same docs as "run instances." The next stretch is page-level signals (PageRank from the cross-link graph), then an evaluation harness with hand-judged queries. Code lives on GitHub.
s3:GetObject, arn:aws:iam, t2.micro) and splits CamelCase + underscore compounds so API_RunInstances emits api, runinstances, run, instances at the same position. Frequency-weighted spell correction. Next: PageRank from the cross-link graph, then an evaluation harness with hand-judged queries.
Title weight 5.0 wasn't enough to push API_RunInstances.md to rank 1 — a single title token couldn't out-vote twenty body hits. The fix was upstream: split RunInstances into run, instances, and the original at tokenize time. Eight previously-impossible queries (DescribeInstances, UpdateStack, PutObject) now return useful results. Two regressions on the original golden set. Where BM25F runs out of headroom.
A Python analysis script over the whole corpus picked the field set — title, headers, code, body — and cut bold_labels and links into body. A ** token contamination bug from API-reference docs; the fix runs only on non-code fields because * means something in code. Rust's regex crate doesn't support lookbehind (linear-time guarantee), so the Python regex became a capture-group rewrite. BM25F math: IDF stays global, length norm goes per-field, weighting goes inside the pseudo-tf. 13-query golden set with four A's and two D's.
Killing a feature I'd built and was proud of to make room for a different one. Tiers were a workaround for flat tf-idf — BM25F made them redundant. The new on-disk posting format adds a field layer: {term → doc_id → field → positions}. Position gaps reset between fields, doc-id gaps don't. Empty fields cost zero bytes. Refactor walked outward from encode_decode.rs through seven files; the Rust compiler acted as the audit. main.rs shrank from a ~50-line tier loop to one retrieval pass.
14,266 markdown files. A character-frequency analysis on the whole corpus picked the keep list (. - _ : / ' @ *). Asymmetric edge trimming so cause: trims but s3:Get* survives. A spell corrector that now resolves instnace to instance instead of instace by sorting on (edit_distance ASC, doc_freq DESC) and dropping rare candidates before Jaccard. Side effect: spell-check latency dropped 2-3x.
The Manning chapters did their job — the math works on a toy corpus from 1997. The artifact at the end is a search engine nobody uses. So I'm pivoting to a real corpus: AWS documentation. Evidence from re:Post, GitHub, devRant on why AWS docs search is genuinely bad, the 18 services that serve 80%+ of workloads, what I'm keeping from Manning, and what I'm rewriting or adding (BM25, zones, PageRank, eval harness, adapter trait).
Day 6 ended with a known flaw: the strict phrase filter was killing real queries. Replaced it with a hand-coded continuous boost — boost = 1 + k/ω, where ω is the smallest window containing all query terms. israeli palestinian went from 0 results to 10. The doc 8802 / doc 9059 ratio nearly doubled (2.3x → 4.25x). Three textbook chapters compounding on the same data.
Manning Chapter 7.2.1 — split posting lists by tf, query the top tier first, fall back. Followed the textbook thresholds. Tier 1 ended up holding 0.49% of postings; most queries returned zero. Ran a histogram on 1.7M (term, doc, tf) triples, picked thresholds from the actual data, and along the way uncovered the next thing to fix: strict phrase filtering was killing valid AND-matches.
My merge step loaded every block into RAM at once. Worked for 3 blocks on 12K docs. Would crash at 100 blocks on a million docs. Wrote a BlockReader. Now the merge holds one term per block at a time. Constant memory, whatever the scale.
Ranking 14 documents for "United States of America." First try put a 1,835-line firearms archive at #1 and a 13-line post literally about the USA at last. Opened the files and figured out why. Cosine normalization fixed it.
Replaced bincode with a hand-rolled variable-byte encoder. 34.8MB → 8.3MB. Also 50% slower. The tradeoff is wrong at 12K docs and right at 1M — and I can tell you exactly why.
Moved from all-in-RAM to a dictionary-in-RAM / postings-on-disk architecture. 138,743 terms indexed. Queries do a nanosecond dict lookup, then a single disk seek to exactly the bytes needed. The whole thing now scales to tens of millions of documents.
Added positional phrase verification (37 matches → 14 real matches) and a three-layer spell corrector built from scratch: trigram index → Jaccard filter → Levenshtein. "Uniited Staates of Ameeriica" → corrected and searched in 19ms.
Indexed 11,314 documents from 20 Newsgroups. Inverted positional index, two-pointer intersection, sorted-by-length query optimization. One structural change (sort once at build, not per query) cut latency 7x.