For the last few weeks I've been walking through Manning's Introduction to Information Retrieval chapter by chapter, building each piece in Rust. Inverted index, BSBI block construction, VByte and gap encoding, spell correction with trigrams and Jaccard and Levenshtein, TF-IDF with cosine normalization, tiered indexes with data-driven thresholds, proximity scoring, phrase filtering (which I later retired and replaced). Six articles shipped along the way.
The engine works. On the 20 Newsgroups corpus (~18,000 emails) it indexes in under a minute and answers queries in 20–50ms. Spell correction handles real typos, proximity scoring lifts phrase-like results without the rigidity of strict phrase matching, and the Manning chapters did what they were supposed to do, which is teach the math.
The problem is the artifact at the end is a search engine for an email corpus from 1997 that nobody uses and nobody will. I have the math right and the implementation is fast, but the deliverable is a textbook exercise.
So I'm changing course. Same engine, different corpus, real users.
Why AWS docs
There are two reasons, and one of them I can actually point to with evidence.
AWS documentation is genuinely hard to search. This isn't my opinion — it's a sustained complaint from developers who use the platform every day, going back years.
What threads through all of these is the same failure mode: developers can't find the page they need. The search experience on docs.aws.amazon.com gives results that are either too broad, too tangential, or from the wrong service entirely, and people end up resorting to Google with site:docs.aws.amazon.com filters and still struggling, partly because Google's ranking isn't tuned for technical reference docs where exact-match tokens like s3:GetObject or arn:aws:iam::* carry the entire meaning of the query.
That's the provable reason. The other one I trust more from experience than from sourcing: 80%+ of AWS workloads use a small subset of services. EC2, S3, Lambda, IAM, VPC, RDS, DynamoDB, CloudWatch, CloudFormation — the same names show up in every "top AWS services" roundup. Wikipedia's summary lists "Amazon Elastic Compute Cloud (EC2), Amazon Simple Storage Service (Amazon S3)... and AWS Lambda" as the most popular, and Urolime's 2025 industry report cites "EC2's 64% enterprise adoption to S3's use by 90% of Fortune 100 companies." Which means I don't need to index every AWS service. I just need to cover the ones developers actually search for.
I picked 18: EC2, S3, IAM, Lambda, VPC, RDS, CloudWatch, DynamoDB, CloudFormation, API Gateway, Route 53, SQS, SNS, ECS, EKS, CloudFront, Cognito, KMS. Between them they serve almost every real query, and the corpus comes out to about 14,000 documents.
One thing that made the corpus easy to get
My first instinct was to scrape docs.aws.amazon.com as HTML, parse the DOM, and extract text content. While reading through the docs by hand to figure out which tags I'd need, I noticed something: AWS publishes its documentation source on GitHub under the awsdocs/ org. The EKS user guide, the Lambda developer guide, OpenSearch, Panorama — all of them are public repositories of markdown files. The docs site renders these to HTML at build time.
Better still, the URLs are symmetric. Every page at https://docs.aws.amazon.com/.../foo.html is also available at https://docs.aws.amazon.com/.../foo.md. Swap the extension and you get markdown source directly. No HTML parsing, no DOM traversal, no figuring out which <div> is body and which is sidebar. The next article goes into the scraper and the rest, but this single observation removed a whole layer of work I would have done.
What I'm keeping from Manning
The Manning book taught me how a search engine actually works, and by chapter 7 I had the full pipeline running on the toy corpus. Pretty much none of what comes next is possible without that foundation.
Keeping unchanged:
- The inverted positional index (term → {doc_id → [positions]}).
- BSBI block construction. Index 4,000 docs to memory, flush to disk, repeat, k-way merge at the end.
- VByte + gap encoding for posting lists. 76% compression on the 20 Newsgroups corpus and the same encoding will work here.
- Trigram index for spell correction candidates.
- Tiered indexes with thresholds learned from the actual term-frequency distribution.
- Proximity scoring using
(1 + k/ω)where ω is the minimum span containing all query terms. From the previous article.
Keeping but rewriting:
- The tokenizer. The Manning version splits on whitespace and strips to alphanumeric, which destroys AWS-shaped tokens like
s3:GetObject,arn:aws:s3:::my-bucket, andt2.micro. The new tokenizer keeps a specific set of connector characters that survive in AWS prose — the next article covers what I picked and why. - The spell corrector. Same trigram + Jaccard + Levenshtein pipeline, but with frequency-weighted ranking. On the AWS corpus the original corrector would suggest
instace(a typo that appears once in the corpus) instead ofinstance(which appears thousands of times). Small fix, big accuracy difference. - Cleanup. The old
cleanup.rsstripped everything that wasn't alphanumeric, while the new one keeps internal characters that carry meaning, trims dangling punctuation at edges, and normalizes ligatures and curly quotes.
Adding:
- BM25, then BM25F. Cosine normalization divides by doc length and over-rewards short tangential pages — the current top result for "s3 bucket" is a Route53 troubleshooting page rather than the S3 user guide. BM25's
bparameter softens that, and BM25F adds per-zone weights so a token in a page title scores higher than the same token buried in body text. - Zones. AWS markdown has identifiers (
#,##, code fences, the** [fieldName] **pattern in API references) that are structural signals. Same approach I used on Wikipedia film pages for an earlier project. - PageRank from the link graph. AWS docs cross-link extensively, and a page that 50 other pages point to is more authoritative than one nothing links to.
- An adapter layer. The engine itself stays corpus-agnostic — a
CorpusAdaptertrait wraps the AWS-specific parts (markdown parsing, link extraction, zone detection), so a different corpus means swapping the adapter and not the math. - An evaluation harness. 30 hand-judged queries, P@1, P@10, MAP. Without it I can't tell whether any of my BM25 tuning is actually moving anything.
The Manning book is a teaching artifact — it exists to explain the math, and the corpus it uses exists so the math has something to operate on. A search engine for AWS docs is a different kind of thing. It has to operate on a real corpus with real structure and real adversarial queries, and the output has to be measurably better than what users get from the AWS docs site itself.
Next article: the AWS corpus, the tokenizer rewrite, and the spell-corrector fix, with numbers.