Why I'm changing course: from the Manning book to AWS docs

8 min read~200 wpm

For the last few weeks I've been walking through Manning's Introduction to Information Retrieval chapter by chapter, building each piece in Rust. Inverted index, BSBI block construction, VByte and gap encoding, spell correction with trigrams and Jaccard and Levenshtein, TF-IDF with cosine normalization, tiered indexes with data-driven thresholds, proximity scoring, phrase filtering (which I later retired and replaced). Six articles shipped along the way.

The engine works. On the 20 Newsgroups corpus (~18,000 emails) it indexes in under a minute and answers queries in 20–50ms. Spell correction handles real typos, proximity scoring lifts phrase-like results without the rigidity of strict phrase matching, and the Manning chapters did what they were supposed to do, which is teach the math.

The problem is the artifact at the end is a search engine for an email corpus from 1997 that nobody uses and nobody will. I have the math right and the implementation is fast, but the deliverable is a textbook exercise.

So I'm changing course. Same engine, different corpus, real users.

Why AWS docs

There are two reasons, and one of them I can actually point to with evidence.

AWS documentation is genuinely hard to search. This isn't my opinion — it's a sustained complaint from developers who use the platform every day, going back years.

the evidence
the documentation can sometimes be dense and challenging to navigate — AWS re:Post thread, "Anyone else also thinks AWS documentation is full of fluff and makes finding useful information difficult?" · link
Tasks that should take 30 seconds take 30 minutes to figure out. — GitHub issue, aws/aws-sdk-js-v3 #4318. A 10-year AWS developer on the v3 SDK docs. · link
they have like over 9000 pages. This is so frustrating. — devRant thread, "WHY AWS Docs are so awful?"
I love you AWS but your documentation and support suck enormously — Miriam Schwab, Medium (2018). Still applies in 2026.

What threads through all of these is the same failure mode: developers can't find the page they need. The search experience on docs.aws.amazon.com gives results that are either too broad, too tangential, or from the wrong service entirely, and people end up resorting to Google with site:docs.aws.amazon.com filters and still struggling, partly because Google's ranking isn't tuned for technical reference docs where exact-match tokens like s3:GetObject or arn:aws:iam::* carry the entire meaning of the query.

That's the provable reason. The other one I trust more from experience than from sourcing: 80%+ of AWS workloads use a small subset of services. EC2, S3, Lambda, IAM, VPC, RDS, DynamoDB, CloudWatch, CloudFormation — the same names show up in every "top AWS services" roundup. Wikipedia's summary lists "Amazon Elastic Compute Cloud (EC2), Amazon Simple Storage Service (Amazon S3)... and AWS Lambda" as the most popular, and Urolime's 2025 industry report cites "EC2's 64% enterprise adoption to S3's use by 90% of Fortune 100 companies." Which means I don't need to index every AWS service. I just need to cover the ones developers actually search for.

I picked 18: EC2, S3, IAM, Lambda, VPC, RDS, CloudWatch, DynamoDB, CloudFormation, API Gateway, Route 53, SQS, SNS, ECS, EKS, CloudFront, Cognito, KMS. Between them they serve almost every real query, and the corpus comes out to about 14,000 documents.

One thing that made the corpus easy to get

My first instinct was to scrape docs.aws.amazon.com as HTML, parse the DOM, and extract text content. While reading through the docs by hand to figure out which tags I'd need, I noticed something: AWS publishes its documentation source on GitHub under the awsdocs/ org. The EKS user guide, the Lambda developer guide, OpenSearch, Panorama — all of them are public repositories of markdown files. The docs site renders these to HTML at build time.

Better still, the URLs are symmetric. Every page at https://docs.aws.amazon.com/.../foo.html is also available at https://docs.aws.amazon.com/.../foo.md. Swap the extension and you get markdown source directly. No HTML parsing, no DOM traversal, no figuring out which <div> is body and which is sidebar. The next article goes into the scraper and the rest, but this single observation removed a whole layer of work I would have done.

What I'm keeping from Manning

The Manning book taught me how a search engine actually works, and by chapter 7 I had the full pipeline running on the toy corpus. Pretty much none of what comes next is possible without that foundation.

Keeping unchanged:

Keeping but rewriting:

Adding:

The Manning book is a teaching artifact — it exists to explain the math, and the corpus it uses exists so the math has something to operate on. A search engine for AWS docs is a different kind of thing. It has to operate on a real corpus with real structure and real adversarial queries, and the output has to be measurably better than what users get from the AWS docs site itself.

Next article: the AWS corpus, the tokenizer rewrite, and the spell-corrector fix, with numbers.