Day 1: Building a search engine from scratch in Rust

I've been working through Introduction to Information Retrieval by Manning, Raghavan & Schütze. I find search engines pretty cool and wanted to actually understand how they work. So I'm building one from scratch in Rust, learning both the language and the IR theory as I go.

What I built today

Indexed 11,314 documents from the 20 Newsgroups dataset. Built an inverted positional index — every unique term maps to which documents contain it and at what exact positions within each document. Around 120K unique terms across the corpus. Then built a query engine on top of it.

The data structure

For each term, I store a mapping of docID → [positions]. So "united" maps to {doc 287: [45, 302], doc 525: [12], ...}. This tells me not just which documents contain a word, but where it appears. The positional data will matter a lot when I add phrase search.

How queries actually work

When you search "United States of America", the engine does the following:

1. Looks up each term and pulls its sorted list of docIDs. Four terms, four lists. Each lookup is O(1).

2. Sorts the four lists by length, shortest first. The result of intersecting two lists can never be bigger than the shorter input. Starting small keeps every intermediate result small. This is a greedy optimization and the book explains why it works.

3. Then runs a two-pointer merge intersection. Two pointers, one per list, both walking forward through sorted docIDs. If values match — store it, advance both pointers. If one is smaller — advance that pointer only. You never go backward.

Cost: O(x + y) where x and y are list lengths. Without sorted lists you'd need a nested loop at O(x × y). For lists of 500 and 5,000 entries, that's 5,500 comparisons instead of 2,500,000. At scale this is the difference between milliseconds and minutes.

One change that cut query latency 7x

My first version stored docIDs in a HashMap internally, which meant extracting and sorting keys at query time. Sorting 10,000+ keys every time someone searches is wasteful — O(n log n) repeated on every query.

The fix was obvious once I thought about it: sort once when building the index, not every time someone queries. Pre-compute sorted docID lists for every term during construction. At query time, just grab the pre-sorted list.

10.089ms → 1.395ms for the same query. 7x improvement from one structural change. Build once, query many times.

Query: "United States of America" → 37 documents in 1.395ms

What's missing

This is still rudimentary. "United States of America" currently matches any document containing all four words anywhere, even if "united" is in the first sentence and "America" is buried in the last paragraph. The positional data is sitting right there in the index — I'm just not checking adjacency during search yet. So there are false positives in those 37 results.

Tomorrow I'm adding phrase verification — only return documents where the query terms appear consecutively in the right order.

Code: github.com/sreenish27/Search_engine_rs