Day 2: My search engine now corrects your spelling and finds exact phrases

Yesterday I built a basic search engine in Rust that could find documents containing specific words. Today I added two features that take it from a keyword matcher to something that actually starts to crawl like a search engine.

Feature 1: Phrase Search

Yesterday, searching "United States of America" returned 37 documents — any document containing all four words anywhere in the text. A document mentioning "united" in paragraph one and "america" in the last sentence would match. That's a false positive.

Today I added positional verification. During indexing, I already store the exact position of every word in every document. Now at query time, after finding documents containing all the words, I check: does "united" appear at position p, "states" at p+1, "of" at p+2, "america" at p+3? If not, it's thrown out.

37 results → 14 results. Those 23 eliminated documents were all false positives. The 14 remaining actually contain the exact phrase "United States of America" in sequence.

Feature 2: Spell Correction — "Did you mean?" (you see in Google search)

Built this entirely from scratch using three layers:

First layer: a trigram index. Every term in my vocabulary is broken into 3-character sequences. "castle" becomes: $ca, cas, ast, stl, tle, le$ . This index maps trigrams to the terms that contain them. Built as a B-tree alongside my main inverted index.

Second layer: Jaccard similarity. When a user misspells a word — currently assuming if it is not there in my index_map or vocabulary it is a misspelling — I break the misspelling into trigrams, look up candidates from the trigram index, and score each candidate by how much trigram overlap it shares with the misspelled word. Jaccard = shared trigrams / total unique trigrams. This filters 120K vocabulary terms down to a handful of candidates.

Third layer: edit distance. For the surviving candidates, I compute Levenshtein distance — the minimum number of character insertions, deletions, or substitutions to transform one string into another. "carot" → "carrot" is edit distance 1 (insert one "r"). Candidates are ranked by edit distance and the closest matches become suggestions.

The full pipeline:

user types misspelled term → not found in index → trigram index finds candidates → Jaccard filters weak matches → edit distance ranks the rest → "Did you mean?" suggestion.

The numbers

"Uniited Staates of America" → corrected to "United States of America" → 14 documents in 19.4ms
"astrnomy spacee explration" → corrected to "astronomy space exploration" → 280ms

Both queries had every single word misspelled. Corrected all of them. No external spell-check library. Just trigrams, Jaccard coefficients, and dynamic programming.

What I learned about the math

Edit distance uses a dynamic programming matrix. For two strings of length m and n, you build an m×n grid where each cell represents the cost of transforming one prefix into another. Each cell is the minimum of three options: substitute (diagonal), insert (from left), delete (from above). The bottom-right cell gives the final edit distance.

Cost: O(m × n) per pair. That's exactly why the trigram + Jaccard filtering matters — you can't afford to run this against 120K terms. Narrow it down first, then compute.

What didn't work: Skip Pointers

I also implemented skip pointers, a technique where you place shortcut pointers every √P entries in a sorted postings list so you can jump over chunks during intersection instead of walking one element at a time. In theory, this reduces comparisons significantly.

In practice, on 11K documents, it made things worse. The overhead of checking whether to skip at every step costs more than the comparisons it saves when lists are short. The break-even point is tens of thousands of entries per list. I've commented it out for now and am using the simpler two-pointer merge. The code is still there — when I move to a larger corpus like Wikipedia, it'll become relevant.

What's next

The engine finds documents but doesn't rank them. A document mentioning "United States" 50 times is treated the same as one mentioning it once. Tomorrow I start working on scoring and ranking — term frequency, inverse document frequency, and the math that makes some results more relevant than others.

Code: github.com/sreenish27/Search_engine_rs