Day 5: My search engine now knows which documents matter most

Up until today my engine returned 14 documents for "United States of America" and said "here you go." A flat list. No order. A 1,835-line firearms archive got treated the same as a 13-line post that was literally about the USA. Today I fixed that.

Matching isn't ranking

Boolean retrieval tells you which documents contain your words. That's step one. But when 14 documents match, you need to know which ones to look at first. Every document needs a number — a relevance score. Higher number, more relevant, shown first.

Two signals get you there.

Signal 1: Term Frequency

How many times does the query term show up in this document? A document mentioning "america" 10 times is probably more about America than one mentioning it once.

But raw count is a bad proxy. Going from 0 to 1 mention is a massive signal. Going from 99 to 100 adds basically nothing. Diminishing returns. Logarithm captures this:

tf_weight = 1 + log₁₀(tf)

tf = 1    →  1.0
tf = 10   →  2.0
tf = 100  →  3.0
tf = 1000 →  4.0

10x more mentions, 1 extra point. The first mention establishes relevance. Everything after matters less and less.

Signal 2: Inverse Document Frequency

Not all terms carry the same weight. My query has four words. Here's how common each one is:

Term	Docs	IDF
`"of"`	9,882	0.059
`"states"`	390	1.463
`"united"`	236	1.681
`"america"`	235	1.683

"of" is in almost every document. IDF of 0.059 is basically worthless. Tells you nothing about what a document is about. "america" is in 235 out of 11,314 documents. IDF is 28x higher. When a document contains "america," that actually means something.

Formula: idf = log₁₀(N / df). Rare terms get high weight. Common terms get nearly zero.

TF-IDF: multiply them

tf-idf = tf_weight × idf

A document mentioning "america" 10 times and "of" 50 times:

america: (1 + log₁₀(10)) × 1.683 = 2.0 × 1.683 = 3.37
of:      (1 + log₁₀(50)) × 0.059 = 2.7 × 0.059 = 0.16

"america" contributes 21x more to the score despite "of" appearing 5x more. That's IDF doing its job.

Total document score = sum of tf-idf across all query terms. Do it for all 14 documents, sort, done.

First try: it got the ranking completely wrong

Doc 9059:  11.90  ← 1st place
Doc 2153:   7.93
...
Doc 8802:   4.88  ← last place

Looks decisive. Doc 9059 scored 2.5x higher than the bottom. I went and opened both files.

Doc 8802 (ranked last) — 13 lines. A short post, entirely about the United States of America:

"If one reasons that the United States of America at one time represented and protected freedom..."

That's it. Short, focused, directly on topic.

Doc 9059 (ranked first) — 1,835 lines. A massive archive index of firearms legislation, Congressional bills, Supreme Court cases, NRA publications. Mentions "United States" dozens of times across hundreds of entries about completely different things.

Doc 9059 isn't about the United States of America. It just mentions it a lot because it's enormous. Raw TF-IDF rewarded length, not relevance.

The fix: cosine normalization

The problem isn't any single term's tf-idf — the log handles that. The problem is the sum. A 1,835-line document accumulates contributions from thousands of terms. Some of those terms overlap with the query just by chance. More words, more overlap, higher score. Length wins.

The fix: measure focus, not volume.

Every document is a vector of tf-idf weights — one per term in the vocabulary. The "length" of that vector is the total signal energy:

||d|| = √(w₁² + w₂² + w₃² + ... + wₙ²)

Doc 9059 has high weights across hundreds of topics. Firearms, legislation, constitutional law, ballistics. Huge vector length. Doc 8802 has almost all its weight on one topic. Small vector length.

Divide raw score by vector length:

normalized_score = raw_score / ||d||

This asks: of all the signal in this document, what fraction points toward the query? Doc 8802 — nearly all of it. Doc 9059 — a sliver.

After normalization

Doc 8802:   0.336  ← 1st (was last)
Doc 8715:   0.237
Doc  287:   0.230
Doc 10785:  0.210
Doc 9047:   0.208
Doc 8819:   0.205
Doc 8690:   0.157
Doc 5553:   0.126
Doc 5466:   0.112
Doc 11288:  0.086
Doc 2153:   0.086
Doc 5194:   0.085
Doc 9059:   0.079  ← 13th (was 1st)
Doc 9391:   0.063

The 13-line post about the USA jumped from last to first. The 1,835-line firearms archive dropped from first to second-to-last.

I opened a few more to sanity check. The top results are short, focused posts where "United States of America" is central to the content. The bottom results are long documents that happen to mention it in passing. The ranking makes sense now.

Precomputing the expensive part

Vector length needs the tf-idf weight of every term in a document — not just the query terms. For a document with 3,000 unique terms, that's 3,000 values to square and sum. Can't do this at query time.

So I compute it during the merge step. The merge already touches every term and every document. For each pair, I compute tf-idf, square it, add to a running sum per document. After the merge finishes, square root everything. One float per document, stored in RAM. At query time: compute raw score, look up precomputed length, divide. One lookup.

What I took away

Ranking needs two signals working together. TF alone rewards any document that repeats a word. IDF alone rewards any document containing rare words. The product captures what actually matters — the documents that frequently mention rare, discriminating terms.

Metrics without inspection are dangerous. Doc 9059 scoring 2.5x higher looked like a clear winner. It took opening the actual file to see it was a false positive. I think about this a lot — in search, in product metrics, in general. A number that looks decisive can be completely misleading if you don't look at what's underneath.

Cosine normalization measures focus, not volume. It doesn't penalize long documents for being long. It penalizes documents for being scattered. A 10,000-word deep analysis of American foreign policy would still rank high where most of its signal points in the right direction. The firearms archive ranked low because its signal goes in a hundred different directions.

What's next

There's a scoring function called BM25 that handles all of this — tf saturation, document length, rarity — all in one formula with better math. That's coming.

There's also the question of going beyond exact phrase matching. Not every relevant document contains "united states of america" as an exact consecutive sequence.

But for now, the engine finds documents, ranks them by topical focus, and the ranking is correct. That's a real search engine.

Code: github.com/sreenish27/Search_engine_rs