Day 7: I replaced strict phrase matching with proximity scoring

End of Day 6 I left this in the post: "strict filters fail on real queries. Phrase filtering looks correct in unit tests. On real two-word queries, it kills everything." Five queries had returned zero results despite having 13–22 candidate docs after intersection. israeli palestinian was the canonical case — 22 candidates, 0 returned, because no doc writes "israeli palestinian" as a literal 2-word string.

Today I implemented Manning Chapter 7.2.2 — query-term proximity scoring — to fix it. Strict phrase filter went away. A continuous boost based on how tightly the query terms appear in each document took its place.

This post is about the formula, why it's hand-coded rather than derived, and one number I didn't expect.

The setup

Each query term has a posting list mapping doc_id → [positions]. For a 13-line post titled "The Cold War: Who REALLY Won?" the postings for the query united states of america look like this:

united:  doc 8802 → [1]
states:  doc 8802 → [2]
of:      doc 8802 → [3]
america: doc 8802 → [4]

Four terms, all in positions 1–4. The window containing all of them has size 4 — the phrase is exact.

For a 1,835-line firearms archive that mentions the same words throughout:

united:  doc 9059 → [22, 47, 199, ...]
states:  doc 9059 → [23, 156, 304, ...]
of:      doc 9059 → [4, 25, 80, ...]
america: doc 9059 → [105, 240, 891, ...]

The smallest window containing one position from each term is much wider — terms scattered across thousands of positions.

Proximity scoring formalizes this: define ω (omega) as the smallest window containing at least one position from each query term. Smaller ω means a tighter match.

The formula

There's no math derivation here. Manning is explicit:

"It is hand-coded; tune it on the queries you care about."

I picked the simplest shape that encodes both signals I care about:

boost = 1 + k / ω

Where k = number of query terms and ω = smallest window.

Three properties drove the choice:

1. The 1 + makes it a multiplier on the existing score, never a penalty. When ω is huge and the boost contributes nothing, the multiplier is 1 — proximity scoring adds signal but never demotes a doc below its tf-idf score.

2. k / ω couples the two signals. More query terms (k) means a tighter window is harder, so the formula rewards docs that satisfy both — high coverage and small window. At ω = k (terms exactly adjacent, e.g., "gun control" at positions 5 and 6) the multiplier is 2.0. As ω grows, the boost decays toward 1.0.

3. Multiplicative, not additive. A doc with strong tf-idf and tight proximity gets compounded; a doc with weak tf-idf doesn't get artificially lifted by proximity alone.

omega = k recovers the strict-phrase-match case as a special point on a continuous curve. The phrase filter from Day 2 isn't deleted — it's subsumed.

Computing ω

ω is "the smallest window containing at least one position from each list." Naively this is O(|A|×|B|×|C|...) — try every combination. There's a linear-time algorithm using k pointers and a greedy advance:

pub fn window_calc(position_lists: &[&[u32]]) -> u32 {
    let k = position_lists.len();
    let mut pointers = vec![0usize; k];
    let mut best = u32::MAX;

    loop {
        let mut min = u32::MAX;
        let mut max = 0u32;
        let mut min_idx = 0usize;

        for i in 0..k {
            let v = position_lists[i][pointers[i]];
            if v < min { min = v; min_idx = i; }
            if v > max { max = v; }
        }

        best = best.min(max - min + 1);

        pointers[min_idx] += 1;
        if pointers[min_idx] == position_lists[min_idx].len() {
            return best;
        }
    }
}

One pointer per list, all starting at index 0. At each step: read the value at every pointer, compute the current window (max − min + 1), update the running minimum, then advance the pointer at the smallest value. Stop when any list runs off the end.

Why advancing the smallest pointer is correct: positions are sorted ascending, pointers only move forward. Therefore at any step you can never lower the ceiling, only raise the floor. Advancing the pointer at the current minimum is the only move that can shrink the window. Any other advance leaves the window the same or makes it worse.

Cost: O(total positions across all lists). On a 5-term query with average 50 positions per term per doc, that's ~250 operations. Fast.

What changed in `rank_results`

The strict phrase filter was a binary gate — drop the doc or keep it. Replacing it with proximity scoring meant making the filter step disappear and adding a multiplier inside the per-doc scoring loop:

for &doc_id in &results {
    let mut doc_score: f32 = 0.0;
    for term in terms {
        // tf-idf for each term, summed
        let t_f = posting[&doc_id].len() as f32;
        let d_f = term_index[term].doc_freq as f32;
        doc_score += tf_idf(t_f, total_docs, d_f);
    }

    doc_score /= doc_vec_len[&doc_id];   // cosine norm (Day 5)

    if k >= 2 {
        let omega = omega_calc(terms, doc_id, &all_postings);
        doc_score *= 1.0 + (k as f32 / omega as f32);   // proximity boost
    }

    ranked_docs.push((doc_id, doc_score));
}

Three lines. tf-idf, cosine norm, proximity. Each contributes one effect. Each can be turned off independently for ablation.

phrase_check.rs from Day 2 is retired — the strict phrase match it implemented is now omega = k, a single point on the proximity curve.

What happened on `israeli palestinian`

>>> TIER 0 <<<
Documents after intersection: 1
Omega: min=3, max=3, avg=3.00, boost: 1.667

>>> TIER 1 <<<
Documents after intersection: 4
Omega: min=4, max=17, avg=10.25, boost: 1.264

>>> TIER 2 <<<
Documents after intersection: 22
Omega: min=4, max=755, avg=152.86, boost: 1.072

FINAL: 10 results, top hit doc 359 (talk.politics.mideast)

0 → 10. The 22 candidate docs that the strict phrase filter killed in Day 6 are now all scored, ranked, and the best ones surface.

The boost did exactly what the formula predicted. Tier 0 has the doc where "israeli" and "palestinian" appear within 3 positions of each other — boost 1.667. Tier 2 has docs where the terms are 150 positions apart on average — boost barely 1.07. The boost magnitude tracks proximity quality, exactly as designed.

Same behavior on every multi-term query I tested. Tight matches → boost ~2.0. Scattered → boost ~1.1. Single-term queries skip the calculation entirely.

The number I didn't expect

Day 6's signature finding was that doc 8802 (13 lines, "Cold War: Who REALLY Won?") beat doc 9059 (1,835 lines, firearms archive) on united states of america by 2.3x — short focused doc beating long doc-soup, via cosine normalization.

I re-ran the same query today.

# 1  doc 8802  score=0.6723
# 2  doc 9059  score=0.1581

ratio: 4.25x

The ratio nearly doubled. Not because the engine got "better" in some abstract sense — because doc 8802 has the literal phrase tightly (ω = 4, boost = 2.0) and doc 9059 has the words scattered (ω = 40, boost = 1.55). The proximity boost amplified the cosine-norm finding.

Three textbook chapters now cooperating: Chapter 5 (compression), Chapter 6 (cosine norm), Chapter 7 (tiered fallback + proximity). Each one was designed in isolation. None of them were designed for this query. The compounding wasn't an intended feature — it's just what falls out when you stack independently-correct optimizations on the same data.

The textbook doesn't telegraph this. You see it in your own engine, on your own corpus.

What I learned

Continuous beats binary, when the underlying signal is continuous. Phrase filter forced a yes/no decision on something — "are these terms close?" — that has a natural numeric answer (how close). The binary version threw away information. Almost every "filter" in retrieval has the same shape; reach for the continuous version first.

Hand-coded formulas are normal. The boost shape isn't derived from probability theory or learned from data. Manning is explicit about this. The right loop is pick simple, ship, measure, tune — not find the optimal formula. Most of IR works this way until you get to learning-to-rank, where the formula becomes the parameters.

Strict-mode features hide other strict-mode features. The phrase filter was masking how strict the rest of the pipeline could be — once it stopped killing candidates, I had to think about what else was making valid docs invisible. The fix isn't always to add a feature. Sometimes it's to remove one and let downstream signals do the work.

What's next

Cranfield benchmark. 13 hand-picked queries are not an evaluation. The Cranfield collection (1,398 abstracts, 225 queries, expert relevance judgments since the 1960s) is the standard small-scale IR benchmark. I'll re-index Cranfield, run all 225 queries through the engine, compute MAP and P@10 against the qrels file. Then re-run after BM25 lands. Two data points on the same benchmark beats one data point each on two different setups.

Then Chapter 11: BM25. Replaces tf-idf + cosine norm with one formula. Smaller code, better math, the standard retrieval scoring function for the last 30 years. The engine has accumulated four scoring components — tf, idf, cosine norm, proximity boost. BM25 collapses three of them into one.

After that — impact-ordered postings, WAND, block-max WAND. Lucene's path, walked by hand.

Code: github.com/sreenish27/Search_engine_rs

Day 7: I replaced strict phrase matching with proximity scoring — and a 13-line doc beat a 1,835-line doc by 4.25x