Day 6: I built tiered indexes — and 0.49% of postings landed in the top tier

Today I implemented Chapter 7.2.1 of the IR textbook — tiered indexes. The idea is simple: split each posting list into multiple tiers by term frequency, query the top tier first, fall back to lower tiers only if you don't have enough results. Pay disk I/O for high-quality candidates, skip lower-quality ones unless you need them.

I followed the textbook's recommended thresholds. Tier 1 ended up holding 0.49% of all postings. Most queries returned zero results.

This is a story about why textbook hyperparameters are corpus-specific, what the data actually said, and what broke when I fixed it.

The mechanic

A posting list for "america" might look like this:

{doc_42: tf=8, doc_91: tf=3, doc_177: tf=1, doc_204: tf=22, ...}

Tiered indexing partitions this by tf threshold:

Tier 1 (tf > T1):  high-tf docs    — "america" is central
Tier 2 (T2 < tf): medium-tf docs  — mentioned multiple times
Tier 3 (tf ≤ T2): low-tf docs     — incidental mentions

At query time: run the full intersection + ranking pipeline against tier 1 only. If you have ≥ K results, stop. If not, fall back to tier 2 and append. Still short? Tier 3.

The bet: most queries are satisfied by tier 1 alone. You skip 90%+ of disk reads on common queries.

The refactor

Before today, my term index was a flat dictionary:

HashMap<String, (u64, u64, u32)>
//             offset, length, doc_freq

One (offset, length) pair per term — points to its full posting list on disk.

After:

const NO_OF_TIERS: usize = 3;

pub struct TermEntry {
    pub tiers: [(u64, u64); NO_OF_TIERS],  // one (offset, length) per tier
    pub doc_freq: u32,
}

HashMap<String, TermEntry>

Three (offset, length) pairs per term. Each tier's postings written contiguously to disk. The dictionary now tells me where each tier starts and how many bytes to read. Same disk file, different metadata.

The merge step (merge_index_map in block_merge.rs) does the partitioning in one pass. As it consumes each merged term's postings, it splits docs into three HashMaps based on tf threshold, serializes each tier with VByte encoding (Day 4's encoder), writes them to disk, and records (offset, length) per tier into a tier_buffer. Then constructs the TermEntry. Zero clones — the position vectors get moved directly from the merged dict into a tier dict based on positions.len().

The query side passes tier_idx: usize through the pipeline — docid_list, phrase_filter, rank_results all take a tier index and read only that tier's byte range from final_index.bin. The fallback loop lives in main.rs and looks like this:

for tier_idx in 0..NO_OF_TIERS {
    let postings = docid_list(&query_list, &term_index, tier_idx);
    let candidates = intersect_all(postings);
    let filtered = phrase_filter(candidates, &query_list, &term_index, tier_idx);

    let new_results: Vec<u32> = filtered.into_iter()
        .filter(|d| seen.insert(*d))   // dedup across tiers
        .collect();

    let ranked = rank_results(new_results, /* ... */, tier_idx);
    all_ranked.extend(ranked);

    if all_ranked.len() >= K { break; }
}

HashSet::insert returns true if newly inserted, false if already present — that's how the dedup catches docs that already appeared in higher tiers.

First run: tier 1 was empty

I started with T1 = 20, T2 = 15 — a doc needs tf > 20 to land in tier 1. These felt reasonable for a textbook recommendation.

Query: "united states of america". Output:

'america': 0 docs (read 1 bytes from offset 877001)
'united':  1 docs (read 54 bytes from offset 7790647)
'states':  2 docs (read 95 bytes from offset 6787717)
'of':      498 docs

Documents after intersection: 0

Zero results. No fallback yet — engine was tier-1-only at this point.

"america" had nothing in tier 1. "united" had one doc. "states" had two. Across all 11,314 documents in the 20 Newsgroups corpus, almost no document mentions any of these words 21+ times. That's an absurd bar for a corpus where the median post is a few hundred words.

The histogram

Time to look at the actual data instead of guessing. I dumped every (term, doc, tf) triple during merge — 1.7 million rows — and ran a histogram.

tf	count of (term, doc) pairs	% at this tf	cumulative %
1	1,265,121	73.1%	73.1%
2	243,940	14.1%	87.2%
3	86,600	5.0%	92.2%
4	41,496	2.4%	94.6%
5	23,999	1.4%	96.0%
10	4,334	0.25%	98.5%
20	761	0.04%	99.5%
50	73	0.004%	99.9%

73% of all (term, doc) pairs have tf = 1. Another 14% have tf = 2. 87% of the data is "this term appears once or twice in this doc." The tail goes out to tf = 1,269 but it's vanishingly thin past tf = 10.

A threshold of tf > 20 puts 0.49% of postings in tier 1. Of course it returned nothing.

Picking thresholds from the data

(T1, T2)	tier 1 %	tier 2 %	tier 3 %
(20, 15)	0.49%	0.30%	99.21%
(10, 5)	1.47%	2.52%	96.00%
(8, 3)	2.05%	5.73%	92.22%
(5, 2)	4.00%	8.79%	87.21%
(3, 1)	7.78%	19.10%	73.11%

The textbook wants tier 1 to be small but selective — the docs where a term is genuinely central. (5, 2) puts 4% of postings in tier 1 (a doc has to mention a term 6+ times) and 9% in tier 2 (3-5 mentions). For a corpus where median tf is 1, "6+ mentions" is the natural threshold where a doc transitions from "incidental mention" to "topically about this term."

I changed the constants and re-ran the same query.

Same query, with the fallback wired up

>>> TIER 0 <<<
'america': 5 docs    'united': 8 docs    'states': 19 docs    'of': 3231 docs
Documents after intersection: 2
Documents after phrase filter: 1
Cumulative results so far: 1 / target 10

>>> TIER 1 <<<
'america': 12 docs   'united': 23 docs   'states': 36 docs    'of': 3189 docs
Documents after intersection: 0
No new docs in tier 1, moving on

>>> TIER 2 <<<
'america': 218 docs  'united': 205 docs  'states': 335 docs   'of': 3462 docs
Documents after intersection: 1
Documents after phrase filter: 1
Ranked 1 new docs from tier 2

FINAL RANKED RESULTS (top 2)
# 1  doc 8802  score=0.3785  →  talk.politics.guns/54415
# 2  doc 9059  score=0.1617  →  talk.politics.guns/54684

Two things to flag here.

Tier 1 was empty for this query — none of the high-tf docs for these terms intersected on doc IDs. Without fallback I'd have returned a single result. Fallback to tier 2 found another doc.

The tier-2 doc outscored the tier-0 doc by 2.3x. Doc 8802 — the tier-2 hit — is a 13-line post titled "The Cold War: Who REALLY Won?" with the literal phrase "the United States of America" in line 1. Doc 9059 — the tier-0 hit — is a 1,835-line firearms legislation archive that mentions the country dozens of times in passing.

This breaks the textbook's implicit assumption.

The assumption that makes tiered indexes work

Tiered indexes lean on monotonicity of tf with relevance — a doc with high tf for a query term is more relevant than one with low tf. Therefore tier-1 docs (high tf) should outscore tier-2 docs (lower tf). Therefore "stop after tier 1 if you have K results" is safe.

This assumption ignores document length.

Doc 9059 has high tf because it's huge — 1,835 lines means everything appears more times. Doc 8802 has lower tf but it's 13 lines, and the phrase is in line 1. Tier assignment is by raw tf. Ranking is by cosine-normalized tf-idf (Day 5), which divides by document vector length.

Both can be true at the same time. Doc 8802 has lower tf (tier 2) and higher relevance (cosine norm rewards focus over volume). The tier system and the ranker were optimizing for different things — tf vs. tf/length.

In practice this means: tier-1 fallback to tier-2 isn't optional in this engine. It's the only path to good results when the right doc is short. Two textbook chapters cooperated here — Chapter 6's cosine norm scored a short doc highly, and Chapter 7's tiered fallback gave that short doc a chance to be seen.

What broke

Five queries on common topics returned zero results despite having dozens of intersection candidates:

Query	Tier 2 intersection	After phrase filter
`israeli palestinian`	22	0
`network protocol`	2	0
`medical research`	16	0
`disk encryption`	15	0
`christian theology`	13	0

Pattern: intersection finds docs containing both terms. Phrase filter requires those terms to appear in word order at consecutive positions. Almost no doc writes "israeli palestinian" as a literal 2-word string — they write "Israeli-Palestinian conflict," "Palestinian and Israeli," etc. Phrase filter is killing valid AND-matches.

This isn't a tiered index problem. It's a retrieval-design problem the strict phrase filter has been masking the whole time. Tiered indexes made it visible because fallback now produces large candidate pools (tier-2 intersections of 15-40 docs) that all get killed at the phrase-filter stage.

The fix is structural: phrase matching should be a score boost, not a filter. Docs containing the literal phrase should rank higher; docs containing the words individually should still rank. Right now I'm doing strict AND + strict phrase, and there's no third option. Next.

What I learned

Textbook hyperparameters are corpus-specific. The Manning textbook gives you the mechanic; it doesn't give you the numbers. Threshold tuning isn't a hyperparameter search — it's a function of your corpus's tf distribution. 0.49% in tier 1 vs 4.00% in tier 1 is the difference between a broken engine and a working one. Run the histogram first.

Two optimizations can have conflicting assumptions. Tiered indexes assume tf monotonically tracks relevance. Cosine norm explicitly breaks that — a short doc with low tf can outscore a long doc with high tf. The fallback loop is what reconciles them in practice. If I'd implemented tiered indexes without cosine norm, the assumption would be intact but the rankings would be worse. Sometimes a "broken" assumption is what makes the system honest.

Tier 1 alone is rarely sufficient. On 13 test queries, only 4 satisfied K=10 from tier 1 alone. The rest required fallback to tier 1 or tier 2. The engineering implication: tier 1 is a fast-path for the easy cases, fallback is the load-bearing part. The textbook framing ("stop after tier 1 if you have K") undersells how often you don't.

Strict filters fail on real queries. Phrase filtering looks correct in unit tests with phrases that exist literally. On real two-word queries, it kills everything. The bug was always there — I just couldn't see it until tiered fallback produced large enough candidate pools to make the failure undeniable.

What's next

Soften the phrase filter into a score boost. Convert the strict adjacency check into a multiplier on the cosine-normalized score: docs containing the literal phrase get a bonus, docs without it still rank. This recovers the 22 docs israeli palestinian should have returned without losing the precision phrase matching gives on queries like "new york times".

Then Chapter 11: BM25. Replaces the current tf-idf + cosine norm with a single formula that handles tf saturation, document length normalization, and term rarity in one shot. Better math, fewer moving parts.

After that — impact-ordered postings (Lucene's path), then dense retrieval and embeddings. The textbook is a runway, not the destination.

For now: tiered indexes are end-to-end working with data-driven thresholds, fallback wired up, and a clearly-named flaw in phrase filtering that's the next thing to fix. That's a real search engine.

Code: github.com/sreenish27/Search_engine_rs