Fixing the merge: from all-blocks-in-RAM to streaming k-way merge

2 min read~200 wpm

My index construction works in blocks. Process 4,000 documents, flush the in-memory index to disk, clear memory, keep going. For 11,314 documents that's 3 blocks. The whole idea is that the full index doesn't fit in RAM.

My merge step was loading all blocks back into memory at once. Every block deserialized into a HashMap<String, HashMap<u32, Vec<u32>>>, all held simultaneously. For 3 blocks on 12K docs, this worked. For 100 blocks on a million docs, it would crash.

The fix

I wrote a BlockReader — a struct that holds the block file's bytes, a byte offset, and a count of remaining entries. One method: next_entry(). Each call reads one term and its postings from the byte stream, advances the offset, returns. The block is never fully deserialized.

The merge opens one BlockReader per block file. Each reader starts at its first term. The terms are alphabetically sorted within each block (I did this during block flush and it pays off now).

The merge loop:

  1. Look at the current term from each reader
  2. Find the smallest
  3. Pull postings for that term from every reader that has it
  4. Merge the postings, compute doc vector lengths, write to disk
  5. Advance those readers to their next entry
  6. Repeat until all readers return None

At any moment, RAM holds one term's postings per block. 3 terms instead of 138,743 × 3.

The numbers

Merge time: 12.6s → 10.6s. 16% faster. Not because of algorithmic complexity — both are O(total postings). The speedup comes from not building a massive Vec<HashMap> and repeatedly scanning it.

Search results: identical. Same 14 documents, same tf-idf scores, same cosine-normalized ranking. Doc 8802 still ranks #1.

The real win isn't speed. It's that this merge scales. 3 blocks, 100 blocks, 1,000 blocks — the RAM usage stays constant.


Code: github.com/sreenish27/Search_engine_rs