Day 3: My search engine now reads from disk like a real search engine

4 min read~200 wpm

Today I tackled Chapter 4 of the IR textbook — index construction at scale. The question: what happens when your data is too big to fit in memory?

The problem

My previous implementation stored the entire inverted index in RAM. 11K documents, ~138K unique terms, all their positions — one big HashMap in memory. Works fine now. But try indexing Wikipedia or a web crawl and your program crashes when RAM fills up.

The solution: Block-based index construction

Instead of building one giant in-memory index, I process documents in blocks of 4,000. After every 4,000 documents, the current index gets serialized to binary on disk and memory is cleared. For my 11K documents, this produces 3 blocks.

After all documents are processed, a merge step combines all blocks into one final index file. For each unique term alphabetically, it collects postings from every block that contains it, concatenates them, and writes them contiguously to a single binary file. No deduplication needed because each block covers different documents.

The new architecture: Dictionary in RAM, postings on disk

After merging, the engine has two things:

1. RAM dictionary — each term maps to its byte offset in the disk file, the length of its postings in bytes, and its document frequency. 138,743 terms. Lightweight.

2. Disk filefinal_index.bin34.8 MB of postings packed contiguously.

When a query arrives, the engine looks up each term in the RAM dictionary (nanoseconds), gets the byte offset, seeks to that exact position on disk, reads only the bytes it needs, and deserializes. No scanning. No loading the full index.

The document frequency in RAM lets the engine decide intersection order without touching disk. Make the cheap decision in memory before paying the expensive disk cost.

What a query actually looks like under the hood

Enter your search query:
Uniteed Sttates of Ameeriica

--- SPELL CHECKING ---
  'uniteed'   → 8,769 candidates → 9 after Jaccard → 1 after edit distance
  'sttates'   → 8,088 candidates → 29 after Jaccard → 1 after edit distance
  'ameeriica' → 4,315 candidates → 4 after Jaccard → 1 after edit distance
  Did you mean: united states of america?
  Spell check time: 415ms

--- RETRIEVING POSTINGS ---
  'america': 235 docs (4,128 bytes from offset 3,478,504)
  'united':  236 docs (4,596 bytes from offset 31,675,908)
  'states':  390 docs (7,700 bytes from offset 27,503,996)
  'of':      9,882 docs (394,124 bytes from offset 21,128,064)
  Retrieval time: 17ms

--- INTERSECTING ---
  37 documents contain all four terms
  Intersection time: 155µs

--- PHRASE FILTERING ---
  14 documents contain the exact phrase
  Phrase filter time: 11ms

Total search time: 444ms

Three misspelled words. 21,172 raw candidates from the trigram index. Filtered to 42 by Jaccard similarity. Down to 3 by edit distance. All three corrected perfectly. Then phrase-verified down to 14 exact matches.

What the postings data reveals

"of" takes 394KB on disk — a common word in 9,882 documents with all their positions. "america" takes 4KB. That's Zipf's law — frequency drops rapidly with rank.

This is why intersection order matters. Starting with "america" (235 docs) instead of "of" (9,882 docs) keeps intermediate results small. The intersection itself takes 155 microseconds. That's not the bottleneck.

The latency tradeoff I'm making

Yes, it's slower. The old architecture caps out at whatever fits in RAM — maybe 500K documents on my MacBook Air before things break. The new architecture handles tens of millions. The postings live on disk, only the dictionary needs to fit in memory.

For 10 million documents, vocabulary might be 1 million terms (Heaps' law — vocabulary grows as roughly the square root of collection size). That dictionary is maybe 50MB in RAM. The postings file could be 50GB on disk. That works. The old approach doesn't.

I want my system to be ready to handle huge corpuses and still give great performance — hence an early tradeoff on that.

Setup costs (one-time)

This happens once. After that, queries run without rebuilding.

What's next

Chapters covered: 1–4. Next: either compression (shrinking that 34.8MB file) or TF-IDF ranking (so results come back in order of relevance instead of just a flat list of doc IDs).


Code: github.com/sreenish27/Search_engine_rs