Day 4: I compressed my search index by 76% and made everything worse (for now)

7 min read~200 wpm

Today I implemented Chapter 5 of the IR textbook — index compression. I replaced every bincode::serialize and bincode::deserialize call in my Rust search engine with my own VByte encoder/decoder. The index shrank from 34.8MB to 8.3MB. Query latency went up. Both of these are correct.

The idea: store gaps, not values

My postings list for "computer" might be [3, 7, 12, 45, 108] — sorted document IDs. Each stored as a u32 = 4 bytes. But if I store the differences instead — [3, 4, 5, 33, 63] — the numbers get small. Small numbers need fewer bytes.

This is gap encoding. It works because postings lists are sorted. The first entry stays absolute, every subsequent entry is the distance from the previous one. Same trick applies to positional indexes within documents — positions [5, 12, 47] become gaps [5, 7, 35].

Variable Byte encoding: the mechanism

A u32 in bincode always takes 4 bytes. The number 3 is stored as 03 00 00 00 — three bytes of pure waste. VByte says: use only as many bytes as you need.

Each byte gets 7 bits of payload and 1 continuation bit. The high bit = 1 means "this is the last byte." High bit = 0 means "keep reading."

The number 5: binary 101 fits in 7 bits. One byte: 10000101 = 0x85. Done. 1 byte instead of 4.

The number 214,577: needs 18 bits. Chunk into 7-bit groups, write most significant first:

Byte 1: 0_0001101 = 0x0D  (continuation 0 — more coming)
Byte 2: 0_0001100 = 0x0C  (continuation 0 — more coming)
Byte 3: 1_0110001 = 0xB1  (continuation 1 — stop here)

3 bytes instead of 4. For small gaps (1, 2, 3), it's 1 byte instead of 4. That's where the 76% comes from.

What I actually built

Six functions replaced all of bincode in my system:

The on-disk format for a single term's postings:

[doc_count] [doc_id₁] [pos_count₁] [pos₁] [pos_gap] [pos_gap] ...
            [doc_gap]  [pos_count₂] [pos₁] [pos_gap] ...
            [doc_gap]  [pos_count₃] [pos₁] ...

Every value is VByte-encoded. No separators needed — VByte is self-delimiting (read bytes until you hit one with the high bit set). Counts tell the decoder how many items follow. The format is entirely mine — no library, no framework, every byte accounted for.

The numbers

Storage: 76% reduction

Metric Before (bincode) After (VByte) Change
final_index.bin size 34,883,860 bytes 8,284,250 bytes -76%

Per-term postings size on disk

Term Before After Reduction
"america" (235 docs) 4,128 bytes 991 bytes 76%
"united" (236 docs) 4,596 bytes 1,168 bytes 75%
"states" (390 docs) 7,700 bytes 1,933 bytes 75%
"of" (9,882 docs) 394,124 bytes 91,347 bytes 77%

The most frequent term ("of") compresses the most. Its gaps are tiny, it appears in almost every document, so consecutive doc IDs are close together. Most gaps encode to 1 byte. This is Zipf's law making compression work exactly as the textbook predicts.

Latency: got worse (and that's fine)

Step Before After Change
Index construction 17.8s 27.1s +52%
Merge time 4.9s 7.4s +51%
Total setup 22.7s 34.5s +52%
Spell check* 415ms 623ms +50%
Postings retrieval 17.4ms 24.6ms +41%
Intersection 155µs 248µs +60%
Phrase filter 11.5ms 17.1ms +49%
Total search 444ms 665ms +50%

*Spell check slowdown is from widening the edit distance threshold from term.len() / 3 to term.len() / 2 — more candidates pass the filter, improving correction accuracy at the cost of speed. Unrelated to compression.

Correctness: identical

Both versions return the same 14 documents for "United States of America" after phrase filtering. Same doc IDs, same order. Compression is lossless.

Why it's slower and why I don't care

Two reasons for the slowdown:

1. Decode cost. Bincode deserializes a u32 with a direct 4-byte memory copy. My VByte decoder runs a loop with bit shifts and masks per byte. For the "of" postings list (9,882 documents), that's ~10,000 VByte decodes versus one bulk memcpy.

2. SSD negates I/O savings. The textbook's compression argument assumes spinning disks where seek time is 5ms and sequential reads are precious. My MacBook has an SSD. Reading 394KB vs 91KB is barely different — both complete in microseconds. The I/O savings are real but small; the CPU decode cost is larger.

Why this is the right tradeoff long-term

At 12,000 documents on a laptop SSD, compression adds CPU overhead that exceeds the I/O savings. The latency increase is the honest cost of doing more work per integer.

But this system isn't staying at 12,000 documents.

The crossover point — where I/O savings exceed decode cost — is somewhere around a few hundred thousand documents. Past that, compressed indexes are strictly faster. Every production search engine (Lucene, Elasticsearch, Google) uses variable-length encoding on postings. I'm paying the learning cost now at a scale where it hurts latency, so the architecture is ready when scale makes it essential.

What I actually learned

Serialization isn't magic. Before today, bincode::serialize was a black box. Now I know exactly what happens: the library walks your data structure, follows every pointer from stack to heap, writes lengths before variable-size data, and lays values out as fixed-width bytes. I replaced it with ~40 lines of Rust that produce a 4x smaller output because I know something bincode doesn't — my integers are small and sorted.

The 76% came from replacing a library with understanding. I didn't add a compression layer on top of bincode. I removed bincode entirely and wrote my own serializer — six call sites, all replaced. The savings aren't from a clever algorithm bolted onto existing code. They're from knowing what my data looks like (sorted integers with small gaps) and writing a format that exploits that structure directly. A general-purpose serializer can't make those assumptions. I can, because I built the index.

Disk format ≠ RAM format. On disk, I want minimal bytes — VByte gaps, sequential, compact. In RAM, I want random access — HashMap, O(1) lookups. Compression is the translation layer. Decode once per query per term, use the HashMap many times during phrase filtering.

The two laws that make IR compression work:

Small improvement: graceful term dropping

I also added handling for query terms that can't be corrected — words that don't exist in the index even after spell correction. Previously the engine would silently return zero results. Now it drops unmatched terms with a warning:

Dropping 'xyzqwk' -- no match found in index
Warning: 1 term(s) dropped, results may be broader than intended

The user sees what happened instead of getting an empty result set. Small thing, but it's the difference between a tool and a product.

What's still broken

My merge still loads all blocks into RAM simultaneously. The whole point of blocked index construction is that blocks don't fit in memory — you merge them with a k-way external merge using a priority queue, one term at a time. The new sequential block format I built today (sorted terms, self-delimiting entries) makes this fix possible — I can now cursor through a block file without loading it all. That's next.

What's next

Chapter 6 — scoring with TF-IDF and the vector space model. Right now my engine returns a flat list of matching document IDs. No ranking. A document that mentions "United States of America" 47 times is treated the same as one that mentions it once. That changes next.


Code: github.com/sreenish27/Search_engine_rs