Today I implemented Chapter 5 of the IR textbook — index compression. I replaced every bincode::serialize and bincode::deserialize call in my Rust search engine with my own VByte encoder/decoder. The index shrank from 34.8MB to 8.3MB. Query latency went up. Both of these are correct.
The idea: store gaps, not values
My postings list for "computer" might be [3, 7, 12, 45, 108] — sorted document IDs. Each stored as a u32 = 4 bytes. But if I store the differences instead — [3, 4, 5, 33, 63] — the numbers get small. Small numbers need fewer bytes.
This is gap encoding. It works because postings lists are sorted. The first entry stays absolute, every subsequent entry is the distance from the previous one. Same trick applies to positional indexes within documents — positions [5, 12, 47] become gaps [5, 7, 35].
Variable Byte encoding: the mechanism
A u32 in bincode always takes 4 bytes. The number 3 is stored as 03 00 00 00 — three bytes of pure waste. VByte says: use only as many bytes as you need.
Each byte gets 7 bits of payload and 1 continuation bit. The high bit = 1 means "this is the last byte." High bit = 0 means "keep reading."
The number 5: binary 101 fits in 7 bits. One byte: 10000101 = 0x85. Done. 1 byte instead of 4.
The number 214,577: needs 18 bits. Chunk into 7-bit groups, write most significant first:
Byte 1: 0_0001101 = 0x0D (continuation 0 — more coming)
Byte 2: 0_0001100 = 0x0C (continuation 0 — more coming)
Byte 3: 1_0110001 = 0xB1 (continuation 1 — stop here)
3 bytes instead of 4. For small gaps (1, 2, 3), it's 1 byte instead of 4. That's where the 76% comes from.
What I actually built
Six functions replaced all of bincode in my system:
vbyte_encode/vbyte_decode— encode and decode a single integer using the 7-bit + continuation-bit scheme.serialize_postings/deserialize_postings— encode and decode an entire term's postings (all doc IDs + all positions) as one VByte byte stream. Gap encoding happens here and differences are computed on write, accumulated back on read.serialize_block/deserialize_block— encode and decode a full index block (all terms + all their postings). Terms sorted alphabetically, each entry self-delimiting.
The on-disk format for a single term's postings:
[doc_count] [doc_id₁] [pos_count₁] [pos₁] [pos_gap] [pos_gap] ...
[doc_gap] [pos_count₂] [pos₁] [pos_gap] ...
[doc_gap] [pos_count₃] [pos₁] ...
Every value is VByte-encoded. No separators needed — VByte is self-delimiting (read bytes until you hit one with the high bit set). Counts tell the decoder how many items follow. The format is entirely mine — no library, no framework, every byte accounted for.
The numbers
Storage: 76% reduction
| Metric | Before (bincode) | After (VByte) | Change |
|---|---|---|---|
final_index.bin size |
34,883,860 bytes | 8,284,250 bytes | -76% |
Per-term postings size on disk
| Term | Before | After | Reduction |
|---|---|---|---|
"america" (235 docs) |
4,128 bytes | 991 bytes | 76% |
"united" (236 docs) |
4,596 bytes | 1,168 bytes | 75% |
"states" (390 docs) |
7,700 bytes | 1,933 bytes | 75% |
"of" (9,882 docs) |
394,124 bytes | 91,347 bytes | 77% |
The most frequent term ("of") compresses the most. Its gaps are tiny, it appears in almost every document, so consecutive doc IDs are close together. Most gaps encode to 1 byte. This is Zipf's law making compression work exactly as the textbook predicts.
Latency: got worse (and that's fine)
| Step | Before | After | Change |
|---|---|---|---|
| Index construction | 17.8s | 27.1s | +52% |
| Merge time | 4.9s | 7.4s | +51% |
| Total setup | 22.7s | 34.5s | +52% |
| Spell check* | 415ms | 623ms | +50% |
| Postings retrieval | 17.4ms | 24.6ms | +41% |
| Intersection | 155µs | 248µs | +60% |
| Phrase filter | 11.5ms | 17.1ms | +49% |
| Total search | 444ms | 665ms | +50% |
*Spell check slowdown is from widening the edit distance threshold from term.len() / 3 to term.len() / 2 — more candidates pass the filter, improving correction accuracy at the cost of speed. Unrelated to compression.
Correctness: identical
Both versions return the same 14 documents for "United States of America" after phrase filtering. Same doc IDs, same order. Compression is lossless.
Why it's slower and why I don't care
Two reasons for the slowdown:
1. Decode cost. Bincode deserializes a u32 with a direct 4-byte memory copy. My VByte decoder runs a loop with bit shifts and masks per byte. For the "of" postings list (9,882 documents), that's ~10,000 VByte decodes versus one bulk memcpy.
2. SSD negates I/O savings. The textbook's compression argument assumes spinning disks where seek time is 5ms and sequential reads are precious. My MacBook has an SSD. Reading 394KB vs 91KB is barely different — both complete in microseconds. The I/O savings are real but small; the CPU decode cost is larger.
Why this is the right tradeoff long-term
At 12,000 documents on a laptop SSD, compression adds CPU overhead that exceeds the I/O savings. The latency increase is the honest cost of doing more work per integer.
But this system isn't staying at 12,000 documents.
-
At 1 million documents: the
"of"postings list balloons to hundreds of thousands of doc IDs plus positions. The uncompressed vs compressed difference becomes megabytes per query term. SSD or not, that matters. -
At 10 million documents: the uncompressed index could be tens of gigabytes. Compressed: maybe a quarter of that. The difference between fitting in the OS page cache or not. Cache miss = disk seek. Cache hit = memory speed. Compression determines which side you land on.
-
At 100 million documents: you're sharding across machines. Every byte you don't send over the network is latency you don't pay. Compression isn't optional, it's the difference between a system that works and one that doesn't.
The crossover point — where I/O savings exceed decode cost — is somewhere around a few hundred thousand documents. Past that, compressed indexes are strictly faster. Every production search engine (Lucene, Elasticsearch, Google) uses variable-length encoding on postings. I'm paying the learning cost now at a scale where it hurts latency, so the architecture is ready when scale makes it essential.
What I actually learned
Serialization isn't magic. Before today, bincode::serialize was a black box. Now I know exactly what happens: the library walks your data structure, follows every pointer from stack to heap, writes lengths before variable-size data, and lays values out as fixed-width bytes. I replaced it with ~40 lines of Rust that produce a 4x smaller output because I know something bincode doesn't — my integers are small and sorted.
The 76% came from replacing a library with understanding. I didn't add a compression layer on top of bincode. I removed bincode entirely and wrote my own serializer — six call sites, all replaced. The savings aren't from a clever algorithm bolted onto existing code. They're from knowing what my data looks like (sorted integers with small gaps) and writing a format that exploits that structure directly. A general-purpose serializer can't make those assumptions. I can, because I built the index.
Disk format ≠ RAM format. On disk, I want minimal bytes — VByte gaps, sequential, compact. In RAM, I want random access — HashMap, O(1) lookups. Compression is the translation layer. Decode once per query per term, use the HashMap many times during phrase filtering.
The two laws that make IR compression work:
- Heaps' law: vocabulary grows as ~√T (sublinear). Your dictionary stays manageable.
- Zipf's law: most terms are rare (large gaps, few postings), a few terms are very frequent (tiny gaps, massive compression). The distribution is perfectly shaped for variable-length encoding.
Small improvement: graceful term dropping
I also added handling for query terms that can't be corrected — words that don't exist in the index even after spell correction. Previously the engine would silently return zero results. Now it drops unmatched terms with a warning:
Dropping 'xyzqwk' -- no match found in index
Warning: 1 term(s) dropped, results may be broader than intended
The user sees what happened instead of getting an empty result set. Small thing, but it's the difference between a tool and a product.
What's still broken
My merge still loads all blocks into RAM simultaneously. The whole point of blocked index construction is that blocks don't fit in memory — you merge them with a k-way external merge using a priority queue, one term at a time. The new sequential block format I built today (sorted terms, self-delimiting entries) makes this fix possible — I can now cursor through a block file without loading it all. That's next.
What's next
Chapter 6 — scoring with TF-IDF and the vector space model. Right now my engine returns a flat list of matching document IDs. No ranking. A document that mentions "United States of America" 47 times is treated the same as one that mentions it once. That changes next.