This article walks through the mathematical and technical details of each step in a multi-zone, BM25-based search architecture for films, explaining the rationale and calculations behind each component.
Database structure and purpose
1. Zone Importance table
Holds Zone Importance Scores for each term and zone, indicating the relevance of zones for specific terms. These scores let the system prioritize zones, ensuring the search targets high-value zones for each query term.
2. TF-IDF table (per zone)
Stores Term Frequency (TF) and Inverse Document Frequency (IDF) for each term-zone pair. The TF score represents how frequently a term appears within a zone for a given film. IDF adjusts term importance based on its frequency across all films, reducing the weight of common words.
3. AllFilms table
Contains general information about each film — title, director, release year. Used for metadata retrieval once relevant film IDs are identified.
4. Zone Vectors table
Pre-computed zone vectors to facilitate cosine similarity calculations with the query vector, used in zone ranking.
Technical flow and math
1. Query processing and composite score
The query is tokenized into a list of terms. For each term, the system retrieves Zone Importance Scores and calculates a composite score. Terms with a zero composite score are filtered out, effectively acting as a stop-word filter.
Composite score formula. For each term t:
Composite Score(t) = Σ Zone Importance Score(t, z)
over all zones z
Where each term t is summed over relevant zones z, giving a total relevance score per term. Zero-score terms are discarded.
2. Zone ranking with cosine similarity
The system calculates cosine similarity between the query vector Q and each zone vector Z to rank zones for each term by similarity. Only the top 3 zones for each term are selected based on similarity, ensuring the search focuses on zones that match the query's semantics.
Cosine similarity formula:
Cosine Similarity(Q, Z) = (Q · Z) / (||Q|| · ||Z||)
Where Q · Z is the dot product of the query and zone vectors, and ||Q||, ||Z|| are their magnitudes. This measure identifies zones aligned with the query's context.
3. BM25 scoring for final relevance ranking
For each selected word-zone pair, the system retrieves TF-IDF values and applies the BM25 formula to calculate a relevance score per film. BM25 scoring takes into account term frequency, document length normalization, and IDF — reducing the impact of common words and increasing the importance of unique terms.
BM25 scoring formula. For each term t in a selected zone for a given film d:
BM25(t, d) = IDF(t) × [ TF(t,d) × (k1 + 1) ]
/ [ TF(t,d) + k1 × (1 - b + b × (L_d / L_avg)) ]
IDF(t) adjusts for term uniqueness across films:
IDF(t) = log( (N - n_t + 0.5) / (n_t + 0.5) + 1 )
Where N is the total number of films and n_t is the number of films containing term t.
TF(t, d) — Term frequency of t in film d.
Length normalization. BM25 normalizes by document length L_d relative to the average length L_avg, ensuring fair comparison across films of different lengths.
4. Filtering and final ranking
The system filters films to include only those containing all query terms across the selected zones. Films are then ranked by their cumulative BM25 scores, giving a final sorted list with the most relevant films at the top.
Why this worked
The big shift from v1 to v2 was realizing that where a match came from matters as much as the match itself. A film where "politics" appears in the plot summary is a completely different signal than one where "politics" appears in a director's biography. Multi-zone retrieval lets the system weight each signal appropriately. Pair that with BM25 — which handles term saturation and length normalization the way the embeddings-only approach in v1 could not — and the results actually line up with what users mean.
A startup founder used this version to find a Japanese animated film he'd been unable to find anywhere else. That's still my favorite piece of feedback on anything I've built.