filmsearch

1 min read ~240 wpm

A semantic search engine for films — built so you can ask things like "movies about love and space" and get back what you actually mean, not just title matches. Corpus of 54,000+ films scraped from Wikipedia.

This project has two articles because it has two versions. The first attempt failed. I was throwing LLMs at the problem without understanding what retrieval actually is. The second attempt worked — because I stopped outsourcing the thinking and built BM25 with multi-zone ranking myself. A startup founder used it to find a Japanese animated film he'd been unable to find anywhere else. Still my favorite piece of feedback.

try it → · github →

the journey

Four weeks between versions. The difference wasn't a bigger model — it was building BM25 and multi-zone ranking from scratch.

articles

Multi-zone, BM25-based film search — the technical breakdown

v2 · nov 10 2024

The rebuild. Dropped the LLM-heavy approach and built a proper retrieval system: BM25 scoring, per-zone TF-IDF tables, cosine similarity for zone ranking, composite scores as a stop-word filter. Math and architecture, step by step.

bm25 multi-zone cosine similarity

Film Search Platform — first attempt (the one that failed)

v1 · oct 15 2024

The first swing. Scrapy pipeline, structured data via Llama 3.1, 384-dim embeddings in Supabase, cosine similarity. It ran. It didn't really work. Keeping the write-up online because the failure is the more useful artifact.

failed embeddings llm wrapper