A semantic search engine for films — built so you can ask things like "movies about love and space" and get back what you actually mean, not just title matches. Corpus of 54,000+ films scraped from Wikipedia.
This project has two articles because it has two versions. The first attempt failed. I was throwing LLMs at the problem without understanding what retrieval actually is. The second attempt worked — because I stopped outsourcing the thinking and built BM25 with multi-zone ranking myself. A startup founder used it to find a Japanese animated film he'd been unable to find anywhere else. Still my favorite piece of feedback.
Four weeks between versions. The difference wasn't a bigger model — it was building BM25 and multi-zone ranking from scratch.
The rebuild. Dropped the LLM-heavy approach and built a proper retrieval system: BM25 scoring, per-zone TF-IDF tables, cosine similarity for zone ranking, composite scores as a stop-word filter. Math and architecture, step by step.
The first swing. Scrapy pipeline, structured data via Llama 3.1, 384-dim embeddings in Supabase, cosine similarity. It ran. It didn't really work. Keeping the write-up online because the failure is the more useful artifact.