Film Search Platform — first attempt (the one that failed)

4 min read~200 wpm
Editor's note: This was my first attempt at building a film search engine. It didn't work — not in the way I needed it to. The core mistake: I was outsourcing the thinking to an LLM and hoping semantic embeddings would do the rest. They didn't. Four weeks later I rebuilt it from scratch with BM25 and multi-zone ranking and that version works. I'm keeping this write-up up because the failure taught me more than the success did, and the architecture of not understanding retrieval is a useful thing for people to see.

Demo, GitHub, and Site Links

What does it do?

The platform lets users input complex, natural language queries about the type of film they want to watch. For example:

"Find me a Tamil movie that delves into the intersection of politics and crime. I'm looking for something with intricate political plots, corruption, and power struggles, set against the backdrop of organized crime. The protagonist should be an anti-hero, navigating a morally gray world where justice is elusive, and survival requires manipulation, deceit, and sometimes violence."

The system then returns movie recommendations based on the user's request.

Data Source: Wikipedia

To keep things simple and within the scope of available resources, I sourced film data from Wikipedia. Key fields per film:

I built a scraper using Scrapy to extract all Tamil films and stored the resulting data in a .jsonl file. The structure of each entry:

{
  "film_name": {
    "key_field": "value",
    "another_field": "value",
    ...
  }
}

Technology Stack

Building the Database

Structuring the data

To improve search accuracy and enable future keyword search, I structured the unstructured Wikipedia data using frameworks:

LLM integration. Used the Groq API (Llama 3.1) with FastAPI to convert unstructured data into structured dictionaries. Used json_repair to clean up LLM outputs. Saved into a new .jsonl.

Embedding. Once structured, I embedded each field as a 384-dim vector and stored it in PostgreSQL (Supabase). Each film had two tables:

Framework creation. I identified 12 critical fields (plot, production, reception, cast) from the 95+ unique fields in the scraped data. Built JSON-based "frameworks" to structure each field. Example for plot:

{
  "genre": [],
  "setting": {
    "time": {
      "period": "",
      "duration": ""
    },
    "place": {
      "geographic": "",
      "specific": []
    },
    "context": {
      "historical": "",
      "social": "",
      "cultural": ""
    }
  }
}

These frameworks guided how data was structured for each field in every film.

Search Architecture

The search process was powered by vector embeddings and LLMs to handle complex queries.

Key steps

  1. User query. The user enters a natural language query. The LLM processes it to identify 1–3 relevant categories — predefined buckets of fields (e.g. "Story" includes Plot, Premise, Synopsis, Summary).
  2. Query structuring. The query is structured according to the frameworks used for the film data.
  3. Vector embedding. The structured query is embedded as a vector and compared against the film vectors via similarity search.
  4. Result aggregation. Films with closest vector matches are retrieved and displayed.

Future enhancements

The architecture was in early stages. Planned improvements:

Final thoughts

The project was still evolving. The frontend and backend allowed users to input queries and get films back, but there was significant potential for further optimization.


What I learned afterward — and the reason this approach failed — is that vector similarity alone doesn't know where in a document a match came from. A film where "politics" shows up in the plot summary is a totally different signal than one where "politics" shows up in a director biography. Treating all text the same flattened exactly the information I needed to rank well. The fix in v2 is called multi-zone retrieval, which is what Google has been doing for twenty years. I just hadn't read enough IR to know it existed yet.