Filmsearch v1 — the failed attempt

Demo, GitHub, and Site Links

Demo Video: link
GitHub: sreenish27/filmsearch
Live Site: filmsearch-kappa.vercel.app

What does it do?

The platform lets users input complex, natural language queries about the type of film they want to watch. For example:

"Find me a Tamil movie that delves into the intersection of politics and crime. I'm looking for something with intricate political plots, corruption, and power struggles, set against the backdrop of organized crime. The protagonist should be an anti-hero, navigating a morally gray world where justice is elusive, and survival requires manipulation, deceit, and sometimes violence."

The system then returns movie recommendations based on the user's request.

Data Source: Wikipedia

To keep things simple and within the scope of available resources, I sourced film data from Wikipedia. Key fields per film:

Plot summary
Cast and crew
Production details
Public and critical reception
Budget and box office performance
And more

I built a scraper using Scrapy to extract all Tamil films and stored the resulting data in a .jsonl file. The structure of each entry:

{
  "film_name": {
    "key_field": "value",
    "another_field": "value",
    ...
  }
}

Technology Stack

Frontend: Next.js, Tailwind CSS
Backend: Node.js, Express.js
Web Crawler: Scrapy (Python)
Database: PostgreSQL (hosted on Supabase, using pgvector for vector data type support)
LLM API: Groq (Llama 3.1 70B), FastAPI for endpoints
Vector Embedding: all-MiniLM-L6-v2 (384 dimensions)

Building the Database

Structuring the data

To improve search accuracy and enable future keyword search, I structured the unstructured Wikipedia data using frameworks:

LLM integration. Used the Groq API (Llama 3.1) with FastAPI to convert unstructured data into structured dictionaries. Used json_repair to clean up LLM outputs. Saved into a new .jsonl.

Embedding. Once structured, I embedded each field as a 384-dim vector and stored it in PostgreSQL (Supabase). Each film had two tables:

Films table — title, film details, image
FilmInfo table — title, vector-embedded fields, raw data

Framework creation. I identified 12 critical fields (plot, production, reception, cast) from the 95+ unique fields in the scraped data. Built JSON-based "frameworks" to structure each field. Example for plot:

{
  "genre": [],
  "setting": {
    "time": {
      "period": "",
      "duration": ""
    },
    "place": {
      "geographic": "",
      "specific": []
    },
    "context": {
      "historical": "",
      "social": "",
      "cultural": ""
    }
  }
}

These frameworks guided how data was structured for each field in every film.

Search Architecture

The search process was powered by vector embeddings and LLMs to handle complex queries.

Key steps

User query. The user enters a natural language query. The LLM processes it to identify 1–3 relevant categories — predefined buckets of fields (e.g. "Story" includes Plot, Premise, Synopsis, Summary).
Query structuring. The query is structured according to the frameworks used for the film data.
Vector embedding. The structured query is embedded as a vector and compared against the film vectors via similarity search.
Result aggregation. Films with closest vector matches are retrieved and displayed.

Future enhancements

The architecture was in early stages. Planned improvements:

Refining the query structuring logic
Optimizing vector similarity search for performance
Exploring keyword-based search in combination with vector search

Final thoughts

The project was still evolving. The frontend and backend allowed users to input queries and get films back, but there was significant potential for further optimization.

What I learned afterward — and the reason this approach failed — is that vector similarity alone doesn't know where in a document a match came from. A film where "politics" shows up in the plot summary is a totally different signal than one where "politics" shows up in a director biography. Treating all text the same flattened exactly the information I needed to rank well. The fix in v2 is called multi-zone retrieval, which is what Google has been doing for twenty years. I just hadn't read enough IR to know it existed yet.

Film Search Platform — first attempt (the one that failed)