Demo, GitHub, and Site Links
- Demo Video: link
- GitHub: sreenish27/filmsearch
- Live Site: filmsearch-kappa.vercel.app
What does it do?
The platform lets users input complex, natural language queries about the type of film they want to watch. For example:
"Find me a Tamil movie that delves into the intersection of politics and crime. I'm looking for something with intricate political plots, corruption, and power struggles, set against the backdrop of organized crime. The protagonist should be an anti-hero, navigating a morally gray world where justice is elusive, and survival requires manipulation, deceit, and sometimes violence."
The system then returns movie recommendations based on the user's request.
Data Source: Wikipedia
To keep things simple and within the scope of available resources, I sourced film data from Wikipedia. Key fields per film:
- Plot summary
- Cast and crew
- Production details
- Public and critical reception
- Budget and box office performance
- And more
I built a scraper using Scrapy to extract all Tamil films and stored the resulting data in a .jsonl file. The structure of each entry:
{
"film_name": {
"key_field": "value",
"another_field": "value",
...
}
}
Technology Stack
- Frontend: Next.js, Tailwind CSS
- Backend: Node.js, Express.js
- Web Crawler: Scrapy (Python)
- Database: PostgreSQL (hosted on Supabase, using pgvector for vector data type support)
- LLM API: Groq (Llama 3.1 70B), FastAPI for endpoints
- Vector Embedding: all-MiniLM-L6-v2 (384 dimensions)
Building the Database
Structuring the data
To improve search accuracy and enable future keyword search, I structured the unstructured Wikipedia data using frameworks:
LLM integration. Used the Groq API (Llama 3.1) with FastAPI to convert unstructured data into structured dictionaries. Used json_repair to clean up LLM outputs. Saved into a new .jsonl.
Embedding. Once structured, I embedded each field as a 384-dim vector and stored it in PostgreSQL (Supabase). Each film had two tables:
- Films table — title, film details, image
- FilmInfo table — title, vector-embedded fields, raw data
Framework creation. I identified 12 critical fields (plot, production, reception, cast) from the 95+ unique fields in the scraped data. Built JSON-based "frameworks" to structure each field. Example for plot:
{
"genre": [],
"setting": {
"time": {
"period": "",
"duration": ""
},
"place": {
"geographic": "",
"specific": []
},
"context": {
"historical": "",
"social": "",
"cultural": ""
}
}
}
These frameworks guided how data was structured for each field in every film.
Search Architecture
The search process was powered by vector embeddings and LLMs to handle complex queries.
Key steps
- User query. The user enters a natural language query. The LLM processes it to identify 1–3 relevant categories — predefined buckets of fields (e.g. "Story" includes Plot, Premise, Synopsis, Summary).
- Query structuring. The query is structured according to the frameworks used for the film data.
- Vector embedding. The structured query is embedded as a vector and compared against the film vectors via similarity search.
- Result aggregation. Films with closest vector matches are retrieved and displayed.
Future enhancements
The architecture was in early stages. Planned improvements:
- Refining the query structuring logic
- Optimizing vector similarity search for performance
- Exploring keyword-based search in combination with vector search
Final thoughts
The project was still evolving. The frontend and backend allowed users to input queries and get films back, but there was significant potential for further optimization.
What I learned afterward — and the reason this approach failed — is that vector similarity alone doesn't know where in a document a match came from. A film where "politics" shows up in the plot summary is a totally different signal than one where "politics" shows up in a director biography. Treating all text the same flattened exactly the information I needed to rank well. The fix in v2 is called multi-zone retrieval, which is what Google has been doing for twenty years. I just hadn't read enough IR to know it existed yet.