Movie Recommendation System

A content-based recommendation engine that suggests similar movies based on cosine similarity of movie features.

Python Pandas NumPy Scikit-learn NLTK TMDB Dataset

Project Overview

This project implements a content-based movie recommendation system that suggests similar movies based on their features like genres, keywords, cast, crew, and overview.

The system analyzes the TMDB 5000 movies dataset, processes the textual features using NLP techniques, and computes cosine similarity between movies to find the most similar ones. It demonstrates how machine learning can power personalized recommendations.

Key Innovation: The system combines multiple features (genres, keywords, cast, crew, overview) into a single "tags" vector for each movie and uses stemming to improve recommendation accuracy.

System Architecture

Figure 1: The content-based recommendation system workflow

Technical Highlights

Data Processing

Merged TMDB movies and credits datasets
Extracted relevant features (genres, keywords, etc.)
Handled missing values
Processed JSON-like columns

Text Processing

Converted JSON to lists of features
Combined multiple features into tags
Applied stemming with Porter Stemmer
Lowercased and removed spaces

Vectorization

CountVectorizer with 5000 features
English stop words removal
Bag-of-words representation
Sparse matrix conversion

Recommendation

Cosine similarity calculation
Top 5 similar movies
Content-based filtering
Fast lookup with movie indices

System Results

Movies Processed

5,000+

From TMDB dataset

Features Combined

5

Genres, keywords, etc.

Processing Time

<1 min

On standard hardware

Recommendation Quality

The system provides:

Genre-based similarity: Movies with similar genres
Content-based similarity: Movies with similar plots
Cast/crew similarity: Movies with same actors/directors
Keyword matching: Movies with similar themes

Recommendation Examples

Input Movie: The Dark Knight

Recommended Movies:

The Dark Knight Rises
Batman Begins
Batman v Superman: Dawn of Justice
Man of Steel
Watchmen

Input Movie: Inception

Recommended Movies:

The Matrix
Shutter Island
Interstellar
The Prestige
Source Code

Input Movie: Toy Story

Recommended Movies:

Toy Story 2
Toy Story 3
Monsters, Inc.
Finding Nemo
Up

Figure 2: Sample recommendations showing the system's capability

Project Links

View Google Colab Notebook View Visualizations Download Jupyter File

📊 Project Details

Status: Completed
Dataset: TMDB 5000 Movies
Features: Genres, Keywords Cast, Crew, Overview
Algorithm: Cosine Similarity

Technology Stack

Core Libraries

Pandas NumPy Scikit-learn NLTK

Data Processing

JSON Parsing Text Cleaning Feature Combination

NLP

CountVectorizer Porter Stemmer Stop Words

Dataset Information

TMDB 5000 Dataset

5000+ movies with detailed metadata
Genres, keywords, cast, crew information
Movie overviews/plot summaries
Ratings and popularity metrics

Features Used

Genres, keywords, overview, cast (top 3), director

JSON-like columns were parsed to extract relevant features

System Visualizations

The data processing pipeline shows how raw movie data is transformed into feature vectors for recommendation.

The cosine similarity matrix shows how movies are related to each other based on their combined features.

The recommendation output shows the top 5 similar movies for a given input movie.

This video demonstrates the working of the recommendation model:

Real-time movie recommendation generation
System response to different input movies
Visualization of similarity scoring

Technical Challenges & Solutions

The dataset contained JSON-like strings in columns like genres, keywords, cast, and crew that needed to be parsed.

Solution: Implemented custom parsing functions using ast.literal_eval to:

Extract genre names from nested dictionaries
Get top 3 cast members
Identify directors from crew
Combine all relevant features

Result: Successfully converted complex JSON structures into usable lists of features.

Text features needed normalization to improve recommendation quality.

Solution: Implemented:

Space removal between names (Tom Cruise → TomCruise)
Lowercasing all text
Porter stemming to reduce words to root forms
Combination of all features into a single "tags" column

Result: Cleaned and standardized text improved similarity calculations.

Calculating similarity between 5000+ movies with high-dimensional vectors.

Solution: Applied:

CountVectorizer with max_features=5000 to limit dimensionality
Cosine similarity for measuring similarity between vectors
Efficient matrix operations with NumPy
Pre-computed similarity matrix for fast recommendations

Result: Fast and accurate recommendations despite large dataset size.

Key Code Implementation

Data Parsing and Cleaning

import ast
import pandas as pd
from nltk.stem import PorterStemmer

# Function to convert JSON-like strings to lists
def convert_to_list(objects):
    new_list = []
    for item in ast.literal_eval(objects):
        new_list.append(item['name'])
    return new_list

# Function to get top 3 cast members
def convert3(objects):
    new_list = []
    counter = 0
    for item in ast.literal_eval(objects):
        if counter != 3:
            new_list.append(item['name'])
            counter += 1
        else:
            break
    return new_list

# Function to get director from crew
def convert_crew(objects):
    new_list = []
    for item in ast.literal_eval(objects):
        if item['job'] == 'Director':
            new_list.append(item['name'])
            break
    return new_list

# Apply parsing functions
movies['genres'] = movies['genres'].apply(convert_to_list)
movies['keywords'] = movies['keywords'].apply(convert_to_list)
movies['cast'] = movies['cast'].apply(convert3)
movies['crew'] = movies['crew'].apply(convert_crew)

Text Processing and Feature Combination

# Process overview text
movies['overview'] = movies['overview'].apply(lambda x: x.split())

# Remove spaces between names and lowercase
movies['cast'] = movies['cast'].apply(lambda x: [i.replace(" ", "") for i in x])
movies['crew'] = movies['crew'].apply(lambda x: [i.replace(" ", "") for i in x])
movies['overview'] = movies['overview'].apply(lambda x: [i.replace(" ", "") for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x: [i.replace(" ", "") for i in x])

# Combine all features into tags
movies['tags'] = movies['genres'] + movies['overview'] + movies['keywords'] + movies['cast'] + movies['crew']

# Convert list to string
movies['tags'] = movies['tags'].apply(lambda x: " ".join(x))

# Apply stemming
ps = PorterStemmer()
def stem(text):
    lis = []
    for i in text.split():
        lis.append(ps.stem(i))
    return " ".join(lis)

movies['tags'] = movies['tags'].apply(stem)

Vectorization and Recommendation

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Create count vectors
cv = CountVectorizer(max_features=5000, stop_words='english')
vectors = cv.fit_transform(movies['tags']).toarray()

# Calculate cosine similarity
similarity = cosine_similarity(vectors)

# Recommendation function
def recommend(movie):
    movie_index = movies[movies['title'] == movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]

    for i in movies_list:
        print(movies.iloc[i[0]].title)

# Example usage
recommend('The Dark Knight')

Switcher

Movie Recommendation System

Project Overview

System Architecture

Technical Highlights

Data Processing

Text Processing

Vectorization

Recommendation

System Results

Movies Processed

5,000+

Features Combined

5

Processing Time

<1 min

Recommendation Quality

Recommendation Examples

Input Movie: The Dark Knight

Input Movie: Inception

Input Movie: Toy Story

Project Links

📊 Project Details

Technology Stack

Core Libraries

Data Processing

NLP

Dataset Information

TMDB 5000 Dataset

Features Used

System Visualizations

Technical Challenges & Solutions

JSON-like Data Parsing

Text Preprocessing

Similarity Calculation

Key Code Implementation

Data Parsing and Cleaning

Text Processing and Feature Combination

Vectorization and Recommendation