As someone who spends copious amounts of time scouring Netflix for worthwhile watches, I wanted to experiment in marrying up two personal interests of mine in cinema and statistical analysis. This is my first attempt in using K-means clustering to identify films with similar characteristics based on the 'Rotten Tomatoes movies and critic reviews' Kaggle dataset. My approach for the time-being is oversimplified as I wanted to pre-select the attributes that I would personally use if I was approaching this task manually - that being said, I will look to develop this model to incorporate additional features and updated data given the cut-off in 2020.
The program will prompt the user for a film and will recommend similar films located within the same cluster. The results of the cluster are sorted by order of their Rotten Tomatoes score, as opposed to Euclidean distance and therefore relatively high emphasis is placed on film scores.
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
%matplotlib inline
pd.set_option('display.max_columns', None)
# Load the dataset
df = pd.read_csv('rotten_tomatoes_movies.csv')
df.head(3)
| rotten_tomatoes_link | movie_title | movie_info | critics_consensus | content_rating | genres | directors | authors | actors | original_release_date | streaming_release_date | runtime | production_company | tomatometer_status | tomatometer_rating | tomatometer_count | audience_status | audience_rating | audience_count | tomatometer_top_critics_count | tomatometer_fresh_critics_count | tomatometer_rotten_critics_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | m/0814255 | Percy Jackson & the Olympians: The Lightning T... | Always trouble-prone, the life of teenager Per... | Though it may seem like just another Harry Pot... | PG | Action & Adventure, Comedy, Drama, Science Fic... | Chris Columbus | Craig Titley, Chris Columbus, Rick Riordan | Logan Lerman, Brandon T. Jackson, Alexandra Da... | 2010-02-12 | 2015-11-25 | 119.0 | 20th Century Fox | Rotten | 49.0 | 149.0 | Spilled | 53.0 | 254421.0 | 43 | 73 | 76 |
| 1 | m/0878835 | Please Give | Kate (Catherine Keener) and her husband Alex (... | Nicole Holofcener's newest might seem slight i... | R | Comedy | Nicole Holofcener | Nicole Holofcener | Catherine Keener, Amanda Peet, Oliver Platt, R... | 2010-04-30 | 2012-09-04 | 90.0 | Sony Pictures Classics | Certified-Fresh | 87.0 | 142.0 | Upright | 64.0 | 11574.0 | 44 | 123 | 19 |
| 2 | m/10 | 10 | A successful, middle-aged Hollywood songwriter... | Blake Edwards' bawdy comedy may not score a pe... | R | Comedy, Romance | Blake Edwards | Blake Edwards | Dudley Moore, Bo Derek, Julie Andrews, Robert ... | 1979-10-05 | 2014-07-24 | 122.0 | Waner Bros. | Fresh | 67.0 | 24.0 | Spilled | 53.0 | 14684.0 | 2 | 16 | 8 |
The dataset provides many potential candidate attributes for modelling, however one of the challenges is that many of the most useful ones are not in a directly usable format. For example, the 'Genres' attribute is non-numeric and it's variates are not uniformly defined (i.e. Horror, Comedy, Horror/Comedy are all separate genres and these combinations exist with various additions). One-hot encoding has been used here to identify the most common occurences and mapped to new columns in the initial dataframe to identify instances of these common genres.
# Set attributes in relevant format for k-means clustering
df['release_year'] = pd.to_datetime(df['original_release_date']).dt.year
def encodeAttributes(col, n):
# Use one-hot encoding for top N entries in column attribute
top_genres = df[col].value_counts().index[:n]
return top_genres
def appendAttributes(col, n):
# Append encoded attributes to original dataframe
for item in encodeAttributes(col, n):
df[item] = df[col].fillna('').str.lower().apply(lambda x: 1 if item.lower() in x else 0)
return df
# Select most relevant features - manually selected for time-being
df = appendAttributes('genres', 10)
df = appendAttributes('directors', 10)
selected_attributes = ['tomatometer_rating', 'release_year', 'audience_count'] + list(encodeAttributes('genres', 10)) + list(encodeAttributes('directors', 10))
# Standardise feature values and impute missing values
scaler = StandardScaler()
imputer = SimpleImputer(strategy='mean')
X = scaler.fit_transform(df[selected_attributes])
X = imputer.fit_transform(X)
# Plot heatmap showing correlation matrix of selected attributes
plt.figure(figsize=(8, 6))
heatmap = plt.imshow(df[selected_attributes].corr(), cmap='coolwarm', interpolation='nearest')
plt.title("Attribute Correlation Heatmap")
plt.xticks(np.arange(df[selected_attributes].corr().shape[1]), labels=df[selected_attributes].corr().columns, rotation=90)
plt.yticks(np.arange(df[selected_attributes].corr().shape[1]), labels=df[selected_attributes].corr().columns)
plt.colorbar(heatmap, fraction=0.046, pad=0.04)
plt.tight_layout()
plt.show()
The Elbow Method is used here to help determine the optimal number of clusters for K-means. It allows us to find the "elbow point" on a plot of the within-cluster sum of squares (WCSS) against the number of clusters. The WCSS is a measure of how spread out the data points in a cluster are. As the number of clusters increases, the WCSS tends to decrease because the data points are closer to the cluster centers. However, after a certain point, the reduction in WCSS becomes less significant, and that's where the "elbow" appears on the plot. The number of clusters at the elbow is often a good choice for K in K-means.
# Create an empty list to store the inertia (sum of squared distances to the closest cluster center)
inertia = []
# Define a range of values for k
k_range = range(1, 50)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertia.append(kmeans.inertia_)
# Plot the inertia values
plt.figure(figsize=(8, 6))
plt.plot(k_range, inertia, marker='o', linestyle='-', color='b')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(True)
plt.show()
Based on the Elbow Method, k=20 was decided as the appropriate number of clusters for our analysis. Plotting the average cluster scores shows a range of values for this particular attribute. That being said, some clusters share a similar average which should imply differentiation across other attributes. We explore this further through parallel coordinates plotting which demonstrates major outliers in the attributes dataset resulting in designated clusters self-identifying through these extreme points. This is not entirely surprising given attributes such as named film directors which will only be involved in a handful of films. This would also explain why different clusters may have similar average scores given that this may simply be the effect rather than the cause.
# Use K-means clustering to identify similar movies based on selected attributes
kmeans = KMeans(n_clusters=20, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X)
df['Cluster'] = clusters
mean_by_category = df.groupby('Cluster')['tomatometer_rating'].mean()
# Display average RT score for each cluster
plt.scatter(mean_by_category.index, mean_by_category)
plt.xlabel('Cluster ID')
plt.ylabel('RT Score')
plt.title('Average Cluster Scores')
plt.grid(True)
plt.show()
scaled_attributes = pd.DataFrame(X, columns = selected_attributes)
scaled_attributes['Cluster'] = clusters
# Plot parallel coordinates chart to display cluster characteristics
plt.figure(figsize=(8, 6))
pd.plotting.parallel_coordinates(scaled_attributes, 'Cluster', colormap='viridis')
plt.title("Parallel Coordinates Plot for K-means Clustering")
plt.xlabel("Attributes")
plt.ylabel("Values")
plt.xticks(fontsize=8, rotation=90)
legend_labels = [str(i) for i in sorted(scaled_attributes['Cluster'].unique())]
custom_legend = plt.legend(title='Cluster', loc='upper right', labels=legend_labels)
plt.show()
def getMovieRec():
# Prompt user for film and return dataframe based on films located within the same cluster
print('------------------------------------------------------------------------------')
keyword = input("Enter a movie title for similar recommendations: ")
search_results = df[df['movie_title'].str.contains(keyword, case=False, na=False)]
if not search_results.empty:
print('------------------------------------------------------------------------------')
print("Search Results:")
print(search_results[['movie_title', 'release_year']])
if len(search_results) > 1:
try:
search_id = int(input("Please select the key of the relevant movie: "))
filtered_df = df[df['Cluster'] == int(search_results['Cluster'][search_id])].sort_values(by='tomatometer_rating', ascending=False)
except:
print('Please enter a valid movie key.')
return
else:
filtered_df = df[df['Cluster'] == int(search_results['Cluster'])].sort_values(by='tomatometer_rating', ascending=False)
print('------------------------------------------------------------------------------')
print('The following titles are recommended:')
print(filtered_df[['movie_title', 'tomatometer_rating', 'genres']].head(3))
print('------------------------------------------------------------------------------')
else:
print("No matching titles found.")
return filtered_df
filtered_df = getMovieRec()
key_prompt = int(input('Please provide key of recommended film to obtain more movie info: '))
print('------------------------------------------------------------------------------')
print(filtered_df.head(3)['movie_info'].loc[key_prompt])
print('------------------------------------------------------------------------------')
------------------------------------------------------------------------------
Enter a movie title for similar recommendations: Zombieland
------------------------------------------------------------------------------
Search Results:
movie_title release_year
17699 Zombieland 2009.0
17700 Zombieland: Double Tap 2019.0
Please select the key of the relevant movie: 17699
------------------------------------------------------------------------------
The following titles are recommended:
movie_title tomatometer_rating \
8726 John Mulaney: Kid Gorgeous at Radio City 100.0
7514 Harold's Going Stiff 100.0
14249 Tampopo 100.0
genres
8726 Comedy
7514 Comedy, Horror
14249 Art House & International, Comedy
------------------------------------------------------------------------------
Please provide key of recommended film to obtain more movie info: 7514
------------------------------------------------------------------------------
Harold suffers form a disease that slowly causes him to become a zombie.
------------------------------------------------------------------------------
The model is functional in creating clusters for films based on the selected attributes. The granularity of these clusters however still lacks certain depth given limitations in the input data and modelling. To put this into perspective, it is relatively easy to cluster together films which were released in a certain time period with similar overall scoring, however this won't be useful in identifying commonalities in movies based on more nuanced characteristics. If I am looking to satisfy a particular itch for a feature depicting a zombie apocalypse with romantic undertones, the current set-up unfortunately won't suffice. Further development will be required to use movie descriptions as an input into the model, as well as additional feature analysis.
Title: Rotten Tomatoes Movies and Critic Reviews Dataset Contributor: Stefano Leone Publication Date: 2020-10-31 URL: https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset