简体   繁体   中英

Clustering - how to recommend movie based on selected movie?

As my question states, I am working with clustering algorithms. I have been clustering movies from IMDB, I have 15 clusters and each cluster contains a genre combination. Now I am struggling with the part where I recommend a movie, how can I do this? I have tried making fake movie data, by making up features of a movie and then putting that data in the k-means algorithm but of course this doesn't work as it has no data to compare it to.

What I want is that I can select a movie or create some data and then get a top ~20 lists of movies from a cluster that the selected movie is in.

Currently I am just doing it in a very cheap way by just preselecting a cluster for a result.

cluster_test = prediction_result[prediction_result['cluster'] == 5].sort_values(by =['averageRating', 'numVotes'], ascending=False) 
cluster_test.head(15)

like this which shows the top movies from cluster 5 in this instance

Here's one way to do it.

# Check out all the movies and their respective IDs
movie_titles = pd.read_csv('C:\\movies.csv')
movie_titles = movie_titles.head(10000)
print(movie_titles.shape)


#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
movie_titles['genres'] = movie_titles['genres'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movie_titles['genres'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape


# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)


#Construct a reverse map of indices and movie titles
indices = pd.Series(movie_titles.index, index=movie_titles['title']).drop_duplicates()



# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movie_titles['title'].iloc[movie_indices]


get_recommendations('Toy Story (1995)')

Result:

在此处输入图像描述

# data:
# https://grouplens.org/datasets/movielens/

In this example we are seeing Cosine Similarity being implemented. You can use KNN to do something very similar. I'm sure there are many ways to solve this kind of problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM