简体   繁体   English

聚类 - 如何根据所选电影推荐电影?

[英]Clustering - how to recommend movie based on selected movie?

As my question states, I am working with clustering algorithms.正如我的问题所述,我正在使用聚类算法。 I have been clustering movies from IMDB, I have 15 clusters and each cluster contains a genre combination.我一直在对 IMDB 中的电影进行聚类,我有 15 个聚类,每个聚类包含一个流派组合。 Now I am struggling with the part where I recommend a movie, how can I do this?现在我正在为我推荐一部电影的部分而苦苦挣扎,我该怎么做? I have tried making fake movie data, by making up features of a movie and then putting that data in the k-means algorithm but of course this doesn't work as it has no data to compare it to.我曾尝试制作假电影数据,方法是制作电影的特征,然后将该数据放入 k-means 算法中,但当然这不起作用,因为它没有可比较的数据。

What I want is that I can select a movie or create some data and then get a top ~20 lists of movies from a cluster that the selected movie is in.我想要的是我可以选择一部电影或创建一些数据,然后从所选电影所在的集群中获取前 20 个电影列表。

Currently I am just doing it in a very cheap way by just preselecting a cluster for a result.目前我只是通过预先选择一个集群来以一种非常便宜的方式来完成它。

cluster_test = prediction_result[prediction_result['cluster'] == 5].sort_values(by =['averageRating', 'numVotes'], ascending=False) 
cluster_test.head(15)

like this which shows the top movies from cluster 5 in this instance像这样在这种情况下显示来自集群 5 的顶级电影

Here's one way to do it.这是一种方法。

# Check out all the movies and their respective IDs
movie_titles = pd.read_csv('C:\\movies.csv')
movie_titles = movie_titles.head(10000)
print(movie_titles.shape)


#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
movie_titles['genres'] = movie_titles['genres'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movie_titles['genres'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape


# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)


#Construct a reverse map of indices and movie titles
indices = pd.Series(movie_titles.index, index=movie_titles['title']).drop_duplicates()



# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movie_titles['title'].iloc[movie_indices]


get_recommendations('Toy Story (1995)')

Result:结果:

在此处输入图像描述

# data:
# https://grouplens.org/datasets/movielens/

In this example we are seeing Cosine Similarity being implemented.在这个例子中,我们看到了 Cosine Similarity 的实现。 You can use KNN to do something very similar.您可以使用 KNN 做一些非常相似的事情。 I'm sure there are many ways to solve this kind of problem.我相信有很多方法可以解决这类问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM