[英]How can I suggest movies based off someone's prior-watched movies?
For a machine learning exercise I am working on, I am given a dataset where each row contains the following features:对于我正在进行的机器学习练习,我得到了一个数据集,其中每一行都包含以下特征:
My task is to suggest other movies that the person might like based off these features.我的任务是根据这些特征推荐该人可能喜欢的其他电影。
The thing is, I am not given a feature set for movies.问题是,我没有获得电影的功能集。 I am only given the dataset described above.
我只得到了上面描述的数据集。
I already know I need to generate a feature set for movies.我已经知道我需要为电影生成一个功能集。 However, I don't know how to approach this.
但是,我不知道如何处理这个问题。
After I create the feature set, I will convert each movie's feature set into an embedding (vector).创建特征集后,我会将每部电影的特征集转换为嵌入(向量)。 Then I will use a similarity-matching library (such as Spotify's Annoy ) to find return embeddings of similar movies.
然后我将使用相似性匹配库(例如 Spotify 的Annoy )来查找相似电影的返回嵌入。
The part I am stuck at is how I can use the dataset to generate a feature set for each movie.我坚持的部分是如何使用数据集为每部电影生成一个特征集。
Imagine that you have a table like this:想象一下,你有一张这样的表:
+-------+-----+--------+---------------------+
| Name | Age | Gender | Movie |
+-------+-----+--------+---------------------+
| John | 23 | Male | John the Ripper |
| Luke | 18 | Male | The Star Wars |
| Ann | 18 | Female | Mr. Nobody |
| Alice | 12 | Female | Alice in Wonderland |
| Bruce | 64 | Male | Armageddon |
+-------+-----+--------+---------------------+
I. First of all, you need to separate this table by two parts:一、首先你需要把这张表分成两部分:
II.二、 After that you could to encode your strings into numbers:
之后,您可以将字符串编码为数字:
For example:例如:
+------+-----+--------+-------+
| Name | Age | Gender | Movie |
+------+-----+--------+-------+
| 0 | 23 | 1 | 3 |
| 1 | 18 | 1 | 2 |
| 2 | 18 | 0 | 4 |
| 3 | 12 | 0 | 1 |
| 4 | 64 | 1 | 0 |
+------+-----+--------+-------+
III.三、 Then you may separate your vector on two parts:
然后你可以将你的向量分成两部分:
The proportion between this separate set may be different, but usually train data set picks greater than test data set.这个单独集之间的比例可能不同,但通常训练数据集选择大于测试数据集。
IV.四、 Sometimes you could need for scaling your data.
有时您可能需要扩展数据。
For example:例如:
+------+--------+--------+-------+
| Name | Age | Gender | Movie |
+------+--------+--------+-------+
| 0.0 | 0.3594 | 1 | 0.6 |
| 0.2 | 0.2813 | 1 | 0.4 |
| 0.4 | 0.2813 | 0 | 0.8 |
| 0.6 | 0.1875 | 0 | 0.2 |
| 0.8 | 1.0000 | 1 | 0.0 |
+------+--------+--------+-------+
In this sample after steps I-IV you will get:在此示例中,在步骤 I-IV 之后,您将获得:
feature_train = [[ 0.0, 0.3594, 1 ], [ 0.2, 0.2813, 1 ], [ 0.4, 0.2813, 0 ]]
purpose_train = [ 0.6, 0.4, 0.8 ]
feature_test = [[ 0.6, 0.1875, 0], [0.8, 1.0000, 1]]
purpose_test = [[ 0.2, 0.0]]
That's all to prepare data in simple way.这就是以简单的方式准备数据。
[UDP] [UDP]
After all this steps, you should teach your algorithm by the data, and then you may predict the favorite Movie by Name, Age and Gender for the choosed one.完成所有这些步骤后,您应该根据数据教授您的算法,然后您可以根据所选电影的姓名、年龄和性别预测最喜欢的电影。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.