简体   繁体   English

如何使用sklearn的矩阵分解来预测新用户的推荐分数

[英]How to use sklearn's Matrix factorization to predict new users' recommendation scores

I'm trying to use sklearn.decomposition.NMF to a matrix R that contains data on how users rated items to predict user ratings for items that they have not yet seen.我正在尝试将sklearn.decomposition.NMF用于矩阵R ,其中包含有关用户如何评价项目的数据,以预测用户对他们尚未看到的项目的评分。

the matrix's rows being users, columns being items, and values being scores, with 0 score meaning that the user did not rate this item yet.矩阵的行是用户,列是项目,值是分数,0 分数意味着用户还没有评价这个项目。

Now with the code below I have only managed to get the two matrices that when multiplied together give the original matrix back.现在使用下面的代码,我只设法得到两个矩阵,当它们相乘时返回原始矩阵。

import numpy

R = numpy.array([
     [5,3,0,1],
     [4,0,0,1],
     [1,1,0,5],
     [1,0,0,4],
     [0,1,5,4],
    ])

from sklearn.decomposition import NMF
model = NMF(n_components=4)

A = model.fit_transform(R)
B = model.components_

n = numpy.dot(A, B)
print(n)

Problem is, that the model does not predict new values in place of 0 's, that would be the predicted scores, but instead recreates the matrix as was.问题是,model 不会预测新值来代替0 ,这将是预测的分数,而是按原样重新创建矩阵。

How do I get the model to predict user scores in place of my original matrix's zeros?如何让 model 预测用户分数来代替原始矩阵的零?

That is what is supposed to happen.这就是应该发生的事情。

However in most of the cases you are not going to have number of components so similar to the number of products and/or customers.但是,在大多数情况下,您不会拥有与产品和/或客户数量如此相似的组件数量。

So for instance considering 2 components因此,例如考虑 2 个组件

model = NMF(n_components=2)
A = model.fit_transform(R)
B = model.components_
R_estimated = np.dot(A, B)
print(np.sum(R-R_estimated))
-1.678873127048393
R_estimated
array([[5.2558264 , 1.99313836, 0.        , 1.45512772],
       [3.50429478, 1.32891458, 0.        , 0.9701988 ],
       [1.31294288, 0.94415991, 1.94956896, 3.94609389],
       [0.98129195, 0.72179987, 1.52759811, 3.0788454 ],
       [0.        , 0.65008935, 2.84003662, 5.21894555]])

You can see in this case that many of the previous zeros are now other numbers you could use.在这种情况下,您可以看到许多以前的零现在是您可以使用的其他数字。 Here for a bit of context https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems) .这里有一点上下文https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems)

How to select n_components?如何 select n_components?

I think the question above is answered, but in case the complete procedure could be something as below.我认为上面的问题已经得到解答,但如果完整的程序可能如下所示。

For that we will need to know a the values in R that are real and we want to focus to predict.为此,我们需要知道 R 中的值是真实的,并且我们希望专注于预测。

In many cases 0 in R are those new cases / scenarios.在许多情况下,R 中的 0 是那些新案例/场景。 It is common to update R with the averages for products or customers and then calculate the decomposition for selecting the ideal n_components.通常使用产品或客户的平均值更新 R,然后计算分解以选择理想的 n_components。 For selection of they maybe a criteria or more to calculate the advantage in a test sample对于选择它们可能是一个标准或更多来计算测试样本中的优势

  1. Create R_with_Averages创建 R_with_Averages
  2. Model selection: 2.1) Split R_with_Averages Test and Training 2.2) Compare among different n_components (from 1 and arbitrary number) using a metric (in which you only consider real evaluations in R) 2.3) Select the best model --> best n_components Model selection: 2.1) Split R_with_Averages Test and Training 2.2) Compare among different n_components (from 1 and arbitrary number) using a metric (in which you only consider real evaluations in R) 2.3) Select the best model --> best n_components
  3. Predict with the best model.用最好的 model 进行预测。

Perhaps good to see:也许很高兴看到:

sklearn 's implementation of NMF does not seem to support missing values ( Nan s, here 0 values basically represent unknown ratings corresponding to new users), refer to this issue . sklearnNMF的实现好像不支持缺失值( Nan ,这里0值基本代表新用户对应的未知评分),参考this issue However, we can use suprise 's NMF implementation, as shown in the following code:但是,我们可以使用supriseNMF实现,如下代码所示:

import numpy as np
import pandas as pd
from surprise import NMF, Dataset, Reader

R = np.array([
     [5,3,0,1],
     [4,0,0,1],
     [1,1,0,5],
     [1,0,0,4],
     [0,1,5,4],
    ], dtype=np.float)

R[R==0] = np.nan
print(R)

# [[ 5.  3. nan  1.]
#  [ 4. nan nan  1.]
#  [ 1.  1. nan  5.]
#  [ 1. nan nan  4.]
#  [nan  1.  5.  4.]]

df = pd.DataFrame(data=R, index=range(R.shape[0]), columns=range(R.shape[1]))
df = pd.melt(df.reset_index(), id_vars='index', var_name='items', value_name='ratings').dropna(axis=0)
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(df[['index', 'items', 'ratings']], reader)

k = 2
algo = NMF(n_factors=k) 
trainset = data.build_full_trainset() 
algo.fit(trainset)
predictions = algo.test(trainset.build_testset()) # predict the known ratings
R_hat = np.zeros_like(R)
for uid, iid, true_r, est, _ in predictions:
    R_hat[uid, iid] = est
predictions = algo.test(trainset.build_anti_testset()) # predict the unknown ratings
for uid, iid, true_r, est, _ in predictions:
    R_hat[uid, iid] = est
print(R_hat)

# [[4.40762528 2.62138084 3.48176319 0.91649316]
# [3.52973408 2.10913555 2.95701406 0.89922637]
# [0.94977826 0.81254138 4.98449755 4.34497549]
# [0.89442186 0.73041578 4.09958967 3.50951819]
# [1.33811051 0.99007556 4.37795636 3.53113236]]

The NMF implementation is as per the [NMF:2014] paper as described here and shown below: NMF 实现是根据 [NMF:2014] 论文,如此处所述,如下所示

在此处输入图像描述

Note that, here the optimization is performed using the known ratings only, resulting in the predicted values of the known ratings being close to the true ratings (but the predicted values for the unknown ratings are not in general close to 0 , as expected).请注意,此处仅使用已知评级执行优化,导致已知评级的预测值接近真实评级(但未知评级的预测值通常不像预期的那样接近0 )。

Again, as usual, we can find the number of factors k using cross-validation.同样,像往常一样,我们可以使用交叉验证找到因子k的数量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM