简体繁体 English

Spark MLlib - 如何验证隐式反馈协同过滤器

[英]Spark MLlib - How to validate implicit feedback collaborative filter

原文 2017-06-07 20:10:01 2 1 scala/ apache-spark/ apache-spark-mllib/ collaborative-filtering

I'm programming it with Scala, but the language doesn't matter here. 我用Scala编程，但这里的语言并不重要。

The input to implicit feedback collaborative filter (ALS.trainImplicit) are, in this case, views of a product: 在这种情况下，隐式反馈协同过滤器（ALS.trainImplicit）的输入是产品的视图：

Rating("user1", "product1", 21.0) //Means that user1 has viewed the details of product1 21 times 评级（“user1”，“product1”，21.0）//表示user1已查看product1的详细信息21次
Rating("user2", "product1", 4.0) 评级（“user2”，“product1”，4.0）
Rating("user3", "product2", 7.0) 评级（“user3”，“product2”，7.0）

But the output (MatrixFactorizationModel.recommendProductsForUsers) are like: 但输出（MatrixFactorizationModel.recommendProductsForUsers）如下：

Rating("user1", "product1", 0.78) 评级（“user1”，“product1”，0.78）
Rating("user2", "product1", 0.63) 评级（“user2”，“product1”，0.63）

Values 0.78 and 0.64 in the output looks like something normalized between 0 and 1, but values in the input were 21,4,7,etc. 输出中的值0.78和0.64看起来像是在0和1之间标准化的值，但输入中的值是21,4,7等。

I don't think that in this case it has any sense to calculate MSE (mean squared error) between the input and output as we can do when we are using collaborative filters with explicit feedback. 我不认为在这种情况下计算输入和输出之间的MSE（均方误差）是有意义的，因为我们在使用具有显式反馈的协同过滤器时可以这样做。

So, the question is, how to validate collaborative filter when using implicit feedback? 那么，问题是，如何在使用隐式反馈时验证协同过滤器？

1 个解决方案

Important KPI for Implicit Feedback Validation are eg Accuracy, Coverage and many others. 隐式反馈验证的重要KPI是例如准确性，覆盖率和许多其他。 It really depends on the use case (how many product do you want to show? How many products do you to offer?) and the goal you want to achieve. 这取决于用例（您想展示多少产品？您提供多少产品？）以及您想要实现的目标。

When I build implicit feedback ALS model I always calculate this 2 KPI. 当我构建隐式反馈ALS模型时，我总是计算这个2 KPI。 Models with a very good accuarcy tend to cover a smaller amount of the available products to offer. 具有非常好的准确性的模型倾向于覆盖少量可用的产品。 Alsways calculate the coverage and decide from there. Alsways计算覆盖范围并从那里决定。

Take a closer look to this post: https://stats.stackexchange.com/questions/226825/what-metric-should-i-use-for-assessing-implicit-matrix-factorization-recommender 仔细看看这篇文章： https ： //stats.stackexchange.com/questions/226825/what-metric-should-i-use-for-assessing-implicit-matrix-factorization-recommender

and this Spark Library: https://github.com/jongwook/spark-ranking-metrics 这个Spark库： https ： //github.com/jongwook/spark-ranking-metrics