简体繁体 English

基于主题的推荐系统的用户配置文件

[英]User profiling for topic-based recommender system

原文 2012-10-06 20:31:45 0 2 python/ machine-learning/ recommendation-engine/ latent-semantic-indexing/ topic-modeling

I'm trying to come up with a topic-based recommender system to suggest relevant text documents to users. 我正在尝试提出一个基于主题的推荐系统，向用户建议相关的文本文档。

I trained a latent semantic indexing model, using gensim, on the wikipedia corpus. 我使用gensim在Wikipedia语料库上训练了一个潜在的语义索引模型。 This lets me easily transform documents into the LSI topic distributions. 这使我可以轻松地将文档转换为LSI主题分布。 My idea now is to represent users the same way. 我现在的想法是用相同的方式代表用户。 However, of course, users have a history of viewed articles, as well as ratings of articles. 但是，当然，用户具有浏览过文章的历史以及文章评级。

So my question is: how to represent the users? 所以我的问题是：如何代表用户？

An idea I had is the following: represent a user as the aggregation of all the documents viewed. 我的想法如下：将用户视为所查看的所有文档的集合。 But how to take into account the rating? 但是如何考虑评级？

Any ideas? 有任何想法吗？

Thanks 谢谢

2 个解决方案

I don't think that's working with lsa. 我认为这不适用于lsa。

But you maybe could do some sort of k-NN classification, where each user's coordinates are the documents viewed. 但是您可能可以进行某种k-NN分类，其中每个用户的坐标就是查看的文档。 Each object (=user) sends out radiation (intensity is inversely proportional to the square of the distance). 每个对象（=用户）发出辐射（强度与距离的平方成反比）。 The intensity is calculated from the ratings on the single documents. 强度是根据单个文档的等级计算得出的。

Then you can place a object (user) in in this hyperdimensional space, and see what other users give the most 'light'. 然后，您可以在此超维空间中放置一个对象（用户），并查看其他用户给予最多“照明”的对象。

But: Can't Apache Lucene do that whole stuff for you? 但是：Apache Lucene不能为您做全部吗？

"represent a user as the aggregation of all the documents viewed" : that might work indeed, given that you are in linear spaces. “将用户表示为所查看的所有文档的集合”：确实可行，因为您位于线性空间中。 You can easily add all the documents vectors in one big vector. 您可以轻松地将所有文档向量添加到一个大向量中。

If you want to add the ratings, you could simply put a coefficient in the sum. 如果要添加等级，可以简单地在总和中添加一个系数。

Say you group all documents rated 2 in a vector D2, rated 3 in D3 etc... you then simply define a user vector as U=c2*D2+c3*D3+... You can play with various forms for c2, c3, but the easiest approach would be to simply multiply by the rating, and divide by the max rating for normalisation reasons. 假设您将所有等级为2的文档归为一个向量D2，等级为3则归为D3等...然后您可以简单地将用户向量定义为U = c2 * D2 + c3 * D3 +...。，但最简单的方法是简单地乘以评分，然后出于标准化原因除以最高评分。

If your max rating is 5, you could define for instance c2=2/5, c3=3/5 ... 如果您的最高评分为5，则可以定义例如c2 = 2/5，c3 = 3/5 ...