简体   繁体   English

如何在目标中计算加权值的相似度以创建良好的聚类

[英]How to calculate similarity of weighted values in goal to create good clusters

I try to create cluster based on object containing weighted values. 我尝试基于包含加权值的对象创建群集。

Values are about songs and objects are users. 值是关于歌曲的,对象是用户。 For example: 例如:

If user1 likes 3 pop songs, 1 rap song and no hip-hop song he will be reprensent as: 如果用户1喜欢3首流行歌曲,1首说唱歌曲而不是嘻哈歌曲,则他将代表:

u1 = {3,1,0}

So if i have 3 users with random values I could have a matrix like this: 因此,如果我有3个具有随机值的用户,则可以有一个像这样的矩阵:

3 1 0
0 4 5
1 2 3

u1 = {3,1,0}
u2 = {0,4,5}
u3 = {1,2,3}

My question is, it is possible to create cluster on that kind of data? 我的问题是,可以在这种数据上创建集群吗? And what kind of algorithm is the best one to find similarity between to data like Jaccard similarity coefficient. 哪种算法是找到与Jaccard相似系数之类的数据之间相似性的最佳算法。

First I tried to calculate using binary data but I will lost some information if I do something like this. 首先,我尝试使用二进制数据进行计算,但是如果我这样做,将会丢失一些信息。

In a second way, I try to calculate similarity between each values. 第二种方式,我尝试计算每个值之间的相似度。 I sum all similarity and I do it again between each object values. 我总结所有相似性,然后在每个对象值之间再次进行相似性处理。

As an example: 举个例子:

I take u1 and u2 and I get: 我用u1和u2得到:

u1 = {3,1,0}
u2 = {0,4,5}

|3 - 0| = 3
|4 - 1| = 3
|0 - 5| = 5

(3 + 3 + 5) / 3 = 11/3 

u1 = {3,1,0}
u3 = {1,2,3}

|3 - 1| = 2
|1 - 2| = 1
|0 - 3| = 3

(2 + 1 +3) / 3 = 6/3 = 2

11/3 > 2 so u1 and u3 are more similar. 11/3> 2,因此u1和u3更相似。

But I am not sure this approach is good too. 但是我不确定这种方法是否也不错。

The goal of this is to compare clusters with other clusters to match some search results. 其目的是将群集与其他群集进行比较,以匹配某些搜索结果。

First, it does not seem to be any special case of cluster analysis. 首先,它似乎不是聚类分析的任何特殊情况。 In fact each clustering method should work as well on this data as it does in general - what I mean is this is nothing "weird" or specific, you simply have points in N dimensional space. 实际上,每种聚类方法都应该像一般情况一样在此数据上正常工作-我的意思是这不是什么“怪异”或特定的,您只是在N维空间中有点。 The only remark is that your current representation differs people liking 10000 songs from people liking 10 songs, even if their music tastes are identical, for example: 唯一要说明的是,即使他们的音乐品味相同,您目前的表示方式还是喜欢10000首歌曲的人和喜欢10首歌曲的人的差异,例如:

[ 10000 0 0 ]
[ 10 0 0 ]

So if your are actually thinking about modeling users "gerne" preferences, you should consider normalisation, so you have (for example, as there are numerous ways to do it) percentage in each dimension, not count: 因此,如果您实际上正在考虑对用户的“ gerne”首选项进行建模,则应考虑规范化,因此您在每个维度中都有(例如,因为有很多方法可以做到)百分比,而不是:

[ 10000 0 0 ] -> [ 1.0 0.0 0.0 ]
[ 10 0 0 ] -> [ 1.0 0.0 0.0 ]

The choice of particular clustering method is dependent on many things regarding expected output , rather then input . 特定聚类方法的选择取决于与预期输出有关的许多事情,而不是输入 You could start with some simple approaches (k-centroids based), and if results are not satisfactionary - go deeper into more advanced methods (hierarchical clustering, dbscan, optics, em, ...). 您可以从一些简单的方法(基于k重心)开始,如果结果不令人满意,请更深入地研究更高级的方法(分层聚类,dbscan,光学,em等)。

I would suggest you to use Cosine similarity . 我建议您使用余弦相似度

Assume that preferences of users are just vectors (each vector represents one user). 假设用户的偏好仅仅是矢量(每个矢量代表一个用户)。

As you understand, different users can listen different amount of music - but, despite this, they might have similar preferences: 如您所知,不同的用户可以收听不同数量的音乐-但是,尽管如此,他们可能会有相似的偏好:

在此处输入图片说明

So, in approach of this model we can claim, that the smaller angle between two vectors - the more similar they are . 因此,在此模型的方法中,我们可以断言, 两个向量之间的角度越小-它们越相似

In opposition to direct calculation of angle between two vectors - we can calculate cosine between them (which is much simpler): 与直接计算两个向量之间的角度相反,我们可以计算它们之间的余弦(这要简单得多):

在此处输入图片说明

Due to specific of cosine function: the greater cosine of angle between two vectors - the more similar they are . 由于余弦函数的特殊性: 两个向量之间的夹角余弦越大,它们越相似

Your example: 你的例子:

u1 = {3, 1, 0}
u2 = {0, 4, 5}
u3 = {1, 2, 3}

|u1| = sqrt(3^2 + 1^2 + 0^2) = sqrt(10) ~ 3.16
|u2| = sqrt(0^2 + 4^2 + 5^2) = sqrt(41) ~ 6.4
|u3| = sqrt(1^2 + 2^2 + 3^2) = sqrt(14) ~ 3.74

similarity(u1, u2) = dot_product(u1, u2) / (|u1| * |u2|) 
                   = (3*0 + 1*4 + 0*5) / (3.16 * 6.4)
                   = 4 / 20.224 ~ 0.2

similarity(u2, u3) = dot_product(u2, u3) / (|u2| * |u3|) 
                   = (0*1 + 4*2 + 5*3) / (6.4 * 3.74)
                   = 23 / 23.936 ~ 0.96

similarity(u1, u3) = dot_product(u1, u3) / (|u1| * |u3|) 
                   = (3*1 + 1*2 + 0*3) / (3.16 * 3.74)
                   = 4 / 11.8184 ~ 0.34

So: 所以:

similarity(u1, u2) = 0.2 相似度(u1,u2)= 0.2

similarity(u2, u3) = 0.96 相似度(u2,u3)= 0.96

similarity(u1, u3) = 0.34 相似度(u1,u3)= 0.34

As I see - results are correlating with input data, because u2 and u3 both like rap and hip-hop, and almost don't like pop music. 如我所见-结果与输入数据相关,因为u2和u3都喜欢说唱和嘻哈,而且几乎不喜欢流行音乐。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM