How to calculate similarity of weighted values in goal to create good clusters

Question

I try to create cluster based on object containing weighted values.

Values are about songs and objects are users. For example:

If user1 likes 3 pop songs, 1 rap song and no hip-hop song he will be reprensent as:

u1 = {3,1,0}

So if i have 3 users with random values I could have a matrix like this:

3 1 0
0 4 5
1 2 3

u1 = {3,1,0}
u2 = {0,4,5}
u3 = {1,2,3}

My question is, it is possible to create cluster on that kind of data? And what kind of algorithm is the best one to find similarity between to data like Jaccard similarity coefficient.

First I tried to calculate using binary data but I will lost some information if I do something like this.

In a second way, I try to calculate similarity between each values. I sum all similarity and I do it again between each object values.

As an example:

I take u1 and u2 and I get:

u1 = {3,1,0}
u2 = {0,4,5}

|3 - 0| = 3
|4 - 1| = 3
|0 - 5| = 5

(3 + 3 + 5) / 3 = 11/3 

u1 = {3,1,0}
u3 = {1,2,3}

|3 - 1| = 2
|1 - 2| = 1
|0 - 3| = 3

(2 + 1 +3) / 3 = 6/3 = 2

11/3 > 2 so u1 and u3 are more similar.

But I am not sure this approach is good too.

The goal of this is to compare clusters with other clusters to match some search results.

Answer 1

First, it does not seem to be any special case of cluster analysis. In fact each clustering method should work as well on this data as it does in general - what I mean is this is nothing "weird" or specific, you simply have points in N dimensional space. The only remark is that your current representation differs people liking 10000 songs from people liking 10 songs, even if their music tastes are identical, for example:

[ 10000 0 0 ]
[ 10 0 0 ]

So if your are actually thinking about modeling users "gerne" preferences, you should consider normalisation, so you have (for example, as there are numerous ways to do it) percentage in each dimension, not count:

[ 10000 0 0 ] -> [ 1.0 0.0 0.0 ]
[ 10 0 0 ] -> [ 1.0 0.0 0.0 ]

The choice of particular clustering method is dependent on many things regarding expected output , rather then input . You could start with some simple approaches (k-centroids based), and if results are not satisfactionary - go deeper into more advanced methods (hierarchical clustering, dbscan, optics, em, ...).

Answer 2

I would suggest you to use Cosine similarity .

Assume that preferences of users are just vectors (each vector represents one user).

As you understand, different users can listen different amount of music - but, despite this, they might have similar preferences:

在此处输入图片说明

So, in approach of this model we can claim, that the smaller angle between two vectors - the more similar they are .

In opposition to direct calculation of angle between two vectors - we can calculate cosine between them (which is much simpler):

在此处输入图片说明

Due to specific of cosine function: the greater cosine of angle between two vectors - the more similar they are .

Your example:

u1 = {3, 1, 0}
u2 = {0, 4, 5}
u3 = {1, 2, 3}

|u1| = sqrt(3^2 + 1^2 + 0^2) = sqrt(10) ~ 3.16
|u2| = sqrt(0^2 + 4^2 + 5^2) = sqrt(41) ~ 6.4
|u3| = sqrt(1^2 + 2^2 + 3^2) = sqrt(14) ~ 3.74

similarity(u1, u2) = dot_product(u1, u2) / (|u1| * |u2|) 
                   = (3*0 + 1*4 + 0*5) / (3.16 * 6.4)
                   = 4 / 20.224 ~ 0.2

similarity(u2, u3) = dot_product(u2, u3) / (|u2| * |u3|) 
                   = (0*1 + 4*2 + 5*3) / (6.4 * 3.74)
                   = 23 / 23.936 ~ 0.96

similarity(u1, u3) = dot_product(u1, u3) / (|u1| * |u3|) 
                   = (3*1 + 1*2 + 0*3) / (3.16 * 3.74)
                   = 4 / 11.8184 ~ 0.34

So:

similarity(u1, u2) = 0.2

similarity(u2, u3) = 0.96

similarity(u1, u3) = 0.34

As I see - results are correlating with input data, because u2 and u3 both like rap and hip-hop, and almost don't like pop music.

How to calculate similarity of weighted values in goal to create good clusters

Question

2 answers

solution1
3 2013-09-11 09:50:02

solution2
2 ACCPTED 2013-09-11 12:20:46

Your example:

How to calculate similarity of weighted values in goal to create good clusters

Question

2 answers

solution1 3 2013-09-11 09:50:02

solution2 2 ACCPTED 2013-09-11 12:20:46

Your example:

solution1
3 2013-09-11 09:50:02

solution2
2 ACCPTED 2013-09-11 12:20:46