简体   繁体   中英

How to calculate similarity between two arrays?

I have two arrays like these: a1=[8,6,4,8,8,9] , a2=[7,3,8,4,3,9,9,5,8,3,5,8] . They may contain different number of integers that can be repeated. How can I calculate similarity between these? Which metric should I use?

UPDATE: numbers represent vote from 1 to 10. Users vote items in one specific category. We don't know items they have been voted. We only know that these items are belong to the same category. So we have arrays of votes in category. The problem is how we could calculate similarity between these users.

Treat your votes as random samples

If you really don't know what they voted for, the only thing you have is the distribution of votes. Ie, you have two samples from two distributions and you need to evaluate the difference between the distributions.

The simplest approach would be to count the number of times the user gave a given vote, ie, convert [8,6,4,8,8,9] to [0,0,0,1,0,1,0,3,9,0] (ie, 3 votes of 8 means 3 in the 8th position). Now your vectors have the same length and you can use cosine similarity.

Fuzzy your data

If you do not have much data, ie, you really need to compare people who voted 1-2 times, you can try "fuzzying the votes", ie, treating a vote for 8 as 1/2 vote for 8, and 1/4 votes for each of 7 and 9. Eg, your vectors [0,0,0,0,0,0,0,0,0,1] and [0,0,0,0,0,0,0,0,1,0] will become [0,0,0,0,0,0,0,0,0.33,0.66] and [0,0,0,0,0,0,0,0.25,0.5,0.25] .

This is equivalent to using an "unusual" dot product: instead of the simple (v,u)=sum_i(v_i*u_i*) , use (Av,u)=sum_ij(a_ij*v_i*u_j) where A is an almost diagonal matrix ( eg , a_ii=4 , a_ij=1 if |ij|=1 , a_ij=0 otherwise). Then the new cosine similarity is then defined as

CS(u,v)=arccos( (Av,u) / sqrt( (Av,v) * (Au,u) ) )

For the tridiagonal example above, the formula looks like:

(Av,u) = 4*sum(v_i,u_i) + sum(v_i,u_{i-1}) + (v_i,u_{i+1})

Use Stats

If you do have a lot of data, ie, every person has votes every number at least 5 times (ie, each length 10 vector has all components >=5), then you can use Chi-squared test or, better yet , Likelihood-ratio test .

recommender system

You should specify which coordinates match (if you, as I suspect, doing a recommender system). Eg, if user 1 votes are [3] and user 2 votes are [4,5] , you need to know whether score 3 is for the same object as score 4 or 5 or for an entirely different object.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM