fastest way to count the number of differences among rows in 2d-array

Question

I need to compute the number of differences (~score) of all rows against all the other of a full 2d-array (score needed to compute a 'difference distance' of an array usefull for statistics). Here a simple exemple, but i need to do that on huge 2d-arrays of ~100 000 rows and thousands of rows, so I'm looking for speeding up my naive code:

a = numpy.array([[1,2],[1,2],[1,3],[2,3],[3,3]])
score =0
scoresquare = 0
for i in xrange(len(a)):
    for j in range(i+1,len(a)):
        scoretemp = 0
        if a[i,0]!=a[j,0] and a[i,1]!=a[j,0] and a[i,1]!=a[j,1] and a[i,0]!=a[j,1] :
            # comparison gives two different items
            scoretemp = 2
        elif (a[i]==a[j]).all():
            scoretemp = 0
        else:
            scoretemp=1
        print a[i],a[j],scoretemp, (a[i]==a[j]).all(),(a[i]==a[j]).any()
        score += scoretemp
        scoresquare += (scoretemp*scoretemp)       
print score,scoresquare

a[0] is identical to a[1] so score(number of differences)=0, but has one difference with a[2] and two differences with a[3]. To compute such distance (statistics), I need intermedairy square-score and score.

reference_row  compared_row  score
[1 2]          [1 2]         0  
[1 2]          [1 3]         1 
[1 2]          [2 3]         1 
[1 2]          [3 3]         2  
[1 2]          [1 3]         1 
[1 2]          [2 3]         1  
[1 2]          [3 3]         2  
[1 3]          [2 3]         1  
[1 3]          [3 3]         1  
[2 3]          [3 3]         1  
Sum_score=11 Sum_scoresquare=15

My code is quite naive and doesn't take advantage of the full strenght of arrays so: How to accelerate such computation? Thanks for your help

Answer 1

np.in1d searches every element of array1 in array2 and generates True for a match. So we need to negate the result using ~np.in1d . After that np.where gives those indices which hold a True value, so len(np.where(...)) gives the total mismatches. I hope this will help you:

>>> import numpy as np
>>> a = np.array([[1,2],[1,2],[1,3],[2,3],[3,3]])
>>> res=[len(np.where(~np.in1d(a[p],a[q]))[0]) for p in range(a.shape[0]) for q in range(p+1,a.shape[0])]
>>> res=np.array(res)
>>> Sum_score=sum(res)
>>> Sum_score_square=sum(res*res)
>>> print Sum_score, Sum_score_square
11 15
>>> k=0
>>> for i in range(a.shape[0]):
...     for j in range(i+1,a.shape[0]):
...         print a[i],a[j],res[k]
...         k+=1


[1 2] [1 2] 0
[1 2] [1 3] 1
[1 2] [2 3] 1
[1 2] [3 3] 2
[1 2] [1 3] 1
[1 2] [2 3] 1
[1 2] [3 3] 2
[1 3] [2 3] 1
[1 3] [3 3] 1
[2 3] [3 3] 1

fastest way to count the number of differences among rows in 2d-array

Question

1 answers

solution1
1 2014-10-28 11:47:18

fastest way to count the number of differences among rows in 2d-array

Question

1 answers

solution1 1 2014-10-28 11:47:18

solution1
1 2014-10-28 11:47:18