Sum of difference of squares between each combination of rows of 17,000 by 300 matrix

Question

Ok, so I have a matrix with 17000 rows (examples) and 300 columns (features). I want to compute basically the euclidian distance between each possible combination of rows, so the sum of the squared differences for each possible pair of rows. Obviously it's a lot and iPython, while not completely crashing my laptop, says "(busy)" for a while and then I can't run anything anymore and it certain seems to have given up, even though I can move my mouse and everything.

Is there any way to make this work? Here's the function I wrote. I used numpy everywhere I could. What I'm doing is storing the differences in a difference matrix for each possible combination. I'm aware that the lower diagonal part of the matrix = the upper diagonal, but that would only save 1/2 the computation time (better than nothing, but not a game changer, I think).

EDIT : I just tried using scipy.spatial.distance.pdist but it's been running for a good minute now with no end in sight, is there a better way? I should also mention that I have NaN values in there...but that's not a problem for numpy apparently.

features = np.array(dataframe)
distances = np.zeros((17000, 17000))


def sum_diff():
    for i in range(17000):
        for j in range(17000):
            diff = np.array(features[i] - features[j])
            diff = np.square(diff)
            sumsquares = np.sum(diff)
            distances[i][j] = sumsquares

Answer 1

You could always divide your computation time by 2, noticing that d(i, i) = 0 and d(i, j) = d(j, i).

But have you had a look at sklearn.metrics.pairwise.pairwise_distances() (in v 0.18, see the doc here ) ?

You would use it as:

from sklearn.metrics import pairwise
import numpy as np

a = np.array([[0, 0, 0], [1, 1, 1], [3, 3, 3]])
pairwise.pairwise_distances(a)

Answer 2

The big thing with numpy is to avoid using loops and to let it do its magic with the vectorised operations, so there are a few basic improvements that will save you some computation time:

import numpy as np
import timeit

#I reduced the problem size to 1000*300 to keep the timing in reasonable range
n=1000
features = np.random.rand(n,300)
distances = np.zeros((n,n))


def sum_diff():
    for i in range(n):
        for j in range(n):
            diff = np.array(features[i] - features[j])
            diff = np.square(diff)
            sumsquares = np.sum(diff)
            distances[i][j] = sumsquares

#Here I removed the unnecessary copy induced by calling np.array
# -> some improvement
def sum_diff_v0():
    for i in range(n):
        for j in range(n):
            diff = features[i] - features[j]
            diff = np.square(diff)
            sumsquares = np.sum(diff)
            distances[i][j] = sumsquares

#Collapsing of the statements -> no improvement
def sum_diff_v1():
    for i in range(n):
        for j in range(n):
            distances[i][j] = np.sum(np.square(features[i] - features[j]))

# Using brodcasting and vetorized operations -> big improvement
def sum_diff_v2():
    for i in range(n):
        distances[i] = np.sum(np.square(features[i] - features),axis=1)

# Computing only half the distance -> 1/2 computation time
def sum_diff_v3():
    for i in range(n):
        distances[i][i+1:] = np.sum(np.square(features[i] - features[i+1:]),axis=1)
    distances[:] = distances + distances.T

print("original :",timeit.timeit(sum_diff, number=10))
print("v0 :",timeit.timeit(sum_diff_v0, number=10))
print("v1 :",timeit.timeit(sum_diff_v1, number=10))
print("v2 :",timeit.timeit(sum_diff_v2, number=10))
print("v3 :",timeit.timeit(sum_diff_v3, number=10))

Edit : For completeness I also timed Camilleri's solution that is much faster :

from sklearn.metrics import pairwise

def Camilleri_solution():
    distances=pairwise.pairwise_distances(features)

Timing results (in seconds, function run 10 times with 1000*300 input):

original : 138.36921879299916
v0 : 111.39915344800102
v1 : 117.7582511530054
v2 : 23.702392491002684
v3 : 9.712442981006461
Camilleri's : 0.6131987979897531

So as you can see we can easily gain an order of magnitude by using the proper numpy syntax. Note that with only 1/20th of the data the function run in about one second so I would expect the whole thing to run in the tens of minutes as the scipt runs in N^2.

Answer 3

Forget about numpy, which is only a convenience solution for self expanding arrays. Use python lists instead which have a very fast indexing access and are about 15 times faster. Use it like this:

features = list(dataframe)
distances = [[None]*17000]*17000

def sum_diff():
    for i in range(17000):
        for j in range(17000):
            for k in range(300):
                diff = features[i][k] - features[j][k]
                diff = diff*diff
                sumsquares = sumsquares + diff
                distances[i][j] = sumsquares

I hope this is faster than your solution, just try it and give feedback please.

Sum of difference of squares between each combination of rows of 17,000 by 300 matrix

Question

2 answers

solution1
2 ACCPTED 2016-10-04 07:29:11

solution2
1 2016-10-04 07:55:34

solution3
-3 2016-10-04 07:40:14

Sum of difference of squares between each combination of rows of 17,000 by 300 matrix

Question

2 answers

solution1 2 ACCPTED 2016-10-04 07:29:11

solution2 1 2016-10-04 07:55:34

solution3 -3 2016-10-04 07:40:14

solution1
2 ACCPTED 2016-10-04 07:29:11

solution2
1 2016-10-04 07:55:34

solution3
-3 2016-10-04 07:40:14