简体   繁体   中英

What is the fastest way to compute the Euclidean distances of a very large matrix with complex numbers?

I have a very large input data set of 50,000 samples with 9 dimensions (ie a 50000x9 matrix). This data has been transformed using DFT:

dft_D = data.dot(dft(9).T) / np.sqrt(9)

I want to calculate the euclidean distance for each pair of rows. I found scipy.spatial.distance.pdist to be the fastest in calculating the euclidean distances when using a matrix with real numbers (eg calculating the distances on data would take ~`15 seconds). However, this function does not work with complex numbers.

I have tried the solution presented in this SO post , but this gave me serious memory issues (ie "Unable to allocate 191. GiB for an array with shape (50000, 50000, 9) and data type complex128"). I have also tried using the EDM defined in this Medium article , but that also gave me similar memory issues.

Originally, I was able to calculate these Euclidean distances by iterating over rows and columns using the definition np.sqrt(np.sum(np.square(np.abs(data[i,:] - data[j,:])))) . This was awfully slow. I then used the definition described in the docs for sklearn.metrics.pairwise.euclidean_distances (which also doesn't work with complex numbers) and it was slightly faster, but still very slow (over 2 hours to run).

This was my final result (note I only compute half of the full distance matrix since the distance matrix is symmetrical),

import numpy as np
def calculate_euclidean_distance(arr, num_rows):
    dist_matrix = np.empty(int((num_rows*(num_rows - 1))/2))
    idx = 0
    dot_dict = {}
    # get the 0th row out of the way
    dot_dict[0] = arr[0,:].dot(arr[0,:])
    
    for i in range(1,num_rows):
        # Save the value of dot(X,X) in dict to not recompute it every time when needed
        if i not in dot_dict:
            dot_dict[i] = arr[i,:].dot(arr[i,:])
        i_dot = dot_dict[i]
        for j in range(0,i):
            j_dot = dot_dict[j]
            dist_matrix[idx] = np.sqrt(i_dot - 2*arr[i,:].dot(arr[j,:]) + j_dot)
            idx+=1
    return dist_matrix

Is there a faster way to get these distances when complex numbers are involved?

You may use numpy.roll() which shifts the rows of input array in a circular manner. It repeats a lot of computations but is much faster despite of that. The below code fills the bottom half of the distance matrix

dist_matrix = np.empty(shape = [inp_arr.shape[0], inp_arr.shape[0]])
for i in range(inp_arr.shape[0]):
    shifted_arr = np.roll(inp_arr, i, axis = 0)
    curr_dist = np.sqrt(np.sum(np.square(np.abs(inp_arr - shifted_arr)), axis = 1))
    for j in range(i, inp_arr.shape[0]):
        dist_matrix[j, j - i] = curr_dist[j]

I don't understand your definition of dft_D . But if you're trying to calculate the distances between rows of the DFT of your original data, this will be the same as the distance between rows of your original data.

According to Parseval's theorem , the magnitude of a vector and its transform are the same. And by linearity, the transform of the difference of two vectors is equal to the difference of their transforms. Since Euclidean distance is the square root of the magnitude of the difference, it doesn't matter which domain you use to calculate the metric. We can demonstrate with a small sample:

import numpy as np
import scipy.spatial

x = np.random.random((500,9)) #Use a smaller data set for the demo
Sx = np.fft.fft(x)/np.sqrt(x.shape[1]) #numpy fft doesn't normalize by default
xd = scipy.spatial.distance.pdist(x,metric='euclidean')
Sxd = np.array([np.sqrt(np.sum(np.square(np.abs(Sx[i,:] - Sx[j,:])))) for i in range(Sx.shape[0]) for j in range(Sx.shape[0])]).reshape((Sx.shape[0],Sx.shape[0])) #calculate the full square of pairwise distances
Sxd = scipy.spatial.distance.squareform(Sxd) #use scipy helper function to get back the same format as pdist
np.all(np.isclose(xd,Sxd)) # Should print True

Therefore, just use pdist on the original data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM