Calculating euclidean distances with Python runs too slow

Question

I read to datasets from file into numpy arrays like this:

def read_data(filename):
   data = np.empty(shape=[0, 65], dtype=int)
   with open(filename) as f:
       for line in f:
           data = np.vstack((data, np.array(list(map(int, line.split(','))), dtype=int)))
   return data

I use numpy to calculate the euclidean distance between two lists:

def euclidean_distance(x, z):
   return np.linalg.norm(x-z)

After this, I calculate the euclidean distances like this:

for data in testing_data:
   for data2 in training_data:
       dist = euclidean_distance(data, data2)

My problem is that this code runs very slowly, it takes about ~10 minutes to finish. How can I improve this, what am I missing?
I have to use the distances in another algorith, so the speed is very important.

Answer 1

You could use sklearn.metrics.pairwise_distances which allows you to allocate the work to all of your cores. Parallel construction of a distance matrix discusses the same topic and provides a good discussion on the differences of pdist , cdist , and pairwise_distances

If I understand your example correctly, you want the distance between each sample in the training set and each sample in the testing set. To do that you could use:

dist = pairwise_distances(training_data, testing_data, n_jobs=-1)

Calculating euclidean distances with Python runs too slow

Question

1 answers

solution1
1 2019-05-06 14:09:44

Calculating euclidean distances with Python runs too slow

Question

1 answers

solution1 1 2019-05-06 14:09:44

solution1
1 2019-05-06 14:09:44