简体   繁体   中英

Calculating euclidean distances with Python runs too slow

I read to datasets from file into numpy arrays like this:

def read_data(filename):
   data = np.empty(shape=[0, 65], dtype=int)
   with open(filename) as f:
       for line in f:
           data = np.vstack((data, np.array(list(map(int, line.split(','))), dtype=int)))
   return data

I use numpy to calculate the euclidean distance between two lists:

def euclidean_distance(x, z):
   return np.linalg.norm(x-z)

After this, I calculate the euclidean distances like this:

for data in testing_data:
   for data2 in training_data:
       dist = euclidean_distance(data, data2)

My problem is that this code runs very slowly, it takes about ~10 minutes to finish. How can I improve this, what am I missing?
I have to use the distances in another algorith, so the speed is very important.

You could use sklearn.metrics.pairwise_distances which allows you to allocate the work to all of your cores. Parallel construction of a distance matrix discusses the same topic and provides a good discussion on the differences of pdist , cdist , and pairwise_distances

If I understand your example correctly, you want the distance between each sample in the training set and each sample in the testing set. To do that you could use:

dist = pairwise_distances(training_data, testing_data, n_jobs=-1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM