简体   繁体   中英

How can i compute distance matrix using euclidian distance for a dataframe's numerical variables?

This is my dataset: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

In this dataset, there are 7 numerical variables and as a beginner i could not handle to compute distance matrix using euclidian distance. I've tried many things on the internet, but could not solve it. Data is very big, so it causes some memory problems sometimes.

from sklearn.metrics.pairwise import euclidean_distances

X = [[0, 1], [1, 1]]
# distance between rows of X
euclidean_distances(X, X)

# result:
# array([[0., 1.],
#        [1., 0.]])

# get distance to origin
euclidean_distances(X, [[0, 0]])

# Result:
# array([[1.        ],
#        [1.41421356]]) 

The example that i've tried to apply on my code, i guess it works,but i could not apply it properly.

You've defined your problem already: you can't hold the entire NxN matrix in memory. Your data set header info says that there are 45211 rows in the data base. The full distance matrix, using float32 data, occupies over 16Gb. If this is more than your available RAM, or more than your system's allowed limit for a single data object, you're going to get a memory error.

You "solve" the given problem by changing your algorithm to something that doesn't require the entire 2-way table in memory at once. You can halve the memory requirement by keeping only the upper triangle.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM