How can i compute distance matrix using euclidian distance for a dataframe's numerical variables?

Question

This is my dataset: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

In this dataset, there are 7 numerical variables and as a beginner i could not handle to compute distance matrix using euclidian distance. I've tried many things on the internet, but could not solve it. Data is very big, so it causes some memory problems sometimes.

from sklearn.metrics.pairwise import euclidean_distances

X = [[0, 1], [1, 1]]
# distance between rows of X
euclidean_distances(X, X)

# result:
# array([[0., 1.],
#        [1., 0.]])

# get distance to origin
euclidean_distances(X, [[0, 0]])

# Result:
# array([[1.        ],
#        [1.41421356]])

The example that i've tried to apply on my code, i guess it works,but i could not apply it properly.

Answer 1

You've defined your problem already: you can't hold the entire NxN matrix in memory. Your data set header info says that there are 45211 rows in the data base. The full distance matrix, using float32 data, occupies over 16Gb. If this is more than your available RAM, or more than your system's allowed limit for a single data object, you're going to get a memory error.

You "solve" the given problem by changing your algorithm to something that doesn't require the entire 2-way table in memory at once. You can halve the memory requirement by keeping only the upper triangle.

How can i compute distance matrix using euclidian distance for a dataframe's numerical variables?

Question

1 answers

solution1
1 2020-04-14 16:57:29

How can i compute distance matrix using euclidian distance for a dataframe's numerical variables?

Question

1 answers

solution1 1 2020-04-14 16:57:29

solution1
1 2020-04-14 16:57:29