I have a large dataframe (eg 15k objects), where each row is an object and the columns are the numeric object features. It is in the form:
df = pd.DataFrame({ 'A' : [0, 0, 1],
'B' : [2, 3, 4],
'C' : [5, 0, 1],
'D' : [1, 1, 0]},
columns= ['A','B', 'C', 'D'], index=['first', 'second', 'third'])
I want to calculate the pairwise distances of all objects (rows) and read that scipy's pdist() function is a good solution due to its computational efficiency. I can simply call:
res = pdist(df, 'cityblock')
res
>> array([ 6., 8., 4.])
And see that the res
array contains the distances in the following order: [first-second, first-third, second-third]
.
My question is how can I get this in a matrix, dataframe or (less desirably) dict format so I know exactly which pair each distance value belongs to, like below:
first second third
first 0 - -
second 6 0 -
third 8 4 0
Eventually, I think having the distance matrix as a pandas DataFrame may be convenient, since I may apply some ranking and ordering operations per row (eg find the top N closest objects to object first
).
Oh, I found the answer on this webpage . Apparently, there is a dedicated function for that named squareform() . Not deleting my question for the time being in case it may be helpful for someone else.
from scipy.spatial.distance import squareform
res = pdist(df, 'cityblock')
squareform(res)
pd.DataFrame(squareform(res), index=df.index, columns= df.index)
>> first second third
>>first 0 6 8
>>second 6 0 4
>>third 8 4 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.