简体   繁体   中英

How can I construct a distance or dissimilarity matrix?

I have a df as follows:

0    111155555511111116666611111111
1    555555111111111116666611222222
2    221111114444411111111777777777
3    111111116666666661111111111111
.......
1000  114444111111111111555555111111

I am calculating the distance between each string. For instance, to get the distance between the first 2 strings: textdistance.hamming(df[0], df[1]) . This will return a single integer.

Now, I want to create a df that stores all the distance between each string. In this case, since I have 1000 strings, I will have a 1000 by 1000 df. The first value is distance between string 1 and itself, then string 1 and string2 and so on. Then in next row its string 2 and string1, string 2 and itself and so on.

Create all combinations of values of Series and get hamming distance in list, then convert to array and reshape for DataFrame :

import textdistance
from  itertools import product

L = [textdistance.hamming(x, y) for x , y in product(df, repeat=2)]
df = pd.DataFrame(np.array(L).reshape(len(df), len(df)))
print (df)
    0   1   2   3   4
0   0  14  24  18  15
1  14   0  24  26  19
2  24  24   0  20  23
3  18  26  20   0  19
4  15  19  23  19   0

EDIT:

For improve performance use this solution with changed lambda function:

import numpy as np    
from scipy.spatial.distance import pdist, squareform

# prepare 2 dimensional array M x N (M entries (3) with N dimensions (1)) 
transformed_strings = np.array(df).reshape(-1,1)

# calculate condensed distance matrix by wrapping the hamming distance function
distance_matrix = pdist(transformed_strings,lambda x,y: textdistance.hamming(x[0],y[0]))

# get square matrix
df1 = pd.DataFrame(squareform(distance_matrix), dtype=int)
print (df1)
    0   1   2   3   4
0   0  14  24  18  15
1  14   0  24  26  19
2  24  24   0  20  23
3  18  26  20   0  19
4  15  19  23  19   0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM