如何構造距離或差異矩陣？

Question

我有一個DF，如下所示：

0    111155555511111116666611111111
1    555555111111111116666611222222
2    221111114444411111111777777777
3    111111116666666661111111111111
.......
1000  114444111111111111555555111111

我正在計算每個字符串之間的距離。 例如，要獲取前兩個字符串之間的距離： textdistance.hamming(df[0], df[1]) 。 這將返回一個整數。

現在，我想創建一個df來存儲每個字符串之間的所有距離。 在這種情況下，由於我有1000個字符串，所以我會有1000 x 1000 df。 第一個值是字符串1和字符串本身之間的距離，然后是字符串1和字符串2，依此類推。 然后在下一行中，其字符串2和string1，字符串2及其本身，依此類推。

Answer 1

創建Series值的所有組合並在列表中獲得hamming距離，然后轉換為array並為DataFrame ：

import textdistance
from  itertools import product

L = [textdistance.hamming(x, y) for x , y in product(df, repeat=2)]
df = pd.DataFrame(np.array(L).reshape(len(df), len(df)))
print (df)
    0   1   2   3   4
0   0  14  24  18  15
1  14   0  24  26  19
2  24  24   0  20  23
3  18  26  20   0  19
4  15  19  23  19   0

編輯：

為了提高性能，請使用帶有更改的lambda函數的以下解決方案：

import numpy as np    
from scipy.spatial.distance import pdist, squareform

# prepare 2 dimensional array M x N (M entries (3) with N dimensions (1)) 
transformed_strings = np.array(df).reshape(-1,1)

# calculate condensed distance matrix by wrapping the hamming distance function
distance_matrix = pdist(transformed_strings,lambda x,y: textdistance.hamming(x[0],y[0]))

# get square matrix
df1 = pd.DataFrame(squareform(distance_matrix), dtype=int)
print (df1)
    0   1   2   3   4
0   0  14  24  18  15
1  14   0  24  26  19
2  24  24   0  20  23
3  18  26  20   0  19
4  15  19  23  19   0

如何構造距離或差異矩陣？

問題描述

1 個解決方案

解決方案1
2 已采納 2019-09-09 05:25:29

如何構造距離或差異矩陣？

問題描述

1 個解決方案

解決方案1 2 已采納 2019-09-09 05:25:29

解決方案1
2 已采納 2019-09-09 05:25:29