[英]Create Similarity Matrix
我有一个矩阵,如下所示:
col_1 col_2 value
A B 2.1
A C 1.3
B C 4.6
A D 1.4
....
我想得到一个相似度矩阵:
A B C D
A X 2.1 1.3 1.4
B 2.1 X 4.6 ...
C ... ... X ...
D ... ... ... X
因此,行和列的名称分别为A,B,C,D,它从第三列获取值并将其添加到矩阵中。问题还在于原始矩阵的长度约为10000行。
正如Roland建议的那样,您可以使用dcast()
:
library(data.table)
dcast(df, col_1 ~ col_2)
## col_1 B C D
## 1 A 2.1 1.3 1.4
## 2 B NA 4.6 NA
哪里:
df <- data.frame(
col_1 = c("A", "A", "B", "A"),
col_2 = c("B","C", "C", "D"),
value = c(2.1, 1.3, 4.6, 1.4)
)
使用xtabs
和mutate_at
。 sparse = TRUE
将输出转换为sparseMatrix:
library(dplyr)
mat <- df %>%
mutate_at(1:2, factor, levels = unique(c(levels(.$col_1), levels(.$col_2)))) %>%
xtabs(value ~ col_1 + col_2, data=., sparse = TRUE)
mat[lower.tri(mat)] <- mat[upper.tri(mat)]
结果:
4 x 4 sparse Matrix of class "dgCMatrix"
col_2
col_1 A B C D
A . 2.1 1.3 1.4
B 2.1 . 4.6 .
C 1.3 1.4 . .
D 4.6 . . .
您可以按照以下方式进行操作。 由于未指定语言,因此我用Python编写代码
#I assume that your data is in a python pandas dataframe called df
df = ..load your data
list_of_labels = [ 'A','B','C','D' ]
nb_labels = len(list_of_labels)
similarity = np.zeros( (nb_labels,nb_labels) )
for l1, l2, val in zip( df['col_1'] , df['col_2'] , df['value'] ):
i = list_of_labels.index( l1 )
j = list_of_labels.index( l2 )
similarity[i][j] = val
similarity_df = pd.DataFrame(data=similarity, index=list_of_labels, columns=list_of_labels)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.