[英]Create Similarity Matrix
我有一個矩陣,如下所示:
col_1 col_2 value
A B 2.1
A C 1.3
B C 4.6
A D 1.4
....
我想得到一個相似度矩陣:
A B C D
A X 2.1 1.3 1.4
B 2.1 X 4.6 ...
C ... ... X ...
D ... ... ... X
因此,行和列的名稱分別為A,B,C,D,它從第三列獲取值並將其添加到矩陣中。問題還在於原始矩陣的長度約為10000行。
正如Roland建議的那樣,您可以使用dcast()
:
library(data.table)
dcast(df, col_1 ~ col_2)
## col_1 B C D
## 1 A 2.1 1.3 1.4
## 2 B NA 4.6 NA
哪里:
df <- data.frame(
col_1 = c("A", "A", "B", "A"),
col_2 = c("B","C", "C", "D"),
value = c(2.1, 1.3, 4.6, 1.4)
)
使用xtabs
和mutate_at
。 sparse = TRUE
將輸出轉換為sparseMatrix:
library(dplyr)
mat <- df %>%
mutate_at(1:2, factor, levels = unique(c(levels(.$col_1), levels(.$col_2)))) %>%
xtabs(value ~ col_1 + col_2, data=., sparse = TRUE)
mat[lower.tri(mat)] <- mat[upper.tri(mat)]
結果:
4 x 4 sparse Matrix of class "dgCMatrix"
col_2
col_1 A B C D
A . 2.1 1.3 1.4
B 2.1 . 4.6 .
C 1.3 1.4 . .
D 4.6 . . .
您可以按照以下方式進行操作。 由於未指定語言,因此我用Python編寫代碼
#I assume that your data is in a python pandas dataframe called df
df = ..load your data
list_of_labels = [ 'A','B','C','D' ]
nb_labels = len(list_of_labels)
similarity = np.zeros( (nb_labels,nb_labels) )
for l1, l2, val in zip( df['col_1'] , df['col_2'] , df['value'] ):
i = list_of_labels.index( l1 )
j = list_of_labels.index( l2 )
similarity[i][j] = val
similarity_df = pd.DataFrame(data=similarity, index=list_of_labels, columns=list_of_labels)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.