![](/img/trans.png)
[英]How i can calculate the correlation of each variable within the same grouping variable using dplyr?
[英]how to generate grouping variable based on correlation?
library(magrittr)
library(dplyr)
V1 <- c("A","A","A","A","A","A","B","B","B","B", "B","B","C","C","C","C","C","C","D","D","D","D","D","D","E","E","E","E","E","E")
V2 <- c("A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F")
cor <- c(1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.9)
df <- data.frame(V1,V2,cor)
# exclude rows where cor=NA
df <- df[complete.cases(df)==TRUE,]
這是完整的數據幀,cor = NA表示小於0.8的相關性
df
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
30 E F 0.9
在上面的df中,F不在V1中,這意味着F不重要
所以在這里我刪除了V2 = F的行(通常,V2等於不在V1中的值)
V1.LIST <- unique(df$V1)
df.gp <- df[which(df$V2 %in% V1.LIST),]
df.gp
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
所以現在,df.gp是我需要處理的數據集
我在V2中刪除了未使用的級別(在示例中為F)
df.gp$V2 <- droplevels(df.gp$V2)
我不想排除自相關變量,以防某些V1與其他變量不相關,並且我想將每個變量放在一個單獨的組中
通過查看cor,可以將A和B關聯起來,將C和D關聯起來,而E本身屬於一個組。
因此,此處的示例應分為三組。
我認為,將數據直接處理為data.frame
可能會使事情復雜化。 我自由地將其轉換回矩陣。
library(reshape2)
cormat <- as.matrix(dcast(data = df,formula = V1~V2))[,-1]
row.names(cormat) <- colnames(cormat)[-length(colnames(cormat))]
cormat
獲得相關矩陣后,可以輕松查看與其他變量共享哪些索引或非NA值。
a <- apply(cormat, 1, function(x) which(!is.na(x)))
a <- data.frame(t(a))
a$var <- row.names(a)
row.names(a) <- NULL
a
X1 X2 var
1 1 2 A
2 1 2 B
3 3 4 C
4 3 4 D
5 5 6 E
現在, X1
或X2
確定您的唯一分組。
假設我們已經選擇了cor >= a
的行,則上述腳本是一種可能的解決方案,其中a
是上述問題中的閾值為0.8。
通過使用cutree
和hclust
,我們可以將腳本中的閾值(即h = 0.8)設置為打擊。
cor.gp <- data.frame(cor.gp =
cutree(hclust(1 - as.dist(xtabs(cor ~ V1 + V2, df.gp))), h = 0.8))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.