[英]how to generate grouping variable based on correlation?
library(magrittr)
library(dplyr)
V1 <- c("A","A","A","A","A","A","B","B","B","B", "B","B","C","C","C","C","C","C","D","D","D","D","D","D","E","E","E","E","E","E")
V2 <- c("A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F")
cor <- c(1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.9)
df <- data.frame(V1,V2,cor)
# exclude rows where cor=NA
df <- df[complete.cases(df)==TRUE,]
This is the full data frame, cor=NA represents a correlation smaller than 0.8 这是完整的数据帧,cor = NA表示小于0.8的相关性
df
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
30 E F 0.9
In the above df, F is not in V1, meaning that F is not of interest 在上面的df中,F不在V1中,这意味着F不重要
so here I remove rows where V2=F (more generally, V2 equals to value that is not in V1) 所以在这里我删除了V2 = F的行(通常,V2等于不在V1中的值)
V1.LIST <- unique(df$V1)
df.gp <- df[which(df$V2 %in% V1.LIST),]
df.gp
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
So now, df.gp is the dataset I need to work on 所以现在,df.gp是我需要处理的数据集
I drop the unused level in V2 (which is F in the example) 我在V2中删除了未使用的级别(在示例中为F)
df.gp$V2 <- droplevels(df.gp$V2)
I do not want to exclude the autocorrelated variables, in case some of the V1 are not correlated with others, and I would like to put each of them in a separated group 我不想排除自相关变量,以防某些V1与其他变量不相关,并且我想将每个变量放在一个单独的组中
By looking at the cor, A and B are correlated, C and D are correalted, and E belongs to a group by itself. 通过查看cor,可以将A和B关联起来,将C和D关联起来,而E本身属于一个组。
Therefore, the example here should have three groups. 因此,此处的示例应分为三组。
The way I see this, you may have complicated things by working your data straight into a data.frame
. 我认为,将数据直接处理为
data.frame
可能会使事情复杂化。 I took the liberty of transforming it back to a matrix. 我自由地将其转换回矩阵。
library(reshape2)
cormat <- as.matrix(dcast(data = df,formula = V1~V2))[,-1]
row.names(cormat) <- colnames(cormat)[-length(colnames(cormat))]
cormat
After I had your correlation matrix, it is easy to see which indices or non NA values are shared with other variables. 获得相关矩阵后,可以轻松查看与其他变量共享哪些索引或非NA值。
a <- apply(cormat, 1, function(x) which(!is.na(x)))
a <- data.frame(t(a))
a$var <- row.names(a)
row.names(a) <- NULL
a
X1 X2 var
1 1 2 A
2 1 2 B
3 3 4 C
4 3 4 D
5 5 6 E
Now either X1
or X2
determines your unique groupings. 现在,
X1
或X2
确定您的唯一分组。
The above script is a possible solution when assuming we already select the rows in with cor >= a
, where a
is a threshold taken as 0.8 in the above question. 假设我们已经选择了
cor >= a
的行,则上述脚本是一种可能的解决方案,其中a
是上述问题中的阈值为0.8。
By using cutree
and hclust
, we can set the threshold in the script (ie h=0.8) as blow. 通过使用
cutree
和hclust
,我们可以将脚本中的阈值(即h = 0.8)设置为打击。
cor.gp <- data.frame(cor.gp =
cutree(hclust(1 - as.dist(xtabs(cor ~ V1 + V2, df.gp))), h = 0.8))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.