简体   繁体   English

使用R的相邻相关性对有序元素进行分组变量

[英]grouping variable for ordered elements using adjecent correlation using R

I have "markr" variable which are arranged in order and correlation between subsequent members of "markr" is provided in corr variables. 我有按顺序排列的“标记”变量,“标记”的后续成员之间的相关性在corr变量中提供。

markr <- c("A", "B", "C", "D", "E",  "g", "A1", "B1", "cc", "dd", 
     "f", "gg", "h", "K")
corr <- c(     1,   1,   1,   1, 0.96,   0.5,  0.96,        1 ,   1 ,  
       1 ,  0.85, 0.99, 1)

I need to group markr based on corr without changing the order of members of markr. 我需要基于corr将标记分组,而无需更改标记成员的顺序。 The group can be better explained by following diagram: 下图可以更好地解释该组:

在此处输入图片说明

The individual members of abject markr that have corr greater than 0.95 will be in one group. corr大于0.95的标志标记的单个成员将在一组中。 Starting from first value when the corr drops to less than 0.95, then second group starts and continues till the corr drops again below 0.95, the process continues to end of the data. 当corr下降到小于0.95时从第一个值开始,然后开始第二组并继续直到corr再次下降到0.95以下,该过程继续到数据结束。 The group variable are named by first and last members in the group for example - Ag, A1-f, gg-k. 组变量由组中的第一个和最后一个成员命名,例如-Ag,A1-f,gg-k。

Thus expected output is. 因此,预期输出是。

markr <- c("A", "B", "C", "D", "E",  "g", 
           "A1", "B1", "cc", "dd", "f", 
           "gg", "h", "K")
group <- c("A-g", "A-g", "A-g", "A-g","A-g", "A-g", 
           "A1-f",  "A1-f",  "A1-f",  "A1-f","A1-f", 
            "gg-k", "gg-k", "gg-k")
dataf <- data.frame (markr, group) 

dataf 

 markr group
1      A   A-g
2      B   A-g
3      C   A-g
4      D   A-g
5      E   A-g
6      g   A-g
7     A1  A1-f
8     B1  A1-f
9     cc  A1-f
10    dd  A1-f
11     f  A1-f
12    gg  gg-k
13     h  gg-k
14     K  gg-k

How can I automate this process as I have very large such dataset. 我有这么大的数据集,该如何自动执行此过程。

The number of the group is the number of values under 0.95 we have seen so far: 该组的数目是到目前为止我们看到的0.95以下的值的数目:

d1 <- data.frame(
  marker = markr,
  group = cumsum(c(1, corr < .95))
)

For the group names, you can use ddply the cut the data.frame into pieces, one per group: it is then easy to extract the first and last element. 对于组名,您可以使用ddply将data.frame切成碎片,每组一个:然后可以轻松提取第一个和最后一个元素。

library(plyr)
d2 <- ddply( 
  d1, "group", summarize, 
  group_name=paste(head(marker,1), tail(marker,1), sep="-")
)
d <- merge(d1, d2, by="group")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM