简体   繁体   English

我如何使用 dplyr 计算 R 中组之间的相关性?

[英]How i can calculate the correlation between groups in R using dplyr?

Let's say i have data frame in R that looks like this :假设我在 R 中有如下所示的数据框:

var = c(rep("A",3),rep("B",3),rep("C",3),rep("D",3),rep("E",3))
y = rnorm(15)
data = tibble(var,y);data

With output:带输出:

# A tibble: 15 x 2
   var        y
   <chr>  <dbl>
 1 A     -1.23 
 2 A     -0.983
 3 A      1.28 
 4 B     -0.268
 5 B     -0.460
 6 B     -1.23 
 7 C      1.87 
 8 C      0.416
 9 C     -1.99 
10 D      0.289
11 D      1.70 
12 D     -0.455
13 E     -0.648
14 E      0.376
15 E     -0.887

i want to calculate the correlation of each distinct pair in R using dplyr.我想使用 dplyr 计算 R 中每个不同对的相关性。 Ideally i want to look like this (the third column to contain the values of each correlation pair):理想情况下,我希望看起来像这样(第三列包含每个相关对的值):

var1变量1 var2变量2 value价值
A一个 B cor(A,B)心电图(A,B)
A一个 C C cor(A,C)心电图(A,C)
A一个 D D cor(A,D)心(A,D)
A一个 E cor(A,E)心(A,E)
B C C cor(B,E)心(乙,乙)
B D D cor(B,E)心(乙,乙)
B E cor(B,E)心(乙,乙)
C C D D cor(C,E)科尔(C,E)
C C E cor(C,E)科尔(C,E)
D D E cor(D,E)心电图(D,E)

How i can do that in R ?我怎么能在 R 中做到这一点? Any help ?有什么帮助吗?

Additional额外的

if i have another grouping variable say group2:如果我有另一个分组变量说 group2:

var2 = c(rep("A",3),rep("B",3),rep("C",3),rep("D",3),rep("E",3),rep("F",3),
        rep("H",3),rep("I",3))

y2 = rnorm(24)
group2 = c(rep(1,6),rep(2,6),rep(3,6),rep(1,6))
data2 = tibble(var2,group2,y2);data2

which ideally must look like this :理想情况下必须是这样的:

group团体 var1变量1 var2变量2 value价值
1 1 A一个 B cor(A,B)心电图(A,B)
1 1 A一个 H H cor(A,H)心电图(A,H)
1 1 A一个 I cor(A,I)心电图(A,I)
1 1 B H H cor(B,H)心(B,H)
1 1 B I cor(B,I)心(乙,我)
1 1 H H I cor(H,I)心电图(H,I)
2 2 C C D D cor(C,D)心(C,D)
3 3 E F F cor(E,F)心(E,F)

How i can calculate each variable in column var2 on each group group2?我如何计算每个组 group2 的列 var2 中的每个变量?

Another possible solution:另一种可能的解决方案:

library(tidyverse)

df %>% 
  group_by(var) %>% 
  group_map(~ data.frame(.x) %>% set_names(.y)) %>% 
  bind_cols %>% cor %>% 
  {data.frame(row=rownames(.)[row(.)[upper.tri(.)]], 
              col=colnames(.)[col(.)[upper.tri(.)]], 
              corr=.[upper.tri(.)])}

#>    row col       corr
#> 1    A   B -0.9949738
#> 2    A   C -0.9574502
#> 3    B   C  0.9815368
#> 4    A   D -0.7039708
#> 5    B   D  0.6293137
#> 6    C   D  0.4690460
#> 7    A   E -0.5755463
#> 8    B   E  0.4907660
#> 9    C   E  0.3150499
#> 10   D   E  0.9859711

Here is a one-liner via base R这是通过基础 R 的单线

data.frame(t(combn(unique(data$var), 2, function(i)
                     list(v1 = i[[1]], 
                          v2 = i[[2]], 
                          value = cor(data$y[data$var %in% i[[1]]], 
                                      data$y[data$var %in% i[[2]]])))))

   X1 X2         X3
1   A  B   0.997249
2   A  C  0.7544987
3   A  D -0.7924587
4   A  E 0.03567887
5   B  C  0.8010711
6   B  D -0.7450683
7   B  E  0.1096579
8   C  D -0.1976141
9   C  E  0.6828033
10  D  E  0.5812632

1) Add an index column 1, 2, 3, 1, 2, 3, ... and then use read.zoo to convert from long to wide. 1)添加一个索引列 1, 2, 3, 1, 2, 3, ... 然后使用 read.zoo 将 long 转换为 wide。 Take the correlation reshape back to long form using as.data.frame.table and filter out the desired rows.使用 as.data.frame.table 将相关重塑回长格式并过滤掉所需的行。

library(dplyr)
library(zoo)

DF %>%
  mutate(index = sequence(rle(var)$lengths)) %>%
  read.zoo(index = "index", split = "var") %>%
  cor %>%
  as.data.frame.table(responseName = "cor") %>%
  filter(format(Var1) < format(Var2))

2) At the expense of one more line of code we can substitute pivot_wider for read.zoo. 2)以多一行代码为代价,我们可以用 pivot_wider 代替 read.zoo。

library(dplyr)
library(tidyr)

DF %>%
  mutate(index = sequence(rle(var)$lengths)) %>%
  pivot_wider(index, names_from = "var", values_from = "y") %>%
  select(-index) %>%
  cor %>%
  as.data.frame.table(responseName = "cor") %>%
  filter(format(Var1) < format(Var2))

3) A base solution consists of using combn to get the pairs of var with the indicated function f. 3)基本解决方案包括使用 combn 获得具有指定函数 f 的 var 对。

co <- combn(unique(DF$var), 2)
f <- function(v) with(DF, data.frame(t(v), cor = cor(y[var==v[1]], y[var==v[2]])))
do.call("rbind", apply(co, 2, f))

Note笔记

The input in reproducible form.可重现形式的输入。

DF <-
structure(list(var = c("A", "A", "A", "B", "B", "B", "C", "C", 
"C", "D", "D", "D", "E", "E", "E"), y = c(-1.23, -0.983, 1.28, 
-0.268, -0.46, -1.23, 1.87, 0.416, -1.99, 0.289, 1.7, -0.455, 
-0.648, 0.376, -0.887)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM