[英]How i can calculate the correlation between groups in R using dplyr?
Let's say i have data frame in R that looks like this :假设我在 R 中有如下所示的数据框:
var = c(rep("A",3),rep("B",3),rep("C",3),rep("D",3),rep("E",3))
y = rnorm(15)
data = tibble(var,y);data
With output:带输出:
# A tibble: 15 x 2
var y
<chr> <dbl>
1 A -1.23
2 A -0.983
3 A 1.28
4 B -0.268
5 B -0.460
6 B -1.23
7 C 1.87
8 C 0.416
9 C -1.99
10 D 0.289
11 D 1.70
12 D -0.455
13 E -0.648
14 E 0.376
15 E -0.887
i want to calculate the correlation of each distinct pair in R using dplyr.我想使用 dplyr 计算 R 中每个不同对的相关性。 Ideally i want to look like this (the third column to contain the values of each correlation pair):理想情况下,我希望看起来像这样(第三列包含每个相关对的值):
var1变量1 | var2变量2 | value价值 |
---|---|---|
A一个 | B乙 | cor(A,B)心电图(A,B) |
A一个 | C C | cor(A,C)心电图(A,C) |
A一个 | D D | cor(A,D)心(A,D) |
A一个 | E乙 | cor(A,E)心(A,E) |
B乙 | C C | cor(B,E)心(乙,乙) |
B乙 | D D | cor(B,E)心(乙,乙) |
B乙 | E乙 | cor(B,E)心(乙,乙) |
C C | D D | cor(C,E)科尔(C,E) |
C C | E乙 | cor(C,E)科尔(C,E) |
D D | E乙 | cor(D,E)心电图(D,E) |
How i can do that in R ?我怎么能在 R 中做到这一点? Any help ?有什么帮助吗?
Additional额外的
if i have another grouping variable say group2:如果我有另一个分组变量说 group2:
var2 = c(rep("A",3),rep("B",3),rep("C",3),rep("D",3),rep("E",3),rep("F",3),
rep("H",3),rep("I",3))
y2 = rnorm(24)
group2 = c(rep(1,6),rep(2,6),rep(3,6),rep(1,6))
data2 = tibble(var2,group2,y2);data2
which ideally must look like this :理想情况下必须是这样的:
group团体 | var1变量1 | var2变量2 | value价值 |
---|---|---|---|
1 1 | A一个 | B乙 | cor(A,B)心电图(A,B) |
1 1 | A一个 | H H | cor(A,H)心电图(A,H) |
1 1 | A一个 | I我 | cor(A,I)心电图(A,I) |
1 1 | B乙 | H H | cor(B,H)心(B,H) |
1 1 | B乙 | I我 | cor(B,I)心(乙,我) |
1 1 | H H | I我 | cor(H,I)心电图(H,I) |
2 2 | C C | D D | cor(C,D)心(C,D) |
3 3 | E乙 | F F | cor(E,F)心(E,F) |
How i can calculate each variable in column var2 on each group group2?我如何计算每个组 group2 的列 var2 中的每个变量?
Another possible solution:另一种可能的解决方案:
library(tidyverse)
df %>%
group_by(var) %>%
group_map(~ data.frame(.x) %>% set_names(.y)) %>%
bind_cols %>% cor %>%
{data.frame(row=rownames(.)[row(.)[upper.tri(.)]],
col=colnames(.)[col(.)[upper.tri(.)]],
corr=.[upper.tri(.)])}
#> row col corr
#> 1 A B -0.9949738
#> 2 A C -0.9574502
#> 3 B C 0.9815368
#> 4 A D -0.7039708
#> 5 B D 0.6293137
#> 6 C D 0.4690460
#> 7 A E -0.5755463
#> 8 B E 0.4907660
#> 9 C E 0.3150499
#> 10 D E 0.9859711
Here is a one-liner via base R这是通过基础 R 的单线
data.frame(t(combn(unique(data$var), 2, function(i)
list(v1 = i[[1]],
v2 = i[[2]],
value = cor(data$y[data$var %in% i[[1]]],
data$y[data$var %in% i[[2]]])))))
X1 X2 X3
1 A B 0.997249
2 A C 0.7544987
3 A D -0.7924587
4 A E 0.03567887
5 B C 0.8010711
6 B D -0.7450683
7 B E 0.1096579
8 C D -0.1976141
9 C E 0.6828033
10 D E 0.5812632
1) Add an index column 1, 2, 3, 1, 2, 3, ... and then use read.zoo to convert from long to wide. 1)添加一个索引列 1, 2, 3, 1, 2, 3, ... 然后使用 read.zoo 将 long 转换为 wide。 Take the correlation reshape back to long form using as.data.frame.table and filter out the desired rows.使用 as.data.frame.table 将相关重塑回长格式并过滤掉所需的行。
library(dplyr)
library(zoo)
DF %>%
mutate(index = sequence(rle(var)$lengths)) %>%
read.zoo(index = "index", split = "var") %>%
cor %>%
as.data.frame.table(responseName = "cor") %>%
filter(format(Var1) < format(Var2))
2) At the expense of one more line of code we can substitute pivot_wider for read.zoo. 2)以多一行代码为代价,我们可以用 pivot_wider 代替 read.zoo。
library(dplyr)
library(tidyr)
DF %>%
mutate(index = sequence(rle(var)$lengths)) %>%
pivot_wider(index, names_from = "var", values_from = "y") %>%
select(-index) %>%
cor %>%
as.data.frame.table(responseName = "cor") %>%
filter(format(Var1) < format(Var2))
3) A base solution consists of using combn to get the pairs of var with the indicated function f. 3)基本解决方案包括使用 combn 获得具有指定函数 f 的 var 对。
co <- combn(unique(DF$var), 2)
f <- function(v) with(DF, data.frame(t(v), cor = cor(y[var==v[1]], y[var==v[2]])))
do.call("rbind", apply(co, 2, f))
The input in reproducible form.可重现形式的输入。
DF <-
structure(list(var = c("A", "A", "A", "B", "B", "B", "C", "C",
"C", "D", "D", "D", "E", "E", "E"), y = c(-1.23, -0.983, 1.28,
-0.268, -0.46, -1.23, 1.87, 0.416, -1.99, 0.289, 1.7, -0.455,
-0.648, 0.376, -0.887)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.