R：变量对之间的相关性

Question

I have a dataframe that looks like this: 我有一个看起来像这样的数据框：

ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4

There are actually more columns, but the idea is that there are a lot of traits (like bmi, height, and IQ in the example above) and then the same amount of columns again, but these are the standardized residuals after regressing some variables out (the columns called bmi.residuals, height.residuals, and IQ.residuals in the example above). 实际上有更多的列，但想法是有很多特征（如上例中的bmi，height和IQ），然后又有相同数量的列，但这是将一些变量回归后的标准化残差（在上面的示例中，列称为bmi.residuals，height.residuals和IQ.residuals）。 I want to create an object with the correlations between each pair of trait and the residuals, that will look like this: 我想创建一个具有每对特征和残差之间的相关性的对象，如下所示：

trait correlation 
bmi 0.85
height 0.90
IQ 0.75

Whereby the correlation "bmi" is the correlation between bmi and bmi.residuals, the correlation "height" is the correlation between height and height.residuals, IQ is the correlation between IQ and IQ.residuals, etc. 其中，相关性“ bmi”是bmi与bmi.residuals之间的相关性，相关性“ height”是高度与height.residuals之间的相关性，IQ是IQ和IQ.residuals之间的相关性，等等。

I could compute all the correlations one by one, but there must be some way to automate this if I have a lot of columns (lots of traits) in the dataframe. 我可以一一计算所有的相关性，但是如果我在数据帧中有很多列（很多特征），就必须有某种方法可以使它自动化。 Any ideas how? 有什么想法吗？ I suspect lapply can come in handy, but not sure how... 我怀疑lapply可以派上用场，但不确定如何...

Answer 1

Another solution using dplyr and tidyr . 使用dplyr和tidyr另一种解决方案。 The idea is to create all correlations first, as this is simple and fast enough, then create a dataset and keep only rows when the variables' names match, but are not the same: 这样做的想法是先创建所有关联，因为这足够简单且快速，然后创建数据集并在变量名称匹配但不相同时仅保留行：

df = read.table(text = "
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
", header=T)

library(dplyr)
library(tidyr)

# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)


df %>% 
  select(-ID) %>%                # remove unnecessary columns
  cor() %>%                      # get all correlations (even ones you don't care about)
  data.frame() %>%               # save result as a dataframe
  mutate(v1 = row.names(.)) %>%  # add row names as a column
  gather(v2,cor, -v1) %>%        # reshape data
  filter(f(v1,v2) & v1 != v2)    # keep pairs that v1 matches v2, but are not the same

#       v1               v2           cor
# 1    bmi    bmi.residuals -3.248544e-17
# 2 height height.residuals -7.837838e-01
# 3     IQ     IQ.residuals  4.487375e-01

Another way is to spot the pairs of interest first and then compute correlations: 另一种方法是先发现感兴趣的对，然后计算相关性：

df = read.table(text = "
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
", header=T)

library(dplyr)
library(tidyr)

# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)

# function to get cor between two variables
f2 = function(x,y) cor(df2[,x], df2[,y])
f2 = Vectorize(f2)

# keep only columns that you want to get correlations
df2 = df %>% select(-ID)

expand.grid(v1=names(df2), v2=names(df2)) %>%  # get all possible combinations of names
  filter(f(v1,v2) & v1 != v2) %>%              # keep pairs of names where v1 matches v2, but are not the same
  mutate(cor = f2(v1,v2))                      # for those pairs (only) obtain correlation value

#       v1               v2           cor
# 1    bmi    bmi.residuals -3.248544e-17
# 2 height height.residuals -7.837838e-01
# 3     IQ     IQ.residuals  4.487375e-01

I'd suggest you pick the faster one, as the number of rows and columns you have might affect the speed of the above approaches. 我建议您选择速度更快的行，因为行数和列数可能会影响上述方法的速度。

Answer 2

Maybe this'll work for you: 也许这会为您工作：

bmi <- c(26, 27, 23)
height <- c(187, 176, 189)

bmi.residuals <- c(0.1, 0.3, 0.4)
height.residuals <- c(0.3, 0.2, 0.1)

df <- data.frame(bmi, height, bmi.residuals, height.residuals)

corr_df <- data.frame(cor(df))

names <- colnames(df)
names <- names[!grepl("residuals", names)]

cors <- data.frame(
  traits = character(length(names)),
  correlation = numeric(length(names)),
  stringsAsFactors = FALSE
)

for (i in 1:length(names)) {
  cors$traits[i] <- names[i]
  cors$correlation[i] <-
    corr_df[i, which(grepl(names[i], names(corr_df)))[2]]
}

Input: 输入：

> df
  bmi height bmi.residuals height.residuals
1  26    187           0.1              0.3
2  27    176           0.3              0.2
3  23    189           0.4              0.1

the correlation matrix: 相关矩阵：

> corr_df
                        bmi      height bmi.residuals height.residuals
bmi               1.0000000 -0.78920304   -0.57655666        0.7205767
height           -0.7892030  1.00000000   -0.04676098       -0.1428571
bmi.residuals    -0.5765567 -0.04676098    1.00000000       -0.9819805
height.residuals  0.7205767 -0.14285714   -0.98198051        1.0000000

Output: 输出：

> cors
  traits correlation
1    bmi  -0.5765567
2 height  -0.1428571

Beware that this will only work if the original columns come before the .residual columns. 请注意，这仅在原始列位于.residual列之前时才有效。

Answer 3

Here is a short solution: 这是一个简短的解决方案：

Suppose you have a dataframe with the variables a, a.resi, b, b.resi 假设您有一个包含变量a，a.resi，b，b.resi的数据框

df <- data.frame(a=c(1:10), b=c(1:10),
              a.resi=c(-1:-10), b.resi=c(-1:-10))

First, create a vector (named 'core') with all your core variables (that is, without the suffix .resi) 首先，使用所有核心变量（即不带后缀.resi）创建一个向量（名为“ core”）

core <- names(df) [1:2]

Then, create another vector (named core.resi) that contains the core variables and the suffix .resi, using paste0() 然后，使用paste0（）创建另一个包含核心变量和后缀.resi的向量（名为core.resi）。

core.resi <- paste0(core, '.resi')

Define a function that takes 3 arguments: a dataframe (Data), x, and y. 定义一个带有3个参数的函数：一个数据框（Data），x和y。 This function will compute the correlation between a given x and y in the dataframe Data 此函数将计算数据帧Data中给定的x和y之间的相关性

MyFun <- function(Data, x,y) cor(Data[,x], Data[,y])

Finally, apply the function to the vectors core and core.resi 最后，将该函数应用于向量core和core.resi

mapply(MyFun, x=core, y=core.resi, MoreArgs = list(Data=df)) %>% 
data.frame()

Answer 4

You can try a tidyverse solution: 您可以尝试一个tidyverse解决方案：

library(tidyverse)
cor(d[,-1]) %>% 
  as.tibble() %>% 
  add_column(Trait=colnames(.)) %>% 
  gather(key, value, -Trait) %>% 
  rowwise() %>% 
  filter(grepl(paste(Trait, collapse = "|"), key)) %>% 
  filter(Trait != key) %>% 
  ungroup()
# A tibble: 3 x 3
   Trait              key         value
   <chr>            <chr>         <dbl>
1    bmi    bmi.residuals -3.248544e-17
2 height height.residuals -7.837838e-01
3     IQ     IQ.residuals  4.487375e-01

Or you start with your data.frame directly: 或者直接从data.frame开始：

d %>% 
  gather(key, value, -ID) %>% 
  mutate(gr=strtrim(key,2)) %>% 
  split(.$gr) %>% 
  map(~spread(.,key, value)) %>%
  map(~cor(.[-1:-2])[,2]) %>% 
  map(~data.frame(Trait1=names(.)[1], Trait2=names(.)[2], cor=.[1],stringsAsFactors = F)) %>% 
  bind_rows()  
  Trait1           Trait2           cor
1    bmi    bmi.residuals -3.248544e-17
2 height height.residuals -7.837838e-01
3     IQ     IQ.residuals  4.487375e-01

R：变量对之间的相关性

问题描述

4 个解决方案

解决方案1
2 已采纳 2017-12-14 10:41:25

解决方案2
1 2017-12-14 06:51:07

解决方案3
1 2017-12-14 08:18:28

解决方案4
1 2017-12-14 08:34:45

R：变量对之间的相关性

问题描述

4 个解决方案

解决方案1 2 已采纳 2017-12-14 10:41:25

解决方案2 1 2017-12-14 06:51:07

解决方案3 1 2017-12-14 08:18:28

解决方案4 1 2017-12-14 08:34:45

解决方案1
2 已采纳 2017-12-14 10:41:25

解决方案2
1 2017-12-14 06:51:07

解决方案3
1 2017-12-14 08:18:28

解决方案4
1 2017-12-14 08:34:45