R：變量對之間的相關性

Question

我有一個看起來像這樣的數據框：

ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4

實際上有更多的列，但想法是有很多特征（如上例中的bmi，height和IQ），然后又有相同數量的列，但這是將一些變量回歸后的標准化殘差（在上面的示例中，列稱為bmi.residuals，height.residuals和IQ.residuals）。 我想創建一個具有每對特征和殘差之間的相關性的對象，如下所示：

trait correlation 
bmi 0.85
height 0.90
IQ 0.75

其中，相關性“ bmi”是bmi與bmi.residuals之間的相關性，相關性“ height”是高度與height.residuals之間的相關性，IQ是IQ和IQ.residuals之間的相關性，等等。

我可以一一計算所有的相關性，但是如果我在數據幀中有很多列（很多特征），就必須有某種方法可以使它自動化。 有什么想法嗎？ 我懷疑lapply可以派上用場，但不確定如何...

Answer 1

使用dplyr和tidyr另一種解決方案。 這樣做的想法是先創建所有關聯，因為這足夠簡單且快速，然后創建數據集並在變量名稱匹配但不相同時僅保留行：

df = read.table(text = "
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
", header=T)

library(dplyr)
library(tidyr)

# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)


df %>% 
  select(-ID) %>%                # remove unnecessary columns
  cor() %>%                      # get all correlations (even ones you don't care about)
  data.frame() %>%               # save result as a dataframe
  mutate(v1 = row.names(.)) %>%  # add row names as a column
  gather(v2,cor, -v1) %>%        # reshape data
  filter(f(v1,v2) & v1 != v2)    # keep pairs that v1 matches v2, but are not the same

#       v1               v2           cor
# 1    bmi    bmi.residuals -3.248544e-17
# 2 height height.residuals -7.837838e-01
# 3     IQ     IQ.residuals  4.487375e-01

另一種方法是先發現感興趣的對，然后計算相關性：

df = read.table(text = "
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
", header=T)

library(dplyr)
library(tidyr)

# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)

# function to get cor between two variables
f2 = function(x,y) cor(df2[,x], df2[,y])
f2 = Vectorize(f2)

# keep only columns that you want to get correlations
df2 = df %>% select(-ID)

expand.grid(v1=names(df2), v2=names(df2)) %>%  # get all possible combinations of names
  filter(f(v1,v2) & v1 != v2) %>%              # keep pairs of names where v1 matches v2, but are not the same
  mutate(cor = f2(v1,v2))                      # for those pairs (only) obtain correlation value

#       v1               v2           cor
# 1    bmi    bmi.residuals -3.248544e-17
# 2 height height.residuals -7.837838e-01
# 3     IQ     IQ.residuals  4.487375e-01

我建議您選擇速度更快的行，因為行數和列數可能會影響上述方法的速度。

Answer 2

也許這會為您工作：

bmi <- c(26, 27, 23)
height <- c(187, 176, 189)

bmi.residuals <- c(0.1, 0.3, 0.4)
height.residuals <- c(0.3, 0.2, 0.1)

df <- data.frame(bmi, height, bmi.residuals, height.residuals)

corr_df <- data.frame(cor(df))

names <- colnames(df)
names <- names[!grepl("residuals", names)]

cors <- data.frame(
  traits = character(length(names)),
  correlation = numeric(length(names)),
  stringsAsFactors = FALSE
)

for (i in 1:length(names)) {
  cors$traits[i] <- names[i]
  cors$correlation[i] <-
    corr_df[i, which(grepl(names[i], names(corr_df)))[2]]
}

輸入：

> df
  bmi height bmi.residuals height.residuals
1  26    187           0.1              0.3
2  27    176           0.3              0.2
3  23    189           0.4              0.1

相關矩陣：

> corr_df
                        bmi      height bmi.residuals height.residuals
bmi               1.0000000 -0.78920304   -0.57655666        0.7205767
height           -0.7892030  1.00000000   -0.04676098       -0.1428571
bmi.residuals    -0.5765567 -0.04676098    1.00000000       -0.9819805
height.residuals  0.7205767 -0.14285714   -0.98198051        1.0000000

輸出：

> cors
  traits correlation
1    bmi  -0.5765567
2 height  -0.1428571

請注意，這僅在原始列位於.residual列之前時才有效。

Answer 3

這是一個簡短的解決方案：

假設您有一個包含變量a，a.resi，b，b.resi的數據框

df <- data.frame(a=c(1:10), b=c(1:10),
              a.resi=c(-1:-10), b.resi=c(-1:-10))

首先，使用所有核心變量（即不帶后綴.resi）創建一個向量（名為“ core”）

core <- names(df) [1:2]

然后，使用paste0（）創建另一個包含核心變量和后綴.resi的向量（名為core.resi）。

core.resi <- paste0(core, '.resi')

定義一個帶有3個參數的函數：一個數據框（Data），x和y。 此函數將計算數據幀Data中給定的x和y之間的相關性

MyFun <- function(Data, x,y) cor(Data[,x], Data[,y])

最后，將該函數應用於向量core和core.resi

mapply(MyFun, x=core, y=core.resi, MoreArgs = list(Data=df)) %>% 
data.frame()

Answer 4

您可以嘗試一個tidyverse解決方案：

library(tidyverse)
cor(d[,-1]) %>% 
  as.tibble() %>% 
  add_column(Trait=colnames(.)) %>% 
  gather(key, value, -Trait) %>% 
  rowwise() %>% 
  filter(grepl(paste(Trait, collapse = "|"), key)) %>% 
  filter(Trait != key) %>% 
  ungroup()
# A tibble: 3 x 3
   Trait              key         value
   <chr>            <chr>         <dbl>
1    bmi    bmi.residuals -3.248544e-17
2 height height.residuals -7.837838e-01
3     IQ     IQ.residuals  4.487375e-01

或者直接從data.frame開始：

d %>% 
  gather(key, value, -ID) %>% 
  mutate(gr=strtrim(key,2)) %>% 
  split(.$gr) %>% 
  map(~spread(.,key, value)) %>%
  map(~cor(.[-1:-2])[,2]) %>% 
  map(~data.frame(Trait1=names(.)[1], Trait2=names(.)[2], cor=.[1],stringsAsFactors = F)) %>% 
  bind_rows()  
  Trait1           Trait2           cor
1    bmi    bmi.residuals -3.248544e-17
2 height height.residuals -7.837838e-01
3     IQ     IQ.residuals  4.487375e-01

R：變量對之間的相關性

問題描述

4 個解決方案

解決方案1
2 已采納 2017-12-14 10:41:25

解決方案2
1 2017-12-14 06:51:07

解決方案3
1 2017-12-14 08:18:28

解決方案4
1 2017-12-14 08:34:45

R：變量對之間的相關性

問題描述

4 個解決方案

解決方案1 2 已采納 2017-12-14 10:41:25

解決方案2 1 2017-12-14 06:51:07

解決方案3 1 2017-12-14 08:18:28

解決方案4 1 2017-12-14 08:34:45

解決方案1
2 已采納 2017-12-14 10:41:25

解決方案2
1 2017-12-14 06:51:07

解決方案3
1 2017-12-14 08:18:28

解決方案4
1 2017-12-14 08:34:45